Link analysis home page - follow-up book (2009): |
Part V: Tools and Techniques17. Using Commercial Search Engines and the Internet ArchiveMain search engines
Sites giving information about search engines
Instructions for Link Searches with Commercial Search EnginesMost search engines allow only count of links to a given single page, but HotBot and Yahoo! (see below) allow more sophisticated link counts, to include counts of links to all pages with a common domain name. This document summarises how I think that search engines can be used for Webometric research. If you disagree, please tell me! See also the simpler set of instructions for webometric search engine queries. I am interested in both the search engine standard/advanced interfaces and the automatic programming interfaces (API or Web Service). The three main search engines at the moment seem to be Google, Microsoft, and the Yahoo! family (Yahoo!, AltaVista, AllTheWeb). Site/domain coverageThis is the ability to discover from a search engine how many pages it has indexed with a single domain name: i.e. how many pages could potentially match a user’s search. For example: how many web pages does Google index from the site www.wlv.ac.uk?
Site/sub-domains coverageThis is the ability to discover from a search engine how many pages it has indexed for multiple domain names with a common ending: i.e. how many pages could potentially match a user’s search. For example: how many web pages does Google index from the site wlv.ac.uk – including www.wlv.ac.uk, www.scit.wlv.ac.uk and all other domain names ending in wlv.ac.uk?
URL or URL family coverageThis is the ability to discover from a search engine how many pages it has indexed from a collection of web pages with a common start to their URL: i.e. how many pages could potentially match a user’s search. For example: how many web pages does Google index that start with http://search.msn.com/docs/? This is useful when a web site shares its domain names with other web sites. There is not a perfect search for any search engine because they all seem to ignore the order of the segments. E.g. a search for inurl:search.msn.com/docs/ would match any URL containing the segments search, msn, com, and docs in any order (e.g. docs.search.msn.com would also match). Hence the second form of the search in the example section ensures that at least the domain name is correct. For a long URL, the search should be reliable unless the segments of the URL are all common.
Links to a single pageThis is the ability to discover from a search engine how many links it knows about that point to any given single web page.
Links to a domain or family of domainsThis is the ability to discover from a search engine how many links it knows about that point to any given web site, as defined by a domain name or family of domain names with a common ending.
Site inlinksThis is the ability to discover from a search engine how many links it knows about that point to any given web site, as defined by a domain name or family of domain names with a common ending, and excluding links from the same web site. Essentially, this can be done by combining a link or linkdomain search with a site or inurl search. Note that the second form of search is not perfect because it is not possible to generate a query that matches all links to an URL family, e.g. all links to pages with URLs starting with http://search.msn.com/docs/.
Links from one site to anotherThis is the ability to discover from a search engine how many links it knows about that point from one site or page to another.
Generic search engine issuesGoogle and Yahoo! return a maximum of 1000 results per search, Microsoft returns a maximum of 250. API IssuesGoogle and Yahoo! return smaller figures for the API than for their main search interface and also return different URLs. Microsoft’s results are the same for both.
Text matchingIn Google and other search engines, it is also possible to text-match URLs. For example, the search "cybermetrics.wlv.ac.uk" will match any page that contains the domain name "cybermetrics.wlv.ac.uk" in the visible text of a page (with or without a link), and also in the invisble text of the HTML of a page for some searh engines. It also matches longer URLs in the text of the page, such as cybermetrics.wlv.ac.uk/database/. HotBot and Yahoo! for Advanced Web SearchesHoBot and Yahoo! allow searches for all URLs containing a given domain name, as reported by Lennart Björneborn and Isidro Aguillo. A message from Lennart Björneborn about using HotBot for link counts.
Isidro Aguillo comments that Hotbot is offering a subset of Yahoo results, and that link extraction can be conducted using Yahoo! with the following syntax:
|