Link analysis home page
Part I: Theory
1. Introduction
2. Crawlers and search engines
3. Theoretical perspectives
4. Sampling & correlations
Part II: Web structure
5. Link structures in the web graph
6. Content structure of the web
Part III: Academic links
7. Universities – link types
8. Universities - link models
9. Universities - international
10. Departments and disciplines
11. Journals and articles
Part IV: Applications
12. Site design & search engines
13. Health check for universities
14. Personal home pages
15. Academic network analysis
16. Business web sites
Part V: Tools and techniques
17. Search engines & Archive
18. Personal crawlers
19. Data cleansing
20. Cybermetrics database
21. Embedded link analysis
22. Social network analysis
23. Network visualisation
24. Academic web indicators
Part VI: Summary
25. Summary & future directions
26. Glossary
Online Appendix
Ethical issues for crawlers

Reviews of this book

- follow-up book (2009):
Introduction to Webometrics

 

Part V: Tools and Techniques

17. Using Commercial Search Engines and the Internet Archive

Main search engines

Sites giving information about search engines

Instructions for Link Searches with Commercial Search Engines

Most search engines allow only count of links to a given single page, but HotBot and Yahoo! (see below) allow more sophisticated link counts, to include counts of links to all pages with a common domain name.

This document summarises how I think that search engines can be used for Webometric research. If you disagree, please tell me! See also the simpler set of instructions for webometric search engine queries.

I am interested in both the search engine standard/advanced interfaces and the automatic programming interfaces (API or Web Service). The three main search engines at the moment seem to be Google, Microsoft, and the Yahoo! family (Yahoo!, AltaVista, AllTheWeb).

Site/domain coverage

This is the ability to discover from a search engine how many pages it has indexed with a single domain name: i.e. how many pages could potentially match a user’s search. For example: how many web pages does Google index from the site www.wlv.ac.uk?

Search Engine

Query

Example

Comments

Google

site:

site:www.wlv.ac.uk

Microsoft

site:

site:www.wlv.ac.uk

Yahoo!

site:

site:www.wlv.ac.uk

Site/sub-domains coverage

This is the ability to discover from a search engine how many pages it has indexed for multiple domain names with a common ending: i.e. how many pages could potentially match a user’s search. For example: how many web pages does Google index from the site wlv.ac.uk – including www.wlv.ac.uk, www.scit.wlv.ac.uk and all other domain names ending in wlv.ac.uk?

Search Engine

Query

Example

Comments

Google

site:

site:wlv.ac.uk

Microsoft

site:

site:wlv.ac.uk

Yahoo!

site:

site:wlv.ac.uk

Also host: in AltaVista

URL or URL family coverage

This is the ability to discover from a search engine how many pages it has indexed from a collection of web pages with a common start to their URL: i.e. how many pages could potentially match a user’s search. For example: how many web pages does Google index that start with http://search.msn.com/docs/? This is useful when a web site shares its domain names with other web sites.

There is not a perfect search for any search engine because they all seem to ignore the order of the segments. E.g. a search for inurl:search.msn.com/docs/ would match any URL containing the segments search, msn, com, and docs in any order (e.g. docs.search.msn.com would also match). Hence the second form of the search in the example section ensures that at least the domain name is correct. For a long URL, the search should be reliable unless the segments of the URL are all common.

Search Engine

Query

Example

Comments

Google

inurl:

inurl:search.msn.com/docs/

or

inurl:docs/ site:search.msn.com

Microsoft

inurl:

inurl:search.msn.com/docs/

or

inurl:docs/ site:search.msn.com

The “inurl” command does not work on many URLs – some URL segments always return zero hits. Some "words" within URLs cause problems.

Yahoo!

allinurl:

allinurl:search.msn.com/docs/

or

allinurl:docs/ site:search.msn.com

Links to a single page

This is the ability to discover from a search engine how many links it knows about that point to any given single web page.

Search Engine

Query

Example

Comments

Google

link:

link:http://news.bbc.co.uk/1/hi/uk/4599030.stm

Only reports a fraction of links that Google knows about (10%?)

Microsoft

link:

link:http://news.bbc.co.uk/1/hi/uk/4599030.stm

Yahoo!

link:

link:http://news.bbc.co.uk/1/hi/uk/4599030.stm

Links to a domain or family of domains

This is the ability to discover from a search engine how many links it knows about that point to any given web site, as defined by a domain name or family of domain names with a common ending.

Search Engine

Query

Example

Comments

Google

N/A

Microsoft

linkdomain:

linkdomain:news.bbc.co.uk

Yahoo!

linkdomain:

linkdomain:news.bbc.co.uk

Yahoo! now has a "Site Explorer" interface, avoiding syntax

Site inlinks

This is the ability to discover from a search engine how many links it knows about that point to any given web site, as defined by a domain name or family of domain names with a common ending, and excluding links from the same web site. Essentially, this can be done by combining a link or linkdomain search with a site or inurl search.

Note that the second form of search is not perfect because it is not possible to generate a query that matches all links to an URL family, e.g. all links to pages with URLs starting with http://search.msn.com/docs/.

Search Engine

Query

Examples

Comments

Google

N/A

The link command cannot be combined with any others.

Microsoft

linkdomain:news.bbc.co.uk –site:news.bbc.co.uk

or

link:search.msn.com/docs/

-inurl:search.msn.com/docs/

Yahoo!

linkdomain:news.bbc.co.uk -site:news.bbc.co.uk

or

link:search.msn.com/docs/ -allinurl:search.msn.com/docs/

Links from one site to another

This is the ability to discover from a search engine how many links it knows about that point from one site or page to another.

Search Engine

Query

Examples

Comments

Google

N/A

The link command cannot be combined with any others.

Microsoft

linkdomain:news.bbc.co.uk site:wlv.ac.uk

or

link:search.msn.com/docs/

site:news.bbc.co.uk

Yahoo!

linkdomain:news.bbc.co.uk site:wlv.ac.uk

or

link:search.msn.com/docs/ NOT

site:news.bbc.co.uk

Generic search engine issues

Google and Yahoo! return a maximum of 1000 results per search, Microsoft returns a maximum of 250.

API Issues

Google and Yahoo! return smaller figures for the API than for their main search interface and also return different URLs. Microsoft’s results are the same for both.

 

Text matching

In Google and other search engines, it is also possible to text-match URLs. For example, the search "cybermetrics.wlv.ac.uk" will match any page that contains the domain name "cybermetrics.wlv.ac.uk" in the visible text of a page (with or without a link), and also in the invisble text of the HTML of a page for some searh engines. It also matches longer URLs in the text of the page, such as cybermetrics.wlv.ac.uk/database/.

HotBot and Yahoo! for Advanced Web Searches

HoBot and Yahoo! allow searches for all URLs containing a given domain name, as reported by Lennart Björneborn and Isidro Aguillo.

A message from Lennart Björneborn about using HotBot for link counts.

Looks like HotBot (http://www.hotbot.com) gives inlink counts for whole sites for search strings like:

linkdomain:www.db.dk and domain:uk

Allthough this example might include some non-uk pages with uk somewhere in their domain name. Somewhat confusingly, HotBot highlights search operators like 'and' and 'domain' in the result list ...

It is also possible to stem urls like:
linkdomain:db.dk and domain:ac.uk

HotBot uses the Inktomi database - owned by Yahoo (cf. http://www.searchengineshowdown.com/features/inktomi/) - so let's see how long they'll keep the advanced search operators ... :-(

Isidro Aguillo comments that Hotbot is offering a subset of Yahoo results, and that link extraction can be conducted using Yahoo! with the following syntax:

linkdomain:db.dk +site:ac.uk