Part I: Theory
1. Introduction
2. Crawlers and search engines
3. Theoretical perspectives
4. Sampling & correlations
Part II: Web structure
5. Link structures in the web graph
6. Content structure of the web
Part III: Academic links
7. Universities – link types
8. Universities - link models
9. Universities - international
10. Departments and disciplines
11. Journals and articles
Part IV: Applications
12. Site design & search engines
13. Health check for universities
14. Personal home pages
15. Academic network analysis
16. Business web sites
Part V: Tools and techniques
17. Search engines & Archive
18. Personal crawlers
19. Data cleansing
20. Cybermetrics database
21. Embedded link analysis
22. Social network analysis
23. Network visualisation
24. Academic web indicators
Part VI: Summary
25. Summary & future directions
26. Glossary
Online Appendix
Ethical issues for crawlers

Reviews of this book

- follow-up book (2009):
Introduction to Webometrics


This section contains the book's glossary plus any additional terms that have been requested. See also a free online encyclopedia.

  • Alternative Document Model (ADM). A method of aggregating web content into units for counting purposes. See the directory ADM, domain ADM, site ADM and page ADM definitions.
  • Citation. A reference by one publication of another. A citation is the reference viewed from the perspective of the referenced document.
  • Cybermetrics. The application of quantitative techniques to the Internet, influenced by informetrics.
  • Directory ADM. All the files in the same directory are treated as a single document. Directories are equated with the position of slashes in URLs, rather than by the actual directory/folder structure of pages on the hosting web server.
  • Domain name. The part of an URL of a web page normally following the http:// and preceding the first subsequent slash (if any). Note that this is a simplified definition and there is a longer computer science definition that encompasses additional variations (Berners-Lee, Fielding & Masinter, 1998).
  • Domain ADM. All files with the same domain name are treated as a single document.
  • File Transfer Protocol (FTP)
  • HITS. Hyperlink Induced Topic Search. An algorithm designed to use link structures to find the web pages most relevant to a given topic (<chapter 12).
  • Host. Used to refer to an individual computer such as a web server.
  • Hyperlink. A feature in a web page that allows users to click to navigate to a different web page.
  • Hyperlink Network Analysis (HNA). Hyperlink network analysis is the application of social network analysis methods to the web
  • HyperText Markup Language (HTML). The coding language in which web pages are described. This is interpreted by web browsers to produce the web pages that web users see, and is processed by web crawlers to extract the embedded links.
  • HyperText Transfer Protocol (HTTP). The mechanism used by programs such as web browsers and crawlers to communicate with a web server, for example to request a web page.
  • Indicator. An indicator is a number, a table of numbers, or a visual representation of quantitative information. Note that wider definitions of indicators are sometimes used, encompassing the presentation of non-quantitative information.
  • Internet. A large public network of computers running IP and able to communicate with each other.
  • Internet Protocol (IP). The basic mechanism for transferring information over the Internet.
  • IP address. A dot-separated list of numbers that identifies computers on the Internet, including web servers.
  • Link page. A web page containing a link. This terminology is sometimes used instead of link because search engines count link pages rather than links in response to a link-based query.
  • Page ADM. Each separate file is treated as a document for extracting links, or for other counting purposes.
  • PageRank. An algorithm used by Google to rank web pages using the link structure of the web (<chapter 12).
  • Pajek. A program for network visualization.
  • Path. A path between two nodes in a network is a contiguous chain of links, starting at the first and ending at the last.
  • Portable Document Format (PDF). A document format created by Adobe and commonly used for posting documents on the web.
  • Power law. A mathematical law that has been applied to many kinds of web data. It is related to rich-get-richer phenomena, and is also known as Lotka’s law. See chapter 5 for a definition and discussion.
  • Search engine. A program that allows users to type in an information request, such as a keyword query, and returns lists of web pages matching the query.
  • Site ADM. All files belonging to a clearly defined web site are treated as a single document.
  • Shortest path. A shortest path between two nodes is a path between them that has the minimum possible length.
  • Social Network Analysis (SNA). Social network analysis is a methodology that has evolved to study social groupings, particularly in terms of social and communication connections within a group.
  • SocSciBot. A web crawler available with this book and designed for research crawling.
  • SocSciBot Tools. A suite of programs that can be used to analyze the link structure files produced by SocSciBot and those available in the cybermetrics university link structure databases.
  • TLD spectral analysis. A technique for choosing and ADM to use for a data set (<chapter 19).
  • Top level domain (TLD). The final segment of a domain name. This will either be a generic top level domain, such as .edu, .com and .info, or a country-specific domain, such as .uk for the UK or .es for Spain.
  • UCINET. A program for social network analysis calculations.
  • University ADM. All files belonging to a university are treated as a single document.
  • Web. The collection of resources that can be obtained over the public Internet using HTTP.
  • Web crawler, robot, bot. A program that visits web pages, automatically extracts their links and follows them.
  • Web site. An entity without a single agreed definition. Loosely speaking, any collection of pages that have a consistent organizational, structural or visual theme may be thought of as a web site. Normally, web sites seem to have identifiable regularities in URLs, such as a common domain name, or a common directory on the web server.
  • Webometrics. The application of quantitative techniques to the web, influenced by informetrics.


