Link analysis home page
Part I: Theory
1. Introduction
2. Crawlers and search engines
3. Theoretical perspectives
4. Sampling & correlations
Part II: Web structure
5. Link structures in the web graph
6. Content structure of the web
Part III: Academic links
7. Universities – link types
8. Universities - link models
9. Universities - international
10. Departments and disciplines
11. Journals and articles
Part IV: Applications
12. Site design & search engines
13. Health check for universities
14. Personal home pages
15. Academic network analysis
16. Business web sites
Part V: Tools and techniques
17. Search engines & Archive
18. Personal crawlers
19. Data cleansing
20. Cybermetrics database
21. Embedded link analysis
22. Social network analysis
23. Network visualisation
24. Academic web indicators
Part VI: Summary
25. Summary & future directions
26. Glossary
Online Appendix
Ethical issues for crawlers

Reviews of this book

- follow-up book (2009):
Introduction to Webometrics

 

Part V: Tools and Techniques

19. Data Cleansing

Information about data cleansing with SocSciBot Tools is in its Tutorial 2.

The practical suggestion given in the tutorial is that clear criteria ought to be set up for the inclusion of pages in a research project. In a small link analysis project, all pages should be visited to assess whether they match these criteria.

In a large link analysis project, where it is not possible to visit all link pages, pages that are the sources of the most frequent link targets should be checked. This is to identify the most influential unwanted pages. The rationale is that single pages which contain one or more links are unlikely to exert an influence on the outcome of a link analysis, but that the most likely source of an anomaly is groups of pages all linking to a single target page. This occurs in many mirror sites: all pages in the site link to the home page. Visiting the sources of highly targeted links is therefore likely to identify the largest mirror sites in the data set. It is likely that many research projects will not want to include such mirror sites and they should be excluded as part of data cleansing.

.