Link analysis home page
- follow-up book (2009):
Part V: Tools and Techniques
19. Data Cleansing
Information about data cleansing with SocSciBot Tools is in its Tutorial 2.
The practical suggestion given in the tutorial is that clear criteria ought to be set up for the inclusion of pages in a research project. In a small link analysis project, all pages should be visited to assess whether they match these criteria.
In a large link analysis project, where it is not possible to visit all link pages, pages that are the sources of the most frequent link targets should be checked. This is to identify the most influential unwanted pages. The rationale is that single pages which contain one or more links are unlikely to exert an influence on the outcome of a link analysis, but that the most likely source of an anomaly is groups of pages all linking to a single target page. This occurs in many mirror sites: all pages in the site link to the home page. Visiting the sources of highly targeted links is therefore likely to identify the largest mirror sites in the data set. It is likely that many research projects will not want to include such mirror sites and they should be excluded as part of data cleansing.