notes


White Page Construction from Web Pages for Finding People on the Internet

PDF

Abstract

This paper proposes a method to extract proper names and their associated information from web pages for Internet/Intranet users automatically. The information extracted from World Wide Web documents includes proper nouns, E-mail addresses and home page URLs. Natural language processing techniques are employed to identify and classify proper nouns, which are usually unknown words. The information (i.e., home pages' URLs or e-mail addresses) for those proper nouns appearing in the anchor parts can be easily extracted using the associated anchor tags. For those proper nouns in the non-anchor part of a web page, different kinds of clues, such as the spelling method, adjacency principle and HTML tags, are used to relate proper nouns to their corresponding E-mail addresses and/or URLs. Based on the semantics of content and HTML tags, the extracted information is more accurate than the results obtained using traditional search engines. The results can be used to con- struct white pages for Internet/Intranet users or to build databases for finding people and organizations on the Internet. Such searching services are very useful for human communication and dissemination of information.

Review in Short

I came to this paper not expecting an awful lot, but ended up finding what appear to be deliberately deceitful representations of results. The paper makes use of a 'refined' organisational name classifier which is never outlined, indulges in sub-group analysis to present more promising-looking results, and conspicuously fails to report certain important figures.

Selection

My interest in this paper lies in that the fundamental task it addresses -- finding people on the internet -- is very related to my own research interests. The authors claim to use natural language processing to find key contact information from what would be at best sporadically structured text (White Pages, by the way, seem to be the American Yellow Pages). I don't expect to find very much which is surprising methodologically in a 16-year-old paper, so this could be considered background reading into the very basics of my field. It's also worth noting that the paper appeared in a journal devoted to Chinese Language Processing, rather than English, so there's something of a context barrier.

Comment By Section

Introduction

I mentioned the context barrier of language, but it's also interesting to contrast the Web of 1998 with the one we're used to now. Chen and Bian acknowledge in passing that a key communication detail is a person's home page URL. Nowadays, my perception is that few people have what they would refer to as a 'home page', although Facebook pages and Twitter accounts perform much the same function -- corporately managed home pages which have more resilient infrastructure and greater capability for direct communication, but are perhaps less personal.

The core of the introduction is laying down an argument for the inadequacy of current (at the time) search engine indexing techniques. They give an example of a university web page containing labelled links to a number of other universities, pointing out that indexing strategies extracting terms from that page will store the names of those universities as terms to direct people to the referencing page, rather than those universities' own web pages. This does seem to be a plausible problem, as you would not expect an institution to repeat its own name very often, whereas affiliated institutions may well do so a lot. An analogy exists in personal exchanges, where you typically only say your own name in introductions, but may use a friend's several times to refer to them.

It seems the authors are mainly highlighting how a search engine's indexing technique wouldn't suffice for building a directory of people. They are accurate in that if a person says a name often, you would not assume that is their name, but it is also true that if you were looking for Mr. Chen and somebody was often mentioning him, it might be reasonable to approach them for information about him. If you're modelling the relationship between name and entity as 'is', then the search engine approach is obviously inappropriate. If on the other hand you are modelling the relationship as the entity possessing information on the name, then the search engine approach works well. There is a useful distinction here, but I think it would be better to support it by arguing the necessity of contact directories like the White Pages as opposed to searching and link-following approaches to reaching people.

The introduction wraps up by formalising the problem approached. They aim at:

  1. Extracting proper nouns from web pages, and distinguishing personal and organisational names;
  2. Mapping proper nouns to URLs and email addresses.

WWW Documents

The authors describe some of the opportunities and challenges involved in processing web pages as opposed to plain text. They mention that text terminators like full-stops might be absent, but could well be replaced by HTML tags which indicate structure (with the caveat that stylistic elements should not be treated as such). This is a fairly practical approach which I have little issue with.

Additionally, the authors consider anchor (<a>) tags evidence for the link text being a proper noun. This I'm less sure about. While their given examples are certainly proper nouns -- links labelled 'National Taiwan University' or 'Dean of Academic Affairs' -- it's not obvious that this necessarily reflects the majority of link usage. Many authors use inline links to reference claims as they're making them, and an approach which over-weighs the presence of an anchor element could end up creating lots of false pronouns, filling their automatically-generated White Pages with noise.

System Overview

The authors provide a high-level overview of their system, which thankfully confirms that their algorithm first tags all the proper nouns in the document, and then looks to see if any of those proper nouns are surrounded by anchor tags (or else the <title> tag) providing contact information. Proper nouns without such metadata are placed into a data structure for later consideration, stored alongside an indicator of their position in the document. The same applies for links and email addresses, whose identification method is not given, but is presumably via regular expressions.

Identification of Proper Nouns

The authors draw on related work, including previous work by Chen, in constructing a name identification module for Chinese personal and organisational names. Though a separate study exists to detail and justify the design of this methodology, they provide a summary explanation. This serves to pad out the article, but also gives the reader more insight into the problem.

For personal names, their method essentially relies on the frequency of a character's usage in names compared to its usage in non-name words, multiplied across all the characters in a word and compared to a threshold value, though their explanation draws a distinction between slightly different variants of this for different types of name. Such character frequency analysis would probably not be suitable for English names, highlighting the importance of language-specific studies in computational linguistics.

Interestingly, it appears that the authors' system is generous in trying to match a name:

When a candidate cannot pass the thresholds, its last character is cut off and the remaining string is tried again

though without knowing more about Chinese names I can't comment on how much impact this re-evaluation might have. Additionally, they use low word association measures (low mutual information of collocated words) to identify plausible name words, though it is not clear in this summary how this is combined with their frequency-based models. Finally, the authors store candidate names in a paragraph-level cache, to facilitate re-identification of names and removal of characters which are not part of the name.

For organisational names, the summary stresses the many word sense and boundary difficulties involved in their detection, then (oddly) defers description of the method used to the experimental section. Skipping ahead to the experimental section, we see that this is a dodge. They acknowledge that a naive method (itself undefined) performed worse than expected, and then point to the improved performance rate for the 'refined' method, with no explanation of what this method is.

A Mapping Algorithm

A heuristics-based mapping algorithm determines the similarity between a URL and a proper noun as a score of 1 + the HTML_SCORE of the proper noun, divided by the number of tokens separating the URL and the proper noun,

\[ \frac{1 + HTML\_SCORE(PN)}{abs(PN.token\_pos - URL.token\_pos)} \]

where HTML_SCORE is defined as:

\[ Title ( PN ) + Heading ( PN ) + Address ( PN ) + Bold ( PN ) + Font ( PN ) + Italic ( PN )\]

That is, the sum of how many of the above metadata tags are around the noun. While I can see why any of these features could be seen as highlighting the noun, I'm not quite sure why their effect is regarded as cumulative. (Some might nostalgically note the presence of the now-deprecated font tag. The rarely-seen address tag being one of these features had me thinking I'd missed a definition somewhere.)

Where a proper noun is part of the <title> element, an additional weight of 1 over the total number of tokens in a document minus the URL position (+1) is added to the score.

\[ \frac{1 + HTML\_SCORE(PN)}{abs(PN.token\_pos - URL.token\_pos)} + \frac{Title(PN)}{token\_total - URL.token\_pos + 1}\]

This favours URLs positioned towards the end of the document, though the authors do not provide clear reasoning as to why URLs towards the end of a page might be related to something in the page's title.

In the case of email addresses, a weighted Pinyin Similarity value is added to the score. Pinyin Similarity is defined as the number of letters which match in Pinyin transliterations of the proper noun and the user component of the email address, out of the total number of characters in the user component of the email address. The authors stress that they weigh this value highly, as it would seem appropriate to do, given that many email addresses are variants of a person's name.

\[ \frac{1 + HTML\_SCORE(PN)}{abs(PN.token\_pos - Email.token\_pos)} + \frac{Title(PN)}{token\_total - Email.token\_pos + 1}\\ + \\ (Email(PN) \times pinyin\_sim(PN,Email.user) \times Weight) \]

With scores between proper nouns and both email addresses and URLs calculated according to this formula, a simple threshold is applied to determine which relationships should be kept.

Experiments

The test of the system is performed on the same site used in a number of examples throughout the paper -- the web pages of the National Taiwan University. A gold standard of personal and organisational names is manually constructed by 'a person' and mapped to email addresses and URLs on the page, and the results of the automatic mapping are compared to this standard.

Here's where it starts to get fuzzy. Remember that certain relationships are formed directly from anchor tags (this is the Anchor Set), while others are formed by the methods in the previous section (this is the Content Set). Well, the results the authors present for the correct identification of proper nouns are divided along these lines. That doesn't make sense. According to the algorithm presented, proper nouns are identified before each proper noun is placed into the Anchor or Content sets. The Anchor and Content sets should be meaningless subdivisions when it comes to the performance of the proper noun identification module. More importantly, they never present a combined figure for the accuracy of the proper noun identification system.

When you see something like this, the suspicion is that the researchers are trying to hide poor results by disguising something. Thankfully, there's enough information given that I can examine their results more closely. The table below shows what the authors presented (the F1 scores added by me).

Set Proper Noun Type # exist # predicted # correct Precision Recall F1
Anchor Personal 255 228 189 82.89% 74.12% 78.26%
Anchor Organisational 1 746 611 213 34.86% 28.55% 31.39%
Anchor Organisational 2 746 856 558 65,19% 74.80% 69.66%
Content Personal 1732 3343 1470 43.97% 84.87% 57.93%
Content Organisational 1 3029 2272 503 22.14% 16.61% 18.98%
Content Organisational 2 3029 856 558 61.38% 68.74% 64.85%

Organisational 1 and 2, by the way, refer to the two organisation name classifiers. The method for Organisational 1 is presented in related work by Chen, but is inadequately described in this paper. The method for Organisational 2, so far as I can tell, has never been outlined. Okay, let's see what happens if we combine the Anchor and Content sets.

Set Proper Noun Type # exist # predicted # correct Precision Recall F1
Combined Personal 1987 3571 1659 46.45% 83.49% 59.69%
Combined Organisational 1 3775 2883 716 24.84% 18.97% 21.51%
Combined Organisational 2 3775 4248 2640 62.15% 69.93% 65.81%

The impressive-looking figures for the personal name identification system's accuracy immediately plummet to a reasonable yet uninspiring 60% accuracy. That would seem to be the reason the Anchor and Content subdivisions are being presented to us, and not the combined figures -- the authors are artificially inflating the apparent accuracy of their classifier.

Now, there are valid reasons to separate out the Personal and Organisational name types -- they use different methods, and it's useful to know how each component performs -- but let's see what the combined accuracy of the system is (under both Organisational classifiers).

Set Proper Noun Type # exist # predicted # correct Precision Recall F1
Combined P + O1 5762 6454 2375 36.79% 41.21% 38.88%
Combined P + O2 5762 7819 4299 54.98% 74.61% 63.31%

So in conclusion, the best iteration of the proper noun identification module has an overall accuracy of 63%, with the best-performing component of that system remaining entirely unexplained.

Moving on to the content mapping system described in the previous section, we see a different kind of distortion going on in the presentation of results. Here is a copy of the information presented:

Content Set Mapping # of items extracted by program # of items mapped correctly by program # of items mapped incorrectly by program Accuracy
E-mail 64 18 5 78.26%
HTTP 16 1 0 100%

Wow, 100% accuracy on mapping URLs and nearly 80% on email addresses! Yeah, no. First of all, notice how we're now getting an 'accuracy' figure rather than separate precision and recall figures? Notice how all the table headings are different? If you look closely, you'll see that the first column, '# of items extracted by program' is reporting something we were never told about -- how many email addresses and URLs were scraped from the web pages. That's not what's being tested -- we don't even know how they're doing that.

The figures presented in that table only show what you might call the precision value, the percentage of predictions made which were right. What it doesn't tell us is the recall value, the percentage of correct predictions made out of those which should have been made. The authors are not accounting for all the mappings their human evaluator found but which the mapping algorithm didn't. In many cases where accuracy is reported instead of precision and recall, it's because the author doesn't know they should use those measures, but here the authors clearly do. The definition of recall is right next to that table in the paper, along with all the other tables where the authors used precision and recall rather than accuracy. Unlike the previous subtler manipulation of accuracy figures, this is a blatant case of choosing the metric which most flatters your results.

Unfortunately, the paper does not include figures on the number of mappings discovered by the manual analysis, so it is not possible for me to discern the real performance of the content mapping system from the publication.

The results section continues, to include an even more empty table which simply shows what proportion of correct predictions came from where (19 correct predictions from the content mapping algorithm, 747 from the anchor tags, do the maths yourself). A somewhat more interesting paragraph details some of what the authors consider admissible shortcomings of the system, pointing out how references to ancient personal names were not captured due to their low rate of occurrence in the modern training corpus.

Impact

As of writing, Google Scholar tracks 9 citations for this paper, 7 of which are in English.

  1. Finding lists of people on the web: references it as evidence that 'proper names have been reliably extracted from Web pages'. The method to which they refer could be considered less reliable in light of my examination of its results.
  2. Personal name classification in web queries: cites this method as an example of an attempt at person-name recognition which relies on rich contextual information not available in web queries. This reference is accurate.
  3. Personal navigating agents: cites this paper in a dubious manner, firstly as an example of information extraction from free text within a single Web page (which it is not) and secondly as a system which tags personal and organisation names, E-mail addresses and URLs. While it is true that all these things are tagged in the system presented, only personal name identification is described in this paper, and even that description is a summary of a previous publication. The authors of the citing paper go on to mention the low precision figure (46.45%) for personal name identification which I have also identified, implicitly backing me up in removing the arbitrary divisions of anchor and content sets. However, here I find myself between two twisted interpretations of the data. The authors of the citing paper mention only the notably low precision of this classifier, and not its comparatively high recall (which is, however still lower than that of their own classifiers). The citing paper's omission is less dangerous and more forgivable than these original misrepresentations, but still leaves a false impression.
  4. Learning-based Web query understanding: I cannot access to understand the context of the citation.
  5. Automatic keyword extraction of Chinese homepage for web understanding: This short paper (which mis-names this paper in its references) includes this paper in a comparative table of keyword extraction techniques for Chinese web pages. They use only the figures for personal names from the content mapping subdivision. It's not clear why they chose those figures and not the anchor set or a combination, but neither change would alter their conclusions.
  6. Ontology-Mediated Service Matching and Adaptation: I cannot access to understand the context of the citation.
  7. A Semantic Approach to Internet Tabular Information Extraction: cites this paper with a number of others as evidence for the statement 'Furthermore, to integrate information from different sources, domain knowledge is necessary', which is rather vague and empty but not obviously incorrect.

Reproducibility

On top of the usual and mundane aim of testing to see whether I can get the same results as the original authors, reproducibility here would allow me to report on the missing information regarding the recall rate of the content mapping system.

Even before beginning this review, I didn't hold great hopes for the paper's reproducibility. Sixteen years is a long time to keep hold of data and code. Sixteen years is long enough for the hardware on which you once stored it to have become obsolete and rusted away somewhere in the back of a lab. Given that the authors already published results in a misleading manner, I would not expect them to be particularly enthusiastic about digging up ancient code which could potentially be damaging to their reputations.

However, it's not actually all that bad. The website on which the analyses were run is still live, but its structure and contents are of course greatly altered. Thankfully, though, there is a snapshot of the site on the Internet Archive from 1998, and further snapshots from 1997. This means that the source material, or an insignificantly altered form of it, is still available for the enterprising critic. The algorithm for the content mapping module is preserved within the paper, as is a high-level overview of the algorithm for the whole directory-constructing system.

The same probably cannot be said for the manually-constructed gold standard which the authors evaluated their classifier against, which contains some of the most important missing information. Neither can I easily reconstruct my own version of this dataset, as I don't speak Chinese. On the implementation front, the main problem is the proper noun classifier, which is described in another paper but it would appear in insufficient detail to re-implement, and in any case it relies on a large LOB-style corpus of Chinese training data which I do not have any clear reference to.

I will attempt to contact the corresponding author, Hsin-Hsi Chen, and enquire about:

  1. An implementation of the proper noun classifier used in these publications.
  2. The balanced corpus used in training it.
  3. The gold-standard manual mapping of the NTU site, and any related data.

If I am unable to contact Chen, or if he is unable to provide these materials, I have no clear fallback for reproducing the paper, and will have to consider it unreproducible.


If you have no more direct means of contacting me, and wish to comment on this post, send electronic mail to lancaster.ac.uk for the user m.edwards7