Forensic Authorship Analysis and the World Wide Web

by Samuel Larner

Rating: ★★

This text, which appears to be a version of an (MSc.?) thesis which has been updated with some additional results, centres on a topic raised during the trial of the Unabomber, Theodore Kaczynski. Kaczynski became a suspect in the case due to a linguistic clue:

Upon reading an internet version of the published manifesto, Linda Patrik became unnerved. Although she had never met her brother-in-law, there was something about the text that seemed familiar. She asked her husband, David Kaczynski, to read the publication and urged him to compare it with her brother-in-law’s writings. David sceptically complied, but started to suspect that his older brother, Theodore, may indeed be the Unabomber. The occurrence of one phrase in particular convinced him: cool-headed logicians. David recalled his brother “using that distinctive term on numerous occasions” (Fitzgerald, 2004: 208) and as a result, he contacted the FBI. To assist with the investigation, the Kaczynski family made available many documents known to have been written by Theodore for comparative analysis with the manifesto

The FBI took this seriously, and decided on the basis of the language of the Unabomber letters and manifesto, that Theodore was indeed their man, and got a search warrant which eventually led to his capture. When Kaczynski was arrested and tried, his defence team attempted to argue that this basis was insufficient for the search warrant.

According to Coulthard (2000), Lakoff argued that many of the lexical items which were shared across both groups of texts could “quite easily occur in any argumentative text” (p.281) and were therefore not indicative of common authorship between the Unabomber’s terrorist manifesto and Kaczynski’s known writings. In particular, 12 words and phrases were selected for exemplification: at any rate, clearly, in practice, gotten, more or less, moreover, on the other hand, presumably, propaganda, thereabouts, and lexemes derived from the lemmas argu and propos (p. 281).

To counter the proposal that such words and phrases could occur in any similar text, Coulthard (2000) reports that the web was searched and approximately three million documents were found which included at least one or more of the 12 lexical items. However, when the search was limited to finding only documents which contained all 12 lexical items, only 69 documents were identified and each of these documents was an online copy of the terrorist manifesto (p. 282).

The book attempts to answer the question of whether this rhetorical flourish in the Kaczynski trial points to a practical method of identifying the unknown author of a document. This is an interesting problem for authorship attribution, because the candidate set is practically unbounded -- anyone at all might be the author of the query document, and they could well not have published anything else.

Unfortunately, Larner does not provide a very compelling treatment of the problem. His coverage of other areas of authorship attribution is spotty at best, especially given that he is, theoretically, writing about 2014 and not just 2004. This would be forgivable if his own work was more compelling, but it is almost an almost laughably constrained case study, covering just three longform authors, often under tight search constraints. The latter are made necessary by the fact that the answer to the question is, obviously, no -- search engines are a terrible tool for this kind of attribution, and his contorted reading of the existing literature on that topic is quite confusing.