projects


The Impact of Javascript-Loaded Content on WAC methods.

A current trend in web design is a shift towards asynchronously-loaded content. Web pages are presented either blank or with only header and footer information, and the main body of a page is loaded via AJAX calls to the distributing server or a content distribution network.

This trend has implications for Web-as-Corpus methodologies, as typical corpus collection methods will find very little in a page which a human reader views as filled with text. More importantly, there is reason to believe that particular sectors and thus categories of text are being obscured in this manner. Social media websites, for example, increasingly use such methods as part of advanced presentation techniques, and specifically to protect against web scraping.

No quantification of the impact this trend is having appears to have yet been made. This project would set out to address that.

Requirements

  1. A means of gathering a typical WaC sample.
  2. A means of gathering a WaC sample using screen-scraping tools.