Collaborative Filtering as a Tool for Filesharing Network Investigation

Peer-to-peer filesharing networks are often of interest to law enforcement and other sorts of investigators. In the case of law enforcement, concerns such as child abuse media motivate monitoring of filesharing networks. In the case of commercial investigators, rampant media piracy attracts investigation, while at the same time there are marketing interests in predicting trends in media consumption.

The huge volume of data being shared on these systems is a hindrance for many investigative methods -- having to download large files to analyse their content is particularly troublesome where legal restrictions are a key aspect of an investigation. So other methods would be preferred, such as looking at the limited data embedded in the network protocols.

The traffic generated by filesharing networks such as Gnutella and Bittorrent can be viewed as an interaction between a set of users, defined by their IP addresses or their network's user identifiers, and files, identified by their hashes (which is of course a faulty identification mechanism when you're interested in the content of the file, at least under the most common hashing systems) and by their filenames (which are often a good indicator of content, but not entirely reliable). The data is often very sparse, with most files being shared by very few users.

Other datasets very similar to this are those used by online shopping sites like Amazon. Their data has the same basic pattern -- lots of users, lots of files, a sparse overlap. Amazon are a good example here because of their well-known recommendations based upon these datasets. The field of generating recommendations even has a specific subset of recommendation systems which are not based on the content -- i.e., information about what an item is -- but simply on the pattern of user behavior regarding it, often known as 'the wisdom of the crowds'. This is known as collaborative filtering.

There's reason to think that collaborative filtering systems could be particularly good as investigative tools for filesharing networks. Their lack of reliance on content should make them comparatively scalable, they have a track record on similar datasets, and their ability to perform analyses relies only on parts of the observable network traffic. But what would we use them for?

There are three types of task which come to mind:

  1. Predicting the genre of unknown files. This is, to a certain degree, what online shops use them for -- genres are clusters of users' interests. If a person likes crime fiction, and another person likes crime fiction, then you might expect their their buying habits overlap, and so things which one user has bought but the other hasn't might be good recommendations precisely because of this common overlap in genre. There are obvious applications to filesharing investigations here: child abuse media is an obvious example of a domain where investigators would be keen to know that a file they don't know about falls into the same class as a set of files they do know about. Similarly for rightsholders investigating piracy. In both cases, filenames can often be unreliable due to users trying to minimise advertisement of file contents to evade investigation and prosecution.

  2. Predicting the behaviour of users. If a recommender system says a users is likely to enjoy something, does that mean they will get it? In online shopping scenarios, this question is often confounded by the fact that the shop uses the recommendation to try and sell the user that item. In a filesharing setting, however, it could be that recommendations are in fact predictive of future behaviour. This question could be very useful in scenarios such as the investigation of filesharing to help your company sell music.

  3. Identifying users. You are what you do -- information as varied and innocuous as what you write, how you type it, when you post status updates, and where you post it from have all been shown to have value in identifying a person, so there's good reason to suspect that your filesharing behaviour would be similarly identifiable. This has obvious applications where an investigator wants to identify some user as actually being another account of the same person.


I investigated the above possibilities with a particular eye to the applications for law enforcement -- mainly focusing on the child abuse media use case, as this is the primary reason law enforcement deal with filesharing networks. The investigation was carried out as my final undergraduate research project, guided by my supervisor, Awais Rashid. The project was linked with work Awais was already involved in on the EU Safer Internet Programme's Project Icop, and drew on some datasets created by monitoring the Gnutella and Bittorrent networks. It was reasonably serious business --- I had a criminal background check in order to be able to take part.

One small hurdle I ran into was finding a competitive collaborative filtering method which worked on binary data -- that is 'have bought' or rather 'have shared' -- rather than ratings data, which many collaborative filtering systems use. I searched the literature and found the Lightweight Collaborative Filtering for Binary Encoded Data algorithm, which seemed pretty suitable for what I needed.

To test the genre question, I found files which -- from their filenames -- seemed to belong to one category or another, and applied the recommender system to this initial set of files to try and generate files which were in the same category. It would've been good to be able to test the direct application of this to child abuse media, but the filenames for child abuse media are often written in a very niche language I wasn't familiar with and for which resources on are usually of restricted access. The categories I looked at were pornography (a common proxy for child abuse media), pop music (a very different kind of file, widely shared) and files associated with software piracy -- key generators, etc. (another kind of more shady file). The filenames of the files predicted by the recommender were then given -- along with some randomly selected files as a control -- to some independent human reviewers to classify into either one of these categories, 'unknown' or 'other'.

The results here were good: the reviewers mostly agreed with the predictions of the tool. More interestingly, they agreed with the results from the Gnutella network more than they agreed with the results from the Bittorrent network, suggesting that properties of the network are impacting the ability of the prediction system. The system worked best on pop music, then which it's difficult to get access to descriptions

To test the prediction question, I went over the history of those users which appear multiple times in my dataset, and at each point tried to predict what files they might share in the future, then looked to see if they did share any of these files. I found that, essentially, they didn't -- the predictive power of the method was very low on the Gnutella network, and never produced a correct result on my Bittorrent datasets. This doesn't mean it can't be done, just that my method wasn't sufficient.

I struggled to find a good way to test the final application -- the use of the method to identify users -- because I didn't have any ground truth data about which of these IP addresses were actually the same person. Instead, I used the method to show that a person has high a similarity metric score between themselves and themselves at a different time, providing some weak evidence for the application scenario.