edwards2012collaborative


Collaborative filtering as an investigative tool for peer-to-peer filesharing networks

Edwards, M. and Rashid, A.

Abstract

The volume of illegal material on file-sharing networks poses a challenge for investigators attempting to police such networks. We propose a novel approach that automates the resource intensive task of identifying previously unknown files of interest amongst hundreds of thousands of files shared on such net- works. We also describe how this approach could be used to identify clusters of peers that might be closely related to each other, either as part of a syndicate, or as multiple personae of the same individual

Our approach is based on the collaborative filtering techniques typically used in recommender systems. In this study we find that we can successfully make use of collaborative filtering techniques to find new media belonging to specific categories of interest to an investigation of a peer-to-peer network, without having to examine filenames or file contents. We also find evidence that distance metrics from collaborative filtering could be useful in the clustering and identification of peers on file-sharing networks. Additionally, we describe an unsuccessful attempt at using collaborative filtering to predict the future file-sharing behaviour of peers.

Summary

The traffic generated by filesharing sites such as Gnutella and Bittorrent can be viewed as an interaction between a set of users (defined by their IP addresses or their network's user identifiers) and files, identified by their hashes (of course of limited value content-identification value) and their filenames (a reasonable indicator, but unreliable). The data is often very sparse, with most files being shared by very few users.

This problem is similar to the problems faced by online shopping sites such as Amazon. These sites are interested in generating recommendations through not only content information ("what kind of film is this?") but patterns of user behaviour ("people like you liked this"). This is the field known as collaborative filtering. The question is whether collaborative filtering solutions can be applied to combat cybercrime on peer-to-peer networks.

There are three tasks we considered using collaborative filtering for:

  1. Predicting the genre of unknown files. Particularly for questions like "is this file child pornography?", where most of the files shared by pornographers are probably pornography, a genre classifier based on user behaviour could be quite possible. Based on an experiment we ran using human raters to judge genre accuracy of a collaborative filtering classifier, this seems like it could work.

  2. Predicting the behaviour of users. Given information about the kinds of file a person likes, it might be possible to identify which files they in fact download next. While this may be possible, the method trialled in the paper didn't work.

  3. Using file data to identify users. The kinds of file a person shares might reasonably be expected to reveal they are the same person. We didn't have ground truth on this, so there wasn't a good evaluation, but we did demonstrate that people are more similar to themselves at different times than they are to other users.