A JMC2 Suprema Project, 2004.

Collaborative filtering systems (recommender systems) attempt to match people with similar tastes in order to suggest items that may be of interest. This has an obvious commercial advantage in that it encourages customers to buy more. But it also benefits the customer as they should (if the system works correctly) end up buying products that they like.
Collaborative filtering depends on customers rating products that they own so that the system can gauge their tastes. As the number of possible products generally vastly out weights the number that any one customer can purchase, the information regarding people's tastes is incomplete.
The information that the system has to work with can be considered as a matrix where the known values are the rating that a customer has given that item:
| User | Items | ||||
| 1 | 2 | 3 | 4 | 5 | |
| 1 | 5 | 1 | 4 | 5 | ⊥ |
| 2 | 4 | 1 | 4 | ⊥ | ⊥ |
| 3 | ⊥ | ⊥ | 1 | ⊥ | 5 |
| 4 | ⊥ | ⊥ | ⊥ | ⊥ | 5 |
| 5 | ⊥ | ⊥ | ⊥ | ⊥ | ⊥ |
Just by looking at the table to the left we can see that users 1 and 2 seem to agree on items 1, 2 and 3. And as user 1 also enjoyed item 4, we should probably suggest it to user 2. Likewise, we shouldn't suggest item 3 to user 4.
However, that isn't an algorithm which we can implement and scale to a system as large as Amazon.
Click points!
The Pearson correlation is a statistical method which determines the correlation of a set of (x,y) pairs. It gives a value between -1 and 1 where 1 is a perfect positive correlation and -1 is a perfect negative correlation. It is given by the following formula:
In this case, the x values would be the ratings that one person has given to a set of items and the y values would be the ratings of another person for those same items.
You can experiment with it in the diagram on the right. Just click a series of points on the graph and click to calculate the Pearson value. Click clear to start again.
So a simple collaborative filtering system proceeds in three stages:
With the example table above the Pearson function gives a correlation of 0.97 for users 1 and 2. Thus, if user 2 was the active user then only user 1 would be in the set of considered users as no other user has enough data points in common to evaluate the Pearson function.
Once we have the set of considered users (which is only one use in this case) we can weight their scores by the Pearson value and average. For our example this gives a predicted rating of item 4, by user 2 of 0.97*5 = 4.85.
Of course, the Pearson system involves querying the whole database for every query. For small databases this is fine, but it doesn't scale. At the cost of accuracy, queries can be made in constant time with some precomputation. The simplest of these schemes would predict a rating, for a given item, of the average of all the rating for that item so far. It's certainly a constant time operation, but it's not very accurate. Fortunately, better systems exist.
The STIN1 [LEMIRE04] model based system is such a scheme. It involves an expensive precalculation, but once you've done that queries run in constant time. If you are really interested the algorithm is at the end of the paper above.
Another very scalable system is that employed by Amazon [AMAZON03] of item-item similarity. For each pair of items that a customer of Amazon has purchased, Amazon uses a similarity function and precomputes the similarity of all these pairs of items offline. Amazon uses the cosine of the angle between the vector as its similarity measure but, considered as a black-box, it could equally well be the Pearson function.
When a recommendation is needed the past history of the user is used to quickly find similar items, remove items that the user has already purchased and rank them. This is a very fast operation that is done in real-time.
If Amazon restricted the input to only those items that the user has actually purchased (rather than merely claiming to own) the algorithm would be very attack resistant. Unfortunately, as they do not do this, it's as weak in this respect any any of the other centralised systems.
If we want to ensure the reliability of this system then the single most important thing that we must prevent are incorrect results. The user will not expect the system to recommend every item that they could possibly want. So, if the system fails to make a recommendation which would be good for our user it presents very little problem. A much larger problem is that of false positives, which are items that are recommended (a positive result) but not actually wanted by the customer (a false result). This can easily result from naive implementations of Pearson's correlation. If we have two customers and they both like rock then they may give similar values for the same CDs. Now suppose that customer A is into Christian rock, but customer B isn't. Customer B may have only rated rock CDs and so this will give a very good correlation coefficient, and so by this measure the system will recommend Christian rock to customer B, as customer A rated the CDs highly, resulting in an angry customer.
When applying this method to a large system like Amazon, we might find that we have a very large number of items and a very large number of customers. However even with a huge number of customers we may well find that the average customer will only rate a few items, thus we have very little data, making it hard to give recommendations. If we have two customers with very similar tastes we may not recognise this if they do not rate any of the same items. This will give a nominal value for the correlation of zero. This can be further exacerbated by a problem called "synonymy". This is a problem caused by a database containing items that are listed separately, but are in fact the same. For example, the DVD and VHS edition of the same film.
One of the greatest problems with recommender systems is that they need a large database of ratings in order to make any reliable recommendations at all. A system which needs a year to get going before it can function properly will be seen as fairly useless by both the customers and the shop owner. This problem also applies to items which have just been added to the database, as they will have no ratings and thus cannot be recommended.
Different people have different scales in which they rate things. If you were to ask 10 people to rate a set of items on a scale of 0 to 10 some would be shy, rarely giving extreme values and others would be harsh, often scoring lowly. A collaborative filtering system shouldn't be effected by these variations. By inspection we can see that one user giving ratings of:
(1, 10, 1, ⊥)
And another giving ratings of:
(2, 4, 2, ⊥)
Are the same and two such ratings are equivalent if one can be obtained from the other by a scale and a transformation. STI invariant systems compensate for this. The Pearson function above is STI invariant, have a play with it to convince yourself of this.
Not all people are easy to classify as they don't conform to normal tastes and standards. This leads us to classify users into three groups. The "white sheep" are those that have tastes that are representative. These customers will have a large number of other users with whom they have a high correlation co-efficient. They are the easiest users for the system to deal with.
"Black sheep" are the opposite. They have very idiosyncratic tastes and this makes suggesting recommendations nearly impossible. Although this is a failure of the recommender system, non-electronic recommenders (ie a knowledgeable shop clerk) also have great problems in these cases, so this is an acceptable failure.
"Grey sheep", however, are the ones who cause a real problem. They are users who give usual and sometimes relatively unusual ratings to items. This means that, although they have other users for which a high correlation co-efficient is calculated, recommendations based on them largely turn out to be false positives.
To deal with the sparsity of the database and the problem of cold starts, we need to find ways in which to fill our databases other than with simple customer recommendations. One way to acheive this is to assume that whenever a customer has chosen to buy an item they are making an informed decision and thus we should insert in a default mark. This can either be a standard mark like 4/5 or the average score recorded by all other customers.
Ideally, we would like to have highly prolific users (those who have made a large number of recommendations) and this can be simulated by including a large database of ratings, such as sites like All Music and IMDB. This, however, does not properly represent the different tastes of various users, ie recommending a hardcore rock fan a very well rated R & B CD would normally result in a false positive. For this reason filterbots are added to the system. These agents base their recommendations on information already present in the database. A straightforward example of a filterbot is a genrebot, which would base its opinion solely on the genre of the item. For example, a "rockbot" would give a full marks to a CD simply because it is in the rock category. Where as it would give a low score to any other CD in the database.
Another very successful heuristic for dealing with the problems of developing a database is to group users together by similar tastes. A simplistic way to do this is to group users that have a high correlation co-efficient. Then each customer is grouped into a neighborhood and, from this, we can create an average or model user for each of these categories. This proves very helpful as it reduces the database size and deals with the problems outlined above.
If we have three customers, call them A, B & C where A & B and B & C have a high correlation , while A & C have no common items (and hence a zero value for the correlation coefficient) then normally we would not be able to recommend A's choices to C. But if we group A, B & C into a single neighborhood then we can make more correct recommendations to C. Neighborhoods can also help with the problems of "grey sheep" as their voices will be drowned out in favor of the more usual (white) members of a neighborhood. It is because of "grey sheep" that it can often be better to choose the mode of a neighborhood (the user who represents the center of the tastes of the group) rather than calculate the average user.
Only a small percentage of users will actually register scores and these ratings can often be quite unreliable as users often only register an opinion if they feel strongly about it. Therefore an alternative solution is not to recommend highly rated items, but those that have been purchased often.
Unfortunately, these systems are very easy to attack. In an attack to "capture" a certain demographic, an attacker can create many accounts on the system and rate items popular with that demographic highly. They can then rate another item (which, we suppose, the attacker has a financial interest in) highly. The correlation between the target demographic and the attacker's accounts will be high and the sheer number of accounts will push the attacker's item into recommendation lists.
Similarly, an attacker could drag down the predicted rating of an item by the same means.