Recommendation system using Collaborative Filtering

Published in

Analytics Vidhya

10 min readNov 11, 2020

Whenever we buy something from e-commerce websites like Amazon or Flipkart , we would see suggestions like “People who bought Item X also bought item Y”.I have always wondered how these websites finds the related products for users and how they maintain such an amazing recommendation system for such a huge user base?

The amazing digital success story of Netflix, the world’s leading Internet television network and the most-valued largest streaming service in the world, is incomplete without the mention of its recommender systems that focus on personalization. Personalization begins on Netflix’s homepage that shows group of videos arranged in horizontal rows. Each horizontal row has a title which relates to the videos in that group. Most of the personalized recommendations begins based on the way rows are selected and the order in which the items are placed. Recommender systems at Netflix span various algorithmic approaches like reinforcement learning, neural networks, causal modelling, probabilistic graphical models, matrix factorization, ensembles, bandits.

On October 2006,Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films based on previous ratings without any other information about the users or films. The main goal of this competition was to find a more accurate movie recommendation system to replace their own system at that time, called Cinematch. Collaborative Filtering, a technique which have been often used for creating highly accurate and efficient Recommendation system, has enjoyed a surge of interest since the Netflix prize competition.

This story emphasizes on Collaborative Filtering, which are based on assumption that people like things similar to other things they like, and things that are liked by other people with similar taste. In this github repository ,you can find some of the ideas implemented for the netflix prize competition data .

Input Types:

Recommendation systems rely on different types of input. Most convenient is the high quality explicit feedback, which includes explicit input(like 5 stars rating for Dark Knight)by users regarding their interest in the items. However, explicit feedback is not always available. Thus, recommendation systems can infer user preferences from the more abundant implicit feedback, which indirectly reflects opinions of users by observing their behavior, such as page views, clicks, purchase records, browsing history, whether or not listen to a music track, search patterns, or even mouse movements.

Understanding the Data:

We have an m× n matrix of ratings with user u ∈ [1,m] and item i ∈ [1,n].A rating rᵤᵢ indicates the rating given by user u on item i.Usually the vast majority of ratings are unknown. For example, in the Netflix data 99% of the possible ratings are missing because a user typically rates only a small portion of the movies.

Baseline Estimates:

There is systematic tendency for some users to give higher ratings than others and for some items to receive higher ratings than others.We need to adjust the data by accounting for these effects, which we encapsulate within the baseline estimates. A baseline estimate for an unknown rating rᵤᵢ is denoted by bᵤᵢ and accounts for the user and item effects:

bᵤᵢ = μ + bᵤ + bᵢ

The parameters bᵤ and bᵢ indicate the observed deviations of user u and item i, respectively, from the average. For example, suppose we want a baseline estimate for the rating of the movie Dark Knight by an user named Bane. Now, say that the average rating over all movies, μ, is 3.8 stars. Furthermore, Dark Knight is better than an average movie, so it tends to be rated 0.4 stars above the average. On the other hand, Bane is a critical user, who tends to rate 0.2 stars lower than the average. Thus, the baseline estimate for Dark Knight’s rating by Bane would be : 3.8 − 0.2 + 0.4 = 4.0 stars.

In order to estimate bᵤ and bᵢ we can solve the least squares optimization problem on the loss function given below :

Here, the first term, before the plus sign , strives to find bᵤs and bᵢs that fit the given ratings and the second term is the regulization term which avoids overfitting by penalizing the magnitudes of the parameters.

Neighborhood Models:

The standard method of Collaborative Filtering is known as Nearest Neighborhood algorithm. Its original form is the user-user based method. It estimates unknown ratings based on ratings given by like-minded users. Almost all earlier CF systems used a user-user method Later, the item-item approach also became popular which estimates ratings given by an user on an item using known ratings made on similar items by the same user.

Better scalability and improved accuracy make the item-item approach more favorable in many cases. Also item-item method is more successful in providing some kind of interpretability or reasoning behind the predictions it makes as users are familiar with items previously preferred by them, but do not know those allegedly like-minded users.

Our goal in item-item approach is to predict rᵤᵢ hat, the unobserved rating given by user u for item i. We identify the k items rated by u, which are most similar to i. The most similar items for those k items is found using a similarity measure, frequently based on Pearson correlation coefficient,ρ ᵢⱼ ,which measures the tendency of users to rate items i and j similarly. The predicted value of rᵤᵢ is taken as a weighted average of the ratings of neighboring items with similarities as weights., while adjusting for user and item effects through the baseline estimates bᵤᵢ.

Where bᵤᵢ is Baseline prediction of (user, movie) rating ; bᵤᵢ = μ + bᵤ + bᵢ and Nᵤᵏ (i) is the set of k-similar movies (neighbors) of movie(i) rated by user(u) ; sim (i, j) refers to the similarity between movie i and j which were rated by user ‘u’. Also rᵤⱼ is the rating which user ‘u’ gives on item ‘j’ and bᵤⱼ is the predicted baseline model rating of user ‘u’ on item ‘j’.

Generally, the similarity measure used is cosine similarity or Pearson correlation coefficient.
But we can also use a shrunk Pearson-baseline correlation coefficient, which is based on the Pearson Baseline Similarity ( i.e we take baseline predictions instead of mean rating of user/item).It is given by:

The variable nᵢⱼ denotes the number of users who rated both i and j.

Similarly in user-user approach, we identify the k most similar users(neighbors) to user u. The predicted value
of rᵤᵢ is taken as a weighted average of the ratings given to item i by the k neighbors of user u , while adjusting for user and item effects through the baseline estimates bᵤᵢ.It is given as below:

Also in this story, we have used a slightly different model than traditional neighborhood models , which in general are Memory Based approaches of Collaborative Filtering. Memory based approaches are non parametric and we do not learn parameters using any optimization methods. The closest user or items are calculated only by using Cosine similarity or Pearson correlation coefficients, which are only based on arithmetic operations and use them directly to make predictions. The neighborhood methods discussed here are based on formally optimizing a global cost function. This leads to improved prediction accuracy, while maintaining merits of the neighborhood approach such as interpretability of predictions and ability to handle new ratings (or new users) without retraining the model.

Latent Factor Models:

An alternative approach to collaborative filtering that aims to uncover latent features to explain observed ratings is the Latent factor model. In this story we understand the Latent factor models that are induced by singular value decomposition (SVD),a matrix factorization technique ,on the user-item ratings matrix.

Sparsity and scalability are the two biggest challenges for standard CF method, so Matrix factorization comes with a more advanced method that decompose the original sparse matrix to low-dimensional matrices with latent factors/features and less sparsity. They transform both items and users to the same latent factor space, thus making them directly comparable.

Latent features expresses higher-level attributes of users and movies like genre of the movie, amount of action, orientation to children, depth of character development or quirkiness etc . Intuitively ,a user may give good ratings to movies Iron Man, Source Code, and X-men. However they do not represent three separate opinions of the user. Instead it shows that this users favors Sci-Fi movies and there may be many more Sci-Fi movies that this user would like. Sci-Fi category is one of latent features in this case.

A typical SVD model associates each user u with a user-factors vector pᵤ ∈ Rᶠ , and each item i with an item-factors vector qᵢ ∈ Rᶠ. When matrix R is dense, pᵤ and qᵢ could be easily factorized analytically. However, a matrix of movie ratings is super sparse. So instead of factorizing Rᶠ via SVD, we are trying find pᵤ and qᵢ with the goal that when pᵤ and qᵢ are multiplied back together, the resultant output matrix R’ is the closest approximation of Rᶠ and no more a sparse matrix. So, the predictions in R’ are represented by an inner product of user-factor vector and item-factor vector and adjusted with the bias term bᵤᵢ:

where bᵤᵢ = μ + bᵤ + bᵢ and we solve the Optimization Problem to minimize the loss function below to find optimal bᵤ, bᵢ, qᵢ, pᵤ:

However, the SVD model faces real difficulties in providing interpretability i.e when it needs to explain the predictions it make. It abstracts users via an intermediate layer of user factors. This intermediate layer separates the computed predictions from past user actions and complicates explanations. Similarly, while reflecting new ratings it requires to relearn the user factors, and cannot be done as in the neighborhood models, at virtually no cost. Therefore ,despite higher accuracy of these models, in practical applications neighborhood models are still expected to be a common choice.

Implicit Feedback Models:

Explicit Rating, is a rate given by a user to an item , like 5 stars for Intersteller. This is a direct feedback from users to show how much they like an item. Implicit Rating, suggests users preference indirectly, such as page views, clicks, purchase history, browsing history, search patterns, whether or not listen to a music track, or even mouse movements. However, such data is not available in the Netflix prize dataset. However a less obvious kind of implicit data does exist within the mentioned dataset that can be used to significantly improve prediction accuracy of the recommendation models we built. The dataset not only tells us the rating values, but also which movies users rate, regardless of how they rated these movies. In other words, a user implicitly tells us about her preferences by choosing to voice her opinion and
vote a rating. We can use an integrated model that combines the
item-item model with a latent factor model ,called SVD++.

In this model prediction of ratings rᵤᵢ is given by:

where, Iᵤ is the set of all items rated by user u. |Iᵤ| is a length of that set and yⱼ is our new set of item factors that captures implicit ratings for user u. Here,as already stated, an implicit rating describes the fact that a user u rated an item j, regardless of the rating value.

We solve the Optimization Problem on the loss function below to find the optimal bᵤ, bᵢ, qᵢ, pᵤ, yⱼ:

Final Thoughts:

This story provides a comprehensive understanding of various approaches of collaborative filtering , a technique widely popular for building highly accurate and efficient recommendation systems. Mainly we have focused on Neighborhood approach(a traditional memory based approach with a slight change to make it parametric) and Latent Factor approach(a model based approach ,parametric approach).Also we have discussed how to take very basic Implicit ratings of users and use it to significantly improve prediction accuracy of the recommendation models.

However there are a few advantages and disadvantages attributed to both the models and so both these models are used interchangeably or in ensembles with other types of recommendation systems in real life.

Neigbourhood models shines as users expect a system to give a reason for its predictions, rather than facing “black box” recommendations. They naturally provide intuitive explanations of the reasoning behind recommendations, which often enhance user experience. Also item-item neighborhood models can provide updated recommendations immediately as soon as the users provide feedback to the system, without needing to retrain the model and estimate new parameters . However, It doesn’t handle sparsity well. Also, it’s not computational efficient when the number of users and items grows.

Latent factor models offer highly expressive ability to describe various aspects of the data. Thus they tend to provide more accurate results than neighborhood models. But the underlying tastes expressed by latent features are not interpretable.

Despite the advantages and disadvantages, Collaborative Filtering is faced with cold start problem which occurs when new items are added that have either none or very little interactions. When a new item comes in, until it is rated by substantial number of users, the model is not able to make any personalized recommendations . Similarly, for items from the tail that didn’t get too much data, the model tends to give less weight on them and have popularity bias by recommending more popular items. Thus recommendations will be poor.

Thats all for now..