I listen to a lot of audiobooks and often find myself seeking out “similar books” as soon as I finish one. So I am constantly looking for new books to read. Usually, the best recommendations come from my husband, who reads a lot of the same books as I do but I figured I could find more books if I extend that pool of ‘similar readers’. A book recommender system seemed a logical first step.
The recommender was built loosely following the Recommender System assignment from the Coursera course Machine Learning, though I made a few of my own changes.
First, the data.
I started with the Book Crossing dataset compiled by Craig Nicolas Ziegler in order to build the recommender system. These data include both implicit ratings and explicit ratings. In this case, explicit ratings are actual ratings of books on a scale of 1 – 10 indicating the extent to which the user enjoyed the book. Implicit readings are simply ‘yes/no’; did this user read this book? At first, I wanted to make use of implicit ratings because it would be easier to gather input from a user of “have you read this book?”. However, I realized that implicit ratings would not account for personal preference like explicit ratings would which would make my recommender less ‘accurate’. Explicit ratings also have the benefit of making it very easy to rank the recommendations from ‘most likely to be enjoyed’ to ‘least likely to be enjoyed’.
To reduce the data, I separated the ratings into implicit and explicit ratings. I also deleted any users who only had read books not in the books dataset and removed books not read by any users in the user dataset. Since the same book can have multiple ISBNs due to there being different versions or printings of the same book, it was important to me to compact books with the same title and author into a single ISBN. This way, I was getting more ratings for the same book regardless of the version read. This increases the rating count for most books and makes searching the database for specific books much more straight forward since I do not need to worry about duplicates.
Finally, to reduce the size of the eventual pivot table so that I could work with it on my local machine, I also removed books and users with under a minimum number of ratings/books. I determined this by simply looking at histograms of the number of ratings each user had and the number of ratings each book had.
From the histograms, I decided to make the cut off at users with more than 100 books and books with more than 50 ratings, to retain most of the ratings, while avoiding making a very sparse matrix.
The Model.
The recommender system makes use of collaborative filtering. There are two types of collaborative filtering, user-based and item-based. Item-based filters out possible recommendations by looking for similar items, much like a recommendation, “movies like The Avengers” would give you more Marvel movies. This type of filtering could compare titles, authors, and publication years.
However, I wanted to build recommendations based on input from other users. In other words, I wanted to take into account whether someone enjoyed the book. This is done with user-based collaborative filtering. This filters out possible ‘recommendations’ by finding the most similar users to the user looking for the recommendation. This functions similar to a recommendation, “users who liked The Notebook, also liked Up”. In an item-based recommender, there would be little that would make The Notebook and Up similar, but a user-based recommender may capture the fact that both of the movies include sappy love stories, because people who like one type of love story movie, probably also like the other.
For a given user, the 10 most similar users are found through a k-neighbors algorithm. The books read by these similar users are then used in the collaborative filtering algorithm to estimate the ratings for the user in question.
This is done by finding the X and Theta that reduce the cost function (X * transpose(Theta) – rate) summed over all potential recommendation books that each user has read. Theta is the parameters for each user, the amount each feature in X needs to be scaled to predict the users’ rating. For instance, if there was a feature that represents action scenes, and a user loves action scenes, the corresponding parameter in Theta will be high, to predict a high rating for books with many action scenes. X is the matrix describing the features of the books. While these don’t actually have human-relatable descriptions such as ‘amount of action scenes’, by comparing between all the books, the algorithm will end up optimizing these features so that similar books will have similar values for each feature. When X and Theta have been optimized, they can be used to predict ratings for books the user has not read. Also, the cost function includes a regularization term so as not to overfit on the books that have already been read. The top 10 books with the highest expected ratings are then returned as recommendations.
With this algorithm, I now can recommend some books for a given user in the database! The notebook with code for this project can be found on Github here.
Next up; making an app to recommend books to new users! This involves creating a database that can expand with new users!
Sources:
No responses yet