Dimensionality reduction

Movielens-20m is a dataset consisting in 20 million ratings and 465,000 tag applications applied to 27.000 movies by 138.000 users. We use the Singular Value Decomposition algorithm to build a recommender system for movies.

[1]:
from time import time
import os
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import scipy.linalg.interpolative as sli

from lightonml.datasets import movielens20m
from lightonml.encoding.base import SeparatedBitPlanEncoder, MixingBitPlanDecoder
from lightonml.projections.sklearn import OPUMap
from lightonopu.opu import OPU
[2]:
ratings, id_to_movie = movielens20m(processed=True, id_to_movie=True)
ratings = ratings.astype('float32')
[3]:
df_m = pd.DataFrame(id_to_movie[1:], columns=id_to_movie[0])
df_m.head()
[3]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

It’s possible to get the raw data from lightonml.datasets.movielens20m by setting processed=False. The array returned with processed=True is the result of the following code snippet run on the raw data:

df = pd.DataFrame(ratings[1:], columns=ratings[0])
df_m = pd.DataFrame(id_to_movie)
n_movies = len(df.movieId.unique())
n_users = len(df.userId.unique())
print('n users: {} n_movies: {}'.format(n_movies, n_users))

# create the user-item ranking matrix
df = df.pivot(index='movieId', columns='userId', values='rating')
ratings = df.values
[4]:
# demeaning ignoring nans along users
ratings -= np.nanmean(ratings, axis=0, keepdims=True)
# set nans to zero after demeaning
ratings[np.isnan(ratings)] = 0

Try SVD on original data

[5]:
try:
    start = time()
    u, s, v = np.linalg.svd(ratings)
    svd_original = time() - start
    print('Run SVD in {}'.format(svd_original))
except MemoryError:
    print('SVD requires too much memory.')
SVD requires too much memory.

Trying to perform SVD on the original data fails because of the high memory requirement of the algorithm.

Use randomized SVD instead

Randomized SVD consists in reducing the dimensionality of the data through random projections before performing SVD. The randomized version of the algorithm reduces the memory requirements and also decreases the computational complexity from \(O(kmn)\) to \(O(mn \log(k) + (m + n)k^2)\).

Where \(n\) is the number of samples, \(m\) is the number of features, and \(k\) is the number random features.

We follow algorithm 5.2 in Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, Halko et al., 2009.

[6]:
def interp_dec(A, k):
    idx, proj = sli.interp_decomp(A.astype('float64'), k)
    X = np.hstack([np.eye(k), proj])[:, np.argsort(idx)]
    return idx[:k], X


def randomize(x, k, thresh=0):
    mapping = OPUMap(n_components=k)
    x = (x > thresh).astype('uint8')
    y = mapping.transform(x)
    return y


def svd(x_proj, x, k):
    x_proj = np.dot(x, np.dot(x.T, x_proj))
    J, X = interp_dec(x_proj.T, k)
    Q1, R1 = np.linalg.qr(x[J, :])
    U, s, Vt = np.linalg.svd(np.dot(X.T, Q1))
    V = np.dot(Vt[:k, :], R1)
    return U[:, :k], s, V


def top_cosine_similarity(data, movie_id, top_n=10):
    index = movie_id - 1
    movie_row = data[index, :]
    magnitude = np.sqrt(np.einsum('ij, ij -> i', data, data))
    similarity = np.dot(movie_row, data.T) / (magnitude[index] * magnitude)
    sort_indices = np.argsort(-similarity)+1
    return sort_indices[:top_n]
[7]:
k = 100
start = time()
ratings_proj = randomize(ratings, k)
rp_time = time() - start
[8]:
c = 100
start = time()
u, s, v = svd(ratings_proj, ratings, c)
svd_time = time() - start
reconstruction = np.dot(u * s, v)
[9]:
del ratings_proj
[10]:
print('Total time: {:.2f}'.format(rp_time + svd_time))
print('RMSE: {:.4f}'.format(np.sqrt(np.mean((reconstruction-ratings)**2))))
Total time: 66.59
RMSE: 0.1152
[11]:
# keep only important singular values (90% of energy)
energy = 0
for i, el in enumerate(s):
    energy += el
    if energy > (s**2).sum()*0.9:
        break
k = i
[12]:
movie_id = 1
top_n = 2
sliced = u[:, :k]
indices = top_cosine_similarity(sliced, movie_id, top_n)
print('Query: {}, {}'.format(df_m.loc[0].title, df_m.loc[0].genres))

for idx in indices[1:]:
    print('Recommended: {}, {}'.format(df_m.loc[idx-1].title,
                                       df_m.loc[idx-1].genres))
Query: Toy Story (1995), Adventure|Animation|Children|Comedy|Fantasy
Recommended: Toy Story 2 (1999), Adventure|Animation|Children|Comedy|Fantasy