What is a Recommendation System and Why someone needs this system?
Recommendation systems are the systems that helps users discover the items they like.
Recommendation Systems are a type of information filtering systems as they improve the quality of search results and provides items that are more relevant to the search item or are related to the search history of the user. Recommender systems identify recommendations for individual users based on their past purchases, searches, and on other user's behaviour
In this Blog, We are going to learn about building Movie search engine for textual analysis, Classification of the movies into its genres based on the plot text (reviews) provided by the users and recommending similar movies to the users based on their selection preference.The data set that we are going to work on can be downloaded from https://www.kaggle.com/rounakbanik/the-movies-dataset/kernels
Phase 1 : Movie Search
We are taking the input from the user and calculating the TF-IDF score of the query and searching on the TMDB dataset. The most similar movies calculated by using Cosine_Similarity that matches the query are being returned to the user.
Term Frequency: For a term T in a document d, It is the the number of occurrences of a term t in a document d.
Inverted document frequency : It is used to calculate the weight of rare words across all the documents in the corpus denoting the total number of documents in collection N
Term Frequency-Inverted Document Frequency : This is the weighting scheme that assigns a weight to a term t in the document d. It is a product Term frequency and Inverted document frequency.
Before we proceed for finding the Cosine_similarity between the query and the movies, we need t remove all the stop words,special characters and unwanted spaces. Word Stemming Lemmatization are the techniques through which we can achieve this.
Stemmer : It operates on single word without the knowledge of the context.
Lemmatizer : It helps to avoid creating features that are semantically similar but syntactically different .
Cosine similarity is the standard way of quantifying similarities between two documents that is compute cosine similarities of documents to its vectors.
Once you find the Cosine scores, you can return the top K most relevant i.e K-highest scoring results to the user.
Challenges : Selection of the dataset was the major challenge for me as this was the critical phase of the entire project because its important that we select the correct dataset that can be explored.
Phase 2 : Movie Classification
Movie Classification is the process of classifying into genre of the movies based on the plot text or reviews by the users. We will train a model capable of predicting the genre of the movie based on user's reviews.
Following are the steps required to create text classification model in Python
1)import the required libraries.
2) Import the datasets (TMDB dataset)
3) Preprocess the text i.e remove the unwanted spaces, special characters . There are different approaches to remove the stopwords, i. Tokenization or Word Lemmatization / Stemming are the most widely used.
4) Converting text into numbers is important because machines unlike humans do not understand the raw text, they can only see numbers. Bag of Words Model and Word Embedding can be used to convert text into numbers. I have used BOW model because all the unique words in all the documents gets converted into features.
5) Finding TF-IDF scre
6)Divide the data into Training and Testing the data. The training dataset will be used for the model and predictions will be performed on the test dataset.
7) We need to train the model and I have used the MultiNomial naive Bayes classifier. Fit the training dataset on the Naive Bayes classifier and predict the labels on validation dataset.
8) Evaluate the performance of the classification model using metrics such as Confusion matrix, F1 score and the accuracy(have implemented the accuracy ).
Phase 3 : Movie Recommendation
We are going to recommend the similar movies to the users based on the selection of the movie by users. We will calculate the cosine similarities between the movies selected based on the keywords, crew, cast, genre information and display the most similar movies to the users.
References :
https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system
https://www.datacamp.com/community/tutorials/recommender-systems-python
Comments