This project is an end-to-end Machine Learning web application that recommends the top 10 movies similar to a user's selection. The primary focus of this project was to implement a complete Data Science Lifecycle—from data collection and preprocessing to model building and web deployment.
Using Content-Based Filtering, the recommendation engine analyzes movie metadata to find contextual similarities. The final model is integrated into a clean, interactive user interface built with Streamlit.
- Language: Python
- Data Manipulation & Analysis: Pandas, NumPy
- Machine Learning & NLP: Scikit-Learn (
TfidfVectorizer,sigmoid_kernel) - Model Serialization: Joblib
- Web Framework: Streamlit
- Data Collection & Cleaning: Merged and cleaned the
movies.csvandcredits.csvdatasets, handling missing values and extracting relevant features (genres, keywords, cast, crew, and overviews). - Text Vectorization (NLP): Utilized
TfidfVectorizer(Term Frequency-Inverse Document Frequency) to convert raw movie overviews and metadata into a matrix of TF-IDF features. - Similarity Computation: Applied a Sigmoid Kernel to compute pairwise similarity scores between movies based on their feature vectors.
- Model Deployment: Exported the vectorized data and similarity models into
.pklfiles and built a Streamlit application (model_deployment.py) to serve real-time predictions.
-
Clone this github repository
-
Install the required packages using pip
pip install -r requirements.txt -
The required dataframes and pre-trained models are already saved in the
dumped_objdirectory. The full exploratory and training code can be found in the Jupyter notebookMovie_Recommendation_System.ipynb.-
Option 1 (Train from scratch): Re-execute the Jupyter notebook file. It will process the datasets and save the dataframes and models again in the
dumped_objdirectory. -
Option 2 (Run Web App directly): Continue with the saved models and run the python script written in
model_deployment.pyfrom your terminal:
streamlit run model_deployment.py
This command will run streamlit localhost engine and you will be navigated to Simple UI in default browser.
-
Note
Before executing the Jupyter Notebook and streamlit command, please make sure that your terminal is pointing to current working directory.
The project utilizes two primary datasets (from TMDB 5000 Movies Dataset) containing comprehensive movie metadata.
This dataset contains information regarding the cast and crew of the movies.
| Feature | Description |
|---|---|
| movie_id | A unique identifier for each movie. |
| cast | The names of lead and supporting actors. |
| crew | The names of the Director, Editor, Composer, Writer, etc. |
This dataset contains metadata and performance metrics for the movies.
| Feature | Description |
|---|---|
| budget | The budget in which the movie was made. |
| genre | The genre of the movie (Action, Comedy, Thriller, etc.). |
| homepage | A link to the homepage of the movie. |
| id | The unique identifier (matches movie_id in the credits dataset). |
| keywords | Keywords or tags related to the movie. |
| original_language | The language in which the movie was made. |
| original_title | The title of the movie before translation or adaptation. |
| overview | A brief description of the movie. |
| popularity | A numeric quantity specifying the movie's popularity. |
| production_companies | The production house of the movie. |
| production_countries | The country in which it was produced. |
| release_date | The date on which it was released. |
| revenue | The worldwide revenue generated by the movie. |
| runtime | The running time of the movie in minutes. |
| status | "Released" or "Rumored". |
| tagline | The movie's tagline. |
| title | The title of the movie. |
| vote_average | Average ratings the movie received. |
| vote_count | The count of votes received. |
Thank you and happy learning! 😄
