A data analytics platform for large-scale YouTube video network datasets. The project leverages NoSQL (MongoDB) and Apache Spark (GraphX/GraphFrames) to store, process, and analyze YouTube data, uncovering insights into trends, influence, and user interaction patterns.
YouTube is one of the world’s most socially and commercially influential platforms. This project builds an end-to-end YouTube Analyzer that:
- Efficiently stores and processes large-scale YouTube video datasets.
- Provides analytics on network structure, influence, and trends.
- Supports top-k queries, range queries, and influence analysis using modern big data tools.
Dataset: "Statistics and Social Network of YouTube Videos" by Xu Cheng, Cameron Dale, and Jiangchuan Liu.
Our architecture involves four main components – a parser, MongoDB database, Apache Spark algorithm set, and a Streamlit GUI app:
- Parsing Algorithm
- Parses the raw crawl .txt files from the initial dataset
- Cleans the data of missing/invalid values
- Inserts into MongoDB
- MongoDB instance
- 4 collections
- crawls
- edges
- video_snapshots
- videos
- 4 collections
- Apache Spark
- connects to MongoDB via the official connector
- algorithms written in Python with PySpark
- runs in standalone mode, requires at least 1 worker with 4 GB of memory
- Python GUI application using Streamlit framework
- Connects to both MongoDB and Spark
- Loads precomputed algorithm results from MongoDB
- Can dynamically run Spark algorithms based on user queries
- Degree distribution (in-degree, out-degree, avg, min, max).
- Categorized statistics (by video category, size, views, etc.).
- Top-k queries:
- Most popular videos
- Highest-rated videos
- Categories with the most uploads
- Range queries:
- Videos by duration
[t1, t2] - Videos by size
[x, y]
- Videos by duration
- PageRank on the YouTube video graph to find top-k most influential videos.
- Analysis of influential video properties (views, edges, categories, etc.).
- Subgraph/motif queries (e.g., user-video relationships in recommendation paths).
- Database: MongoDB
- Processing Engine: Apache Spark
- Graph Analytics: Spark GraphFrames
- Document Store (MongoDB): stores video metadata as documents with attributes (uploader, category, views, etc.).
| Name | Role | GitHub | |
|---|---|---|---|
| Ross Kugler | Data Pipeline & API Lead, Hadoop MapReduce Researcher, Communication Liaison | ross.kugler@wsu.edu | rk3026 |
| Huy (Harry) Ky | Database Manager, MongoDB Researcher | giahuy.ky@wsu.edu | Harry908 |
| Ben Bordon | Analytics & Algorithms Lead, Spark GraphX/GraphFrames Researcher, Documentation | b.bordon@wsu.edu | wizkid0101 |