YouTube Analyzer

A data analytics platform for large-scale YouTube video network datasets. The project leverages NoSQL (MongoDB) and Apache Spark (GraphX/GraphFrames) to store, process, and analyze YouTube data, uncovering insights into trends, influence, and user interaction patterns.

Project Overview

YouTube is one of the world’s most socially and commercially influential platforms. This project builds an end-to-end YouTube Analyzer that:

Efficiently stores and processes large-scale YouTube video datasets.
Provides analytics on network structure, influence, and trends.
Supports top-k queries, range queries, and influence analysis using modern big data tools.

Dataset: "Statistics and Social Network of YouTube Videos" by Xu Cheng, Cameron Dale, and Jiangchuan Liu.

Architecture

Our architecture involves four main components – a parser, MongoDB database, Apache Spark algorithm set, and a Streamlit GUI app:

Parsing Algorithm
- Parses the raw crawl .txt files from the initial dataset
- Cleans the data of missing/invalid values
- Inserts into MongoDB
MongoDB instance
- 4 collections
  - crawls
  - edges
  - video_snapshots
  - videos
Apache Spark
- connects to MongoDB via the official connector
- algorithms written in Python with PySpark
- runs in standalone mode, requires at least 1 worker with 4 GB of memory
Python GUI application using Streamlit framework
- Connects to both MongoDB and Spark
- Loads precomputed algorithm results from MongoDB
- Can dynamically run Spark algorithms based on user queries

App Preview

Features

1. Network Aggregation

Degree distribution (in-degree, out-degree, avg, min, max).
Categorized statistics (by video category, size, views, etc.).

2. Search & Queries

Top-k queries:
- Most popular videos
- Highest-rated videos
- Categories with the most uploads
Range queries:
- Videos by duration [t1, t2]
- Videos by size [x, y]

3. Influence Analysis

PageRank on the YouTube video graph to find top-k most influential videos.
Analysis of influential video properties (views, edges, categories, etc.).

4. Pattern & User Analysis

Subgraph/motif queries (e.g., user-video relationships in recommendation paths).

Tech Stack

Database: MongoDB
Processing Engine: Apache Spark
Graph Analytics: Spark GraphFrames

Data Model

Document Store (MongoDB): stores video metadata as documents with attributes (uploader, category, views, etc.).

Team

Name	Role	Email	GitHub
Ross Kugler	Data Pipeline & API Lead, Hadoop MapReduce Researcher, Communication Liaison	ross.kugler@wsu.edu	rk3026
Huy (Harry) Ky	Database Manager, MongoDB Researcher	giahuy.ky@wsu.edu	Harry908
Ben Bordon	Analytics & Algorithms Lead, Spark GraphX/GraphFrames Researcher, Documentation	b.bordon@wsu.edu	wizkid0101

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.streamlit		.streamlit
Docs		Docs
Spark		Spark
app		app
scripts		scripts
.gitignore		.gitignore
README.md		README.md
process_crawl_data.py		process_crawl_data.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start-spark-only.ps1		start-spark-only.ps1
start-streamlit-only.ps1		start-streamlit-only.ps1
start-with-spark.ps1		start-with-spark.ps1
start.ps1		start.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Analyzer

Project Overview

Architecture

App Preview

Features

1. Network Aggregation

2. Search & Queries

3. Influence Analysis

4. Pattern & User Analysis

Tech Stack

Data Model

Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YouTube Analyzer

Project Overview

Architecture

App Preview

Features

1. Network Aggregation

2. Search & Queries

3. Influence Analysis

4. Pattern & User Analysis

Tech Stack

Data Model

Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages