Skip to content

rk3026/Youtube_Analyzer

Repository files navigation

YouTube Analyzer

A data analytics platform for large-scale YouTube video network datasets. The project leverages NoSQL (MongoDB) and Apache Spark (GraphX/GraphFrames) to store, process, and analyze YouTube data, uncovering insights into trends, influence, and user interaction patterns.


Project Overview

YouTube is one of the world’s most socially and commercially influential platforms. This project builds an end-to-end YouTube Analyzer that:

  • Efficiently stores and processes large-scale YouTube video datasets.
  • Provides analytics on network structure, influence, and trends.
  • Supports top-k queries, range queries, and influence analysis using modern big data tools.

Dataset: "Statistics and Social Network of YouTube Videos" by Xu Cheng, Cameron Dale, and Jiangchuan Liu.


Architecture

image

Our architecture involves four main components – a parser, MongoDB database, Apache Spark algorithm set, and a Streamlit GUI app:

  • Parsing Algorithm
    • Parses the raw crawl .txt files from the initial dataset
    • Cleans the data of missing/invalid values
    • Inserts into MongoDB
  • MongoDB instance
    • 4 collections
      • crawls
      • edges
      • video_snapshots
      • videos
  • Apache Spark
    • connects to MongoDB via the official connector
    • algorithms written in Python with PySpark
    • runs in standalone mode, requires at least 1 worker with 4 GB of memory
  • Python GUI application using Streamlit framework
    • Connects to both MongoDB and Spark
    • Loads precomputed algorithm results from MongoDB
    • Can dynamically run Spark algorithms based on user queries

App Preview

image image image image

Features

1. Network Aggregation

  • Degree distribution (in-degree, out-degree, avg, min, max).
  • Categorized statistics (by video category, size, views, etc.).

2. Search & Queries

  • Top-k queries:
    • Most popular videos
    • Highest-rated videos
    • Categories with the most uploads
  • Range queries:
    • Videos by duration [t1, t2]
    • Videos by size [x, y]

3. Influence Analysis

  • PageRank on the YouTube video graph to find top-k most influential videos.
  • Analysis of influential video properties (views, edges, categories, etc.).

4. Pattern & User Analysis

  • Subgraph/motif queries (e.g., user-video relationships in recommendation paths).

Tech Stack

  • Database: MongoDB
  • Processing Engine: Apache Spark
  • Graph Analytics: Spark GraphFrames

Data Model

  • Document Store (MongoDB): stores video metadata as documents with attributes (uploader, category, views, etc.).

Team

Name Role Email GitHub
Ross Kugler Data Pipeline & API Lead, Hadoop MapReduce Researcher, Communication Liaison ross.kugler@wsu.edu rk3026
Huy (Harry) Ky Database Manager, MongoDB Researcher giahuy.ky@wsu.edu Harry908
Ben Bordon Analytics & Algorithms Lead, Spark GraphX/GraphFrames Researcher, Documentation b.bordon@wsu.edu wizkid0101

About

A big data analytics tool for large-scale YouTube video networks, powered by MongoDB and Apache Spark (GraphX/GraphFrames). The analyzer supports top-k queries, range queries, influence analysis (PageRank), and network statistics to uncover trends, influential content, and user interaction patterns.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors