Skip to content

enricoferraiolo/SCP-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Earthquake Co-occurrence Analysis with Scala and Apache Spark on GCP

Overview

This project implements a distributed MapReduce solution using Scala and Apache Spark to identify geographical pairs with the highest frequency of daily earthquake co-occurrences. The system processes large-scale earthquake datasets to find pairs of locations (rounded to the first decimal place) that experience seismic events on the same day.

The project is designed to run on Google Cloud Platform (GCP) Dataproc and includes an automated benchmarking script to analyze scalability across different cluster configurations.

Project Structure

  • src/main/scala/EarthquakeAnalysis.scala: Main application.
  • benchmark.py: Python script to automate GCP cluster creation, job submission, and log collection.
  • build.sbt: Build configuration for the Scala project.
  • report/: Contains the detailed technical report.

Setup and Usage

1. Prerequisites

  • SBT (Simple Build Tool)
  • Google Cloud CLI (gcloud)
  • Python 3.x (for benchmarking)

2. Local Build

Compile the project and generate the JAR file using sbt:

sbt package

3. Cloud Deployment

  1. Configure GCP:
gcloud config set project [YOUR_PROJECT_ID]
  1. Upload Assets:
gsutil cp target/scala-2.12/scp-project_2.12-0.1.0-SNAPSHOT.jar gs://[YOUR_BUCKET]/jars/
gsutil cp dataset.csv gs://[YOUR_BUCKET]/dataset/input.csv

4. Running Benchmarks

The benchmark.py script automates the testing for different cluster configurations. Modify the script as you wish, then run:

python benchmark.py

5. Manual Job Submission

You can also submit jobs manually using the following command:

gcloud dataproc jobs submit spark --cluster=[YOUR_CLUSTER_NAME] --region=[YOUR_REGION] --class=EarthquakeAnalysis --jars=gs://[YOUR_BUCKET]/jars/scp-project_2.12-0.1.0-SNAPSHOT.jar -- gs://[YOUR_BUCKET]/dataset/input.csv gs://[YOUR_BUCKET]/output/ [NUM_PARTITIONS]

About

Project for the Scalable and Cloud Programming course 2025/2026

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors