Earthquake Co-occurrence Analysis with Scala and Apache Spark on GCP

Overview

This project implements a distributed MapReduce solution using Scala and Apache Spark to identify geographical pairs with the highest frequency of daily earthquake co-occurrences. The system processes large-scale earthquake datasets to find pairs of locations (rounded to the first decimal place) that experience seismic events on the same day.

The project is designed to run on Google Cloud Platform (GCP) Dataproc and includes an automated benchmarking script to analyze scalability across different cluster configurations.

Project Structure

src/main/scala/EarthquakeAnalysis.scala: Main application.
benchmark.py: Python script to automate GCP cluster creation, job submission, and log collection.
build.sbt: Build configuration for the Scala project.
report/: Contains the detailed technical report.

Setup and Usage

1. Prerequisites

SBT (Simple Build Tool)
Google Cloud CLI (gcloud)
Python 3.x (for benchmarking)

2. Local Build

Compile the project and generate the JAR file using sbt:

sbt package

3. Cloud Deployment

Configure GCP:

gcloud config set project [YOUR_PROJECT_ID]

Upload Assets:

gsutil cp target/scala-2.12/scp-project_2.12-0.1.0-SNAPSHOT.jar gs://[YOUR_BUCKET]/jars/
gsutil cp dataset.csv gs://[YOUR_BUCKET]/dataset/input.csv

4. Running Benchmarks

The benchmark.py script automates the testing for different cluster configurations. Modify the script as you wish, then run:

python benchmark.py

5. Manual Job Submission

You can also submit jobs manually using the following command:

gcloud dataproc jobs submit spark --cluster=[YOUR_CLUSTER_NAME] --region=[YOUR_REGION] --class=EarthquakeAnalysis --jars=gs://[YOUR_BUCKET]/jars/scp-project_2.12-0.1.0-SNAPSHOT.jar -- gs://[YOUR_BUCKET]/dataset/input.csv gs://[YOUR_BUCKET]/output/ [NUM_PARTITIONS]

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
project		project
report		report
src/main/scala		src/main/scala
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
benchmark_log.txt		benchmark_log.txt
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Earthquake Co-occurrence Analysis with Scala and Apache Spark on GCP

Overview

Project Structure

Setup and Usage

1. Prerequisites

2. Local Build

3. Cloud Deployment

4. Running Benchmarks

5. Manual Job Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Earthquake Co-occurrence Analysis with Scala and Apache Spark on GCP

Overview

Project Structure

Setup and Usage

1. Prerequisites

2. Local Build

3. Cloud Deployment

4. Running Benchmarks

5. Manual Job Submission

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages