This project implements a distributed MapReduce solution using Scala and Apache Spark to identify geographical pairs with the highest frequency of daily earthquake co-occurrences. The system processes large-scale earthquake datasets to find pairs of locations (rounded to the first decimal place) that experience seismic events on the same day.
The project is designed to run on Google Cloud Platform (GCP) Dataproc and includes an automated benchmarking script to analyze scalability across different cluster configurations.
src/main/scala/EarthquakeAnalysis.scala: Main application.benchmark.py: Python script to automate GCP cluster creation, job submission, and log collection.build.sbt: Build configuration for the Scala project.report/: Contains the detailed technical report.
- SBT (Simple Build Tool)
- Google Cloud CLI (gcloud)
- Python 3.x (for benchmarking)
Compile the project and generate the JAR file using sbt:
sbt package- Configure GCP:
gcloud config set project [YOUR_PROJECT_ID]- Upload Assets:
gsutil cp target/scala-2.12/scp-project_2.12-0.1.0-SNAPSHOT.jar gs://[YOUR_BUCKET]/jars/
gsutil cp dataset.csv gs://[YOUR_BUCKET]/dataset/input.csvThe benchmark.py script automates the testing for different cluster configurations. Modify the script as you wish, then run:
python benchmark.pyYou can also submit jobs manually using the following command:
gcloud dataproc jobs submit spark --cluster=[YOUR_CLUSTER_NAME] --region=[YOUR_REGION] --class=EarthquakeAnalysis --jars=gs://[YOUR_BUCKET]/jars/scp-project_2.12-0.1.0-SNAPSHOT.jar -- gs://[YOUR_BUCKET]/dataset/input.csv gs://[YOUR_BUCKET]/output/ [NUM_PARTITIONS]