Skip to content

souravg-db2/seismic-data-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seismic Data Processor

A high-performance, distributed solution for processing large-scale seismic data using Ray for parallel processing and Delta Lake for efficient storage and analytics.

Overview

This solution accelerator provides an efficient way to process 3D seismic SEGY files by leveraging Ray's distributed computing capabilities to split and process data in parallel, then storing the results in Delta format for advanced analytics and further processing.

Key Features

  • Distributed Processing: Uses Ray to split SEGY files into chunks and process them in parallel across multiple worker nodes
  • Scalable Architecture: Configurable worker nodes and CPU allocation for optimal resource utilization
  • Delta Lake Integration: Outputs processed data in Delta format for ACID transactions and time travel capabilities
  • Memory Efficient: Processes data in configurable chunks to handle large seismic datasets
  • Databricks Optimized: Designed to run efficiently on Databricks with integrated utilities

Architecture

SEGY Files → Ray Distributed Processing → Parquet Files → Delta Lake → Analytics
  1. Ray Cluster Setup: Configures distributed computing cluster with specified worker nodes and resources
  2. Parallel Processing: Splits SEGY files into chunks and processes them concurrently across Ray workers
  3. Data Flattening: Converts 3D seismic data into flattened format suitable for analytics
  4. Parquet Output: Saves processed data as Parquet files for efficient storage
  5. Delta Integration: Parquet files are then ingested into Delta Lake for advanced analytics

Components

Core Processing Module (src/data_processor_core/)

  • segy_processor.py:Handles SEGY file metadata extraction and chunk-based processing
  • Implements Ray remote functions for distributed execution

Notebooks (src/notebooks/)

  • segy_to_parquet.py: Databricks notebook for orchestrating the entire processing pipeline
  • Configurable parameters for input/output paths, chunk sizes, and Ray cluster settings

Usage

Prerequisites

  • Python 3.8+
  • Ray 2.0+
  • Databricks environment (for notebook execution)
  • Access to SEGY files and Delta Lake storage

Installation

pip install -r requirements.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors