Skip to content

alessitomas/OCaml_ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ETL Project

Project Overview

This project implements an ETL (Extract, Transform, Load) pipeline for processing E-commerce data. It performs the following steps:

  • Extract: Fetches data via HTTP from a deployed REST API.
  • Transform: Processes the extracted data to generate insights.
  • Load: Saves the transformed data in CSV and SQLite database formats for further use.

Project Structure

/rest_api   β†’ Python-based REST API serving E-commerce data, deployed on an AWS EC2 instance.

/etl        β†’ OCaml-based ETL pipeline that fetches, processes, and stores data locally.

/etl/results β†’ Output directory containing the processed CSV and SQLite database files.

Setup and Usage

Prerequisites

Ensure you have the following dependencies installed before running the project:

Running the ETL Pipeline

  1. Navigate to the ETL project directory:

    cd etl
  2. Create a switch using ocaml-base-compiler.5.3.0 and activate it

opam switch create my_custom_switch ocaml-base-compiler.5.3.0
  eval $(opam env)

3 Installing the dependencies, it will use the file etl/etl.opam to install the dependencies

opam install . --deps-only
  1. Build the project:

    dune build
  2. Execute the ETL process:

    dune exec etl
  3. Follow the terminal instructions. Once the execution is complete, the processed files will be available in:

    etl/results/

  4. To run unit tests run:

 dune test

ETL Development Report

This report outlines the development process of the project, detailing the steps taken to implement the required functionalities. It serves as a guide for anyone looking to replicate or extend the project in the future. This project was implemented in OCaml and adheres to functional programming principles.

1. Create Python API

A Python API was developed to serve order and order item data. This API is responsible for exposing endpoints that return order details in JSON format.

API endpoints:

2. Set Up Project Structure

  1. Installed dune and created the basic project structure using it.
  2. Manually created directories to organize the project, separating concerns between:
    • lib/ β†’ for the main execution file
    • controller/ β†’ for fetching data logic
    • service/ β†’ for all processing data logic
    • test/ β†’ for testing

3. Fetch and Parse Data

Fetching Data: The ETL process retrieves order and order item data via HTTP GET requests using OCaml Cohttp for HTTP handling.

Parsing JSON: The HTTP response is parsed into JSON objects with the Yojson.Safe module.

Defining Data Structures:

OCaml record types order and order_item were created to model the extracted data:

type order = {
  id : int;
  client_id : int;
  order_date : string;
  status : string;
  origin : char;
}

type order_item = {
  order_id : int;
  product_id : int;
  quantity : int;
  price : float;
  tax : float;
}

Conversion functions were implemented to map JSON data to these OCaml types.

4. Perform Inner Join

  • Defined a new type, order_order_item, to represent the result of joining orders and order items:

    type order_order_item = {
      order : order;
      order_item : order_item;
    }
  • Used list processing functions (List.fold_left, List.map, and List.filter) to match items by order_id and generate a combined list of enriched records.

  • It is an inner join, so one order type can be combined with N order_item types.

5. Capture User Input for Filtering

  • Implemented command-line input capture to filter data based on status and origin.
  • Used standard OCaml read_line for user input and employed List.filter to apply the constraints dynamically.

6. Compute Order Total

  • Defined a type representing the final aggregated results:

    type order_total = {
      order_id : int;
      total_amount : float;
      total_taxes : float;
    }
  • Transformed order_order_item list into order_total list.

  • Grouped order_item by order_id, aggregating data related to price, quantity, and tax to calculate total_amount and total_taxes.

Detailed implementation:

  1. Using IntSet, created a list of unique order_id's.
  2. Iterated over this unique order_id's list using List.fold_left with an empty list as the accumulator representing the order_total list.
  3. For every unique order_id, filtered order_items that had the same order_id and iterated over them using an inner List.fold_left where the accumulator tracked total_amount and total_taxes.
  4. Using the inner List.fold_left accumulator return, created an order_total record and appended it to the order_total list accumulator.

7. Compute Additional Monthly Summary

  • Introduced a new type to store mean amount and mean tax per month:

    type monthly_mean = {
      year_month: string;
      mean_amount: float;
      mean_tax: float;
    }
  • Transformed order_order_item list into monthly_mean list.

  • Grouped order_item by yyyy-mm, aggregating data related to price, quantity, and tax to calculate total_amount and total_taxes, while also tracking the number of orders in that month to calculate the mean_amount and mean_tax.

Detailed implementation:

  1. Using StringSet, created a list of unique yyyy-mm values.
  2. Iterated over this unique yyyy-mm list using List.fold_left, with an empty list as accumulator representing the monthly_mean list, and calculated the number of orders in each month.
  3. For every unique yyyy-mm, filtered order_items with the same yyyy-mm and iterated over them using an inner List.fold_left where the accumulator tracked total_amount and total_taxes.
  4. Using the inner List.fold_left accumulator return, calculated the mean and created a monthly_mean record, appending it to the monthly_mean list accumulator.

8. Save Data to CSV

  • Utilized the OCaml CSV Library to write processed data to CSV files.
  • Implemented a function to transform OCaml records into CSV-compatible string lists.
  • Then used the library to transform these string lists representations into CSV files.
  • Bonus: Discovered bugs in the official documentation and contributed to open source by creating an issue detailing the problem: Issue Link.
  • Results are stored at: etl/results/csv

9. Save Data to SQLite

  • Designed a schema for storing order_total data and monthly_mean data in an SQLite database.
  • Used OCaml SQLite3 Library.
  • Created one table for each type representation.
  • Created functions that map record lists into SQL insertion value strings.
  • Executed the SQL statements.
  • Results are stored at: etl/results/sqlite/store_db.sqlite3

Results inspected using DB Browser for SQLite:

Order_total table:

order_total_db

Monthly_mean table:

monthly_mean_db

10. Document Functions

  • All functions were documented using structured docstrings.
  • Inside the docstring is the information if the function is pure or not.
  • AI assistance was used to generate the docstrings for each function.

11. Implement Unit Tests

Use of Generative AI

Generative AI was used in specific parts of the project:

  • Assisting with Python API creation.
  • Generating docstring documentation for functions.
  • Helping generate complete unit tests.
  • Helping in creation of README.md.

All Required Requirements

  • The project must be implemented in OCaml.
  • To compute the output, the use of map, reduce, and filter is required.
  • The code must include functions for reading from http API and writing to CSV files, which will generate impure functions.
  • Separate impure functions from pure functions in the project files, (In the docstring of each function, there is a pure section calssifying the function as pure or impure).
  • The input data must be loaded into a list of Records.
  • The use of Helper Functions to load fields into a Record is mandatory.
  • A project report must be written, describing how each step was built. This should serve as a guide for someone reproducing the project in the future. The report must explicitly state whether Generative AI was used or not.

Optional Requirements Checklist

  • Read input data from a static file on the internet (exposed via HTTP).
  • Save output data in an SQLite database.
  • Process input tables separately but preferably perform an inner join before transformation.
  • Organize the ETL project using dune.
  • Document all functions using docstring format.
  • Provide an additional output containing the average revenue and taxes paid, grouped by month and year.
  • Generate comprehensive test files for pure functions.

Bonus Deliverables

  • Deployed an API that serves data over the internet.
  • Parsed HTTP requests into JSON.
  • During development, found errors in the CSV library documentation and created an issue reporting it.

About

It's an ELT project, created using the functional programming language OCaml 🐫. The goal was to calculate aggregated information from data available at an API, it uses important concepts of functional programming such as: recursion, purity, side effects ...

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors