This project implements an ETL (Extract, Transform, Load) pipeline for processing E-commerce data. It performs the following steps:
- Extract: Fetches data via HTTP from a deployed REST API.
- Transform: Processes the extracted data to generate insights.
- Load: Saves the transformed data in CSV and SQLite database formats for further use.
/rest_api β Python-based REST API serving E-commerce data, deployed on an AWS EC2 instance.
/etl β OCaml-based ETL pipeline that fetches, processes, and stores data locally.
/etl/results β Output directory containing the processed CSV and SQLite database files.
Ensure you have the following dependencies installed before running the project:
- OCaml Tooling:
dune(build system) - Libraries:
cohttp-lwt-unix(HTTP requests)yojson(JSON parsing)csv(CSV generation)sqlite3(Database interactions)alcotest(Unit testing framework)
-
Navigate to the ETL project directory:
cd etl -
Create a switch using ocaml-base-compiler.5.3.0 and activate it
opam switch create my_custom_switch ocaml-base-compiler.5.3.0 eval $(opam env)3 Installing the dependencies, it will use the file etl/etl.opam to install the dependencies
opam install . --deps-only-
Build the project:
dune build
-
Execute the ETL process:
dune exec etl -
Follow the terminal instructions. Once the execution is complete, the processed files will be available in:
etl/results/ -
To run unit tests run:
dune testThis report outlines the development process of the project, detailing the steps taken to implement the required functionalities. It serves as a guide for anyone looking to replicate or extend the project in the future. This project was implemented in OCaml and adheres to functional programming principles.
A Python API was developed to serve order and order item data. This API is responsible for exposing endpoints that return order details in JSON format.
API endpoints:
- Order: http://44.200.31.239:8000/order
- OrderItem: http://44.200.31.239:8000/orderItem
- Installed
duneand created the basic project structure using it. - Manually created directories to organize the project, separating concerns between:
lib/β for the main execution filecontroller/β for fetching data logicservice/β for all processing data logictest/β for testing
Fetching Data: The ETL process retrieves order and order item data via HTTP GET requests using OCaml Cohttp for HTTP handling.
Parsing JSON: The HTTP response is parsed into JSON objects with the Yojson.Safe module.
Defining Data Structures:
OCaml record types order and order_item were created to model the extracted data:
type order = {
id : int;
client_id : int;
order_date : string;
status : string;
origin : char;
}
type order_item = {
order_id : int;
product_id : int;
quantity : int;
price : float;
tax : float;
}Conversion functions were implemented to map JSON data to these OCaml types.
-
Defined a new type,
order_order_item, to represent the result of joining orders and order items:type order_order_item = { order : order; order_item : order_item; }
-
Used list processing functions (
List.fold_left,List.map, andList.filter) to match items byorder_idand generate a combined list of enriched records. -
It is an inner join, so one order type can be combined with N order_item types.
- Implemented command-line input capture to filter data based on
statusandorigin. - Used standard OCaml
read_linefor user input and employedList.filterto apply the constraints dynamically.
-
Defined a type representing the final aggregated results:
type order_total = { order_id : int; total_amount : float; total_taxes : float; }
-
Transformed
order_order_itemlist intoorder_totallist. -
Grouped order_item by order_id, aggregating data related to price, quantity, and tax to calculate total_amount and total_taxes.
Detailed implementation:
- Using IntSet, created a list of unique order_id's.
- Iterated over this unique order_id's list using
List.fold_leftwith an empty list as the accumulator representing the order_total list. - For every unique order_id, filtered order_items that had the same order_id and iterated over them using an inner
List.fold_leftwhere the accumulator tracked total_amount and total_taxes. - Using the inner
List.fold_leftaccumulator return, created an order_total record and appended it to the order_total list accumulator.
-
Introduced a new type to store mean amount and mean tax per month:
type monthly_mean = { year_month: string; mean_amount: float; mean_tax: float; }
-
Transformed
order_order_itemlist intomonthly_meanlist. -
Grouped order_item by yyyy-mm, aggregating data related to price, quantity, and tax to calculate total_amount and total_taxes, while also tracking the number of orders in that month to calculate the mean_amount and mean_tax.
Detailed implementation:
- Using StringSet, created a list of unique yyyy-mm values.
- Iterated over this unique yyyy-mm list using
List.fold_left, with an empty list as accumulator representing the monthly_mean list, and calculated the number of orders in each month. - For every unique yyyy-mm, filtered order_items with the same yyyy-mm and iterated over them using an inner
List.fold_leftwhere the accumulator tracked total_amount and total_taxes. - Using the inner
List.fold_leftaccumulator return, calculated the mean and created a monthly_mean record, appending it to the monthly_mean list accumulator.
- Utilized the OCaml CSV Library to write processed data to CSV files.
- Implemented a function to transform OCaml records into CSV-compatible
string lists. - Then used the library to transform these
string listsrepresentations into CSV files. - Bonus: Discovered bugs in the official documentation and contributed to open source by creating an issue detailing the problem: Issue Link.
- Results are stored at:
etl/results/csv
- Designed a schema for storing order_total data and monthly_mean data in an SQLite database.
- Used OCaml SQLite3 Library.
- Created one table for each type representation.
- Created functions that map record lists into SQL insertion value strings.
- Executed the SQL statements.
- Results are stored at:
etl/results/sqlite/store_db.sqlite3
Results inspected using DB Browser for SQLite:
Order_total table:
Monthly_mean table:
- All functions were documented using structured docstrings.
- Inside the docstring is the information if the function is pure or not.
- AI assistance was used to generate the docstrings for each function.
- Complete unit tests were written for all pure functions using
Alcotest. - Leveraged insights from:
Generative AI was used in specific parts of the project:
- Assisting with Python API creation.
- Generating docstring documentation for functions.
- Helping generate complete unit tests.
- Helping in creation of README.md.
- The project must be implemented in OCaml.
- To compute the output, the use of map, reduce, and filter is required.
- The code must include functions for reading from http API and writing to CSV files, which will generate impure functions.
- Separate impure functions from pure functions in the project files, (In the docstring of each function, there is a pure section calssifying the function as pure or impure).
- The input data must be loaded into a list of Records.
- The use of Helper Functions to load fields into a Record is mandatory.
- A project report must be written, describing how each step was built. This should serve as a guide for someone reproducing the project in the future. The report must explicitly state whether Generative AI was used or not.
- Read input data from a static file on the internet (exposed via HTTP).
- Save output data in an SQLite database.
- Process input tables separately but preferably perform an inner join before transformation.
- Organize the ETL project using
dune. - Document all functions using docstring format.
- Provide an additional output containing the average revenue and taxes paid, grouped by month and year.
- Generate comprehensive test files for pure functions.
- Deployed an API that serves data over the internet.
- Parsed HTTP requests into JSON.
- During development, found errors in the CSV library documentation and created an issue reporting it.

