Real-Time Stock Market Analysis

A scalable, end-to-end data pipeline to ingest, process, and visualise stock market data in real-time. Built for MarketPulse Analytics, a FinTech firm serving institutional investors. This project leverages a modern data stack to handle high-velocity data streams, providing actionable insights through live dashboards.

Project Overview

MarketPulse Analytics faces increasing demand for ultra-low latency financial data as data volumes grow. The current infrastructure struggles to handle peak market periods such as stock market opens and earnings reports leading to delays that affect trading decisions and client satisfaction.

This project addresses those challenges by building a fault-tolerant, real-time data pipeline that:

Collects stock market data from the Alpha Vantage API via RapidAPI, providing live equity prices, trading volumes, and financial metrics
Streams the collected data through Apache Kafka for real-time event processing
Processes the stream using Apache Spark for transformations and analytics
Stores processed data in PostgreSQL for querying and reporting
Visualises insights via Power BI dashboards, delivering real-time stock trends, trading volumes, and sentiment signals to clients

Expected Outcomes

Scalable pipeline capable of processing high-velocity streams with low latency
Real-time insights via interactive dashboards for stock trends and trading volumes
Operational efficiency with reduced processing delays and more reliable analytics
Client satisfaction through faster, data-driven decision support

Business Context

Company: MarketPulse Analytics
Location: New York City, USA
Industry: FinTech — Financial Data Analytics
Founded: 2016

MarketPulse provides real-time market feeds, custom reporting dashboards, and predictive insights to hedge funds, asset managers, and electronic brokers. Key milestones include launching the first real-time reporting platform in 2016, expanding to global exchanges in 2018, and adding advanced analytics and sentiment modelling in 2022.

Key challenges driving this project:

Data latency — delays integrating diverse sources (exchanges, news, social sentiment) reduce insight accuracy
Scalability — the existing infrastructure bottlenecks under peak market load
System reliability — lack of robust monitoring makes anomaly detection difficult

Pipeline Architecture

Financial APIs  ──►  Kafka (broker)  ──►  Spark (stream processor)  ──►  PostgreSQL  ──►  Power BI
                        │
                    Kafka UI
                  (topic inspector)

Tech Stack and Data Flow

Component	Role
API Producer	Publishes JSON stock events to a Kafka topic
Apache Kafka	Distributed message broker; buffers and streams events in real-time
Kafka UI	Web interface to inspect topics, consumer groups, and message payloads
Apache Spark	Consumes from Kafka, applies transformations, writes results to PostgreSQL
PostgreSQL	Stores processed analytics data for querying and reporting
pgAdmin	Web-based GUI for managing and querying the PostgreSQL database
Power BI	External dashboard tool; connects directly to PostgreSQL for live reporting

Prerequisites

Ensure the following are installed on your machine before proceeding:

Docker Desktop (v24+ recommended) with Docker Compose v2
Git
Python 3.9+ (for running the producer locally)
Power BI Desktop (optional — for dashboard visualisation)

Verify your Docker installation:

docker --version
docker compose version

Getting Started

Step 1 — Create a RapidAPI account and get your Alpha Vantage API key

The producer fetches live stock market data from the Alpha Vantage API, accessed via RapidAPI.

Go to https://rapidapi.com and sign up for a free account using a valid email address
Once logged in, use the search bar to search for Alpha Vantage
From the results, select the one listed under the Finance category
On the API page, click the Test Endpoint tab to browse the available endpoints and confirm results are returning successfully
In the code snippet panel on the right, select Python from the language dropdown
Copy the generated code — this reflects the structure used in producer/extract.py
Navigate to the Pricing tab and subscribe to the free tier if prompted

Once you have your key, copy it from the App section under your RapidAPI dashboard (it appears as the X-RapidAPI-Key header value in the code snippet).

Step 2 — Clone the repository

git clone https://github.com/samuelede/Real-Time-Stock-Market-Analysis.git
cd Real-Time-Stock-Market-Analysis

Step 3 — Configure environment variables

Create a .env file in the project root:

touch .env

Populate it with your credentials — the API key must come first:

API_KEY=your_rapidapi_key_here

POSTGRES_USER=your_postgres_user
POSTGRES_PASSWORD=your_postgres_password
POSTGRES_DB=stock_market

Important: Never commit your .env file to version control. It is listed in .gitignore by default.

Step 4 — Install Python dependencies

pip install -r requirements.txt

Step 5 — Test the API connection

Before starting the full pipeline, confirm the producer can successfully fetch data from Alpha Vantage:

python producer/extract.py

You should see stock market data printed to the terminal. If you receive an authentication error, double-check your API_KEY value in .env.

Running the Pipeline

All infrastructure services are defined in compose.yml and run as Docker containers on a shared stock_data network.

Step 1 — Start all infrastructure services

From the project root, bring up all containers in detached mode:

docker compose up -d

Docker will pull the required images (first run may take a few minutes) and start the following services: spark-master, spark-worker, kafka, kafka-ui, postgres, and pgadmin.

Verify all containers are running:

docker compose ps

You should see all services with a running status.

Step 2 — Confirm Kafka is ready

Wait ~15–20 seconds for Kafka to complete its KRaft initialisation, then open Kafka UI in your browser:

http://localhost:8085

You should see a cluster named local with no topics yet. If the cluster does not appear, wait a few more seconds and refresh.

Step 3 — Start the stock data producer

The producer fetches live stock market data from Alpha Vantage and publishes it as JSON events to Kafka. Run it from the project root:

python producer/main.py

Once running, return to Kafka UI (http://localhost:8085) and navigate to Topics. You should see the stock market topic appear with incoming messages.

Step 4 — Submit the Spark streaming job

The Spark job consumes messages from Kafka, applies transformations, and writes the results to PostgreSQL. Submit it to the Spark master container:

docker exec spark-master /opt/spark/bin/spark-submit \
  --master spark://spark-master:7077 \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.postgresql:postgresql:42.7.3 \
  /path/to/your/spark_job.py

Replace /path/to/your/spark_job.py with the actual path to the Spark consumer script inside the container, or mount it via a volume.

Monitor the job progress at the Spark Master web UI:

http://localhost:8077

Step 5 — Verify data in PostgreSQL

Open pgAdmin at http://localhost:5050 and log in with:

Email: admin@admin.com
Password: admin

Register a new server with the following connection details:

Field	Value
Host	`postgres` (or `localhost` if connecting from outside Docker)
Port	`5434`
Database	value from `POSTGRES_DB` in your `.env`
Username	value from `POSTGRES_USER` in your `.env`
Password	value from `POSTGRES_PASSWORD` in your `.env`

Query the processed stock data table to confirm records are being written.

Accessing the Services

Service	URL	Credentials
Spark Master UI	http://localhost:8077	—
Spark Worker UI	http://localhost:8081	—
Kafka UI	http://localhost:8085	—
pgAdmin	http://localhost:5050	admin@admin.com / admin
PostgreSQL (direct)	localhost:5434	from `.env`

Connecting Power BI

Power BI connects externally to the PostgreSQL database to build live dashboards.

Open Power BI Desktop
Select Get Data → PostgreSQL database
Enter the following connection details:
- Server: localhost:5434
- Database: value from POSTGRES_DB in your .env
Authenticate with your PostgreSQL username and password
Select the relevant tables and build your reports

For scheduled refresh or cloud publishing, configure the Power BI On-premises Data Gateway.

Project Structure

Real-Time-Stock-Market-Analysis/
├── consumer/
│   ├── consumer.py          # Spark streaming job — consumes from Kafka, writes to PostgreSQL
│   └── Dockerfile           # Container definition for the Spark consumer
├── img/
│   └── data-pipeline-architecture.svg   # Architecture diagram
├── producer/
│   ├── config.py            # Configuration and environment variable loading
│   ├── extract.py           # Fetches stock data from Alpha Vantage via RapidAPI
│   ├── main.py              # Entry point — orchestrates extraction and Kafka publishing
│   └── producer_setup.py   # Kafka producer initialisation and topic setup
├── compose.yml              # Docker Compose service definitions
├── requirements.txt         # Python dependencies
├── consumer.py              # Top-level consumer script (root-level entry point)
├── .env                     # Local environment variables (not committed)
├── .gitignore
└── README.md

Environment Variables

The following variables are required in your .env file:

Variable	Description	Example
`API_KEY`	RapidAPI key for Alpha Vantage stock data	`your_rapidapi_key_here`
`POSTGRES_USER`	PostgreSQL superuser name	`stockuser`
`POSTGRES_PASSWORD`	PostgreSQL password	`securepassword`
`POSTGRES_DB`	Name of the database to create	`stock_market`

These are injected into the relevant services at container startup via compose.yml.

Stopping the Pipeline

To stop all running containers without removing volumes (data is preserved):

docker compose stop

To stop and remove all containers, networks, and volumes (full teardown):

docker compose down -v

Learning Outcomes

Working through this project builds practical skills in:

Data engineering — designing and operating real-time pipelines with Kafka, Spark, and PostgreSQL
Stream processing — consuming, transforming, and writing continuous data streams at low latency
Containerisation — orchestrating multi-service architectures with Docker Compose
Analytics and visualisation — connecting live operational data to Power BI for client-facing reporting
Secrets management — handling credentials securely via environment variables

Contributing

Contributions are welcome and appreciated. If you'd like to improve the pipeline, fix a bug, or extend the project with new features, please follow the steps below.

How to contribute

Fork the repository — click the Fork button at the top of the GitHub page
Create a feature branch from main:

   git checkout -b feature/your-feature-name

Make your changes and ensure the pipeline still runs end-to-end before committing
Commit with a clear message:

   git commit -m "feat: describe what your change does"

Push your branch:

   git push origin feature/your-feature-name

Open a Pull Request against the main branch with a description of what you changed and why

Guidelines

Keep pull requests focused — one feature or fix per PR
Follow the existing code style and naming conventions
If adding a new service or dependency, update compose.yml, requirements.txt, and this README accordingly
Do not commit .env files or any credentials

Reporting issues

Found a bug or have a suggestion? Open an issue on the Issues page with as much detail as possible, including steps to reproduce if applicable.

License

This project is licensed under the MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Stock Market Analysis

Table of Contents

Project Overview

Expected Outcomes

Business Context

Pipeline Architecture

Tech Stack and Data Flow

Prerequisites

Getting Started

Step 1 — Create a RapidAPI account and get your Alpha Vantage API key

Step 2 — Clone the repository

Step 3 — Configure environment variables

Step 4 — Install Python dependencies

Step 5 — Test the API connection

Running the Pipeline

Step 1 — Start all infrastructure services

Step 2 — Confirm Kafka is ready

Step 3 — Start the stock data producer

Step 4 — Submit the Spark streaming job

Step 5 — Verify data in PostgreSQL

Accessing the Services

Connecting Power BI

Project Structure

Environment Variables

Stopping the Pipeline

Learning Outcomes

Contributing

How to contribute

Guidelines

Reporting issues

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
consumer		consumer
img		img
producer		producer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compose.yml		compose.yml
consumer.py		consumer.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Real-Time Stock Market Analysis

Table of Contents

Project Overview

Expected Outcomes

Business Context

Pipeline Architecture

Tech Stack and Data Flow

Prerequisites

Getting Started

Step 1 — Create a RapidAPI account and get your Alpha Vantage API key

Step 2 — Clone the repository

Step 3 — Configure environment variables

Step 4 — Install Python dependencies

Step 5 — Test the API connection

Running the Pipeline

Step 1 — Start all infrastructure services

Step 2 — Confirm Kafka is ready

Step 3 — Start the stock data producer

Step 4 — Submit the Spark streaming job

Step 5 — Verify data in PostgreSQL

Accessing the Services

Connecting Power BI

Project Structure

Environment Variables

Stopping the Pipeline

Learning Outcomes

Contributing

How to contribute

Guidelines

Reporting issues

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages