This tutorial runs on a single Postgres instance for simplicity: the sensors write to it, and Metabase reads from it. That's perfect for learning, but it hides an important real-world distinction. In a real company you would almost never point your analytics tool at the same database your application writes to. This file explains why — and what the three kinds of data stores (operational database, data warehouse, data lake) are for.
Why it matters for an analyst: the job description for this kind of role lists "Data Lake / Data Warehouse" as a required concept. You don't have to build them, but you must understand which store you're querying and why.
Data systems are pulled between two jobs that want opposite things:
- Transactions (OLTP — Online Transaction Processing). Many tiny, fast reads and writes: "insert this payment," "update this user's balance," "fetch this order." Needs to be instant, always consistent, never down.
- Analytics (OLAP — Online Analytical Processing). A few huge, slow queries: "average OEE per machine over the last year," "revenue by region by month." Scans millions of rows, runs for seconds or minutes.
If you run a heavy analytics query against the database that's also handling live payments, you can slow down or lock up the thing customers depend on. That single sentence is the reason the other stores exist. You separate the workloads so an analyst's big query never threatens the live system.
The source of truth for the running application. Optimised for fast, small, concurrent transactions. Usually row-oriented and heavily indexed for point lookups. Examples: Postgres, MySQL, SQL Server for relational; MongoDB, DynamoDB for NoSQL.
- Holds: current state — users, orders, balances, the latest sensor reading.
- Optimised for: speed and correctness of individual writes/reads.
- Bad at: large analytical scans (they compete with live traffic).
This is what our sensor_readings table would be in a real plant — the live
landing zone for incoming telemetry.
A separate database built for analysis. Data is copied here (on a schedule, or streamed) so analysts can run huge queries without touching the operational system. Usually column-oriented, which makes aggregations over a few columns across millions of rows extremely fast. Examples: Snowflake, BigQuery, Redshift, or Postgres itself for smaller scale.
- Holds: historical, cleaned, structured data, often reshaped into a schema designed for querying (e.g. star schema with fact + dimension tables).
- Optimised for: big aggregations, joins, time-series rollups.
- Bad at: being a live transactional system (not its job).
In this repo, our single Postgres is playing the warehouse role — Metabase queries it for analytics. In production it would be a separate warehouse fed from the operational DB, so the analytics load is isolated.
A cheap, vast store for raw data in any format — structured tables, JSON logs, images, sensor blobs, video — kept before anyone has decided how to use it. Usually object storage (Amazon S3, Azure Data Lake, Google Cloud Storage). You impose structure when you read ("schema-on-read"), not when you write.
- Holds: everything, raw and unprocessed, often huge volumes.
- Optimised for: cheap storage and flexibility; feeding ML and future use-cases you haven't designed yet.
- Bad at: fast, clean, governed querying on its own — raw data is messy. (Hence "data swamp" when a lake is left ungoverned.)
A common modern pattern is the lakehouse (Databricks, or warehouse engines querying lake storage directly), which blurs the warehouse/lake line — but the mental model above still holds.
A typical flow in a real company:
app / sensors ─► operational DB (OLTP) ─► [ETL/ELT] ─► data warehouse (OLAP) ─► BI / dashboards
│ ▲
└──────────► data lake (raw) ──────────┘ (and ► ML)
- The operational DB runs the business in real time.
- Raw events also land in the lake, cheaply, in case they're needed later.
- A pipeline (ETL = transform then load; ELT = load then transform) moves and cleans data into the warehouse.
- Analysts and dashboards query the warehouse, never the live OLTP system.
(Note: the job description for an analyst role often says "does not design ETL pipelines" — that arrow is the data engineer's job. The analyst consumes the warehouse the engineer fills. Knowing the boundary is part of the role.)
The roles are identical; only the nouns change.
- Operational DB: every payment, transfer, and balance update — must be instant and perfectly consistent (you cannot lose a transaction).
- Warehouse: daily transaction history reshaped for analysis — success rates, fraud trends, funnel conversion, regulatory reporting.
- Lake: raw event logs, app clickststreams, device fingerprints — kept for fraud-model training and audits.
- Why separate: you must never let a "fraud rate by region last quarter" query slow down live payment processing.
- Operational DB / time-series store: the live stream of sensor readings landing every fraction of a second (often a specialised time-series DB like InfluxDB or TimescaleDB).
- Warehouse: aggregated history — OEE, downtime, failures per shift/machine — for the dashboards.
- Lake: raw high-frequency sensor data and machine logs, kept cheaply for predictive-maintenance ML.
- Why separate: analytics over a year of readings shouldn't compete with the ingestion writing new readings right now.
- Operational DB: the live catalogue, carts, orders, inventory.
- Warehouse: sales history for conversion, AOV, cohort retention, product performance.
- Lake: clickstream, search logs, recommendation events — for personalisation models.
- Why separate: a "revenue by category by month" report must not slow down customers checking out.
We collapse all of this into one Postgres on purpose: it keeps the tutorial
to a single docker compose up with nothing to configure, while still letting
you practise the analyst's actual job — writing SQL, defining KPIs, building
dashboards. Just keep the real-world mental model in mind:
In production, the sensors would write to an operational/time-series database, a pipeline would move that data into a separate warehouse, and Metabase would query the warehouse — so heavy analytics never touch the live ingestion. Our single Postgres plays the warehouse role for learning purposes.
Understanding that separation — and being able to explain why it exists (don't let analytics crush the live system) — is exactly the kind of platform literacy the role expects.