Skip to content

jfmcdowell/ducklake-tutorial

Repository files navigation

DuckLake Tutorial

A hands-on tutorial for building a local lakehouse with DuckDB, DuckLake, and SQLMesh.

Why This Tutorial?

DuckLake brings lakehouse capabilities (ACID transactions, time travel, Parquet storage) to DuckDB. Combined with SQLMesh for data transformations, you get a lightweight but production-ready data stack that runs entirely on your laptop.

This repo is an unofficial companion to the Tobiko blog post, packaged as a Jupyter notebook for ease of exploration. The blog post covers:

  • What lakehouses are and why they matter
  • How DuckLake compares to other open table formats
  • The layered data architecture (raw → staging → marts)

New to these concepts? Read the blog post first, then come back here to build it yourself.

Prerequisites

  • Python 3.13+
  • uv package manager

Setup

# Install dependencies
uv sync

# Launch JupyterLab
uv run jupyter lab

Usage

  1. Open ducklake_tutorial.ipynb
  2. Run all cells

The notebook will:

  • Download NYC Taxi trip data (~50K rows sampled)
  • Initialize a DuckLake lakehouse
  • Run SQLMesh transformations (staging -> dims -> facts)
  • Query the transformed data

Project Structure

├── ducklake_tutorial.ipynb  # Main tutorial
├── src/ducklake/            # Helper utilities
├── sqlmesh/                 # SQLMesh config & models
├── data/                    # Generated data (gitignored)
└── pyproject.toml           # Dependencies

Data

Uses NYC TLC Trip Record Data (Yellow Taxi, January 2024).

About

A hands-on tutorial for building a local lakehouse with DuckDB, DuckLake, and SQLMesh.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published