zig-parquet

A pure Zig Parquet library. All five compression codecs. No C dependencies. Runs anywhere.

Built against Zig 0.16 — uses the unified std.Io reader/writer interfaces throughout.

Build Sizes

Pure Zig, all five compression codecs, no C compiler required. Measured with ReleaseSmall.

Target	Size
Static library (macOS arm64)	1,365 KB
WASM (brotli-compressed)	199 KB

The 1,365 KB figure includes all five Zig codecs. Per-codec cost (on top of a 992 KB codec-less baseline) ranges from 89 KB for lz4 to 254 KB for brotli — use -Dcodecs= to pick a subset. See COMPRESSION.md for the full breakdown.

Features

Embeddable Native Library - Link Parquet support directly into native applications
Full Read/Write Support - Read and write Parquet files with all physical and logical types
All Standard Encodings - PLAIN, RLE, DICTIONARY, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT
Nested Types - Lists, structs, maps, and arbitrary nesting depth
Compression - zstd, gzip, snappy, lz4, brotli — pure Zig by default, no C/C++ required; C backends available as opt-in via -Dcodecs=c-only
Logical Types - STRING, DATE, TIME, TIMESTAMP (millis/micros/nanos), DECIMAL, UUID, INT annotations, FLOAT16, ENUM, JSON, BSON, INTERVAL, GEOMETRY, GEOGRAPHY
Dynamic Row API - Runtime DynamicWriter / DynamicReader for all types and arbitrary nesting depth
Schema-Agnostic Reading - Read any Parquet file without knowing the schema at compile time
Column Statistics - Min/max/null_count in column metadata
Page Index - Read and write OffsetIndex + ColumnIndex; readRowsFiltered skips pages whose min/max can't match a predicate, reading only needed byte ranges
Page-Level CRC Checksums - Written by default, validated on read
Key-Value Metadata - Read and write arbitrary file-level metadata
DataPage V1 and V2 - Read both page formats; write uses V1
Buffer and Callback Transports - Read/write from memory, files, or custom I/O backends
Hardened Against Malformed Input - Designed for safe casting, bounds checking, and no undefined behavior on untrusted data
C ABI with Arrow C Data Interface - Call from C, C++, and other languages via ArrowSchema, ArrowArray, and ArrowArrayStream
Portable Deployment - Native library and CLI for desktops, servers, edge devices, and serverless jobs
WASM Compatible - 103 KB plain, 184 KB with all pure Zig codecs, or 446 KB with C codecs (brotli-compressed)
CLI Tool - pqi for inspecting and validating Parquet files

CLI Tool

The pqi command-line tool is included for working with Parquet files:

# Build the CLI
cd cli && zig build

# Show schema
pqi schema data.parquet

# Preview rows
pqi head data.parquet -n 10

# Output all rows as JSON
pqi cat data.parquet --json

# Row count
pqi count data.parquet

# File statistics
pqi stats data.parquet

# Row group details
pqi rowgroups data.parquet

# File size breakdown
pqi size data.parquet

# Column detail across row groups
pqi column data.parquet price quantity

# Validate file integrity
pqi validate data.parquet

# Show OffsetIndex + ColumnIndex for each row group / column
pqi page-index data.parquet

Why zig-parquet?

If you need Parquet support inside a native application, zig-parquet is a straightforward way to ship it.

Embed directly - Use Parquet from Zig or through the C ABI instead of shelling out to a separate tool or service
Keep deployment simple - Stay native without requiring the JVM, Python, or the full Arrow C++ stack
Ship across targets - Use the same core library on desktops, servers, edge devices, serverless workloads, and WASM
Start with the CLI - Use pqi to inspect and validate files, then embed the same implementation in your application

Installation

Add zig-parquet to your project using zig fetch. This will automatically download the package and update your build.zig.zon with the correct cryptographic hash:

zig fetch --save https://github.com/akeating/zig-parquet/releases/download/v0.2.0/zig-parquet-v0.2.0.tar.gz

Then in your build.zig:

const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});

const parquet = b.dependency("parquet", .{
    .target = target,
    .optimize = optimize,
});
exe.root_module.addImport("parquet", parquet.module("parquet"));
exe.linkLibrary(parquet.artifact("parquet"));

Quick Start

Row-Based API (Recommended)

Define your schema at runtime, write rows with typed setters, and read back dynamically:

const std = @import("std");
const parquet = @import("parquet");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Write
    {
        const file = try std.fs.cwd().createFile("sensors.parquet", .{});
        defer file.close();

        var writer = try parquet.createFileDynamic(allocator, file);
        defer writer.deinit();

        // Columns are OPTIONAL by default; use .asRequired() for non-nullable
        const TypeInfo = parquet.TypeInfo;
        try writer.addColumn("sensor_id", TypeInfo.int32.asRequired(), .{});
        try writer.addColumn("timestamp", TypeInfo.timestamp_micros, .{});
        try writer.addColumn("temperature", TypeInfo.double_, .{});
        try writer.addColumn("location", TypeInfo.string, .{});
        writer.setCompression(.zstd);
        try writer.begin();

        try writer.setInt32(0, 1);
        try writer.setInt64(1, 1704067200000000);
        try writer.setDouble(2, 23.5);
        try writer.setBytes(3, "Building A");
        try writer.addRow();

        try writer.close();
    }

    // Read
    {
        const file = try std.fs.cwd().openFile("sensors.parquet", .{});
        defer file.close();

        var reader = try parquet.openFileDynamic(allocator, file, .{});
        defer reader.deinit();

        const rows = try reader.readAllRows(0);
        defer {
            for (rows) |row| row.deinit();
            allocator.free(rows);
        }

        for (rows) |row| {
            const id = if (row.getColumn(0)) |v| v.asInt32() orelse 0 else 0;
            const temp = if (row.getColumn(2)) |v| v.asDouble() orelse 0 else 0;
            std.debug.print("Sensor {}: {d}°C\n", .{ id, temp });
        }
    }
}

Column Projection

Read only a subset of top-level columns (skips I/O for unrequested columns):

var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();

// Read only columns 1 and 3 (dense-packed: returned rows have 2 values)
const rows = try reader.readRowsProjected(0, &.{ 1, 3 });
defer {
    for (rows) |row| row.deinit();
    allocator.free(rows);
}

for (rows) |row| {
    const name = if (row.getColumn(0)) |v| v.asBytes() orelse "" else "";
    const score = if (row.getColumn(1)) |v| v.asDouble() orelse 0 else 0;
    std.debug.print("{s}: {d}\n", .{ name, score });
}

Row Group Filtering

Use column statistics to skip row groups that don't match your criteria:

var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();

for (0..reader.getNumRowGroups()) |rg| {
    const stats = reader.getColumnStatistics(0, rg) orelse continue;
    const min_bytes = stats.min_value orelse stats.min orelse continue;
    const max_bytes = stats.max_value orelse stats.max orelse continue;
    const min = std.mem.readInt(i32, min_bytes[0..4], .little);
    const max = std.mem.readInt(i32, max_bytes[0..4], .little);

    if (target < min or target > max) continue; // skip this row group

    const rows = try reader.readAllRows(rg);
    defer {
        for (rows) |row| row.deinit();
        allocator.free(rows);
    }
    // ... process matching rows ...
}

Page-Index Filtering

For files written with page indexes, readRowsFiltered evaluates column-level predicates against per-page min/max and reads only the pages whose ranges might match. Non-matching pages are never fetched from the source and never decompressed:

var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();

// Keep rows where col 0 is between 100 and 200 AND col 1 equals the bytes "active".
const filters = [_]parquet.ColumnFilter{
    parquet.page_filter.betweenI32(0, 100, 200),
    parquet.page_filter.cmpBytes(1, .eq, "active"),
};

const rows = try reader.readRowsFiltered(0, &filters);
defer {
    for (rows) |row| row.deinit();
    allocator.free(rows);
}

The filter is conservative: kept rows come from pages whose stats could match, so you may still need a value-level check post-decode. readRowsFiltered requires every leaf column in the row group to carry an OffsetIndex; use readAllRows for unindexed files.

Writing files with indexes is opt-in:

var writer = try parquet.createFileDynamic(allocator, file);
writer.write_page_index = true;
writer.max_page_size = 64 * 1024; // roll pages at 64 KiB of encoded data
// ... addColumn / begin / rows / close as usual

Per-page min/max is captured automatically during multi-page writes for non-dictionary-encoded flat primitive columns (i32/i64/f32/f64). Dictionary-encoded or byte-array columns still emit a valid single-page index with chunk-level stats. Inspect what you wrote with pqi page-index <file>.

Row Iterator

Stream through all rows without managing row groups manually. Only one row group's data is held in memory at a time:

var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();

var iter = reader.rowIterator();
defer iter.deinit();

while (try iter.next()) |row| {
    const id = if (row.getColumn(0)) |v| v.asInt32() orelse 0 else 0;
    std.debug.print("id={}\n", .{id});
}

Supported Types

Physical Types

Parquet Type	Zig Type
BOOLEAN	`bool`
INT32	`i32`
INT64	`i64`
FLOAT	`f32`
DOUBLE	`f64`
BYTE_ARRAY	`[]const u8`
FIXED_LEN_BYTE_ARRAY	`[]const u8`

Logical Types

Logical Type	TypeInfo Constant	Physical Storage
STRING	`TypeInfo.string`	BYTE_ARRAY
DATE	`TypeInfo.date`	INT32 (days since epoch)
TIMESTAMP	`TypeInfo.timestamp_micros`	INT64
TIME	`TypeInfo.time_micros`	INT64
UUID	`TypeInfo.uuid`	FIXED_LEN_BYTE_ARRAY(16)
INTERVAL	`TypeInfo.interval`	FIXED_LEN_BYTE_ARRAY(12)
GEOMETRY	`TypeInfo.geometry`	BYTE_ARRAY (WKB)
GEOGRAPHY	`TypeInfo.geography`	BYTE_ARRAY (WKB)
DECIMAL	`TypeInfo.forDecimal(p, s)`	INT32/INT64/FIXED
JSON	`TypeInfo.json`	BYTE_ARRAY
BSON	`TypeInfo.bson`	BYTE_ARRAY
ENUM	`TypeInfo.enum_`	BYTE_ARRAY

Nested Types

Build arbitrary nested schemas at runtime using SchemaNode:

// list<struct<product_id: i32, quantity: i32, price: f64>>
const pid = try writer.allocSchemaNode(.{ .int32 = .{} });
const qty = try writer.allocSchemaNode(.{ .int32 = .{} });
const price = try writer.allocSchemaNode(.{ .double = .{} });
var fields = try writer.allocSchemaFields(3);
fields[0] = .{ .name = try writer.dupeSchemaName("product_id"), .node = pid };
fields[1] = .{ .name = try writer.dupeSchemaName("quantity"), .node = qty };
fields[2] = .{ .name = try writer.dupeSchemaName("price"), .node = price };
const item = try writer.allocSchemaNode(.{ .struct_ = .{ .fields = fields } });
const items = try writer.allocSchemaNode(.{ .list = item });
try writer.addColumnNested("items", items, .{});

Supports lists, structs, maps, and arbitrary nesting depth (e.g., list<map<string, list<struct<...>>>>). See examples/basic/03_nested_types.zig for a complete example.

Compression

All major Parquet compression codecs are supported, individually selectable at build time:

Codec	Default	C Backend	Notes
zstd	Pure Zig	libzstd 1.5.7	Zig: level-1 compressor + stdlib decompressor
gzip	Pure Zig	zlib 1.3.1	Zig: level-9 deflate compressor + stdlib decompressor
snappy	Pure Zig	snappy 1.2.2 (C++)	Full Snappy block format
lz4	Pure Zig	lz4 1.10.0	Full LZ4 raw block format
brotli	Pure Zig	brotli 1.2.0	Zig: quality-0 compressor + full decompressor

C backends available via -Dcodecs=c-only for maximum compression performance.

var writer = try parquet.createFileDynamic(allocator, file);
writer.setCompression(.zstd);

zig build                           # all codecs, Zig implementations used by default
zig build -Dcodecs=none             # no compression (smallest binary)
zig build -Dcodecs=zig-only         # pure Zig codecs only (no C/C++ deps at all)
zig build -Dcodecs=c-only           # C/C++ codecs only (opt-in)
zig build -Dcodecs=c-zstd,zstd    # both zstd implementations (cross-impl testing)

See COMPRESSION.md for build sizes, API details, and the full set of build options.

Per-Column and Per-Leaf Options

Set options per column at definition time, or per leaf path for nested types:

// Per-column options via addColumn
try writer.addColumn("timestamp", parquet.TypeInfo.int64, .{
    .encoding = .delta_binary_packed,
    .compression = .zstd,
});

// Per-leaf options for nested columns via setPathProperties
try writer.addColumnNested("address", struct_node, .{});
try writer.setPathProperties("address.city", .{ .compression = .zstd });
try writer.setPathProperties("address.zip", .{ .use_dictionary = false });

Global defaults apply to any column/leaf without an explicit override:

writer.setUseDictionary(false);        // disable dictionary encoding globally
writer.setIntEncoding(.delta_binary_packed);  // default for int columns
writer.setMaxPageSize(1_048_576);      // 1MB page size limit

Spec Coverage

Feature		Notes
Physical Types
BOOLEAN, INT32, INT64, FLOAT, DOUBLE	✅	All primitive types
BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY	✅	Variable and fixed-length binary
INT96	✅	Legacy timestamp support; read always, write via column API only (not DynamicWriter)
Encodings
PLAIN	✅	All physical types
RLE / BIT_PACKED	✅	Levels, dictionary indices, booleans
PLAIN_DICTIONARY / RLE_DICTIONARY	✅	Strings and integers
DELTA_BINARY_PACKED	✅	Sorted integers, timestamps
DELTA_LENGTH_BYTE_ARRAY	✅	Variable-length byte arrays
DELTA_BYTE_ARRAY	✅	Sorted strings (prefix compression)
BYTE_STREAM_SPLIT	✅	Float/double/int/fixed columns
Compression
UNCOMPRESSED	✅
SNAPPY	✅	Pure Zig (default); C++ backend via `-Dcodecs=c-snappy` or `c-only`
GZIP	✅	Pure Zig (default); C backend via `-Dcodecs=c-gzip` or `c-only`
ZSTD	✅	Pure Zig (default); C backend via `-Dcodecs=c-zstd` or `c-only`
LZ4_RAW	✅	Pure Zig (default); C backend via `-Dcodecs=c-lz4` or `c-only`
BROTLI	✅	Pure Zig (default); C backend via `-Dcodecs=c-brotli` or `c-only`
LZ4 (non-raw)	❌	Hadoop-specific framing format
LZO	❌	Not implemented
Logical Types
STRING, ENUM, JSON, BSON	✅	BYTE_ARRAY with annotation
UUID	✅	FIXED_LEN_BYTE_ARRAY(16)
INT (8/16/32/64, signed/unsigned)	✅	Width annotations
DECIMAL	✅	INT32/INT64/FIXED backing
FLOAT16	✅	Half-precision float
DATE	✅	Days since epoch
TIME (MILLIS/MICROS)	✅	Time of day
TIMESTAMP (MILLIS/MICROS)	✅	Instant or local
TIME/TIMESTAMP (NANOS)	✅	Full read/write support
INTERVAL	✅	Legacy ConvertedType (months/days/millis)
GEOMETRY / GEOGRAPHY	✅	GeoParquet 1.1 compatible
VARIANT	⏳	Future
Nested Types
LIST	✅	3-level structure
MAP	✅	Key-value pairs
Nested structs	✅	Arbitrary depth
Page Types
DATA_PAGE (v1)	✅
DATA_PAGE_V2	✅	Read only; optimized split decompression
DICTIONARY_PAGE	✅
Features
Column projection	✅	Read subset of columns; skips I/O for unselected columns
Row group filtering	✅	Statistics-based; skip row groups via min/max/null_count
Streaming iteration	✅	Row iterator; one row group in memory at a time
Column statistics	✅	min/max/null_count
Multi-page columns	✅	Large column support
Multi-row-group files	✅
Page Index (read)	✅	`readRowsFiltered` skips pages whose min/max can't match
Page Index (write)	✅	Flat columns; nested-leaf emission is a follow-up
Bloom filters	⏳	Planned
CRC checksums	✅	Page-level CRC32
Encryption	🔍	Under review — Java/Python-only ecosystem support

Legend: ✅ Supported | ⏳ Planned | 🔍 Under review | ❌ Unsupported

Files containing unsupported features return explicit errors rather than silently producing incorrect results.

WASM Support

Both wasm32-wasi and wasm32-freestanding targets are supported. WASI supports all codecs via -Dcodecs=; freestanding builds without compression. See COMPRESSION.md for per-codec WASM binary sizes.

Build for WASI:

cd zig-parquet
zig build -Dwasm_wasi -Doptimize=ReleaseSmall
# Output: zig-out/bin/parquet_wasi.wasm

Build for browser (freestanding, no compression):

cd zig-parquet
zig build -Dwasm_freestanding -Doptimize=ReleaseSmall
# Output: zig-out/bin/parquet_freestanding.wasm

Run with a WASI runtime:

wasmtime --dir=. zig-out/bin/parquet_wasi.wasm

See examples/wasm_demo/ and examples/wasm_freestanding/ for usage examples.

Requirements

Zig 0.16.0
No C compiler required for the default build (-Dcodecs=all uses pure Zig implementations)
C compiler only needed when opting into C codecs via -Dcodecs=c-only or individual codec names (e.g. -Dcodecs=c-zstd)
C++ compiler only needed for the C Snappy backend

License

Licensed under either of

at your option.

Contributing

Contributions welcome! Please read the existing code style and add tests for new functionality.

Acknowledgments

Apache Parquet specification
PyArrow for reference implementation and test file generation

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github/workflows		.github/workflows
.vscode		.vscode
cli		cli
scripts		scripts
test-files-arrow		test-files-arrow
test-files-wild		test-files-wild
zig-parquet		zig-parquet
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
COMPRESSION.md		COMPRESSION.md
CONTRIBUTING.md		CONTRIBUTING.md
COPYRIGHT		COPYRIGHT
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
mise.toml		mise.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zig-parquet

Build Sizes

Features

CLI Tool

Why zig-parquet?

Installation

Quick Start

Row-Based API (Recommended)

Column Projection

Row Group Filtering

Page-Index Filtering

Row Iterator

Supported Types

Physical Types

Logical Types

Nested Types

Compression

Per-Column and Per-Leaf Options

Spec Coverage

WASM Support

Requirements

License

Contributing

Acknowledgments

About

Licenses found

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zig-parquet

Build Sizes

Features

CLI Tool

Why zig-parquet?

Installation

Quick Start

Row-Based API (Recommended)

Column Projection

Row Group Filtering

Page-Index Filtering

Row Iterator

Supported Types

Physical Types

Logical Types

Nested Types

Compression

Per-Column and Per-Leaf Options

Spec Coverage

WASM Support

Requirements

License

Contributing

Acknowledgments

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages