A pure Zig Parquet library. All five compression codecs. No C dependencies. Runs anywhere.
Built against Zig 0.16 — uses the unified std.Io reader/writer interfaces throughout.
Pure Zig, all five compression codecs, no C compiler required. Measured with ReleaseSmall.
| Target | Size |
|---|---|
| Static library (macOS arm64) | 1,365 KB |
| WASM (brotli-compressed) | 199 KB |
The 1,365 KB figure includes all five Zig codecs. Per-codec cost (on top of a 992 KB codec-less baseline) ranges from 89 KB for lz4 to 254 KB for brotli — use -Dcodecs= to pick a subset. See COMPRESSION.md for the full breakdown.
- Embeddable Native Library - Link Parquet support directly into native applications
- Full Read/Write Support - Read and write Parquet files with all physical and logical types
- All Standard Encodings - PLAIN, RLE, DICTIONARY, DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY, BYTE_STREAM_SPLIT
- Nested Types - Lists, structs, maps, and arbitrary nesting depth
- Compression - zstd, gzip, snappy, lz4, brotli — pure Zig by default, no C/C++ required; C backends available as opt-in via
-Dcodecs=c-only - Logical Types - STRING, DATE, TIME, TIMESTAMP (millis/micros/nanos), DECIMAL, UUID, INT annotations, FLOAT16, ENUM, JSON, BSON, INTERVAL, GEOMETRY, GEOGRAPHY
- Dynamic Row API - Runtime
DynamicWriter/DynamicReaderfor all types and arbitrary nesting depth - Schema-Agnostic Reading - Read any Parquet file without knowing the schema at compile time
- Column Statistics - Min/max/null_count in column metadata
- Page Index - Read and write OffsetIndex + ColumnIndex;
readRowsFilteredskips pages whose min/max can't match a predicate, reading only needed byte ranges - Page-Level CRC Checksums - Written by default, validated on read
- Key-Value Metadata - Read and write arbitrary file-level metadata
- DataPage V1 and V2 - Read both page formats; write uses V1
- Buffer and Callback Transports - Read/write from memory, files, or custom I/O backends
- Hardened Against Malformed Input - Designed for safe casting, bounds checking, and no undefined behavior on untrusted data
- C ABI with Arrow C Data Interface - Call from C, C++, and other languages via ArrowSchema, ArrowArray, and ArrowArrayStream
- Portable Deployment - Native library and CLI for desktops, servers, edge devices, and serverless jobs
- WASM Compatible - 103 KB plain, 184 KB with all pure Zig codecs, or 446 KB with C codecs (brotli-compressed)
- CLI Tool -
pqifor inspecting and validating Parquet files
The pqi command-line tool is included for working with Parquet files:
# Build the CLI
cd cli && zig build
# Show schema
pqi schema data.parquet
# Preview rows
pqi head data.parquet -n 10
# Output all rows as JSON
pqi cat data.parquet --json
# Row count
pqi count data.parquet
# File statistics
pqi stats data.parquet
# Row group details
pqi rowgroups data.parquet
# File size breakdown
pqi size data.parquet
# Column detail across row groups
pqi column data.parquet price quantity
# Validate file integrity
pqi validate data.parquet
# Show OffsetIndex + ColumnIndex for each row group / column
pqi page-index data.parquetIf you need Parquet support inside a native application, zig-parquet is a straightforward way to ship it.
- Embed directly - Use Parquet from Zig or through the C ABI instead of shelling out to a separate tool or service
- Keep deployment simple - Stay native without requiring the JVM, Python, or the full Arrow C++ stack
- Ship across targets - Use the same core library on desktops, servers, edge devices, serverless workloads, and WASM
- Start with the CLI - Use
pqito inspect and validate files, then embed the same implementation in your application
Add zig-parquet to your project using zig fetch. This will automatically download the package and update your build.zig.zon with the correct cryptographic hash:
zig fetch --save https://github.com/akeating/zig-parquet/releases/download/v0.2.0/zig-parquet-v0.2.0.tar.gzThen in your build.zig:
const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});
const parquet = b.dependency("parquet", .{
.target = target,
.optimize = optimize,
});
exe.root_module.addImport("parquet", parquet.module("parquet"));
exe.linkLibrary(parquet.artifact("parquet"));Define your schema at runtime, write rows with typed setters, and read back dynamically:
const std = @import("std");
const parquet = @import("parquet");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Write
{
const file = try std.fs.cwd().createFile("sensors.parquet", .{});
defer file.close();
var writer = try parquet.createFileDynamic(allocator, file);
defer writer.deinit();
// Columns are OPTIONAL by default; use .asRequired() for non-nullable
const TypeInfo = parquet.TypeInfo;
try writer.addColumn("sensor_id", TypeInfo.int32.asRequired(), .{});
try writer.addColumn("timestamp", TypeInfo.timestamp_micros, .{});
try writer.addColumn("temperature", TypeInfo.double_, .{});
try writer.addColumn("location", TypeInfo.string, .{});
writer.setCompression(.zstd);
try writer.begin();
try writer.setInt32(0, 1);
try writer.setInt64(1, 1704067200000000);
try writer.setDouble(2, 23.5);
try writer.setBytes(3, "Building A");
try writer.addRow();
try writer.close();
}
// Read
{
const file = try std.fs.cwd().openFile("sensors.parquet", .{});
defer file.close();
var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();
const rows = try reader.readAllRows(0);
defer {
for (rows) |row| row.deinit();
allocator.free(rows);
}
for (rows) |row| {
const id = if (row.getColumn(0)) |v| v.asInt32() orelse 0 else 0;
const temp = if (row.getColumn(2)) |v| v.asDouble() orelse 0 else 0;
std.debug.print("Sensor {}: {d}°C\n", .{ id, temp });
}
}
}Read only a subset of top-level columns (skips I/O for unrequested columns):
var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();
// Read only columns 1 and 3 (dense-packed: returned rows have 2 values)
const rows = try reader.readRowsProjected(0, &.{ 1, 3 });
defer {
for (rows) |row| row.deinit();
allocator.free(rows);
}
for (rows) |row| {
const name = if (row.getColumn(0)) |v| v.asBytes() orelse "" else "";
const score = if (row.getColumn(1)) |v| v.asDouble() orelse 0 else 0;
std.debug.print("{s}: {d}\n", .{ name, score });
}Use column statistics to skip row groups that don't match your criteria:
var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();
for (0..reader.getNumRowGroups()) |rg| {
const stats = reader.getColumnStatistics(0, rg) orelse continue;
const min_bytes = stats.min_value orelse stats.min orelse continue;
const max_bytes = stats.max_value orelse stats.max orelse continue;
const min = std.mem.readInt(i32, min_bytes[0..4], .little);
const max = std.mem.readInt(i32, max_bytes[0..4], .little);
if (target < min or target > max) continue; // skip this row group
const rows = try reader.readAllRows(rg);
defer {
for (rows) |row| row.deinit();
allocator.free(rows);
}
// ... process matching rows ...
}For files written with page indexes, readRowsFiltered evaluates column-level predicates against per-page min/max and reads only the pages whose ranges might match. Non-matching pages are never fetched from the source and never decompressed:
var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();
// Keep rows where col 0 is between 100 and 200 AND col 1 equals the bytes "active".
const filters = [_]parquet.ColumnFilter{
parquet.page_filter.betweenI32(0, 100, 200),
parquet.page_filter.cmpBytes(1, .eq, "active"),
};
const rows = try reader.readRowsFiltered(0, &filters);
defer {
for (rows) |row| row.deinit();
allocator.free(rows);
}The filter is conservative: kept rows come from pages whose stats could match, so you may still need a value-level check post-decode. readRowsFiltered requires every leaf column in the row group to carry an OffsetIndex; use readAllRows for unindexed files.
Writing files with indexes is opt-in:
var writer = try parquet.createFileDynamic(allocator, file);
writer.write_page_index = true;
writer.max_page_size = 64 * 1024; // roll pages at 64 KiB of encoded data
// ... addColumn / begin / rows / close as usualPer-page min/max is captured automatically during multi-page writes for non-dictionary-encoded flat primitive columns (i32/i64/f32/f64). Dictionary-encoded or byte-array columns still emit a valid single-page index with chunk-level stats. Inspect what you wrote with pqi page-index <file>.
Stream through all rows without managing row groups manually. Only one row group's data is held in memory at a time:
var reader = try parquet.openFileDynamic(allocator, file, .{});
defer reader.deinit();
var iter = reader.rowIterator();
defer iter.deinit();
while (try iter.next()) |row| {
const id = if (row.getColumn(0)) |v| v.asInt32() orelse 0 else 0;
std.debug.print("id={}\n", .{id});
}| Parquet Type | Zig Type |
|---|---|
| BOOLEAN | bool |
| INT32 | i32 |
| INT64 | i64 |
| FLOAT | f32 |
| DOUBLE | f64 |
| BYTE_ARRAY | []const u8 |
| FIXED_LEN_BYTE_ARRAY | []const u8 |
| Logical Type | TypeInfo Constant | Physical Storage |
|---|---|---|
| STRING | TypeInfo.string |
BYTE_ARRAY |
| DATE | TypeInfo.date |
INT32 (days since epoch) |
| TIMESTAMP | TypeInfo.timestamp_micros |
INT64 |
| TIME | TypeInfo.time_micros |
INT64 |
| UUID | TypeInfo.uuid |
FIXED_LEN_BYTE_ARRAY(16) |
| INTERVAL | TypeInfo.interval |
FIXED_LEN_BYTE_ARRAY(12) |
| GEOMETRY | TypeInfo.geometry |
BYTE_ARRAY (WKB) |
| GEOGRAPHY | TypeInfo.geography |
BYTE_ARRAY (WKB) |
| DECIMAL | TypeInfo.forDecimal(p, s) |
INT32/INT64/FIXED |
| JSON | TypeInfo.json |
BYTE_ARRAY |
| BSON | TypeInfo.bson |
BYTE_ARRAY |
| ENUM | TypeInfo.enum_ |
BYTE_ARRAY |
Build arbitrary nested schemas at runtime using SchemaNode:
// list<struct<product_id: i32, quantity: i32, price: f64>>
const pid = try writer.allocSchemaNode(.{ .int32 = .{} });
const qty = try writer.allocSchemaNode(.{ .int32 = .{} });
const price = try writer.allocSchemaNode(.{ .double = .{} });
var fields = try writer.allocSchemaFields(3);
fields[0] = .{ .name = try writer.dupeSchemaName("product_id"), .node = pid };
fields[1] = .{ .name = try writer.dupeSchemaName("quantity"), .node = qty };
fields[2] = .{ .name = try writer.dupeSchemaName("price"), .node = price };
const item = try writer.allocSchemaNode(.{ .struct_ = .{ .fields = fields } });
const items = try writer.allocSchemaNode(.{ .list = item });
try writer.addColumnNested("items", items, .{});Supports lists, structs, maps, and arbitrary nesting depth (e.g., list<map<string, list<struct<...>>>>).
See examples/basic/03_nested_types.zig for a complete example.
All major Parquet compression codecs are supported, individually selectable at build time:
| Codec | Default | C Backend | Notes |
|---|---|---|---|
| zstd | Pure Zig | libzstd 1.5.7 | Zig: level-1 compressor + stdlib decompressor |
| gzip | Pure Zig | zlib 1.3.1 | Zig: level-9 deflate compressor + stdlib decompressor |
| snappy | Pure Zig | snappy 1.2.2 (C++) | Full Snappy block format |
| lz4 | Pure Zig | lz4 1.10.0 | Full LZ4 raw block format |
| brotli | Pure Zig | brotli 1.2.0 | Zig: quality-0 compressor + full decompressor |
C backends available via -Dcodecs=c-only for maximum compression performance.
var writer = try parquet.createFileDynamic(allocator, file);
writer.setCompression(.zstd);zig build # all codecs, Zig implementations used by default
zig build -Dcodecs=none # no compression (smallest binary)
zig build -Dcodecs=zig-only # pure Zig codecs only (no C/C++ deps at all)
zig build -Dcodecs=c-only # C/C++ codecs only (opt-in)
zig build -Dcodecs=c-zstd,zstd # both zstd implementations (cross-impl testing)See COMPRESSION.md for build sizes, API details, and the full set of build options.
Set options per column at definition time, or per leaf path for nested types:
// Per-column options via addColumn
try writer.addColumn("timestamp", parquet.TypeInfo.int64, .{
.encoding = .delta_binary_packed,
.compression = .zstd,
});
// Per-leaf options for nested columns via setPathProperties
try writer.addColumnNested("address", struct_node, .{});
try writer.setPathProperties("address.city", .{ .compression = .zstd });
try writer.setPathProperties("address.zip", .{ .use_dictionary = false });Global defaults apply to any column/leaf without an explicit override:
writer.setUseDictionary(false); // disable dictionary encoding globally
writer.setIntEncoding(.delta_binary_packed); // default for int columns
writer.setMaxPageSize(1_048_576); // 1MB page size limit| Feature | Notes | |
|---|---|---|
| Physical Types | ||
| BOOLEAN, INT32, INT64, FLOAT, DOUBLE | ✅ | All primitive types |
| BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY | ✅ | Variable and fixed-length binary |
| INT96 | ✅ | Legacy timestamp support; read always, write via column API only (not DynamicWriter) |
| Encodings | ||
| PLAIN | ✅ | All physical types |
| RLE / BIT_PACKED | ✅ | Levels, dictionary indices, booleans |
| PLAIN_DICTIONARY / RLE_DICTIONARY | ✅ | Strings and integers |
| DELTA_BINARY_PACKED | ✅ | Sorted integers, timestamps |
| DELTA_LENGTH_BYTE_ARRAY | ✅ | Variable-length byte arrays |
| DELTA_BYTE_ARRAY | ✅ | Sorted strings (prefix compression) |
| BYTE_STREAM_SPLIT | ✅ | Float/double/int/fixed columns |
| Compression | ||
| UNCOMPRESSED | ✅ | |
| SNAPPY | ✅ | Pure Zig (default); C++ backend via -Dcodecs=c-snappy or c-only |
| GZIP | ✅ | Pure Zig (default); C backend via -Dcodecs=c-gzip or c-only |
| ZSTD | ✅ | Pure Zig (default); C backend via -Dcodecs=c-zstd or c-only |
| LZ4_RAW | ✅ | Pure Zig (default); C backend via -Dcodecs=c-lz4 or c-only |
| BROTLI | ✅ | Pure Zig (default); C backend via -Dcodecs=c-brotli or c-only |
| LZ4 (non-raw) | ❌ | Hadoop-specific framing format |
| LZO | ❌ | Not implemented |
| Logical Types | ||
| STRING, ENUM, JSON, BSON | ✅ | BYTE_ARRAY with annotation |
| UUID | ✅ | FIXED_LEN_BYTE_ARRAY(16) |
| INT (8/16/32/64, signed/unsigned) | ✅ | Width annotations |
| DECIMAL | ✅ | INT32/INT64/FIXED backing |
| FLOAT16 | ✅ | Half-precision float |
| DATE | ✅ | Days since epoch |
| TIME (MILLIS/MICROS) | ✅ | Time of day |
| TIMESTAMP (MILLIS/MICROS) | ✅ | Instant or local |
| TIME/TIMESTAMP (NANOS) | ✅ | Full read/write support |
| INTERVAL | ✅ | Legacy ConvertedType (months/days/millis) |
| GEOMETRY / GEOGRAPHY | ✅ | GeoParquet 1.1 compatible |
| VARIANT | ⏳ | Future |
| Nested Types | ||
| LIST | ✅ | 3-level structure |
| MAP | ✅ | Key-value pairs |
| Nested structs | ✅ | Arbitrary depth |
| Page Types | ||
| DATA_PAGE (v1) | ✅ | |
| DATA_PAGE_V2 | ✅ | Read only; optimized split decompression |
| DICTIONARY_PAGE | ✅ | |
| Features | ||
| Column projection | ✅ | Read subset of columns; skips I/O for unselected columns |
| Row group filtering | ✅ | Statistics-based; skip row groups via min/max/null_count |
| Streaming iteration | ✅ | Row iterator; one row group in memory at a time |
| Column statistics | ✅ | min/max/null_count |
| Multi-page columns | ✅ | Large column support |
| Multi-row-group files | ✅ | |
| Page Index (read) | ✅ | readRowsFiltered skips pages whose min/max can't match |
| Page Index (write) | ✅ | Flat columns; nested-leaf emission is a follow-up |
| Bloom filters | ⏳ | Planned |
| CRC checksums | ✅ | Page-level CRC32 |
| Encryption | 🔍 | Under review — Java/Python-only ecosystem support |
Legend: ✅ Supported | ⏳ Planned | 🔍 Under review | ❌ Unsupported
Files containing unsupported features return explicit errors rather than silently producing incorrect results.
Both wasm32-wasi and wasm32-freestanding targets are supported. WASI supports all codecs via -Dcodecs=; freestanding builds without compression. See COMPRESSION.md for per-codec WASM binary sizes.
Build for WASI:
cd zig-parquet
zig build -Dwasm_wasi -Doptimize=ReleaseSmall
# Output: zig-out/bin/parquet_wasi.wasmBuild for browser (freestanding, no compression):
cd zig-parquet
zig build -Dwasm_freestanding -Doptimize=ReleaseSmall
# Output: zig-out/bin/parquet_freestanding.wasmRun with a WASI runtime:
wasmtime --dir=. zig-out/bin/parquet_wasi.wasmSee examples/wasm_demo/ and examples/wasm_freestanding/ for usage examples.
- Zig 0.16.0
- No C compiler required for the default build (
-Dcodecs=alluses pure Zig implementations) - C compiler only needed when opting into C codecs via
-Dcodecs=c-onlyor individual codec names (e.g.-Dcodecs=c-zstd) - C++ compiler only needed for the C Snappy backend
Licensed under either of
at your option.
Contributions welcome! Please read the existing code style and add tests for new functionality.
- Apache Parquet specification
- PyArrow for reference implementation and test file generation