A command-line tool for listing the contents and metadata of Apache Parquet files and partitioned parquet datasets, modelled on HDF5's h5ls.
curl -fsSL https://github.com/dunnock/pqls/releases/latest/download/install.sh | shcargo install pqlsInspect a single file:
pqls data.parquetDetailed stats (per-column min/max/nulls):
pqls -d data.parquetDump as CSV:
pqls --csv data.parquet
pqls --csv --head 100 data.parquetList a partitioned dataset:
pqls /path/to/dataset/
pqls -d -r /path/to/dataset/Machine-readable output:
pqls -q data.parquetpqls [OPTIONS] <PATH> [PATH_B]
ARGS:
<PATH> path to a .parquet file or directory to inspect
[PATH_B] second .parquet file for schema diff (required by --diff)
OPTIONS:
--diff compare schemas of two files; exits 0 if identical, 1 if different
-d, --detail show per-row-group column statistics (min/max/nulls)
-r, --recursive recurse into a directory and list all .parquet files
--csv dump rows as CSV to stdout
--head <N> limit output to the first N rows (applies to --csv and --ndjson)
-q, --quiet suppress human-readable headers; emit tab-separated summary lines
--schema print schema only (column names and types)
--json emit output as JSON (works with --schema, --kv-meta, --check, --partition-stats, --diff)
--ndjson stream rows as newline-delimited JSON (NDJSON)
--sample <N> emit N randomly-sampled rows; requires --ndjson or --csv
--columns <COLS> comma-separated list of column names to project (e.g. id,ts,value)
--kv-meta print Parquet key-value metadata (writer version, custom properties)
--scan-stats scan the full file to compute per-column min/max/nulls/n_distinct; requires -d
--partition-stats aggregate row counts and file sizes across a Hive-partitioned directory; requires -r
--check verify file integrity by reading the footer and all row groups
--deep with --check: read every data page (slower but catches corrupt column data)
-h, --help print help
-V, --version print version
Static binary. No JVM, no Python interpreter, no pip install. Drop the binary on
any Linux box and it runs — sub-100ms startup on the critical path of a data pipeline.
Composable. Stdout is always clean (data only; warnings go to stderr). Pipe anywhere:
pqls --csv file.parquet | xsv stats
pqls --schema file.parquet | diff - expected.schemaAgent-friendly. Machine-readable --schema --json and --ndjson output let code
agents inspect schema and rows without parsing human text. See SKILL.md for patterns.
One-liner install:
curl -fsSL https://github.com/dunnock/pqls/releases/latest/download/install.sh | shFast:
| Tool | Runtime | Startup | Schema dump | Stats | Pipe-composable |
|---|---|---|---|---|---|
| pqls | none (static) | ~50ms | --schema --json |
--scan-stats |
yes |
| parquet-tools | JVM | ~2s | text only | yes | no |
| DuckDB | Go binary | ~200ms | SQL only | SQL | no |
| fastparquet | Python | ~500ms | Python API | Python API | no |
| pqls | parquet-cli (Apache) | pqrs | DuckDB | |
|---|---|---|---|---|
| Static binary, no JVM/Python | yes | no (JAR) | yes | yes |
--schema --json for agents |
yes | no (text only) | no | via SQL |
NDJSON rows (--ndjson) |
yes | no | cat -f json | via SQL |
Column projection (--columns) |
yes | yes | no | via SQL |
Random sampling (--sample N) |
yes | no | yes | ORDER BY random() |
Key-value metadata (--kv-meta) |
yes | footer cmd | no | parquet_kv_metadata() |
| Directory / partition listing | yes | no | no | no |
| SKILL.md for code agents | yes | no | no | no |
| Composable (stdin/stdout clean) | yes | no | partial | no |
pqls is the only static binary in this list that produces JSON schema output and NDJSON rows without requiring SQL. It is designed for shell pipelines and agent tooling where DuckDB's startup time or SQL syntax is overhead.
pqls is designed to be called by code agents (Claude, Codex, Cursor, etc.) without any human at the terminal.
pqls --schema --json /path/to/foo.parquetReturns a JSON object — safe to parse with jq or Python json.loads.
Field logical_type tells you DATE, TIMESTAMP_MICROS, DECIMAL(10,2), etc.
pqls --ndjson --sample 50 foo.parquet50 rows, one JSON object per line. Pipe to jq for field inspection.
pqls --ndjson --columns user_id,amount --sample 20 foo.parquetpqls --kv-meta --json foo.parquet | jq '.["pandas"]'# Find which files in a partitioned dataset have more than 1M rows
pqls -q --recursive /data/events/ \
| awk -F'\t' '$2 > 1000000 { print $1 }'Scripts should test $?:
0— success, output on stdout1— file/path error or schema mismatch (with --diff)2— corrupt or invalid parquet, or bad flag combination
Licensed under either of MIT or Apache-2.0 at your option.