Skip to content

dunnock/pqls

Repository files navigation

pqls

Release Latest Release License

A command-line tool for listing the contents and metadata of Apache Parquet files and partitioned parquet datasets, modelled on HDF5's h5ls.

Install

curl -fsSL https://github.com/dunnock/pqls/releases/latest/download/install.sh | sh

Or install with cargo:

cargo install pqls

Examples

Inspect a single file:

pqls data.parquet

Detailed stats (per-column min/max/nulls):

pqls -d data.parquet

Dump as CSV:

pqls --csv data.parquet
pqls --csv --head 100 data.parquet

List a partitioned dataset:

pqls /path/to/dataset/
pqls -d -r /path/to/dataset/

Machine-readable output:

pqls -q data.parquet

CLI

pqls [OPTIONS] <PATH> [PATH_B]

ARGS:
  <PATH>            path to a .parquet file or directory to inspect
  [PATH_B]          second .parquet file for schema diff (required by --diff)

OPTIONS:
      --diff                compare schemas of two files; exits 0 if identical, 1 if different
  -d, --detail              show per-row-group column statistics (min/max/nulls)
  -r, --recursive           recurse into a directory and list all .parquet files
      --csv                 dump rows as CSV to stdout
      --head <N>            limit output to the first N rows (applies to --csv and --ndjson)
  -q, --quiet               suppress human-readable headers; emit tab-separated summary lines
      --schema              print schema only (column names and types)
      --json                emit output as JSON (works with --schema, --kv-meta, --check, --partition-stats, --diff)
      --ndjson              stream rows as newline-delimited JSON (NDJSON)
      --sample <N>          emit N randomly-sampled rows; requires --ndjson or --csv
      --columns <COLS>      comma-separated list of column names to project (e.g. id,ts,value)
      --kv-meta             print Parquet key-value metadata (writer version, custom properties)
      --scan-stats          scan the full file to compute per-column min/max/nulls/n_distinct; requires -d
      --partition-stats     aggregate row counts and file sizes across a Hive-partitioned directory; requires -r
      --check               verify file integrity by reading the footer and all row groups
      --deep                with --check: read every data page (slower but catches corrupt column data)
  -h, --help                print help
  -V, --version             print version

Why pqls?

Static binary. No JVM, no Python interpreter, no pip install. Drop the binary on any Linux box and it runs — sub-100ms startup on the critical path of a data pipeline.

Composable. Stdout is always clean (data only; warnings go to stderr). Pipe anywhere:

pqls --csv file.parquet | xsv stats
pqls --schema file.parquet | diff - expected.schema

Agent-friendly. Machine-readable --schema --json and --ndjson output let code agents inspect schema and rows without parsing human text. See SKILL.md for patterns.

One-liner install:

curl -fsSL https://github.com/dunnock/pqls/releases/latest/download/install.sh | sh

Fast:

Tool Runtime Startup Schema dump Stats Pipe-composable
pqls none (static) ~50ms --schema --json --scan-stats yes
parquet-tools JVM ~2s text only yes no
DuckDB Go binary ~200ms SQL only SQL no
fastparquet Python ~500ms Python API Python API no

How pqls compares

pqls parquet-cli (Apache) pqrs DuckDB
Static binary, no JVM/Python yes no (JAR) yes yes
--schema --json for agents yes no (text only) no via SQL
NDJSON rows (--ndjson) yes no cat -f json via SQL
Column projection (--columns) yes yes no via SQL
Random sampling (--sample N) yes no yes ORDER BY random()
Key-value metadata (--kv-meta) yes footer cmd no parquet_kv_metadata()
Directory / partition listing yes no no no
SKILL.md for code agents yes no no no
Composable (stdin/stdout clean) yes no partial no

pqls is the only static binary in this list that produces JSON schema output and NDJSON rows without requiring SQL. It is designed for shell pipelines and agent tooling where DuckDB's startup time or SQL syntax is overhead.

Agent usage

pqls is designed to be called by code agents (Claude, Codex, Cursor, etc.) without any human at the terminal.

Discover schema

pqls --schema --json /path/to/foo.parquet

Returns a JSON object — safe to parse with jq or Python json.loads. Field logical_type tells you DATE, TIMESTAMP_MICROS, DECIMAL(10,2), etc.

Sample rows to understand data

pqls --ndjson --sample 50 foo.parquet

50 rows, one JSON object per line. Pipe to jq for field inspection.

Project specific columns

pqls --ndjson --columns user_id,amount --sample 20 foo.parquet

Check embedded metadata (Spark / Pandas schema)

pqls --kv-meta --json foo.parquet | jq '.["pandas"]'

Composable pipeline example

# Find which files in a partitioned dataset have more than 1M rows
pqls -q --recursive /data/events/ \
  | awk -F'\t' '$2 > 1000000 { print $1 }'

Exit code contract

Scripts should test $?:

  • 0 — success, output on stdout
  • 1 — file/path error or schema mismatch (with --diff)
  • 2 — corrupt or invalid parquet, or bad flag combination

License

Licensed under either of MIT or Apache-2.0 at your option.

About

CLI for listing parquet tables

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors