Skip to content

quodlibetor/s3glob

Repository files navigation

s3glob

s3glob is a fast aws s3 list implementation that basically obeys standard unix glob patterns.

In my experience (on an ec2 instance) s3glob can list 10s of millions of files in about 5 seconds, where I gave up on aws s3 ls after 5 minutes.

s3glob in action

Status

s3glob is basically complete. It does all the things I need. If you have any feature requests or bug reports please open an issue.

Usage

These two commands are equivalent:

s3glob ls "s3://my-bucket/a*/something/1*/other/*"
s3glob ls      "my-bucket/a*/something/1*/other/*"

Output is in the same format as aws s3 ls, but you can change it with the --format flag. For example, this will output just the s3://<bucket>/<key> for each object:

s3glob ls -f "{uri}" "s3://my-bucket/a*/something/1*/other/*"

You can also download objects:

s3glob dl "s3://my-bucket/a*/something/1*/other/*" my-local-dir

Local files will always be unique (two objects with the same filename won't stomp on each other). See s3glob dl --help to configure exactly how local paths are created.

Installation

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/quodlibetor/s3glob/releases/latest/download/s3glob-installer.sh | sh

Install prebuilt binaries via powershell script

powershell -ExecutionPolicy ByPass -c "irm https://github.com/quodlibetor/s3glob/releases/latest/download/s3glob-installer.ps1 | iex"

Install prebuilt binaries via Homebrew

brew install quodlibetor/tap/s3glob

Syntax

Glob syntax supported:

  • * matches any number of non-delimiter characters. The default delimiter is /.
  • ? matches any single character. By default this includes the delimiter; pass --no-cross-delim to restrict ? to a single segment.
  • [abc]/[!abc] matches any single character in/not in the set. By default the negated form [!abc] may also match the delimiter; pass --no-cross-delim to keep it single-segment.
  • [a-z]/[!a-z] matches any single character in/not in the range, with the same --no-cross-delim rule for the negated form.
  • {a,b,c} matches any of the comma-separated options (but nested globs are not supported). Empty alternatives are allowed: {a,} matches either a or the empty string.
  • ** matches any number of characters, including the delimiter. At a **, s3glob discovers sub-prefixes via a bounded breadth-first walk so it can list them in parallel; if your bucket shape isn't suited to that, pass --no-recursive-auto-parallel to skip the walk.
  • A pattern (or any brace alternative) ending in / implicitly matches everything inside that directory: s3glob ls 'foo/' lists every object under foo/.

Differences from standard glob and globset

s3glob's syntax overlaps with traditional Unix glob, but with a few intentional deviations driven by the S3-listing model:

  • ** works anywhere, not only as a path component. Most glob implementations require ** to stand alone between delimiters (e.g. a/**/b). In s3glob, ** compiles to "any chars including the delimiter" wherever it appears: a**b matches a/x/y/b, and {x,y}** matches anything starting with x or y.
  • Negated character classes and ? can be made single-segment. By default ? matches any character, and [!a] matches any non-a character including the delimiter. Pass --no-cross-delim (or set S3GLOB_CROSS_DELIM=false) to restrict [!a] to a single segment. A future major version will flip this default to single-segment.
  • Empty brace alternatives are first-class. {a,}, {,a}, and {a,,b} are all valid; the empty alt matches the empty string. Many glob implementations reject these.
  • Trailing / is "match everything inside this directory". Pattern foo/ is internally rewritten to foo/*-equivalent.

Algorithm and performance implications

The tl;dr is that, up until the point a pattern has a ** in it, s3glob will search within directories filtering by any constants in the pattern to reduce the number of objects that need to be scanned:

  • fastest: bucket/a*/b*/**
  • fast: bucket/*a*/*b*/**
  • full scan: bucket/**a**/b**

AWS S3 allows us to enumerate objects within a prefix, but it does not natively allow any filtering. s3glob works around this by enumerating prefixes and matching them recursively against the provided glob pattern.

I have observed s3glob to be able to list hundreds of thousands of objects in a couple of seconds from within an ec2 instance.

A ** is where prefix-narrowing ends — segments after it can't be turned into prefix filters. But s3glob still tries to parallelize the recursive listing itself: at the **, it walks one directory level at a time with delimiter-aware LIST calls and then scans the discovered sub-prefixes concurrently. For buckets with a broad subtree under ** this turns a single-stream recursive list into a parallel scan. For buckets with extremely wide it can cost extra LIST calls without much payoff. Pass --no-recursive-auto-parallel to force ** to immediately become a serial list.

What this means in general is that, if you have a keyspace that looks like:

2000_01_01-2024_12_31/a-z/0-999/OBJECT_ID.txt

where each - represents the values in between, then you can roughly determine how many objects S3Glob will need to list by multiplying the number of values in each range. Adding a filter can reduce that number.

Some example approximate numbers:

Pattern Approximate number of objects Reason
s3glob ls 2000_01_01/a/*/OBJECT_ID.txt 1,000 0-999 = 1000
s3glob ls 2000_01_01/[abc]/*/OBJECT_ID.txt 3,000 (a + b + c) * 0-999 = 3 * 1000
s3glob ls 2000_01_01/*/*/OBJECT_ID.txt 26,000 a-z * 0-999 = 26 * 1000
s3glob ls 2000_01_01/[!xyz]/*/OBJECT_ID.txt 23,026 (list all of a-z) = 26 => (filter out x,y,z) => 23 * 1,000 = 23,000
s3glob ls 2000_01_*/*/*/OBJECT_ID.txt 806,000 01-31 * a-z * 0-999 = 31 * 26 * 1000

Copying

All code is available under the MIT or Apache 2.0 license, at your option.

Development

Performing a release

Ensure git-cliff and cargo-release are both installed (run mise install to get them) and run cargo release [patch|minor].

If things look good, run again with --execute.

About

A fast aws s3 ls and download cli that supports glob patterns and automatically parallelizes queries

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages