diff --git a/CHANGELOG.md b/CHANGELOG.md index c3d9b53..6595c8b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,12 @@ > ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents > Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads. +## 1.1.0 (2026-06-09) + +RSpec tests: 1,038 → 1,070 + +- New `SmarterJSON.foreach(source)` — the streaming, composable sibling of `process_file`. `source` is a file path or an IO (a socket, `StringIO`, open `File`). Without a block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, never loading the whole file, so a large NDJSON / JSONL stream can be filtered or transformed with `.select` / `.map` / `.lazy` / `.first`; with a block it streams and returns the document count, like `process_file`. + ## 1.0.0 (2026-06-08) RSpec tests: 1,038 diff --git a/README.md b/README.md index 54bc4d2..bfa3ce5 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,17 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL, > **SmarterJSON: one tool, no modes — want strict? Please use the stdlib `json` gem.** +## Features at a glance + +- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python & JavaScript literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list). +- **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`). +- **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over. +- **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `` tags, and pulls every payload out of one messy blob. +- **Writes JSON too** — `generate` with pretty-printing, NDJSON, `sort_keys`, `ascii_only`, `script_safe`, `allow_nan`, and `coerce` (via `as_json`); iterative, so deeply nested data is depth-safe. +- **Keeps number precision** — `BigDecimal` by default (Oj-compatible), or `:float` / `:auto`. +- **Transparent leniency** — pass an optional `on_warning` callback to be handed every lenient fix (an empty slot collapsed, a duplicate key dropped, a code fence stripped, …); with no handler the parser stays silent and adds zero overhead. +- **Fast, and runs everywhere** — a C extension that matches or beats Oj, with a pure-Ruby fallback for platforms that can't build it. Stable, semantically versioned, thread-safe, Ruby 2.6+. + ## Why SmarterJSON? **Are you tired of seeing errors like these?** @@ -73,13 +84,15 @@ It raises only on genuinely unreadable input (unterminated string, mismatched br The lenient grammar is a superset of these human-JSON specs — listed once, here: * [JSON5](https://json5.org/) -* [HJSON](https://hjson.github.io/) +* [HJSON](https://hjson.github.io/) * [JWCC / HuJSON](https://github.com/tailscale/hujson) * [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html) * [JSONH](https://github.com/jsonh-org/Jsonh) * [JSONC (VS Code)](https://jsonc.org/) * [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464). + A deliberate subset. SmarterJSON's quoteless (unquoted) string values are single-line — it does **not** parse HJSON's unquoted multi-line strings; use a quoted or triple-quoted (`'''…'''`) string for multiline. This is by design: SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others — and an unquoted string that may span newlines collides with newline-as-a-document-separator (NDJSON, implicit-root config), so it is left out. + ## Installation ```ruby @@ -130,7 +143,7 @@ See [Examples](#examples) below for multi-document input, streaming, and recover ## Stable interface & thread safety -The public interface is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface. +The public interface is: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.foreach`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface. `SmarterJSON.process` and `SmarterJSON.process_file` always return an `Array` of documents; `process_one` returns the single document's value (or `nil`), and emits a warning if there is more than one doc. Concurrent calls are safe. The processor and generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable state across calls. @@ -254,6 +267,35 @@ SmarterJSON.process_file("#{Dir.home}/.claude/projects//.js end ``` +### Filtering and rewriting a large file (`foreach`) + +`SmarterJSON.foreach(source)` is the composable sibling of `process_file`. `source` is a file path or any IO (a socket, a `StringIO`, an open `File`). With no block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, so you can chain `.select` / `.map` and friends. Add `.lazy` to keep the whole chain bounded in memory, even when the filtered set is large: + +```ruby +# Keep only the user/assistant turns of a transcript — one document in memory at a time +SmarterJSON.foreach("session.jsonl", symbolize_keys: true) + .lazy + .select { |doc| %w[user assistant].include?(doc[:type]) } + .each { |doc| puts doc[:text] } +``` + +Because it streams both ends, you can **filter a big file down and rewrite it** without ever loading the whole thing: + +```ruby +File.open("filtered.jsonl", "w") do |out| + SmarterJSON.foreach("session.jsonl", symbolize_keys: true) + .lazy + .select { |doc| %w[user assistant].include?(doc[:type]) } + .each { |doc| out.puts SmarterJSON.generate(doc) } +end +``` + +Pass an IO instead of a path to stream straight from a socket or an HTTP response body — anything `IO`-like works (an IO is single-pass, read once): + +```ruby +SmarterJSON.foreach(response_io).each { |event| handle(event) } +``` + ### Recovering JSON from LLM / markdown noise When the payload is wrapped in markdown fences, surrounding prose, or tags, `process` (or `process_one` for a single payload) strips the wrapper and reads what's inside. (Clean JSON never pays for this — recovery only runs when a straight read fails.) diff --git a/docs/basic_read_api.md b/docs/basic_read_api.md index 2ab5308..6db35ad 100644 --- a/docs/basic_read_api.md +++ b/docs/basic_read_api.md @@ -103,6 +103,28 @@ SmarterJSON.process(io) { |doc| handle(doc) } The streaming path now frames whole top-level documents, not just one line at a time. That means NDJSON / JSONL still work, but pretty-printed multi-line objects and arrays work too, as do mixed `\n` / `\r\n` / `\r` line endings and comment-only separators between documents. +## `SmarterJSON.foreach` — stream a file or IO, composably + +`foreach` is the composable sibling of `process_file`. Its argument is a **file path or any IO** (a socket, a `StringIO`, an open `File`); a String is always a path, never content. + +With a block it behaves exactly like the block form above — streams each document, returns the **document count**. Without a block it returns a plain `Enumerator` (like `CSV.foreach` — **not** an `Enumerator::Lazy`), so `.map` / `.select` return Arrays the usual way, and you can chain over the stream: + +```ruby +SmarterJSON.foreach("events.ndjson").each { |event| EventJob.perform_async(event) } # like the block form +SmarterJSON.foreach("events.ndjson").select { |e| e["level"] == "error" } # => an Array of the matches +``` + +It reads one document at a time, so `foreach(path).first(3)` only reads ~3 documents off disk, and `.next` pulls them one by one. `.map` / `.select` read the source lazily but still build an Array of their *result*; to keep a whole pipeline bounded end to end (a large filtered set off a fat file), add `.lazy` at the call site: + +```ruby +SmarterJSON.foreach("session.jsonl", symbolize_keys: true) + .lazy + .select { |doc| %w[user assistant].include?(doc[:type]) } + .each { |doc| puts doc[:text] } +``` + +Options are validated eagerly — a bad option key or value raises immediately, before any iteration. An **IO source is single-pass** (an IO can only be read once), so iterating the returned Enumerator a second time over the same IO yields nothing; a path-backed `foreach` re-opens the file and is re-iterable. + ## The C extension and the pure-Ruby fallback By default (`acceleration: true`) the C extension is used when it is compiled and loadable (`SmarterJSON::HAS_ACCELERATION` is then `true`); otherwise the pure-Ruby implementation runs and produces identical results. Pass `acceleration: false` to force the pure-Ruby path. See [Configuration Options](./options.md). diff --git a/docs/examples.md b/docs/examples.md index 8881f47..f9fdc6d 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -94,6 +94,17 @@ SmarterJSON.process_file("#{Dir.home}/.claude/projects//.js end ``` +**Filter and rewrite as a stream — `SmarterJSON.foreach`:** `foreach(source)` is the composable sibling of `process_file`; `source` is a file path or any IO (a socket, a `StringIO`, an open `File`). Without a block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, so it chains with `.select` / `.map`; add `.lazy` to keep the whole pipeline bounded in memory. This filters a transcript down to its user/assistant turns and writes a smaller file, never loading all of it: + +```ruby +File.open("filtered.jsonl", "w") do |out| + SmarterJSON.foreach("session.jsonl", symbolize_keys: true) + .lazy + .select { |doc| %w[user assistant].include?(doc[:type]) } + .each { |doc| out.puts SmarterJSON.generate(doc) } +end +``` + ### Example 6: Symbolize Keys ```ruby diff --git a/lib/smarter_json/parser.rb b/lib/smarter_json/parser.rb index df1fea2..6ccfa51 100644 --- a/lib/smarter_json/parser.rb +++ b/lib/smarter_json/parser.rb @@ -57,6 +57,41 @@ def process_file(path, options = {}, &block) end end + # SmarterJSON.foreach(source, options = {}) — the streaming, composable sibling of + # process_file, mirroring the stdlib convention (CSV.foreach / File.foreach): a + # plain Enumerator (NOT Enumerator::Lazy), so .map / .select behave the normal way + # and return an Array. + # + # `source` is a file path (opened and streamed from disk, like process_file) OR an + # IO — a socket, a StringIO, an open File — streamed directly from its current + # position. A String is always a path, never content. An IO source is single-pass: + # it can only be read once, so iterating the returned Enumerator a second time over + # the same IO yields nothing. + # + # Without a block: returns an Enumerator over each top-level document, reading one + # document at a time via readpartial — it never slurps the whole file the way + # process_file(path) does. So foreach(path).first(3) reads only ~3 documents off + # disk, and foreach(src).each { … } / .next stream in bounded memory. .map / .select + # read the source one document at a time but still build an Array of their result; + # for a chain that stays bounded end to end (a large filtered set off a fat file) + # opt into .lazy at the call site: foreach(src).lazy.select { … }.each { … }. + # + # With a block: streams each document and returns the document count — identical + # to process_file(path) { |doc| … } (or process(io) { |doc| … } for an IO). + # + # Options are validated eagerly (before the Enumerator is returned), so a bad + # option key or value fails fast rather than on first iteration. + def foreach(source, options = {}, &block) + options = Options.process_options(options) + return enum_for(:foreach, source, options) unless block + + if source.respond_to?(:read) # an IO (socket, StringIO, open File) — stream it directly + stream_io(source, options, &block) + else # a path — open the file and stream from disk + process_file(source, options, &block) + end + end + # SmarterJSON.process_one(input, options = {}) — the single-document accessor. # # Returns the first document's value (or nil when the input holds no documents). diff --git a/lib/smarter_json/version.rb b/lib/smarter_json/version.rb index 3d9352e..4ea7698 100644 --- a/lib/smarter_json/version.rb +++ b/lib/smarter_json/version.rb @@ -1,5 +1,5 @@ # frozen_string_literal: true module SmarterJSON - VERSION = "1.0.0" + VERSION = "1.1.0" end diff --git a/spec/foreach_spec.rb b/spec/foreach_spec.rb new file mode 100644 index 0000000..82bd6bc --- /dev/null +++ b/spec/foreach_spec.rb @@ -0,0 +1,142 @@ +# frozen_string_literal: true + +require "smarter_json" +require "tempfile" +require "stringio" + +# The contract for SmarterJSON.foreach(source) — the streaming, composable sibling of +# process_file. `source` is a file path or an IO (StringIO / socket / open File). It +# mirrors the stdlib convention (CSV.foreach / File.foreach): +# +# foreach(source) -> a plain Enumerator over each top-level document +# foreach(source) { |doc| ... } -> streams each document, returns the document count +# +# Eager : process_file(path) reads the whole file and returns an Array of documents. +# Streaming: foreach(path) reads one document at a time from disk (never slurping +# the whole file). The no-block form returns a plain Enumerator (NOT lazy), +# so .map / .select return Arrays and .first(n) / .next stay bounded; add +# .lazy at the call site for a chain that's bounded end to end. +RSpec.describe SmarterJSON, ".foreach" do + let(:fixtures_dir) { File.expand_path("fixtures", __dir__) } + let(:ndjson) { File.join(fixtures_dir, "multi_doc.ndjson") } # => [{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }] + + # Parity harness: every example runs on the C path and the pure-Ruby path. + [true, false].each do |acceleration| + context "acceleration: #{acceleration}" do + it "without a block returns an Enumerator" do + expect(SmarterJSON.foreach(ndjson, acceleration: acceleration)).to be_a(Enumerator) + end + + it "returns a plain Enumerator (not lazy), so .map / .select return Arrays — like CSV.foreach" do + enum = SmarterJSON.foreach(ndjson, acceleration: acceleration) + expect(enum).not_to be_a(Enumerator::Lazy) + mapped = SmarterJSON.foreach(ndjson, acceleration: acceleration).map { |doc| doc["id"] } + expect(mapped).to be_a(Array) # not an Enumerator::Lazy needing .to_a / .force + expect(mapped).to eq([1, 2, 3]) + end + + it "the Enumerator yields the same documents process_file returns (eager/lazy parity)" do + eager = SmarterJSON.process_file(ndjson, acceleration: acceleration) + lazy = SmarterJSON.foreach(ndjson, acceleration: acceleration).to_a + expect(lazy).to eq(eager) + expect(lazy).to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }]) + end + + it "with a block streams each document and returns the document count" do + out = [] + rv = SmarterJSON.foreach(ndjson, acceleration: acceleration) { |doc| out << doc } + expect(out).to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }]) + expect(rv).to eq(out.length) # same return contract as process_file's block form + end + + it "composes — select / map run over the stream" do + ids = SmarterJSON.foreach(ndjson, acceleration: acceleration) + .select { |doc| doc["id"].odd? } + .map { |doc| doc["id"] } + expect(ids).to eq([1, 3]) + end + + it "supports external iteration with .next and raises StopIteration past the end" do + enum = SmarterJSON.foreach(ndjson, acceleration: acceleration) + expect(enum.next).to eq({ "id" => 1 }) + expect(enum.next).to eq({ "id" => 2 }) + expect(enum.next).to eq({ "id" => 3 }) + expect { enum.next }.to raise_error(StopIteration) + end + + it "streams from disk instead of slurping the whole file (bounded memory)" do + # The eager process_file loads the entire file with File.read; foreach must + # not — it reads incrementally via File.open/readpartial, so a huge file costs + # the same memory as a small one. Pin that mechanism: foreach yields every + # document without ever calling File.read. (Eager and lazy return identical + # documents — the parity test above — so the difference is memory, not result.) + expect(File).not_to receive(:read) + docs = SmarterJSON.foreach(ndjson, acceleration: acceleration).to_a + expect(docs).to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }]) + end + + it "passes options through (symbolize_keys)" do + docs = SmarterJSON.foreach(ndjson, symbolize_keys: true, acceleration: acceleration).to_a + expect(docs).to eq([{ id: 1 }, { id: 2 }, { id: 3 }]) + end + + it "validates options eagerly (fails fast, before any iteration)" do + expect { SmarterJSON.foreach(ndjson, bogus: 1, acceleration: acceleration) } + .to raise_error(ArgumentError, /unknown option/) + end + + it "yields nothing for an empty file" do + Tempfile.create(["empty", ".ndjson"]) do |f| + expect(SmarterJSON.foreach(f.path, acceleration: acceleration).to_a).to eq([]) + expect(SmarterJSON.foreach(f.path, acceleration: acceleration) { |_| }).to eq(0) + end + end + + it "yields a single document for a one-document file" do + Tempfile.create(["one", ".ndjson"]) do |f| + f.write(%({"only": true})) + f.flush + expect(SmarterJSON.foreach(f.path, acceleration: acceleration).to_a).to eq([{ "only" => true }]) + end + end + + # foreach also accepts an IO (a socket, a StringIO, an open File) — the same + # source process(io) { … } streams, but composable. A String argument is always + # a path (like process_file); an IO is streamed directly. + it "accepts an IO (StringIO) and yields each document" do + io = StringIO.new(%({"id":1}\n{"id":2}\n{"id":3}\n)) + expect(SmarterJSON.foreach(io, acceleration: acceleration).to_a) + .to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }]) + end + + it "streams an IO with a block and returns the document count" do + io = StringIO.new(%({"a":1}\n{"b":2}\n)) + out = [] + rv = SmarterJSON.foreach(io, acceleration: acceleration) { |doc| out << doc } + expect(out).to eq([{ "a" => 1 }, { "b" => 2 }]) + expect(rv).to eq(2) + end + + it "composes (.select / .map) over an IO" do + io = StringIO.new(%({"id":1}\n{"id":2}\n{"id":3}\n)) + odds = SmarterJSON.foreach(io, acceleration: acceleration).select { |d| d["id"].odd? }.map { |d| d["id"] } + expect(odds).to eq([1, 3]) + end + + it "accepts an open File handle as an IO" do + io = File.open(ndjson, "r:UTF-8") + expect(SmarterJSON.foreach(io, acceleration: acceleration).to_a) + .to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }]) + ensure + io&.close + end + + it "an IO source is single-pass (an IO can only be read once)" do + io = StringIO.new(%({"id":1}\n{"id":2}\n)) + enum = SmarterJSON.foreach(io, acceleration: acceleration) + expect(enum.to_a).to eq([{ "id" => 1 }, { "id" => 2 }]) # first pass drains the IO + expect(enum.to_a).to eq([]) # IO now at EOF — nothing left + end + end + end +end