Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@
> ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.

## 1.1.0 (2026-06-09)

RSpec tests: 1,038 → 1,070

- New `SmarterJSON.foreach(source)` — the streaming, composable sibling of `process_file`. `source` is a file path or an IO (a socket, `StringIO`, open `File`). Without a block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, never loading the whole file, so a large NDJSON / JSONL stream can be filtered or transformed with `.select` / `.map` / `.lazy` / `.first`; with a block it streams and returns the document count, like `process_file`.

## 1.0.0 (2026-06-08)

RSpec tests: 1,038
Expand Down
46 changes: 44 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL,

> **SmarterJSON: one tool, no modes — want strict? Please use the stdlib `json` gem.**

## Features at a glance

- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python & JavaScript literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
- **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`).
- **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over.
- **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `<json>` tags, and pulls every payload out of one messy blob.
- **Writes JSON too** — `generate` with pretty-printing, NDJSON, `sort_keys`, `ascii_only`, `script_safe`, `allow_nan`, and `coerce` (via `as_json`); iterative, so deeply nested data is depth-safe.
- **Keeps number precision** — `BigDecimal` by default (Oj-compatible), or `:float` / `:auto`.
- **Transparent leniency** — pass an optional `on_warning` callback to be handed every lenient fix (an empty slot collapsed, a duplicate key dropped, a code fence stripped, …); with no handler the parser stays silent and adds zero overhead.
- **Fast, and runs everywhere** — a C extension that matches or beats Oj, with a pure-Ruby fallback for platforms that can't build it. Stable, semantically versioned, thread-safe, Ruby 2.6+.

## Why SmarterJSON?

**Are you tired of seeing errors like these?**
Expand Down Expand Up @@ -73,13 +84,15 @@ It raises only on genuinely unreadable input (unterminated string, mismatched br
The lenient grammar is a superset of these human-JSON specs — listed once, here:

* [JSON5](https://json5.org/)
* [HJSON](https://hjson.github.io/)
* [HJSON](https://hjson.github.io/) <sup>†</sup>
* [JWCC / HuJSON](https://github.com/tailscale/hujson)
* [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
* [JSONH](https://github.com/jsonh-org/Jsonh)
* [JSONC (VS Code)](https://jsonc.org/)
* [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).

<sup>†</sup> A deliberate subset. SmarterJSON's quoteless (unquoted) string values are single-line — it does **not** parse HJSON's unquoted multi-line strings; use a quoted or triple-quoted (`'''…'''`) string for multiline. This is by design: SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others — and an unquoted string that may span newlines collides with newline-as-a-document-separator (NDJSON, implicit-root config), so it is left out.

## Installation

```ruby
Expand Down Expand Up @@ -130,7 +143,7 @@ See [Examples](#examples) below for multi-document input, streaming, and recover

## Stable interface & thread safety

The public interface is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
The public interface is: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.foreach`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface. `SmarterJSON.process` and `SmarterJSON.process_file` always return an `Array` of documents; `process_one` returns the single document's value (or `nil`), and emits a warning if there is more than one doc.

Concurrent calls are safe. The processor and generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable state across calls.

Expand Down Expand Up @@ -254,6 +267,35 @@ SmarterJSON.process_file("#{Dir.home}/.claude/projects/<project>/<session-id>.js
end
```

### Filtering and rewriting a large file (`foreach`)

`SmarterJSON.foreach(source)` is the composable sibling of `process_file`. `source` is a file path or any IO (a socket, a `StringIO`, an open `File`). With no block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, so you can chain `.select` / `.map` and friends. Add `.lazy` to keep the whole chain bounded in memory, even when the filtered set is large:

```ruby
# Keep only the user/assistant turns of a transcript — one document in memory at a time
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
.lazy
.select { |doc| %w[user assistant].include?(doc[:type]) }
.each { |doc| puts doc[:text] }
```

Because it streams both ends, you can **filter a big file down and rewrite it** without ever loading the whole thing:

```ruby
File.open("filtered.jsonl", "w") do |out|
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
.lazy
.select { |doc| %w[user assistant].include?(doc[:type]) }
.each { |doc| out.puts SmarterJSON.generate(doc) }
end
```

Pass an IO instead of a path to stream straight from a socket or an HTTP response body — anything `IO`-like works (an IO is single-pass, read once):

```ruby
SmarterJSON.foreach(response_io).each { |event| handle(event) }
```

### Recovering JSON from LLM / markdown noise

When the payload is wrapped in markdown fences, surrounding prose, or tags, `process` (or `process_one` for a single payload) strips the wrapper and reads what's inside. (Clean JSON never pays for this — recovery only runs when a straight read fails.)
Expand Down
22 changes: 22 additions & 0 deletions docs/basic_read_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,28 @@ SmarterJSON.process(io) { |doc| handle(doc) }

The streaming path now frames whole top-level documents, not just one line at a time. That means NDJSON / JSONL still work, but pretty-printed multi-line objects and arrays work too, as do mixed `\n` / `\r\n` / `\r` line endings and comment-only separators between documents.

## `SmarterJSON.foreach` — stream a file or IO, composably

`foreach` is the composable sibling of `process_file`. Its argument is a **file path or any IO** (a socket, a `StringIO`, an open `File`); a String is always a path, never content.

With a block it behaves exactly like the block form above — streams each document, returns the **document count**. Without a block it returns a plain `Enumerator` (like `CSV.foreach` — **not** an `Enumerator::Lazy`), so `.map` / `.select` return Arrays the usual way, and you can chain over the stream:

```ruby
SmarterJSON.foreach("events.ndjson").each { |event| EventJob.perform_async(event) } # like the block form
SmarterJSON.foreach("events.ndjson").select { |e| e["level"] == "error" } # => an Array of the matches
```

It reads one document at a time, so `foreach(path).first(3)` only reads ~3 documents off disk, and `.next` pulls them one by one. `.map` / `.select` read the source lazily but still build an Array of their *result*; to keep a whole pipeline bounded end to end (a large filtered set off a fat file), add `.lazy` at the call site:

```ruby
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
.lazy
.select { |doc| %w[user assistant].include?(doc[:type]) }
.each { |doc| puts doc[:text] }
```

Options are validated eagerly — a bad option key or value raises immediately, before any iteration. An **IO source is single-pass** (an IO can only be read once), so iterating the returned Enumerator a second time over the same IO yields nothing; a path-backed `foreach` re-opens the file and is re-iterable.

## The C extension and the pure-Ruby fallback

By default (`acceleration: true`) the C extension is used when it is compiled and loadable (`SmarterJSON::HAS_ACCELERATION` is then `true`); otherwise the pure-Ruby implementation runs and produces identical results. Pass `acceleration: false` to force the pure-Ruby path. See [Configuration Options](./options.md).
Expand Down
11 changes: 11 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,17 @@ SmarterJSON.process_file("#{Dir.home}/.claude/projects/<project>/<session-id>.js
end
```

**Filter and rewrite as a stream — `SmarterJSON.foreach`:** `foreach(source)` is the composable sibling of `process_file`; `source` is a file path or any IO (a socket, a `StringIO`, an open `File`). Without a block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, so it chains with `.select` / `.map`; add `.lazy` to keep the whole pipeline bounded in memory. This filters a transcript down to its user/assistant turns and writes a smaller file, never loading all of it:

```ruby
File.open("filtered.jsonl", "w") do |out|
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
.lazy
.select { |doc| %w[user assistant].include?(doc[:type]) }
.each { |doc| out.puts SmarterJSON.generate(doc) }
end
```

### Example 6: Symbolize Keys

```ruby
Expand Down
35 changes: 35 additions & 0 deletions lib/smarter_json/parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,41 @@ def process_file(path, options = {}, &block)
end
end

# SmarterJSON.foreach(source, options = {}) — the streaming, composable sibling of
# process_file, mirroring the stdlib convention (CSV.foreach / File.foreach): a
# plain Enumerator (NOT Enumerator::Lazy), so .map / .select behave the normal way
# and return an Array.
#
# `source` is a file path (opened and streamed from disk, like process_file) OR an
# IO — a socket, a StringIO, an open File — streamed directly from its current
# position. A String is always a path, never content. An IO source is single-pass:
# it can only be read once, so iterating the returned Enumerator a second time over
# the same IO yields nothing.
#
# Without a block: returns an Enumerator over each top-level document, reading one
# document at a time via readpartial — it never slurps the whole file the way
# process_file(path) does. So foreach(path).first(3) reads only ~3 documents off
# disk, and foreach(src).each { … } / .next stream in bounded memory. .map / .select
# read the source one document at a time but still build an Array of their result;
# for a chain that stays bounded end to end (a large filtered set off a fat file)
# opt into .lazy at the call site: foreach(src).lazy.select { … }.each { … }.
#
# With a block: streams each document and returns the document count — identical
# to process_file(path) { |doc| … } (or process(io) { |doc| … } for an IO).
#
# Options are validated eagerly (before the Enumerator is returned), so a bad
# option key or value fails fast rather than on first iteration.
def foreach(source, options = {}, &block)
options = Options.process_options(options)
return enum_for(:foreach, source, options) unless block

if source.respond_to?(:read) # an IO (socket, StringIO, open File) — stream it directly
stream_io(source, options, &block)
else # a path — open the file and stream from disk
process_file(source, options, &block)
end
end

# SmarterJSON.process_one(input, options = {}) — the single-document accessor.
#
# Returns the first document's value (or nil when the input holds no documents).
Expand Down
2 changes: 1 addition & 1 deletion lib/smarter_json/version.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# frozen_string_literal: true

module SmarterJSON
VERSION = "1.0.0"
VERSION = "1.1.0"
end
142 changes: 142 additions & 0 deletions spec/foreach_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# frozen_string_literal: true

require "smarter_json"
require "tempfile"
require "stringio"

# The contract for SmarterJSON.foreach(source) — the streaming, composable sibling of
# process_file. `source` is a file path or an IO (StringIO / socket / open File). It
# mirrors the stdlib convention (CSV.foreach / File.foreach):
#
# foreach(source) -> a plain Enumerator over each top-level document
# foreach(source) { |doc| ... } -> streams each document, returns the document count
#
# Eager : process_file(path) reads the whole file and returns an Array of documents.
# Streaming: foreach(path) reads one document at a time from disk (never slurping
# the whole file). The no-block form returns a plain Enumerator (NOT lazy),
# so .map / .select return Arrays and .first(n) / .next stay bounded; add
# .lazy at the call site for a chain that's bounded end to end.
RSpec.describe SmarterJSON, ".foreach" do
let(:fixtures_dir) { File.expand_path("fixtures", __dir__) }
let(:ndjson) { File.join(fixtures_dir, "multi_doc.ndjson") } # => [{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }]

# Parity harness: every example runs on the C path and the pure-Ruby path.
[true, false].each do |acceleration|
context "acceleration: #{acceleration}" do
it "without a block returns an Enumerator" do
expect(SmarterJSON.foreach(ndjson, acceleration: acceleration)).to be_a(Enumerator)
end

it "returns a plain Enumerator (not lazy), so .map / .select return Arrays — like CSV.foreach" do
enum = SmarterJSON.foreach(ndjson, acceleration: acceleration)
expect(enum).not_to be_a(Enumerator::Lazy)
mapped = SmarterJSON.foreach(ndjson, acceleration: acceleration).map { |doc| doc["id"] }
expect(mapped).to be_a(Array) # not an Enumerator::Lazy needing .to_a / .force
expect(mapped).to eq([1, 2, 3])
end

it "the Enumerator yields the same documents process_file returns (eager/lazy parity)" do
eager = SmarterJSON.process_file(ndjson, acceleration: acceleration)
lazy = SmarterJSON.foreach(ndjson, acceleration: acceleration).to_a
expect(lazy).to eq(eager)
expect(lazy).to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }])
end

it "with a block streams each document and returns the document count" do
out = []
rv = SmarterJSON.foreach(ndjson, acceleration: acceleration) { |doc| out << doc }
expect(out).to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }])
expect(rv).to eq(out.length) # same return contract as process_file's block form
end

it "composes — select / map run over the stream" do
ids = SmarterJSON.foreach(ndjson, acceleration: acceleration)
.select { |doc| doc["id"].odd? }
.map { |doc| doc["id"] }
expect(ids).to eq([1, 3])
end

it "supports external iteration with .next and raises StopIteration past the end" do
enum = SmarterJSON.foreach(ndjson, acceleration: acceleration)
expect(enum.next).to eq({ "id" => 1 })
expect(enum.next).to eq({ "id" => 2 })
expect(enum.next).to eq({ "id" => 3 })
expect { enum.next }.to raise_error(StopIteration)
end

it "streams from disk instead of slurping the whole file (bounded memory)" do
# The eager process_file loads the entire file with File.read; foreach must
# not — it reads incrementally via File.open/readpartial, so a huge file costs
# the same memory as a small one. Pin that mechanism: foreach yields every
# document without ever calling File.read. (Eager and lazy return identical
# documents — the parity test above — so the difference is memory, not result.)
expect(File).not_to receive(:read)
docs = SmarterJSON.foreach(ndjson, acceleration: acceleration).to_a
expect(docs).to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }])
end

it "passes options through (symbolize_keys)" do
docs = SmarterJSON.foreach(ndjson, symbolize_keys: true, acceleration: acceleration).to_a
expect(docs).to eq([{ id: 1 }, { id: 2 }, { id: 3 }])
end

it "validates options eagerly (fails fast, before any iteration)" do
expect { SmarterJSON.foreach(ndjson, bogus: 1, acceleration: acceleration) }
.to raise_error(ArgumentError, /unknown option/)
end

it "yields nothing for an empty file" do
Tempfile.create(["empty", ".ndjson"]) do |f|
expect(SmarterJSON.foreach(f.path, acceleration: acceleration).to_a).to eq([])
expect(SmarterJSON.foreach(f.path, acceleration: acceleration) { |_| }).to eq(0)
end
end

it "yields a single document for a one-document file" do
Tempfile.create(["one", ".ndjson"]) do |f|
f.write(%({"only": true}))
f.flush
expect(SmarterJSON.foreach(f.path, acceleration: acceleration).to_a).to eq([{ "only" => true }])
end
end

# foreach also accepts an IO (a socket, a StringIO, an open File) — the same
# source process(io) { … } streams, but composable. A String argument is always
# a path (like process_file); an IO is streamed directly.
it "accepts an IO (StringIO) and yields each document" do
io = StringIO.new(%({"id":1}\n{"id":2}\n{"id":3}\n))
expect(SmarterJSON.foreach(io, acceleration: acceleration).to_a)
.to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }])
end

it "streams an IO with a block and returns the document count" do
io = StringIO.new(%({"a":1}\n{"b":2}\n))
out = []
rv = SmarterJSON.foreach(io, acceleration: acceleration) { |doc| out << doc }
expect(out).to eq([{ "a" => 1 }, { "b" => 2 }])
expect(rv).to eq(2)
end

it "composes (.select / .map) over an IO" do
io = StringIO.new(%({"id":1}\n{"id":2}\n{"id":3}\n))
odds = SmarterJSON.foreach(io, acceleration: acceleration).select { |d| d["id"].odd? }.map { |d| d["id"] }
expect(odds).to eq([1, 3])
end

it "accepts an open File handle as an IO" do
io = File.open(ndjson, "r:UTF-8")
expect(SmarterJSON.foreach(io, acceleration: acceleration).to_a)
.to eq([{ "id" => 1 }, { "id" => 2 }, { "id" => 3 }])
ensure
io&.close
end

it "an IO source is single-pass (an IO can only be read once)" do
io = StringIO.new(%({"id":1}\n{"id":2}\n))
enum = SmarterJSON.foreach(io, acceleration: acceleration)
expect(enum.to_a).to eq([{ "id" => 1 }, { "id" => 2 }]) # first pass drains the IO
expect(enum.to_a).to eq([]) # IO now at EOF — nothing left
end
end
end
end