v1.4.0 — Smart Extract, tutorial, and path language by sahilsunny · Pull Request #17 · ScrapingBee/scrapingbee-cli

sahilsunny · 2026-04-01T23:16:56Z

Summary

--smart-extract: Client-side extraction with auto-format detection (JSON, HTML, XML, CSV, Markdown, text) and a full path language — recursive search (...key), glob patterns (*), value/key filters ([=pattern]), regex ([=/pattern/]), context expansion (~N), OR/AND operators, JSON schema output. Works on all commands.
Interactive tutorial: 25-step walkthrough (11 chapters) from basics to production pipelines, with syntax-highlighted command box, recursive prereq resolution, and brand colors.
Deprecation of --extract-field and --fields: Replaced by --smart-extract. Deprecation warnings added, removal planned for v2.0.0.
Bug fixes: Regex crash in value filter, CSV header auto-detection, ancestry corruption in recursive search, type annotation fixes across all commands.
AI platform updates: --smart-extract documented in AGENTS.md, SKILL.md (all 5 platforms), plugin.json. LLM-focused positioning: "extract just the data your LLM needs."

Test plan

600 unit tests passing (67 new for path language + smart-extract)
196 e2e tests passing
Integration tests passing (2 server-side 429s, not code bugs)
ruff lint + format: all checks passed
ty type check: all checks passed
Backward compatible: --extract-field and --fields still work
Skills synced across all 5 AI platforms
Tutorial tested end-to-end with prereq chain resolution

New features: - --smart-extract: client-side extraction with auto-format detection (JSON, HTML, XML, CSV, Markdown, plain text) and a full path language supporting recursive search (...key), glob patterns (*), value/key filters ([=pattern], [key=pattern]), regex ([=/pattern/]), context expansion (~N), OR (|), AND (&), multi-index ([0,3,7]), slicing, [keys]/[values], escaped literals ((key.name)), and JSON schema output - Interactive tutorial command (25 steps, 11 chapters) with syntax- highlighted command box, prereq chain resolution, and ScrapingBee brand colors - --smart-extract works on all commands: scrape, google, amazon, walmart, youtube, chatgpt, fast-search - Deprecation warnings for --extract-field and --fields (use --smart-extract instead, removal in v2.0.0) Improvements: - Enhanced --extract-field and --fields with full path language support - CSV header row auto-detection (skips headers without --input-column) - Recursive prereq resolution in tutorial runner - API key check before prereq execution in tutorial - File-based usage cache with locking - Batch resume discovery (scrapingbee --resume) - Scraping pipeline agent added to all 5 AI platform skill trees - Skills synced across .agents/, .github/, .kiro/, .opencode/, plugins/ - LLM-focused marketing in AGENTS.md and SKILL.md frontmatter - scripts/sync-skills.sh for cross-platform skill synchronization Bug fixes: - Fixed regex compilation crash in value filter (_build_matcher) - Fixed ancestry list corruption in recursive context search - Fixed CSV serialization using str() instead of json.dumps() for dicts - Fixed retries/backoff type annotations across all commands - Suppressed harmless coroutine-never-awaited test warnings Tests: - 600 unit tests (67 new for path language + smart-extract) - 196 e2e tests passing - Full backward compatibility with v1.3.1 --extract-field and --fields

…ixes - Chainable ~N context expansion — works anywhere in the path, not just after recursive search. Find a value, go up, keep chaining. - [!=pattern] negation filter for excluding values/dicts - [*=pattern] glob key filter for matching dicts by any key's value - Scalar-only value matching — filters skip dicts/lists to prevent false positives from stringifying entire subtrees - List flattening in recursive search — ...section[id=faq] now correctly finds individual elements - Guard against None dict keys in recursive walk - Pass root through recursive _resolve_path calls for correct ~N ancestry - CSV header auto-detection in read_input_file - Deprecation warnings for --extract-field and --fields - Tutorial restructured: 25 steps, 11 chapters, prereq chain resolution - 647 unit tests (114 smart-extract tests including 47 chaining tests) - Updated CHANGELOG, AGENTS.md, SKILL.md, Amazon Q JSON skill

.agents/skills/scrapingbee-cli/reference/batch/output.md

src/scrapingbee_cli/batch.py

Cap _recursive_walk_simple() and _recursive_walk_ctx() at 100 levels of recursion depth. These functions power the ...key recursive search in --smart-extract and previously had no depth guard — a pathologically nested JSON structure could exhaust Python's call stack and crash ungracefully. Now they return gracefully when the limit is exceeded.

sahilsunny requested a review from kostas-jakeliunas-sb April 1, 2026 23:18

kostas-jakeliunas-sb reviewed Apr 2, 2026

View reviewed changes

.agents/skills/scrapingbee-cli/reference/batch/output.md Outdated Show resolved Hide resolved

Replace MD5 with SHA-256 for manifest content hashing

442c8e0

kostas-jakeliunas-sb reviewed Apr 2, 2026

View reviewed changes

src/scrapingbee_cli/batch.py Show resolved Hide resolved

kostas-jakeliunas-sb force-pushed the v1.4.0 branch from 3a3198b to f9bf45a Compare April 2, 2026 10:59

kostas-jakeliunas-sb approved these changes Apr 2, 2026

View reviewed changes

kostas-jakeliunas-sb merged commit ea13734 into main Apr 2, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0 — Smart Extract, tutorial, and path language#17

v1.4.0 — Smart Extract, tutorial, and path language#17
kostas-jakeliunas-sb merged 4 commits intomainfrom
v1.4.0

sahilsunny commented Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sahilsunny commented Apr 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants