Skip to content

v1.4.0 — Smart Extract, tutorial, and path language#17

Merged
kostas-jakeliunas-sb merged 4 commits intomainfrom
v1.4.0
Apr 2, 2026
Merged

v1.4.0 — Smart Extract, tutorial, and path language#17
kostas-jakeliunas-sb merged 4 commits intomainfrom
v1.4.0

Conversation

@sahilsunny
Copy link
Copy Markdown
Collaborator

Summary

  • --smart-extract: Client-side extraction with auto-format detection (JSON, HTML, XML, CSV, Markdown, text) and a full path language — recursive search (...key), glob patterns (*), value/key filters ([=pattern]), regex ([=/pattern/]), context expansion (~N), OR/AND operators, JSON schema output. Works on all commands.
  • Interactive tutorial: 25-step walkthrough (11 chapters) from basics to production pipelines, with syntax-highlighted command box, recursive prereq resolution, and brand colors.
  • Deprecation of --extract-field and --fields: Replaced by --smart-extract. Deprecation warnings added, removal planned for v2.0.0.
  • Bug fixes: Regex crash in value filter, CSV header auto-detection, ancestry corruption in recursive search, type annotation fixes across all commands.
  • AI platform updates: --smart-extract documented in AGENTS.md, SKILL.md (all 5 platforms), plugin.json. LLM-focused positioning: "extract just the data your LLM needs."

Test plan

  • 600 unit tests passing (67 new for path language + smart-extract)
  • 196 e2e tests passing
  • Integration tests passing (2 server-side 429s, not code bugs)
  • ruff lint + format: all checks passed
  • ty type check: all checks passed
  • Backward compatible: --extract-field and --fields still work
  • Skills synced across all 5 AI platforms
  • Tutorial tested end-to-end with prereq chain resolution

New features:
- --smart-extract: client-side extraction with auto-format detection
  (JSON, HTML, XML, CSV, Markdown, plain text) and a full path language
  supporting recursive search (...key), glob patterns (*), value/key
  filters ([=pattern], [key=pattern]), regex ([=/pattern/]), context
  expansion (~N), OR (|), AND (&), multi-index ([0,3,7]), slicing,
  [keys]/[values], escaped literals ((key.name)), and JSON schema output
- Interactive tutorial command (25 steps, 11 chapters) with syntax-
  highlighted command box, prereq chain resolution, and ScrapingBee
  brand colors
- --smart-extract works on all commands: scrape, google, amazon,
  walmart, youtube, chatgpt, fast-search
- Deprecation warnings for --extract-field and --fields (use
  --smart-extract instead, removal in v2.0.0)

Improvements:
- Enhanced --extract-field and --fields with full path language support
- CSV header row auto-detection (skips headers without --input-column)
- Recursive prereq resolution in tutorial runner
- API key check before prereq execution in tutorial
- File-based usage cache with locking
- Batch resume discovery (scrapingbee --resume)
- Scraping pipeline agent added to all 5 AI platform skill trees
- Skills synced across .agents/, .github/, .kiro/, .opencode/, plugins/
- LLM-focused marketing in AGENTS.md and SKILL.md frontmatter
- scripts/sync-skills.sh for cross-platform skill synchronization

Bug fixes:
- Fixed regex compilation crash in value filter (_build_matcher)
- Fixed ancestry list corruption in recursive context search
- Fixed CSV serialization using str() instead of json.dumps() for dicts
- Fixed retries/backoff type annotations across all commands
- Suppressed harmless coroutine-never-awaited test warnings

Tests:
- 600 unit tests (67 new for path language + smart-extract)
- 196 e2e tests passing
- Full backward compatibility with v1.3.1 --extract-field and --fields
…ixes

- Chainable ~N context expansion — works anywhere in the path, not just
  after recursive search. Find a value, go up, keep chaining.
- [!=pattern] negation filter for excluding values/dicts
- [*=pattern] glob key filter for matching dicts by any key's value
- Scalar-only value matching — filters skip dicts/lists to prevent
  false positives from stringifying entire subtrees
- List flattening in recursive search — ...section[id=faq] now correctly
  finds individual elements
- Guard against None dict keys in recursive walk
- Pass root through recursive _resolve_path calls for correct ~N ancestry
- CSV header auto-detection in read_input_file
- Deprecation warnings for --extract-field and --fields
- Tutorial restructured: 25 steps, 11 chapters, prereq chain resolution
- 647 unit tests (114 smart-extract tests including 47 chaining tests)
- Updated CHANGELOG, AGENTS.md, SKILL.md, Amazon Q JSON skill
Cap _recursive_walk_simple() and _recursive_walk_ctx() at 100 levels
of recursion depth. These functions power the ...key recursive search
in --smart-extract and previously had no depth guard — a pathologically
nested JSON structure could exhaust Python's call stack and crash
ungracefully. Now they return gracefully when the limit is exceeded.
@kostas-jakeliunas-sb kostas-jakeliunas-sb merged commit ea13734 into main Apr 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants