Skip to content

Releases: AIMLPM/markcrawl

v0.11.1 — default aggregator-page URL filter

12 May 04:49
a5e158b

Choose a tag to compare

Reject mdBook /print.html and Hugo /_print/ pages during crawl-time URL filtering. These single-render-of-whole-tree pages have artificially high keyword density and pollute embedding-based retrieval rankings.

Why

Surfaced by the public llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five other well-functioning competitors returned 0% /_print/ on kubernetes-docs.

These pages contain the entire docs tree on one URL, so embedding-based retrieval ranks them above the dedicated chapter pages a user actually wants.

What changed

  • New default URL patterns rejected pre-fetch (saves crawl budget):
    */print.html, */_print, */_print/, */_print/*, */print/index.html
  • New kwarg include_aggregator_pages: bool = False on crawl() and both engine classes for offline-archive use cases.
  • CLI flag --include-aggregators mirrors.
  • User-supplied exclude_paths and include_paths still apply independently — the aggregator filter composes with both, doesn't replace either.

Substring-match safety

Patterns are anchored to avoid over-matching:

URL Behavior
/book/print.html rejected (mdBook)
/blueprint.html passes (print is mid-word)
/preprint.html passes (academic content)
/imprint/ passes (legal page)
/_print/index.html rejected (Hugo)
/_printer-friendly/css.css passes (asset path)

Expected impact

Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. Measurement deferred to bench v1.5's helpful-pages-universe methodology (the current v1.4 anchor-biased methodology would give misleading numbers regardless of the underlying fix).

Tests

647 passing (was 611); 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring safety, opt-out flag, composition with user filters, and CrawlEngine + AsyncCrawlEngine parity.

Migration

No breaking changes. Default behavior unchanged on sites that don't generate aggregator pages. For users archiving offline docs that include print views, pass include_aggregator_pages=True or --include-aggregators.

v0.11.0 — binary downloads + filters

06 May 10:32
cd26ee1

Choose a tag to compare

Two new modules expand markcrawl from "HTML to Markdown converter" to "crawl + selectively download referenced files":

markcrawl/binaries.py — streaming binary downloads

New crawl(..., download_types=["pdf","docx"], ...) opt-in kwarg:

  • Streaming with size capstream=True / aiter_bytes() with per-chunk accumulation. Never buffers the full body. Default 25 MB per-file cap, 200 file count cap.
  • Atomic write via .tmp + os.replace. Partial files unlinked on cap-exceed.
  • Content-type validated BEFORE writing bytes — a .pdf URL serving text/html (login wall, marketing splash) is dropped immediately.
  • JSONL row gains downloads field when a page's binaries were downloaded: [{url, path, size_bytes, content_type}, ...]. Field omitted when empty (backward compat).
  • Sitemap entries route to download queue when they match download_types (symmetry with link discovery).
  • All v0.10.x safety nets (respect_robots, idle_timeout_s, include_subdomains) apply uniformly to downloads.

markcrawl/filters.py — reusable pre-fetch filters

from markcrawl import crawl
from markcrawl.filters import is_likely_resume

result = crawl(
    base_url="https://example.com/templates",
    out_dir="./resumes",
    download_types=["pdf", "docx"],
    download_filter=is_likely_resume,
)
print(f"Saved {result.downloads_count} files")
  • DownloadCandidate(url, anchor_text, parent_url, parent_title, extension) — pre-fetch context passed to filters.
  • is_likely_resume / is_likely_paper / exclude_legal_boilerplate — reusable URL+anchor heuristics. Best-effort, not classifiers.
  • Filters run pre-fetch — rejected URLs never get fetched, zero HTTP bytes transferred.
  • Compose via lambda c: positive(c) and exclude_legal_boilerplate(c).

New CrawlResult fields

  • downloads_count: int — files saved
  • downloads_bytes: int — total bytes saved
  • downloads_size_skipped: List[str] — URLs that exceeded the size cap
  • downloads_type_skipped: List[str] — URLs whose content-type didn't match

Migration

No breaking changes. Default download_types=None preserves v0.10.6 behavior exactly.

Deferred

  • Live-network smoke harness case for an ATS-template aggregator → v0.11.1.
  • Format-specific text extraction (PDF/DOCX → Markdown) remains out of scope; users compose with pypdf / python-docx / mammoth / unstructured downstream of saved files.

Tests

611 passing (was 566 on v0.10.6; +45 in tests/test_v011_binary_downloads.py). Spec specs/binary-downloads.md confidence-reviewed; all SC/DS rated ≥ 90% before implementation began.

v0.10.6 — opt-in respect_robots flag

05 May 06:43
75ecb9d

Choose a tag to compare

New crawl(..., respect_robots: bool = True) — default unchanged (robots.txt Disallow rules honored). Setting respect_robots=False bypasses Disallow but still honors Crawl-delay (politeness preserved). Caller takes responsibility for legality, ethics, and downstream consequences.

Why

robots.txt is the only widely-deployed mechanism site owners have to express preferences about automated access. We default to respecting it. But forks and monkey-patches that ignore robots already exist in the wild; an explicit, audited flag is more honest than letting users hack around the constraint silently.

Three guardrails

  1. Loud, non-silenceable warning at engine setup when bypass is active — both progress callback and Python logger.warning. No env-var or CLI override; the choice must be made deliberately in code.
  2. CrawlResult.robots_respected: bool — mirrors the kwarg the caller passed. Surfaced for audit / governance pipelines.
  3. CrawlResult.robots_bypassed_count: int — count of unique URLs robots.txt Disallowed but were fetched anyway. Always 0 when robots_respected is True. Lets you see the actual impact of the override — small numbers mean robots wasn't constraining you.

End-of-crawl summary when bypass was active reports either had no effect this run (count=0) or fetched N URL(s) that robots.txt Disallowed (count>0).

What stays unchanged

  • Default behavior: robots.txt Disallow rules honored.
  • Crawl-delay (politeness): honored unconditionally. We disregard Disallow, not politeness. Bypassing rate limits would be DoS-shaped.

Migration

No breaking changes. Default behavior unchanged. Use the flag for legitimate cases:

  • Your own site (forgotten or misconfigured robots.txt)
  • Authorized pen-testing engagements
  • Internal / intranet documentation you own
  • RAG ingestion of docs the site owner explicitly wants ingested but forgot to whitelist your UA

566 tests passing (was 549 on v0.10.5; +17 in tests/test_v0106_respect_robots.py).

v0.10.5 — adaptive scope broadening

05 May 03:02
9ba5d2b

Choose a tag to compare

When a crawl exhausts its narrow auto-derived scope (e.g. /docs/concepts/* from a kubernetes seed) with budget remaining, the engine now attempts one-level broadening (/docs/concepts/*/docs/*) before terminating. URLs filtered under the previous scope are stashed during link discovery and replayed through the broader scope.

Empirical verification (real network, max_pages=400)

Site v0.10.4 v0.10.5 Delta
kubernetes-docs 195/400 400/400 +105%
rust-book 111 111 unchanged (guardrail held)
postgres-docs 80 80 unchanged
newegg 1 1 unchanged (engine handles WAF gracefully)

Rust-book is deliberately unchanged: its Tier 0 single-segment scope /book/* cannot broaden short of whole-host, which the guardrail blocks. We don't auto-pull /std/, /cargo/, /nomicon/ even though crawl4ai-raw does — those are different publications, and our scope honors the seed's intent.

Guardrails

Broadening fires only when:

  1. Scope was auto-derived. User-explicit include_paths is respected as intent and never mutated.
  2. Current scope's leftmost segment is in _DOCS_HUB_MARKERS (docs, book, learn, tutorial, guide, reference, manual, handbook, api, etc.) or the site classifies as docs / apiref by hostname.
  3. One-level broadening doesn't land at whole-host (/*).
  4. Cap of _DEFAULT_MAX_BROADEN_EVENTS = 2 per crawl.

API additions (additive only)

  • CrawlResult.scope_history: List[List[str]] — sequence of include_paths patterns the crawl traversed. Auditable. Empty if no scope was set.

Migration

No breaking changes. Behavior preserved exactly when the user passes include_paths explicitly. For default crawls on docs sites, expect more pages and the same (or better) signal-to-noise — the broadening guardrail is intentionally tight (docs hub markers only, no whole-host fallback).

549 tests passing (was 528 on v0.10.4; +21 in tests/test_v0105_adaptive_scope.py).

v0.10.4 — idle-timeout reset signal fix + release smoke harness

05 May 01:50
cb53648

Choose a tag to compare

The v0.10.3 idle-timeout reset only on save_page, which mis-fired on bursty crawls where the engine was successfully fetching pages but most were getting deduped or were under min_words. The public benchmark surfaced this on huggingface-transformers (21/200 pages saved before the timer fired at 120 s).

Fix

The idle-timeout clock now resets on any meaningful progress event:

  • save_page (already in v0.10.3)
  • successful HTTP 2xx response
  • discover_links call that adds at least one new URL to the queue

4xx / 5xx responses do not reset the clock — anti-bot loops still get caught.

Empirical verification

A fresh crawl of huggingface-transformers at max_pages=200:

version pages saved elapsed
v0.10.3 21 120 s (timer fired early)
v0.10.4 174 236 s (graceful exit)

8x improvement on the bursty-discovery case; idle timer now functions as a true deadlock detector, not a save-rate guard.

CrawlResult API additions (additive only)

  • first_status: Optional[int] — first observable HTTP status. Lets callers distinguish engine bugs from external WAF/anti-bot blocks without scraping logs.
  • stalled: boolTrue when the run was terminated by the idle-timeout watchdog rather than running out of work or hitting max_pages.

Pre-release smoke harness

New bench/local_replica/release_smoke.py runs crawl() against ~4 real sites with per-site baselines. Treats first_status >= 400 + 0 pages as BLOCKED (skip, not fail) so transient WAF blocks don't false-alarm. Catches stall-detection regressions, coverage regressions, and anti-bot diagnostic regressions in 5-10 min.

Migration

No breaking changes. Users who set MARKCRAWL_IDLE_TIMEOUT_S=300 to work around the v0.10.3 mis-fire can drop the override — 120 s is correct again.

528 tests passing (was 521 on v0.10.3; +7 covering the new reset paths).

v0.10.3 — benchmark resilience fixes

04 May 18:43
faff5c6

Choose a tag to compare

Three generalizable resilience fixes surfaced by the public llm-crawler-benchmarks v1.3 cycle. All site-agnostic — none reference the sites or site classes that surfaced them.

Fixes

Partial-write recovery. pages.jsonl is now line-buffered (buffering=1) and save_page flushes after every row. SIGKILL / external watchdog termination no longer leaves an empty JSONL on disk; OS page cache holds all written rows.

Discovery-exhaustion stall detection (idle_timeout_s). Engine tracks _last_save_time and terminates gracefully when no new page has been saved for idle_timeout_s seconds (default 120). Catches link-graph churn after reachable pages exhaust without site-specific heuristics.

0-page diagnostic. Engine captures the first observed HTTP status. On crawls that finish with pages_saved == 0, logs a class-aware warning: 4xx/5xx → likely anti-bot block, 200 → likely min_words too high or JS-rendered, no response → seed unreachable / DNS error.

API additions (additive only)

  • crawl(..., idle_timeout_s: Optional[float] = None)
  • CrawlEngine / AsyncCrawlEngine accept idle_timeout_s kwarg
  • MARKCRAWL_IDLE_TIMEOUT_S env var
  • DEFAULT_IDLE_TIMEOUT_S = 120.0 module constant
  • Set idle_timeout_s=0 (or env to 0) to disable

Verification

  • 521 tests passing (was 500 on v0.10.2; +21 in tests/test_v0103_resilience.py)
  • Ruff lint clean
  • All 4 Python versions (3.10–3.13) green on CI

Migration

No breaking changes. Default idle_timeout_s=120 is generous and only fires on genuine stalls. Users running long-blocked crawls intentionally (e.g. waiting on slow renders) can pass idle_timeout_s=0.

See CHANGELOG.md for full details.

v0.10.2 — Sitemap pre-enumeration deadline · fixes retailer-index timeouts

03 May 17:44

Choose a tag to compare

tl;dr

Patch release fixing a regression surfaced by llm-crawler-benchmarks against v0.10.1: pathological sitemap-indexes (ikea: 2,113 locale shards) consumed 200+ s in pre-enumeration before any page got crawled, tripping benchmark zero-output watchdogs (120 s).

The sitemap-discovery phase now has a 60 s wallclock budget shared across all top-level sitemaps + their recursive children. When the budget fires, the parser returns whatever URLs it has collected so far and the crawl proceeds normally.

Verified locally on the failing sites

Site v0.10.1 v0.10.2
ikea (max_pages=30) 0 pages (heartbeat fired) 30 pages saved in 49.7 s
huggingface-transformers regression on bench CI 30 pages saved in 36.2 s

What changed

  • markcrawl.robots.parse_sitemap_xml and parse_sitemap_xml_async: new time_budget_s kwarg (default 60.0), threaded through recursion via the internal _deadline. Async path switches from asyncio.gather to asyncio.as_completed so pending child-sitemap tasks are cancelled rather than awaited once the budget fires.
  • markcrawl.core: both sync and async crawl paths instantiate a shared deadline at the start of sitemap discovery.
  • 2 new tests in tests/test_sitemap_parallel.py covering the short-circuit and the no-op default.
  • 500 tests passing (was 498).

Compatibility

No CLI flag changes. No behavior change for sites with normal sitemaps (which finish in <10 s anyway). Only the pathological-index path is affected.

For benchmark integrators

pip install --upgrade markcrawl==0.10.2 and re-run the previously failing sites. Crawl wallclock for ikea drops from "timeout, 0 pages" to "max_pages saved within budget."

v0.10.1 — Local embedder is the default · zero-cost RAG

03 May 09:39

Choose a tag to compare

tl;dr

pip install markcrawl now ships a complete crawl-and-embed stack with zero API cost. The default embedder flips from OpenAI 3-small to the bake-off-winning local mixedbread-ai/mxbai-embed-large-v1. Combined with the v0.10.0 chunker work, v0.10.1 closes the leaderboard story:

Metric (vs v0.9.9-rc1) v0.10.1 default Δ
Mean MRR (11-site local pool) 0.3859 +0.040 (+11.5%)
Cost at 50M pages $0 −$10,152/yr
Chunks per page 10.49 −48% smaller index

Multi-trial validated: +14% MRR on all-MiniLM-L6-v2 (6 trials, all positive) and +15% on OpenAI 3-small (3 trials, all positive) on the chunker change. The mxbai swap is MRR-neutral (Δ −0.018 within ±0.020 SC-B2 noise band) at $0/yr cost-at-scale.

What's new in 0.10.1

  • pip install markcrawl now bundles the ML stack (torch + transformers + sentence-transformers + sentencepiece). The chunker's chunk_semantic and the new default embedder work out of the box.
  • Default embedder = mixedbread-ai/mxbai-embed-large-v1 (local, zero API cost). Replaces the previous OpenAI 3-small default.
  • markcrawl[ml] kept as a no-op alias — existing install commands keep working.
  • Override paths: MARKCRAWL_EMBEDDER=text-embedding-3-small env var, or embedding_model="..." / embedder=... kwargs on upload(...).

Lean install (no ML deps)

pip install --no-deps markcrawl beautifulsoup4 lxml markdownify requests certifi tenacity
# Then either set OPENAI_API_KEY for the OpenAI fallback, or skip embedding entirely.

Migration

Default kwargs to upload(...) now produce mxbai-embedded rows automatically — callers simply stop being charged for OpenAI. To stay on OpenAI explicitly:

from markcrawl.upload import upload
upload(jsonl_path=..., supabase_url=..., supabase_key=...,
       embedding_model="text-embedding-3-small")

Or set MARKCRAWL_EMBEDDER=text-embedding-3-small in your environment.

Reports

v0.10.0 — Chunker MRR lift + embedder bake-off + retry resilience

03 May 09:13

Choose a tag to compare

Highlights

Metric v0.9.9-rc1 (baseline) v0.10.0
Mean MRR (local pool) 0.3461 0.3859 (+11.5%)
Cost at 50M pages $10,152 $0 with markcrawl[ml] / $5,246 default
Chunks per page 20.3 10.49 (-48% smaller index)
Tests 350 493 passing

What's in this release

  • Chunker defaults flipped (Track D). chunk_markdown now defaults to min_words=250, section_overlap_words=40, strip_markdown_links=True. Multi-trial validated: +14% MRR on st-mini (6 trials, all positive), +15% on OpenAI 3-small (3 trials, all positive). Backward-compatible — pass legacy values explicitly to opt out.
  • Embedder abstraction (Track B). New markcrawl/embedder.py ships with Embedder ABC + OpenAIEmbedder + LocalSentenceTransformerEmbedder (with model-specific instruction prefixes for asymmetric retrieval). Bake-off winner is mixedbread-ai/mxbai-embed-large-v1 — passes SC-B1 ($0 cost) + SC-B2 (Δ −0.018 MRR within ±0.020 band).
  • Rerank infrastructure (Track A). New markcrawl/retrieval.py with CrossEncoderReranker. Failed the +0.030 MRR bar on this distribution (regressed −0.013); ships opt-in only.
  • Ecom diagnosis (Track C). Documented that newegg/ikea failures are crawl-discovery, not extraction. Per-site auto_path_scope / auto_path_priority / use_sitemap overrides plumbed in the local-replica harness; SC-C1 closure deferred to v0.11 (sitemap-first + UA rotation).
  • Retry resilience. Tenacity-backed retry, binary-parity CI smoke tests, structured exhaustion logs.

Per-category MRR (11-site canonical pool)

Category v0.9.9 v0.10.0 Δ
framework_docs 0.3641 0.3771 +0.0130
api_docs 0.4840 0.5414 +0.0574
reference 0.3750 0.3750 0.0000
tutorial 0.4605 0.7000 +0.2395
ecommerce 0.0625 0.0625 0.0000
blog 0.5000 0.5000 0.0000
news 0.1667 0.1667 0.0000

No category regresses. The largest per-site regression
(huggingface-transformers −0.083) is offset within the
framework_docs category by react-dev's gain.

Known v0.11 work

  1. Default flip for mxbai embedder. Extras-aware factory: pick mxbai when markcrawl[ml] is installed, fall back to OpenAI 3-small. Captures the −$10K/yr cost reduction in production callers.
  2. Sitemap-first ecom discovery. Closes SC-C1 (ikea reaches its canonical products; newegg avoids anti-bot via different URL queue).
  3. Newegg anti-bot mitigation. UA rotation pool + retry-with-jitter for ecom-class sites.

Reports

  • bench/local_replica/v010_release_report.md — authoritative release report with side-by-side comparison.
  • bench/local_replica/track_d_report.md — chunker sweep methodology (56 configs).
  • bench/local_replica/track_a_report.md — rerank failure analysis.
  • bench/local_replica/track_b_report.md — embedder bake-off (4 of 5 candidates run; nomic aborted).
  • bench/local_replica/track_c_report.md — ecom discovery vs extraction diagnosis.
  • specs/v010-leaderboard-sweep.md — campaign spec with per-track status markers.

v0.9.3 — ecommerce category-marker scope

26 Apr 04:07

Choose a tag to compare

Generic URL-convention fix for ecommerce sites where the seed URL passes through a category-index segment but target items live at sibling paths. Same kind of rule as the v0.9.2 `/wiki/` article-container check.

Validated on the 4 sites used by the public llm-crawler-benchmarks rotation

Site Public v0.9.1 v0.9.2 v0.9.3 Notes
mdn-css 0.125 0.5625 0.5625 unchanged from v0.9.2
kubernetes-docs 0.542 0.9062 0.9062 unchanged from v0.9.2
huggingface-transformers 0.000 0.3438 0.3438 unchanged from v0.9.2
ikea 0.375 0.0000 0.1250 recovered from v0.9.2 over-tight scope
AVG MRR 0.260 0.4531 0.4844 +0.224 over public (+86% relative)

(Ikea remains below public's 0.375 due to long-tail variance — a 200-page random sample of thousands of products will hit different specific named items each run. The scope fix itself works as designed.)

What's new

auto_path_scope now detects ecommerce category-index markers

When the seed URL passes through a /<marker>/ segment that's a known ecommerce-platform URL convention, the segments BEFORE the marker become the scope. This is generic — applies to any site adopting the convention, not domain-specific.

Markers detected: cat, category, categories, products, shop, collections.

Used by Shopify, WooCommerce, Magento, Salesforce Commerce Cloud defaults, plus IKEA, Etsy, BigCommerce, and many more.

Seed URL New v0.9.3 scope Why
ikea.com/us/en/cat/furniture-fu001/ /us/en/* Products at /us/en/p/* are siblings
myshop.com/store/collections/spring/products/x /store/collections/spring/* Deepest marker wins; outer is parent grouping
myshop.com/products/single-thing None Marker at root → siblings span whole site
mywp.com/blog/category/news/post-1 /blog/* News posts are at /blog/<slug>

Multi-marker tiebreak: deepest wins

For URLs with nested markers (e.g. /store/collections/X/products/Y), the deepest marker is the leaf-level category — outer markers are parent groupings. Scope is anchored at the segments before the deepest marker.

Tests

324 passing (was 316 in v0.9.2; +8 new ecommerce-marker tests covering ikea-style /us/en/cat, Shopify /products & /collections, deepest-marker tiebreak, case-insensitivity).

Migration

No API changes. Behavior shift only on seeds passing through one of the marker words. If you were relying on a tight scope at the marker level, pass auto_path_scope=False and use include_paths explicitly.

Known limitations

  • Long-tail product queries: with constrained max_pages, hitting specific named products on a large catalog (ikea MALM/SLATTUM, etc.) depends on which products end up in BFS order — irreducible variance. Larger max_pages reduces this.
  • SPA sites: pages requiring JavaScript-rendered navigation still need explicit --render-js.

Install

pip install 'markcrawl[js]==0.9.3'