Releases: AIMLPM/markcrawl
v0.11.1 — default aggregator-page URL filter
Reject mdBook /print.html and Hugo /_print/ pages during crawl-time URL filtering. These single-render-of-whole-tree pages have artificially high keyword density and pollute embedding-based retrieval rankings.
Why
Surfaced by the public llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five other well-functioning competitors returned 0% /_print/ on kubernetes-docs.
These pages contain the entire docs tree on one URL, so embedding-based retrieval ranks them above the dedicated chapter pages a user actually wants.
What changed
- New default URL patterns rejected pre-fetch (saves crawl budget):
*/print.html,*/_print,*/_print/,*/_print/*,*/print/index.html - New kwarg
include_aggregator_pages: bool = Falseoncrawl()and both engine classes for offline-archive use cases. - CLI flag
--include-aggregatorsmirrors. - User-supplied
exclude_pathsandinclude_pathsstill apply independently — the aggregator filter composes with both, doesn't replace either.
Substring-match safety
Patterns are anchored to avoid over-matching:
| URL | Behavior |
|---|---|
/book/print.html |
rejected (mdBook) |
/blueprint.html |
passes (print is mid-word) |
/preprint.html |
passes (academic content) |
/imprint/ |
passes (legal page) |
/_print/index.html |
rejected (Hugo) |
/_printer-friendly/css.css |
passes (asset path) |
Expected impact
Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. Measurement deferred to bench v1.5's helpful-pages-universe methodology (the current v1.4 anchor-biased methodology would give misleading numbers regardless of the underlying fix).
Tests
647 passing (was 611); 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring safety, opt-out flag, composition with user filters, and CrawlEngine + AsyncCrawlEngine parity.
Migration
No breaking changes. Default behavior unchanged on sites that don't generate aggregator pages. For users archiving offline docs that include print views, pass include_aggregator_pages=True or --include-aggregators.
v0.11.0 — binary downloads + filters
Two new modules expand markcrawl from "HTML to Markdown converter" to "crawl + selectively download referenced files":
markcrawl/binaries.py — streaming binary downloads
New crawl(..., download_types=["pdf","docx"], ...) opt-in kwarg:
- Streaming with size cap —
stream=True/aiter_bytes()with per-chunk accumulation. Never buffers the full body. Default 25 MB per-file cap, 200 file count cap. - Atomic write via
.tmp+os.replace. Partial files unlinked on cap-exceed. - Content-type validated BEFORE writing bytes — a
.pdfURL servingtext/html(login wall, marketing splash) is dropped immediately. - JSONL row gains
downloadsfield when a page's binaries were downloaded:[{url, path, size_bytes, content_type}, ...]. Field omitted when empty (backward compat). - Sitemap entries route to download queue when they match
download_types(symmetry with link discovery). - All v0.10.x safety nets (
respect_robots,idle_timeout_s,include_subdomains) apply uniformly to downloads.
markcrawl/filters.py — reusable pre-fetch filters
from markcrawl import crawl
from markcrawl.filters import is_likely_resume
result = crawl(
base_url="https://example.com/templates",
out_dir="./resumes",
download_types=["pdf", "docx"],
download_filter=is_likely_resume,
)
print(f"Saved {result.downloads_count} files")DownloadCandidate(url, anchor_text, parent_url, parent_title, extension)— pre-fetch context passed to filters.is_likely_resume/is_likely_paper/exclude_legal_boilerplate— reusable URL+anchor heuristics. Best-effort, not classifiers.- Filters run pre-fetch — rejected URLs never get fetched, zero HTTP bytes transferred.
- Compose via
lambda c: positive(c) and exclude_legal_boilerplate(c).
New CrawlResult fields
downloads_count: int— files saveddownloads_bytes: int— total bytes saveddownloads_size_skipped: List[str]— URLs that exceeded the size capdownloads_type_skipped: List[str]— URLs whose content-type didn't match
Migration
No breaking changes. Default download_types=None preserves v0.10.6 behavior exactly.
Deferred
- Live-network smoke harness case for an ATS-template aggregator → v0.11.1.
- Format-specific text extraction (PDF/DOCX → Markdown) remains out of scope; users compose with
pypdf/python-docx/mammoth/unstructureddownstream of saved files.
Tests
611 passing (was 566 on v0.10.6; +45 in tests/test_v011_binary_downloads.py). Spec specs/binary-downloads.md confidence-reviewed; all SC/DS rated ≥ 90% before implementation began.
v0.10.6 — opt-in respect_robots flag
New crawl(..., respect_robots: bool = True) — default unchanged (robots.txt Disallow rules honored). Setting respect_robots=False bypasses Disallow but still honors Crawl-delay (politeness preserved). Caller takes responsibility for legality, ethics, and downstream consequences.
Why
robots.txt is the only widely-deployed mechanism site owners have to express preferences about automated access. We default to respecting it. But forks and monkey-patches that ignore robots already exist in the wild; an explicit, audited flag is more honest than letting users hack around the constraint silently.
Three guardrails
- Loud, non-silenceable warning at engine setup when bypass is active — both progress callback and Python
logger.warning. No env-var or CLI override; the choice must be made deliberately in code. CrawlResult.robots_respected: bool— mirrors the kwarg the caller passed. Surfaced for audit / governance pipelines.CrawlResult.robots_bypassed_count: int— count of unique URLs robots.txt Disallowed but were fetched anyway. Always 0 whenrobots_respectedis True. Lets you see the actual impact of the override — small numbers mean robots wasn't constraining you.
End-of-crawl summary when bypass was active reports either had no effect this run (count=0) or fetched N URL(s) that robots.txt Disallowed (count>0).
What stays unchanged
- Default behavior: robots.txt Disallow rules honored.
- Crawl-delay (politeness): honored unconditionally. We disregard Disallow, not politeness. Bypassing rate limits would be DoS-shaped.
Migration
No breaking changes. Default behavior unchanged. Use the flag for legitimate cases:
- Your own site (forgotten or misconfigured robots.txt)
- Authorized pen-testing engagements
- Internal / intranet documentation you own
- RAG ingestion of docs the site owner explicitly wants ingested but forgot to whitelist your UA
566 tests passing (was 549 on v0.10.5; +17 in tests/test_v0106_respect_robots.py).
v0.10.5 — adaptive scope broadening
When a crawl exhausts its narrow auto-derived scope (e.g. /docs/concepts/* from a kubernetes seed) with budget remaining, the engine now attempts one-level broadening (/docs/concepts/* → /docs/*) before terminating. URLs filtered under the previous scope are stashed during link discovery and replayed through the broader scope.
Empirical verification (real network, max_pages=400)
| Site | v0.10.4 | v0.10.5 | Delta |
|---|---|---|---|
| kubernetes-docs | 195/400 | 400/400 | +105% |
| rust-book | 111 | 111 | unchanged (guardrail held) |
| postgres-docs | 80 | 80 | unchanged |
| newegg | 1 | 1 | unchanged (engine handles WAF gracefully) |
Rust-book is deliberately unchanged: its Tier 0 single-segment scope /book/* cannot broaden short of whole-host, which the guardrail blocks. We don't auto-pull /std/, /cargo/, /nomicon/ even though crawl4ai-raw does — those are different publications, and our scope honors the seed's intent.
Guardrails
Broadening fires only when:
- Scope was auto-derived. User-explicit
include_pathsis respected as intent and never mutated. - Current scope's leftmost segment is in
_DOCS_HUB_MARKERS(docs,book,learn,tutorial,guide,reference,manual,handbook,api, etc.) or the site classifies asdocs/apirefby hostname. - One-level broadening doesn't land at whole-host (
/*). - Cap of
_DEFAULT_MAX_BROADEN_EVENTS = 2per crawl.
API additions (additive only)
CrawlResult.scope_history: List[List[str]]— sequence of include_paths patterns the crawl traversed. Auditable. Empty if no scope was set.
Migration
No breaking changes. Behavior preserved exactly when the user passes include_paths explicitly. For default crawls on docs sites, expect more pages and the same (or better) signal-to-noise — the broadening guardrail is intentionally tight (docs hub markers only, no whole-host fallback).
549 tests passing (was 528 on v0.10.4; +21 in tests/test_v0105_adaptive_scope.py).
v0.10.4 — idle-timeout reset signal fix + release smoke harness
The v0.10.3 idle-timeout reset only on save_page, which mis-fired on bursty crawls where the engine was successfully fetching pages but most were getting deduped or were under min_words. The public benchmark surfaced this on huggingface-transformers (21/200 pages saved before the timer fired at 120 s).
Fix
The idle-timeout clock now resets on any meaningful progress event:
save_page(already in v0.10.3)- successful HTTP 2xx response
discover_linkscall that adds at least one new URL to the queue
4xx / 5xx responses do not reset the clock — anti-bot loops still get caught.
Empirical verification
A fresh crawl of huggingface-transformers at max_pages=200:
| version | pages saved | elapsed |
|---|---|---|
| v0.10.3 | 21 | 120 s (timer fired early) |
| v0.10.4 | 174 | 236 s (graceful exit) |
8x improvement on the bursty-discovery case; idle timer now functions as a true deadlock detector, not a save-rate guard.
CrawlResult API additions (additive only)
first_status: Optional[int]— first observable HTTP status. Lets callers distinguish engine bugs from external WAF/anti-bot blocks without scraping logs.stalled: bool—Truewhen the run was terminated by the idle-timeout watchdog rather than running out of work or hittingmax_pages.
Pre-release smoke harness
New bench/local_replica/release_smoke.py runs crawl() against ~4 real sites with per-site baselines. Treats first_status >= 400 + 0 pages as BLOCKED (skip, not fail) so transient WAF blocks don't false-alarm. Catches stall-detection regressions, coverage regressions, and anti-bot diagnostic regressions in 5-10 min.
Migration
No breaking changes. Users who set MARKCRAWL_IDLE_TIMEOUT_S=300 to work around the v0.10.3 mis-fire can drop the override — 120 s is correct again.
528 tests passing (was 521 on v0.10.3; +7 covering the new reset paths).
v0.10.3 — benchmark resilience fixes
Three generalizable resilience fixes surfaced by the public llm-crawler-benchmarks v1.3 cycle. All site-agnostic — none reference the sites or site classes that surfaced them.
Fixes
Partial-write recovery. pages.jsonl is now line-buffered (buffering=1) and save_page flushes after every row. SIGKILL / external watchdog termination no longer leaves an empty JSONL on disk; OS page cache holds all written rows.
Discovery-exhaustion stall detection (idle_timeout_s). Engine tracks _last_save_time and terminates gracefully when no new page has been saved for idle_timeout_s seconds (default 120). Catches link-graph churn after reachable pages exhaust without site-specific heuristics.
0-page diagnostic. Engine captures the first observed HTTP status. On crawls that finish with pages_saved == 0, logs a class-aware warning: 4xx/5xx → likely anti-bot block, 200 → likely min_words too high or JS-rendered, no response → seed unreachable / DNS error.
API additions (additive only)
crawl(..., idle_timeout_s: Optional[float] = None)CrawlEngine/AsyncCrawlEngineacceptidle_timeout_skwargMARKCRAWL_IDLE_TIMEOUT_Senv varDEFAULT_IDLE_TIMEOUT_S = 120.0module constant- Set
idle_timeout_s=0(or env to0) to disable
Verification
- 521 tests passing (was 500 on v0.10.2; +21 in
tests/test_v0103_resilience.py) - Ruff lint clean
- All 4 Python versions (3.10–3.13) green on CI
Migration
No breaking changes. Default idle_timeout_s=120 is generous and only fires on genuine stalls. Users running long-blocked crawls intentionally (e.g. waiting on slow renders) can pass idle_timeout_s=0.
See CHANGELOG.md for full details.
v0.10.2 — Sitemap pre-enumeration deadline · fixes retailer-index timeouts
tl;dr
Patch release fixing a regression surfaced by llm-crawler-benchmarks against v0.10.1: pathological sitemap-indexes (ikea: 2,113 locale shards) consumed 200+ s in pre-enumeration before any page got crawled, tripping benchmark zero-output watchdogs (120 s).
The sitemap-discovery phase now has a 60 s wallclock budget shared across all top-level sitemaps + their recursive children. When the budget fires, the parser returns whatever URLs it has collected so far and the crawl proceeds normally.
Verified locally on the failing sites
| Site | v0.10.1 | v0.10.2 |
|---|---|---|
| ikea (max_pages=30) | 0 pages (heartbeat fired) | 30 pages saved in 49.7 s |
| huggingface-transformers | regression on bench CI | 30 pages saved in 36.2 s |
What changed
markcrawl.robots.parse_sitemap_xmlandparse_sitemap_xml_async: newtime_budget_skwarg (default60.0), threaded through recursion via the internal_deadline. Async path switches fromasyncio.gathertoasyncio.as_completedso pending child-sitemap tasks are cancelled rather than awaited once the budget fires.markcrawl.core: both sync and async crawl paths instantiate a shared deadline at the start of sitemap discovery.- 2 new tests in
tests/test_sitemap_parallel.pycovering the short-circuit and the no-op default. - 500 tests passing (was 498).
Compatibility
No CLI flag changes. No behavior change for sites with normal sitemaps (which finish in <10 s anyway). Only the pathological-index path is affected.
For benchmark integrators
pip install --upgrade markcrawl==0.10.2 and re-run the previously failing sites. Crawl wallclock for ikea drops from "timeout, 0 pages" to "max_pages saved within budget."
v0.10.1 — Local embedder is the default · zero-cost RAG
tl;dr
pip install markcrawl now ships a complete crawl-and-embed stack with zero API cost. The default embedder flips from OpenAI 3-small to the bake-off-winning local mixedbread-ai/mxbai-embed-large-v1. Combined with the v0.10.0 chunker work, v0.10.1 closes the leaderboard story:
| Metric (vs v0.9.9-rc1) | v0.10.1 default | Δ |
|---|---|---|
| Mean MRR (11-site local pool) | 0.3859 | +0.040 (+11.5%) |
| Cost at 50M pages | $0 | −$10,152/yr |
| Chunks per page | 10.49 | −48% smaller index |
Multi-trial validated: +14% MRR on all-MiniLM-L6-v2 (6 trials, all positive) and +15% on OpenAI 3-small (3 trials, all positive) on the chunker change. The mxbai swap is MRR-neutral (Δ −0.018 within ±0.020 SC-B2 noise band) at $0/yr cost-at-scale.
What's new in 0.10.1
pip install markcrawlnow bundles the ML stack (torch + transformers + sentence-transformers + sentencepiece). The chunker'schunk_semanticand the new default embedder work out of the box.- Default embedder =
mixedbread-ai/mxbai-embed-large-v1(local, zero API cost). Replaces the previous OpenAI 3-small default. markcrawl[ml]kept as a no-op alias — existing install commands keep working.- Override paths:
MARKCRAWL_EMBEDDER=text-embedding-3-smallenv var, orembedding_model="..."/embedder=...kwargs onupload(...).
Lean install (no ML deps)
pip install --no-deps markcrawl beautifulsoup4 lxml markdownify requests certifi tenacity
# Then either set OPENAI_API_KEY for the OpenAI fallback, or skip embedding entirely.Migration
Default kwargs to upload(...) now produce mxbai-embedded rows automatically — callers simply stop being charged for OpenAI. To stay on OpenAI explicitly:
from markcrawl.upload import upload
upload(jsonl_path=..., supabase_url=..., supabase_key=...,
embedding_model="text-embedding-3-small")Or set MARKCRAWL_EMBEDDER=text-embedding-3-small in your environment.
Reports
bench/local_replica/v010_release_report.md— full v0.10 release report.bench/local_replica/track_b_report.md— embedder bake-off (4 of 5 candidates run on the canonical 11-site pool).bench/local_replica/track_d_report.md— chunker sweep (56 configs).
v0.10.0 — Chunker MRR lift + embedder bake-off + retry resilience
Highlights
| Metric | v0.9.9-rc1 (baseline) | v0.10.0 |
|---|---|---|
| Mean MRR (local pool) | 0.3461 | 0.3859 (+11.5%) |
| Cost at 50M pages | $10,152 | $0 with markcrawl[ml] / $5,246 default |
| Chunks per page | 20.3 | 10.49 (-48% smaller index) |
| Tests | 350 | 493 passing |
What's in this release
- Chunker defaults flipped (Track D).
chunk_markdownnow defaults tomin_words=250,section_overlap_words=40,strip_markdown_links=True. Multi-trial validated: +14% MRR on st-mini (6 trials, all positive), +15% on OpenAI 3-small (3 trials, all positive). Backward-compatible — pass legacy values explicitly to opt out. - Embedder abstraction (Track B). New
markcrawl/embedder.pyships withEmbedderABC +OpenAIEmbedder+LocalSentenceTransformerEmbedder(with model-specific instruction prefixes for asymmetric retrieval). Bake-off winner ismixedbread-ai/mxbai-embed-large-v1— passes SC-B1 ($0 cost) + SC-B2 (Δ −0.018 MRR within ±0.020 band). - Rerank infrastructure (Track A). New
markcrawl/retrieval.pywithCrossEncoderReranker. Failed the +0.030 MRR bar on this distribution (regressed −0.013); ships opt-in only. - Ecom diagnosis (Track C). Documented that newegg/ikea failures are crawl-discovery, not extraction. Per-site
auto_path_scope/auto_path_priority/use_sitemapoverrides plumbed in the local-replica harness; SC-C1 closure deferred to v0.11 (sitemap-first + UA rotation). - Retry resilience. Tenacity-backed retry, binary-parity CI smoke tests, structured exhaustion logs.
Per-category MRR (11-site canonical pool)
| Category | v0.9.9 | v0.10.0 | Δ |
|---|---|---|---|
| framework_docs | 0.3641 | 0.3771 | +0.0130 |
| api_docs | 0.4840 | 0.5414 | +0.0574 |
| reference | 0.3750 | 0.3750 | 0.0000 |
| tutorial | 0.4605 | 0.7000 | +0.2395 |
| ecommerce | 0.0625 | 0.0625 | 0.0000 |
| blog | 0.5000 | 0.5000 | 0.0000 |
| news | 0.1667 | 0.1667 | 0.0000 |
No category regresses. The largest per-site regression
(huggingface-transformers −0.083) is offset within the
framework_docs category by react-dev's gain.
Known v0.11 work
- Default flip for mxbai embedder. Extras-aware factory: pick mxbai when
markcrawl[ml]is installed, fall back to OpenAI 3-small. Captures the −$10K/yr cost reduction in production callers. - Sitemap-first ecom discovery. Closes SC-C1 (ikea reaches its canonical products; newegg avoids anti-bot via different URL queue).
- Newegg anti-bot mitigation. UA rotation pool + retry-with-jitter for ecom-class sites.
Reports
bench/local_replica/v010_release_report.md— authoritative release report with side-by-side comparison.bench/local_replica/track_d_report.md— chunker sweep methodology (56 configs).bench/local_replica/track_a_report.md— rerank failure analysis.bench/local_replica/track_b_report.md— embedder bake-off (4 of 5 candidates run; nomic aborted).bench/local_replica/track_c_report.md— ecom discovery vs extraction diagnosis.specs/v010-leaderboard-sweep.md— campaign spec with per-track status markers.
v0.9.3 — ecommerce category-marker scope
Generic URL-convention fix for ecommerce sites where the seed URL passes through a category-index segment but target items live at sibling paths. Same kind of rule as the v0.9.2 `/wiki/` article-container check.
Validated on the 4 sites used by the public llm-crawler-benchmarks rotation
| Site | Public v0.9.1 | v0.9.2 | v0.9.3 | Notes |
|---|---|---|---|---|
| mdn-css | 0.125 | 0.5625 | 0.5625 | unchanged from v0.9.2 |
| kubernetes-docs | 0.542 | 0.9062 | 0.9062 | unchanged from v0.9.2 |
| huggingface-transformers | 0.000 | 0.3438 | 0.3438 | unchanged from v0.9.2 |
| ikea | 0.375 | 0.0000 | 0.1250 | recovered from v0.9.2 over-tight scope |
| AVG MRR | 0.260 | 0.4531 | 0.4844 | +0.224 over public (+86% relative) |
(Ikea remains below public's 0.375 due to long-tail variance — a 200-page random sample of thousands of products will hit different specific named items each run. The scope fix itself works as designed.)
What's new
auto_path_scope now detects ecommerce category-index markers
When the seed URL passes through a /<marker>/ segment that's a known ecommerce-platform URL convention, the segments BEFORE the marker become the scope. This is generic — applies to any site adopting the convention, not domain-specific.
Markers detected: cat, category, categories, products, shop, collections.
Used by Shopify, WooCommerce, Magento, Salesforce Commerce Cloud defaults, plus IKEA, Etsy, BigCommerce, and many more.
| Seed URL | New v0.9.3 scope | Why |
|---|---|---|
ikea.com/us/en/cat/furniture-fu001/ |
/us/en/* |
Products at /us/en/p/* are siblings |
myshop.com/store/collections/spring/products/x |
/store/collections/spring/* |
Deepest marker wins; outer is parent grouping |
myshop.com/products/single-thing |
None | Marker at root → siblings span whole site |
mywp.com/blog/category/news/post-1 |
/blog/* |
News posts are at /blog/<slug> |
Multi-marker tiebreak: deepest wins
For URLs with nested markers (e.g. /store/collections/X/products/Y), the deepest marker is the leaf-level category — outer markers are parent groupings. Scope is anchored at the segments before the deepest marker.
Tests
324 passing (was 316 in v0.9.2; +8 new ecommerce-marker tests covering ikea-style /us/en/cat, Shopify /products & /collections, deepest-marker tiebreak, case-insensitivity).
Migration
No API changes. Behavior shift only on seeds passing through one of the marker words. If you were relying on a tight scope at the marker level, pass auto_path_scope=False and use include_paths explicitly.
Known limitations
- Long-tail product queries: with constrained
max_pages, hitting specific named products on a large catalog (ikea MALM/SLATTUM, etc.) depends on which products end up in BFS order — irreducible variance. Largermax_pagesreduces this. - SPA sites: pages requiring JavaScript-rendered navigation still need explicit
--render-js.
Install
pip install 'markcrawl[js]==0.9.3'