Releases · AIMLPM/markcrawl

12 May 04:49

AIMLPM

v0.11.1

a5e158b

v0.11.1 — default aggregator-page URL filter Latest

Latest

Reject mdBook /print.html and Hugo /_print/ pages during crawl-time URL filtering. These single-render-of-whole-tree pages have artificially high keyword density and pollute embedding-based retrieval rankings.

Why

Surfaced by the public llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five other well-functioning competitors returned 0% /_print/ on kubernetes-docs.

These pages contain the entire docs tree on one URL, so embedding-based retrieval ranks them above the dedicated chapter pages a user actually wants.

What changed

New default URL patterns rejected pre-fetch (saves crawl budget):
*/print.html, */_print, */_print/, */_print/*, */print/index.html
New kwarg include_aggregator_pages: bool = False on crawl() and both engine classes for offline-archive use cases.
CLI flag --include-aggregators mirrors.
User-supplied exclude_paths and include_paths still apply independently — the aggregator filter composes with both, doesn't replace either.

Substring-match safety

Patterns are anchored to avoid over-matching:

URL	Behavior
`/book/print.html`	rejected (mdBook)
`/blueprint.html`	passes (`print` is mid-word)
`/preprint.html`	passes (academic content)
`/imprint/`	passes (legal page)
`/_print/index.html`	rejected (Hugo)
`/_printer-friendly/css.css`	passes (asset path)

Expected impact

Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. Measurement deferred to bench v1.5's helpful-pages-universe methodology (the current v1.4 anchor-biased methodology would give misleading numbers regardless of the underlying fix).

Tests

647 passing (was 611); 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring safety, opt-out flag, composition with user filters, and CrawlEngine + AsyncCrawlEngine parity.

Migration

No breaking changes. Default behavior unchanged on sites that don't generate aggregator pages. For users archiving offline docs that include print views, pass include_aggregator_pages=True or --include-aggregators.

Assets 2

06 May 10:32

AIMLPM

v0.11.0

cd26ee1

v0.11.0 — binary downloads + filters

Two new modules expand markcrawl from "HTML to Markdown converter" to "crawl + selectively download referenced files":

`markcrawl/binaries.py` — streaming binary downloads

New crawl(..., download_types=["pdf","docx"], ...) opt-in kwarg:

Streaming with size cap — stream=True / aiter_bytes() with per-chunk accumulation. Never buffers the full body. Default 25 MB per-file cap, 200 file count cap.
Atomic write via .tmp + os.replace. Partial files unlinked on cap-exceed.
Content-type validated BEFORE writing bytes — a .pdf URL serving text/html (login wall, marketing splash) is dropped immediately.
JSONL row gains downloads field when a page's binaries were downloaded: [{url, path, size_bytes, content_type}, ...]. Field omitted when empty (backward compat).
Sitemap entries route to download queue when they match download_types (symmetry with link discovery).
All v0.10.x safety nets (respect_robots, idle_timeout_s, include_subdomains) apply uniformly to downloads.

`markcrawl/filters.py` — reusable pre-fetch filters

from markcrawl import crawl
from markcrawl.filters import is_likely_resume

result = crawl(
    base_url="https://example.com/templates",
    out_dir="./resumes",
    download_types=["pdf", "docx"],
    download_filter=is_likely_resume,
)
print(f"Saved {result.downloads_count} files")

DownloadCandidate(url, anchor_text, parent_url, parent_title, extension) — pre-fetch context passed to filters.
is_likely_resume / is_likely_paper / exclude_legal_boilerplate — reusable URL+anchor heuristics. Best-effort, not classifiers.
Filters run pre-fetch — rejected URLs never get fetched, zero HTTP bytes transferred.
Compose via lambda c: positive(c) and exclude_legal_boilerplate(c).

New `CrawlResult` fields

downloads_count: int — files saved
downloads_bytes: int — total bytes saved
downloads_size_skipped: List[str] — URLs that exceeded the size cap
downloads_type_skipped: List[str] — URLs whose content-type didn't match

Migration

No breaking changes. Default download_types=None preserves v0.10.6 behavior exactly.

Deferred

Live-network smoke harness case for an ATS-template aggregator → v0.11.1.
Format-specific text extraction (PDF/DOCX → Markdown) remains out of scope; users compose with pypdf / python-docx / mammoth / unstructured downstream of saved files.

Tests

611 passing (was 566 on v0.10.6; +45 in tests/test_v011_binary_downloads.py). Spec specs/binary-downloads.md confidence-reviewed; all SC/DS rated ≥ 90% before implementation began.

Assets 2

05 May 06:43

AIMLPM

v0.10.6

75ecb9d

v0.10.6 — opt-in respect_robots flag

New crawl(..., respect_robots: bool = True) — default unchanged (robots.txt Disallow rules honored). Setting respect_robots=False bypasses Disallow but still honors Crawl-delay (politeness preserved). Caller takes responsibility for legality, ethics, and downstream consequences.

Why

robots.txt is the only widely-deployed mechanism site owners have to express preferences about automated access. We default to respecting it. But forks and monkey-patches that ignore robots already exist in the wild; an explicit, audited flag is more honest than letting users hack around the constraint silently.

Three guardrails

Loud, non-silenceable warning at engine setup when bypass is active — both progress callback and Python logger.warning. No env-var or CLI override; the choice must be made deliberately in code.
CrawlResult.robots_respected: bool — mirrors the kwarg the caller passed. Surfaced for audit / governance pipelines.
CrawlResult.robots_bypassed_count: int — count of unique URLs robots.txt Disallowed but were fetched anyway. Always 0 when robots_respected is True. Lets you see the actual impact of the override — small numbers mean robots wasn't constraining you.

End-of-crawl summary when bypass was active reports either had no effect this run (count=0) or fetched N URL(s) that robots.txt Disallowed (count>0).

What stays unchanged

Default behavior: robots.txt Disallow rules honored.
Crawl-delay (politeness): honored unconditionally. We disregard Disallow, not politeness. Bypassing rate limits would be DoS-shaped.

Migration

No breaking changes. Default behavior unchanged. Use the flag for legitimate cases:

Your own site (forgotten or misconfigured robots.txt)
Authorized pen-testing engagements
Internal / intranet documentation you own
RAG ingestion of docs the site owner explicitly wants ingested but forgot to whitelist your UA

566 tests passing (was 549 on v0.10.5; +17 in tests/test_v0106_respect_robots.py).

Assets 2

05 May 03:02

AIMLPM

v0.10.5

9ba5d2b

v0.10.5 — adaptive scope broadening

When a crawl exhausts its narrow auto-derived scope (e.g. /docs/concepts/* from a kubernetes seed) with budget remaining, the engine now attempts one-level broadening (/docs/concepts/* → /docs/*) before terminating. URLs filtered under the previous scope are stashed during link discovery and replayed through the broader scope.

Empirical verification (real network, max_pages=400)

Site	v0.10.4	v0.10.5	Delta
kubernetes-docs	195/400	400/400	+105%
rust-book	111	111	unchanged (guardrail held)
postgres-docs	80	80	unchanged
newegg	1	1	unchanged (engine handles WAF gracefully)

Rust-book is deliberately unchanged: its Tier 0 single-segment scope /book/* cannot broaden short of whole-host, which the guardrail blocks. We don't auto-pull /std/, /cargo/, /nomicon/ even though crawl4ai-raw does — those are different publications, and our scope honors the seed's intent.

Guardrails

Broadening fires only when:

Scope was auto-derived. User-explicit include_paths is respected as intent and never mutated.
Current scope's leftmost segment is in _DOCS_HUB_MARKERS (docs, book, learn, tutorial, guide, reference, manual, handbook, api, etc.) or the site classifies as docs / apiref by hostname.
One-level broadening doesn't land at whole-host (/*).
Cap of _DEFAULT_MAX_BROADEN_EVENTS = 2 per crawl.

API additions (additive only)

CrawlResult.scope_history: List[List[str]] — sequence of include_paths patterns the crawl traversed. Auditable. Empty if no scope was set.

Migration

No breaking changes. Behavior preserved exactly when the user passes include_paths explicitly. For default crawls on docs sites, expect more pages and the same (or better) signal-to-noise — the broadening guardrail is intentionally tight (docs hub markers only, no whole-host fallback).

549 tests passing (was 528 on v0.10.4; +21 in tests/test_v0105_adaptive_scope.py).

Assets 2

05 May 01:50

AIMLPM

v0.10.4

cb53648

v0.10.4 — idle-timeout reset signal fix + release smoke harness

The v0.10.3 idle-timeout reset only on save_page, which mis-fired on bursty crawls where the engine was successfully fetching pages but most were getting deduped or were under min_words. The public benchmark surfaced this on huggingface-transformers (21/200 pages saved before the timer fired at 120 s).

Fix

The idle-timeout clock now resets on any meaningful progress event:

save_page (already in v0.10.3)
successful HTTP 2xx response
discover_links call that adds at least one new URL to the queue

4xx / 5xx responses do not reset the clock — anti-bot loops still get caught.

Empirical verification

A fresh crawl of huggingface-transformers at max_pages=200:

version	pages saved	elapsed
v0.10.3	21	120 s (timer fired early)
v0.10.4	174	236 s (graceful exit)

8x improvement on the bursty-discovery case; idle timer now functions as a true deadlock detector, not a save-rate guard.

CrawlResult API additions (additive only)

first_status: Optional[int] — first observable HTTP status. Lets callers distinguish engine bugs from external WAF/anti-bot blocks without scraping logs.
stalled: bool — True when the run was terminated by the idle-timeout watchdog rather than running out of work or hitting max_pages.

Pre-release smoke harness

New bench/local_replica/release_smoke.py runs crawl() against ~4 real sites with per-site baselines. Treats first_status >= 400 + 0 pages as BLOCKED (skip, not fail) so transient WAF blocks don't false-alarm. Catches stall-detection regressions, coverage regressions, and anti-bot diagnostic regressions in 5-10 min.

Migration

No breaking changes. Users who set MARKCRAWL_IDLE_TIMEOUT_S=300 to work around the v0.10.3 mis-fire can drop the override — 120 s is correct again.

528 tests passing (was 521 on v0.10.3; +7 covering the new reset paths).

Assets 2

04 May 18:43

AIMLPM

v0.10.3

faff5c6

v0.10.3 — benchmark resilience fixes

Three generalizable resilience fixes surfaced by the public llm-crawler-benchmarks v1.3 cycle. All site-agnostic — none reference the sites or site classes that surfaced them.

Fixes

Partial-write recovery. pages.jsonl is now line-buffered (buffering=1) and save_page flushes after every row. SIGKILL / external watchdog termination no longer leaves an empty JSONL on disk; OS page cache holds all written rows.

Discovery-exhaustion stall detection (idle_timeout_s). Engine tracks _last_save_time and terminates gracefully when no new page has been saved for idle_timeout_s seconds (default 120). Catches link-graph churn after reachable pages exhaust without site-specific heuristics.

0-page diagnostic. Engine captures the first observed HTTP status. On crawls that finish with pages_saved == 0, logs a class-aware warning: 4xx/5xx → likely anti-bot block, 200 → likely min_words too high or JS-rendered, no response → seed unreachable / DNS error.

API additions (additive only)

crawl(..., idle_timeout_s: Optional[float] = None)
CrawlEngine / AsyncCrawlEngine accept idle_timeout_s kwarg
MARKCRAWL_IDLE_TIMEOUT_S env var
DEFAULT_IDLE_TIMEOUT_S = 120.0 module constant
Set idle_timeout_s=0 (or env to 0) to disable

Verification

521 tests passing (was 500 on v0.10.2; +21 in tests/test_v0103_resilience.py)
Ruff lint clean
All 4 Python versions (3.10–3.13) green on CI

Migration

No breaking changes. Default idle_timeout_s=120 is generous and only fires on genuine stalls. Users running long-blocked crawls intentionally (e.g. waiting on slow renders) can pass idle_timeout_s=0.

See CHANGELOG.md for full details.

Assets 2

03 May 17:44

AIMLPM

v0.10.2

c8d55c7

v0.10.2 — Sitemap pre-enumeration deadline · fixes retailer-index timeouts

tl;dr

Patch release fixing a regression surfaced by llm-crawler-benchmarks against v0.10.1: pathological sitemap-indexes (ikea: 2,113 locale shards) consumed 200+ s in pre-enumeration before any page got crawled, tripping benchmark zero-output watchdogs (120 s).

The sitemap-discovery phase now has a 60 s wallclock budget shared across all top-level sitemaps + their recursive children. When the budget fires, the parser returns whatever URLs it has collected so far and the crawl proceeds normally.

Verified locally on the failing sites

Site	v0.10.1	v0.10.2
ikea (max_pages=30)	0 pages (heartbeat fired)	30 pages saved in 49.7 s
huggingface-transformers	regression on bench CI	30 pages saved in 36.2 s

What changed

markcrawl.robots.parse_sitemap_xml and parse_sitemap_xml_async: new time_budget_s kwarg (default 60.0), threaded through recursion via the internal _deadline. Async path switches from asyncio.gather to asyncio.as_completed so pending child-sitemap tasks are cancelled rather than awaited once the budget fires.
markcrawl.core: both sync and async crawl paths instantiate a shared deadline at the start of sitemap discovery.
2 new tests in tests/test_sitemap_parallel.py covering the short-circuit and the no-op default.
500 tests passing (was 498).

Compatibility

No CLI flag changes. No behavior change for sites with normal sitemaps (which finish in <10 s anyway). Only the pathological-index path is affected.

For benchmark integrators

pip install --upgrade markcrawl==0.10.2 and re-run the previously failing sites. Crawl wallclock for ikea drops from "timeout, 0 pages" to "max_pages saved within budget."

Assets 2

03 May 09:39

AIMLPM

v0.10.1

13fa2c2

v0.10.1 — Local embedder is the default · zero-cost RAG

tl;dr

pip install markcrawl now ships a complete crawl-and-embed stack with zero API cost. The default embedder flips from OpenAI 3-small to the bake-off-winning local mixedbread-ai/mxbai-embed-large-v1. Combined with the v0.10.0 chunker work, v0.10.1 closes the leaderboard story:

Metric (vs v0.9.9-rc1)	v0.10.1 default	Δ
Mean MRR (11-site local pool)	0.3859	+0.040 (+11.5%)
Cost at 50M pages	$0	−$10,152/yr
Chunks per page	10.49	−48% smaller index

Multi-trial validated: +14% MRR on all-MiniLM-L6-v2 (6 trials, all positive) and +15% on OpenAI 3-small (3 trials, all positive) on the chunker change. The mxbai swap is MRR-neutral (Δ −0.018 within ±0.020 SC-B2 noise band) at $0/yr cost-at-scale.

What's new in 0.10.1

pip install markcrawl now bundles the ML stack (torch + transformers + sentence-transformers + sentencepiece). The chunker's chunk_semantic and the new default embedder work out of the box.
Default embedder = mixedbread-ai/mxbai-embed-large-v1 (local, zero API cost). Replaces the previous OpenAI 3-small default.
markcrawl[ml] kept as a no-op alias — existing install commands keep working.
Override paths: MARKCRAWL_EMBEDDER=text-embedding-3-small env var, or embedding_model="..." / embedder=... kwargs on upload(...).

Lean install (no ML deps)

pip install --no-deps markcrawl beautifulsoup4 lxml markdownify requests certifi tenacity
# Then either set OPENAI_API_KEY for the OpenAI fallback, or skip embedding entirely.

Migration

Default kwargs to upload(...) now produce mxbai-embedded rows automatically — callers simply stop being charged for OpenAI. To stay on OpenAI explicitly:

from markcrawl.upload import upload
upload(jsonl_path=..., supabase_url=..., supabase_key=...,
       embedding_model="text-embedding-3-small")

Or set MARKCRAWL_EMBEDDER=text-embedding-3-small in your environment.

Reports

bench/local_replica/v010_release_report.md — full v0.10 release report.
bench/local_replica/track_b_report.md — embedder bake-off (4 of 5 candidates run on the canonical 11-site pool).
bench/local_replica/track_d_report.md — chunker sweep (56 configs).

Assets 2

03 May 09:13

AIMLPM

v0.10.0

17175a8

v0.10.0 — Chunker MRR lift + embedder bake-off + retry resilience

Highlights

Metric	v0.9.9-rc1 (baseline)	v0.10.0
Mean MRR (local pool)	0.3461	0.3859 (+11.5%)
Cost at 50M pages	$10,152	$0 with `markcrawl[ml]` / $5,246 default
Chunks per page	20.3	10.49 (-48% smaller index)
Tests	350	493 passing

What's in this release

Chunker defaults flipped (Track D). chunk_markdown now defaults to min_words=250, section_overlap_words=40, strip_markdown_links=True. Multi-trial validated: +14% MRR on st-mini (6 trials, all positive), +15% on OpenAI 3-small (3 trials, all positive). Backward-compatible — pass legacy values explicitly to opt out.
Embedder abstraction (Track B). New markcrawl/embedder.py ships with Embedder ABC + OpenAIEmbedder + LocalSentenceTransformerEmbedder (with model-specific instruction prefixes for asymmetric retrieval). Bake-off winner is mixedbread-ai/mxbai-embed-large-v1 — passes SC-B1 ($0 cost) + SC-B2 (Δ −0.018 MRR within ±0.020 band).
Rerank infrastructure (Track A). New markcrawl/retrieval.py with CrossEncoderReranker. Failed the +0.030 MRR bar on this distribution (regressed −0.013); ships opt-in only.
Ecom diagnosis (Track C). Documented that newegg/ikea failures are crawl-discovery, not extraction. Per-site auto_path_scope / auto_path_priority / use_sitemap overrides plumbed in the local-replica harness; SC-C1 closure deferred to v0.11 (sitemap-first + UA rotation).
Retry resilience. Tenacity-backed retry, binary-parity CI smoke tests, structured exhaustion logs.

Per-category MRR (11-site canonical pool)

Category	v0.9.9	v0.10.0	Δ
framework_docs	0.3641	0.3771	+0.0130
api_docs	0.4840	0.5414	+0.0574
reference	0.3750	0.3750	0.0000
tutorial	0.4605	0.7000	+0.2395
ecommerce	0.0625	0.0625	0.0000
blog	0.5000	0.5000	0.0000
news	0.1667	0.1667	0.0000

No category regresses. The largest per-site regression
(huggingface-transformers −0.083) is offset within the
framework_docs category by react-dev's gain.

Known v0.11 work

Default flip for mxbai embedder. Extras-aware factory: pick mxbai when markcrawl[ml] is installed, fall back to OpenAI 3-small. Captures the −$10K/yr cost reduction in production callers.
Sitemap-first ecom discovery. Closes SC-C1 (ikea reaches its canonical products; newegg avoids anti-bot via different URL queue).
Newegg anti-bot mitigation. UA rotation pool + retry-with-jitter for ecom-class sites.

Reports

bench/local_replica/v010_release_report.md — authoritative release report with side-by-side comparison.
bench/local_replica/track_d_report.md — chunker sweep methodology (56 configs).
bench/local_replica/track_a_report.md — rerank failure analysis.
bench/local_replica/track_b_report.md — embedder bake-off (4 of 5 candidates run; nomic aborted).
bench/local_replica/track_c_report.md — ecom discovery vs extraction diagnosis.
specs/v010-leaderboard-sweep.md — campaign spec with per-track status markers.

Assets 2

26 Apr 04:07

AIMLPM

v0.9.3

e2b6183

v0.9.3 — ecommerce category-marker scope

Generic URL-convention fix for ecommerce sites where the seed URL passes through a category-index segment but target items live at sibling paths. Same kind of rule as the v0.9.2 `/wiki/` article-container check.

Validated on the 4 sites used by the public llm-crawler-benchmarks rotation

Site	Public v0.9.1	v0.9.2	v0.9.3	Notes
mdn-css	0.125	0.5625	0.5625	unchanged from v0.9.2
kubernetes-docs	0.542	0.9062	0.9062	unchanged from v0.9.2
huggingface-transformers	0.000	0.3438	0.3438	unchanged from v0.9.2
ikea	0.375	0.0000	0.1250	recovered from v0.9.2 over-tight scope
AVG MRR	0.260	0.4531	0.4844	+0.224 over public (+86% relative)

(Ikea remains below public's 0.375 due to long-tail variance — a 200-page random sample of thousands of products will hit different specific named items each run. The scope fix itself works as designed.)

What's new

`auto_path_scope` now detects ecommerce category-index markers

When the seed URL passes through a /<marker>/ segment that's a known ecommerce-platform URL convention, the segments BEFORE the marker become the scope. This is generic — applies to any site adopting the convention, not domain-specific.

Markers detected: cat, category, categories, products, shop, collections.

Used by Shopify, WooCommerce, Magento, Salesforce Commerce Cloud defaults, plus IKEA, Etsy, BigCommerce, and many more.

Seed URL	New v0.9.3 scope	Why
`ikea.com/us/en/cat/furniture-fu001/`	`/us/en/*`	Products at `/us/en/p/*` are siblings
`myshop.com/store/collections/spring/products/x`	`/store/collections/spring/*`	Deepest marker wins; outer is parent grouping
`myshop.com/products/single-thing`	None	Marker at root → siblings span whole site
`mywp.com/blog/category/news/post-1`	`/blog/*`	News posts are at `/blog/<slug>`

Multi-marker tiebreak: deepest wins

For URLs with nested markers (e.g. /store/collections/X/products/Y), the deepest marker is the leaf-level category — outer markers are parent groupings. Scope is anchored at the segments before the deepest marker.

Tests

324 passing (was 316 in v0.9.2; +8 new ecommerce-marker tests covering ikea-style /us/en/cat, Shopify /products & /collections, deepest-marker tiebreak, case-insensitivity).

Migration

No API changes. Behavior shift only on seeds passing through one of the marker words. If you were relying on a tight scope at the marker level, pass auto_path_scope=False and use include_paths explicitly.

Known limitations

Long-tail product queries: with constrained max_pages, hitting specific named products on a large catalog (ikea MALM/SLATTUM, etc.) depends on which products end up in BFS order — irreducible variance. Larger max_pages reduces this.
SPA sites: pages requiring JavaScript-rendered navigation still need explicit --render-js.

Install

pip install 'markcrawl[js]==0.9.3'

Assets 2

Releases: AIMLPM/markcrawl

v0.11.1 — default aggregator-page URL filter

Why

What changed

Substring-match safety

Expected impact

Tests

Migration

Uh oh!

v0.11.0 — binary downloads + filters

markcrawl/binaries.py — streaming binary downloads

markcrawl/filters.py — reusable pre-fetch filters

New CrawlResult fields

Migration

Deferred

Tests

Uh oh!

v0.10.6 — opt-in respect_robots flag

Why

Three guardrails

What stays unchanged

Migration

Uh oh!

v0.10.5 — adaptive scope broadening

Empirical verification (real network, max_pages=400)

Guardrails

API additions (additive only)

Migration

Uh oh!

v0.10.4 — idle-timeout reset signal fix + release smoke harness

Fix

Empirical verification

CrawlResult API additions (additive only)

Pre-release smoke harness

Migration

Uh oh!

v0.10.3 — benchmark resilience fixes

Fixes

API additions (additive only)

Verification

Migration

Uh oh!

v0.10.2 — Sitemap pre-enumeration deadline · fixes retailer-index timeouts

tl;dr

Verified locally on the failing sites

What changed

Compatibility

For benchmark integrators

Uh oh!

v0.10.1 — Local embedder is the default · zero-cost RAG

tl;dr

What's new in 0.10.1

Lean install (no ML deps)

Migration

Reports

Uh oh!

v0.10.0 — Chunker MRR lift + embedder bake-off + retry resilience

Highlights

What's in this release

Per-category MRR (11-site canonical pool)

Known v0.11 work

Reports

Uh oh!

v0.9.3 — ecommerce category-marker scope

Validated on the 4 sites used by the public llm-crawler-benchmarks rotation

What's new

auto_path_scope now detects ecommerce category-index markers

Multi-marker tiebreak: deepest wins

Tests

Migration

Known limitations

Install

Uh oh!

`markcrawl/binaries.py` — streaming binary downloads

`markcrawl/filters.py` — reusable pre-fetch filters

New `CrawlResult` fields

`auto_path_scope` now detects ecommerce category-index markers