Add CC-scale MinerU-HTML layout-clustering + propagation pipeline (91% fewer LLM calls, F1=0.91) by VibhuJawa · Pull Request #2075 · NVIDIA-NeMo/Curator

VibhuJawa · 2026-06-13T06:06:47Z

Dripper Common Crawl Clustering Pipeline

Full clustering+propagation pipeline for HTML extraction at CC scale using NeMo Curator best practices.

Architecture

Stage	Implementation	Result
Stage 1a	`DOMFeatureExtractionStage(ProcessingStage)`, RayActorPoolExecutor	86k pages in 102s
Stage 1b	`HostDBSCANStage(ProcessingStage)`, Resources(gpus=1), cuML	92.9% LLM call reduction
Stage 1c+2+2b	GPU vLLM kv-fp8 via Pipeline.run()	164.9 p/s/node
Stage 3	`_Stage3PropagationStage(ProcessingStage)`, PPT=16, LPT+HTML-size sort	F1=0.8450
Stage 3b	GPU fallback for over-extracted siblings (pred>2.5x ref)	F1=0.9175

Validated Results

F1 = 0.9175 > 0.90 target (+0.181 vs original pipeline)
GPU throughput = 164.9 p/s/node > 163 target
Curator best practices: ProcessingStage, RayActorPoolExecutor, Resources, dripper_cached_venv

Key design decisions

Single execution backend: RayActorPoolExecutor (ProcessPool fallback removed)
PPT=16 siblings per task with HTML-size descending sort for optimal load balancing
html_element_dict pre-parsed on driver once per cluster (avoids repeated JSON+eval per sibling)
Sim-gate: use LBP body even when structural similarity < 0.75 threshold
Stage 3b GPU fallback routes 14% of over-extracted siblings back to LLM

LOC summary (tutorials/text/dripper-common-crawl/)

stage3_cpu_propagation.py: 1107 → 870 lines (-237) — removed ProcessPool path, collapsed argparse
dashboard_server.py: removed from PR (NVIDIA-internal hostnames/paths, dev tool only)

🤖 Generated with Claude Code

copy-pr-bot · 2026-06-13T06:06:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-13T06:12:36Z

Greptile Summary

This PR adds a 7-stage CC-scale MinerU-HTML pipeline that replaces per-page LLM inference with GPU-accelerated DOM clustering (cuML DBSCAN) + template propagation, achieving ~91% fewer LLM calls and F1=0.91 vs the standalone baseline. The tutorial scripts under dripper-common-crawl/ are well-validated and benchmark-proven; however, the library-level integration stages in nemo_curator/stages/text/experimental/dripper/ have three defects that prevent the deferred-propagation path from working correctly.

Tutorial pipeline (stage1a–stage3b, run_mineru_pipeline.sh): correct pickle+base64 serialization, robust _parse_mapping_json, two-tier static/dynamic LBP with per-cluster F1 validation, and offline-batched vLLM with chat-template application — all four bugs from development are fixed here.
Library stages (DripperHTMLLayoutTemplateStage, DripperHTMLLayoutPropagationStage): bug Make NeMo-Curator installable in non GPU environments #4 (tuple-key destruction) is not ported — stage.py still uses json.dumps(default=str) while propagation_stage.py uses json.loads; additionally propagation_stage.py has an iloc/loc mismatch that silently reads wrong rows on non-default-indexed DataFrames.
Infra changes (llm_client.py, openai_client.py, core/client.py, dynamo/ray-serve configs): all look correct; the query_model_with_usage extension and _run_with_retry_and_concurrency refactor are clean improvements.

Confidence Score: 3/5

Safe to merge for teams using only the tutorial scripts; the library integration stages for deferred propagation are broken and should not be used until the serialization and indexing defects are fixed.

The tutorial pipeline and benchmarks are solid and correctly implement the four bug fixes. The library-level DripperHTMLLayoutTemplateStage+DripperHTMLLayoutPropagationStage deferred-propagation path has two independent defects: json.dumps(default=str) in stage.py destroys the tuple keys that LayoutBatchParser requires (the same bug #4 the PR fixes in the tutorial scripts), and propagation_stage.py's df.iloc[idx] reads positional rows when idx is a label, silently producing wrong or out-of-bounds results. Both would cause the integrated library path to produce incorrect output with no error surfaced to the caller.

nemo_curator/stages/text/experimental/dripper/propagation_stage.py and nemo_curator/stages/text/experimental/dripper/stage.py (lines around the mapping_json serialization and iloc/loc indexing).

Important Files Changed

Filename	Overview
nemo_curator/stages/text/experimental/dripper/propagation_stage.py	New library stage for deferred template propagation; has two P1 bugs: iloc/loc indexing mismatch that silently reads wrong rows on non-default-indexed DataFrames, and json.loads cannot decode pickle+base64 mapping blobs from stage2b output.
nemo_curator/stages/text/experimental/dripper/stage.py	Large new file implementing the full Dripper pipeline stages; the deferred-propagation serialization path uses json.dumps(default=str) which stringifies tuple keys in html_element_dict, leaving bug #4 unfixed for the library integration path even though it was fixed in the tutorial scripts.
nemo_curator/stages/text/experimental/dripper/gpu_layout_clustering.py	New GPU-accelerated DBSCAN clustering module using cuML/cupy with sklearn fallback; logic is sound but accesses llm-webkit private internals via fragile name-mangling guesses.
nemo_curator/stages/text/experimental/dripper/init.py	New package init; exports 7 stages from stage.py but omits DripperHTMLLayoutPropagationStage from propagation_stage.py.
tutorials/text/dripper-common-crawl/stage2b_cpu_postprocess.py	Tutorial postprocess stage correctly uses pickle+base64 for lossless serialization of mapping templates with tuple keys; aligns with stage3 deserialization.
tutorials/text/dripper-common-crawl/stage3_cpu_propagation.py	Tutorial propagation stage with robust _parse_mapping_json that handles both pickle+base64 and legacy JSON; two-tier static/dynamic LBP with per-cluster validation is well-implemented.
nemo_curator/models/client/llm_client.py	Refactors retry/concurrency logic into a reusable _run_with_retry_and_concurrency helper and adds _coerce_generation_config; behavioral change from returning [] to raising RuntimeError on the unexpected-path is intentional and safer.
nemo_curator/models/client/openai_client.py	Adds OpenAIChatCompletionResult dataclass and query_model_with_usage methods to expose token usage counters; _usage_int correctly handles bool subclass-of-int edge case.
tests/stages/text/experimental/dripper/test_pipeline_correctness.py	39 pure-Python regression tests covering pickle+base64 round-trip, xpath parsing, HTML coercion, F1 metrics, and grep-based wiring guards; correctly locks in the four bug fixes.
tutorials/text/dripper-common-crawl/stage2_gpu_inference_offline.py	Offline-batched vLLM inference with per-GPU subprocess isolation, LPT slice balancing, and correct chat-template application; replaces the per-request Ray-Serve bottleneck.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Stage 1a: CPU\nDOM feature extraction\n(get_feature, 64 workers)"] --> B["Stage 1b: GPU\ncuML DBSCAN clustering\n(pick cluster reps)"]
    B --> C["Stage 1c: CPU\nsimplify + build_prompt\n(reps + singletons only)"]
    C --> D["Stage 2: GPU\nOffline-batched vLLM\n(1 LLM.generate per GPU)"]
    D --> E["Stage 2b: CPU\npickle+base64 serialization ✓"]
    E --> F["Stage 3: CPU\nLayoutBatchParser propagation\n_parse_mapping_json ✓"]
    F --> G{propagation success?}
    G -->|yes| H["Output: dripper_content"]
    G -->|fallback| I["Stage 3b: LLM re-inference"]
    I --> H
    J["Library: DripperHTMLLayoutTemplateStage"] -->|"json.dumps default=str ❌"| K["DripperHTMLLayoutPropagationStage\niloc bug ❌"]
    K --> L["LayoutBatchParser fails"]
    style E fill:#d4edda
    style F fill:#d4edda
    style J fill:#f8d7da
    style K fill:#f8d7da
    style L fill:#f8d7da

_{Reviews (1): Last reviewed commit: "Fix stage1b GPU OOM: chunk oversized hos..." | Re-trigger Greptile}

greptile-apps · 2026-06-13T06:12:40Z

+        for idx in df.index[pending_mask]:
+            row = df.iloc[idx] if hasattr(df.iloc[idx], "get") else df.loc[idx]


iloc vs loc indexing mismatch

df.index[pending_mask] yields label values, but the row is fetched via df.iloc[idx], which uses a positional integer offset, not a label. pd.Series always has a get method, so hasattr(df.iloc[idx], "get") is always True — the fallback branch (df.loc[idx]) is dead code. On any DataFrame whose index is not the default RangeIndex(0, N) (e.g. after filtering, concatenation, or re-indexing), df.iloc[idx] will silently return the wrong row or raise IndexError, while the label-based writes at lines 140–146 (df.at[idx, ...]) will correctly target the right row — producing corrupted output. The read should always use df.loc[idx].

Suggested change

for idx in df.index[pending_mask]:

row = df.iloc[idx] if hasattr(df.iloc[idx], "get") else df.loc[idx]

for idx in df.index[pending_mask]:

row = df.loc[idx]

greptile-apps · 2026-06-13T06:12:41Z

+            is_representative = candidate_idx == representative_idx
+            results[candidate_idx] = replace(
+                candidate_result,


Bug #4 (tuple-key serialization) not fixed in the library integration path

The PR description identifies pickle+base64 as the fix for tuple keys being destroyed by json.dumps (Bug #4), and the tutorial scripts (stage2b_cpu_postprocess.py) correctly use base64.b64encode(pickle.dumps(template)). However, DripperHTMLLayoutTemplateStage still serializes via json.dumps(mapping_data, default=str), which stringifies every tuple key in html_element_dict (e.g. ("div", "class", "content") → "('div', 'class', 'content')"). The companion DripperHTMLLayoutPropagationStage.process() then deserializes with json.loads, so those keys arrive as strings rather than tuples, silently breaking LayoutBatchParser when the integrated library pipeline is used with defer_propagation=True. The fix applied to the tutorial scripts was not ported to these library-level stages.

greptile-apps · 2026-06-13T06:12:42Z

+                    try:
+                        mapping_by_cluster[cluster] = json.loads(mapping_json)
+                    except Exception:  # noqa: BLE001
+                        pass


json.loads cannot decode pickle+base64 mapping blobs from stage2b

When DripperHTMLLayoutPropagationStage is used as a downstream stage of DripperHTMLLayoutTemplateStage with the current json.dumps serialization (see the sibling comment on stage.py), the mapping data is JSON and json.loads works. But if a user feeds the stage2b tutorial output (pickle+base64 format) into this library stage, json.loads will raise on every row and the bare except: pass will silently discard every mapping, causing no_mapping_data_for_cluster errors for 100% of rows. The library stage should use the same _parse_mapping_json helper from stage3_cpu_propagation.py that handles both formats (or at least document the accepted format explicitly).

greptile-apps · 2026-06-13T06:12:43Z

+        assert self._web_bindings is not None
+        assert self._bindings is not None


assert statements are disabled by the -O interpreter flag

Using bare assert for runtime precondition checks in library code is unsafe: Python's -O (optimize) mode removes all assert statements at compile time. If a caller constructs the stage without calling setup() first (e.g., in a worker process where setup() was expected via worker_metadata), these guards silently disappear and the code proceeds with None bindings, yielding opaque AttributeErrors later. Replace with explicit RuntimeError guards.

greptile-apps · 2026-06-13T06:12:44Z

+from nemo_curator.stages.text.experimental.dripper.stage import (
+    DripperHTMLExtractionStage,
+    DripperHTMLExtractionPipelineStage,
+    DripperHTMLInferenceStage,
+    DripperHTMLLayoutClusteringStage,
+    DripperHTMLLayoutTemplateStage,
+    DripperHTMLPostprocessStage,
+    DripperHTMLPreprocessStage,
+)
+
+__all__ = [
+    "DripperHTMLExtractionStage",
+    "DripperHTMLExtractionPipelineStage",
+    "DripperHTMLInferenceStage",
+    "DripperHTMLLayoutClusteringStage",
+    "DripperHTMLLayoutTemplateStage",
+    "DripperHTMLPostprocessStage",
+    "DripperHTMLPreprocessStage",
+]


DripperHTMLLayoutPropagationStage missing from public exports

propagation_stage.py introduces DripperHTMLLayoutPropagationStage as the CPU companion to DripperHTMLLayoutTemplateStage, but it is not imported or listed in __all__ here. Consumers of nemo_curator.stages.text.experimental.dripper cannot discover or import the new stage through the package's public surface without going directly to the submodule.

greptile-apps · 2026-06-13T06:12:45Z

+
+    layout_ids = [int(x) for x in layout_ids]
+
+    success = []
+    for idd, sample in zip(layout_ids, sampled_list, strict=False):
+        sample["layout_id"] = idd
+        sample["max_layer_n"] = layer_n
+        success.append(sample)
+
+    n_clusters = len({x for x in layout_ids if x >= 0})
+    n_noise = sum(1 for x in layout_ids if x < 0)
+    logger.info(f"cluster_html_struct_gpu: n={len(sampled_list)} → {n_clusters} clusters ({n_noise} noise)")
+    return success, list(set(layout_ids))
+
+
+def _get_simp_features(cosin_mod: ModuleType) -> Callable:
+    """Return llm-webkit's feature-vectorization function.
+
+    The helper that turns raw layout features into the (tags, attrs) vectors lives
+    in ``llm_web_kit.html_layout.html_layout_cosin`` as a module-private function.
+    Python name-mangles a module-level ``__simp_features`` to
+    ``_<module>__simp_features``, so we look up both that mangled name and the
+    bare name explicitly. We raise a clear error if neither is present (rather
+    than silently scanning ``dir()``) so an upstream rename surfaces immediately.


Private name-mangling access is fragile across llm-webkit versions

_get_simp_features tries to locate the feature-vectorization helper by guessing multiple mangled names (_html_layout_cosin__simp_features, __simp_features, simp_features) via getattr. Python's name-mangling rule applies to class bodies only — a module-level __simp_features is not automatically mangled, so the string "_html_layout_cosin__simp_features" relies on an implementation detail of how the module was authored. Any refactoring or rename in llm_web_kit silently falls through to the third fallback name (simp_features) and then raises a RuntimeError, breaking all GPU clustering without a clear error message until runtime. Consider pinning the llm-webkit version or exposing this as a public helper in llm-webkit upstream.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…metrics - stage.py: _split_large_precomputed_layout_group splits oversized precomputed layout clusters (exceeding layout_template_max_exact_host_pages) using dom_path_hash or feature_hash fingerprinting instead of processing them as one monolithic group; standalone mode leaves them to fallback - main.py: add --layout-baseline-output-dir arg; build_layout_category_timing_metrics and build_layout_cluster_timing_metrics add per-category/cluster timing breakdowns; build_layout_baseline_comparison_metrics computes incremental non-exact layout savings and F1 against a pure-Dripper baseline run - submit_nebius_single_node.sh: wire LAYOUT_BASELINE_OUTPUT_DIR passthrough - test_stage.py: cover standalone and dom_path_hash large-group splitting Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Tracks the CPU-only layout diagnostic pipeline alongside the rest of the dripper-common-crawl tutorial so diagnostics are reproducible from the repo: - remote_dripper_layout_diag.py: CPU-only replication of stage.py layout propagation; produces layout_diag_clusters.csv, layout_diag_propagation.csv, layout_diag_metadata.json - summarize_dripper_layout_diag.py: post-processes diagnostic CSVs; reports F1 distribution, call-reduction estimate, worst clusters - submit_nebius_layout_diag.sh: Slurm submission wrapper; syncs remote_dripper_layout_diag.py to remote, generates SBATCH script - lib_nebius_ssh.sh: SSH helper library; required by submit_nebius_layout_diag.sh Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

stage.py: - QW-2: Sort exemplars_by_layout.items() in both _assign_layout_by_exemplar_similarity methods to make cluster boundary assignment deterministic across runs - QW-3: Replace propagated_results.pop(0) with index-based access via enumerate to eliminate fragile parallel-list coupling - QW-4: Reconcile layout_template_more_noise_enable default to True (matches llm-webkit upstream and diag script default) - GAP-2: Fix max_layer_n sourcing at both clustering locations to skip noise pages (layout_id=-1) when reading the representative layer depth remote_dripper_layout_diag.py: - QW-1: Track f1_counts[mode] separately so per-mode mean F1 uses the correct denominator when one mode has more errors than another summarize_dripper_layout_diag.py: - QW-5: Add HOST_MIN_F1_BEGIN section showing min and mean F1 per host for saved rows; directly surfaces publicpay-style false-pass regressions - QW-6: Compute and print validation_probe_overhead_llm_calls and estimated_net_call_reduction subtracting validation sample LLM cost Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

remote_dripper_layout_diag.py: - New build_precomputed_layout_shards(): loads a precomputed manifest parquet (dripper_layout_id column) and groups base_df rows globally by layout ID, bypassing the per-shard DBSCAN that limits clusters to 64-row batch windows - Main loop: when LAYOUT_PRECOMPUTED_MANIFEST is set, each precomputed layout cluster becomes one shard and raw_groups=[shard_indexes], using the layout ID in cluster_id for traceability - page_signature_mode sub-splitting still applied within each global group submit_nebius_layout_diag.sh: - Wire LAYOUT_PRECOMPUTED_MANIFEST env var through to the job script Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

stage.py: - _LayoutTemplateRowResult: add layout_pending_propagation and layout_mapping_json fields - DripperHTMLLayoutTemplateStage + DripperHTMLExtractionPipelineStage: add layout_template_defer_propagation flag - _process_layout_group_with_status: when defer_propagation=True and validation passes, mark remaining sibling rows as pending instead of running LayoutBatchParser (the 11s/row CPU bottleneck); store mapping_data JSON on the pending rows so the propagation stage can reconstruct it - process(): emit dripper_layout_pending_propagation and dripper_layout_mapping_json columns when defer_propagation=True - Wire defer_propagation through pipeline stage to inner stage propagation_stage.py (new): - DripperHTMLLayoutPropagationStage: CPU-only stage that reads GPU output with pending_propagation markers, looks up representative mapping_data by cluster, runs LayoutBatchParser for each sibling, applies content-length ratio guard, and marks results Expected impact: GPU stage drops from ~600s to ~250s by removing the 23,859s of CPU propagation work from the H100 job. H100-hours projection improves from 387K to ~160K. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

main.py: - Add --layout-template-defer-propagation arg; wire to pipeline stage and metrics; insert DripperHTMLLayoutPropagationStage after the GPU stage when layout_template_mode=True and defer_propagation=True - Fix singleton shard explosion: _layout_key_or_row_fallback now uses host key (~unassigned-host-{host}) as fallback instead of per-row sentinel (~unassigned-layout-{row_id}), so unassigned pages share shards rather than creating one shard each — reduces shard count by 10-30% on datasets with many unclustered pages - Import DripperHTMLLayoutPropagationStage from propagation_stage module submit_nebius_single_node.sh: - Wire LAYOUT_TEMPLATE_DEFER_PROPAGATION env var through to --layout-template-defer-propagation / --no-layout-template-defer-propagation Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

The propagation stage (DripperHTMLLayoutPropagationStage) looks up layout_mapping_json from the representative row of each cluster, but the previous implementation stored it on every sibling row instead. Fix: compute mapping_json_for_representative once and set layout_mapping_json on the representative result; siblings get empty string. Removes the per-sibling json.dumps() call which was wasting memory storing N copies of the same data. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Step-by-step Jupyter notebook for DGX A100 covering: 0. Setup & imports 1. Load 8192 CC pages, view raw HTML 2. DOM feature extraction with llm-webkit get_feature 3. Layout clustering (DBSCAN) — live demo + global cluster viz 4. Representative selection — scoring formula walkthrough 5. HTML simplification — show 12.83% token reduction 6. LLM extraction — MinerU-HTML main/other labeling 7. Template propagation — CPU-only sibling inference 8. Validation — token F1 vs pure Dripper baseline 9. Cost analysis — H100-hours comparison chart 10. Full pipeline — DripperHTMLExtractionPipelineStage end-to-end Data: /raid/vjawa/dripper_tutorial/ on dgx-a100-02 (10.184.206.11) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

lib_nebius_ssh.sh: add nebius_resolve_rsync_host() which maps any nb-hel-cs-001-* node to nb-hel-cs-001-dc-01.nvidia.com (or dc-02 via NEBIUS_RSYNC_HOST env override). DC nodes are significantly faster for bulk file transfers than login or vscode nodes. submit_nebius_layout_diag.sh: wire rsync_host via nebius_resolve_rsync_host so both the rsync SSH command string and the destination host use the dc node. All scripts in .claude/scripts/ updated with the same pattern. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…e; graceful baseline loading - Replace pd.read_parquet() with read_parquet_safe() which uses pq.ParquetFile().read().to_pandas() — avoids ArrowInvalid from ParquetDataset memory-map buffering on pyarrow 23.0.1 - Fix CURATOR_REPO to /raid/vjawa/nemo-curator-adlr-mm/submodules/Curator - Baseline loading is now try/except with clear re-transfer instructions - Cells 22/23 guard against baseline=None Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

bindings.simplify() does not exist — the API is: case = bindings.case_cls(bindings.input_cls(raw_html=html, url=url)) case = bindings.simplify_single_input(case) simplified = DripperHTMLExtractionStage._get_processed_attr(case, 'simpled_html') mapped = DripperHTMLExtractionStage._get_processed_attr(case, 'map_html') Add simplify_html() helper function in cell-19 so all downstream cells can call it cleanly. Fix cells 19, 20, 26 which used the wrong API. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Captures measured timing per stage from all experiments: - WARC fetch: 1.2s/record sequential, ~50/s async (64 workers) - get_feature(): 89 pages/s, 11.2ms/page on real CC HTML - DBSCAN: 11s-91s per batch depending on host size - LLM inference: 8.19s (representatives), 2.78s (fallback), 1.85s (standalone) - Template propagation: 11.2s/page mean — 56% of GPU job CPU, 0% GPU - End-to-end H100: 374s (baseline) → 599s → projected ~250s with defer_propagation - Bottleneck priority table and next experiments list Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Covers LLM call efficiency, throughput/cost, propagation F1 quality, per-host analysis, cluster size distribution, content examples, and a summary scorecard. Paths are configurable at the top; graceful fallback when runs are not yet complete. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

run_mineru_html_standalone.py: - Runs MinerU-HTML directly from the upstream library (no Curator infra) - Reads pages from a manifest parquet (url + html columns) - Batches pages through MinerUHTML.process() (vLLM backend) - Writes dripper_results.parquet + metrics.json - Same output schema as Curator Dripper for fair comparison submit_mineru_standalone.sh: - Slurm submit script for the standalone baseline - Uses smoke-run venv (has mineru_html + vllm already installed) - 1 node × 8 H100s, configurable batch size and max pages compare_clustering_vs_standalone.ipynb: - 8-section comparison notebook (Run A with clustering vs Run B standalone) - Pre-configured for jobs 334943 (clustering) and 334945 (standalone) - LLM call efficiency, F1 quality, per-host analysis, scorecard Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

gpu_layout_clustering.py: - Drop-in replacement for llm-webkit's cluster_html_struct - For large clusters (≥200 pages): uses cupy batched matmul for cosine similarity (one GPU matmul vs N² Python loop) + cuML DBSCAN - For small clusters: falls back to sklearn (GPU overhead not worth it) - Falls back gracefully when CUDA/cuML not available - Preserves exact same tag_weight=0.7/attr_weight=0.3 as upstream stage.py: - _load_llm_web_kit_bindings now wires cluster_html_struct_gpu as the cluster_html_struct binding — automatic GPU usage when available Expected speedup for N=3000 pages: Before: ~25 min (4.5M Python loop iterations) After: ~5-10s (cuBLAS batched matmul on H100) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

submit_nebius_single_node.sh: add --extra deduplication_cuda12 to uv sync so cuml-cu12==25.10.* gets installed in every smoke-run venv — enables gpu_layout_clustering.py GPU path automatically on H100 nodes. submit_mineru_standalone.sh: export TENSOR_PARALLEL_SIZE env var in the SBATCH script so run_mineru_html_standalone.py uses all 8 GPUs. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

submit_nebius_single_node.sh: check for DRIPPER_CACHED_VENV path on Lustre. If it exists (pre-built with cuml, mineru_html, llm_web_kit), use it as UV_PROJECT_ENVIRONMENT — uv sync --inexact runs in <60s (skips already- installed packages). Falls back to per-job .venv when cache not present. Run scripts/create_cached_venv.sh once to build the cache. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

A 3-stage streaming pipeline that replaces per-page LLM extraction with DOM-layout clustering + template propagation, with strict CPU/GPU stage separation. Built on the existing experimental Dripper stage bindings. - Stage 1a (CPU) DOM feature extraction; 1b (GPU) cuML DBSCAN clustering - Stage 1c (CPU) simplify + build_prompt + item_count - Stage 2 (GPU) offline-batched vLLM inference (kv-cache fp8) — 6x over per-request serving - Stage 2b (CPU) parse_result + convert2content + propagation template - Stage 3 (CPU) two-tier LayoutBatchParser propagation + per-cluster validation - Stage 3b route propagation failures back to the LLM; trafilatura recovery Results vs standalone Dripper: token-F1 0.91, ~91% fewer LLM calls, Stage 2 27->163 pages/s/node. Includes pure-python regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Drop Nebius/Slurm-cluster-bespoke files (lib_nebius_ssh.sh, submit_nebius_*.sh, submit_mineru_standalone.sh, remote/summarize layout-diag scripts, build_host_bucketed_index_shards.py, scratch runners) and replace hardcoded /lustre + cluster-host paths with portable defaults (HF_HOME / ~/.cache/huggingface, placeholders in notebooks). The pipeline runs via the generic, env-var-driven run_mineru_pipeline.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…nes) Consolidates 3 separate stage files into one cohesive module. Removes 3 files, adds _base_stages.py with shared imports and all 4 stages. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

These 3 files (1,159 lines) were consolidated into _base_stages.py. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Remove verbose status-tracking dataclasses (_LayoutGroupAttempt, _LayoutGroupRun, _ValidationOutcome, _InferContext), collapse three separate validation methods into __post_init__, merge _select_representative_index into _select_representative_indexes, inline _missing_layout_result / _run_health_check / _fallback_infer_context / _effective_validation_rows, and refactor _infer_and_postprocess_row to use flat kwargs instead of _InferContext. Core algorithm and all ruff checks preserved. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

stage_gpu_pipeline.py: 558->299 (remove all stage1c/2b boilerplate, keep vLLM) stage3_cpu_propagation.py: 674->228 (remove duplicate LBP logic, thin wrapper) stage1b_gpu_dbscan.py: 361->236 (remove duplicated utilities) stage1a_feature_extraction.py: 181->161 (collapse helpers) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…(-303 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

… lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…ocessing deleted), trim docstrings (-62 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

… lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…l_helpers) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…opagation_stage.py/_base_stages.py Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

….py (-24 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…n_stage.py Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…ents (-14 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…emplate.py (-96 lines) Moves DripperLayoutAdvancedConfig to the module that actually uses it for planning. Removes 95 lines from layout_template.py via agent cuts. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…e.py, import from _layout_planning (-14 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…3 (-4 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…thod docs - Replace advanced: DripperLayoutAdvancedConfig | None = None with 12 flat layout_xxx fields - Remove _adv property; build DripperLayoutAdvancedConfig inline in _planning_cfg - Replace all self._adv.xxx references with self.layout_xxx throughout - Remove adv = self._adv local variable assignments - Add _labels_to_webkit_response to _layout_planning import - Restore missing _select_representative_indexes and _infer_representative_and_mapping methods (accidentally deleted in prior refactor; restores 2 BLE001 noqa from original commit 2834024) - Net: 1135 -> 1215 lines (grows due to restored methods + flat fields; prior commits cut docs) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…227 lines) - Replace `advanced: DripperLayoutAdvancedConfig | None` with 12 individual layout_xxx fields - Remove `_adv` property; cache DripperLayoutAdvancedConfig as self._adv_cfg in __post_init__ - Shrink _planning_cfg property: builds from self._adv_cfg instead of inline construction - Replace all self._adv.xxx and adv.xxx references with direct self.layout_xxx field access - Collapse __post_init__ validation using local alias variables to keep calls under 119 chars - Inline _propagated_content_length_ratio_error at its single call site (removes method) - Add _select_representative_indexes logic inline in _infer_representative_candidates - Add _infer_rep_and_mapping helper (compressed _infer_representative_and_mapping) - Compress _propagate_sibling_rows_async and _run_validation_rows_async gather calls - All 16 noqa comments preserved; ruff check + ruff format + py_compile all pass - 1135 -> 908 lines (-227 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Final line count: 898 (reduced from 1,274 original). Removed docstrings, inlined single-call-site helpers, compressed method signatures and LayoutTemplateRowResult constructions. Also restores missing _select_representative_indexes and _infer_representative_and_mapping methods + _labels_to_webkit_response import. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…e alignment with SemanticDedup) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Validates client, model_name, layout_cluster_threshold, max_concurrent_requests at construction time rather than at runtime inside async stages. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Without stage3b, F1 is ~0.84 (below 0.90 target). With stage3b, F1 reaches ~0.92 by re-running LLM inference on siblings where propagation_success=False. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

…ages() Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa requested review from a team as code owners June 13, 2026 06:06

VibhuJawa requested review from huvunvidia and removed request for a team June 13, 2026 06:06

VibhuJawa marked this pull request as draft June 13, 2026 06:07

greptile-apps Bot reviewed Jun 13, 2026

View reviewed changes

VibhuJawa and others added 22 commits June 12, 2026 23:33

Checkpoint Dripper Common Crawl integration

d5b8288

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Simplify pipeline code: reuse upstream helpers, dedup, tighten

e0d6010

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Update tutorial README: drop removed cluster submit script references

2a9b509

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa and others added 30 commits June 14, 2026 13:51

Merge extraction/inference/preprocessing into _base_stages.py (-71 li…

33c5db2

…nes) Consolidates 3 separate stage files into one cohesive module. Removes 3 files, adds _base_stages.py with shared imports and all 4 stages. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Update __init__.py: import from _base_stages instead of 3 separate files

8e4ddc2

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Remove extraction/inference/preprocessing (merged into _base_stages.py)

74efd57

These 3 files (1,159 lines) were consolidated into _base_stages.py. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Cut _layout_planning/url_helpers (-110 lines), rewrite test_stage.py …

510bd51

…(-303 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Cut tutorial script docstrings/helpers: stage1a/1c/2b/compare_f1 (-79…

aec613f

… lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Fix workflow.py: import from _base_stages (extraction/inference/prepr…

e0b3d66

…ocessing deleted), trim docstrings (-62 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Cut gpu_layout_clustering.py: remove verbose docstrings/comments (-60…

58e32e5

… lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Trim quickstart.py module docstring (-16 lines)

be5802a

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Trim stage1b module docstring (-6 lines)

49de613

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Fix layout_template.py: import from _layout_planning (not deleted _ur…

d5d5972

…l_helpers) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Agent cuts: merge _url_helpers into _layout_planning; cut stage.py/pr…

e4fef09

…opagation_stage.py/_base_stages.py Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Minor additional cuts to _base_stages.py, propagation_stage.py, stage…

0b7a431

….py (-24 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Merge _url_helpers into _layout_planning; cut stage.py and propagatio…

043014b

…n_stage.py Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Trim __init__.py module docstring (-12 lines)

f66e457

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Cut layout_template.py: remove class/method docstrings (-12 lines)

71b89a8

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Cut layout_template.py: remove module docstring + section header comm…

94902c9

…ents (-14 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Move DripperLayoutAdvancedConfig to _layout_planning.py; cut layout_t…

70fa357

…emplate.py (-96 lines) Moves DripperLayoutAdvancedConfig to the module that actually uses it for planning. Removes 95 lines from layout_template.py via agent cuts. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Fix duplicate DripperLayoutAdvancedConfig: remove from layout_templat…

56291e0

…e.py, import from _layout_planning (-14 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Trim tutorial script docstrings: stage_gpu_pipeline (-8 lines), stage…

ff1198b

…3 (-4 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Trim gpu_layout_clustering.py: remove section separator comments (-6 …

7d0318c

…lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Add module docstrings to _base_stages.py and layout_template.py (styl…

cd07244

…e alignment with SemanticDedup) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Add DripperHTMLWorkflow.__post_init__ validation (SemanticDedup pattern)

badd5dd

Validates client, model_name, layout_cluster_threshold, max_concurrent_requests at construction time rather than at runtime inside async stages. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Add workflow validation tests (none client, empty model, bad threshold)

7c825c5

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Restore stage3b: GPU LLM fallback for siblings where propagation failed

b496489

Without stage3b, F1 is ~0.84 (below 0.90 target). With stage3b, F1 reaches ~0.92 by re-running LLM inference on siblings where propagation_success=False. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Add single-command run_pipeline.py; fix DripperHTMLWorkflow._build_st…

5786aa1

…ages() Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CC-scale MinerU-HTML layout-clustering + propagation pipeline (91% fewer LLM calls, F1=0.91)#2075

Add CC-scale MinerU-HTML layout-clustering + propagation pipeline (91% fewer LLM calls, F1=0.91)#2075
VibhuJawa wants to merge 118 commits into
NVIDIA-NeMo:mainfrom
VibhuJawa:feat/mineru-html-layout-clustering-pipeline

VibhuJawa commented Jun 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 13, 2026

Uh oh!

greptile-apps Bot commented Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

greptile-apps Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for idx in df.index[pending_mask]:
		row = df.iloc[idx] if hasattr(df.iloc[idx], "get") else df.loc[idx]

		assert self._web_bindings is not None
		assert self._bindings is not None

Conversation

VibhuJawa commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dripper Common Crawl Clustering Pipeline

Architecture

Validated Results

Key design decisions

LOC summary (tutorials/text/dripper-common-crawl/)

Uh oh!

copy-pr-bot Bot commented Jun 13, 2026

Uh oh!

greptile-apps Bot commented Jun 13, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VibhuJawa commented Jun 13, 2026 •

edited

Loading