Skip to content

Add CC-scale MinerU-HTML layout-clustering + propagation pipeline (91% fewer LLM calls, F1=0.91)#2075

Draft
VibhuJawa wants to merge 118 commits into
NVIDIA-NeMo:mainfrom
VibhuJawa:feat/mineru-html-layout-clustering-pipeline
Draft

Add CC-scale MinerU-HTML layout-clustering + propagation pipeline (91% fewer LLM calls, F1=0.91)#2075
VibhuJawa wants to merge 118 commits into
NVIDIA-NeMo:mainfrom
VibhuJawa:feat/mineru-html-layout-clustering-pipeline

Conversation

@VibhuJawa

@VibhuJawa VibhuJawa commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Dripper Common Crawl Clustering Pipeline

Full clustering+propagation pipeline for HTML extraction at CC scale using NeMo Curator best practices.

Architecture

Stage Implementation Result
Stage 1a DOMFeatureExtractionStage(ProcessingStage), RayActorPoolExecutor 86k pages in 102s
Stage 1b HostDBSCANStage(ProcessingStage), Resources(gpus=1), cuML 92.9% LLM call reduction
Stage 1c+2+2b GPU vLLM kv-fp8 via Pipeline.run() 164.9 p/s/node
Stage 3 _Stage3PropagationStage(ProcessingStage), PPT=16, LPT+HTML-size sort F1=0.8450
Stage 3b GPU fallback for over-extracted siblings (pred>2.5x ref) F1=0.9175

Validated Results

  • F1 = 0.9175 > 0.90 target (+0.181 vs original pipeline)
  • GPU throughput = 164.9 p/s/node > 163 target
  • Curator best practices: ProcessingStage, RayActorPoolExecutor, Resources, dripper_cached_venv

Key design decisions

  • Single execution backend: RayActorPoolExecutor (ProcessPool fallback removed)
  • PPT=16 siblings per task with HTML-size descending sort for optimal load balancing
  • html_element_dict pre-parsed on driver once per cluster (avoids repeated JSON+eval per sibling)
  • Sim-gate: use LBP body even when structural similarity < 0.75 threshold
  • Stage 3b GPU fallback routes 14% of over-extracted siblings back to LLM

LOC summary (tutorials/text/dripper-common-crawl/)

  • stage3_cpu_propagation.py: 1107 → 870 lines (-237) — removed ProcessPool path, collapsed argparse
  • dashboard_server.py: removed from PR (NVIDIA-internal hostnames/paths, dev tool only)

🤖 Generated with Claude Code

@VibhuJawa VibhuJawa requested review from a team as code owners June 13, 2026 06:06
@VibhuJawa VibhuJawa requested review from huvunvidia and removed request for a team June 13, 2026 06:06
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@VibhuJawa VibhuJawa marked this pull request as draft June 13, 2026 06:07
@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a 7-stage CC-scale MinerU-HTML pipeline that replaces per-page LLM inference with GPU-accelerated DOM clustering (cuML DBSCAN) + template propagation, achieving ~91% fewer LLM calls and F1=0.91 vs the standalone baseline. The tutorial scripts under dripper-common-crawl/ are well-validated and benchmark-proven; however, the library-level integration stages in nemo_curator/stages/text/experimental/dripper/ have three defects that prevent the deferred-propagation path from working correctly.

  • Tutorial pipeline (stage1astage3b, run_mineru_pipeline.sh): correct pickle+base64 serialization, robust _parse_mapping_json, two-tier static/dynamic LBP with per-cluster F1 validation, and offline-batched vLLM with chat-template application — all four bugs from development are fixed here.
  • Library stages (DripperHTMLLayoutTemplateStage, DripperHTMLLayoutPropagationStage): bug Make NeMo-Curator installable in non GPU environments #4 (tuple-key destruction) is not ported — stage.py still uses json.dumps(default=str) while propagation_stage.py uses json.loads; additionally propagation_stage.py has an iloc/loc mismatch that silently reads wrong rows on non-default-indexed DataFrames.
  • Infra changes (llm_client.py, openai_client.py, core/client.py, dynamo/ray-serve configs): all look correct; the query_model_with_usage extension and _run_with_retry_and_concurrency refactor are clean improvements.

Confidence Score: 3/5

Safe to merge for teams using only the tutorial scripts; the library integration stages for deferred propagation are broken and should not be used until the serialization and indexing defects are fixed.

The tutorial pipeline and benchmarks are solid and correctly implement the four bug fixes. The library-level DripperHTMLLayoutTemplateStage+DripperHTMLLayoutPropagationStage deferred-propagation path has two independent defects: json.dumps(default=str) in stage.py destroys the tuple keys that LayoutBatchParser requires (the same bug #4 the PR fixes in the tutorial scripts), and propagation_stage.py's df.iloc[idx] reads positional rows when idx is a label, silently producing wrong or out-of-bounds results. Both would cause the integrated library path to produce incorrect output with no error surfaced to the caller.

nemo_curator/stages/text/experimental/dripper/propagation_stage.py and nemo_curator/stages/text/experimental/dripper/stage.py (lines around the mapping_json serialization and iloc/loc indexing).

Important Files Changed

Filename Overview
nemo_curator/stages/text/experimental/dripper/propagation_stage.py New library stage for deferred template propagation; has two P1 bugs: iloc/loc indexing mismatch that silently reads wrong rows on non-default-indexed DataFrames, and json.loads cannot decode pickle+base64 mapping blobs from stage2b output.
nemo_curator/stages/text/experimental/dripper/stage.py Large new file implementing the full Dripper pipeline stages; the deferred-propagation serialization path uses json.dumps(default=str) which stringifies tuple keys in html_element_dict, leaving bug #4 unfixed for the library integration path even though it was fixed in the tutorial scripts.
nemo_curator/stages/text/experimental/dripper/gpu_layout_clustering.py New GPU-accelerated DBSCAN clustering module using cuML/cupy with sklearn fallback; logic is sound but accesses llm-webkit private internals via fragile name-mangling guesses.
nemo_curator/stages/text/experimental/dripper/init.py New package init; exports 7 stages from stage.py but omits DripperHTMLLayoutPropagationStage from propagation_stage.py.
tutorials/text/dripper-common-crawl/stage2b_cpu_postprocess.py Tutorial postprocess stage correctly uses pickle+base64 for lossless serialization of mapping templates with tuple keys; aligns with stage3 deserialization.
tutorials/text/dripper-common-crawl/stage3_cpu_propagation.py Tutorial propagation stage with robust _parse_mapping_json that handles both pickle+base64 and legacy JSON; two-tier static/dynamic LBP with per-cluster validation is well-implemented.
nemo_curator/models/client/llm_client.py Refactors retry/concurrency logic into a reusable _run_with_retry_and_concurrency helper and adds _coerce_generation_config; behavioral change from returning [] to raising RuntimeError on the unexpected-path is intentional and safer.
nemo_curator/models/client/openai_client.py Adds OpenAIChatCompletionResult dataclass and query_model_with_usage methods to expose token usage counters; _usage_int correctly handles bool subclass-of-int edge case.
tests/stages/text/experimental/dripper/test_pipeline_correctness.py 39 pure-Python regression tests covering pickle+base64 round-trip, xpath parsing, HTML coercion, F1 metrics, and grep-based wiring guards; correctly locks in the four bug fixes.
tutorials/text/dripper-common-crawl/stage2_gpu_inference_offline.py Offline-batched vLLM inference with per-GPU subprocess isolation, LPT slice balancing, and correct chat-template application; replaces the per-request Ray-Serve bottleneck.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Stage 1a: CPU\nDOM feature extraction\n(get_feature, 64 workers)"] --> B["Stage 1b: GPU\ncuML DBSCAN clustering\n(pick cluster reps)"]
    B --> C["Stage 1c: CPU\nsimplify + build_prompt\n(reps + singletons only)"]
    C --> D["Stage 2: GPU\nOffline-batched vLLM\n(1 LLM.generate per GPU)"]
    D --> E["Stage 2b: CPU\npickle+base64 serialization ✓"]
    E --> F["Stage 3: CPU\nLayoutBatchParser propagation\n_parse_mapping_json ✓"]
    F --> G{propagation success?}
    G -->|yes| H["Output: dripper_content"]
    G -->|fallback| I["Stage 3b: LLM re-inference"]
    I --> H
    J["Library: DripperHTMLLayoutTemplateStage"] -->|"json.dumps default=str ❌"| K["DripperHTMLLayoutPropagationStage\niloc bug ❌"]
    K --> L["LayoutBatchParser fails"]
    style E fill:#d4edda
    style F fill:#d4edda
    style J fill:#f8d7da
    style K fill:#f8d7da
    style L fill:#f8d7da
Loading

Reviews (1): Last reviewed commit: "Fix stage1b GPU OOM: chunk oversized hos..." | Re-trigger Greptile

Comment on lines +117 to +118
for idx in df.index[pending_mask]:
row = df.iloc[idx] if hasattr(df.iloc[idx], "get") else df.loc[idx]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 iloc vs loc indexing mismatch

df.index[pending_mask] yields label values, but the row is fetched via df.iloc[idx], which uses a positional integer offset, not a label. pd.Series always has a get method, so hasattr(df.iloc[idx], "get") is always True — the fallback branch (df.loc[idx]) is dead code. On any DataFrame whose index is not the default RangeIndex(0, N) (e.g. after filtering, concatenation, or re-indexing), df.iloc[idx] will silently return the wrong row or raise IndexError, while the label-based writes at lines 140–146 (df.at[idx, ...]) will correctly target the right row — producing corrupted output. The read should always use df.loc[idx].

Suggested change
for idx in df.index[pending_mask]:
row = df.iloc[idx] if hasattr(df.iloc[idx], "get") else df.loc[idx]
for idx in df.index[pending_mask]:
row = df.loc[idx]

Comment on lines +2706 to +2708
is_representative = candidate_idx == representative_idx
results[candidate_idx] = replace(
candidate_result,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Bug #4 (tuple-key serialization) not fixed in the library integration path

The PR description identifies pickle+base64 as the fix for tuple keys being destroyed by json.dumps (Bug #4), and the tutorial scripts (stage2b_cpu_postprocess.py) correctly use base64.b64encode(pickle.dumps(template)). However, DripperHTMLLayoutTemplateStage still serializes via json.dumps(mapping_data, default=str), which stringifies every tuple key in html_element_dict (e.g. ("div", "class", "content")"('div', 'class', 'content')"). The companion DripperHTMLLayoutPropagationStage.process() then deserializes with json.loads, so those keys arrive as strings rather than tuples, silently breaking LayoutBatchParser when the integrated library pipeline is used with defer_propagation=True. The fix applied to the tutorial scripts was not ported to these library-level stages.

Comment on lines +111 to +114
try:
mapping_by_cluster[cluster] = json.loads(mapping_json)
except Exception: # noqa: BLE001
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 json.loads cannot decode pickle+base64 mapping blobs from stage2b

When DripperHTMLLayoutPropagationStage is used as a downstream stage of DripperHTMLLayoutTemplateStage with the current json.dumps serialization (see the sibling comment on stage.py), the mapping data is JSON and json.loads works. But if a user feeds the stage2b tutorial output (pickle+base64 format) into this library stage, json.loads will raise on every row and the bare except: pass will silently discard every mapping, causing no_mapping_data_for_cluster errors for 100% of rows. The library stage should use the same _parse_mapping_json helper from stage3_cpu_propagation.py that handles both formats (or at least document the accepted format explicitly).

Comment on lines +163 to +164
assert self._web_bindings is not None
assert self._bindings is not None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 assert statements are disabled by the -O interpreter flag

Using bare assert for runtime precondition checks in library code is unsafe: Python's -O (optimize) mode removes all assert statements at compile time. If a caller constructs the stage without calling setup() first (e.g., in a worker process where setup() was expected via worker_metadata), these guards silently disappear and the code proceeds with None bindings, yielding opaque AttributeErrors later. Replace with explicit RuntimeError guards.

Comment on lines +17 to +35
from nemo_curator.stages.text.experimental.dripper.stage import (
DripperHTMLExtractionStage,
DripperHTMLExtractionPipelineStage,
DripperHTMLInferenceStage,
DripperHTMLLayoutClusteringStage,
DripperHTMLLayoutTemplateStage,
DripperHTMLPostprocessStage,
DripperHTMLPreprocessStage,
)

__all__ = [
"DripperHTMLExtractionStage",
"DripperHTMLExtractionPipelineStage",
"DripperHTMLInferenceStage",
"DripperHTMLLayoutClusteringStage",
"DripperHTMLLayoutTemplateStage",
"DripperHTMLPostprocessStage",
"DripperHTMLPreprocessStage",
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DripperHTMLLayoutPropagationStage missing from public exports

propagation_stage.py introduces DripperHTMLLayoutPropagationStage as the CPU companion to DripperHTMLLayoutTemplateStage, but it is not imported or listed in __all__ here. Consumers of nemo_curator.stages.text.experimental.dripper cannot discover or import the new stage through the package's public surface without going directly to the submodule.

Comment on lines +194 to +217

layout_ids = [int(x) for x in layout_ids]

success = []
for idd, sample in zip(layout_ids, sampled_list, strict=False):
sample["layout_id"] = idd
sample["max_layer_n"] = layer_n
success.append(sample)

n_clusters = len({x for x in layout_ids if x >= 0})
n_noise = sum(1 for x in layout_ids if x < 0)
logger.info(f"cluster_html_struct_gpu: n={len(sampled_list)} → {n_clusters} clusters ({n_noise} noise)")
return success, list(set(layout_ids))


def _get_simp_features(cosin_mod: ModuleType) -> Callable:
"""Return llm-webkit's feature-vectorization function.

The helper that turns raw layout features into the (tags, attrs) vectors lives
in ``llm_web_kit.html_layout.html_layout_cosin`` as a module-private function.
Python name-mangles a module-level ``__simp_features`` to
``_<module>__simp_features``, so we look up both that mangled name and the
bare name explicitly. We raise a clear error if neither is present (rather
than silently scanning ``dir()``) so an upstream rename surfaces immediately.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Private name-mangling access is fragile across llm-webkit versions

_get_simp_features tries to locate the feature-vectorization helper by guessing multiple mangled names (_html_layout_cosin__simp_features, __simp_features, simp_features) via getattr. Python's name-mangling rule applies to class bodies only — a module-level __simp_features is not automatically mangled, so the string "_html_layout_cosin__simp_features" relies on an implementation detail of how the module was authored. Any refactoring or rename in llm_web_kit silently falls through to the third fallback name (simp_features) and then raises a RuntimeError, breaking all GPU clustering without a clear error message until runtime. Consider pinning the llm-webkit version or exposing this as a public helper in llm-webkit upstream.

VibhuJawa and others added 22 commits June 12, 2026 23:33
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…metrics

- stage.py: _split_large_precomputed_layout_group splits oversized
  precomputed layout clusters (exceeding layout_template_max_exact_host_pages)
  using dom_path_hash or feature_hash fingerprinting instead of processing
  them as one monolithic group; standalone mode leaves them to fallback
- main.py: add --layout-baseline-output-dir arg; build_layout_category_timing_metrics
  and build_layout_cluster_timing_metrics add per-category/cluster timing
  breakdowns; build_layout_baseline_comparison_metrics computes incremental
  non-exact layout savings and F1 against a pure-Dripper baseline run
- submit_nebius_single_node.sh: wire LAYOUT_BASELINE_OUTPUT_DIR passthrough
- test_stage.py: cover standalone and dom_path_hash large-group splitting

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Tracks the CPU-only layout diagnostic pipeline alongside the rest of the
dripper-common-crawl tutorial so diagnostics are reproducible from the repo:

- remote_dripper_layout_diag.py: CPU-only replication of stage.py layout
  propagation; produces layout_diag_clusters.csv, layout_diag_propagation.csv,
  layout_diag_metadata.json
- summarize_dripper_layout_diag.py: post-processes diagnostic CSVs; reports
  F1 distribution, call-reduction estimate, worst clusters
- submit_nebius_layout_diag.sh: Slurm submission wrapper; syncs
  remote_dripper_layout_diag.py to remote, generates SBATCH script
- lib_nebius_ssh.sh: SSH helper library; required by submit_nebius_layout_diag.sh

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
stage.py:
- QW-2: Sort exemplars_by_layout.items() in both _assign_layout_by_exemplar_similarity
  methods to make cluster boundary assignment deterministic across runs
- QW-3: Replace propagated_results.pop(0) with index-based access via enumerate
  to eliminate fragile parallel-list coupling
- QW-4: Reconcile layout_template_more_noise_enable default to True (matches
  llm-webkit upstream and diag script default)
- GAP-2: Fix max_layer_n sourcing at both clustering locations to skip noise
  pages (layout_id=-1) when reading the representative layer depth

remote_dripper_layout_diag.py:
- QW-1: Track f1_counts[mode] separately so per-mode mean F1 uses the correct
  denominator when one mode has more errors than another

summarize_dripper_layout_diag.py:
- QW-5: Add HOST_MIN_F1_BEGIN section showing min and mean F1 per host for
  saved rows; directly surfaces publicpay-style false-pass regressions
- QW-6: Compute and print validation_probe_overhead_llm_calls and
  estimated_net_call_reduction subtracting validation sample LLM cost

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
remote_dripper_layout_diag.py:
- New build_precomputed_layout_shards(): loads a precomputed manifest
  parquet (dripper_layout_id column) and groups base_df rows globally
  by layout ID, bypassing the per-shard DBSCAN that limits clusters to
  64-row batch windows
- Main loop: when LAYOUT_PRECOMPUTED_MANIFEST is set, each precomputed
  layout cluster becomes one shard and raw_groups=[shard_indexes],
  using the layout ID in cluster_id for traceability
- page_signature_mode sub-splitting still applied within each global group

submit_nebius_layout_diag.sh:
- Wire LAYOUT_PRECOMPUTED_MANIFEST env var through to the job script

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
stage.py:
- _LayoutTemplateRowResult: add layout_pending_propagation and
  layout_mapping_json fields
- DripperHTMLLayoutTemplateStage + DripperHTMLExtractionPipelineStage:
  add layout_template_defer_propagation flag
- _process_layout_group_with_status: when defer_propagation=True and
  validation passes, mark remaining sibling rows as pending instead of
  running LayoutBatchParser (the 11s/row CPU bottleneck); store mapping_data
  JSON on the pending rows so the propagation stage can reconstruct it
- process(): emit dripper_layout_pending_propagation and
  dripper_layout_mapping_json columns when defer_propagation=True
- Wire defer_propagation through pipeline stage to inner stage

propagation_stage.py (new):
- DripperHTMLLayoutPropagationStage: CPU-only stage that reads GPU output
  with pending_propagation markers, looks up representative mapping_data by
  cluster, runs LayoutBatchParser for each sibling, applies content-length
  ratio guard, and marks results

Expected impact: GPU stage drops from ~600s to ~250s by removing the
23,859s of CPU propagation work from the H100 job. H100-hours projection
improves from 387K to ~160K.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
main.py:
- Add --layout-template-defer-propagation arg; wire to pipeline stage and
  metrics; insert DripperHTMLLayoutPropagationStage after the GPU stage
  when layout_template_mode=True and defer_propagation=True
- Fix singleton shard explosion: _layout_key_or_row_fallback now uses
  host key (~unassigned-host-{host}) as fallback instead of per-row
  sentinel (~unassigned-layout-{row_id}), so unassigned pages share shards
  rather than creating one shard each — reduces shard count by 10-30%
  on datasets with many unclustered pages
- Import DripperHTMLLayoutPropagationStage from propagation_stage module

submit_nebius_single_node.sh:
- Wire LAYOUT_TEMPLATE_DEFER_PROPAGATION env var through to
  --layout-template-defer-propagation / --no-layout-template-defer-propagation

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
The propagation stage (DripperHTMLLayoutPropagationStage) looks up
layout_mapping_json from the representative row of each cluster, but the
previous implementation stored it on every sibling row instead.

Fix: compute mapping_json_for_representative once and set layout_mapping_json
on the representative result; siblings get empty string. Removes the per-sibling
json.dumps() call which was wasting memory storing N copies of the same data.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Step-by-step Jupyter notebook for DGX A100 covering:
  0. Setup & imports
  1. Load 8192 CC pages, view raw HTML
  2. DOM feature extraction with llm-webkit get_feature
  3. Layout clustering (DBSCAN) — live demo + global cluster viz
  4. Representative selection — scoring formula walkthrough
  5. HTML simplification — show 12.83% token reduction
  6. LLM extraction — MinerU-HTML main/other labeling
  7. Template propagation — CPU-only sibling inference
  8. Validation — token F1 vs pure Dripper baseline
  9. Cost analysis — H100-hours comparison chart
  10. Full pipeline — DripperHTMLExtractionPipelineStage end-to-end

Data: /raid/vjawa/dripper_tutorial/ on dgx-a100-02 (10.184.206.11)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
lib_nebius_ssh.sh: add nebius_resolve_rsync_host() which maps any
nb-hel-cs-001-* node to nb-hel-cs-001-dc-01.nvidia.com (or dc-02 via
NEBIUS_RSYNC_HOST env override). DC nodes are significantly faster for bulk
file transfers than login or vscode nodes.

submit_nebius_layout_diag.sh: wire rsync_host via nebius_resolve_rsync_host
so both the rsync SSH command string and the destination host use the dc node.

All scripts in .claude/scripts/ updated with the same pattern.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…e; graceful baseline loading

- Replace pd.read_parquet() with read_parquet_safe() which uses
  pq.ParquetFile().read().to_pandas() — avoids ArrowInvalid from
  ParquetDataset memory-map buffering on pyarrow 23.0.1
- Fix CURATOR_REPO to /raid/vjawa/nemo-curator-adlr-mm/submodules/Curator
- Baseline loading is now try/except with clear re-transfer instructions
- Cells 22/23 guard against baseline=None

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
bindings.simplify() does not exist — the API is:
  case = bindings.case_cls(bindings.input_cls(raw_html=html, url=url))
  case = bindings.simplify_single_input(case)
  simplified = DripperHTMLExtractionStage._get_processed_attr(case, 'simpled_html')
  mapped     = DripperHTMLExtractionStage._get_processed_attr(case, 'map_html')

Add simplify_html() helper function in cell-19 so all downstream cells
can call it cleanly. Fix cells 19, 20, 26 which used the wrong API.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Captures measured timing per stage from all experiments:
- WARC fetch: 1.2s/record sequential, ~50/s async (64 workers)
- get_feature(): 89 pages/s, 11.2ms/page on real CC HTML
- DBSCAN: 11s-91s per batch depending on host size
- LLM inference: 8.19s (representatives), 2.78s (fallback), 1.85s (standalone)
- Template propagation: 11.2s/page mean — 56% of GPU job CPU, 0% GPU
- End-to-end H100: 374s (baseline) → 599s → projected ~250s with defer_propagation
- Bottleneck priority table and next experiments list

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Covers LLM call efficiency, throughput/cost, propagation F1 quality,
per-host analysis, cluster size distribution, content examples, and
a summary scorecard. Paths are configurable at the top; graceful
fallback when runs are not yet complete.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
run_mineru_html_standalone.py:
- Runs MinerU-HTML directly from the upstream library (no Curator infra)
- Reads pages from a manifest parquet (url + html columns)
- Batches pages through MinerUHTML.process() (vLLM backend)
- Writes dripper_results.parquet + metrics.json
- Same output schema as Curator Dripper for fair comparison

submit_mineru_standalone.sh:
- Slurm submit script for the standalone baseline
- Uses smoke-run venv (has mineru_html + vllm already installed)
- 1 node × 8 H100s, configurable batch size and max pages

compare_clustering_vs_standalone.ipynb:
- 8-section comparison notebook (Run A with clustering vs Run B standalone)
- Pre-configured for jobs 334943 (clustering) and 334945 (standalone)
- LLM call efficiency, F1 quality, per-host analysis, scorecard

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
gpu_layout_clustering.py:
- Drop-in replacement for llm-webkit's cluster_html_struct
- For large clusters (≥200 pages): uses cupy batched matmul for cosine
  similarity (one GPU matmul vs N² Python loop) + cuML DBSCAN
- For small clusters: falls back to sklearn (GPU overhead not worth it)
- Falls back gracefully when CUDA/cuML not available
- Preserves exact same tag_weight=0.7/attr_weight=0.3 as upstream

stage.py:
- _load_llm_web_kit_bindings now wires cluster_html_struct_gpu as the
  cluster_html_struct binding — automatic GPU usage when available

Expected speedup for N=3000 pages:
  Before: ~25 min (4.5M Python loop iterations)
  After:  ~5-10s  (cuBLAS batched matmul on H100)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
submit_nebius_single_node.sh: add --extra deduplication_cuda12 to uv sync
so cuml-cu12==25.10.* gets installed in every smoke-run venv — enables
gpu_layout_clustering.py GPU path automatically on H100 nodes.

submit_mineru_standalone.sh: export TENSOR_PARALLEL_SIZE env var in the
SBATCH script so run_mineru_html_standalone.py uses all 8 GPUs.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
submit_nebius_single_node.sh: check for DRIPPER_CACHED_VENV path on Lustre.
If it exists (pre-built with cuml, mineru_html, llm_web_kit), use it as
UV_PROJECT_ENVIRONMENT — uv sync --inexact runs in <60s (skips already-
installed packages). Falls back to per-job .venv when cache not present.

Run scripts/create_cached_venv.sh once to build the cache.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
A 3-stage streaming pipeline that replaces per-page LLM extraction with DOM-layout
clustering + template propagation, with strict CPU/GPU stage separation. Built on
the existing experimental Dripper stage bindings.

- Stage 1a (CPU) DOM feature extraction; 1b (GPU) cuML DBSCAN clustering
- Stage 1c (CPU) simplify + build_prompt + item_count
- Stage 2 (GPU) offline-batched vLLM inference (kv-cache fp8) — 6x over per-request serving
- Stage 2b (CPU) parse_result + convert2content + propagation template
- Stage 3 (CPU) two-tier LayoutBatchParser propagation + per-cluster validation
- Stage 3b route propagation failures back to the LLM; trafilatura recovery

Results vs standalone Dripper: token-F1 0.91, ~91% fewer LLM calls,
Stage 2 27->163 pages/s/node. Includes pure-python regression tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Drop Nebius/Slurm-cluster-bespoke files (lib_nebius_ssh.sh, submit_nebius_*.sh,
submit_mineru_standalone.sh, remote/summarize layout-diag scripts,
build_host_bucketed_index_shards.py, scratch runners) and replace hardcoded
/lustre + cluster-host paths with portable defaults (HF_HOME / ~/.cache/huggingface,
placeholders in notebooks). The pipeline runs via the generic, env-var-driven
run_mineru_pipeline.sh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
VibhuJawa and others added 30 commits June 14, 2026 13:51
…nes)

Consolidates 3 separate stage files into one cohesive module.
Removes 3 files, adds _base_stages.py with shared imports and all 4 stages.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
These 3 files (1,159 lines) were consolidated into _base_stages.py.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Remove verbose status-tracking dataclasses (_LayoutGroupAttempt,
_LayoutGroupRun, _ValidationOutcome, _InferContext), collapse three
separate validation methods into __post_init__, merge
_select_representative_index into _select_representative_indexes,
inline _missing_layout_result / _run_health_check /
_fallback_infer_context / _effective_validation_rows, and refactor
_infer_and_postprocess_row to use flat kwargs instead of _InferContext.
Core algorithm and all ruff checks preserved.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
stage_gpu_pipeline.py: 558->299 (remove all stage1c/2b boilerplate, keep vLLM)
stage3_cpu_propagation.py: 674->228 (remove duplicate LBP logic, thin wrapper)
stage1b_gpu_dbscan.py: 361->236 (remove duplicated utilities)
stage1a_feature_extraction.py: 181->161 (collapse helpers)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…(-303 lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
… lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…ocessing deleted), trim docstrings (-62 lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
… lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…l_helpers)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…opagation_stage.py/_base_stages.py

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
….py (-24 lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…n_stage.py

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…ents (-14 lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…emplate.py (-96 lines)

Moves DripperLayoutAdvancedConfig to the module that actually uses it for planning.
Removes 95 lines from layout_template.py via agent cuts.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…e.py, import from _layout_planning (-14 lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…3 (-4 lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…thod docs

- Replace advanced: DripperLayoutAdvancedConfig | None = None with 12 flat layout_xxx fields
- Remove _adv property; build DripperLayoutAdvancedConfig inline in _planning_cfg
- Replace all self._adv.xxx references with self.layout_xxx throughout
- Remove adv = self._adv local variable assignments
- Add _labels_to_webkit_response to _layout_planning import
- Restore missing _select_representative_indexes and _infer_representative_and_mapping methods
  (accidentally deleted in prior refactor; restores 2 BLE001 noqa from original commit 2834024)
- Net: 1135 -> 1215 lines (grows due to restored methods + flat fields; prior commits cut docs)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…227 lines)

- Replace `advanced: DripperLayoutAdvancedConfig | None` with 12 individual layout_xxx fields
- Remove `_adv` property; cache DripperLayoutAdvancedConfig as self._adv_cfg in __post_init__
- Shrink _planning_cfg property: builds from self._adv_cfg instead of inline construction
- Replace all self._adv.xxx and adv.xxx references with direct self.layout_xxx field access
- Collapse __post_init__ validation using local alias variables to keep calls under 119 chars
- Inline _propagated_content_length_ratio_error at its single call site (removes method)
- Add _select_representative_indexes logic inline in _infer_representative_candidates
- Add _infer_rep_and_mapping helper (compressed _infer_representative_and_mapping)
- Compress _propagate_sibling_rows_async and _run_validation_rows_async gather calls
- All 16 noqa comments preserved; ruff check + ruff format + py_compile all pass
- 1135 -> 908 lines (-227 lines)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Final line count: 898 (reduced from 1,274 original).
Removed docstrings, inlined single-call-site helpers, compressed
method signatures and LayoutTemplateRowResult constructions.
Also restores missing _select_representative_indexes and
_infer_representative_and_mapping methods + _labels_to_webkit_response import.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…e alignment with SemanticDedup)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Validates client, model_name, layout_cluster_threshold, max_concurrent_requests
at construction time rather than at runtime inside async stages.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Without stage3b, F1 is ~0.84 (below 0.90 target).
With stage3b, F1 reaches ~0.92 by re-running LLM inference on
siblings where propagation_success=False.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…ages()

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant