Add CC-scale MinerU-HTML layout-clustering + propagation pipeline (91% fewer LLM calls, F1=0.91)#2075
Conversation
Greptile SummaryThis PR adds a 7-stage CC-scale MinerU-HTML pipeline that replaces per-page LLM inference with GPU-accelerated DOM clustering (cuML DBSCAN) + template propagation, achieving ~91% fewer LLM calls and F1=0.91 vs the standalone baseline. The tutorial scripts under
Confidence Score: 3/5Safe to merge for teams using only the tutorial scripts; the library integration stages for deferred propagation are broken and should not be used until the serialization and indexing defects are fixed. The tutorial pipeline and benchmarks are solid and correctly implement the four bug fixes. The library-level DripperHTMLLayoutTemplateStage+DripperHTMLLayoutPropagationStage deferred-propagation path has two independent defects: json.dumps(default=str) in stage.py destroys the tuple keys that LayoutBatchParser requires (the same bug #4 the PR fixes in the tutorial scripts), and propagation_stage.py's df.iloc[idx] reads positional rows when idx is a label, silently producing wrong or out-of-bounds results. Both would cause the integrated library path to produce incorrect output with no error surfaced to the caller. nemo_curator/stages/text/experimental/dripper/propagation_stage.py and nemo_curator/stages/text/experimental/dripper/stage.py (lines around the mapping_json serialization and iloc/loc indexing). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Stage 1a: CPU\nDOM feature extraction\n(get_feature, 64 workers)"] --> B["Stage 1b: GPU\ncuML DBSCAN clustering\n(pick cluster reps)"]
B --> C["Stage 1c: CPU\nsimplify + build_prompt\n(reps + singletons only)"]
C --> D["Stage 2: GPU\nOffline-batched vLLM\n(1 LLM.generate per GPU)"]
D --> E["Stage 2b: CPU\npickle+base64 serialization ✓"]
E --> F["Stage 3: CPU\nLayoutBatchParser propagation\n_parse_mapping_json ✓"]
F --> G{propagation success?}
G -->|yes| H["Output: dripper_content"]
G -->|fallback| I["Stage 3b: LLM re-inference"]
I --> H
J["Library: DripperHTMLLayoutTemplateStage"] -->|"json.dumps default=str ❌"| K["DripperHTMLLayoutPropagationStage\niloc bug ❌"]
K --> L["LayoutBatchParser fails"]
style E fill:#d4edda
style F fill:#d4edda
style J fill:#f8d7da
style K fill:#f8d7da
style L fill:#f8d7da
Reviews (1): Last reviewed commit: "Fix stage1b GPU OOM: chunk oversized hos..." | Re-trigger Greptile |
| for idx in df.index[pending_mask]: | ||
| row = df.iloc[idx] if hasattr(df.iloc[idx], "get") else df.loc[idx] |
There was a problem hiding this comment.
df.index[pending_mask] yields label values, but the row is fetched via df.iloc[idx], which uses a positional integer offset, not a label. pd.Series always has a get method, so hasattr(df.iloc[idx], "get") is always True — the fallback branch (df.loc[idx]) is dead code. On any DataFrame whose index is not the default RangeIndex(0, N) (e.g. after filtering, concatenation, or re-indexing), df.iloc[idx] will silently return the wrong row or raise IndexError, while the label-based writes at lines 140–146 (df.at[idx, ...]) will correctly target the right row — producing corrupted output. The read should always use df.loc[idx].
| for idx in df.index[pending_mask]: | |
| row = df.iloc[idx] if hasattr(df.iloc[idx], "get") else df.loc[idx] | |
| for idx in df.index[pending_mask]: | |
| row = df.loc[idx] |
| is_representative = candidate_idx == representative_idx | ||
| results[candidate_idx] = replace( | ||
| candidate_result, |
There was a problem hiding this comment.
Bug #4 (tuple-key serialization) not fixed in the library integration path
The PR description identifies pickle+base64 as the fix for tuple keys being destroyed by json.dumps (Bug #4), and the tutorial scripts (stage2b_cpu_postprocess.py) correctly use base64.b64encode(pickle.dumps(template)). However, DripperHTMLLayoutTemplateStage still serializes via json.dumps(mapping_data, default=str), which stringifies every tuple key in html_element_dict (e.g. ("div", "class", "content") → "('div', 'class', 'content')"). The companion DripperHTMLLayoutPropagationStage.process() then deserializes with json.loads, so those keys arrive as strings rather than tuples, silently breaking LayoutBatchParser when the integrated library pipeline is used with defer_propagation=True. The fix applied to the tutorial scripts was not ported to these library-level stages.
| try: | ||
| mapping_by_cluster[cluster] = json.loads(mapping_json) | ||
| except Exception: # noqa: BLE001 | ||
| pass |
There was a problem hiding this comment.
json.loads cannot decode pickle+base64 mapping blobs from stage2b
When DripperHTMLLayoutPropagationStage is used as a downstream stage of DripperHTMLLayoutTemplateStage with the current json.dumps serialization (see the sibling comment on stage.py), the mapping data is JSON and json.loads works. But if a user feeds the stage2b tutorial output (pickle+base64 format) into this library stage, json.loads will raise on every row and the bare except: pass will silently discard every mapping, causing no_mapping_data_for_cluster errors for 100% of rows. The library stage should use the same _parse_mapping_json helper from stage3_cpu_propagation.py that handles both formats (or at least document the accepted format explicitly).
| assert self._web_bindings is not None | ||
| assert self._bindings is not None |
There was a problem hiding this comment.
assert statements are disabled by the -O interpreter flag
Using bare assert for runtime precondition checks in library code is unsafe: Python's -O (optimize) mode removes all assert statements at compile time. If a caller constructs the stage without calling setup() first (e.g., in a worker process where setup() was expected via worker_metadata), these guards silently disappear and the code proceeds with None bindings, yielding opaque AttributeErrors later. Replace with explicit RuntimeError guards.
| from nemo_curator.stages.text.experimental.dripper.stage import ( | ||
| DripperHTMLExtractionStage, | ||
| DripperHTMLExtractionPipelineStage, | ||
| DripperHTMLInferenceStage, | ||
| DripperHTMLLayoutClusteringStage, | ||
| DripperHTMLLayoutTemplateStage, | ||
| DripperHTMLPostprocessStage, | ||
| DripperHTMLPreprocessStage, | ||
| ) | ||
|
|
||
| __all__ = [ | ||
| "DripperHTMLExtractionStage", | ||
| "DripperHTMLExtractionPipelineStage", | ||
| "DripperHTMLInferenceStage", | ||
| "DripperHTMLLayoutClusteringStage", | ||
| "DripperHTMLLayoutTemplateStage", | ||
| "DripperHTMLPostprocessStage", | ||
| "DripperHTMLPreprocessStage", | ||
| ] |
There was a problem hiding this comment.
DripperHTMLLayoutPropagationStage missing from public exports
propagation_stage.py introduces DripperHTMLLayoutPropagationStage as the CPU companion to DripperHTMLLayoutTemplateStage, but it is not imported or listed in __all__ here. Consumers of nemo_curator.stages.text.experimental.dripper cannot discover or import the new stage through the package's public surface without going directly to the submodule.
|
|
||
| layout_ids = [int(x) for x in layout_ids] | ||
|
|
||
| success = [] | ||
| for idd, sample in zip(layout_ids, sampled_list, strict=False): | ||
| sample["layout_id"] = idd | ||
| sample["max_layer_n"] = layer_n | ||
| success.append(sample) | ||
|
|
||
| n_clusters = len({x for x in layout_ids if x >= 0}) | ||
| n_noise = sum(1 for x in layout_ids if x < 0) | ||
| logger.info(f"cluster_html_struct_gpu: n={len(sampled_list)} → {n_clusters} clusters ({n_noise} noise)") | ||
| return success, list(set(layout_ids)) | ||
|
|
||
|
|
||
| def _get_simp_features(cosin_mod: ModuleType) -> Callable: | ||
| """Return llm-webkit's feature-vectorization function. | ||
|
|
||
| The helper that turns raw layout features into the (tags, attrs) vectors lives | ||
| in ``llm_web_kit.html_layout.html_layout_cosin`` as a module-private function. | ||
| Python name-mangles a module-level ``__simp_features`` to | ||
| ``_<module>__simp_features``, so we look up both that mangled name and the | ||
| bare name explicitly. We raise a clear error if neither is present (rather | ||
| than silently scanning ``dir()``) so an upstream rename surfaces immediately. |
There was a problem hiding this comment.
Private name-mangling access is fragile across llm-webkit versions
_get_simp_features tries to locate the feature-vectorization helper by guessing multiple mangled names (_html_layout_cosin__simp_features, __simp_features, simp_features) via getattr. Python's name-mangling rule applies to class bodies only — a module-level __simp_features is not automatically mangled, so the string "_html_layout_cosin__simp_features" relies on an implementation detail of how the module was authored. Any refactoring or rename in llm_web_kit silently falls through to the third fallback name (simp_features) and then raises a RuntimeError, breaking all GPU clustering without a clear error message until runtime. Consider pinning the llm-webkit version or exposing this as a public helper in llm-webkit upstream.
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…metrics - stage.py: _split_large_precomputed_layout_group splits oversized precomputed layout clusters (exceeding layout_template_max_exact_host_pages) using dom_path_hash or feature_hash fingerprinting instead of processing them as one monolithic group; standalone mode leaves them to fallback - main.py: add --layout-baseline-output-dir arg; build_layout_category_timing_metrics and build_layout_cluster_timing_metrics add per-category/cluster timing breakdowns; build_layout_baseline_comparison_metrics computes incremental non-exact layout savings and F1 against a pure-Dripper baseline run - submit_nebius_single_node.sh: wire LAYOUT_BASELINE_OUTPUT_DIR passthrough - test_stage.py: cover standalone and dom_path_hash large-group splitting Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Tracks the CPU-only layout diagnostic pipeline alongside the rest of the dripper-common-crawl tutorial so diagnostics are reproducible from the repo: - remote_dripper_layout_diag.py: CPU-only replication of stage.py layout propagation; produces layout_diag_clusters.csv, layout_diag_propagation.csv, layout_diag_metadata.json - summarize_dripper_layout_diag.py: post-processes diagnostic CSVs; reports F1 distribution, call-reduction estimate, worst clusters - submit_nebius_layout_diag.sh: Slurm submission wrapper; syncs remote_dripper_layout_diag.py to remote, generates SBATCH script - lib_nebius_ssh.sh: SSH helper library; required by submit_nebius_layout_diag.sh Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
stage.py: - QW-2: Sort exemplars_by_layout.items() in both _assign_layout_by_exemplar_similarity methods to make cluster boundary assignment deterministic across runs - QW-3: Replace propagated_results.pop(0) with index-based access via enumerate to eliminate fragile parallel-list coupling - QW-4: Reconcile layout_template_more_noise_enable default to True (matches llm-webkit upstream and diag script default) - GAP-2: Fix max_layer_n sourcing at both clustering locations to skip noise pages (layout_id=-1) when reading the representative layer depth remote_dripper_layout_diag.py: - QW-1: Track f1_counts[mode] separately so per-mode mean F1 uses the correct denominator when one mode has more errors than another summarize_dripper_layout_diag.py: - QW-5: Add HOST_MIN_F1_BEGIN section showing min and mean F1 per host for saved rows; directly surfaces publicpay-style false-pass regressions - QW-6: Compute and print validation_probe_overhead_llm_calls and estimated_net_call_reduction subtracting validation sample LLM cost Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
remote_dripper_layout_diag.py: - New build_precomputed_layout_shards(): loads a precomputed manifest parquet (dripper_layout_id column) and groups base_df rows globally by layout ID, bypassing the per-shard DBSCAN that limits clusters to 64-row batch windows - Main loop: when LAYOUT_PRECOMPUTED_MANIFEST is set, each precomputed layout cluster becomes one shard and raw_groups=[shard_indexes], using the layout ID in cluster_id for traceability - page_signature_mode sub-splitting still applied within each global group submit_nebius_layout_diag.sh: - Wire LAYOUT_PRECOMPUTED_MANIFEST env var through to the job script Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
stage.py: - _LayoutTemplateRowResult: add layout_pending_propagation and layout_mapping_json fields - DripperHTMLLayoutTemplateStage + DripperHTMLExtractionPipelineStage: add layout_template_defer_propagation flag - _process_layout_group_with_status: when defer_propagation=True and validation passes, mark remaining sibling rows as pending instead of running LayoutBatchParser (the 11s/row CPU bottleneck); store mapping_data JSON on the pending rows so the propagation stage can reconstruct it - process(): emit dripper_layout_pending_propagation and dripper_layout_mapping_json columns when defer_propagation=True - Wire defer_propagation through pipeline stage to inner stage propagation_stage.py (new): - DripperHTMLLayoutPropagationStage: CPU-only stage that reads GPU output with pending_propagation markers, looks up representative mapping_data by cluster, runs LayoutBatchParser for each sibling, applies content-length ratio guard, and marks results Expected impact: GPU stage drops from ~600s to ~250s by removing the 23,859s of CPU propagation work from the H100 job. H100-hours projection improves from 387K to ~160K. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
main.py:
- Add --layout-template-defer-propagation arg; wire to pipeline stage and
metrics; insert DripperHTMLLayoutPropagationStage after the GPU stage
when layout_template_mode=True and defer_propagation=True
- Fix singleton shard explosion: _layout_key_or_row_fallback now uses
host key (~unassigned-host-{host}) as fallback instead of per-row
sentinel (~unassigned-layout-{row_id}), so unassigned pages share shards
rather than creating one shard each — reduces shard count by 10-30%
on datasets with many unclustered pages
- Import DripperHTMLLayoutPropagationStage from propagation_stage module
submit_nebius_single_node.sh:
- Wire LAYOUT_TEMPLATE_DEFER_PROPAGATION env var through to
--layout-template-defer-propagation / --no-layout-template-defer-propagation
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
The propagation stage (DripperHTMLLayoutPropagationStage) looks up layout_mapping_json from the representative row of each cluster, but the previous implementation stored it on every sibling row instead. Fix: compute mapping_json_for_representative once and set layout_mapping_json on the representative result; siblings get empty string. Removes the per-sibling json.dumps() call which was wasting memory storing N copies of the same data. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Step-by-step Jupyter notebook for DGX A100 covering: 0. Setup & imports 1. Load 8192 CC pages, view raw HTML 2. DOM feature extraction with llm-webkit get_feature 3. Layout clustering (DBSCAN) — live demo + global cluster viz 4. Representative selection — scoring formula walkthrough 5. HTML simplification — show 12.83% token reduction 6. LLM extraction — MinerU-HTML main/other labeling 7. Template propagation — CPU-only sibling inference 8. Validation — token F1 vs pure Dripper baseline 9. Cost analysis — H100-hours comparison chart 10. Full pipeline — DripperHTMLExtractionPipelineStage end-to-end Data: /raid/vjawa/dripper_tutorial/ on dgx-a100-02 (10.184.206.11) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
lib_nebius_ssh.sh: add nebius_resolve_rsync_host() which maps any nb-hel-cs-001-* node to nb-hel-cs-001-dc-01.nvidia.com (or dc-02 via NEBIUS_RSYNC_HOST env override). DC nodes are significantly faster for bulk file transfers than login or vscode nodes. submit_nebius_layout_diag.sh: wire rsync_host via nebius_resolve_rsync_host so both the rsync SSH command string and the destination host use the dc node. All scripts in .claude/scripts/ updated with the same pattern. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…e; graceful baseline loading - Replace pd.read_parquet() with read_parquet_safe() which uses pq.ParquetFile().read().to_pandas() — avoids ArrowInvalid from ParquetDataset memory-map buffering on pyarrow 23.0.1 - Fix CURATOR_REPO to /raid/vjawa/nemo-curator-adlr-mm/submodules/Curator - Baseline loading is now try/except with clear re-transfer instructions - Cells 22/23 guard against baseline=None Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
bindings.simplify() does not exist — the API is: case = bindings.case_cls(bindings.input_cls(raw_html=html, url=url)) case = bindings.simplify_single_input(case) simplified = DripperHTMLExtractionStage._get_processed_attr(case, 'simpled_html') mapped = DripperHTMLExtractionStage._get_processed_attr(case, 'map_html') Add simplify_html() helper function in cell-19 so all downstream cells can call it cleanly. Fix cells 19, 20, 26 which used the wrong API. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Captures measured timing per stage from all experiments: - WARC fetch: 1.2s/record sequential, ~50/s async (64 workers) - get_feature(): 89 pages/s, 11.2ms/page on real CC HTML - DBSCAN: 11s-91s per batch depending on host size - LLM inference: 8.19s (representatives), 2.78s (fallback), 1.85s (standalone) - Template propagation: 11.2s/page mean — 56% of GPU job CPU, 0% GPU - End-to-end H100: 374s (baseline) → 599s → projected ~250s with defer_propagation - Bottleneck priority table and next experiments list Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Covers LLM call efficiency, throughput/cost, propagation F1 quality, per-host analysis, cluster size distribution, content examples, and a summary scorecard. Paths are configurable at the top; graceful fallback when runs are not yet complete. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
run_mineru_html_standalone.py: - Runs MinerU-HTML directly from the upstream library (no Curator infra) - Reads pages from a manifest parquet (url + html columns) - Batches pages through MinerUHTML.process() (vLLM backend) - Writes dripper_results.parquet + metrics.json - Same output schema as Curator Dripper for fair comparison submit_mineru_standalone.sh: - Slurm submit script for the standalone baseline - Uses smoke-run venv (has mineru_html + vllm already installed) - 1 node × 8 H100s, configurable batch size and max pages compare_clustering_vs_standalone.ipynb: - 8-section comparison notebook (Run A with clustering vs Run B standalone) - Pre-configured for jobs 334943 (clustering) and 334945 (standalone) - LLM call efficiency, F1 quality, per-host analysis, scorecard Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
gpu_layout_clustering.py: - Drop-in replacement for llm-webkit's cluster_html_struct - For large clusters (≥200 pages): uses cupy batched matmul for cosine similarity (one GPU matmul vs N² Python loop) + cuML DBSCAN - For small clusters: falls back to sklearn (GPU overhead not worth it) - Falls back gracefully when CUDA/cuML not available - Preserves exact same tag_weight=0.7/attr_weight=0.3 as upstream stage.py: - _load_llm_web_kit_bindings now wires cluster_html_struct_gpu as the cluster_html_struct binding — automatic GPU usage when available Expected speedup for N=3000 pages: Before: ~25 min (4.5M Python loop iterations) After: ~5-10s (cuBLAS batched matmul on H100) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
submit_nebius_single_node.sh: add --extra deduplication_cuda12 to uv sync so cuml-cu12==25.10.* gets installed in every smoke-run venv — enables gpu_layout_clustering.py GPU path automatically on H100 nodes. submit_mineru_standalone.sh: export TENSOR_PARALLEL_SIZE env var in the SBATCH script so run_mineru_html_standalone.py uses all 8 GPUs. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
submit_nebius_single_node.sh: check for DRIPPER_CACHED_VENV path on Lustre. If it exists (pre-built with cuml, mineru_html, llm_web_kit), use it as UV_PROJECT_ENVIRONMENT — uv sync --inexact runs in <60s (skips already- installed packages). Falls back to per-job .venv when cache not present. Run scripts/create_cached_venv.sh once to build the cache. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
A 3-stage streaming pipeline that replaces per-page LLM extraction with DOM-layout clustering + template propagation, with strict CPU/GPU stage separation. Built on the existing experimental Dripper stage bindings. - Stage 1a (CPU) DOM feature extraction; 1b (GPU) cuML DBSCAN clustering - Stage 1c (CPU) simplify + build_prompt + item_count - Stage 2 (GPU) offline-batched vLLM inference (kv-cache fp8) — 6x over per-request serving - Stage 2b (CPU) parse_result + convert2content + propagation template - Stage 3 (CPU) two-tier LayoutBatchParser propagation + per-cluster validation - Stage 3b route propagation failures back to the LLM; trafilatura recovery Results vs standalone Dripper: token-F1 0.91, ~91% fewer LLM calls, Stage 2 27->163 pages/s/node. Includes pure-python regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Drop Nebius/Slurm-cluster-bespoke files (lib_nebius_ssh.sh, submit_nebius_*.sh, submit_mineru_standalone.sh, remote/summarize layout-diag scripts, build_host_bucketed_index_shards.py, scratch runners) and replace hardcoded /lustre + cluster-host paths with portable defaults (HF_HOME / ~/.cache/huggingface, placeholders in notebooks). The pipeline runs via the generic, env-var-driven run_mineru_pipeline.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…nes) Consolidates 3 separate stage files into one cohesive module. Removes 3 files, adds _base_stages.py with shared imports and all 4 stages. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
These 3 files (1,159 lines) were consolidated into _base_stages.py. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Remove verbose status-tracking dataclasses (_LayoutGroupAttempt, _LayoutGroupRun, _ValidationOutcome, _InferContext), collapse three separate validation methods into __post_init__, merge _select_representative_index into _select_representative_indexes, inline _missing_layout_result / _run_health_check / _fallback_infer_context / _effective_validation_rows, and refactor _infer_and_postprocess_row to use flat kwargs instead of _InferContext. Core algorithm and all ruff checks preserved. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
stage_gpu_pipeline.py: 558->299 (remove all stage1c/2b boilerplate, keep vLLM) stage3_cpu_propagation.py: 674->228 (remove duplicate LBP logic, thin wrapper) stage1b_gpu_dbscan.py: 361->236 (remove duplicated utilities) stage1a_feature_extraction.py: 181->161 (collapse helpers) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…(-303 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
… lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…ocessing deleted), trim docstrings (-62 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
… lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…l_helpers) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…opagation_stage.py/_base_stages.py Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
….py (-24 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…n_stage.py Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…ents (-14 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…emplate.py (-96 lines) Moves DripperLayoutAdvancedConfig to the module that actually uses it for planning. Removes 95 lines from layout_template.py via agent cuts. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…e.py, import from _layout_planning (-14 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…3 (-4 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…thod docs - Replace advanced: DripperLayoutAdvancedConfig | None = None with 12 flat layout_xxx fields - Remove _adv property; build DripperLayoutAdvancedConfig inline in _planning_cfg - Replace all self._adv.xxx references with self.layout_xxx throughout - Remove adv = self._adv local variable assignments - Add _labels_to_webkit_response to _layout_planning import - Restore missing _select_representative_indexes and _infer_representative_and_mapping methods (accidentally deleted in prior refactor; restores 2 BLE001 noqa from original commit 2834024) - Net: 1135 -> 1215 lines (grows due to restored methods + flat fields; prior commits cut docs) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…227 lines) - Replace `advanced: DripperLayoutAdvancedConfig | None` with 12 individual layout_xxx fields - Remove `_adv` property; cache DripperLayoutAdvancedConfig as self._adv_cfg in __post_init__ - Shrink _planning_cfg property: builds from self._adv_cfg instead of inline construction - Replace all self._adv.xxx and adv.xxx references with direct self.layout_xxx field access - Collapse __post_init__ validation using local alias variables to keep calls under 119 chars - Inline _propagated_content_length_ratio_error at its single call site (removes method) - Add _select_representative_indexes logic inline in _infer_representative_candidates - Add _infer_rep_and_mapping helper (compressed _infer_representative_and_mapping) - Compress _propagate_sibling_rows_async and _run_validation_rows_async gather calls - All 16 noqa comments preserved; ruff check + ruff format + py_compile all pass - 1135 -> 908 lines (-227 lines) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Final line count: 898 (reduced from 1,274 original). Removed docstrings, inlined single-call-site helpers, compressed method signatures and LayoutTemplateRowResult constructions. Also restores missing _select_representative_indexes and _infer_representative_and_mapping methods + _labels_to_webkit_response import. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…e alignment with SemanticDedup) Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Validates client, model_name, layout_cluster_threshold, max_concurrent_requests at construction time rather than at runtime inside async stages. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Without stage3b, F1 is ~0.84 (below 0.90 target). With stage3b, F1 reaches ~0.92 by re-running LLM inference on siblings where propagation_success=False. Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
…ages() Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Dripper Common Crawl Clustering Pipeline
Full clustering+propagation pipeline for HTML extraction at CC scale using NeMo Curator best practices.
Architecture
DOMFeatureExtractionStage(ProcessingStage), RayActorPoolExecutorHostDBSCANStage(ProcessingStage), Resources(gpus=1), cuML_Stage3PropagationStage(ProcessingStage), PPT=16, LPT+HTML-size sortValidated Results
Key design decisions
LOC summary (tutorials/text/dripper-common-crawl/)
stage3_cpu_propagation.py: 1107 → 870 lines (-237) — removed ProcessPool path, collapsed argparsedashboard_server.py: removed from PR (NVIDIA-internal hostnames/paths, dev tool only)🤖 Generated with Claude Code