Add LLM judge filter stages by arhamm1 · Pull Request #2060 · NVIDIA-NeMo/Curator

arhamm1 · 2026-06-09T22:13:44Z

Summary

Adds LLM-backed text judge stages for analysis, task relevance, and natural-language conditions.

Reuses existing LLMClient, AsyncLLMClient, GenerationConfig, ProcessingStage, and DocumentBatch APIs instead of introducing new client or execution plumbing.

Preserves Curator batch metadata/stage perf, attach parse/provenance columns, and make filtering optional via score/result keep fields.

Support boundary

This PR adds pure text ProcessingStage implementations and mocked unit coverage only.

It does not add a new model server, Ray Serve/Dynamo endpoint, Xenna backend, or full-corpus workflow.

Endpoint/model behavior should be validated separately with a bounded sample before publishing docs that imply production support for a specific model/backend.

Details

LLMAnalysisFilterStage: scores text with a JSON rubric over configurable 1-5 dimension keys, normalizes scores to 0-1, filters against min_score/max_score, and records normalized recommendation/tags, raw response when requested, parse errors, and provenance.

LLMTaskRelevanceFilterStage: extends analysis scoring with task descriptions plus JSON/JSONL validation examples, supports n_shot limiting, and validates malformed validation context early.

LLMConditionFilterStage: classifies rows against a natural-language condition with direct, CoT, few-shot, and CoT+few-shot prompt modes, storing both the boolean condition result and the final keep decision so failure policy is visible.

Verification

Ruff passed locally: uv run --group linting ruff check nemo_curator/stages/text/llm_judge tests/stages/text/llm_judge

Targeted LLM judge unit tests passed locally: tests/stages/text/llm_judge/test_llm_judge.py

Targeted LLM client unit tests passed locally: tests/models/client/test_llm_client.py, tests/models/client/test_openai_client.py

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

copy-pr-bot · 2026-06-09T22:13:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-09T22:19:17Z

Greptile Summary

This PR adds three new ProcessingStage implementations backed by an existing LLMClient/AsyncLLMClient: LLMAnalysisFilterStage (rubric scoring), LLMTaskRelevanceFilterStage (task-aligned scoring with validation examples), and LLMConditionFilterStage (yes/no natural-language classification). All three reuse existing Curator plumbing for batches, generation config, and provenance tracking.

LLMAnalysisFilterStage scores dimensions 1–5 and normalizes with a correct min-max formula (avg − 1) / 4 to [0.0, 1.0]; LLMTaskRelevanceFilterStage extends it with cached validation context built once in __post_init__.
LLMConditionFilterStage supports direct, CoT, few-shot, and CoT+few-shot prompt strategies and separates the condition result from the final keep decision to make failure policy visible.
One defect: the empty-batch early return in base.py skips _write_results, so the output batch lacks the declared output columns (keep_field, score_field, etc.), breaking any downstream stage that reads them.

Confidence Score: 4/5

Safe to merge after fixing the empty-batch path in base.py; all other logic is correct and well-tested.

The empty-batch early return in base.py exits before _write_results is called, so the returned DocumentBatch never receives the output columns declared in outputs(). Any downstream stage that reads keep_field or score_field on a batch that started empty will raise KeyError. Everything else — normalization, async dispatch, condition parsing, validation-context caching — is correct and covered by tests.

nemo_curator/stages/text/llm_judge/base.py — the empty-batch guard at line 479 needs to either be removed or extended to call _write_results before returning.

Important Files Changed

Filename	Overview
nemo_curator/stages/text/llm_judge/base.py	Core judge stage; empty-batch early return skips output-column initialization, leaving downstream consumers without expected columns.
nemo_curator/stages/text/llm_judge/analysis.py	Score normalization correctly uses min-max formula (average − 1) / 4 giving range [0.0, 1.0]; validation logic and parse path look correct.
nemo_curator/stages/text/llm_judge/condition.py	Condition stage correctly separates keep decision from condition result and handles all prompt strategies; no logic issues found.
nemo_curator/stages/text/llm_judge/task_relevance.py	Validation context is correctly computed once in post_init and cached via _validation_context; n_shot, path loading, and example validation all look correct.
nemo_curator/stages/text/llm_judge/_utils.py	JSON extraction helpers correctly handle string escaping and nested braces; normalization and coercion utilities are straightforward and safe.
tests/stages/text/llm_judge/test_llm_judge.py	Good mock coverage for sync/async clients, score normalization, parse failures, and condition strategies; validates min-max normalization and caching.

Sequence Diagram

sequenceDiagram
    participant P as Pipeline
    participant S as LLMJudgeStage.process()
    participant B as build_messages()
    participant C as LLMClient
    participant R as parse_response()
    participant W as _write_results()

    P->>S: DocumentBatch
    S->>S: batch.to_pandas() → df
    alt df.empty (bug: skips W)
        S-->>P: original batch (no output columns)
    else rows present
        loop per row (sync or async)
            S->>B: row dict
            alt empty input / no condition
                B-->>S: None → no_call_result()
            else messages built
                B-->>S: messages
                S->>C: query_model(messages, model)
                C-->>S: raw_response
                S->>R: raw_response, row, messages
                R-->>S: LLMJudgeResult(keep, score, …)
            end
        end
        S->>W: df, results[]
        W-->>S: df with keep_field, score_field, …
        S-->>P: "new DocumentBatch (filtered if filter=True)"
    end

Comments Outside Diff (1)

nemo_curator/stages/text/llm_judge/base.py, line 479-492 (link)

Empty batch silently omits expected output columns

When df.empty the stage returns the original batch before _write_results is called, so none of the declared output columns (keep_field, score_field, record_field, etc.) are ever added to the DataFrame. Any downstream stage that accesses df["llm_analysis_keep"] (or similar) on a previously empty batch will raise KeyError. The same guard applies to the all-filtered-out case immediately below, which continues correctly because _write_results has already been called by that point.

The simplest fix is to drop the early-return guard entirely and let _process_sync([]) / _process_async([]) produce empty result lists — _write_results(df, []) will then assign empty columns with the correct names, giving the output batch the expected schema even when it carries zero rows.

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/llm-judge..." | Re-trigger Greptile}

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

arhamm1 · 2026-06-10T20:33:58Z

/ok to test

copy-pr-bot · 2026-06-10T20:34:02Z

/ok to test

@arhamm1, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

arhamm1 · 2026-06-10T23:59:20Z

/ok to test e84fb5d

arhamm1 added 8 commits June 9, 2026 15:09

Add LLM judge package exports

95d0567

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Add LLM judge test package marker

8f50a30

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Add LLM judge utility helpers

94a9ea6

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Add shared LLM judge stage base

cfded63

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Add LLM analysis filter stage

4d20afb

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Add LLM condition filter stage

e007885

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Add LLM task relevance filter stage

dcd49e6

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

Add tests for LLM judge filters

a024197

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

arhamm1 requested a review from a team as a code owner June 9, 2026 22:13

arhamm1 requested review from VibhuJawa and removed request for a team June 9, 2026 22:13

Address LLM judge review feedback

e84fb5d

Signed-off-by: Arham Mehta <arhamm@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 10, 2026 23:59 Inactive

copy-pr-bot Bot temporarily deployed to test June 10, 2026 23:59 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 10, 2026 23:59 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 00:04 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 00:05 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 11, 2026 00:11 Inactive

Merge branch 'main' into codex/llm-judge-filters

8bc4654

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM judge filter stages#2060

Add LLM judge filter stages#2060
arhamm1 wants to merge 10 commits into
mainfrom
codex/llm-judge-filters

arhamm1 commented Jun 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

arhamm1 commented Jun 10, 2026

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

arhamm1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arhamm1 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Support boundary

Details

Verification

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

arhamm1 commented Jun 10, 2026

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

arhamm1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arhamm1 commented Jun 9, 2026 •

edited

Loading

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading