Skip to content

Add LLM judge filter stages#2060

Open
arhamm1 wants to merge 10 commits into
mainfrom
codex/llm-judge-filters
Open

Add LLM judge filter stages#2060
arhamm1 wants to merge 10 commits into
mainfrom
codex/llm-judge-filters

Conversation

@arhamm1

@arhamm1 arhamm1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds LLM-backed text judge stages for analysis, task relevance, and natural-language conditions.

Reuses existing LLMClient, AsyncLLMClient, GenerationConfig, ProcessingStage, and DocumentBatch APIs instead of introducing new client or execution plumbing.

Preserves Curator batch metadata/stage perf, attach parse/provenance columns, and make filtering optional via score/result keep fields.

Support boundary

This PR adds pure text ProcessingStage implementations and mocked unit coverage only.

It does not add a new model server, Ray Serve/Dynamo endpoint, Xenna backend, or full-corpus workflow.

Endpoint/model behavior should be validated separately with a bounded sample before publishing docs that imply production support for a specific model/backend.

Details

LLMAnalysisFilterStage: scores text with a JSON rubric over configurable 1-5 dimension keys, normalizes scores to 0-1, filters against min_score/max_score, and records normalized recommendation/tags, raw response when requested, parse errors, and provenance.

LLMTaskRelevanceFilterStage: extends analysis scoring with task descriptions plus JSON/JSONL validation examples, supports n_shot limiting, and validates malformed validation context early.

LLMConditionFilterStage: classifies rows against a natural-language condition with direct, CoT, few-shot, and CoT+few-shot prompt modes, storing both the boolean condition result and the final keep decision so failure policy is visible.

Verification

Ruff passed locally: uv run --group linting ruff check nemo_curator/stages/text/llm_judge tests/stages/text/llm_judge

Targeted LLM judge unit tests passed locally: tests/stages/text/llm_judge/test_llm_judge.py

Targeted LLM client unit tests passed locally: tests/models/client/test_llm_client.py, tests/models/client/test_openai_client.py

arhamm1 added 8 commits June 9, 2026 15:09
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
@arhamm1 arhamm1 requested a review from a team as a code owner June 9, 2026 22:13
@arhamm1 arhamm1 requested review from VibhuJawa and removed request for a team June 9, 2026 22:13
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds three new ProcessingStage implementations backed by an existing LLMClient/AsyncLLMClient: LLMAnalysisFilterStage (rubric scoring), LLMTaskRelevanceFilterStage (task-aligned scoring with validation examples), and LLMConditionFilterStage (yes/no natural-language classification). All three reuse existing Curator plumbing for batches, generation config, and provenance tracking.

  • LLMAnalysisFilterStage scores dimensions 1–5 and normalizes with a correct min-max formula (avg − 1) / 4 to [0.0, 1.0]; LLMTaskRelevanceFilterStage extends it with cached validation context built once in __post_init__.
  • LLMConditionFilterStage supports direct, CoT, few-shot, and CoT+few-shot prompt strategies and separates the condition result from the final keep decision to make failure policy visible.
  • One defect: the empty-batch early return in base.py skips _write_results, so the output batch lacks the declared output columns (keep_field, score_field, etc.), breaking any downstream stage that reads them.

Confidence Score: 4/5

Safe to merge after fixing the empty-batch path in base.py; all other logic is correct and well-tested.

The empty-batch early return in base.py exits before _write_results is called, so the returned DocumentBatch never receives the output columns declared in outputs(). Any downstream stage that reads keep_field or score_field on a batch that started empty will raise KeyError. Everything else — normalization, async dispatch, condition parsing, validation-context caching — is correct and covered by tests.

nemo_curator/stages/text/llm_judge/base.py — the empty-batch guard at line 479 needs to either be removed or extended to call _write_results before returning.

Important Files Changed

Filename Overview
nemo_curator/stages/text/llm_judge/base.py Core judge stage; empty-batch early return skips output-column initialization, leaving downstream consumers without expected columns.
nemo_curator/stages/text/llm_judge/analysis.py Score normalization correctly uses min-max formula (average − 1) / 4 giving range [0.0, 1.0]; validation logic and parse path look correct.
nemo_curator/stages/text/llm_judge/condition.py Condition stage correctly separates keep decision from condition result and handles all prompt strategies; no logic issues found.
nemo_curator/stages/text/llm_judge/task_relevance.py Validation context is correctly computed once in post_init and cached via _validation_context; n_shot, path loading, and example validation all look correct.
nemo_curator/stages/text/llm_judge/_utils.py JSON extraction helpers correctly handle string escaping and nested braces; normalization and coercion utilities are straightforward and safe.
tests/stages/text/llm_judge/test_llm_judge.py Good mock coverage for sync/async clients, score normalization, parse failures, and condition strategies; validates min-max normalization and caching.

Sequence Diagram

sequenceDiagram
    participant P as Pipeline
    participant S as LLMJudgeStage.process()
    participant B as build_messages()
    participant C as LLMClient
    participant R as parse_response()
    participant W as _write_results()

    P->>S: DocumentBatch
    S->>S: batch.to_pandas() → df
    alt df.empty (bug: skips W)
        S-->>P: original batch (no output columns)
    else rows present
        loop per row (sync or async)
            S->>B: row dict
            alt empty input / no condition
                B-->>S: None → no_call_result()
            else messages built
                B-->>S: messages
                S->>C: query_model(messages, model)
                C-->>S: raw_response
                S->>R: raw_response, row, messages
                R-->>S: LLMJudgeResult(keep, score, …)
            end
        end
        S->>W: df, results[]
        W-->>S: df with keep_field, score_field, …
        S-->>P: "new DocumentBatch (filtered if filter=True)"
    end
Loading

Comments Outside Diff (1)

  1. nemo_curator/stages/text/llm_judge/base.py, line 479-492 (link)

    P1 Empty batch silently omits expected output columns

    When df.empty the stage returns the original batch before _write_results is called, so none of the declared output columns (keep_field, score_field, record_field, etc.) are ever added to the DataFrame. Any downstream stage that accesses df["llm_analysis_keep"] (or similar) on a previously empty batch will raise KeyError. The same guard applies to the all-filtered-out case immediately below, which continues correctly because _write_results has already been called by that point.

    The simplest fix is to drop the early-return guard entirely and let _process_sync([]) / _process_async([]) produce empty result lists — _write_results(df, []) will then assign empty columns with the correct names, giving the output batch the expected schema even when it carries zero rows.

Reviews (3): Last reviewed commit: "Merge branch 'main' into codex/llm-judge..." | Re-trigger Greptile

Signed-off-by: Arham Mehta <arhamm@nvidia.com>
@arhamm1

arhamm1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test

@copy-pr-bot

copy-pr-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

/ok to test

@arhamm1, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

arhamm1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test e84fb5d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant