Add LLM judge filter stages#2060
Conversation
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
Greptile SummaryThis PR adds three new
Confidence Score: 4/5Safe to merge after fixing the empty-batch path in base.py; all other logic is correct and well-tested. The empty-batch early return in nemo_curator/stages/text/llm_judge/base.py — the empty-batch guard at line 479 needs to either be removed or extended to call _write_results before returning. Important Files Changed
Sequence DiagramsequenceDiagram
participant P as Pipeline
participant S as LLMJudgeStage.process()
participant B as build_messages()
participant C as LLMClient
participant R as parse_response()
participant W as _write_results()
P->>S: DocumentBatch
S->>S: batch.to_pandas() → df
alt df.empty (bug: skips W)
S-->>P: original batch (no output columns)
else rows present
loop per row (sync or async)
S->>B: row dict
alt empty input / no condition
B-->>S: None → no_call_result()
else messages built
B-->>S: messages
S->>C: query_model(messages, model)
C-->>S: raw_response
S->>R: raw_response, row, messages
R-->>S: LLMJudgeResult(keep, score, …)
end
end
S->>W: df, results[]
W-->>S: df with keep_field, score_field, …
S-->>P: "new DocumentBatch (filtered if filter=True)"
end
|
Signed-off-by: Arham Mehta <arhamm@nvidia.com>
|
/ok to test |
@arhamm1, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test e84fb5d |
Summary
Adds LLM-backed text judge stages for analysis, task relevance, and natural-language conditions.
Reuses existing
LLMClient,AsyncLLMClient,GenerationConfig,ProcessingStage, andDocumentBatchAPIs instead of introducing new client or execution plumbing.Preserves Curator batch metadata/stage perf, attach parse/provenance columns, and make filtering optional via score/result keep fields.
Support boundary
This PR adds pure text
ProcessingStageimplementations and mocked unit coverage only.It does not add a new model server, Ray Serve/Dynamo endpoint, Xenna backend, or full-corpus workflow.
Endpoint/model behavior should be validated separately with a bounded sample before publishing docs that imply production support for a specific model/backend.
Details
LLMAnalysisFilterStage: scores text with a JSON rubric over configurable 1-5 dimension keys, normalizes scores to 0-1, filters againstmin_score/max_score, and records normalized recommendation/tags, raw response when requested, parse errors, and provenance.LLMTaskRelevanceFilterStage: extends analysis scoring with task descriptions plus JSON/JSONL validation examples, supportsn_shotlimiting, and validates malformed validation context early.LLMConditionFilterStage: classifies rows against a natural-language condition with direct, CoT, few-shot, and CoT+few-shot prompt modes, storing both the boolean condition result and the final keep decision so failure policy is visible.Verification
Ruff passed locally:
uv run --group linting ruff check nemo_curator/stages/text/llm_judge tests/stages/text/llm_judgeTargeted LLM judge unit tests passed locally:
tests/stages/text/llm_judge/test_llm_judge.pyTargeted LLM client unit tests passed locally:
tests/models/client/test_llm_client.py,tests/models/client/test_openai_client.py