[REVIEW] model-supply-chain: add evaluation-set and backdoor-regression evidence gates

## Skill Being Reviewed
**Skill name:** `model-supply-chain`
**Skill path:** `skills/ai-security/model-supply-chain/`

## False Positive Analysis

**Benign code that triggers a false positive:**
```yaml
model:
  source: huggingface
  repo_id: org/classifier-v2
  revision: 3b7d8e0f4f0d4cb10f6e1d1a5a33d8c4d5310ad2
  artifact_sha256: 4d7d3b2d5f9b7b1e85f0d2a41f9a9c0cf2f2b2ab71a81f8ed81e3d52be6a8a91

evaluation:
  clean_set:
    dataset_id: org/classifier-clean-v4
    revision: 7f2d1c1d9b31c1e0d9f0e4aa23cc5ed2c31a7711
  backdoor_canaries:
    dataset_id: internal/backdoor-canary-v2
    revision: 2026-06-01
  thresholds:
    clean_accuracy_min: 0.91
    canary_attack_success_max: 0.02
```

**Why this is a false positive:**

The skill correctly flags unpinned model sources and missing checksums, but a supply-chain review can still over-report a properly pinned model if it does not inspect evaluation evidence. In the benign case above, the model, evaluation dataset, and backdoor-canary suite are all versioned. A finding that says "no model card or provenance documentation available" would be too broad if the repository has machine-readable provenance plus regression evidence but no narrative model card.

The skill should separate missing human-readable documentation from missing safety/evaluation evidence. For production ML, a signed or pinned model plus versioned clean and adversarial regression suites can be stronger evidence than a model card alone.

## Coverage Gaps

**Missed variant 1: model artifact is pinned, but evaluation set floats**
```python
model = AutoModelForSequenceClassification.from_pretrained(
    "org/classifier-v2",
    revision="3b7d8e0f4f0d4cb10f6e1d1a5a33d8c4d5310ad2",
)

eval_data = load_dataset("org/classifier-eval", split="test")
metrics = evaluate(model, eval_data)
```

**Why it should be caught:**

The model is pinned, but the evaluation dataset is not. A compromised or updated test set can hide regressions, backdoor behavior, safety-policy drift, or data leakage. The current skill focuses on model and training data provenance, but it does not require the reviewer to bind evaluation datasets, thresholds, and regression results to the released model artifact.

**Missed variant 2: no trigger/canary regression gate for backdoor behavior**
```python
def release_candidate(model_uri: str) -> None:
    model = load_model(model_uri)
    clean_accuracy = run_clean_eval(model)
    if clean_accuracy >= 0.90:
        promote_to_production(model_uri)
```

**Why it should be caught:**

Backdoored models can preserve clean benchmark performance while misbehaving on trigger inputs or targeted slices. A release gate that only checks aggregate clean accuracy can pass a poisoned model. The review should require targeted canary, slice, or trigger-regression evidence for relevant model classes, especially when third-party pre-trained weights or adapters are used.

## Edge Cases

- Evaluation data may be private or sensitive. The review should accept hash, dataset version, owner, and result evidence without requiring raw sample disclosure.
- Some models have no meaningful "trigger" concept. The report should support Not Applicable with rationale rather than force generic tests.
- Benchmark thresholds can be gamed if the pass/fail gate is owned by the same training job that produces the model. Require independent evaluation job identity or reviewer evidence for high-risk models.
- Fine-tuned models can pass base-model canaries but fail on domain-specific slices. The evaluation matrix should cover base behavior and fine-tune domain behavior separately.
- Regression reports can be stale. Require model artifact ID, evaluation dataset ID, run ID, timestamp, and environment/version binding.

## Remediation Quality

- [x] Fix resolves the vulnerability
- [x] Fix doesn't introduce new security issues
- [x] Fix doesn't break functionality
- **Issues found:**

Add an evaluation integrity and backdoor-regression subsection to the model supply-chain report. For each production or release-candidate model, require:

- model artifact ID, revision, and checksum;
- evaluation dataset ID, revision, checksum, or immutable snapshot;
- clean benchmark thresholds and actual run results;
- targeted slice, canary, or trigger-regression tests where applicable;
- run ID, evaluator identity, timestamp, and execution environment;
- decision outcome: release, block, monitor, or Not Evaluable.

Severity should depend on model risk:

- High: production third-party or fine-tuned model has pinned weights but floating evaluation data, no release-result binding, or no targeted canary/slice coverage for a known backdoor risk class.
- Medium: evaluation evidence exists but lacks immutable dataset/run identity.
- Low: documentation gaps where model and evaluation evidence are technically bound.

## Comparison to Other Tools

| Tool | Catches this? | Notes |
|------|:---:|-------|
| Semgrep | Partial | Can detect unpinned dataset loads or missing `revision=` in known frameworks, but cannot prove evaluation adequacy alone. |
| CodeQL | Partial | Can model some data flows, but ML evaluation provenance needs project-specific queries. |
| MLflow / experiment tracking | Partial | Can bind run IDs and metrics if configured, but the security review must check whether the tracked datasets and thresholds are immutable. |
| Manual model release review | Yes | A release review can inspect model identity, evaluation identity, canary coverage, and approval evidence together. |

## Overall Assessment

**Strengths:**

- Strong coverage of model source, checksums, serialization risk, training data lineage, fine-tuning pipeline controls, and inference dependencies.
- Correctly calls out unpinned `from_pretrained()` and unsafe deserialization paths.
- Good mapping to OWASP LLM03, SLSA, and MITRE ATLAS supply-chain concerns.

**Needs improvement:**

- The skill does not require evaluation dataset provenance or result-to-artifact binding.
- It mentions backdoor detection at a high level, but it does not require canary/slice regression evidence before model promotion.
- A model card can be treated as a substitute for release-quality evaluation evidence, which can create both false positives and false negatives.

**Priority recommendations:**
1. Add an evaluation integrity matrix to the report format.
2. Require immutable evaluation dataset identity and model-artifact-to-result binding.
3. Require targeted canary or trigger-regression evidence for applicable models and mark unsupported cases as Not Evaluable.
4. Add search hints for `evaluate`, `eval_dataset`, `load_dataset`, `mlflow`, `wandb`, `canary`, `backdoor`, `trigger`, and `benchmark`.

**Official references used:**
- https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10
- https://owasp.org/www-project-top-10-for-large-language-model-applications/

## Bounty Info
- [x] I have read and agree to the [CONTRIBUTING.md](../../CONTRIBUTING.md) bounty terms
- **Preferred payment method:** PayPal `samik4184@gmail.com`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] model-supply-chain: add evaluation-set and backdoor-regression evidence gates #1171

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Bounty Info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tool	Catches this?	Notes
Semgrep	Partial	Can detect unpinned dataset loads or missing `revision=` in known frameworks, but cannot prove evaluation adequacy alone.
CodeQL	Partial	Can model some data flows, but ML evaluation provenance needs project-specific queries.
MLflow / experiment tracking	Partial	Can bind run IDs and metrics if configured, but the security review must check whether the tracked datasets and thresholds are immutable.
Manual model release review	Yes	A release review can inspect model identity, evaluation identity, canary coverage, and approval evidence together.

[REVIEW] model-supply-chain: add evaluation-set and backdoor-regression evidence gates #1171

Description

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Bounty Info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions