Skill Being Reviewed
Skill name: model-supply-chain
Skill path: skills/ai-security/model-supply-chain/
False Positive Analysis
Benign code that triggers a false positive:
model:
source: huggingface
repo_id: org/classifier-v2
revision: 3b7d8e0f4f0d4cb10f6e1d1a5a33d8c4d5310ad2
artifact_sha256: 4d7d3b2d5f9b7b1e85f0d2a41f9a9c0cf2f2b2ab71a81f8ed81e3d52be6a8a91
evaluation:
clean_set:
dataset_id: org/classifier-clean-v4
revision: 7f2d1c1d9b31c1e0d9f0e4aa23cc5ed2c31a7711
backdoor_canaries:
dataset_id: internal/backdoor-canary-v2
revision: 2026-06-01
thresholds:
clean_accuracy_min: 0.91
canary_attack_success_max: 0.02
Why this is a false positive:
The skill correctly flags unpinned model sources and missing checksums, but a supply-chain review can still over-report a properly pinned model if it does not inspect evaluation evidence. In the benign case above, the model, evaluation dataset, and backdoor-canary suite are all versioned. A finding that says "no model card or provenance documentation available" would be too broad if the repository has machine-readable provenance plus regression evidence but no narrative model card.
The skill should separate missing human-readable documentation from missing safety/evaluation evidence. For production ML, a signed or pinned model plus versioned clean and adversarial regression suites can be stronger evidence than a model card alone.
Coverage Gaps
Missed variant 1: model artifact is pinned, but evaluation set floats
model = AutoModelForSequenceClassification.from_pretrained(
"org/classifier-v2",
revision="3b7d8e0f4f0d4cb10f6e1d1a5a33d8c4d5310ad2",
)
eval_data = load_dataset("org/classifier-eval", split="test")
metrics = evaluate(model, eval_data)
Why it should be caught:
The model is pinned, but the evaluation dataset is not. A compromised or updated test set can hide regressions, backdoor behavior, safety-policy drift, or data leakage. The current skill focuses on model and training data provenance, but it does not require the reviewer to bind evaluation datasets, thresholds, and regression results to the released model artifact.
Missed variant 2: no trigger/canary regression gate for backdoor behavior
def release_candidate(model_uri: str) -> None:
model = load_model(model_uri)
clean_accuracy = run_clean_eval(model)
if clean_accuracy >= 0.90:
promote_to_production(model_uri)
Why it should be caught:
Backdoored models can preserve clean benchmark performance while misbehaving on trigger inputs or targeted slices. A release gate that only checks aggregate clean accuracy can pass a poisoned model. The review should require targeted canary, slice, or trigger-regression evidence for relevant model classes, especially when third-party pre-trained weights or adapters are used.
Edge Cases
- Evaluation data may be private or sensitive. The review should accept hash, dataset version, owner, and result evidence without requiring raw sample disclosure.
- Some models have no meaningful "trigger" concept. The report should support Not Applicable with rationale rather than force generic tests.
- Benchmark thresholds can be gamed if the pass/fail gate is owned by the same training job that produces the model. Require independent evaluation job identity or reviewer evidence for high-risk models.
- Fine-tuned models can pass base-model canaries but fail on domain-specific slices. The evaluation matrix should cover base behavior and fine-tune domain behavior separately.
- Regression reports can be stale. Require model artifact ID, evaluation dataset ID, run ID, timestamp, and environment/version binding.
Remediation Quality
Add an evaluation integrity and backdoor-regression subsection to the model supply-chain report. For each production or release-candidate model, require:
- model artifact ID, revision, and checksum;
- evaluation dataset ID, revision, checksum, or immutable snapshot;
- clean benchmark thresholds and actual run results;
- targeted slice, canary, or trigger-regression tests where applicable;
- run ID, evaluator identity, timestamp, and execution environment;
- decision outcome: release, block, monitor, or Not Evaluable.
Severity should depend on model risk:
- High: production third-party or fine-tuned model has pinned weights but floating evaluation data, no release-result binding, or no targeted canary/slice coverage for a known backdoor risk class.
- Medium: evaluation evidence exists but lacks immutable dataset/run identity.
- Low: documentation gaps where model and evaluation evidence are technically bound.
Comparison to Other Tools
| Tool |
Catches this? |
Notes |
| Semgrep |
Partial |
Can detect unpinned dataset loads or missing revision= in known frameworks, but cannot prove evaluation adequacy alone. |
| CodeQL |
Partial |
Can model some data flows, but ML evaluation provenance needs project-specific queries. |
| MLflow / experiment tracking |
Partial |
Can bind run IDs and metrics if configured, but the security review must check whether the tracked datasets and thresholds are immutable. |
| Manual model release review |
Yes |
A release review can inspect model identity, evaluation identity, canary coverage, and approval evidence together. |
Overall Assessment
Strengths:
- Strong coverage of model source, checksums, serialization risk, training data lineage, fine-tuning pipeline controls, and inference dependencies.
- Correctly calls out unpinned
from_pretrained() and unsafe deserialization paths.
- Good mapping to OWASP LLM03, SLSA, and MITRE ATLAS supply-chain concerns.
Needs improvement:
- The skill does not require evaluation dataset provenance or result-to-artifact binding.
- It mentions backdoor detection at a high level, but it does not require canary/slice regression evidence before model promotion.
- A model card can be treated as a substitute for release-quality evaluation evidence, which can create both false positives and false negatives.
Priority recommendations:
- Add an evaluation integrity matrix to the report format.
- Require immutable evaluation dataset identity and model-artifact-to-result binding.
- Require targeted canary or trigger-regression evidence for applicable models and mark unsupported cases as Not Evaluable.
- Add search hints for
evaluate, eval_dataset, load_dataset, mlflow, wandb, canary, backdoor, trigger, and benchmark.
Official references used:
Bounty Info
Skill Being Reviewed
Skill name:
model-supply-chainSkill path:
skills/ai-security/model-supply-chain/False Positive Analysis
Benign code that triggers a false positive:
Why this is a false positive:
The skill correctly flags unpinned model sources and missing checksums, but a supply-chain review can still over-report a properly pinned model if it does not inspect evaluation evidence. In the benign case above, the model, evaluation dataset, and backdoor-canary suite are all versioned. A finding that says "no model card or provenance documentation available" would be too broad if the repository has machine-readable provenance plus regression evidence but no narrative model card.
The skill should separate missing human-readable documentation from missing safety/evaluation evidence. For production ML, a signed or pinned model plus versioned clean and adversarial regression suites can be stronger evidence than a model card alone.
Coverage Gaps
Missed variant 1: model artifact is pinned, but evaluation set floats
Why it should be caught:
The model is pinned, but the evaluation dataset is not. A compromised or updated test set can hide regressions, backdoor behavior, safety-policy drift, or data leakage. The current skill focuses on model and training data provenance, but it does not require the reviewer to bind evaluation datasets, thresholds, and regression results to the released model artifact.
Missed variant 2: no trigger/canary regression gate for backdoor behavior
Why it should be caught:
Backdoored models can preserve clean benchmark performance while misbehaving on trigger inputs or targeted slices. A release gate that only checks aggregate clean accuracy can pass a poisoned model. The review should require targeted canary, slice, or trigger-regression evidence for relevant model classes, especially when third-party pre-trained weights or adapters are used.
Edge Cases
Remediation Quality
Add an evaluation integrity and backdoor-regression subsection to the model supply-chain report. For each production or release-candidate model, require:
Severity should depend on model risk:
Comparison to Other Tools
revision=in known frameworks, but cannot prove evaluation adequacy alone.Overall Assessment
Strengths:
from_pretrained()and unsafe deserialization paths.Needs improvement:
Priority recommendations:
evaluate,eval_dataset,load_dataset,mlflow,wandb,canary,backdoor,trigger, andbenchmark.Official references used:
Bounty Info
samik4184@gmail.com