From ca08151d44448328eed89c8889a0dd6e29ce2754 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 10:44:48 -0700 Subject: [PATCH 01/24] Minor D4D rubric 10 semantic edits --- .claude/agents/d4d-rubric10-semantic.md | 26 ++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/.claude/agents/d4d-rubric10-semantic.md b/.claude/agents/d4d-rubric10-semantic.md index 2a5b0dac..b36bd73b 100644 --- a/.claude/agents/d4d-rubric10-semantic.md +++ b/.claude/agents/d4d-rubric10-semantic.md @@ -29,24 +29,24 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th ### Scoring Standards A sub-element scores **1** (present/pass) ONLY if: -- ✅ The field exists in the D4D file AND is non-empty -- ✅ Contains **meaningful, non-trivial content** (not just boilerplate) -- ✅ Provides **actionable information** to dataset users -- ✅ Is **complete enough** to support the sub-element's stated purpose +- The field exists in the D4D file AND is non-empty +- Contains **meaningful, non-trivial content** (not just boilerplate) +- Provides **actionable information** to dataset users +- Is **complete enough** to support the sub-element's stated purpose Score **0** (absent/fail) if: -- ❌ Field is missing, null, or empty -- ❌ Content is generic, boilerplate, or placeholder text -- ❌ Information is incomplete, vague, or too high-level -- ❌ Does not meaningfully address the sub-element's intent +- Field is missing, null, or empty +- Content is generic, boilerplate, or placeholder text +- Information is incomplete, vague, or does not address the purpose of the D4D, element, or sub-element +- Does not meaningfully address the sub-element's intent ### Quality vs. Presence **This is NOT simple field-presence detection.** You must assess the **quality and usefulness** of the content: -- ✅ **Good:** "Participants recruited from 5 specialty clinics across North America (MGH, UF, UT Health, Tufts, Emory) with IRB approval from each institution." -- ⚠️ **Marginal:** "Data collected from multiple sites." -- ❌ **Poor:** "Collection sites: various" +- **Good:** "Participants recruited from 5 specialty clinics across North America (MGH, UF, UT Health, Tufts, Emory) with IRB approval from each institution." +- **Marginal:** "Data collected from multiple sites." +- **Poor:** "Collection sites: various" ### Semantic Analysis Requirements @@ -54,7 +54,7 @@ Score **0** (absent/fail) if: 1. **Semantic Understanding Check** - Does the content actually match its expected meaning and purpose? - - Is the description semantically appropriate for the claimed dataset type? + - Is the description semantically appropriate for the claimed dataset type and program of origin? - Are technical terms used correctly and consistently? 2. **Correctness Validation** @@ -78,7 +78,7 @@ Score **0** (absent/fail) if: - IF funding present → EXPECT `purposes` aligns with funding goals 4. **Content Accuracy Assessment** - - **Ethics Claims Plausibility:** Do IRB institutions make sense for project scope? + - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? - **Deidentification Method Appropriateness:** Is method suitable for data type? - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? From 7422458278778a0d4088076aa495fa3aa42427c4 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 11:37:38 -0700 Subject: [PATCH 02/24] Minor updates to rubric 20 semantic eval --- .claude/agents/d4d-rubric20-semantic.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 1f79c8ef..9fb151af 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -13,17 +13,17 @@ color: purple # D4D Rubric20 Semantic Evaluator -You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. +You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. ## Your Task -Read the provided D4D YAML file and perform a **semantic quality assessment** that goes beyond simple quality checks to include correctness validation, consistency checking, and deep semantic understanding across 20 evaluation questions organized into 4 categories. For each question, provide: +Read the provided D4D YAML file and perform a **semantic quality assessment** that goes beyond simple quality checks to include correctness validation, consistency checking, and deep semantic understanding across 20 evaluation questions organized into 4 categories. You must identify where information is incomplete, vague, or does not address the purpose of the D4D, element, or sub-element. For each question, provide: 1. **Score** - Either numeric (0-5 scale) or pass/fail depending on question type 2. **Score label** - Description of the quality level achieved 3. **Evidence** - Specific quotes or field references from the D4D file 4. **Quality assessment** - Brief explanation of scoring rationale -5. **Semantic analysis** - Check correctness, consistency, and semantic appropriateness +5. **Semantic analysis** - Check correctness, consistency, and semantic relevance to the element or sub-element ## Evaluation Criteria @@ -45,11 +45,11 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **This is NOT simple field-presence detection.** Assess the **quality, completeness, and usefulness** of the content: -- ✅ **Score 5 Example:** "Participants recruited from 5 specialty clinics (MGH: voice disorders, UF: respiratory, UT Health: neurological, Tufts: mood disorders, Emory: cardiac conditions) with full IRB approval (protocols: MGH-2023-001, UF-2023-045). Inclusion: adults 18-85, English-speaking. Exclusion: cognitive impairment, active substance abuse." +- **Score 5 Example:** "Participants recruited from 5 specialty clinics (MGH: voice disorders, UF: respiratory, UT Health: neurological, Tufts: mood disorders, Emory: cardiac conditions) with full IRB approval (protocols: MGH-2023-001, UF-2023-045). Inclusion: adults 18-85, English-speaking. Exclusion: cognitive impairment, active substance abuse." -- ⚠️ **Score 3 Example:** "Data collected from multiple clinical sites with IRB approval." +- **Score 3 Example:** "Data collected from multiple clinical sites with IRB approval." -- ❌ **Score 0 Example:** "Collection sites: various" +- **Score 0 Example:** "Collection sites: various" ### Semantic Analysis Requirements @@ -84,7 +84,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - IF license allows reuse → EXPECT distribution formats specified 4. **Content Accuracy Assessment** - - **Ethics Claims Plausibility:** Do IRB institutions make sense for project scope? + - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? - **Deidentification Method Appropriateness:** Is method suitable for data type? - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? From b0225b22b74b249bbe899c54da8eab560a5ca283 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 11:38:59 -0700 Subject: [PATCH 03/24] Update text of assessment sections --- .claude/agents/d4d-rubric20-semantic.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 9fb151af..703e7361 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -57,7 +57,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th 1. **Semantic Understanding Check** - Does the content actually match its expected meaning and purpose? - - Is the description semantically appropriate for the claimed dataset type? + - Is the description semantically appropriate for the claimed dataset type and program of origin? - Are technical terms used correctly and consistently? 2. **Correctness Validation** @@ -85,12 +85,12 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th 4. **Content Accuracy Assessment** - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? - - **Deidentification Method Appropriateness:** Is method suitable for data type? + - **Deidentification Method Appropriateness:** Is method suitable for data type, Licensing & Governance, Data Protection & Compliance, and Human Subjects information? - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? - - **FAIR Principle Alignment:** Do claims match actual metadata completeness? + - **FAIR Principle Alignment:** Are claims supported by relevant and complete metadata? -**Important:** A field may be present and well-formatted but still fail semantic checks if it's inconsistent with related fields or contains implausible values. This affects scoring - reduce score if semantic issues detected. +**Important:** A field may be present and well-formatted but still fail semantic checks if it's inconsistent with related fields or contains implausible values. This affects scoring - reduce score if semantic issues detected. Always note where semantic issues impacted scoring. ## Rubric20 Specification @@ -148,7 +148,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** 2–3 file types - **5:** >3 file types -**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety indicates multi-modal data. +**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety can indicate multi-modal data if indicated `description`, `purposes`, or `keywords`. --- @@ -161,7 +161,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** Numeric file size or instance count found - **Fail:** No file size/instance metadata -**Assessment:** Look for bytes field, instance counts, or sample size documentation. +**Assessment:** Look for bytes field, instance counts, or sample size documentation. Note that sample size only enables and estimate of the file size. --- From d160f4c890f0767f93acadad2468fbd03903e9c7 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 11:40:53 -0700 Subject: [PATCH 04/24] Remove gov review restriction Suggest performing assessment based on generalized criteria --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 703e7361..c32a1766 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -204,9 +204,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Basic ethics (IRB + deidentification) - **5:** Comprehensive (all human subjects protections documented) -**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas. - -**Applies to:** Bridge2AI-Voice, AI-READI +**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas. Only score this question if human subjects or governance restrictions are identified elsewhere. --- From 73c59fc8fa452eff9681060f5c494af2783d241a Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 11:50:10 -0700 Subject: [PATCH 05/24] Remove gov review restriction for Q9 Suggest performing assessment based on generalized criteria --- .claude/agents/d4d-rubric20-semantic.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index c32a1766..7e992ccd 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -204,7 +204,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Basic ethics (IRB + deidentification) - **5:** Comprehensive (all human subjects protections documented) -**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas. Only score this question if human subjects or governance restrictions are identified elsewhere. +**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas. Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. --- @@ -218,9 +218,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** License only - **5:** License + restrictions + confidentiality classification -**Assessment:** Evaluate clarity and completeness of governance and access documentation. - -**Applies to:** Bridge2AI-Voice, Dataverse +**Assessment:** Evaluate clarity and completeness of governance and terms of use documentation. Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. --- From e2c1fbb7a8a48c1089dcc290d50967f354cc2248 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 12:59:27 -0700 Subject: [PATCH 06/24] Add generalized review selection criteria to Q10 --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 7e992ccd..cf4590ff 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -232,9 +232,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Standard format but no schema reference - **5:** Standard formats + schema/ontology compliance -**Assessment:** Check for standard formats (Parquet, TSV, OMOP, FHIR, DICOM), encoding, and schema conformance references. - -**Applies to:** Bridge2AI-Voice, Health Nexus +**Assessment:** Check for standard formats (Parquet, TSV, OMOP, FHIR, DICOM), encoding, and schema conformance references. Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- From 33b19e35d30509784377d8713aa26b4cb6358eaa Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:00:23 -0700 Subject: [PATCH 07/24] Add generalized review selection criteria to Q11 --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index cf4590ff..8ebc1659 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -248,9 +248,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** At least one strategy or tool listed - **5:** Comprehensive strategies with software versions/URLs -**Assessment:** Look for strategy documentation and software names, versions, and links. - -**Applies to:** Bridge2AI-Voice +**Assessment:** Look for strategy documentation and software names, versions, and links. Always report results of this question, but only score if software tools were identified elsewhere as shared and available for reuse. --- From 85a943cb1672c8bc6d96ac7e19d49e23069c2819 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:03:47 -0700 Subject: [PATCH 08/24] Add generalized review selection criteria to Q12 --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 8ebc1659..0a3e5356 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -262,9 +262,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Partial description (e.g., mechanism only) - **5:** Full collection protocol with methods, collectors, and timeframes -**Assessment:** Evaluate detail level and completeness of collection protocol documentation. - -**Applies to:** Bridge2AI-Voice, AI-READI +**Assessment:** Evaluate detail level and completeness of collection protocol documentation. Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. --- From b5e0d5f4b6dffd784a794ecb5e35dd62b5273b86 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:09:29 -0700 Subject: [PATCH 09/24] Add generalized review selection criteria to Q13 --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 0a3e5356..4134907f 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -276,9 +276,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Version number + basic access info - **5:** Comprehensive versioning with errata, updates, and release notes -**Assessment:** Evaluate completeness of version tracking infrastructure. - -**Applies to:** Bridge2AI-Voice, Dataverse +**Assessment:** Evaluate completeness of version tracking infrastructure. Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. --- From af92b8d2ed7cf46c927faa5e5c36cf9d0b189a15 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:12:02 -0700 Subject: [PATCH 10/24] Add generalized review selection criteria to Q14 --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 4134907f..ec72ac01 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -290,9 +290,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** One citation or external resource - **5:** Multiple references and dataset citation -**Assessment:** Count publications, external resources, and check for formal dataset citation. - -**Applies to:** Bridge2AI-Voice, AI-READI +**Assessment:** Count publications, external resources, and check for formal dataset citation. Always report results of this question, but only score if publication was identified elsewhere and datasets were shared and available for reuse. --- From f3bd5085bfcf0634e1dae8c0acf1197d51197b78 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:13:22 -0700 Subject: [PATCH 11/24] Add generalized review selection criteria to Q15 --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index ec72ac01..319ca59a 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -304,9 +304,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** General human data without subgroup description - **5:** Detailed demographics and inclusion/exclusion criteria -**Assessment:** Evaluate demographic detail and population characterization through instances and subpopulations. - -**Applies to:** Bridge2AI-Voice, AI-READI +**Assessment:** Evaluate demographic detail and population characterization through instances and subpopulations. Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. --- From 120e5652825303d929dbc02d70ab2a2e5f1b1bea Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:15:44 -0700 Subject: [PATCH 12/24] Add generalized review selection criteria to Q17 --- .claude/agents/d4d-rubric20-semantic.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 319ca59a..835a2c90 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -319,7 +319,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** At least one working external URL present - **Fail:** No external links found -**Assessment:** Verify presence of persistent URLs. +**Assessment:** Verify presence of persistent URLs. --- @@ -333,9 +333,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Partially described access mechanism - **5:** Fully defined access path (platform, login, policy) -**Assessment:** Evaluate clarity of access instructions through distribution formats and licensing. - -**Applies to:** Dataverse, PhysioNet +**Assessment:** Evaluate clarity of access instructions through distribution formats and licensing. Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- From 3dd6c01c80d3e82b577ace6a253da023ff1eb84e Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:17:21 -0700 Subject: [PATCH 13/24] Add generalized review selection criteria to Q20 --- .claude/agents/d4d-rubric20-semantic.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 835a2c90..41ded03f 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -374,9 +374,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** Cross-platform links verified - **Fail:** No external linkages found -**Assessment:** Look for external resources linking to related platforms (FAIRhub, PhysioNet, GitHub, etc.). - -**Applies to:** Health Nexus, PhysioNet, FAIRhub +**Assessment:** Look for external resources linking to related platforms (FAIRhub, PhysioNet, GitHub, etc.). Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- From 0f438c891c5c63aebbde5cc349483b696c3b7ef7 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:26:44 -0700 Subject: [PATCH 14/24] Use "Applies to" field for reporting criteria --- .claude/agents/d4d-rubric20-semantic.md | 42 ++++++++++++++++++------- 1 file changed, 31 insertions(+), 11 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 41ded03f..5faea7e5 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -204,7 +204,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Basic ethics (IRB + deidentification) - **5:** Comprehensive (all human subjects protections documented) -**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas. Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. +**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas + +**Applies to:** Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. --- @@ -218,7 +220,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** License only - **5:** License + restrictions + confidentiality classification -**Assessment:** Evaluate clarity and completeness of governance and terms of use documentation. Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. +**Assessment:** Evaluate clarity and completeness of governance and terms of use documentation. + +**Applies to:** Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. --- @@ -232,7 +236,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Standard format but no schema reference - **5:** Standard formats + schema/ontology compliance -**Assessment:** Check for standard formats (Parquet, TSV, OMOP, FHIR, DICOM), encoding, and schema conformance references. Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. +**Assessment:** Check for standard formats (Parquet, TSV, OMOP, FHIR, DICOM), encoding, and schema conformance references. + +**Applies to:** Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- @@ -248,7 +254,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** At least one strategy or tool listed - **5:** Comprehensive strategies with software versions/URLs -**Assessment:** Look for strategy documentation and software names, versions, and links. Always report results of this question, but only score if software tools were identified elsewhere as shared and available for reuse. +**Assessment:** Look for strategy documentation and software names, versions, and links. + +**Applies to:** Always report results of this question, but only score if software tools were identified elsewhere as shared and available for reuse. --- @@ -262,7 +270,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Partial description (e.g., mechanism only) - **5:** Full collection protocol with methods, collectors, and timeframes -**Assessment:** Evaluate detail level and completeness of collection protocol documentation. Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. +**Assessment:** Evaluate detail level and completeness of collection protocol documentation. + +**Applies to:** Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. --- @@ -276,7 +286,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Version number + basic access info - **5:** Comprehensive versioning with errata, updates, and release notes -**Assessment:** Evaluate completeness of version tracking infrastructure. Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. +**Assessment:** Evaluate completeness of version tracking infrastructure. + +**Applies to:** Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. --- @@ -290,7 +302,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** One citation or external resource - **5:** Multiple references and dataset citation -**Assessment:** Count publications, external resources, and check for formal dataset citation. Always report results of this question, but only score if publication was identified elsewhere and datasets were shared and available for reuse. +**Assessment:** Count publications, external resources, and check for formal dataset citation. + +**Applies to:** Always report results of this question, but only score if publication was identified elsewhere and datasets were shared and available for reuse. --- @@ -304,7 +318,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** General human data without subgroup description - **5:** Detailed demographics and inclusion/exclusion criteria -**Assessment:** Evaluate demographic detail and population characterization through instances and subpopulations. Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. +**Assessment:** Evaluate demographic detail and population characterization through instances and subpopulations. + +**Applies to:** Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. --- @@ -333,7 +349,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** Partially described access mechanism - **5:** Fully defined access path (platform, login, policy) -**Assessment:** Evaluate clarity of access instructions through distribution formats and licensing. Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. +**Assessment:** Evaluate clarity of access instructions through distribution formats and licensing. + +**Applies to:** Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- @@ -374,7 +392,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** Cross-platform links verified - **Fail:** No external linkages found -**Assessment:** Look for external resources linking to related platforms (FAIRhub, PhysioNet, GitHub, etc.). Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. +**Assessment:** Look for external resources linking to related platforms (FAIRhub, PhysioNet, GitHub, etc.). + +**Applies to:** Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- @@ -709,7 +729,7 @@ semantic_analysis_summary: 2. **Evidence-Based Scoring:** Include specific field values and quotes. -3. **Context-Aware:** Some questions apply only to specific dataset types (see "applies_to" field). +3. **Context-Aware:** Some questions apply only to specific dataset and program types (see "Applies to" field in questions). 4. **Graduated Scoring:** Use the full 0-5 range for numeric questions based on quality levels. From cd8ae54a699a127afcbacc04f3f82c91f85223f1 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 1 Jun 2026 13:45:18 -0700 Subject: [PATCH 15/24] Modify procedure for multi-sheet eval --- .claude/agents/d4d-rubric20-semantic.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 5faea7e5..b04445cc 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -753,9 +753,9 @@ semantic_analysis_summary: **User:** "Run rubric20 assessment on CM4AI D4D files (curated, gpt5, claudecode)" **Agent:** -1. Evaluates each file separately -2. Generates detailed quality assessments -3. Highlights differences in FAIR compliance and technical documentation +1. Evaluates each file separately and generates detailed quality assessments, following the procedure in Example 1 +2. Compare and contrast content and scoring between files +3. Report summary of comparison between files ## How This Agent Works From 6451997ee8763a39304264d885b67e7fdcd2b74c Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 15 Jun 2026 17:30:52 -0700 Subject: [PATCH 16/24] Address issue 155 Add first draft of language to handle applies to logic in evaluation and scoring --- .claude/agents/d4d-rubric20-semantic.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index b04445cc..79005299 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -82,6 +82,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **FAIR Logic:** - IF DOI present → EXPECT publicly accessible landing page - IF license allows reuse → EXPECT distribution formats specified + - **'Applies to' Logic:** + - If `Applies to` condition is listed, check that relevant information was provided elsewhere + - EXAMPLE: IF shared tools were not described in the document, question 11 is not applicable 4. **Content Accuracy Assessment** - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? @@ -92,6 +95,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Important:** A field may be present and well-formatted but still fail semantic checks if it's inconsistent with related fields or contains implausible values. This affects scoring - reduce score if semantic issues detected. Always note where semantic issues impacted scoring. + ## Rubric20 Specification ### Category 1: Structural Completeness (Questions 1-5) @@ -453,7 +457,8 @@ Return your evaluation as a **JSON object** with this EXACT structure: "overall_score": { "total_points": 72.5, "max_points": 84, - "percentage": 86.3 + "percentage": 86.3, + "questions_not_applicable": 0 }, "categories": [ { @@ -462,6 +467,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "id": 1, "name": "Field Completeness", + "applicable": "true", "description": "Proportion of mandatory schema fields populated", "score_type": "numeric", "score": 5, @@ -473,6 +479,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "id": 2, "name": "Entry Length Adequacy", + "applicable": "true", "score_type": "numeric", "score": 5, "max_score": 5, @@ -723,6 +730,8 @@ semantic_analysis_summary: - **Technical Documentation (5 questions):** 25 points max (5 numeric @5 each) - **FAIRness & Accessibility (5 questions):** 13 points max (3 numeric @5 each + 2 pass/fail) +**NOTE:** if any question in the output includes "applicable": "false", decrease the Maximum Possible Score by the question `max_score` and use the adjusted Maximum Possible Score to calculate score percentage. Report the number of non-applicable questions in the "questions_not_applicable" field. + ## Key Principles 1. **Quality over Presence:** Assess content usefulness, not just existence. From baec775098c025187706c779e2d08cbb4b11b2b5 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Mon, 15 Jun 2026 17:31:46 -0700 Subject: [PATCH 17/24] Update d4d-rubric20-semantic.md --- .claude/agents/d4d-rubric20-semantic.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 79005299..fc3cb81a 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -730,7 +730,7 @@ semantic_analysis_summary: - **Technical Documentation (5 questions):** 25 points max (5 numeric @5 each) - **FAIRness & Accessibility (5 questions):** 13 points max (3 numeric @5 each + 2 pass/fail) -**NOTE:** if any question in the output includes "applicable": "false", decrease the Maximum Possible Score by the question `max_score` and use the adjusted Maximum Possible Score to calculate score percentage. Report the number of non-applicable questions in the "questions_not_applicable" field. +**NOTE:** If any question in the output includes "applicable": "false", decrease the Maximum Possible Score by the question `max_score` and use the adjusted Maximum Possible Score to calculate score percentage. Report the number of non-applicable questions in the "questions_not_applicable" field. ## Key Principles From eca681676e84ad9b7202d5ee80940aeb346bfca1 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Tue, 16 Jun 2026 10:44:41 -0700 Subject: [PATCH 18/24] Update d4d-rubric20-semantic.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Change 1 — overall_score block Removed percentage, added excluded_max_points, adjusted_max_points, and normalized_percentage. The example values reflect a zero-exclusion case (all three new fields show 84 / 86.3%) to match the existing example context. Change 2 — N/A question example Added question 11 as an illustrative applicable: false entry inside the JSON schema block, showing "score": null, "score_label": "Not applicable", and a quality_note explaining the exclusion. It sits alongside the existing question 1 and 2 examples. Change 3 — Scoring Summary section Replaced the one-sentence NOTE with a three-part N/A Question Convention block defining: (1) the applicable: false / score: null encoding rule, (2) the denominator arithmetic (excluded_max_points, adjusted_max_points, normalized_percentage), and (3) the batch aggregation requirement. The header now reads "Maximum Possible Score: 84 points (before N/A exclusions)" to flag the baseline. Change 4 — Batch EvaluationSummary YAML Added average_excluded_max_points, average_adjusted_max_points, and average_normalized_percentage to overall_performance, each method_comparison entry, and each project_comparison entry. Removed average_percentage from those blocks. Added excluded_max_points, adjusted_max_points, and normalized_percentage to best_performer and worst_performer. Replaced average_percentage with average_normalized_percentage in category_performance. Updated the CSV column list to replace percentage with excluded_max_points, adjusted_max_points, normalized_percentage. Change 5 - Updated Applies to Logic instructions in Cross-Field Consistency Checking Updates made with Claude Sonnet 4.6 Adaptive --- .claude/agents/d4d-rubric20-semantic.md | 94 +++++++++++++++++++------ 1 file changed, 73 insertions(+), 21 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index fc3cb81a..1b434eb3 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -84,8 +84,22 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - IF license allows reuse → EXPECT distribution formats specified - **'Applies to' Logic:** - If `Applies to` condition is listed, check that relevant information was provided elsewhere - - EXAMPLE: IF shared tools were not described in the document, question 11 is not applicable - + - **Step 1 — Resolve all five trigger conditions before scoring any question:** + + | Condition | Satisfied when… | Gates | + |---|---|---| + | Human subjects / governance | `human_subject_research.involves_human_subjects=True` OR governance/regulatory restrictions mentioned anywhere in the datasheet | Q8, Q9, Q15 | + | Datasets shared & available for reuse | `distribution_formats` populated OR `download_url`/`page` links to accessible data OR license explicitly permits reuse | Q10, Q17, Q20 | + | Software tools shared & available for reuse | `software_and_tools` lists at least one tool AND an access path (URL, repo, or distribution) exists | Q11 | + | Data collection identified AND datasets shared | Collection fields populated (`acquisition_methods`, `collection_mechanisms`, `data_collectors`, or `collection_timeframes`) AND the datasets shared condition above is met | Q12, Q13 | + | Publication identified AND datasets shared | `citation` or `external_resources` includes at least one publication reference AND the datasets shared condition above is met | Q14 | + + - **Step 2 — Apply the N/A encoding convention:** If a condition is not met, set `applicable: false` and `score: null` for every question it gates. Do not emit `0`. Subtract the question's `max_score` from the denominator per the N/A Question Convention in the Scoring Summary section. + - **Ambiguity rule:** When a condition is borderline (e.g., a dataset page exists but access requires approval), default to `applicable: true` and score based on what is documented. This prevents silent N/A inflation on datasets that are partially shared. + - EXAMPLE (applicable + scored): `distribution_formats` lists Parquet and TSV with a PhysioNet download URL → datasets shared condition is met → Q10, Q17, Q20 are applicable and scored. + - EXAMPLE (applicable + reported, not scored): `human_subject_research.involves_human_subjects=True` but the datasheet is a core/instrument-only record with no ethics fields populated → Q8 and Q9 are reported (flag the gap) but the condition is met so they remain applicable and receive a low score, not N/A. + - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Q10, Q11, Q12, Q13, Q14, Q17, Q20 are all set to `applicable: false`, `score: null`, and excluded from the denominator. + 4. **Content Accuracy Assessment** - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? - **Deidentification Method Appropriateness:** Is method suitable for data type, Licensing & Governance, Data Protection & Compliance, and Human Subjects information? @@ -457,7 +471,9 @@ Return your evaluation as a **JSON object** with this EXACT structure: "overall_score": { "total_points": 72.5, "max_points": 84, - "percentage": 86.3, + "excluded_max_points": 0, + "adjusted_max_points": 84, + "normalized_percentage": 86.3, "questions_not_applicable": 0 }, "categories": [ @@ -487,7 +503,18 @@ Return your evaluation as a **JSON object** with this EXACT structure: "evidence": "description: 420 chars, motivation: N/A", "quality_note": "Description is comprehensive at 420 characters" }, - ... (remaining questions 3-5) + { + "id": 11, + "name": "Tool and Software Transparency", + "applicable": "false", + "score_type": "numeric", + "score": null, + "max_score": 5, + "score_label": "Not applicable", + "evidence": "No shared software tools identified in this datasheet", + "quality_note": "Excluded from denominator per 'Applies to' condition: software tools not shared" + }, + "... (remaining questions 3-5)" ], "category_score": 23, "category_max": 24 @@ -495,7 +522,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "Metadata Quality & Content", "questions": [ - ... (questions 6-10) + "... (questions 6-10)" ], "category_score": 18, "category_max": 22 @@ -503,7 +530,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "Technical Documentation", "questions": [ - ... (questions 11-15) + "... (questions 11-15)" ], "category_score": 19, "category_max": 25 @@ -511,7 +538,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "FAIRness & Accessibility", "questions": [ - ... (questions 16-20) + "... (questions 16-20)" ], "category_score": 12.5, "category_max": 13 @@ -567,7 +594,9 @@ evaluation_date: "" overall_performance: average_score: 52.3 max_score: 84 - average_percentage: 62.3 + average_excluded_max_points: 8.5 + average_adjusted_max_points: 75.5 + average_normalized_percentage: 69.3 best_score: 68.0 worst_score: 38.5 best_performer: @@ -575,36 +604,48 @@ overall_performance: method: claudecode_agent project: AI_READI score: 68.0 - percentage: 81.0 + excluded_max_points: 5 + adjusted_max_points: 79 + normalized_percentage: 86.1 worst_performer: file: CHORUS_d4d.yaml method: gpt5 project: CHORUS score: 38.5 - percentage: 45.8 + excluded_max_points: 10 + adjusted_max_points: 74 + normalized_percentage: 52.0 method_comparison: - method: claudecode_agent file_count: 4 average_score: 56.2 - average_percentage: 66.9 + average_excluded_max_points: 7.5 + average_adjusted_max_points: 76.5 + average_normalized_percentage: 73.5 rank: 1 - method: claudecode_assistant file_count: 4 average_score: 48.4 - average_percentage: 57.6 + average_excluded_max_points: 9.5 + average_adjusted_max_points: 74.5 + average_normalized_percentage: 64.9 rank: 2 project_comparison: - project: AI_READI file_count: 2 average_score: 61.5 - average_percentage: 73.2 + average_excluded_max_points: 5.0 + average_adjusted_max_points: 79.0 + average_normalized_percentage: 77.8 rank: 1 - project: CM4AI file_count: 2 average_score: 54.8 - average_percentage: 65.2 + average_excluded_max_points: 8.0 + average_adjusted_max_points: 76.0 + average_normalized_percentage: 72.1 rank: 2 category_performance: @@ -612,22 +653,22 @@ category_performance: category_name: "Structural Completeness and Core Metadata" average_score: 15.8 max_score: 24 - average_percentage: 65.8 + average_normalized_percentage: 65.8 - category_id: "2" category_name: "Metadata Quality and Detail" average_score: 14.2 max_score: 22 - average_percentage: 64.5 + average_normalized_percentage: 64.5 - category_id: "3" category_name: "Technical Documentation and Reproducibility" average_score: 12.5 max_score: 25 - average_percentage: 50.0 + average_normalized_percentage: 50.0 - category_id: "4" category_name: "FAIRness and Accessibility" average_score: 9.8 max_score: 13 - average_percentage: 75.4 + average_normalized_percentage: 75.4 common_strengths: - description: "Strong structural completeness with semantically validated fields" @@ -711,7 +752,7 @@ semantic_analysis_summary: ### Additional Output Files 1. **CSV Summary:** `all_scores.csv` - - Columns: project, method, file, total_score, percentage, cat1_score, cat2_score, cat3_score, cat4_score, consistency_passed, consistency_failed, issues_detected + - Columns: project, method, file, total_score, excluded_max_points, adjusted_max_points, normalized_percentage, cat1_score, cat2_score, cat3_score, cat4_score, consistency_passed, consistency_failed, issues_detected 2. **Markdown Report:** `summary_report.md` - Executive summary with scoring tables @@ -724,13 +765,24 @@ semantic_analysis_summary: ## Scoring Summary -**Maximum Possible Score:** 84 points +**Maximum Possible Score:** 84 points (before N/A exclusions) - **Structural Completeness (5 questions):** 24 points max (4 numeric @5 each + 1 pass/fail) - **Metadata Quality & Content (5 questions):** 22 points max (4 numeric @5 each + 1 pass/fail) - **Technical Documentation (5 questions):** 25 points max (5 numeric @5 each) - **FAIRness & Accessibility (5 questions):** 13 points max (3 numeric @5 each + 2 pass/fail) -**NOTE:** If any question in the output includes "applicable": "false", decrease the Maximum Possible Score by the question `max_score` and use the adjusted Maximum Possible Score to calculate score percentage. Report the number of non-applicable questions in the "questions_not_applicable" field. +**N/A Question Convention:** + +1. **Encoding:** Set `applicable: false` and `score: null` for any question whose `Applies to` condition is not met. Do not emit `0` for these questions — a zero score penalizes datasheets for which the question is simply irrelevant. + +2. **Denominator rule:** Subtract the question's `max_score` from `max_points` to compute `adjusted_max_points`. Report `normalized_percentage = total_points / adjusted_max_points × 100`. This is the only percentage reported; it is comparable across datasheets regardless of how many questions are excluded. + - `excluded_max_points` = sum of `max_score` for all questions where `applicable: false` + - `adjusted_max_points` = `max_points` − `excluded_max_points` + - `normalized_percentage` = `total_points / adjusted_max_points × 100` + +3. **Batch aggregation:** Apply the same convention in the `EvaluationSummary`. Report `average_excluded_max_points`, `average_adjusted_max_points`, and `average_normalized_percentage` at the overall, method, and project levels so cross-file comparisons remain meaningful even when different datasheets trigger different N/A conditions (e.g., core vs. full-schema datasheets). + +**NOTE:** Report the count of non-applicable questions in the `questions_not_applicable` field of `overall_score`. ## Key Principles From fcde51883d65e4f482de5ff5decd528b857b0fbf Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Tue, 16 Jun 2026 11:04:56 -0700 Subject: [PATCH 19/24] Align rubric10 with rubric20 Add 'applies to' logic and evaluation guidelines from rubric20 semantic eval Update outputs to be consistent with rubric20 semantic eval --- .claude/agents/d4d-rubric10-semantic.md | 123 +++++++++++++++++++++--- 1 file changed, 111 insertions(+), 12 deletions(-) diff --git a/.claude/agents/d4d-rubric10-semantic.md b/.claude/agents/d4d-rubric10-semantic.md index b36bd73b..4a1aa251 100644 --- a/.claude/agents/d4d-rubric10-semantic.md +++ b/.claude/agents/d4d-rubric10-semantic.md @@ -76,6 +76,24 @@ Score **0** (absent/fail) if: - **Funding Logic:** - IF `funders` present → EXPECT `funding_and_acknowledgements.funding.agency` matches - IF funding present → EXPECT `purposes` aligns with funding goals + - **'Applies to' Logic:** + - If an element or sub-element is only meaningful under a specific condition, check that the condition is satisfied before scoring it + - EXAMPLE: IF no human subjects are identified in the datasheet, Element 4 sub-elements are not applicable + - **Step 1 — Resolve all five trigger conditions before scoring any element:** + + | Condition | Satisfied when… | Gates | + |---|---|---| + | Human subjects / governance | `human_subject_research.involves_human_subjects=True` OR governance/regulatory restrictions mentioned anywhere in the datasheet | Element 4 (all 5 sub-elements) | + | Datasets shared & available for reuse | `distribution_formats` populated OR `download_url`/`page` links to accessible data OR license explicitly permits reuse | Element 3 sub-elements 1–4, Element 6 (all), Element 8 (all), Element 10 (all) | + | Software tools shared & available for reuse | `software_and_tools` lists at least one tool AND an access path (URL, repo, or distribution) exists | Element 8 sub-elements 3–4 | + | Data collection identified AND datasets shared | Collection fields populated (`acquisition_methods`, `collection_mechanisms`) AND the datasets shared condition above is met | Element 8 sub-elements 1–2 | + | Publication identified AND datasets shared | `citation` or `external_resources` includes at least one publication reference AND the datasets shared condition above is met | Element 10 sub-element 2 | + + - **Step 2 — Apply the N/A encoding convention:** If a condition is not met, set `applicable: false` and `score: null` for every sub-element it gates. Do not emit `0`. Subtract 1 from the denominator per excluded sub-element per the N/A Sub-Element Convention above. + - **Ambiguity rule:** When a condition is borderline (e.g., a dataset page exists but access requires approval), default to `applicable: true` and score based on what is documented. This prevents silent N/A inflation on datasets that are partially shared. + - EXAMPLE (applicable + scored): `distribution_formats` lists Parquet and TSV with a PhysioNet download URL → datasets shared condition is met → Element 6, 8, and 10 sub-elements are applicable and scored. + - EXAMPLE (applicable + scored low): `human_subject_research.involves_human_subjects=True` but no IRB fields populated → Element 4 sub-elements are applicable (condition met) and receive a score of 0, flagged as a consistency gap. + - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Element 3 sub-elements 1–4, all of Element 6, all of Element 8, and all of Element 10 are set to `applicable: false`, `score: null`, and excluded from the denominator. 4. **Content Accuracy Assessment** - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? @@ -83,6 +101,23 @@ Score **0** (absent/fail) if: - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? +### N/A Sub-Element Convention + +**Maximum Possible Score:** 50 points (before N/A exclusions; 10 elements × 5 sub-elements × 1 point each) + +Some sub-elements are only applicable under certain conditions (see 'Applies to' Logic in Cross-Field Consistency Checking). When a condition is not met: + +1. **Encoding:** Set `applicable: false` and `score: null` for the sub-element. Do not emit `0` — a zero score penalizes datasheets for which the sub-element is simply irrelevant. + +2. **Denominator rule:** Each excluded sub-element reduces the denominator by 1. + - `excluded_max_points` = count of sub-elements where `applicable: false` + - `adjusted_max_points` = `max_points` − `excluded_max_points` + - `normalized_percentage` = `total_points / adjusted_max_points × 100` + +3. **Batch aggregation:** Apply the same convention in the `EvaluationSummary`. Report `average_excluded_max_points`, `average_adjusted_max_points`, and `average_normalized_percentage` at the overall, method, and project levels so cross-file comparisons remain meaningful even when different datasheets trigger different N/A conditions. + +Report the count of non-applicable sub-elements in the `sub_elements_not_applicable` field of `overall_score`. + **Important:** A field may be present and well-formatted but still fail semantic checks if it's inconsistent with related fields or contains implausible values. ## Rubric10 Specification @@ -151,18 +186,22 @@ Score **0** (absent/fail) if: 1. **License Terms Allow Reuse** - Fields: `license_and_use_terms` - Look for: Clear license (CC BY, CC BY-NC-SA, etc.) with reuse permissions + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 2. **Data Formats Are Standardized (encoding, format)** - Fields: `format`, `encoding` - Look for: Use of standard formats (JSON, TSV, Parquet, DICOM, WFDB) and character encoding + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 3. **Schema or Ontology Conformance Stated** - Fields: `conforms_to`, `conforms_to_schema` - Look for: References to schemas (OMOP, FHIR, schema.org, etc.) + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 4. **Variable Metadata with Identifiers Defined** - Fields: `variables` - Look for: Variable-level metadata with identifiers and descriptions + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 5. **Use Guidance Provided (intended, prohibited uses)** - Fields: `intended_uses`, `prohibited_uses`, `discouraged_uses` @@ -184,22 +223,27 @@ Score **0** (absent/fail) if: - Fields: `ethical_reviews`, `human_subject_research` - Look for: IRB approval details, institutional oversight, ethics review boards - **Semantic Check:** If `human_subject_research.involves_human_subjects=True`, this MUST be populated + - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. 2. **Deidentification Method Described** - Fields: `is_deidentified` - Look for: Specific deidentification method (HIPAA Safe Harbor, Expert Determination, k-anonymity) + - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. 3. **Privacy Protections Beyond Deidentification** - Fields: `participant_privacy` - Look for: Privacy protections, anonymization procedures, reidentification risk assessment + - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. 4. **Informed Consent Obtained from Participants** - Fields: `informed_consent` - Look for: Consent procedures, consent type (written, verbal), withdrawal mechanisms + - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. 5. **Vulnerable Populations and Compensation Documented** - Fields: `vulnerable_populations`, `participant_compensation` - Look for: Protections for vulnerable populations, compensation details + - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. --- @@ -236,22 +280,27 @@ Score **0** (absent/fail) if: 1. **Dataset Version Number Provided** - Fields: `version` - Look for: Version number (1.0, 1.1, 2.0.1) + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 2. **Version Access Methods Documented** - Fields: `version_access` - Look for: How to access different versions of the dataset + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 3. **Change Descriptions and Errata Provided** - Fields: `errata`, `updates` - Look for: Errata documentation, update descriptions, change logs + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 4. **Update Schedule or Frequency Indicated** - Fields: `updates` - Look for: Update schedule, maintenance plan, update frequency + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 5. **Provenance and Source Derivation Documented** - Fields: `was_derived_from`, `release_notes` - Look for: Source provenance, dataset derivation, release notes + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. --- @@ -292,22 +341,27 @@ Score **0** (absent/fail) if: 1. **Collection Mechanisms and Settings Described** - Fields: `collection_mechanisms` - Look for: Collection procedures, settings, timeframes + - **Applies to:** Always report results of this sub-element, but only score if data collection is identified elsewhere and datasets are shared and available for reuse. 2. **Data Acquisition Methods Listed** - Fields: `acquisition_methods` - Look for: Instruments, devices, software used for data capture and acquisition + - **Applies to:** Always report results of this sub-element, but only score if data collection is identified elsewhere and datasets are shared and available for reuse. 3. **Preprocessing, Cleaning, and Labeling Strategies** - Fields: `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies` - Look for: Preprocessing pipeline, cleaning steps, labeling methods + - **Applies to:** Always report results of this sub-element, but only score if software tools are identified elsewhere as shared and available for reuse. 4. **Software and Tools Documented** - Fields: `software_and_tools` - Look for: Software names, versions, processing tools, GitHub repos + - **Applies to:** Always report results of this sub-element, but only score if software tools are identified elsewhere as shared and available for reuse. 5. **External Standards and Resources Referenced** - Fields: `external_resources`, `conforms_to` - Look for: Published papers, standards documents, external documentation + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. --- @@ -344,22 +398,27 @@ Score **0** (absent/fail) if: 1. **Dataset Published on a Recognized Platform** - Fields: `publisher` - Look for: PhysioNet, Dataverse, FAIRhub, Zenodo, institutional repository + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 2. **Citation and DOI for Cross-referencing** - Fields: `citation`, `doi` - Look for: Recommended citation format, DOI for cross-referencing + - **Applies to:** Always report results of this sub-element, but only score if a publication is identified elsewhere and datasets are shared and available for reuse. 3. **Community Standards or Schema Conformance** - Fields: `conforms_to` - Look for: OMOP, FHIR, schema.org, Dublin Core, other community standards + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 4. **Outreach Materials and Documentation Links** - Fields: `external_resources`, `page` - Look for: Webinars, tutorials, documentation links, landing pages + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 5. **Related Datasets with Typed Relationships** - Fields: `related_datasets` - Look for: Related datasets with relationship types (supplements, derives from, is version of) + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. --- @@ -417,7 +476,10 @@ Return your evaluation as a **JSON object** with this EXACT structure: "overall_score": { "total_points": 38.5, "max_points": 50, - "percentage": 77.0 + "excluded_max_points": 0, + "adjusted_max_points": 50, + "normalized_percentage": 77.0, + "sub_elements_not_applicable": 0 }, "elements": [ { @@ -459,7 +521,30 @@ Return your evaluation as a **JSON object** with this EXACT structure: "element_score": 5, "element_max": 5 }, - ... (repeat for all 10 elements) + { + "id": 4, + "name": "Ethical Use and Privacy Safeguards", + "description": "Does the dataset provide clear information about consent, privacy, and ethical oversight?", + "sub_elements": [ + { + "name": "IRB or Ethics Review Documented", + "applicable": true, + "score": 1, + "evidence": "ethical_reviews: IRB approval from 5 institutions documented", + "quality_note": "Human subjects confirmed; IRB details present" + }, + { + "name": "Informed Consent Obtained from Participants", + "applicable": false, + "score": null, + "evidence": "human_subject_research.involves_human_subjects not present in this datasheet", + "quality_note": "Excluded from denominator: human subjects / governance condition not met" + } + ], + "element_score": 1, + "element_max": 1 + }, + "... (repeat for all 10 elements)" ], "assessment": { "strengths": [ @@ -511,7 +596,9 @@ evaluation_date: "" overall_performance: average_score: 35.2 max_score: 50 - average_percentage: 70.4 + average_excluded_max_points: 4.2 + average_adjusted_max_points: 45.8 + average_normalized_percentage: 76.9 best_score: 42.0 worst_score: 28.0 best_performer: @@ -519,36 +606,48 @@ overall_performance: method: claudecode_agent project: AI_READI score: 42.0 - percentage: 84.0 + excluded_max_points: 0 + adjusted_max_points: 50 + normalized_percentage: 84.0 worst_performer: file: CHORUS_d4d.yaml method: gpt5 project: CHORUS score: 28.0 - percentage: 56.0 + excluded_max_points: 5 + adjusted_max_points: 45 + normalized_percentage: 62.2 method_comparison: - method: claudecode_agent file_count: 4 average_score: 37.5 - average_percentage: 75.0 + average_excluded_max_points: 3.0 + average_adjusted_max_points: 47.0 + average_normalized_percentage: 79.8 rank: 1 - method: claudecode_assistant file_count: 4 average_score: 32.8 - average_percentage: 65.6 + average_excluded_max_points: 5.5 + average_adjusted_max_points: 44.5 + average_normalized_percentage: 73.7 rank: 2 project_comparison: - project: AI_READI file_count: 2 average_score: 39.0 - average_percentage: 78.0 + average_excluded_max_points: 2.0 + average_adjusted_max_points: 48.0 + average_normalized_percentage: 81.3 rank: 1 - project: CM4AI file_count: 2 average_score: 36.5 - average_percentage: 73.0 + average_excluded_max_points: 4.0 + average_adjusted_max_points: 46.0 + average_normalized_percentage: 79.3 rank: 2 element_performance: @@ -556,12 +655,12 @@ element_performance: element_name: "Dataset Discovery and Identification" average_score: 4.2 max_score: 5 - average_percentage: 84.0 + average_normalized_percentage: 84.0 - element_id: "2" element_name: "Terms of Reuse" average_score: 4.5 max_score: 5 - average_percentage: 90.0 + average_normalized_percentage: 90.0 # ... (10 elements total) common_strengths: @@ -636,7 +735,7 @@ semantic_analysis_summary: ### Additional Output Files 1. **CSV Summary:** `all_scores.csv` - - Columns: project, method, file, total_score, percentage, consistency_passed, consistency_failed, issues_detected + - Columns: project, method, file, total_score, excluded_max_points, adjusted_max_points, normalized_percentage, consistency_passed, consistency_failed, issues_detected 2. **Markdown Report:** `summary_report.md` - Executive summary with comparison tables From dc283383afca6dd71be4ab338293bd4f7419b6f1 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Tue, 16 Jun 2026 11:40:39 -0700 Subject: [PATCH 20/24] Typo fixes and linting --- .claude/agents/d4d-rubric20-semantic.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 1b434eb3..3f0fc77e 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -13,7 +13,7 @@ color: purple # D4D Rubric20 Semantic Evaluator -You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. +You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. ## Your Task @@ -166,7 +166,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** 2–3 file types - **5:** >3 file types -**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety can indicate multi-modal data if indicated `description`, `purposes`, or `keywords`. +**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety can indicate multi-modal data if indicated in `description`, `purposes`, or `keywords`. --- @@ -179,7 +179,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** Numeric file size or instance count found - **Fail:** No file size/instance metadata -**Assessment:** Look for bytes field, instance counts, or sample size documentation. Note that sample size only enables and estimate of the file size. +**Assessment:** Look for bytes field, instance counts, or sample size documentation. Note that sample size only enables an estimate of the file size. --- @@ -353,7 +353,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** At least one working external URL present - **Fail:** No external links found -**Assessment:** Verify presence of persistent URLs. +**Assessment:** Verify presence of persistent URLs. --- From cd4ed4b14308c501dca7109694ba91d0689e5e87 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Tue, 16 Jun 2026 11:44:20 -0700 Subject: [PATCH 21/24] Update schema mappings in rubric10 and rubric20 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Summary of all 19 changes across both files: rubric20 (6 changes): Content Accuracy Assessment — section-name prose replaced with actual field names (license_and_use_terms, data_protection_impacts, participant_privacy.reidentification_risk, etc.) in both ethics plausibility bullets Q8 — vulnerable_populations corrected to at_risk_populations; added participant_privacy.reidentification_risk, data_protection_impacts, regulatory_restrictions.hipaa_compliant, regulatory_restrictions.other_compliance, regulatory_restrictions.governance_committee_contact; score-5 label updated Q9 — added data_protection_impacts, regulatory_restrictions.governance_committee_contact Q11 — description expanded; added annotation_analyses, machine_annotation_tools, imputation_protocols, missing_data_documentation Q12 — added raw_data_sources Q15 — added DataSubset.is_data_split, DataSubset.is_subpopulation rubric10 (13 changes): Content Accuracy Assessment — same field-anchoring fix as rubric20 E2.2 — added regulatory_restrictions.hipaa_compliant, regulatory_restrictions.other_compliance E4 Consistency Checks — new rule: data_protection_impacts present → expect participant_privacy.reidentification_risk assessed E4.1 — renamed to "IRB or Ethics Review and Data Protection Impact"; added data_protection_impacts, regulatory_restrictions.governance_committee_contact E4.3 — renamed to "Privacy Protections and Re-identification Risk Assessment"; added participant_privacy.reidentification_risk E4.5 — vulnerable_populations corrected to at_risk_populations E5.1 — added DataSubset.is_subpopulation E5.2 — added DataSubset.is_data_split E5.5 — added missing_data_documentation E6.5 — added raw_data_sources E8.2 — added raw_data_sources E8.3 — renamed to "Preprocessing, Cleaning, Labeling, and Annotation Quality"; added annotation_analyses, machine_annotation_tools, imputation_protocols E9.2 — renamed to "Biases Categorized Using Standard Taxonomy (RAI-aligned)"; added future_use_impacts with rai:dataSocialImpact reference Implemented with Claude Sonnet 4.6 Adaptive --- .claude/agents/d4d-rubric10-semantic.md | 58 +++++++++++++------------ .claude/agents/d4d-rubric20-semantic.md | 29 +++++++------ 2 files changed, 45 insertions(+), 42 deletions(-) diff --git a/.claude/agents/d4d-rubric10-semantic.md b/.claude/agents/d4d-rubric10-semantic.md index 4a1aa251..9fd84d75 100644 --- a/.claude/agents/d4d-rubric10-semantic.md +++ b/.claude/agents/d4d-rubric10-semantic.md @@ -96,8 +96,8 @@ Score **0** (absent/fail) if: - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Element 3 sub-elements 1–4, all of Element 6, all of Element 8, and all of Element 10 are set to `applicable: false`, `score: null`, and excluded from the denominator. 4. **Content Accuracy Assessment** - - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? - - **Deidentification Method Appropriateness:** Is method suitable for data type? + - **Ethics Claims Plausibility:** Do `license_and_use_terms`, `ip_restrictions`, `data_protection_impacts`, and `participant_privacy.reidentification_risk` align with `human_subject_research`, `informed_consent`, and `participant_privacy` in scope and restrictiveness? + - **Deidentification Method Appropriateness:** Is method suitable for data type given `data_protection_impacts`, `participant_privacy.reidentification_risk`, and `human_subject_research` values? - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? @@ -162,8 +162,8 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - Look for: Clear access policy, IP-based restrictions, or licensing terms 2. **Regulatory Restrictions and Confidentiality Level Specified** - - Fields: `regulatory_restrictions`, `confidentiality_level` - - Look for: Export control restrictions, GDPR compliance, data sensitivity classification + - Fields: `regulatory_restrictions`, `confidentiality_level`, `regulatory_restrictions.hipaa_compliant`, `regulatory_restrictions.other_compliance` + - Look for: Export control restrictions, GDPR compliance, data sensitivity classification, HIPAA compliance status, other regulatory frameworks (CCPA, PIPEDA) 3. **Download URL or Platform Link Available** - Fields: `download_url` @@ -216,12 +216,13 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - IF `human_subject_research.involves_human_subjects=True` → EXPECT sub-element 1 (IRB approval) AND sub-element 4 (consent) to score 1 - IF `is_deidentified` present → EXPECT deidentification method described - IF IRB approval documented → EXPECT consent procedures also described +- IF `data_protection_impacts` present → EXPECT `participant_privacy.reidentification_risk` assessed - Flag any inconsistencies in semantic_analysis.issues_detected **Sub-elements:** -1. **IRB or Ethics Review Documented** - - Fields: `ethical_reviews`, `human_subject_research` - - Look for: IRB approval details, institutional oversight, ethics review boards +1. **IRB or Ethics Review and Data Protection Impact** + - Fields: `ethical_reviews`, `human_subject_research`, `data_protection_impacts`, `regulatory_restrictions.governance_committee_contact` + - Look for: IRB approval details, institutional oversight, ethics review boards, data protection impact assessments (DPIAs), governance committee contacts - **Semantic Check:** If `human_subject_research.involves_human_subjects=True`, this MUST be populated - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. @@ -230,9 +231,9 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - Look for: Specific deidentification method (HIPAA Safe Harbor, Expert Determination, k-anonymity) - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. -3. **Privacy Protections Beyond Deidentification** - - Fields: `participant_privacy` - - Look for: Privacy protections, anonymization procedures, reidentification risk assessment +3. **Privacy Protections and Re-identification Risk Assessment** + - Fields: `participant_privacy`, `participant_privacy.reidentification_risk` + - Look for: Privacy protections, anonymization procedures, explicit re-identification risk assessment and mitigation measures - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. 4. **Informed Consent Obtained from Participants** @@ -241,8 +242,8 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. 5. **Vulnerable Populations and Compensation Documented** - - Fields: `vulnerable_populations`, `participant_compensation` - - Look for: Protections for vulnerable populations, compensation details + - Fields: `at_risk_populations`, `participant_compensation` + - Look for: Protections for at-risk populations, compensation details - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. --- @@ -252,12 +253,12 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica **Sub-elements:** 1. **Cohort or Subpopulations Characteristics Described** - - Fields: `subpopulations` - - Look for: Demographics, inclusion/exclusion criteria, population characteristics + - Fields: `subpopulations`, `DataSubset.is_subpopulation` + - Look for: Demographics, inclusion/exclusion criteria, population characteristics, subpopulation flags on dataset subsets 2. **Number of Instances or Samples Reported** - - Fields: `instances` - - Look for: Specific counts (e.g., 306 participants, 12,523 recordings) + - Fields: `instances`, `DataSubset.is_data_split` + - Look for: Specific counts (e.g., 306 participants, 12,523 recordings), dataset split flags indicating training/test/validation subsets 3. **Variable-Level Metadata and Tabular Flag** - Fields: `variables`, `is_tabular` @@ -268,8 +269,8 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - Look for: Disease conditions, phenotypes, topics covered in the dataset 5. **Data Quality Issues and Anomalies Documented** - - Fields: `anomalies`, `sampling_strategies` - - Look for: Known data quality issues, anomalies, sampling methods + - Fields: `anomalies`, `sampling_strategies`, `missing_data_documentation` + - Look for: Known data quality issues, anomalies, sampling methods, missing data patterns and handling strategies --- @@ -298,8 +299,8 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 5. **Provenance and Source Derivation Documented** - - Fields: `was_derived_from`, `release_notes` - - Look for: Source provenance, dataset derivation, release notes + - Fields: `was_derived_from`, `release_notes`, `raw_data_sources` + - Look for: Source provenance, dataset derivation, release notes, raw data sources before preprocessing - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. --- @@ -344,13 +345,13 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - **Applies to:** Always report results of this sub-element, but only score if data collection is identified elsewhere and datasets are shared and available for reuse. 2. **Data Acquisition Methods Listed** - - Fields: `acquisition_methods` - - Look for: Instruments, devices, software used for data capture and acquisition + - Fields: `acquisition_methods`, `raw_data_sources` + - Look for: Instruments, devices, software used for data capture and acquisition, raw data sources before preprocessing - **Applies to:** Always report results of this sub-element, but only score if data collection is identified elsewhere and datasets are shared and available for reuse. -3. **Preprocessing, Cleaning, and Labeling Strategies** - - Fields: `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies` - - Look for: Preprocessing pipeline, cleaning steps, labeling methods +3. **Preprocessing, Cleaning, Labeling, and Annotation Quality** + - Fields: `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies`, `annotation_analyses`, `machine_annotation_tools`, `imputation_protocols` + - Look for: Preprocessing pipeline, cleaning steps, labeling methods, annotation quality analyses, machine annotation tools, imputation protocols for missing values - **Applies to:** Always report results of this sub-element, but only score if software tools are identified elsewhere as shared and available for reuse. 4. **Software and Tools Documented** @@ -373,9 +374,9 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - Fields: `known_limitations` - Look for: Explicit limitations section with known issues -2. **Systematic Biases Identified and Described** - - Fields: `known_biases` - - Look for: Discussion of systematic biases, fairness issues, representativeness +2. **Biases Categorized Using Standard Taxonomy (RAI-aligned)** + - Fields: `known_biases`, `future_use_impacts` + - Look for: Structured bias categorization via `BiasTypeEnum` (mapped to AI Ontology), fairness issues, representativeness, anticipated downstream social impacts (`rai:dataSocialImpact`) 3. **Data Anomalies and Quality Issues Noted** - Fields: `anomalies` @@ -827,3 +828,4 @@ See `notes/RUBRIC_AGENT_USAGE.md` for comprehensive usage examples. - **Complement, Not Replace:** This LLM-based evaluation complements the existing field-presence detection in `src/evaluation/evaluate_d4d.py` - **Cost:** ~$0.10-0.30 per file evaluation via Anthropic API - **Time:** ~30-60 seconds per file (slower than presence detection but provides deeper insights) + diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 3f0fc77e..3239bc70 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -13,7 +13,7 @@ color: purple # D4D Rubric20 Semantic Evaluator -You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. +You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. ## Your Task @@ -84,6 +84,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - IF license allows reuse → EXPECT distribution formats specified - **'Applies to' Logic:** - If `Applies to` condition is listed, check that relevant information was provided elsewhere + - EXAMPLE: IF shared tools were not described in the document, question 11 is not applicable - **Step 1 — Resolve all five trigger conditions before scoring any question:** | Condition | Satisfied when… | Gates | @@ -99,10 +100,10 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - EXAMPLE (applicable + scored): `distribution_formats` lists Parquet and TSV with a PhysioNet download URL → datasets shared condition is met → Q10, Q17, Q20 are applicable and scored. - EXAMPLE (applicable + reported, not scored): `human_subject_research.involves_human_subjects=True` but the datasheet is a core/instrument-only record with no ethics fields populated → Q8 and Q9 are reported (flag the gap) but the condition is met so they remain applicable and receive a low score, not N/A. - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Q10, Q11, Q12, Q13, Q14, Q17, Q20 are all set to `applicable: false`, `score: null`, and excluded from the denominator. - + 4. **Content Accuracy Assessment** - - **Ethics Claims Plausibility:** Do Licensing & Governance and Data Protection & Compliance sections align with Human Subjects section and overall project scope? - - **Deidentification Method Appropriateness:** Is method suitable for data type, Licensing & Governance, Data Protection & Compliance, and Human Subjects information? + - **Ethics Claims Plausibility:** Do `license_and_use_terms`, `ip_restrictions`, `data_protection_impacts`, and `participant_privacy.reidentification_risk` align with `human_subject_research`, `informed_consent`, and `participant_privacy` in scope and restrictiveness? + - **Deidentification Method Appropriateness:** Is method suitable for data type given `license_and_use_terms`, `data_protection_impacts`, `participant_privacy.reidentification_risk`, and `human_subject_research` values? - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? - **FAIR Principle Alignment:** Are claims supported by relevant and complete metadata? @@ -166,7 +167,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** 2–3 file types - **5:** >3 file types -**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety can indicate multi-modal data if indicated in `description`, `purposes`, or `keywords`. +**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety can indicate multi-modal data if indicated `description`, `purposes`, or `keywords`. --- @@ -179,7 +180,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** Numeric file size or instance count found - **Fail:** No file size/instance metadata -**Assessment:** Look for bytes field, instance counts, or sample size documentation. Note that sample size only enables an estimate of the file size. +**Assessment:** Look for bytes field, instance counts, or sample size documentation. Note that sample size only enables and estimate of the file size. --- @@ -215,12 +216,12 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th #### Question 8: Ethical and Privacy Declarations **Description:** Comprehensive ethics coverage including IRB approval, deidentification, privacy protections, informed consent, participant compensation, and vulnerable population safeguards. -**Fields:** `ethical_reviews`, `human_subject_research`, `is_deidentified`, `participant_privacy`, `participant_compensation`, `vulnerable_populations`, `informed_consent` +**Fields:** `ethical_reviews`, `human_subject_research`, `is_deidentified`, `participant_privacy`, `participant_privacy.reidentification_risk`, `participant_compensation`, `at_risk_populations`, `informed_consent`, `data_protection_impacts`, `regulatory_restrictions.hipaa_compliant`, `regulatory_restrictions.other_compliance`, `regulatory_restrictions.governance_committee_contact` **Scoring (numeric 0-5):** - **0:** No ethics fields present - **3:** Basic ethics (IRB + deidentification) -- **5:** Comprehensive (all human subjects protections documented) +- **5:** Comprehensive (all human subjects protections and data protection impacts documented) **Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas @@ -231,7 +232,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th #### Question 9: Access Requirements and Governance Documentation **Description:** Whether access policy, license, IP restrictions, regulatory restrictions, and confidentiality level are clearly defined. -**Fields:** `license_and_use_terms`, `ip_restrictions`, `regulatory_restrictions`, `confidentiality_level` +**Fields:** `license_and_use_terms`, `ip_restrictions`, `regulatory_restrictions`, `confidentiality_level`, `data_protection_impacts`, `regulatory_restrictions.governance_committee_contact` **Scoring (numeric 0-5):** - **0:** No license or access info @@ -263,9 +264,9 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th ### Category 3: Technical Documentation (Questions 11-15) #### Question 11: Tool and Software Transparency -**Description:** Mentions of preprocessing, cleaning, and labeling strategies with software tools used in data preparation. +**Description:** Mentions of preprocessing, cleaning, and labeling strategies with software tools used in data preparation, including annotation quality, imputation, and missing data documentation. -**Fields:** `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies`, `software_and_tools` +**Fields:** `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies`, `software_and_tools`, `annotation_analyses`, `machine_annotation_tools`, `imputation_protocols`, `missing_data_documentation` **Scoring (numeric 0-5):** - **0:** No software tools documented @@ -281,7 +282,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th #### Question 12: Collection Protocol Clarity **Description:** Description completeness of data collection mechanisms, acquisition methods, data collectors, and collection timeframes. -**Fields:** `acquisition_methods`, `collection_mechanisms`, `data_collectors`, `collection_timeframes` +**Fields:** `acquisition_methods`, `collection_mechanisms`, `data_collectors`, `collection_timeframes`, `raw_data_sources` **Scoring (numeric 0-5):** - **0:** No collection description @@ -329,7 +330,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th #### Question 15: Human Subject Representation **Description:** Inclusion of human subjects, demographic diversity, or subgroup details. -**Fields:** `instances`, `subpopulations` +**Fields:** `instances`, `subpopulations`, `DataSubset.is_data_split`, `DataSubset.is_subpopulation` **Scoring (numeric 0-5):** - **0:** No human subject information @@ -353,7 +354,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** At least one working external URL present - **Fail:** No external links found -**Assessment:** Verify presence of persistent URLs. +**Assessment:** Verify presence of persistent URLs. --- From 1328f09f186915c54253283b266e128838ed58f0 Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Tue, 16 Jun 2026 12:08:52 -0700 Subject: [PATCH 22/24] rubric20 - Address circular applies to restrictions and specificity MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Q8 and Q15 now check description, keywords, and collection_mechanisms for human subjects evidence — none of those fields belong to Q8 or Q15. Q9's gate was removed entirely: every dataset must document its access and license terms, so absence of that documentation always scores 0 rather than triggering an exclusion. Q11 now checks external_resources and description/purposes for software-as-output evidence rather than its own software_and_tools field. --- .claude/agents/d4d-rubric20-semantic.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 3239bc70..5ad2bcf4 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -89,14 +89,15 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th | Condition | Satisfied when… | Gates | |---|---|---| - | Human subjects / governance | `human_subject_research.involves_human_subjects=True` OR governance/regulatory restrictions mentioned anywhere in the datasheet | Q8, Q9, Q15 | + | Human subjects | `description` or `keywords` reference human participants, patients, or clinical research, OR `collection_mechanisms` describes human participant recruitment — checked via Q1/Q2/Q3/Q12 fields only, never Q8's own fields | Q8, Q15 | | Datasets shared & available for reuse | `distribution_formats` populated OR `download_url`/`page` links to accessible data OR license explicitly permits reuse | Q10, Q17, Q20 | - | Software tools shared & available for reuse | `software_and_tools` lists at least one tool AND an access path (URL, repo, or distribution) exists | Q11 | + | Software tools produced as dataset output | `external_resources` references a code repository, OR `description`/`purposes` explicitly identifies software production as a dataset output — checked via Q2/Q7/Q14 fields only, never Q11's own fields | Q11 | | Data collection identified AND datasets shared | Collection fields populated (`acquisition_methods`, `collection_mechanisms`, `data_collectors`, or `collection_timeframes`) AND the datasets shared condition above is met | Q12, Q13 | | Publication identified AND datasets shared | `citation` or `external_resources` includes at least one publication reference AND the datasets shared condition above is met | Q14 | - **Step 2 — Apply the N/A encoding convention:** If a condition is not met, set `applicable: false` and `score: null` for every question it gates. Do not emit `0`. Subtract the question's `max_score` from the denominator per the N/A Question Convention in the Scoring Summary section. - **Ambiguity rule:** When a condition is borderline (e.g., a dataset page exists but access requires approval), default to `applicable: true` and score based on what is documented. This prevents silent N/A inflation on datasets that are partially shared. + - **Anti-circular rule:** A question's own scoring fields may not be the sole basis for excluding it. If the only reason to set `applicable: false` is the absence of the question's own fields, treat the question as `applicable: true` and score accordingly (receiving 0 if those fields are absent). Applicability must be evidenced by fields belonging to a *different* question. Emit `applicability_status` and `applicability_evidence` before scoring every conditional question to make this determination explicit and auditable. - EXAMPLE (applicable + scored): `distribution_formats` lists Parquet and TSV with a PhysioNet download URL → datasets shared condition is met → Q10, Q17, Q20 are applicable and scored. - EXAMPLE (applicable + reported, not scored): `human_subject_research.involves_human_subjects=True` but the datasheet is a core/instrument-only record with no ethics fields populated → Q8 and Q9 are reported (flag the gap) but the condition is met so they remain applicable and receive a low score, not N/A. - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Q10, Q11, Q12, Q13, Q14, Q17, Q20 are all set to `applicable: false`, `score: null`, and excluded from the denominator. @@ -225,7 +226,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas -**Applies to:** Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. +**Applies to:** Always report results of this question, but only score if `description`, `keywords`, or `collection_mechanisms` (from Q1–Q3, Q12) contain evidence of human participants, patients, or clinical research. Do not use Q8's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- @@ -241,7 +242,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Evaluate clarity and completeness of governance and terms of use documentation. -**Applies to:** Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. +**Applies to:** Always applicable. Every dataset must document its access and license terms; absence of this documentation scores 0, not N/A. --- @@ -275,7 +276,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Look for strategy documentation and software names, versions, and links. -**Applies to:** Always report results of this question, but only score if software tools were identified elsewhere as shared and available for reuse. +**Applies to:** Always report results of this question, but only score if `external_resources` (from Q14) references a code repository, OR `description` or `purposes` (from Q2, Q7) explicitly identifies software production as a dataset output. Do not use Q11's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- @@ -339,7 +340,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Evaluate demographic detail and population characterization through instances and subpopulations. -**Applies to:** Always report results of this question, but only score if human subjects or governance restrictions are identified elsewhere. +**Applies to:** Always report results of this question, but only score if `description`, `keywords`, or `collection_mechanisms` (from Q1–Q3, Q12) contain evidence of human participants, patients, or clinical research. Do not use Q15's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- @@ -485,6 +486,8 @@ Return your evaluation as a **JSON object** with this EXACT structure: "id": 1, "name": "Field Completeness", "applicable": "true", + "applicability_status": "always_applicable", + "applicability_evidence": "", "description": "Proportion of mandatory schema fields populated", "score_type": "numeric", "score": 5, @@ -508,6 +511,8 @@ Return your evaluation as a **JSON object** with this EXACT structure: "id": 11, "name": "Tool and Software Transparency", "applicable": "false", + "applicability_status": "not_applicable", + "applicability_evidence": "external_resources contains no code repository URLs; description and purposes contain no reference to software production as a dataset output", "score_type": "numeric", "score": null, "max_score": 5, From d40024a228c9748aa3c8ba19a5140cde0a91b06a Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Tue, 16 Jun 2026 12:10:34 -0700 Subject: [PATCH 23/24] rubric10 - address circular dependencies and applies to specificity MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All five E4 sub-elements and both E8 sub-elements 3–4 had their gate text replaced with the same non-circular predicates. For E4, the human subjects check now uses E1 and E8 fields; the governance check now uses E2 fields — both external to E4. The trigger table was also split into separate rows for "human subjects" and "governance restrictions" to make the non-circular nature of each explicit. For E8 sub-elements 3–4, applicability is now determined by E10 and E1/E7 fields rather than E8's own software_and_tools. --- .claude/agents/d4d-rubric10-semantic.md | 29 +++++++++++++++---------- 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/.claude/agents/d4d-rubric10-semantic.md b/.claude/agents/d4d-rubric10-semantic.md index 9fd84d75..0fca0d0a 100644 --- a/.claude/agents/d4d-rubric10-semantic.md +++ b/.claude/agents/d4d-rubric10-semantic.md @@ -83,14 +83,16 @@ Score **0** (absent/fail) if: | Condition | Satisfied when… | Gates | |---|---|---| - | Human subjects / governance | `human_subject_research.involves_human_subjects=True` OR governance/regulatory restrictions mentioned anywhere in the datasheet | Element 4 (all 5 sub-elements) | + | Human subjects | `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment — never E4's own fields | Element 4 (all 5 sub-elements) | + | Governance restrictions | `regulatory_restrictions` or `confidentiality_level` (from E2) indicate governance constraints — E2 fields, not E4 fields, so non-circular | Element 4 (all 5 sub-elements) | | Datasets shared & available for reuse | `distribution_formats` populated OR `download_url`/`page` links to accessible data OR license explicitly permits reuse | Element 3 sub-elements 1–4, Element 6 (all), Element 8 (all), Element 10 (all) | - | Software tools shared & available for reuse | `software_and_tools` lists at least one tool AND an access path (URL, repo, or distribution) exists | Element 8 sub-elements 3–4 | + | Software tools produced as dataset output | `external_resources` (from E10) references a code repository, OR `description`/`purposes` (from E1/E7) explicitly identifies software production as a dataset output — never E8's own fields | Element 8 sub-elements 3–4 | | Data collection identified AND datasets shared | Collection fields populated (`acquisition_methods`, `collection_mechanisms`) AND the datasets shared condition above is met | Element 8 sub-elements 1–2 | | Publication identified AND datasets shared | `citation` or `external_resources` includes at least one publication reference AND the datasets shared condition above is met | Element 10 sub-element 2 | - **Step 2 — Apply the N/A encoding convention:** If a condition is not met, set `applicable: false` and `score: null` for every sub-element it gates. Do not emit `0`. Subtract 1 from the denominator per excluded sub-element per the N/A Sub-Element Convention above. - **Ambiguity rule:** When a condition is borderline (e.g., a dataset page exists but access requires approval), default to `applicable: true` and score based on what is documented. This prevents silent N/A inflation on datasets that are partially shared. + - **Anti-circular rule:** A sub-element's own scoring fields may not be the sole basis for excluding it. If the only reason to set `applicable: false` is the absence of the sub-element's own fields, treat it as `applicable: true` and score accordingly (receiving 0 if those fields are absent). Applicability must be evidenced by fields belonging to a *different* element. Emit `applicability_status` and `applicability_evidence` before scoring every conditional sub-element to make this determination explicit and auditable. - EXAMPLE (applicable + scored): `distribution_formats` lists Parquet and TSV with a PhysioNet download URL → datasets shared condition is met → Element 6, 8, and 10 sub-elements are applicable and scored. - EXAMPLE (applicable + scored low): `human_subject_research.involves_human_subjects=True` but no IRB fields populated → Element 4 sub-elements are applicable (condition met) and receive a score of 0, flagged as a consistency gap. - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Element 3 sub-elements 1–4, all of Element 6, all of Element 8, and all of Element 10 are set to `applicable: false`, `score: null`, and excluded from the denominator. @@ -224,27 +226,27 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica - Fields: `ethical_reviews`, `human_subject_research`, `data_protection_impacts`, `regulatory_restrictions.governance_committee_contact` - Look for: IRB approval details, institutional oversight, ethics review boards, data protection impact assessments (DPIAs), governance committee contacts - **Semantic Check:** If `human_subject_research.involves_human_subjects=True`, this MUST be populated - - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 2. **Deidentification Method Described** - Fields: `is_deidentified` - Look for: Specific deidentification method (HIPAA Safe Harbor, Expert Determination, k-anonymity) - - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 3. **Privacy Protections and Re-identification Risk Assessment** - Fields: `participant_privacy`, `participant_privacy.reidentification_risk` - Look for: Privacy protections, anonymization procedures, explicit re-identification risk assessment and mitigation measures - - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 4. **Informed Consent Obtained from Participants** - Fields: `informed_consent` - Look for: Consent procedures, consent type (written, verbal), withdrawal mechanisms - - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 5. **Vulnerable Populations and Compensation Documented** - Fields: `at_risk_populations`, `participant_compensation` - Look for: Protections for at-risk populations, compensation details - - **Applies to:** Always report results of this sub-element, but only score if human subjects or governance restrictions are identified elsewhere. + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- @@ -352,12 +354,12 @@ Report the count of non-applicable sub-elements in the `sub_elements_not_applica 3. **Preprocessing, Cleaning, Labeling, and Annotation Quality** - Fields: `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies`, `annotation_analyses`, `machine_annotation_tools`, `imputation_protocols` - Look for: Preprocessing pipeline, cleaning steps, labeling methods, annotation quality analyses, machine annotation tools, imputation protocols for missing values - - **Applies to:** Always report results of this sub-element, but only score if software tools are identified elsewhere as shared and available for reuse. + - **Applies to:** Always report results of this sub-element, but only score if `external_resources` (from E10) references a code repository, OR `description` or `purposes` (from E1/E7) explicitly identifies software production as a dataset output. Do not use E8's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 4. **Software and Tools Documented** - Fields: `software_and_tools` - Look for: Software names, versions, processing tools, GitHub repos - - **Applies to:** Always report results of this sub-element, but only score if software tools are identified elsewhere as shared and available for reuse. + - **Applies to:** Always report results of this sub-element, but only score if `external_resources` (from E10) references a code repository, OR `description` or `purposes` (from E1/E7) explicitly identifies software production as a dataset output. Do not use E8's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 5. **External Standards and Resources Referenced** - Fields: `external_resources`, `conforms_to` @@ -530,6 +532,8 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "IRB or Ethics Review Documented", "applicable": true, + "applicability_status": "applicable", + "applicability_evidence": "description contains 'voice recordings from participants'; keywords include 'clinical trial'", "score": 1, "evidence": "ethical_reviews: IRB approval from 5 institutions documented", "quality_note": "Human subjects confirmed; IRB details present" @@ -537,9 +541,11 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "Informed Consent Obtained from Participants", "applicable": false, + "applicability_status": "not_applicable", + "applicability_evidence": "description and keywords contain no clinical/patient/participant terms; collection_mechanisms absent; regulatory_restrictions and confidentiality_level (E2) not populated", "score": null, - "evidence": "human_subject_research.involves_human_subjects not present in this datasheet", - "quality_note": "Excluded from denominator: human subjects / governance condition not met" + "evidence": "No human subject evidence found in E1, E2, or E8 fields", + "quality_note": "Excluded from denominator: human subjects and governance conditions not met via external fields" } ], "element_score": 1, @@ -828,4 +834,3 @@ See `notes/RUBRIC_AGENT_USAGE.md` for comprehensive usage examples. - **Complement, Not Replace:** This LLM-based evaluation complements the existing field-presence detection in `src/evaluation/evaluate_d4d.py` - **Cost:** ~$0.10-0.30 per file evaluation via Anthropic API - **Time:** ~30-60 seconds per file (slower than presence detection but provides deeper insights) - From ed2faf2f5e6e2455b00fe38a3ee5d793dc0897ed Mon Sep 17 00:00:00 2001 From: Orion Banks <49208907+Bankso@users.noreply.github.com> Date: Tue, 16 Jun 2026 12:20:37 -0700 Subject: [PATCH 24/24] Specify evidence for program of origin Encourages deterministic behavior --- .claude/agents/d4d-rubric10-semantic.md | 2 +- .claude/agents/d4d-rubric20-semantic.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.claude/agents/d4d-rubric10-semantic.md b/.claude/agents/d4d-rubric10-semantic.md index 0fca0d0a..8ade9272 100644 --- a/.claude/agents/d4d-rubric10-semantic.md +++ b/.claude/agents/d4d-rubric10-semantic.md @@ -54,7 +54,7 @@ Score **0** (absent/fail) if: 1. **Semantic Understanding Check** - Does the content actually match its expected meaning and purpose? - - Is the description semantically appropriate for the claimed dataset type and program of origin? + - Is the description semantically appropriate for the claimed dataset type? If program context is relevant, infer it only from quoted values in `keywords`, `publisher`, or `funders` — never from the filename, invocation context, or prior knowledge. - Are technical terms used correctly and consistently? 2. **Correctness Validation** diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 5ad2bcf4..37f3d1d6 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -57,7 +57,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th 1. **Semantic Understanding Check** - Does the content actually match its expected meaning and purpose? - - Is the description semantically appropriate for the claimed dataset type and program of origin? + - Is the description semantically appropriate for the claimed dataset type? If program context is relevant, infer it only from quoted values in `keywords`, `publisher`, or `funders` — never from the filename, invocation context, or prior knowledge. - Are technical terms used correctly and consistently? 2. **Correctness Validation**