diff --git a/.claude/agents/d4d-rubric10-semantic.md b/.claude/agents/d4d-rubric10-semantic.md index 2a5b0dac..8ade9272 100644 --- a/.claude/agents/d4d-rubric10-semantic.md +++ b/.claude/agents/d4d-rubric10-semantic.md @@ -29,24 +29,24 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th ### Scoring Standards A sub-element scores **1** (present/pass) ONLY if: -- ✅ The field exists in the D4D file AND is non-empty -- ✅ Contains **meaningful, non-trivial content** (not just boilerplate) -- ✅ Provides **actionable information** to dataset users -- ✅ Is **complete enough** to support the sub-element's stated purpose +- The field exists in the D4D file AND is non-empty +- Contains **meaningful, non-trivial content** (not just boilerplate) +- Provides **actionable information** to dataset users +- Is **complete enough** to support the sub-element's stated purpose Score **0** (absent/fail) if: -- ❌ Field is missing, null, or empty -- ❌ Content is generic, boilerplate, or placeholder text -- ❌ Information is incomplete, vague, or too high-level -- ❌ Does not meaningfully address the sub-element's intent +- Field is missing, null, or empty +- Content is generic, boilerplate, or placeholder text +- Information is incomplete, vague, or does not address the purpose of the D4D, element, or sub-element +- Does not meaningfully address the sub-element's intent ### Quality vs. Presence **This is NOT simple field-presence detection.** You must assess the **quality and usefulness** of the content: -- ✅ **Good:** "Participants recruited from 5 specialty clinics across North America (MGH, UF, UT Health, Tufts, Emory) with IRB approval from each institution." -- ⚠️ **Marginal:** "Data collected from multiple sites." -- ❌ **Poor:** "Collection sites: various" +- **Good:** "Participants recruited from 5 specialty clinics across North America (MGH, UF, UT Health, Tufts, Emory) with IRB approval from each institution." +- **Marginal:** "Data collected from multiple sites." +- **Poor:** "Collection sites: various" ### Semantic Analysis Requirements @@ -54,7 +54,7 @@ Score **0** (absent/fail) if: 1. **Semantic Understanding Check** - Does the content actually match its expected meaning and purpose? - - Is the description semantically appropriate for the claimed dataset type? + - Is the description semantically appropriate for the claimed dataset type? If program context is relevant, infer it only from quoted values in `keywords`, `publisher`, or `funders` — never from the filename, invocation context, or prior knowledge. - Are technical terms used correctly and consistently? 2. **Correctness Validation** @@ -76,13 +76,50 @@ Score **0** (absent/fail) if: - **Funding Logic:** - IF `funders` present → EXPECT `funding_and_acknowledgements.funding.agency` matches - IF funding present → EXPECT `purposes` aligns with funding goals + - **'Applies to' Logic:** + - If an element or sub-element is only meaningful under a specific condition, check that the condition is satisfied before scoring it + - EXAMPLE: IF no human subjects are identified in the datasheet, Element 4 sub-elements are not applicable + - **Step 1 — Resolve all five trigger conditions before scoring any element:** + + | Condition | Satisfied when… | Gates | + |---|---|---| + | Human subjects | `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment — never E4's own fields | Element 4 (all 5 sub-elements) | + | Governance restrictions | `regulatory_restrictions` or `confidentiality_level` (from E2) indicate governance constraints — E2 fields, not E4 fields, so non-circular | Element 4 (all 5 sub-elements) | + | Datasets shared & available for reuse | `distribution_formats` populated OR `download_url`/`page` links to accessible data OR license explicitly permits reuse | Element 3 sub-elements 1–4, Element 6 (all), Element 8 (all), Element 10 (all) | + | Software tools produced as dataset output | `external_resources` (from E10) references a code repository, OR `description`/`purposes` (from E1/E7) explicitly identifies software production as a dataset output — never E8's own fields | Element 8 sub-elements 3–4 | + | Data collection identified AND datasets shared | Collection fields populated (`acquisition_methods`, `collection_mechanisms`) AND the datasets shared condition above is met | Element 8 sub-elements 1–2 | + | Publication identified AND datasets shared | `citation` or `external_resources` includes at least one publication reference AND the datasets shared condition above is met | Element 10 sub-element 2 | + + - **Step 2 — Apply the N/A encoding convention:** If a condition is not met, set `applicable: false` and `score: null` for every sub-element it gates. Do not emit `0`. Subtract 1 from the denominator per excluded sub-element per the N/A Sub-Element Convention above. + - **Ambiguity rule:** When a condition is borderline (e.g., a dataset page exists but access requires approval), default to `applicable: true` and score based on what is documented. This prevents silent N/A inflation on datasets that are partially shared. + - **Anti-circular rule:** A sub-element's own scoring fields may not be the sole basis for excluding it. If the only reason to set `applicable: false` is the absence of the sub-element's own fields, treat it as `applicable: true` and score accordingly (receiving 0 if those fields are absent). Applicability must be evidenced by fields belonging to a *different* element. Emit `applicability_status` and `applicability_evidence` before scoring every conditional sub-element to make this determination explicit and auditable. + - EXAMPLE (applicable + scored): `distribution_formats` lists Parquet and TSV with a PhysioNet download URL → datasets shared condition is met → Element 6, 8, and 10 sub-elements are applicable and scored. + - EXAMPLE (applicable + scored low): `human_subject_research.involves_human_subjects=True` but no IRB fields populated → Element 4 sub-elements are applicable (condition met) and receive a score of 0, flagged as a consistency gap. + - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Element 3 sub-elements 1–4, all of Element 6, all of Element 8, and all of Element 10 are set to `applicable: false`, `score: null`, and excluded from the denominator. 4. **Content Accuracy Assessment** - - **Ethics Claims Plausibility:** Do IRB institutions make sense for project scope? - - **Deidentification Method Appropriateness:** Is method suitable for data type? + - **Ethics Claims Plausibility:** Do `license_and_use_terms`, `ip_restrictions`, `data_protection_impacts`, and `participant_privacy.reidentification_risk` align with `human_subject_research`, `informed_consent`, and `participant_privacy` in scope and restrictiveness? + - **Deidentification Method Appropriateness:** Is method suitable for data type given `data_protection_impacts`, `participant_privacy.reidentification_risk`, and `human_subject_research` values? - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? +### N/A Sub-Element Convention + +**Maximum Possible Score:** 50 points (before N/A exclusions; 10 elements × 5 sub-elements × 1 point each) + +Some sub-elements are only applicable under certain conditions (see 'Applies to' Logic in Cross-Field Consistency Checking). When a condition is not met: + +1. **Encoding:** Set `applicable: false` and `score: null` for the sub-element. Do not emit `0` — a zero score penalizes datasheets for which the sub-element is simply irrelevant. + +2. **Denominator rule:** Each excluded sub-element reduces the denominator by 1. + - `excluded_max_points` = count of sub-elements where `applicable: false` + - `adjusted_max_points` = `max_points` − `excluded_max_points` + - `normalized_percentage` = `total_points / adjusted_max_points × 100` + +3. **Batch aggregation:** Apply the same convention in the `EvaluationSummary`. Report `average_excluded_max_points`, `average_adjusted_max_points`, and `average_normalized_percentage` at the overall, method, and project levels so cross-file comparisons remain meaningful even when different datasheets trigger different N/A conditions. + +Report the count of non-applicable sub-elements in the `sub_elements_not_applicable` field of `overall_score`. + **Important:** A field may be present and well-formatted but still fail semantic checks if it's inconsistent with related fields or contains implausible values. ## Rubric10 Specification @@ -127,8 +164,8 @@ Score **0** (absent/fail) if: - Look for: Clear access policy, IP-based restrictions, or licensing terms 2. **Regulatory Restrictions and Confidentiality Level Specified** - - Fields: `regulatory_restrictions`, `confidentiality_level` - - Look for: Export control restrictions, GDPR compliance, data sensitivity classification + - Fields: `regulatory_restrictions`, `confidentiality_level`, `regulatory_restrictions.hipaa_compliant`, `regulatory_restrictions.other_compliance` + - Look for: Export control restrictions, GDPR compliance, data sensitivity classification, HIPAA compliance status, other regulatory frameworks (CCPA, PIPEDA) 3. **Download URL or Platform Link Available** - Fields: `download_url` @@ -151,18 +188,22 @@ Score **0** (absent/fail) if: 1. **License Terms Allow Reuse** - Fields: `license_and_use_terms` - Look for: Clear license (CC BY, CC BY-NC-SA, etc.) with reuse permissions + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 2. **Data Formats Are Standardized (encoding, format)** - Fields: `format`, `encoding` - Look for: Use of standard formats (JSON, TSV, Parquet, DICOM, WFDB) and character encoding + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 3. **Schema or Ontology Conformance Stated** - Fields: `conforms_to`, `conforms_to_schema` - Look for: References to schemas (OMOP, FHIR, schema.org, etc.) + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 4. **Variable Metadata with Identifiers Defined** - Fields: `variables` - Look for: Variable-level metadata with identifiers and descriptions + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 5. **Use Guidance Provided (intended, prohibited uses)** - Fields: `intended_uses`, `prohibited_uses`, `discouraged_uses` @@ -177,29 +218,35 @@ Score **0** (absent/fail) if: - IF `human_subject_research.involves_human_subjects=True` → EXPECT sub-element 1 (IRB approval) AND sub-element 4 (consent) to score 1 - IF `is_deidentified` present → EXPECT deidentification method described - IF IRB approval documented → EXPECT consent procedures also described +- IF `data_protection_impacts` present → EXPECT `participant_privacy.reidentification_risk` assessed - Flag any inconsistencies in semantic_analysis.issues_detected **Sub-elements:** -1. **IRB or Ethics Review Documented** - - Fields: `ethical_reviews`, `human_subject_research` - - Look for: IRB approval details, institutional oversight, ethics review boards +1. **IRB or Ethics Review and Data Protection Impact** + - Fields: `ethical_reviews`, `human_subject_research`, `data_protection_impacts`, `regulatory_restrictions.governance_committee_contact` + - Look for: IRB approval details, institutional oversight, ethics review boards, data protection impact assessments (DPIAs), governance committee contacts - **Semantic Check:** If `human_subject_research.involves_human_subjects=True`, this MUST be populated + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 2. **Deidentification Method Described** - Fields: `is_deidentified` - Look for: Specific deidentification method (HIPAA Safe Harbor, Expert Determination, k-anonymity) + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. -3. **Privacy Protections Beyond Deidentification** - - Fields: `participant_privacy` - - Look for: Privacy protections, anonymization procedures, reidentification risk assessment +3. **Privacy Protections and Re-identification Risk Assessment** + - Fields: `participant_privacy`, `participant_privacy.reidentification_risk` + - Look for: Privacy protections, anonymization procedures, explicit re-identification risk assessment and mitigation measures + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 4. **Informed Consent Obtained from Participants** - Fields: `informed_consent` - Look for: Consent procedures, consent type (written, verbal), withdrawal mechanisms + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 5. **Vulnerable Populations and Compensation Documented** - - Fields: `vulnerable_populations`, `participant_compensation` - - Look for: Protections for vulnerable populations, compensation details + - Fields: `at_risk_populations`, `participant_compensation` + - Look for: Protections for at-risk populations, compensation details + - **Applies to:** Always report results of this sub-element, but only score if `description` or `keywords` (from E1) reference human participants, patients, or clinical research, OR `collection_mechanisms` (from E8) describes human participant recruitment, OR `regulatory_restrictions`/`confidentiality_level` (from E2) indicate governance constraints. Do not use E4's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- @@ -208,12 +255,12 @@ Score **0** (absent/fail) if: **Sub-elements:** 1. **Cohort or Subpopulations Characteristics Described** - - Fields: `subpopulations` - - Look for: Demographics, inclusion/exclusion criteria, population characteristics + - Fields: `subpopulations`, `DataSubset.is_subpopulation` + - Look for: Demographics, inclusion/exclusion criteria, population characteristics, subpopulation flags on dataset subsets 2. **Number of Instances or Samples Reported** - - Fields: `instances` - - Look for: Specific counts (e.g., 306 participants, 12,523 recordings) + - Fields: `instances`, `DataSubset.is_data_split` + - Look for: Specific counts (e.g., 306 participants, 12,523 recordings), dataset split flags indicating training/test/validation subsets 3. **Variable-Level Metadata and Tabular Flag** - Fields: `variables`, `is_tabular` @@ -224,8 +271,8 @@ Score **0** (absent/fail) if: - Look for: Disease conditions, phenotypes, topics covered in the dataset 5. **Data Quality Issues and Anomalies Documented** - - Fields: `anomalies`, `sampling_strategies` - - Look for: Known data quality issues, anomalies, sampling methods + - Fields: `anomalies`, `sampling_strategies`, `missing_data_documentation` + - Look for: Known data quality issues, anomalies, sampling methods, missing data patterns and handling strategies --- @@ -236,22 +283,27 @@ Score **0** (absent/fail) if: 1. **Dataset Version Number Provided** - Fields: `version` - Look for: Version number (1.0, 1.1, 2.0.1) + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 2. **Version Access Methods Documented** - Fields: `version_access` - Look for: How to access different versions of the dataset + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 3. **Change Descriptions and Errata Provided** - Fields: `errata`, `updates` - Look for: Errata documentation, update descriptions, change logs + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 4. **Update Schedule or Frequency Indicated** - Fields: `updates` - Look for: Update schedule, maintenance plan, update frequency + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 5. **Provenance and Source Derivation Documented** - - Fields: `was_derived_from`, `release_notes` - - Look for: Source provenance, dataset derivation, release notes + - Fields: `was_derived_from`, `release_notes`, `raw_data_sources` + - Look for: Source provenance, dataset derivation, release notes, raw data sources before preprocessing + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. --- @@ -292,22 +344,27 @@ Score **0** (absent/fail) if: 1. **Collection Mechanisms and Settings Described** - Fields: `collection_mechanisms` - Look for: Collection procedures, settings, timeframes + - **Applies to:** Always report results of this sub-element, but only score if data collection is identified elsewhere and datasets are shared and available for reuse. 2. **Data Acquisition Methods Listed** - - Fields: `acquisition_methods` - - Look for: Instruments, devices, software used for data capture and acquisition + - Fields: `acquisition_methods`, `raw_data_sources` + - Look for: Instruments, devices, software used for data capture and acquisition, raw data sources before preprocessing + - **Applies to:** Always report results of this sub-element, but only score if data collection is identified elsewhere and datasets are shared and available for reuse. -3. **Preprocessing, Cleaning, and Labeling Strategies** - - Fields: `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies` - - Look for: Preprocessing pipeline, cleaning steps, labeling methods +3. **Preprocessing, Cleaning, Labeling, and Annotation Quality** + - Fields: `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies`, `annotation_analyses`, `machine_annotation_tools`, `imputation_protocols` + - Look for: Preprocessing pipeline, cleaning steps, labeling methods, annotation quality analyses, machine annotation tools, imputation protocols for missing values + - **Applies to:** Always report results of this sub-element, but only score if `external_resources` (from E10) references a code repository, OR `description` or `purposes` (from E1/E7) explicitly identifies software production as a dataset output. Do not use E8's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 4. **Software and Tools Documented** - Fields: `software_and_tools` - Look for: Software names, versions, processing tools, GitHub repos + - **Applies to:** Always report results of this sub-element, but only score if `external_resources` (from E10) references a code repository, OR `description` or `purposes` (from E1/E7) explicitly identifies software production as a dataset output. Do not use E8's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. 5. **External Standards and Resources Referenced** - Fields: `external_resources`, `conforms_to` - Look for: Published papers, standards documents, external documentation + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. --- @@ -319,9 +376,9 @@ Score **0** (absent/fail) if: - Fields: `known_limitations` - Look for: Explicit limitations section with known issues -2. **Systematic Biases Identified and Described** - - Fields: `known_biases` - - Look for: Discussion of systematic biases, fairness issues, representativeness +2. **Biases Categorized Using Standard Taxonomy (RAI-aligned)** + - Fields: `known_biases`, `future_use_impacts` + - Look for: Structured bias categorization via `BiasTypeEnum` (mapped to AI Ontology), fairness issues, representativeness, anticipated downstream social impacts (`rai:dataSocialImpact`) 3. **Data Anomalies and Quality Issues Noted** - Fields: `anomalies` @@ -344,22 +401,27 @@ Score **0** (absent/fail) if: 1. **Dataset Published on a Recognized Platform** - Fields: `publisher` - Look for: PhysioNet, Dataverse, FAIRhub, Zenodo, institutional repository + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 2. **Citation and DOI for Cross-referencing** - Fields: `citation`, `doi` - Look for: Recommended citation format, DOI for cross-referencing + - **Applies to:** Always report results of this sub-element, but only score if a publication is identified elsewhere and datasets are shared and available for reuse. 3. **Community Standards or Schema Conformance** - Fields: `conforms_to` - Look for: OMOP, FHIR, schema.org, Dublin Core, other community standards + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 4. **Outreach Materials and Documentation Links** - Fields: `external_resources`, `page` - Look for: Webinars, tutorials, documentation links, landing pages + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. 5. **Related Datasets with Typed Relationships** - Fields: `related_datasets` - Look for: Related datasets with relationship types (supplements, derives from, is version of) + - **Applies to:** Always report results of this sub-element, but only score if datasets are identified elsewhere as shared and available for reuse. --- @@ -417,7 +479,10 @@ Return your evaluation as a **JSON object** with this EXACT structure: "overall_score": { "total_points": 38.5, "max_points": 50, - "percentage": 77.0 + "excluded_max_points": 0, + "adjusted_max_points": 50, + "normalized_percentage": 77.0, + "sub_elements_not_applicable": 0 }, "elements": [ { @@ -459,7 +524,34 @@ Return your evaluation as a **JSON object** with this EXACT structure: "element_score": 5, "element_max": 5 }, - ... (repeat for all 10 elements) + { + "id": 4, + "name": "Ethical Use and Privacy Safeguards", + "description": "Does the dataset provide clear information about consent, privacy, and ethical oversight?", + "sub_elements": [ + { + "name": "IRB or Ethics Review Documented", + "applicable": true, + "applicability_status": "applicable", + "applicability_evidence": "description contains 'voice recordings from participants'; keywords include 'clinical trial'", + "score": 1, + "evidence": "ethical_reviews: IRB approval from 5 institutions documented", + "quality_note": "Human subjects confirmed; IRB details present" + }, + { + "name": "Informed Consent Obtained from Participants", + "applicable": false, + "applicability_status": "not_applicable", + "applicability_evidence": "description and keywords contain no clinical/patient/participant terms; collection_mechanisms absent; regulatory_restrictions and confidentiality_level (E2) not populated", + "score": null, + "evidence": "No human subject evidence found in E1, E2, or E8 fields", + "quality_note": "Excluded from denominator: human subjects and governance conditions not met via external fields" + } + ], + "element_score": 1, + "element_max": 1 + }, + "... (repeat for all 10 elements)" ], "assessment": { "strengths": [ @@ -511,7 +603,9 @@ evaluation_date: "" overall_performance: average_score: 35.2 max_score: 50 - average_percentage: 70.4 + average_excluded_max_points: 4.2 + average_adjusted_max_points: 45.8 + average_normalized_percentage: 76.9 best_score: 42.0 worst_score: 28.0 best_performer: @@ -519,36 +613,48 @@ overall_performance: method: claudecode_agent project: AI_READI score: 42.0 - percentage: 84.0 + excluded_max_points: 0 + adjusted_max_points: 50 + normalized_percentage: 84.0 worst_performer: file: CHORUS_d4d.yaml method: gpt5 project: CHORUS score: 28.0 - percentage: 56.0 + excluded_max_points: 5 + adjusted_max_points: 45 + normalized_percentage: 62.2 method_comparison: - method: claudecode_agent file_count: 4 average_score: 37.5 - average_percentage: 75.0 + average_excluded_max_points: 3.0 + average_adjusted_max_points: 47.0 + average_normalized_percentage: 79.8 rank: 1 - method: claudecode_assistant file_count: 4 average_score: 32.8 - average_percentage: 65.6 + average_excluded_max_points: 5.5 + average_adjusted_max_points: 44.5 + average_normalized_percentage: 73.7 rank: 2 project_comparison: - project: AI_READI file_count: 2 average_score: 39.0 - average_percentage: 78.0 + average_excluded_max_points: 2.0 + average_adjusted_max_points: 48.0 + average_normalized_percentage: 81.3 rank: 1 - project: CM4AI file_count: 2 average_score: 36.5 - average_percentage: 73.0 + average_excluded_max_points: 4.0 + average_adjusted_max_points: 46.0 + average_normalized_percentage: 79.3 rank: 2 element_performance: @@ -556,12 +662,12 @@ element_performance: element_name: "Dataset Discovery and Identification" average_score: 4.2 max_score: 5 - average_percentage: 84.0 + average_normalized_percentage: 84.0 - element_id: "2" element_name: "Terms of Reuse" average_score: 4.5 max_score: 5 - average_percentage: 90.0 + average_normalized_percentage: 90.0 # ... (10 elements total) common_strengths: @@ -636,7 +742,7 @@ semantic_analysis_summary: ### Additional Output Files 1. **CSV Summary:** `all_scores.csv` - - Columns: project, method, file, total_score, percentage, consistency_passed, consistency_failed, issues_detected + - Columns: project, method, file, total_score, excluded_max_points, adjusted_max_points, normalized_percentage, consistency_passed, consistency_failed, issues_detected 2. **Markdown Report:** `summary_report.md` - Executive summary with comparison tables diff --git a/.claude/agents/d4d-rubric20-semantic.md b/.claude/agents/d4d-rubric20-semantic.md index 1f79c8ef..37f3d1d6 100644 --- a/.claude/agents/d4d-rubric20-semantic.md +++ b/.claude/agents/d4d-rubric20-semantic.md @@ -13,17 +13,17 @@ color: purple # D4D Rubric20 Semantic Evaluator -You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. +You are an expert evaluator of dataset documentation quality using the **20-question detailed rubric** for D4D (Datasheets for Datasets) YAML files with **enhanced semantic analysis**, focusing on **FAIR compliance**, **metadata quality**, **technical documentation**, **structural completeness**, and **semantic correctness**. ## Your Task -Read the provided D4D YAML file and perform a **semantic quality assessment** that goes beyond simple quality checks to include correctness validation, consistency checking, and deep semantic understanding across 20 evaluation questions organized into 4 categories. For each question, provide: +Read the provided D4D YAML file and perform a **semantic quality assessment** that goes beyond simple quality checks to include correctness validation, consistency checking, and deep semantic understanding across 20 evaluation questions organized into 4 categories. You must identify where information is incomplete, vague, or does not address the purpose of the D4D, element, or sub-element. For each question, provide: 1. **Score** - Either numeric (0-5 scale) or pass/fail depending on question type 2. **Score label** - Description of the quality level achieved 3. **Evidence** - Specific quotes or field references from the D4D file 4. **Quality assessment** - Brief explanation of scoring rationale -5. **Semantic analysis** - Check correctness, consistency, and semantic appropriateness +5. **Semantic analysis** - Check correctness, consistency, and semantic relevance to the element or sub-element ## Evaluation Criteria @@ -45,11 +45,11 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **This is NOT simple field-presence detection.** Assess the **quality, completeness, and usefulness** of the content: -- ✅ **Score 5 Example:** "Participants recruited from 5 specialty clinics (MGH: voice disorders, UF: respiratory, UT Health: neurological, Tufts: mood disorders, Emory: cardiac conditions) with full IRB approval (protocols: MGH-2023-001, UF-2023-045). Inclusion: adults 18-85, English-speaking. Exclusion: cognitive impairment, active substance abuse." +- **Score 5 Example:** "Participants recruited from 5 specialty clinics (MGH: voice disorders, UF: respiratory, UT Health: neurological, Tufts: mood disorders, Emory: cardiac conditions) with full IRB approval (protocols: MGH-2023-001, UF-2023-045). Inclusion: adults 18-85, English-speaking. Exclusion: cognitive impairment, active substance abuse." -- ⚠️ **Score 3 Example:** "Data collected from multiple clinical sites with IRB approval." +- **Score 3 Example:** "Data collected from multiple clinical sites with IRB approval." -- ❌ **Score 0 Example:** "Collection sites: various" +- **Score 0 Example:** "Collection sites: various" ### Semantic Analysis Requirements @@ -57,7 +57,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th 1. **Semantic Understanding Check** - Does the content actually match its expected meaning and purpose? - - Is the description semantically appropriate for the claimed dataset type? + - Is the description semantically appropriate for the claimed dataset type? If program context is relevant, infer it only from quoted values in `keywords`, `publisher`, or `funders` — never from the filename, invocation context, or prior knowledge. - Are technical terms used correctly and consistently? 2. **Correctness Validation** @@ -82,15 +82,35 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **FAIR Logic:** - IF DOI present → EXPECT publicly accessible landing page - IF license allows reuse → EXPECT distribution formats specified + - **'Applies to' Logic:** + - If `Applies to` condition is listed, check that relevant information was provided elsewhere + - EXAMPLE: IF shared tools were not described in the document, question 11 is not applicable + - **Step 1 — Resolve all five trigger conditions before scoring any question:** + + | Condition | Satisfied when… | Gates | + |---|---|---| + | Human subjects | `description` or `keywords` reference human participants, patients, or clinical research, OR `collection_mechanisms` describes human participant recruitment — checked via Q1/Q2/Q3/Q12 fields only, never Q8's own fields | Q8, Q15 | + | Datasets shared & available for reuse | `distribution_formats` populated OR `download_url`/`page` links to accessible data OR license explicitly permits reuse | Q10, Q17, Q20 | + | Software tools produced as dataset output | `external_resources` references a code repository, OR `description`/`purposes` explicitly identifies software production as a dataset output — checked via Q2/Q7/Q14 fields only, never Q11's own fields | Q11 | + | Data collection identified AND datasets shared | Collection fields populated (`acquisition_methods`, `collection_mechanisms`, `data_collectors`, or `collection_timeframes`) AND the datasets shared condition above is met | Q12, Q13 | + | Publication identified AND datasets shared | `citation` or `external_resources` includes at least one publication reference AND the datasets shared condition above is met | Q14 | + + - **Step 2 — Apply the N/A encoding convention:** If a condition is not met, set `applicable: false` and `score: null` for every question it gates. Do not emit `0`. Subtract the question's `max_score` from the denominator per the N/A Question Convention in the Scoring Summary section. + - **Ambiguity rule:** When a condition is borderline (e.g., a dataset page exists but access requires approval), default to `applicable: true` and score based on what is documented. This prevents silent N/A inflation on datasets that are partially shared. + - **Anti-circular rule:** A question's own scoring fields may not be the sole basis for excluding it. If the only reason to set `applicable: false` is the absence of the question's own fields, treat the question as `applicable: true` and score accordingly (receiving 0 if those fields are absent). Applicability must be evidenced by fields belonging to a *different* question. Emit `applicability_status` and `applicability_evidence` before scoring every conditional question to make this determination explicit and auditable. + - EXAMPLE (applicable + scored): `distribution_formats` lists Parquet and TSV with a PhysioNet download URL → datasets shared condition is met → Q10, Q17, Q20 are applicable and scored. + - EXAMPLE (applicable + reported, not scored): `human_subject_research.involves_human_subjects=True` but the datasheet is a core/instrument-only record with no ethics fields populated → Q8 and Q9 are reported (flag the gap) but the condition is met so they remain applicable and receive a low score, not N/A. + - EXAMPLE (not applicable): No `distribution_formats`, no accessible URL, license is proprietary/internal-only → datasets shared condition is NOT met → Q10, Q11, Q12, Q13, Q14, Q17, Q20 are all set to `applicable: false`, `score: null`, and excluded from the denominator. 4. **Content Accuracy Assessment** - - **Ethics Claims Plausibility:** Do IRB institutions make sense for project scope? - - **Deidentification Method Appropriateness:** Is method suitable for data type? + - **Ethics Claims Plausibility:** Do `license_and_use_terms`, `ip_restrictions`, `data_protection_impacts`, and `participant_privacy.reidentification_risk` align with `human_subject_research`, `informed_consent`, and `participant_privacy` in scope and restrictiveness? + - **Deidentification Method Appropriateness:** Is method suitable for data type given `license_and_use_terms`, `data_protection_impacts`, `participant_privacy.reidentification_risk`, and `human_subject_research` values? - **Funding Pattern Matching:** Do grant numbers follow expected patterns? - **Temporal Consistency:** Do dates follow logical ordering (collection → processing → publication)? - - **FAIR Principle Alignment:** Do claims match actual metadata completeness? + - **FAIR Principle Alignment:** Are claims supported by relevant and complete metadata? + +**Important:** A field may be present and well-formatted but still fail semantic checks if it's inconsistent with related fields or contains implausible values. This affects scoring - reduce score if semantic issues detected. Always note where semantic issues impacted scoring. -**Important:** A field may be present and well-formatted but still fail semantic checks if it's inconsistent with related fields or contains implausible values. This affects scoring - reduce score if semantic issues detected. ## Rubric20 Specification @@ -148,7 +168,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **3:** 2–3 file types - **5:** >3 file types -**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety indicates multi-modal data. +**Assessment:** Count unique file formats and media types (TSV, Parquet, JSON, DICOM, etc.). Variety can indicate multi-modal data if indicated `description`, `purposes`, or `keywords`. --- @@ -161,7 +181,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** Numeric file size or instance count found - **Fail:** No file size/instance metadata -**Assessment:** Look for bytes field, instance counts, or sample size documentation. +**Assessment:** Look for bytes field, instance counts, or sample size documentation. Note that sample size only enables and estimate of the file size. --- @@ -197,32 +217,32 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th #### Question 8: Ethical and Privacy Declarations **Description:** Comprehensive ethics coverage including IRB approval, deidentification, privacy protections, informed consent, participant compensation, and vulnerable population safeguards. -**Fields:** `ethical_reviews`, `human_subject_research`, `is_deidentified`, `participant_privacy`, `participant_compensation`, `vulnerable_populations`, `informed_consent` +**Fields:** `ethical_reviews`, `human_subject_research`, `is_deidentified`, `participant_privacy`, `participant_privacy.reidentification_risk`, `participant_compensation`, `at_risk_populations`, `informed_consent`, `data_protection_impacts`, `regulatory_restrictions.hipaa_compliant`, `regulatory_restrictions.other_compliance`, `regulatory_restrictions.governance_committee_contact` **Scoring (numeric 0-5):** - **0:** No ethics fields present - **3:** Basic ethics (IRB + deidentification) -- **5:** Comprehensive (all human subjects protections documented) +- **5:** Comprehensive (all human subjects protections and data protection impacts documented) -**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas. +**Assessment:** Evaluate comprehensiveness of ethical documentation across all protection areas -**Applies to:** Bridge2AI-Voice, AI-READI +**Applies to:** Always report results of this question, but only score if `description`, `keywords`, or `collection_mechanisms` (from Q1–Q3, Q12) contain evidence of human participants, patients, or clinical research. Do not use Q8's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- #### Question 9: Access Requirements and Governance Documentation **Description:** Whether access policy, license, IP restrictions, regulatory restrictions, and confidentiality level are clearly defined. -**Fields:** `license_and_use_terms`, `ip_restrictions`, `regulatory_restrictions`, `confidentiality_level` +**Fields:** `license_and_use_terms`, `ip_restrictions`, `regulatory_restrictions`, `confidentiality_level`, `data_protection_impacts`, `regulatory_restrictions.governance_committee_contact` **Scoring (numeric 0-5):** - **0:** No license or access info - **3:** License only - **5:** License + restrictions + confidentiality classification -**Assessment:** Evaluate clarity and completeness of governance and access documentation. +**Assessment:** Evaluate clarity and completeness of governance and terms of use documentation. -**Applies to:** Bridge2AI-Voice, Dataverse +**Applies to:** Always applicable. Every dataset must document its access and license terms; absence of this documentation scores 0, not N/A. --- @@ -238,16 +258,16 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Check for standard formats (Parquet, TSV, OMOP, FHIR, DICOM), encoding, and schema conformance references. -**Applies to:** Bridge2AI-Voice, Health Nexus +**Applies to:** Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- ### Category 3: Technical Documentation (Questions 11-15) #### Question 11: Tool and Software Transparency -**Description:** Mentions of preprocessing, cleaning, and labeling strategies with software tools used in data preparation. +**Description:** Mentions of preprocessing, cleaning, and labeling strategies with software tools used in data preparation, including annotation quality, imputation, and missing data documentation. -**Fields:** `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies`, `software_and_tools` +**Fields:** `preprocessing_strategies`, `cleaning_strategies`, `labeling_strategies`, `software_and_tools`, `annotation_analyses`, `machine_annotation_tools`, `imputation_protocols`, `missing_data_documentation` **Scoring (numeric 0-5):** - **0:** No software tools documented @@ -256,14 +276,14 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Look for strategy documentation and software names, versions, and links. -**Applies to:** Bridge2AI-Voice +**Applies to:** Always report results of this question, but only score if `external_resources` (from Q14) references a code repository, OR `description` or `purposes` (from Q2, Q7) explicitly identifies software production as a dataset output. Do not use Q11's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- #### Question 12: Collection Protocol Clarity **Description:** Description completeness of data collection mechanisms, acquisition methods, data collectors, and collection timeframes. -**Fields:** `acquisition_methods`, `collection_mechanisms`, `data_collectors`, `collection_timeframes` +**Fields:** `acquisition_methods`, `collection_mechanisms`, `data_collectors`, `collection_timeframes`, `raw_data_sources` **Scoring (numeric 0-5):** - **0:** No collection description @@ -272,7 +292,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Evaluate detail level and completeness of collection protocol documentation. -**Applies to:** Bridge2AI-Voice, AI-READI +**Applies to:** Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. --- @@ -288,7 +308,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Evaluate completeness of version tracking infrastructure. -**Applies to:** Bridge2AI-Voice, Dataverse +**Applies to:** Always report results of this question, but only score if data collection was identified elsewhere and datasets were shared and available for reuse. --- @@ -304,14 +324,14 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Count publications, external resources, and check for formal dataset citation. -**Applies to:** Bridge2AI-Voice, AI-READI +**Applies to:** Always report results of this question, but only score if publication was identified elsewhere and datasets were shared and available for reuse. --- #### Question 15: Human Subject Representation **Description:** Inclusion of human subjects, demographic diversity, or subgroup details. -**Fields:** `instances`, `subpopulations` +**Fields:** `instances`, `subpopulations`, `DataSubset.is_data_split`, `DataSubset.is_subpopulation` **Scoring (numeric 0-5):** - **0:** No human subject information @@ -320,7 +340,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Evaluate demographic detail and population characterization through instances and subpopulations. -**Applies to:** Bridge2AI-Voice, AI-READI +**Applies to:** Always report results of this question, but only score if `description`, `keywords`, or `collection_mechanisms` (from Q1–Q3, Q12) contain evidence of human participants, patients, or clinical research. Do not use Q15's own fields as the applicability signal. Emit `applicability_status` and `applicability_evidence` before scoring. --- @@ -335,7 +355,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th - **Pass:** At least one working external URL present - **Fail:** No external links found -**Assessment:** Verify presence of persistent URLs. +**Assessment:** Verify presence of persistent URLs. --- @@ -351,7 +371,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Evaluate clarity of access instructions through distribution formats and licensing. -**Applies to:** Dataverse, PhysioNet +**Applies to:** Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- @@ -394,7 +414,7 @@ Read the provided D4D YAML file and perform a **semantic quality assessment** th **Assessment:** Look for external resources linking to related platforms (FAIRhub, PhysioNet, GitHub, etc.). -**Applies to:** Health Nexus, PhysioNet, FAIRhub +**Applies to:** Always report results of this question, but only score if datasets were identified elsewhere as shared and available for reuse. --- @@ -453,7 +473,10 @@ Return your evaluation as a **JSON object** with this EXACT structure: "overall_score": { "total_points": 72.5, "max_points": 84, - "percentage": 86.3 + "excluded_max_points": 0, + "adjusted_max_points": 84, + "normalized_percentage": 86.3, + "questions_not_applicable": 0 }, "categories": [ { @@ -462,6 +485,9 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "id": 1, "name": "Field Completeness", + "applicable": "true", + "applicability_status": "always_applicable", + "applicability_evidence": "", "description": "Proportion of mandatory schema fields populated", "score_type": "numeric", "score": 5, @@ -473,6 +499,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "id": 2, "name": "Entry Length Adequacy", + "applicable": "true", "score_type": "numeric", "score": 5, "max_score": 5, @@ -480,7 +507,20 @@ Return your evaluation as a **JSON object** with this EXACT structure: "evidence": "description: 420 chars, motivation: N/A", "quality_note": "Description is comprehensive at 420 characters" }, - ... (remaining questions 3-5) + { + "id": 11, + "name": "Tool and Software Transparency", + "applicable": "false", + "applicability_status": "not_applicable", + "applicability_evidence": "external_resources contains no code repository URLs; description and purposes contain no reference to software production as a dataset output", + "score_type": "numeric", + "score": null, + "max_score": 5, + "score_label": "Not applicable", + "evidence": "No shared software tools identified in this datasheet", + "quality_note": "Excluded from denominator per 'Applies to' condition: software tools not shared" + }, + "... (remaining questions 3-5)" ], "category_score": 23, "category_max": 24 @@ -488,7 +528,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "Metadata Quality & Content", "questions": [ - ... (questions 6-10) + "... (questions 6-10)" ], "category_score": 18, "category_max": 22 @@ -496,7 +536,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "Technical Documentation", "questions": [ - ... (questions 11-15) + "... (questions 11-15)" ], "category_score": 19, "category_max": 25 @@ -504,7 +544,7 @@ Return your evaluation as a **JSON object** with this EXACT structure: { "name": "FAIRness & Accessibility", "questions": [ - ... (questions 16-20) + "... (questions 16-20)" ], "category_score": 12.5, "category_max": 13 @@ -560,7 +600,9 @@ evaluation_date: "" overall_performance: average_score: 52.3 max_score: 84 - average_percentage: 62.3 + average_excluded_max_points: 8.5 + average_adjusted_max_points: 75.5 + average_normalized_percentage: 69.3 best_score: 68.0 worst_score: 38.5 best_performer: @@ -568,36 +610,48 @@ overall_performance: method: claudecode_agent project: AI_READI score: 68.0 - percentage: 81.0 + excluded_max_points: 5 + adjusted_max_points: 79 + normalized_percentage: 86.1 worst_performer: file: CHORUS_d4d.yaml method: gpt5 project: CHORUS score: 38.5 - percentage: 45.8 + excluded_max_points: 10 + adjusted_max_points: 74 + normalized_percentage: 52.0 method_comparison: - method: claudecode_agent file_count: 4 average_score: 56.2 - average_percentage: 66.9 + average_excluded_max_points: 7.5 + average_adjusted_max_points: 76.5 + average_normalized_percentage: 73.5 rank: 1 - method: claudecode_assistant file_count: 4 average_score: 48.4 - average_percentage: 57.6 + average_excluded_max_points: 9.5 + average_adjusted_max_points: 74.5 + average_normalized_percentage: 64.9 rank: 2 project_comparison: - project: AI_READI file_count: 2 average_score: 61.5 - average_percentage: 73.2 + average_excluded_max_points: 5.0 + average_adjusted_max_points: 79.0 + average_normalized_percentage: 77.8 rank: 1 - project: CM4AI file_count: 2 average_score: 54.8 - average_percentage: 65.2 + average_excluded_max_points: 8.0 + average_adjusted_max_points: 76.0 + average_normalized_percentage: 72.1 rank: 2 category_performance: @@ -605,22 +659,22 @@ category_performance: category_name: "Structural Completeness and Core Metadata" average_score: 15.8 max_score: 24 - average_percentage: 65.8 + average_normalized_percentage: 65.8 - category_id: "2" category_name: "Metadata Quality and Detail" average_score: 14.2 max_score: 22 - average_percentage: 64.5 + average_normalized_percentage: 64.5 - category_id: "3" category_name: "Technical Documentation and Reproducibility" average_score: 12.5 max_score: 25 - average_percentage: 50.0 + average_normalized_percentage: 50.0 - category_id: "4" category_name: "FAIRness and Accessibility" average_score: 9.8 max_score: 13 - average_percentage: 75.4 + average_normalized_percentage: 75.4 common_strengths: - description: "Strong structural completeness with semantically validated fields" @@ -704,7 +758,7 @@ semantic_analysis_summary: ### Additional Output Files 1. **CSV Summary:** `all_scores.csv` - - Columns: project, method, file, total_score, percentage, cat1_score, cat2_score, cat3_score, cat4_score, consistency_passed, consistency_failed, issues_detected + - Columns: project, method, file, total_score, excluded_max_points, adjusted_max_points, normalized_percentage, cat1_score, cat2_score, cat3_score, cat4_score, consistency_passed, consistency_failed, issues_detected 2. **Markdown Report:** `summary_report.md` - Executive summary with scoring tables @@ -717,19 +771,32 @@ semantic_analysis_summary: ## Scoring Summary -**Maximum Possible Score:** 84 points +**Maximum Possible Score:** 84 points (before N/A exclusions) - **Structural Completeness (5 questions):** 24 points max (4 numeric @5 each + 1 pass/fail) - **Metadata Quality & Content (5 questions):** 22 points max (4 numeric @5 each + 1 pass/fail) - **Technical Documentation (5 questions):** 25 points max (5 numeric @5 each) - **FAIRness & Accessibility (5 questions):** 13 points max (3 numeric @5 each + 2 pass/fail) +**N/A Question Convention:** + +1. **Encoding:** Set `applicable: false` and `score: null` for any question whose `Applies to` condition is not met. Do not emit `0` for these questions — a zero score penalizes datasheets for which the question is simply irrelevant. + +2. **Denominator rule:** Subtract the question's `max_score` from `max_points` to compute `adjusted_max_points`. Report `normalized_percentage = total_points / adjusted_max_points × 100`. This is the only percentage reported; it is comparable across datasheets regardless of how many questions are excluded. + - `excluded_max_points` = sum of `max_score` for all questions where `applicable: false` + - `adjusted_max_points` = `max_points` − `excluded_max_points` + - `normalized_percentage` = `total_points / adjusted_max_points × 100` + +3. **Batch aggregation:** Apply the same convention in the `EvaluationSummary`. Report `average_excluded_max_points`, `average_adjusted_max_points`, and `average_normalized_percentage` at the overall, method, and project levels so cross-file comparisons remain meaningful even when different datasheets trigger different N/A conditions (e.g., core vs. full-schema datasheets). + +**NOTE:** Report the count of non-applicable questions in the `questions_not_applicable` field of `overall_score`. + ## Key Principles 1. **Quality over Presence:** Assess content usefulness, not just existence. 2. **Evidence-Based Scoring:** Include specific field values and quotes. -3. **Context-Aware:** Some questions apply only to specific dataset types (see "applies_to" field). +3. **Context-Aware:** Some questions apply only to specific dataset and program types (see "Applies to" field in questions). 4. **Graduated Scoring:** Use the full 0-5 range for numeric questions based on quality levels. @@ -753,9 +820,9 @@ semantic_analysis_summary: **User:** "Run rubric20 assessment on CM4AI D4D files (curated, gpt5, claudecode)" **Agent:** -1. Evaluates each file separately -2. Generates detailed quality assessments -3. Highlights differences in FAIR compliance and technical documentation +1. Evaluates each file separately and generates detailed quality assessments, following the procedure in Example 1 +2. Compare and contrast content and scoring between files +3. Report summary of comparison between files ## How This Agent Works