Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions .github/workflows/submission-readiness.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: submission-readiness

on:
push:
branches: ["main", "checks/**"]
pull_request:
branches: ["main"]

jobs:
lightweight-checks:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.14"

- name: Install pinned dependencies
run: make setup-lock

- name: Run unit tests
run: make test

- name: Check publication consistency
run: make consistency

- name: Run submission readiness checks
run: make verify-submission
20 changes: 10 additions & 10 deletions MODEL_CARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
|---|---|
| Model name | `v1.3_primary_no_reverse_causality` |
| Model type | Calibrated soft-voting ensemble of CatBoost, XGBoost, and LightGBM |
| Development data | NHANES 2011-2014 adults age 30+ with full periodontal examination, `n=9,379` |
| Same-source temporal validation data | NHANES 2009-2010, `n=5,177` |
| Development data | NHANES 2011-2014 adults age 30+ with full periodontal examination, `n=9,034` |
| Same-source temporal validation data | NHANES 2009-2010, `n=5,037` |
| Outcome | Any CDC/AAP periodontitis versus no periodontitis |
| Primary feature count | 29 predictors |
| Secondary feature count | 33 predictors |
Expand All @@ -21,16 +21,16 @@ This model is intended for research benchmarking, methods comparison, and risk-s

| Evaluation | AUC-ROC | PR-AUC | Brier | Notes |
|---|---:|---:|---:|---|
| Internal 5-fold CV, primary 29-feature model | 0.7172 | 0.8157 | 0.1812 | Excludes treatment-seeking variables |
| Internal 5-fold CV, secondary 33-feature model | 0.7255 | 0.8207 | 0.1793 | Includes treatment-seeking variables |
| Same-source temporal validation, frozen primary model | 0.6771 | 0.7735 | 0.2003 | NHANES 2009-2010 |
| Internal 5-fold CV, primary 29-feature model | 0.6896 | 0.8240 | 0.1871 | Excludes treatment-seeking variables |
| Internal 5-fold CV, secondary 33-feature model | 0.6996 | 0.8295 | 0.1844 | Includes treatment-seeking variables |
| Same-source temporal validation, frozen primary model | 0.6495 | 0.7727 | 0.2023 | NHANES 2009-2010 |

Temporal operating points for the frozen primary model:

| Threshold | Sensitivity | Specificity | PPV | NPV | Appropriate interpretation |
|---:|---:|---:|---:|---:|---|
| 0.35 | 97.1% | 18.1% | 70.8% | 75.2% | High-sensitivity triage threshold; many false positives and some false negatives remain |
| 0.65 | 82.6% | 43.3% | 74.9% | 54.9% | More balanced threshold; still not sufficient for diagnosis |
| 0.35 | 98.9% | 5.5% | 70.0% | 69.1% | High-sensitivity triage threshold; many false positives and some false negatives remain |
| 0.65 | 77.7% | 45.2% | 76.0% | 47.5% | More balanced threshold; still not sufficient for diagnosis |

## Feature Sets

Expand Down Expand Up @@ -58,7 +58,7 @@ The temporal validation cohort is useful because the model is frozen and evaluat

Known applicability limits:

- High analytic-sample prevalence, around 67-68%, limits direct PPV/NPV transfer to lower-prevalence populations.
- High analytic-sample prevalence, around 66-72% depending on cycle and weighting, limits direct PPV/NPV transfer to lower-prevalence populations.
- Missingness indicators may learn survey logistics, so the deployment-ready no-indicator model should be reported as a conservative benchmark.
- Subgroup calibration and discrimination should be regenerated before journal submission using `scripts/04_publication_analyses.py`.
- Any implementation outside NHANES-like research data requires local recalibration and independent safety assessment.
Expand All @@ -70,8 +70,8 @@ make setup-lock
source venv/bin/activate
make test
make consistency
make reproduce
make temporal
make verify-submission
make reproduce-full
```

The consistency check enforces agreement between result artifacts, README, this model card, and the manuscript source.
Expand Down
48 changes: 39 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
.PHONY: help setup setup-lock download process train reproduce temporal test consistency notebook clean figures lock dirs manuscript
SHELL := /bin/bash
PYTHON ?= ./venv/bin/python

.PHONY: help setup setup-lock download process train reproduce temporal test consistency verify-submission reproduce-full notebook clean figures lock dirs manuscript

help:
@echo "NHANES Periodontitis ML Project - Make Commands"
Expand All @@ -21,6 +24,8 @@ help:
@echo ""
@echo "Publication:"
@echo " make consistency - Check result and manuscript consistency"
@echo " make verify-submission - Run lightweight submission-readiness checks"
@echo " make reproduce-full - Run full local reproduction workflow"
@echo " make manuscript - Render PDF manuscript if pandoc is installed"
@echo " make figures - Generate publication figures from saved results"
@echo ""
Expand All @@ -44,36 +49,61 @@ setup-lock:

test:
@echo "Running pytest unit tests..."
./venv/bin/python -m pytest tests/ -v --tb=short
$(PYTHON) -m pytest tests/ -v --tb=short
@echo "Tests complete"

consistency:
@echo "Checking publication consistency..."
python3 scripts/check_publication_consistency.py
$(PYTHON) scripts/check_publication_consistency.py
@echo "Publication consistency checks passed"

verify-submission:
@echo "Running submission-readiness checks..."
$(MAKE) test
$(MAKE) consistency
$(PYTHON) scripts/verify_submission.py
$(PYTHON) scripts/05_number_manuscript_lines.py
@echo "Submission-readiness checks complete"

download:
@echo "Downloading NHANES data..."
python3 scripts/01_download_nhanes_data.py
$(PYTHON) scripts/01_download_nhanes_data.py
@echo "Download complete"

process:
@echo "Processing and merging NHANES components..."
python3 scripts/02_process_nhanes_data.py
$(PYTHON) scripts/02_process_nhanes_data.py
@echo "Processing complete"

train:
@echo "Training models..."
python3 scripts/03_train_models.py
$(PYTHON) scripts/03_train_models.py
@echo "Training complete"

reproduce:
@echo "Running primary-model reproduction workflow..."
bash scripts/run_v13_primary.sh
$(PYTHON) scripts/reproduce_v13_primary.py

temporal:
@echo "Running same-source temporal validation workflow..."
bash scripts/run_external_validation.sh
$(PYTHON) scripts/run_temporal_validation.py

reproduce-full:
@mkdir -p logs
@set -euo pipefail; \
LOG="logs/full_reproduction_$$(date -u +%Y%m%dT%H%M%SZ).log"; \
echo "Writing full reproduction log to $$LOG"; \
{ \
$(MAKE) download; \
$(MAKE) process; \
$(MAKE) reproduce; \
$(MAKE) temporal; \
$(PYTHON) scripts/04_publication_analyses.py \
--input data/processed/publication_predictions.parquet \
--feature-cols age bmi waist_cm waist_height height_cm systolic_bp diastolic_bp glucose triglycerides hdl; \
$(MAKE) consistency; \
$(MAKE) verify-submission; \
} 2>&1 | tee "$$LOG"

notebook:
@echo "Launching Jupyter notebook..."
Expand All @@ -85,7 +115,7 @@ figures:

manuscript:
@echo "Rendering manuscript if pandoc is installed..."
python3 scripts/05_number_manuscript_lines.py
$(PYTHON) scripts/05_number_manuscript_lines.py
@if command -v pandoc >/dev/null 2>&1; then \
mkdir -p reports; \
pandoc docs/publication/ARTICLE_DRAFT.md \
Expand Down
31 changes: 15 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ This repository contains a reproducible benchmark of low-cost predictors for per

## Current Study Framing

- Development cohort: NHANES 2011-2014 adults age 30+ with full periodontal examination, `n=9,379`.
- Same-source temporal validation cohort: NHANES 2009-2010, `n=5,177`.
- Development cohort: NHANES 2011-2014 adults age 30+ with full periodontal examination, `n=9,034`.
- Same-source temporal validation cohort: NHANES 2009-2010, `n=5,037`.
- Outcome: any periodontitis versus no periodontitis using CDC/AAP case definitions.
- Primary model: calibrated soft-voting ensemble with 29 predictors after excluding treatment-seeking/reverse-causality variables.
- Secondary model: 33 predictors with the treatment-seeking variables restored for upper-bound sensitivity analysis.
Expand All @@ -17,18 +17,18 @@ These values are the source-of-truth values enforced by `scripts/check_publicati

| Analysis | Model | Features | AUC-ROC | PR-AUC | Notes |
|---|---:|---:|---:|---:|---|
| Internal 5-fold CV | Primary no reverse-causality | 29 | 0.7172 | 0.8157 | Main development estimate |
| Internal 5-fold CV | Secondary full-feature | 33 | 0.7255 | 0.8207 | Adds dental visit, flossing, loose teeth, and floss-missing flag |
| Same-source temporal validation | Frozen primary model on 2009-2010 | 29 | 0.6771 | 0.7735 | Same survey system, earlier cycle |
| Internal 5-fold CV | Primary no reverse-causality | 29 | 0.6896 | 0.8240 | Main development estimate |
| Internal 5-fold CV | Secondary full-feature | 33 | 0.6996 | 0.8295 | Adds dental visit, flossing, loose teeth, and floss-missing flag |
| Same-source temporal validation | Frozen primary model on 2009-2010 | 29 | 0.6495 | 0.7727 | Same survey system, earlier cycle |

Temporal operating points for the frozen primary model:

| Threshold | Sensitivity | Specificity | PPV | NPV | Interpretation |
|---:|---:|---:|---:|---:|---|
| 0.35 | 97.1% | 18.1% | 70.8% | 75.2% | High-sensitivity triage; negative screens are not definitive |
| 0.65 | 82.6% | 43.3% | 74.9% | 54.9% | More balanced but still requires clinical confirmation |
| 0.35 | 98.9% | 5.5% | 70.0% | 69.1% | High-sensitivity triage; negative screens are not definitive |
| 0.65 | 77.7% | 45.2% | 76.0% | 47.5% | More balanced but still requires clinical confirmation |

The key conclusion is deliberately modest: with these low-cost predictors, discrimination is around 0.72 internally and around 0.68 under same-source temporal validation. The observed performance is useful as a benchmark, not as proof of readiness for clinical implementation.
The key conclusion is deliberately modest: with these low-cost predictors, discrimination is around 0.69 internally and around 0.65 under same-source temporal validation. The observed performance is useful as a benchmark, not as proof of readiness for clinical implementation.

## Reproducibility

Expand All @@ -51,18 +51,13 @@ Run lightweight checks that do not require NHANES data:
```bash
make test
make consistency
make verify-submission
```

Run the full workflows after NHANES data are available:
Run the full local reproduction after NHANES data are available:

```bash
make download
make process
make reproduce
make temporal
python3 scripts/04_publication_analyses.py \
--input data/processed/publication_predictions.parquet \
--feature-cols age bmi waist_cm systolic_bp diastolic_bp glucose triglycerides hdl
make reproduce-full
```

The legacy notebooks are retired as source-of-truth artifacts. The maintained publication surface is the script targets, result artifacts, model card, tests, and manuscript source, with consistency checks to prevent silent drift across those files.
Expand All @@ -75,7 +70,11 @@ The legacy notebooks are retired as source-of-truth artifacts. The maintained pu
| `src/evaluation.py` | Metrics, threshold selection, calibration, and plotting helpers |
| `src/publication_analysis.py` | Survey-weighted prevalence, subgroup performance, and missingness tables |
| `scripts/check_publication_consistency.py` | Guards canonical values and conservative publication wording |
| `scripts/verify_submission.py` | Runs lightweight submission-readiness gates |
| `scripts/reproduce_v13_primary.py` | Regenerates internal v1.3 benchmark result artifacts |
| `scripts/run_temporal_validation.py` | Regenerates same-source temporal validation artifacts |
| `scripts/04_publication_analyses.py` | Generates publication sensitivity tables from processed predictions |
| `results/publication_sensitivity_tables.md` | Survey-weighted prevalence and subgroup performance summary generated by the full reproduction |
| `results/` | Saved result artifacts used by the manuscript and model card |
| `docs/publication/ARTICLE_DRAFT.md` | Current manuscript source |

Expand Down
Loading
Loading