Skip to content

ArioMoniri/syntha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🩺 syntha

A Synthea-inspired hybrid synthetic patient record generator. Learns the joint distribution of real anonymized Turkish-cohort EHR episodes with a Gaussian copula, then layers Synthea-style clinical pathways on top to emit fully-coded FHIR R4 bundles in Turkish.

CI Cross-platform Release Install buttons Codecov Latest release Downloads License: Apache 2.0 Python 3.10+ FHIR R4 Locale: tr-TR


What it is

syntha is a Python library, command-line tool, and signed cross-platform desktop app for generating realistic synthetic patient records — flat CSVs and FHIR R4 transaction Bundles — that match the statistical structure of an anonymized Turkish-cohort EHR while staying physiologically valid and clinically coded.

The pipeline is hybrid:

  1. Gaussian copula fitted on real anonymized episodes — preserves marginal distributions (age, labs, vitals, comorbidity prevalence) and their joint correlation structure.
  2. Physiologic filter — rejects samples that violate pulse-pressure, Friedewald lipid coherence, or eGFR ↔ creatinine constraints.
  3. Synthea-style clinical modules — nine condition-specific state activations that emit Encounters, MedicationRequests (RxNorm-coded), Procedures, and CarePlans matching each patient's comorbidity profile.
  4. FHIR R4 export — Patient + Observation + Condition + Encounter + MedicationRequest + Procedure + CarePlan + DiagnosticReport + RiskAssessment + FamilyMemberHistory, dual-coded LOINC / SNOMED CT / ICD-10 / RxNorm, Turkish locale (names, addresses, language code, display text).

Desktop app

Download macOS Apple Silicon (.dmg)   Download Windows installer (.exe)   Download Linux AppImage

A Tauri 2 app bundling the trained Gaussian copula. Picks cohort + n + seed + constraints, samples synthetic patients fully client-side (no Python at runtime), downloads a CSV. macOS DMG is Developer-ID signed + notarized + stapled. Windows installer is code-signed. All three OSes ship a minisign-signed auto-updater — existing installs get an in-app upgrade banner on next launch.

Install URLs auto-resolve to the latest release via releases/latest/download/… — no per-version link maintenance.

Install

# PyPI
pip install syntha-ehr

# Or from source
git clone https://github.com/ArioMoniri/syntha
cd syntha
pip install -e ".[dev]"

# Or Docker
docker pull ghcr.io/ariomoniri/syntha:latest

Quick start

# Generate 1 000 synthetic episodes + FHIR bundles + model card + validation report
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant \
  --n 1000 --cohort tolerant

# Longitudinal — multiple encounters per patient with shared HASTA_ID
syntha generate \
  --input data/raw/pristine_tolerant_episodes.csv \
  --output output/tolerant_long \
  --n 2000 --cohort tolerant \
  --longitudinal --encounters-per-patient 4 --years-of-history 3

# Validate a synthetic CSV against the source it was trained on
syntha validate \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/validation.json

# Run a privacy audit (MIA + AIA)
syntha audit \
  --source data/raw/pristine_tolerant_episodes.csv \
  --synthetic output/tolerant/synthetic_tolerant_episodes.csv \
  --output output/tolerant/privacy.json

By default the CSV writer drops 29 source-pipeline curation flags (pristine_*, berturk_*, drug-safety filters, rf_*) — those are training metadata, not clinical observations, and most are degenerate (constant 0 or 1) in the pristine cohort. Pass --curation-flags to keep them for QA work.

What it produces

For every synthetic patient, syntha emits a FHIR R4 transaction Bundle:

Resource Coding What
👤 Patient Turkish HumanName + Address (ISO 3166-2:TR province), communication.language = tr, derived birthDate
🧪 Observation ×~12 LOINC Labs (glucose, full lipid panel, CBC, LFTs, eGFR/creatinine, ferritin, B12) + vitals (BP)
🩺 Condition ×N SNOMED CT + ICD-10 Every active comorbidity, dual-coded, with English + clinical-Turkish display
🏥 Encounter ×M SNOMED CT One per active condition, fired by the relevant module
💊 MedicationRequest ×P RxNorm First-line therapy per condition, with dosage
🔬 Procedure ×Q SNOMED CT HbA1c, lipid panel, ECG, spirometry, etc.
📋 CarePlan ×R SNOMED CT Disease-specific lifestyle + monitoring plans
📊 DiagnosticReport LOINC Lipid, CBC, CMP, iron, BP panels grouping their constituent Observations
🎯 RiskAssessment SNOMED CT Charlson Comorbidity Index
👪 FamilyMemberHistory SNOMED CT When rf_kanser / rf_kronik_hastalik are set

…plus a flat CSV matching the input schema (minus the 29 dropped curation flags) for drop-in use as training data, a JSON model card with the source_sha256 and marginals, and a validation report.

Distribution fidelity

A 100-episode sample of tolerant vs the full 135 569-row source:

Metric Value
n source / synthetic 135 569 / 100
Max Kolmogorov–Smirnov across continuous columns 0.14
Mean KS 0.07
Max binary-prevalence error 0.025 (has_rx_data)
Disease-prevalence error (HTN / DM / hyperlipidemia) 0.015 / 0.004 / 0.010
Spearman correlation Frobenius diff 2.94
Fraction of synthetic patients with all labs in reference range reported per cohort in validation_report.json

Marginals

Marginal distributions — source vs synthetic

Spearman correlation structure

Spearman correlations — source vs synthetic vs diff

Disease prevalence

Comorbidity prevalence — source vs synthetic

FHIR endpoints

# Spin up a local read-only FHIR R4 server
syntha serve --bundles examples/sample_output/sample_bundles.ndjson --port 8080

# Then:
curl http://127.0.0.1:8080/metadata           # CapabilityStatement
curl http://127.0.0.1:8080/Patient            # search-set Bundle
curl http://127.0.0.1:8080/Patient/{id}
curl http://127.0.0.1:8080/\$export           # Bulk Data NDJSON

scripts/post_to_fhir.sh posts every transaction Bundle in an NDJSON file to any FHIR R4 endpoint (default: the public HAPI test server).

Turkish cohort + Turkish output

The trained models bundled with the desktop app and the example output come from pristine_strict_episodes.csv and pristine_tolerant_episodes.csv — anonymized retrospective EHR episodes from a Turkish patient cohort selected to represent clinically pristine adults. The source CSVs themselves are gitignored and never redistributed.

The output is Turkish-localized:

  • Patient names sampled from Turkish given-name and family-name distributions (src/syntha/locale/turkish.py).
  • Addresses use Turkish cities weighted by approximate population with ISO 3166-2:TR province codes.
  • Every Condition carries both an English SNOMED display and a clinical-Turkish translation in Condition.code.text.
  • Patient.communication.language is tr.

All clinical terminology used (LOINC, SNOMED CT, ICD-10, RxNorm) comes from open international standards. No licensed terminology content is embedded.

Synthea-style clinical modules

Nine modules ship out of the box (src/syntha/modules/); each fires on its corresponding comorbidity flag.

Module Source flag(s) Emits
🫀 Hypertension Hipertansiyon Encounter, 1–2 antihypertensives (stage 2 → dual), CarePlan
🍬 Diabetes DM_Tum, DM_Komplikasyonlu Encounter, HbA1c, metformin (+ insulin if severe), CarePlan
🧀 Hyperlipidemia Hiperlipidemi Encounter, lipid panel, statin (high-intensity if LDL ≥ 190)
🦋 Thyroid Tiroid Encounter, TSH, levothyroxine
😔 Depression Depresyon Psych encounter, sertraline, CBT CarePlan
😰 Anxiety Anksiyete Psych encounter, escitalopram (or buspirone if already on an SSRI)
❤️ Ischemic heart disease Iskemik_Kalp Cardiology encounter, ECG, aspirin + β-blocker + statin
🌬️ Asthma Astim Resp encounter, spirometry, SABA + ICS
🚭 COPD COPD Resp encounter, spirometry, LABA + SABA

Module authoring guide: docs/MODULES.md.

Architecture

┌──────────────┐    ┌──────────────────┐    ┌──────────────────────┐
│  Source CSV  │──▶│  Gaussian copula  │──▶│ Physiologic filter   │
│ (Turkish     │    │ (mixed-type ρ;   │    │ (BP, Friedewald,     │
│  pristine)   │    │ nearest-PSD)     │    │  eGFR ↔ creatinine)  │
└──────────────┘    └──────────────────┘    └─────────┬────────────┘
                                                       │
                                  ┌────────────────────┴────────────────────┐
                                  │                                         │
                                  ▼                                         ▼
                       ┌──────────────────┐                  ┌──────────────────────────┐
                       │ Longitudinal     │   (optional)     │  Single-encounter CSV +  │
                       │ expansion        │ ───────────────▶│  FHIR R4 export with      │
                       │ (drift, Poisson) │                  │  module activation        │
                       └─────────┬────────┘                  └──────────────────────────┘
                                 │
                                 ▼
                          (same FHIR export)

Full math (mixed-type correlation, nearest-PSD projection, conditional missingness, AR(1) lab drift): docs/ARCHITECTURE.md.

CLI reference

Command What
syntha generate End-to-end: train copula → sample → modules → CSV + FHIR + model card + validation
syntha fit Fit and persist a copula in a registry without sampling
syntha sample Raw sampling from a registered model
syntha sample-conditional AST-validated rejection sampling against a pandas filter expression
syntha fhir Convert an existing synthetic CSV to FHIR R4 bundles
syntha validate KS / Wasserstein / correlation diff + reference-range coverage
syntha audit Privacy audit (membership-inference + attribute-inference)
syntha serve Read-only FHIR R4 demo server
syntha export-model Export a registered copula to v2 JSON for the desktop app
syntha list-models, show-card Inspect the registry

Run syntha <cmd> --help for full option lists.

Example output

A pretty-printed sample Bundle, a 100-episode synthetic CSV, the model card, and the validation report all live under examples/sample_output/ and are tracked in git.

File What
sample_bundle_pretty.json One pretty-printed transaction Bundle
sample_bundles.ndjson 100 Bundles, one per line (Bulk-FHIR style)
sample_episodes.csv 100 synthetic episodes matching the input schema
sample_model_card.json source_sha256, n_train, marginals, top correlations
sample_validation_report.json KS / Wasserstein / correlation-Frobenius per column

For FHIR-aware rendering: drop the Bundle onto simplifier.net or the HL7 Clinical FHIR Renderer.

What it is not

  • Not privacy-proof. Gaussian copulas are not differentially private. Run syntha audit before sharing any synthetic dataset trained on a small or sensitive cohort.
  • Not a substitute for real PHI when validity hinges on rare events — the copula reproduces the bulk of the joint distribution, not the long tails.
  • Not a population-representative Turkish cohort by default — the source is selected for clinically-pristine adults, so synthetic disease prevalence is lower than TÜİK national figures. Calibration to TÜİK is a curation task — see ROADMAP.md and COLLABORATE.md for how to help.

Contributing + collaboration

Open-source, Apache 2.0, contributions welcome from clinicians, data scientists, and software engineers alike. Three places to start:

  • 🧑‍⚕️ Clinicians — see COLLABORATE.md for the live list of tasks needing clinical-Turkish guidance (drug calibration, ICD specificity, new modules), plus the in-app Collaborate panel that surfaces the same list with one-click "claim" via your GitHub handle.
  • 💻 DevelopersCONTRIBUTING.md for dev setup, commit conventions, and the test matrix.
  • 🗺️ Project directionROADMAP.md for the staged plan, what's shipped, and what's queued.

License + citation

Apache 2.0 © 2026 Ariorad Moniri — see LICENSE. If you use syntha in academic work, please cite:

Moniri, A. (2026). syntha: hybrid synthetic patient record generator
trained on Turkish pristine-healthy EHR cohorts.
https://github.com/ArioMoniri/syntha

Acknowledgements

Project What it gives us
🩺 Synthea Inspiration for the clinical-module layer and FHIR output format
🧪 LOINC Lab and observation codes
🧬 SNOMED CT Condition, procedure, encounter, and care-plan terminology
📑 ICD-10 Diagnosis coding alongside SNOMED
💊 RxNorm Medication coding
📊 Turkish-cohort EHR data steward De-identified retrospective episodes (anonymized upstream; never redistributed by this repo)

Community

💬 Discussions
Open questions, "is this the right tool for X?", show-and-tell
🐛 Issues
Bug reports + feature requests + clinical curation
🤝 Collaborate
Live list of clinician + dev + data tasks · also surfaced in the desktop app
📖 Contributing
Dev setup, commit conventions, test matrix
🗺️ Roadmap
Shipped + queued + what needs a clinician
📋 Changelog
Semver, Keep-a-Changelog, generated by release-please

About

Synthea-inspired hybrid synthetic patient record generator — Gaussian copula + clinical modules, trained on Turkish pristine-healthy EHR cohorts. Outputs CSV + FHIR R4 (LOINC/SNOMED/ICD-10/RxNorm). Includes a Tauri desktop app for non-coders.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors