Pipeline to extract cancer patient data from Hopital Foch databases and map them to the OSIRIS-RWD format.
run_pipeline.py # Pipeline entry point
config.py # Paths, constants
db.py # Database connection helpers (EDS, Axia, CHIMIO)
utils.py # Shared utilities (I/O, type conversion)
steps/
s01_create_cohort.py # Step 1: cancer cohort from EDS (PostgreSQL)
s02_patient_admin.py # Step 2: patient demographics from Axia (Oracle)
s03_primary_cancer.py # Step 3: primary cancer data from CHIMIO (Oracle)
s04_medication.py # Step 4: medication data from CHIMIO (Oracle)
.env # Credentials (gitignored)
.env.example # Template for .env
data/ # Output files (gitignored)
| Step | Module | Source | Output |
|---|---|---|---|
| 1 | steps/s01_create_cohort.py |
EDS (PostgreSQL) | data/cancer_ipp.csv |
| 2 | steps/s02_patient_admin.py |
Axia (Oracle) | data/osiris_rwd_export.json |
| 3 | steps/s03_primary_cancer.py |
CHIMIO (Oracle) | enriches JSON with primaryCancer, cancerOrder, tnmEvent, patient measures |
| 4 | steps/s04_medication.py |
CHIMIO (Oracle) | enriches JSON with medication (ATC codes, drug names, dates) |
- Python 3.10+
- Access to EDS V2 (PostgreSQL), Axia (Oracle) and CHIMIO (Oracle) databases
- Oracle Client 12c installed (required for Axia thick mode connection)
python -m venv venv
venv\Scripts\activate
pip install psycopg2 oracledb python-dotenvCopy .env.example to .env and fill in your database credentials:
cp .env.example .envWarning:
.envis in.gitignoreand must never be committed.
# Run the full pipeline
python run_pipeline.py
# Run a single step
python run_pipeline.py --step 1 # cohort only
python run_pipeline.py --step 2 # patient admin only (uses existing cohort)
python run_pipeline.py --step 3 # primary cancer only (uses existing JSON)
python run_pipeline.py --step 4 # medication only (uses existing JSON)- Create
steps/s05_my_step.pywith arun(patients, ...)function - Add the call in
run_pipeline.py - Done
See documentation.md for the full technical documentation: data sources, variable mappings, ICD-10 criteria, pseudonymization, and JSON structure.