Export cancer patient data from Hopital Foch databases and map them to the OSIRIS-RWD format.
The process is incremental: each script enriches the output JSON with new variables.
run_pipeline.py # Entry point (--step N for single step)
config.py # Paths, constants
db.py # Database connections (EDS, Axia, CHIMIO)
utils.py # Shared I/O and type helpers
steps/
s01_create_cohort.py -> data/cancer_ipp.csv
s02_patient_admin.py -> data/osiris_rwd_export.json
data/correspondence_ipp_pseudo.csv
s03_primary_cancer.py -> enriches osiris_rwd_export.json with primaryCancer
s04_medication.py -> enriches osiris_rwd_export.json with medication
Module: steps/s01_create_cohort.py
Source: EDS V2 (PostgreSQL)
- Host:
192.168.2.52:5432 - Database:
dwh_care - Tables:
dwh.dwh_patient_ipphist,dwh.dwh_data,dwh.dwh_thesaurus_data
Method: Select patients with at least one CIM10 cancer diagnosis using an EXISTS subquery (optimized to avoid a costly JOIN + DISTINCT).
ICD-10 criteria using code ranges (instead of an exhaustive list):
| ICD-10 range | Description | SQL condition |
|---|---|---|
| C00 - C97 | Malignant neoplasms (+ all subcodes) | left(concept_code, 1) = 'C' and substring(concept_code from 1 for 3) between 'C00' and 'C97' |
| D00 - D09 | Carcinomas in situ (+ all subcodes) | left(concept_code, 1) = 'D' and substring(concept_code from 1 for 3) between 'D00' and 'D09' |
| Z85* | Personal history of malignant neoplasm | concept_code like 'Z85%' |
Output: data/cancer_ipp.csv (1 column: ipp)
Module: steps/s02_patient_admin.py
Source: Axia database (Oracle)
- Host:
srvorat500.hopital-foch.net:1521 - Database:
AXIN(service_name) - Schema:
drci - Connection mode: thick (Oracle Client 12c)
| OSIRIS-RWD field | Axia column | Axia table | Transformation |
|---|---|---|---|
patientId |
SI_NUMDOS |
drci.SERVEUR_IDENTITE |
Pseudonymized as PSCCn |
originCenterId |
- | - | Fixed value (920100019, Foch FINESS — to verify) |
birthdateYear |
SI_NAIS |
drci.SERVEUR_IDENTITE |
extract(year) |
birthdateMonth |
SI_NAIS |
drci.SERVEUR_IDENTITE |
extract(month) |
birthdateDay |
SI_NAIS |
drci.SERVEUR_IDENTITE |
extract(day) |
biologicalSex |
SI_SEXE |
drci.SERVEUR_IDENTITE |
1 -> MALE, 2 -> FEMALE |
deathDateYear |
SI_DATEDEC |
drci.SERVEUR_IDENTITE |
extract(year) |
deathDateMonth |
SI_DATEDEC |
drci.SERVEUR_IDENTITE |
extract(month) |
deathDateDay |
SI_DATEDEC |
drci.SERVEUR_IDENTITE |
extract(day) |
latestVisitDate* |
HO_DDEB |
drci.HOSPITALISATION |
max(HO_DDEB) + extract() |
latestContactDate* |
HO_DDEB |
drci.HOSPITALISATION |
Same value as latestVisitDate |
drci.SERVEUR_IDENTITE.SI_NUMDOS = drci.HOSPITALISATION.HO_MANUMDOS
- Each IPP receives a sequential identifier:
PSCC1,PSCC2, ... - The correspondence table
ipp <-> pseudo_idis saved indata/correspondence_ipp_pseudo.csv - This file must remain internal and should not be shared
data/osiris_rwd_export.json: list of patients in OSIRIS-RWD format (first part)data/correspondence_ipp_pseudo.csv: IPP / pseudonym correspondence table
{
"patientId": "PSCC1",
"originCenterId": "920100019",
"birthdateYear": 1943,
"birthdateMonth": 6,
"birthdateDay": 1,
"biologicalSex": "MALE",
"latestNews": {
"vitalStatus": null,
"latestVisitDateDay": 15,
"latestVisitDateMonth": 12,
"latestVisitDateYear": 2023,
"deathDateDay": null,
"deathDateMonth": null,
"deathDateYear": null,
"latestContactDateDay": 15,
"latestContactDateMonth": 12,
"latestContactDateYear": 2023,
"vitalStatusUpdateDay": null,
"vitalStatusUpdateMonth": null,
"vitalStatusUpdateYear": null,
"observation": []
},
"primaryCancer": [
{
"cancerOrder": 1,
"topographyCode": "C34.9",
"topographyGroup": "POUMON NON A PETITES CELLULES",
"morphologyCode": null,
"morphologyGroup": null,
"cancerDiagnosisDateYear": 2021,
"cancerDiagnosisDateMonth": 3,
"cancerDiagnosisDateDay": 15,
"laterality": null,
"cancerDiagnosisInCenter": true,
"cancerDiagnosisMethod": null,
"cancerDiagnosisCode": "C34.9",
"cancerCareInCenter": null,
"primaryCancerGrade": [],
"primaryCancerStage": [],
"tumorPathoEvent": [],
"tnmEvent": [{"tnmType": "c", "t": "T3", "n": "N1", "m": "M0"}],
"tumorSize": [],
"imaging": [],
"surgery": [],
"radiotherapy": [],
"measure": [],
"observation": []
}
]
}Module: steps/s03_primary_cancer.py
Source: CHIMIO database (Oracle)
- Schema owner:
CHIMSECOURS - Connection mode: thick (Oracle Client 12c)
| OSIRIS-RWD field | CHIMIO column | CHIMIO table | Notes |
|---|---|---|---|
cancerOrder |
— | — | Computed: chronological order of cancers per patient (1, 2, 3...) based on cancerDiagnosisDate. |
topographyCode |
codediag1 |
CHIMSECOURS.diagnost |
ICD-10 code (C34, C50...). Often -1 at Foch (not filled). Formatted to dotted notation (C34 → C34.9). |
topographyGroup |
nomlocal |
CHIMSECOURS.localisa |
Text (GLIOBLASTOME, COLON RECTUM...). Main topography identifier at Foch. |
cancerDiagnosisDate* |
dateinclus |
CHIMSECOURS.inclusio |
YYYYMMDD VARCHAR2. Real inclusion/diagnosis date. |
cancerDiagnosisCode |
codediag1 |
CHIMSECOURS.diagnost |
Same as topographyCode (ICD-10). |
primaryCancerStage |
codest1 / lignestade |
CHIMSECOURS.stade |
Simple stage codes: 01→I, 02→II, 03→III, 04→IV. -1 = empty. |
tnmEvent |
codest1 |
CHIMSECOURS.stade |
TNM codes (cT3bN1M0, pT1aN2aM1b) parsed into structured {tnmType, t, n, m}. |
morphologyCode |
— | — | Not available in CHIMIO |
laterality |
— | — | Not available in CHIMIO |
patient.pacleunik = inclusio.pacleunik
inclusio.dicleunik = diagnost.dicleunik
inclusio.codelocal = localisa.codelocal
inclusio.stcleunik = stade.stcleunik (LEFT JOIN)
inclusio.ltcleunik = lignetr.ltcleunik
inclusio.prcleunik = protocol.prcleunik
At Foch, patient.noobspat is identical to patient_etab.c_ipp (the hospital IPP). The query uses noobspat directly — no need for the patient_etab join.
inclusio.inactinclus = 0(active inclusions only)codediag1: values-1,0, empty are treated as NULLcodest1/lignestade: values-1,0,(Sans)are treated as NULL- Patient measures: only included when value > 0
Computed per patient after grouping all primaryCancer entries:
- Cancers are sorted chronologically by diagnosis date (year, month, day)
- Assigned sequential order: 1, 2, 3, ...
- Single cancer →
cancerOrder = 1
codest1 from the stade table contains either:
- TNM codes (e.g.
cT3N2M0,pT1aN2aM1b,cTisN0M0) → parsed into structuredtnmEvent:{"tnmType": "c", "t": "T3", "n": "N2", "m": "M0"} - Simple stage codes (
01,02,03,04) → mapped to roman numerals (I, II, III, IV) inprimaryCancerStage - Invalid values (
-1,0,(Sans)) → ignored
- The
indicationtable exists but is empty at Foch (not included in query) - IPP matching: CHIMIO returns 10-digit IPPs; correspondence CSV IPPs are normalized with
zfill(10)to ensure matching
Module: steps/s04_medication.py
Source: CHIMIO database (Oracle)
- Schema owner:
CHIMSECOURS - Connection mode: thick (Oracle Client 12c)
| OSIRIS-RWD field | CHIMIO column | CHIMIO table | Notes |
|---|---|---|---|
moleculeCode |
CLASSETH |
CHIMSECOURS.dci |
ATC code (L01BC02, L01XA02...). 100% filled (261/261 molecules). |
drugAtcCode |
NOMDCI |
CHIMSECOURS.dci |
Drug name / DCI (FLUORO-URACILE, OXALIPLATINE...). |
startDate* |
DATEPRES |
CHIMSECOURS.prescrip |
Prescription date, YYYYMMDD VARCHAR2. |
phase |
— | — | Not mapped (could use lignetr.lignetrait: ADJUVANT, 1ERE LIGNE...). |
clinicalTrial |
incessai |
CHIMSECOURS.inclusio |
1 = yes (clinical trial inclusion). |
patient.pacleunik = inclusio.pacleunik
inclusio.I0CLEUNIK = prescrip.I0CLEUNIK
prescrip.P0CLEUNIK = lignepre.P0CLEUNIK
lignepre.DCCLEUNIK = dci.DCCLEUNIK
prescrip.SECLEUNIK = sejour.SECLEUNIK (not used yet)
inclusio.inactinclus = 0(active inclusions only)dci.CLASSETH IS NOT NULL AND != '-1'(excludes solvents whereNOMDCI = 'sans')prescrip.DATEPRES IS NOT NULL(excludes planned/not-yet-administered prescriptions)- Results are grouped by (ipp, date, ATC code, drug name, clinical trial) to deduplicate
| Code | Unit | Verification |
|---|---|---|
| 1 | mg/kg | BEVACIZUMAB: dosepres=8.95, dose=548.22 |
| 2 | mg/m² | OXALIPLATINE: dosepres=100, dose=150 (BSA ~1.5) |
| 3 | mg | Fixed doses (8, 80, 500) |
| 5 | UI | L-ASPARAGINASE: dosepres=10000 |
| 7 | AUC | CARBOPLATINE: dosepres=4.73, dose=401.44 |
| 8 | AUC | CARBOPLATINE: dosepres=4.86, dose=430.48 |
| 13 | g/m² | Values 1, 1.5 |
| Code | Route | Count |
|---|---|---|
| IVL | IV lente (perfusion) | 968,878 |
| IVD | IV directe | 299,956 |
| PO | Per Os (oral) | 235,452 |
| SC | Sous-cutané | 32,483 |
| IVC | IV continue | 30,942 |
| VESI | Vésicale | 3,614 |
| THECA | Intrathécale | 1,770 |
| IM | Intramusculaire | 1,600 |
| IA | Intra-artérielle | 1,012 |
{
"moleculeCode": "L01XA03",
"drugAtcCode": "OXALIPLATINE",
"startDateYear": 2010,
"startDateMonth": 6,
"startDateDay": 8,
"phase": null,
"clinicalTrial": false,
"measure": [],
"observation": []
}- The
produittable contains product presentations (UCD codes, dosages) linked to DCI viaDCCLEUNIK— not used currently - The
sejourtable (hospital stays) is linked to prescriptions viaprescrip.SECLEUNIK = sejour.SECLEUNIK— available for future use - UNITE and CODEVOIE are not included in the OSIRIS-RWD medication structure but documented here for reference
Subsequent scripts will load data/osiris_rwd_export.json and enrich it with:
surgery,radiotherapy,imagingmeasure,observation