Skip to content

Latest commit

 

History

History
309 lines (242 loc) · 11 KB

File metadata and controls

309 lines (242 loc) · 11 KB

OSIRIS-RWD Mapping - Documentation

Objective

Export cancer patient data from Hopital Foch databases and map them to the OSIRIS-RWD format.

The process is incremental: each script enriches the output JSON with new variables.


Architecture

run_pipeline.py                     # Entry point (--step N for single step)
config.py                           # Paths, constants
db.py                               # Database connections (EDS, Axia, CHIMIO)
utils.py                            # Shared I/O and type helpers
steps/
  s01_create_cohort.py   ->  data/cancer_ipp.csv
  s02_patient_admin.py   ->  data/osiris_rwd_export.json
                              data/correspondence_ipp_pseudo.csv
  s03_primary_cancer.py  ->  enriches osiris_rwd_export.json with primaryCancer
  s04_medication.py      ->  enriches osiris_rwd_export.json with medication

Step 1: Cancer cohort creation

Module: steps/s01_create_cohort.py

Source: EDS V2 (PostgreSQL)

  • Host: 192.168.2.52:5432
  • Database: dwh_care
  • Tables: dwh.dwh_patient_ipphist, dwh.dwh_data, dwh.dwh_thesaurus_data

Method: Select patients with at least one CIM10 cancer diagnosis using an EXISTS subquery (optimized to avoid a costly JOIN + DISTINCT).

ICD-10 criteria using code ranges (instead of an exhaustive list):

ICD-10 range Description SQL condition
C00 - C97 Malignant neoplasms (+ all subcodes) left(concept_code, 1) = 'C' and substring(concept_code from 1 for 3) between 'C00' and 'C97'
D00 - D09 Carcinomas in situ (+ all subcodes) left(concept_code, 1) = 'D' and substring(concept_code from 1 for 3) between 'D00' and 'D09'
Z85* Personal history of malignant neoplasm concept_code like 'Z85%'

Output: data/cancer_ipp.csv (1 column: ipp)


Step 2: Administrative data extraction (Axia)

Module: steps/s02_patient_admin.py

Source: Axia database (Oracle)

  • Host: srvorat500.hopital-foch.net:1521
  • Database: AXIN (service_name)
  • Schema: drci
  • Connection mode: thick (Oracle Client 12c)

Tables and variables

OSIRIS-RWD field Axia column Axia table Transformation
patientId SI_NUMDOS drci.SERVEUR_IDENTITE Pseudonymized as PSCCn
originCenterId - - Fixed value (920100019, Foch FINESS — to verify)
birthdateYear SI_NAIS drci.SERVEUR_IDENTITE extract(year)
birthdateMonth SI_NAIS drci.SERVEUR_IDENTITE extract(month)
birthdateDay SI_NAIS drci.SERVEUR_IDENTITE extract(day)
biologicalSex SI_SEXE drci.SERVEUR_IDENTITE 1 -> MALE, 2 -> FEMALE
deathDateYear SI_DATEDEC drci.SERVEUR_IDENTITE extract(year)
deathDateMonth SI_DATEDEC drci.SERVEUR_IDENTITE extract(month)
deathDateDay SI_DATEDEC drci.SERVEUR_IDENTITE extract(day)
latestVisitDate* HO_DDEB drci.HOSPITALISATION max(HO_DDEB) + extract()
latestContactDate* HO_DDEB drci.HOSPITALISATION Same value as latestVisitDate

Join

drci.SERVEUR_IDENTITE.SI_NUMDOS = drci.HOSPITALISATION.HO_MANUMDOS

Pseudonymization

  • Each IPP receives a sequential identifier: PSCC1, PSCC2, ...
  • The correspondence table ipp <-> pseudo_id is saved in data/correspondence_ipp_pseudo.csv
  • This file must remain internal and should not be shared

Output

  • data/osiris_rwd_export.json: list of patients in OSIRIS-RWD format (first part)
  • data/correspondence_ipp_pseudo.csv: IPP / pseudonym correspondence table

Generated JSON structure (per patient)

{
  "patientId": "PSCC1",
  "originCenterId": "920100019",
  "birthdateYear": 1943,
  "birthdateMonth": 6,
  "birthdateDay": 1,
  "biologicalSex": "MALE",
  "latestNews": {
    "vitalStatus": null,
    "latestVisitDateDay": 15,
    "latestVisitDateMonth": 12,
    "latestVisitDateYear": 2023,
    "deathDateDay": null,
    "deathDateMonth": null,
    "deathDateYear": null,
    "latestContactDateDay": 15,
    "latestContactDateMonth": 12,
    "latestContactDateYear": 2023,
    "vitalStatusUpdateDay": null,
    "vitalStatusUpdateMonth": null,
    "vitalStatusUpdateYear": null,
    "observation": []
  },
  "primaryCancer": [
    {
      "cancerOrder": 1,
      "topographyCode": "C34.9",
      "topographyGroup": "POUMON NON A PETITES CELLULES",
      "morphologyCode": null,
      "morphologyGroup": null,
      "cancerDiagnosisDateYear": 2021,
      "cancerDiagnosisDateMonth": 3,
      "cancerDiagnosisDateDay": 15,
      "laterality": null,
      "cancerDiagnosisInCenter": true,
      "cancerDiagnosisMethod": null,
      "cancerDiagnosisCode": "C34.9",
      "cancerCareInCenter": null,
      "primaryCancerGrade": [],
      "primaryCancerStage": [],
      "tumorPathoEvent": [],
      "tnmEvent": [{"tnmType": "c", "t": "T3", "n": "N1", "m": "M0"}],
      "tumorSize": [],
      "imaging": [],
      "surgery": [],
      "radiotherapy": [],
      "measure": [],
      "observation": []
    }
  ]
}

Step 3: Primary cancer data extraction (CHIMIO)

Module: steps/s03_primary_cancer.py

Source: CHIMIO database (Oracle)

  • Schema owner: CHIMSECOURS
  • Connection mode: thick (Oracle Client 12c)

Tables and variables

OSIRIS-RWD field CHIMIO column CHIMIO table Notes
cancerOrder Computed: chronological order of cancers per patient (1, 2, 3...) based on cancerDiagnosisDate.
topographyCode codediag1 CHIMSECOURS.diagnost ICD-10 code (C34, C50...). Often -1 at Foch (not filled). Formatted to dotted notation (C34 → C34.9).
topographyGroup nomlocal CHIMSECOURS.localisa Text (GLIOBLASTOME, COLON RECTUM...). Main topography identifier at Foch.
cancerDiagnosisDate* dateinclus CHIMSECOURS.inclusio YYYYMMDD VARCHAR2. Real inclusion/diagnosis date.
cancerDiagnosisCode codediag1 CHIMSECOURS.diagnost Same as topographyCode (ICD-10).
primaryCancerStage codest1 / lignestade CHIMSECOURS.stade Simple stage codes: 01→I, 02→II, 03→III, 04→IV. -1 = empty.
tnmEvent codest1 CHIMSECOURS.stade TNM codes (cT3bN1M0, pT1aN2aM1b) parsed into structured {tnmType, t, n, m}.
morphologyCode Not available in CHIMIO
laterality Not available in CHIMIO

Joins

patient.pacleunik = inclusio.pacleunik
inclusio.dicleunik = diagnost.dicleunik
inclusio.codelocal = localisa.codelocal
inclusio.stcleunik = stade.stcleunik        (LEFT JOIN)
inclusio.ltcleunik = lignetr.ltcleunik
inclusio.prcleunik = protocol.prcleunik

IPP matching

At Foch, patient.noobspat is identical to patient_etab.c_ipp (the hospital IPP). The query uses noobspat directly — no need for the patient_etab join.

Filters

  • inclusio.inactinclus = 0 (active inclusions only)
  • codediag1: values -1, 0, empty are treated as NULL
  • codest1 / lignestade: values -1, 0, (Sans) are treated as NULL
  • Patient measures: only included when value > 0

Cancer order (cancerOrder)

Computed per patient after grouping all primaryCancer entries:

  • Cancers are sorted chronologically by diagnosis date (year, month, day)
  • Assigned sequential order: 1, 2, 3, ...
  • Single cancer → cancerOrder = 1

TNM parsing (tnmEvent)

codest1 from the stade table contains either:

  • TNM codes (e.g. cT3N2M0, pT1aN2aM1b, cTisN0M0) → parsed into structured tnmEvent:
    {"tnmType": "c", "t": "T3", "n": "N2", "m": "M0"}
  • Simple stage codes (01, 02, 03, 04) → mapped to roman numerals (I, II, III, IV) in primaryCancerStage
  • Invalid values (-1, 0, (Sans)) → ignored

Notes

  • The indication table exists but is empty at Foch (not included in query)
  • IPP matching: CHIMIO returns 10-digit IPPs; correspondence CSV IPPs are normalized with zfill(10) to ensure matching

Step 4: Medication data extraction (CHIMIO)

Module: steps/s04_medication.py

Source: CHIMIO database (Oracle)

  • Schema owner: CHIMSECOURS
  • Connection mode: thick (Oracle Client 12c)

Tables and variables

OSIRIS-RWD field CHIMIO column CHIMIO table Notes
moleculeCode CLASSETH CHIMSECOURS.dci ATC code (L01BC02, L01XA02...). 100% filled (261/261 molecules).
drugAtcCode NOMDCI CHIMSECOURS.dci Drug name / DCI (FLUORO-URACILE, OXALIPLATINE...).
startDate* DATEPRES CHIMSECOURS.prescrip Prescription date, YYYYMMDD VARCHAR2.
phase Not mapped (could use lignetr.lignetrait: ADJUVANT, 1ERE LIGNE...).
clinicalTrial incessai CHIMSECOURS.inclusio 1 = yes (clinical trial inclusion).

Joins

patient.pacleunik = inclusio.pacleunik
inclusio.I0CLEUNIK = prescrip.I0CLEUNIK
prescrip.P0CLEUNIK = lignepre.P0CLEUNIK
lignepre.DCCLEUNIK = dci.DCCLEUNIK
prescrip.SECLEUNIK = sejour.SECLEUNIK       (not used yet)

Filters

  • inclusio.inactinclus = 0 (active inclusions only)
  • dci.CLASSETH IS NOT NULL AND != '-1' (excludes solvents where NOMDCI = 'sans')
  • prescrip.DATEPRES IS NOT NULL (excludes planned/not-yet-administered prescriptions)
  • Results are grouped by (ipp, date, ATC code, drug name, clinical trial) to deduplicate

UNITE codes (observed at Foch)

Code Unit Verification
1 mg/kg BEVACIZUMAB: dosepres=8.95, dose=548.22
2 mg/m² OXALIPLATINE: dosepres=100, dose=150 (BSA ~1.5)
3 mg Fixed doses (8, 80, 500)
5 UI L-ASPARAGINASE: dosepres=10000
7 AUC CARBOPLATINE: dosepres=4.73, dose=401.44
8 AUC CARBOPLATINE: dosepres=4.86, dose=430.48
13 g/m² Values 1, 1.5

Voies d'administration (CODEVOIE in lignepre)

Code Route Count
IVL IV lente (perfusion) 968,878
IVD IV directe 299,956
PO Per Os (oral) 235,452
SC Sous-cutané 32,483
IVC IV continue 30,942
VESI Vésicale 3,614
THECA Intrathécale 1,770
IM Intramusculaire 1,600
IA Intra-artérielle 1,012

Generated JSON structure (per medication entry)

{
  "moleculeCode": "L01XA03",
  "drugAtcCode": "OXALIPLATINE",
  "startDateYear": 2010,
  "startDateMonth": 6,
  "startDateDay": 8,
  "phase": null,
  "clinicalTrial": false,
  "measure": [],
  "observation": []
}

Notes

  • The produit table contains product presentations (UCD codes, dosages) linked to DCI via DCCLEUNIK — not used currently
  • The sejour table (hospital stays) is linked to prescriptions via prescrip.SECLEUNIK = sejour.SECLEUNIK — available for future use
  • UNITE and CODEVOIE are not included in the OSIRIS-RWD medication structure but documented here for reference

Next steps

Subsequent scripts will load data/osiris_rwd_export.json and enrich it with:

  • surgery, radiotherapy, imaging
  • measure, observation