Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ For more details, see the **[Python API Reference](docs/API.md)**.
- **Modular Architecture**: Decoupled backbones, interaction modules, and pathway output heads.
- **Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
- **Pathway-Exclusive Prediction**: Directly predicts biological pathway activity scores (e.g., 50 MSigDB Hallmark pathways) — no intermediate gene reconstruction step.
- **Offline Pathway Targets**: Ground-truth pathway activities are pre-computed offline (`stf-compute-pathways`) from raw gene expression using QC → CP10k normalisation → z-score → mean pathway aggregation. This eliminates the circular auxiliary loss used in previous versions.
- **Offline Pathway Targets**: Ground-truth pathway activities are pre-computed offline (`stf-compute-pathways`) from raw gene expression using QC → CP10k normalisation → mean pathway aggregation. This eliminates the circular auxiliary loss used in previous versions.
- **Spatial Pattern Coherence**: Optimised using a composite **MSE + PCC (Pearson Correlation) loss**.
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, **PLIP**, and **GigaPath**.

Expand Down Expand Up @@ -87,7 +87,7 @@ stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data

### 2. Pre-Compute Pathway Activity Targets

Before training, compute the offline pathway activity matrix for each sample. This step applies per-spot QC, CP10k normalisation, and z-scoring before aggregating gene expression into MSigDB Hallmark pathway scores.
Before training, compute the offline pathway activity matrix for each sample. This step applies per-spot QC and CP10k normalisation, then aggregates gene expression into MSigDB Hallmark pathway scores as the per-spot mean over each pathway's member genes.

```bash
stf-compute-pathways --data-dir hest_data
Expand Down Expand Up @@ -123,6 +123,7 @@ Visualization plots and spatial pathway activation maps will be saved to the `./

- **[Models & Architecture](docs/MODELS.md)**: Deep dive into the pathway-exclusive prediction architecture, quad-flow interaction logic, and network scaling.
- **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Offline pathway scoring methodology, QC pipeline, and MSigDB integration.
- **[SVG Exploratory Analysis](docs/SVG_HEST_EXPLORATORY_ANALYSIS.md)**: Detailed report on spatially variable pathway analysis across 95 HEST samples and data-driven target curation.
- **[Data Structure](docs/DATA_FORMAT.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.

## Development
Expand Down
21 changes: 14 additions & 7 deletions docs/PATHWAY_MAPPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,8 @@ For each `.h5ad` file, the following steps are applied in order:
| :--- | :--- | :--- |
| **1. QC Filtering** | Remove low-quality spots (min UMIs, min detected genes, max MT%) on **raw counts** | QC before normalisation prevents low-quality spots from distorting library-size estimates |
| **2. CP10k Normalisation** | Scale each spot to 10,000 total counts, then apply `log1p` | Corrects for sequencing depth differences between spots |
| **3. Gene Z-Scoring** | Standardise each gene across surviving spots (mean=0, std=1) | Eliminates housekeeping gene dominance; every gene gets equal weight |
| **4. Pathway Aggregation** | For each pathway: take the mean z-score of its member genes present in the matrix | Produces a single, comparable activity score per pathway per spot |
| **5. Moran I** | Compute Moran's I for each gene on the raw counts | Computes spatial autocorrelation for each gene |
| **3. Pathway Aggregation** | For each pathway: take the mean log1p CP10k expression of its member genes present in the matrix | Slide-stationary by construction — no per-slide statistics enter the score, so the same biological state in two slides yields the same target value |
| **4. Moran I (diagnostic)** | Compute per-pathway Moran's I on the activity matrix | Records spatial autocorrelation alongside the targets; not used in the loss |

Pathways with fewer than `--min-genes` (default: 5) detected members are filled with zeros. Samples with fewer than `--min-pathways` (default: 25) scorable pathways are excluded entirely.

Expand All @@ -38,10 +37,12 @@ These defaults follow standard scRNA-seq / spatial transcriptomics QC practice t
Each sample is saved as a compressed HDF5 file at `<data_dir>/pathway_activities/<sample_id>.h5`:

```text
activities float32 (n_spots, n_pathways) # z-scored pathway activity matrix
barcodes bytes (n_spots,) # spot barcode strings
pathway_names bytes (n_pathways,) # pathway name labels
activities float32 (n_spots, n_pathways) # mean log1p CP10k pathway score
barcodes bytes (n_spots,) # spot barcode strings
pathway_names bytes (n_pathways,) # pathway name labels
pathway_morans_i float32 (n_pathways,) # per-pathway Moran's I (diagnostic)
attrs:
format_version int # on-disk schema version (current: 2)
n_spots_before_qc int # total spots in raw h5ad
n_spots_after_qc int # spots surviving QC
qc_min_umis int
Expand All @@ -50,6 +51,12 @@ attrs:
n_scored_pathways int # pathways meeting the min_genes threshold
```

> **Breaking change (v2):** `activities` is now the simple mean of log1p
> CP10k expression of pathway members — no per-slide z-score. Files written
> by older builds carry `format_version=1` (or no version attribute) and are
> rejected at load time. Re-run `stf-compute-pathways --overwrite` to
> regenerate.

These files are consumed at training time by `HEST_FeatureDataset` when `--pathway-targets-dir` is provided (which defaults to `<data_dir>/pathway_activities`).

### Usage
Expand Down Expand Up @@ -100,7 +107,7 @@ The current design eliminates this entirely:
| Aspect | Old (Auxiliary Loss) | New (Pre-computed Targets) |
| :--- | :--- | :--- |
| Target source | Computed in-flight from training labels | Computed once, offline, from raw expression |
| QC & normalisation | None | Per-spot QC → CP10k → z-score |
| QC & normalisation | None | Per-spot QC → CP10k → mean pathway aggregation |
| Model output | Gene expression (via gene reconstructor) | Pathway activity scores directly |
| Loss objective | `L_gene + λ · (1 - PCC(scores, pseudo-targets))` | `MSE + PCC` against pre-computed activities |
| Interpretability | Indirect (pathway scores were internal and needed to be mapped back to pathways) | Direct (output *is* the pathway activity) |
Expand Down
71 changes: 71 additions & 0 deletions docs/SVG_HEST_EXPLORATORY_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# SVG Exploratory Analysis & Pathway Curation Report

This report summarizes the data-driven analysis of Spatially Variable Pathways across the HEST dataset (95 samples, 85 valid human samples after QC) conducted to refine the targets for model training.

## 1. Methodology

The analysis was performed using a standalone utility (`scripts/analyze_svg.py`) that:
1. **Strips common gene prefixes** (e.g., `GRCh38_`, `GRCm38_`) to ensure compatibility with MSigDB Hallmark gene sets.
2. **Computes pathway activities** using a sum-aggregation method (normalized to 10k target sum).
3. **Calculates Moran's I** for each of the 50 Hallmark pathways per sample.
4. **Aggregates statistics** (mean, median, std, etc.) across all valid human samples.
5. **Analyzes correlations** between spot-level pathway activities to understand redundancy.

## 2. Global Spatial Autocorrelation Results

The following plot shows the ranked mean Moran's I across 85 human samples for all 50 Hallmark pathways.

![Global SVG Analysis](./assets/reports/svg_analysis_full.png)

### Key Observations:
* **Widespread Spatial Structure**: All 50 pathways exhibit positive spatial autocorrelation (Mean Moran's I > 0.15).
* **High-Signal Pathways**: Top-ranked pathways include **MYC Targets V1** (0.665), **E2F Targets** (0.639), **G2M Checkpoint** (0.633), and **Oxidative Phosphorylation** (0.631).
* **Variance vs. Spatiality**: High expression variance does not always equate to high spatial coherence. Some pathways vary significantly between spots but lack a spatially organized pattern.

---

## 3. CRC Pathway Curation

Based on these results, the curated list of pathways for Colorectal Cancer (CRC) was validated. While some pathways exhibit lower spatial autocorrelation than others, all 14 selected hallmarks exceed a significance baseline of **Mean Moran's I > 0.20** and are therefore retained for training.

| Status | Pathway | Mean Moran's I | % Samples > 0.05 |
| :--- | :--- | :--- | :--- |
| ✅ **Retained** | EPITHELIAL_MESENCHYMAL_TRANSITION | 0.602 | 98.8% |
| ✅ **Retained** | DNA_REPAIR | 0.554 | 91.8% |
| ✅ **Retained** | APOPTOSIS | 0.547 | 100.0% |
| ✅ **Retained** | P53_PATHWAY | 0.546 | 92.9% |
| ✅ **Retained** | HYPOXIA | 0.539 | 100.0% |
| ✅ **Retained** | APICAL_JUNCTION | 0.498 | 100.0% |
| ✅ **Retained** | INFLAMMATORY_RESPONSE | 0.487 | 100.0% |
| ✅ **Retained** | PI3K_AKT_MTOR_SIGNALING | 0.483 | 91.8% |
| ✅ **Retained** | KRAS_SIGNALING_UP | 0.469 | 98.8% |
| ✅ **Retained** | IL6_JAK_STAT3_SIGNALING | 0.408 | 98.8% |
| ✅ **Retained** | TGF_BETA_SIGNALING | 0.397 | 94.1% |
| ✅ **Retained** | ANGIOGENESIS | 0.339 | 94.1% |
| ✅ **Retained** | WNT_BETA_CATENIN_SIGNALING | 0.302 | 90.6% |
| ✅ **Retained** | KRAS_SIGNALING_DN | 0.250 | 95.3% |

### Rationalization:
Although pathways like **WNT/β-catenin** and **KRAS_DN** have lower Moran's I scores (0.30 and 0.25 respectively) compared to **EMT** (0.60), they remain significantly above the background noise floor (~0.15). Their relative spatial uniformity likely reflects constitutive activation by driver mutations (e.g., APC mutations making WNT "on" globally), but the remaining spatial gradients are biologically critical for capturing tumor margins and stroma-epithelial interactions.

---

## 4. Pathway Correlation & Redundancy

To ensure the model is learning distinct biological signals, we analyzed the correlation between spot-level activities of the 14 CRC pathways.

![Pathway Correlations](./assets/reports/pathway_correlations_full.png)

### Correlation Insights:
* **Biological Axes**: Strong correlations exist between **Angiogenesis** and **EMT** (r=0.749), and between **TGF-β** and **Apoptosis** (r=0.668). These axes represent co-regulated spatial processes.
* **Distinct Signals**: Despite these correlations, each pathway provides a unique biological "view" of the tissue. Retaining the full set allows the model to learn complex regulatory relationships rather than just isolated spatial patterns.

**Conclusion**: All 14 CRC pathways exhibit sufficient spatial structure and biological relevance to be included as training targets. This ensures the model learns a comprehensive representation of the CRC tissue microenvironment.

---

## 5. Technical Improvements

During this analysis, two critical fixes were implemented:
1. **Gene Prefix Stripping**: Fixed an issue where samples like `TENX175` had all-zero pathway scores because gene names were prefixed with `GRCh38_`.
2. **Sample Compatibility Check**: Added a check for Hallmark gene overlap to automatically skip mouse samples or low-density panels that cannot be accurately scored using human Hallmark sets.
Loading
Loading