BenjaminIsaac0111 · BenjaminIsaac0111 · Jun 10, 2026 · Jun 10, 2026
@@ -262,18 +262,7 @@ hest_data/patches/TENX175.h5
 hest_data/st/TENX175.h5ad
 hest_data/tissue_seg/TENX175_contours.geojson
 hest_data/wsis/TENX175.tif
-.idea/.gitignore
-.idea/csv-editor.xml
-.idea/deployment.xml
-.idea/jupyter-settings.xml
-.idea/misc.xml
-.idea/modules.xml
-.idea/SpatialTranscriptFormer.iml
-.idea/vcs.xml
-.idea/inspectionProfiles/profiles_settings.xml
-.idea/inspectionProfiles/Project_Default.xml
-.idea/runConfigurations/STF_Compute_Pathways.xml
-.idea/runConfigurations/STF_Train_PrimaryPathway.xml
+.idea/
 .gemini/settings.json
 .gemini/agents/literature-search.md
 .gemini/agents/test-triage.md

@@ -0,0 +1,52 @@
+# Changelog
+
+All notable changes to the SpatialTranscriptFormer project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+---
+
+## [Unreleased]
+
+### Added
+- Created `CHANGELOG.md` documenting project history, milestones, and design choices.
+- Documented the role of Moran's I (diagnostic target validation and spatial representation collapse detection) in [PATHWAY_MAPPING.md](docs/PATHWAY_MAPPING.md) and [spatial_stats.py](src/spatial_transcript_former/data/spatial_stats.py).
+
+### Changed
+- Refactored baseline models (`HE2RNA`, `ViT_ST` in [regression.py](src/spatial_transcript_former/models/regression.py)) to accept `num_pathways` instead of `num_genes` and directly regress pathway activities.
+- Corrected console script entry points in [pyproject.toml](pyproject.toml) to map to `recipes/hest/` instead of `data/`.
+- Updated [setup.ps1](setup.ps1) and [setup.sh](setup.sh) to suggest `stf-compute-pathways` instead of `stf-build-vocab`.
+- Cleaned up parameter descriptions and docstrings in [dataset.py](src/spatial_transcript_former/recipes/hest/dataset.py), [trainer.py](src/spatial_transcript_former/training/trainer.py), and [checkpoint.py](src/spatial_transcript_former/checkpoint.py).
+- Completely updated documentation files ([DATALOADER.md](docs/DATALOADER.md), [MODELS.md](docs/MODELS.md), [SC_BEST_PRACTICES.md](docs/SC_BEST_PRACTICES.md), [TRAINING_GUIDE.md](docs/TRAINING_GUIDE.md), [TESTING.md](docs/TESTING.md), [PRECOMPUTED_WORKFLOW.md](docs/PRECOMPUTED_WORKFLOW.md), [DATA_FORMAT.md](docs/DATA_FORMAT.md)) to reflect the pathway-exclusive paradigm and remove legacy gene-reconstruction references.
+
+### Removed
+- Deleted obsolete gene vocabulary builder script `build_vocab.py`.
+- Deleted obsolete gene availability analysis document [GENE_ANALYSIS.md](docs/GENE_ANALYSIS.md).
+
+---
+
+## [0.2.0] - 2026-06
+
+### Added
+- Integrated multi-loss framework containing Concordance Correlation Coefficient (CCC), Huber loss, and CLIP-style contrastive loss to improve target convergence and model robustness.
+- Added direct supervision head for pre-computed pathway targets, eliminating circular dependency issues from older auxiliary pathway loss architectures.
+- Created public inference API and model wrapping framework.
+- Introduced Moran's I diagnostics for Spatially Variable Gene (SVG) selection and spatial pattern evaluation.
+- Added licensing disclaimers and specific attribution details for MSigDB Hallmark gene sets (CC BY 4.0), HEST-1k dataset, and third-party foundation models (CTransPath, Phikon).
+
+### Fixed
+- Resolved `TypeError` in transformer encoder by placing `enable_nested_tensor=False` in PyTorch's `TransformerEncoder` constructor.
+- Configured pytest warnings filter in `pyproject.toml` to suppress non-critical output noise (e.g. deprecations from third-party libraries).
+
+---
+
+## [0.1.0] - 2026-03
+
+### Added
+- Initialized core package architecture, modules, test suite, and scripts.
+- Implemented the quad-flow interaction system (early fusion of spatial transcriptomics and whole-slide histology features).
+- Added `LocalPatchMixer` module (Scatter-Gather depthwise 2D convolutions) to introduce localized spatial inductive biases into slide spot processing.
+- Added support for pre-computing histology feature extraction (e.g. using CTransPath) and building KD-Tree representations for spatial neighbor retrieval.
+- Developed an interactive Matplotlib visualization widget to overlay predicted pathway activities on histology slide coordinates.
+- Set up GitHub Actions CI workflow for automated testing.
@@ -1,59 +1,64 @@
 # HEST Dataloader Documentation
 
-The `SpatialTranscriptFormer` uses a custom PyTorch dataloader designed for memory-efficient loading of large-scale spatial transcriptomics datasets.
+The `SpatialTranscriptFormer` uses custom PyTorch dataloaders designed for memory-efficient loading of large-scale spatial transcriptomics datasets. The framework supports two loading paths: loading raw histology patches or loading pre-extracted feature vectors.
 
 ## Core Implementation Details
 
-The implementation is located in `src/spatial_transcript_former/data/dataset.py`.
+The implementation is located in [dataset.py](../src/spatial_transcript_former/recipes/hest/dataset.py).
 
-### 1. `HEST_Dataset` Class
+### 1. Raw-Patch Loading Path
 
-This class implements the standard `torch.utils.data.Dataset` interface.
+This path is used when training or evaluating directly on pixel-space images.
 
-- **Lazy Loading**: To avoid overwhelming memory, it uses lazy loading for H5 file handles. File objects are initialized only when the first item is requested (typically within a worker process).
-- **Indexing**: It supports an optional `indices` map, which allows it to represent a subset of the original data (e.g., after filtering for valid ST spots) without duplicating arrays in memory.
-- **Transformation**: Images are permuted from `(H, W, C)` to `(C, H, W)` and normalized to `[0, 1]`.
+*   **`HEST_Dataset` Class**: Loads raw histology patches from a HEST `.h5` file. It supports:
+    *   **Lazy File Access**: File handles are created lazily inside each worker process to avoid pickling issues during multiprocessing.
+    *   **Neighbourhood Context**: Can retrieve a patch along with its $K$ nearest neighbours.
+    *   **Dihedral Augmentation**: Randomly rotates or flips patch pixels and coordinates in sync.
+*   **`get_hest_dataloader`**: High-level orchestrator that creates a `DataLoader` over raw patches for a list of sample IDs, combining individual datasets using `ConcatDataset`.
+*   **Returned Tuples**: Yields `(patches, None, rel_coords)` where the second element (formerly gene expression counts) is `None`.
 
-### 2. `load_gene_expression_matrix`
+### 2. Pre-Computed Feature Loading Path
 
-This utility function handles the complex process of aligning image patches to gene expression data.
+This is the default path used by the SpatialTranscriptFormer training pipeline (`--precomputed`), as it avoids repeated backbone inference.
 
-- **Barcode Alignment**: Since not every image patch in an `.h5` file necessarily has a corresponding transcriptomic profile in the `.h5ad` file, the function performs a lookup using the spot barcodes.
-- **Gene Selection**: It can either:
-  1. Select the top `N` most expressed genes from a single sample.
-  2. Align the current sample to a predefined list of global gene names (filling missing genes with zeros).
-- **Sparse Support**: It handles both dense and sparse (CSR) matrix formats in the `.h5ad` file.
+*   **`HEST_FeatureDataset` Class**: Loads pre-extracted feature vectors (e.g. CTransPath, Phikon) from `.pt` files and aligns them to pre-computed pathway activity targets from `.h5` files.
+    *   **Spot barcode alignment**: Filters features to keep only spots that passed quality control (QC) in the corresponding `.h5ad` file.
+    *   **Stationary Coordinate Normalisation**: Normalises coordinates relative to the slide's centroid and standard deviation so coordinates are invariant to batching.
+    *   **Patch Mode**: Returns a single spot feature vector, its local neighbourhood features (optionally with random dropout augmentation), pre-computed pathway targets, and relative coordinates.
+    *   **Whole-Slide Mode**: Returns all spots on the slide as a single sequence.
+*   **`get_hest_feature_dataloader`**: Builds a `DataLoader` over the feature datasets.
+    *   In **patch mode**, yields standard batched tensors `(feats, None, pathway_acts, coords)`.
+    *   In **whole-slide mode**, pads variable-length slides to the longest slide in the batch and appends a boolean padding mask. Yields `(padded_feats, None, padded_pathways, padded_coords, mask)`.
 
-### 3. `get_hest_dataloader`
+---
 
-The high-level orchestrator that creates a unified dataloader for multiple samples.
-
-- **Sample Concatenation**: It iterates through multiple sample IDs and creates individual `HEST_Dataset` instances, which are then combined using `torch.utils.data.ConcatDataset`.
-- **Global Gene Lock**: The first sample found "locks" the gene list (usually the top 1000 genes). Every subsequent sample in the loop is then aligned to this specific set of genes to ensure consistent input dimensions for the model.
-
-## Usage Example
+## Usage Example (Pre-Computed Features)
 
 ```python
-from spatial_transcript_former.data import get_hest_dataloader
+from spatial_transcript_former.recipes.hest.dataset import get_hest_feature_dataloader
 
-# IDs from your metadata split
+# Pre-selected training sample IDs
 train_ids = ['MEND29', 'TENX156', ...]
 
-dataloader = get_hest_dataloader(
-    root_dir="A:/hest_data",
+dataloader = get_hest_feature_dataloader(
+    root_dir="./hest_data",
     ids=train_ids,
     batch_size=32,
     shuffle=True,
     num_workers=4,
-    num_genes=1000
+    n_neighbors=6,
+    pathway_targets_dir="./hest_data/pathway_activities"
 )
 
-for patches, gene_counts in dataloader:
-    # patches shape: (BS, 3, 224, 224)
-    # gene_counts shape: (BS, 1000)
+for feats, _, pathway_acts, rel_coords in dataloader:
+    # feats shape: (BS, 1 + n_neighbors, feature_dim)
+    # pathway_acts shape: (BS, num_pathways)
+    # rel_coords shape: (BS, 1 + n_neighbors, 2)
     ...
 ```
 
-## Stratified Splitting
+---
+
+## Patient-Aware Stratified Splitting
 
-For robust evaluation, we use `split_hest_patients` in `src/spatial_transcript_former/data/splitting.py`. This ensures that all samples from a single patient go into the same split (Train/Val/Test), preventing data leakage due to biological similarities between slides from the same donor.
+To prevent data leakage due to biological similarities between multiple slides from the same donor, splits are stratified by patient. The splitting logic is located in [splitting.py](../src/spatial_transcript_former/recipes/hest/splitting.py) and exposed via the `stf-split` command.
@@ -22,7 +22,10 @@ your_data_dir/
 │   ├── sample1.pt
 │   ├── sample2.pt
 │   └── ...
-└── global_genes.json             # Generated by stf-build-vocab
+└── pathway_activities/           # Pre-computed pathway targets
+    ├── sample1.h5
+    ├── sample2.h5
+    └── ...
 ```
 
 *Note: If you are training from scratch on raw image crops, you would have a `patches/` directory with `.h5` files instead of the `he_features_ctranspath/` directory.*
@@ -41,7 +44,7 @@ If you processed your data using standard tools like **10x SpaceRanger** and loa
 #### `var` (Variables / Genes)
 
 - Must contain an index representing the gene names (e.g., standard HGNC symbols like `TRAP1`, `BRCA1`).
-- These names are used by `stf-build-vocab` to map your dataset to biological pathways (like MSigDB Hallmarks).
+- These names are used by `stf-compute-pathways` to map your dataset to biological pathways (like MSigDB Hallmarks) during target pre-computation.
 
 #### `X` (Expression Matrix)
 
@@ -75,7 +78,7 @@ adata.uns['spatial'] = {
 Once your files match the structure above:
 
 1. **Place data:** Put your `.h5ad` files into `your_data_dir/st/`.
-2. **Build Vocabulary:** Run `stf-build-vocab --data-dir your_data_dir/`. This will scan all your `.h5ad` files, find the most highly expressed genes, map them to biological pathways, and generate `global_genes.json`.
+2. **Compute Pathway Targets:** Run `stf-compute-pathways --data-dir your_data_dir/`. This will process your `.h5ad` files, apply spot quality control, and pre-compute the Hallmark pathway activity target matrices saved to `your_data_dir/pathway_activities/`.
 3. **Extract Features (Optional):** If you haven't already, run the feature extraction pipeline (e.g., `stf-extract`) to generate the `.pt` files in `he_features_ctranspath/`.
 4. **Train:** You are now ready to run `stf-train`!
 

@@ -56,7 +56,7 @@ The spatial relationships of gene expression are central to this model. It is no
 
 1. **Positional Encoding** — Each patch token receives a 2D sinusoidal encoding of its (x, y) coordinate on the tissue. This means the pathway tokens, when they attend to patches, can distinguish *where* each patch is. A pathway token can learn that EMT is localised at the tumour-stroma boundary, not uniformly across the slide.
 
-2. **PCC Loss (Spatial Pattern Coherence)** — The Pearson Correlation component in the composite loss measures whether the *spatial pattern* of each gene's predicted expression matches the ground truth pattern, independently of scale. A model that predicts the same value everywhere scores PCC = 0, even if the mean is correct. This directly penalises spatial collapse.
+2. **PCC Loss (Spatial Pattern Coherence)** — The Pearson Correlation component in the composite loss measures whether the *spatial pattern* of each pathway's predicted activity matches the ground truth pattern, independently of scale. A model that predicts the same value everywhere scores PCC = 0, even if the mean is correct. This directly penalises spatial collapse.
 
 Together, these ensure the model learns *spatially-varying* pathway activation maps rather than slide-level averages.
 
@@ -157,8 +157,8 @@ where $\hat{h}_i$ and $\hat{p}_k$ are the L2-normalised patch and pathway tokens
 
 | Mode | Input | Output | Supervision |
 | :--- | :--- | :--- | :--- |
-| **Dense (whole-slide)** | All patches from a slide | Per-patch gene predictions $(B, S, G)$ | Masked MSE+PCC at each spot |
-| **Global** | All patches from a slide | Slide-level prediction $(B, G)$ | Mean-pooled expression |
+| **Dense (whole-slide)** | All patches from a slide | Per-patch pathway predictions $(B, S, P)$ | Masked MSE+PCC at each spot |
+| **Global** | All patches from a slide | Slide-level pathway prediction $(B, P)$ | Mean-pooled pathway activities |
 
 ---
 

@@ -59,6 +59,14 @@ attrs:
 
 These files are consumed at training time by `HEST_FeatureDataset` when `--pathway-targets-dir` is provided (which defaults to `<data_dir>/pathway_activities`).
 
+### The Role of Moran's I in the Project
+
+Historically, Moran's I was introduced to rank and select Spatially Variable Genes (SVGs) when the model was trained on high-dimensional gene expression reconstruction. With the transition to the strictly pathway-exclusive architecture, its role has shifted:
+
+- **Why it is NOT used in the Loss**: Down-weighting pathways with low Moran's I during training was dropped because it is counterproductive. Crucial cancer pathways (e.g., Wnt/β-catenin) can exhibit low spatial autocorrelation across spots due to constitutive activation (from driver mutations like APC), yet remain key targets that the model must predict.
+- **Why it is kept as a Diagnostic**: The pre-computed `pathway_morans_i` dataset in the `.h5` files acts as a slide-level spatial signature. It is used to curate disease-specific pathway sets (ensuring targets are above a background noise floor of ~0.15) and for offline biological analysis.
+- **Role in validation (Collapse Detection)**: During validation, the training engine dynamically computes the Pearson correlation of predicted vs. ground-truth Moran's I vectors across the pathways. If the model suffers from spot-level representation collapse (predicting identical mean values everywhere), the predicted Moran's I drops to 0, which immediately registers as a drop in the validation `spatial_coherence` score.
+
 ### Usage
 
 ```bash