The Challenge can be found here.
Citation :
@misc{echoes-of-silenced-genes,
author = {Myllia Biotechnology},
title = {Myllia| Echoes of Silenced Genes: A Cell Challenge},
year = {2026},
howpublished = {\url{https://kaggle.com/competitions/echoes-of-silenced-genes}},
note = {Kaggle}
}
Predicting responses of a human cancer cell line to CRISPR pertubations
Single-cell RNA-seq of most human cell types. Genes are pertubed using CRISPRi, to decrease transcription rates.
Only slightly adapted from the Competition page.
The WMAE for a single pertubation is given by:
with true and predicted being delta expression values across
The weights vector is calculated using t values of the moderated t-statistic (limma package),
with
If the perturbation targets gene with index
The resulting weight vector is defined as:
Over L pertubations the WMAE ratio score W is then calculated as
where
The baseline prediction is hereby defined as the arithmetic mean of the log fold-change vectors across the 80 training perturbations. Let
We define the weighted cosine similarity score
Let the smoothstep function be defined as:
We fix the gating constants
and the (smooth) gate weight
Let
(If the denominator is zero, define
5127 genes, 120 pertubations (0.5 validation, 0.5 test)
Possible Approaches:
- VAE
- Transformer
- GNN
- build an integated pathway map based on reactome
- take pertubation --> check all pathways it realates with --> give this as additional feature (one-hot)
- sequence depth, genes, cell types
Preprocessing
-
get different datasets (myllia, vcc, replogle)
-
merge into one big adata object with obs = sgrna_symbol, RNAcount, featurecount, mitocount
-
drop all genes not in the submission → looses biological complexity, but would otherwise be to expensive
-
if dataset is missing genes, impute them using the mean non-targeting expressions from myllia
- possible TODO: use reactome for educated guess of the expression
-
only use vcc & replogle for training and all myllia as val & test / for fine-tuning?
-
train/test split: use 30 perts from myllia.h5ad for test set (and all )
ML
- goal: learn the expression patterns → graph based approach?
- Transformer based