Myllia| Echoes of Silenced Genes: A Cell Challenge

The Challenge can be found here.
Citation :

@misc{echoes-of-silenced-genes,
    author = {Myllia Biotechnology},
    title = {Myllia| Echoes of Silenced Genes: A Cell Challenge},
    year = {2026},
    howpublished = {\url{https://kaggle.com/competitions/echoes-of-silenced-genes}},
    note = {Kaggle}
}

Problemset

Predicting responses of a human cancer cell line to CRISPR pertubations

Data

Single-cell RNA-seq of most human cell types. Genes are pertubed using CRISPRi, to decrease transcription rates.

Evaluation

Only slightly adapted from the Competition page.

Weighted Mean Absolute Error (WMAE)

The WMAE for a single pertubation is given by:

$$ WMAE(true, pred) = \frac{1}{n} \sum_{i=1}^{n} w_i , |true_i - pred_i|. $$

with true and predicted being delta expression values across $n$ genes and a weight vector $w = (w_1, \dots, w_n) \in \mathbb{R}^n$.

$$ w_i\ge0 \text{ for } i = (1,\dots,n) \qquad \sum_{i=1}^{n} w_i = n. $$

The weights vector is calculated using t values of the moderated t-statistic (limma package),

$$ t_i=\frac{\text{estimated effect for gene i}}{\text{moderated standard error for gene i (from eBayes)}} $$

with $t= (t_1,\dots, t_n) \in\mathbb{R}^n$ for a sinlge pertubation:

$$ c_i = \min{(|t_i|+0.1, 10)}, \qquad i=1,...,n. $$

If the perturbation targets gene with index $g \in 1,\dots,n$, set $c_g = 0$. Let

$$ M = \max_{1\le\text{j}\le\text{n}}{c_i} $$

The resulting weight vector is defined as:

$$ w_i = n\frac{(\frac{c_i}{M}²)}{\sum_{k=1}^{n}(\frac{c_k}{M})²}, \qquad i = 1,...,n. $$

Over L pertubations the WMAE ratio score W is then calculated as

$$ W = \sum_{l=1}^{L}min{(5, \log_{2}{(\frac{WMAE_{l}^{base}}{WMAE_{l}^{pred}})})} $$

where $WMAE_{l}^{base}$ and $WMAE_{l}^{pred}$ are the baseline and model predictions for the l-th pertubation. The threshold 5 is only to protect against outliers.

The baseline prediction is hereby defined as the arithmetic mean of the log fold-change vectors across the 80 training perturbations. Let $x^{i} \in \mathbb{R}^n$ denote the log fold-change vector for training perturbation $j$, for $j = 1, \dots,80$. The baseline prediction vector is

$$ \bar{x} = \frac{1}{80}\sum_{j=1}^{80}x^{(j)} $$

Weighted Cosine Similarity (Wcos)

We define the weighted cosine similarity score $Wcos$ between two vectors $a, b \in \mathbb{R}^m$. In our setting, $a$ is the single concatenated ground-truth delta vector across all genes in all perturbations of length $m = L \times n$, and $b$ is the corresponding concatenated predictions vector.

Let the smoothstep function be defined as:

$$ s(t) = t^2 (3 - 2t). $$

We fix the gating constants $left = 0$ and $right = 0.3$. For each coordinate in the concatenated vectors $i = 1, \dots, m$, we define

$$ x_i = \max(|a_i|, |b_i|), \quad t_i = \frac{x_i - left}{right - left} = \frac{x_i}{0.3}, \quad \tilde{t}_i = \min(1, \max(0, t_i)), $$

and the (smooth) gate weight

$$ w_i = s(\tilde{t}_i). $$

Let $w_i^2$ denote the squared weights. Then the weighted cosine similarity is

$$ Wcos(a, b) = \frac{\sum_{i=1}^{m} w_i^2 a_i b_i}{\big(\sum_{i=1}^{m} w_i^2 a_i^2\big)^{1/2} , \big(\sum_{i=1}^{m} w_i^2 b_i^2\big)^{1/2}}. $$

(If the denominator is zero, define $Wcos(a, b) = 0$.)

Final score

$$ W\times\max(0, Wcos) $$

Submission

5127 genes, 120 pertubations (0.5 validation, 0.5 test)

Approach

Possible Approaches:

VAE
Transformer
GNN

build an integated pathway map based on reactome
take pertubation --> check all pathways it realates with --> give this as additional feature (one-hot)

in different dataseets look out for

sequence depth, genes, cell types

External Datasets

Preprocessing

get different datasets (myllia, vcc, replogle)
merge into one big adata object with obs = sgrna_symbol, RNAcount, featurecount, mitocount
drop all genes not in the submission → looses biological complexity, but would otherwise be to expensive
if dataset is missing genes, impute them using the mean non-targeting expressions from myllia
- possible TODO: use reactome for educated guess of the expression
only use vcc & replogle for training and all myllia as val & test / for fine-tuning?
train/test split: use 30 perts from myllia.h5ad for test set (and all )

ML

goal: learn the expression patterns → graph based approach?
Transformer based

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
info		info
src		src
utils		utils
README.md		README.md
predict.py		predict.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Myllia| Echoes of Silenced Genes: A Cell Challenge

Problemset

Data

Evaluation

Weighted Mean Absolute Error (WMAE)

Weighted Cosine Similarity (Wcos)

Final score

Submission

Approach

in different dataseets look out for

External Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Myllia| Echoes of Silenced Genes: A Cell Challenge

Problemset

Data

Evaluation

Weighted Mean Absolute Error (WMAE)

Weighted Cosine Similarity (Wcos)

Final score

Submission

Approach

in different dataseets look out for

External Datasets

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages