fgseaGPU

GPU-accelerated Fast pre-ranked Gene Set Enrichment Analysis. A re-implementation of alserglab/fgsea with the permutation null and per-pathway enrichment score computed in batched LibTorch tensor operations that run on CUDA / ROCm / Apple Metal, with a portable C++23 std::execution::par_unseq CPU fallback.

The R interface mirrors fgsea::fgseaSimple — drop-in replacement.

Why

The hot loop of GSEA is the permutation null: for each pathway, compute the enrichment-score running sum for thousands of random gene-set permutations. Per-permutation work is independent, with no data dependencies — embarrassingly parallel. The original fgsea exploits a clever per-CPU-core segment-tree algorithm to get O(n + mK log K). On a GPU we can collapse the entire null into three fused tensor ops (scatter, cumsum, reduce) and run permutations by the million.

Project layout

fgseaGPU/
├── DESCRIPTION                     # R package metadata
├── NAMESPACE
├── LICENSE                         # MIT, with upstream attribution
├── R/fgsea.R                       # public R API
├── src/
│   ├── fgsea_core.h                # algorithm definition + types
│   ├── fgsea_cpu.h                 # std::execution CPU backend
│   ├── fgsea_gpu.h                 # LibTorch GPU backend (opt-in)
│   ├── fgsea_dispatch.h            # picks device, chunks batch
│   ├── rcpp_glue.cpp               # R <-> C++ marshalling
│   ├── RcppExports.cpp
│   └── Makevars                    # R build (Linux, macOS)
├── tests/testthat/                 # R-side tests, incl. vs upstream fgsea
├── tests/cpp/test_core.cpp         # standalone C++ unit tests
├── bench/fgsea_bench.cpp           # CLI benchmark
└── CMakeLists.txt                  # standalone C++ build

Installation

CPU-only (no GPU)

# In R, with the working directory at the repo root:
install.packages(".", repos = NULL, type = "source")
library(fgseaGPU)
fgseaBackendInfo()
# $torch_built FALSE   $device "cpu"   $concurrency 16

With the Torch GPU backend

# Grab LibTorch from https://pytorch.org/get-started/locally/ (cxx11 ABI)
# Then point R at it:
export FGSEA_TORCH=1
export FGSEA_TORCH_HOME=$HOME/opt/libtorch
export FGSEA_TORCH_CUDA=1   # if your LibTorch is a CUDA build
R CMD INSTALL .

On macOS with Apple Silicon, install a LibTorch build with MPS enabled. The package auto-detects MPS at runtime.

Precise (Neumaier-compensated) ES accumulation

The enrichment-score walk in calcEs uses naive += summation by default, which is fast and accurate to ~1e-12 relative error for typical pathway sizes. If you're running 10⁶+ permutation studies on very large gene sets and want the accumulation precision pinned at one machine epsilon regardless of summand count, build with the opt-in Neumaier path:

FGSEA_PRECISE=1 R CMD INSTALL .

Cost when enabled: two extra floating-point adds and one compare per gene-set member walked. Combinable with FGSEA_TORCH=1.

Usage

Three entry points mirror the upstream API:

library(fgseaGPU)
data(examplePathways, package = "fgsea")
data(exampleRanks,    package = "fgsea")

# Multilevel (default): accurate small p-values, CPU-parallel across pathways
res <- fgsea(examplePathways, exampleRanks)

# Simple permutation null: GPU-accelerated when LibTorch is built in
res_simple <- fgseaSimple(examplePathways, exampleRanks,
                          nperm = 10000, device = "auto")

# Over-representation analysis (Fisher / hypergeometric)
foraRes <- fora(examplePathways,
                genes    = tail(names(exampleRanks), 200),
                universe = names(exampleRanks))

Routing rule: fgsea(...) delegates to fgseaMultilevel by default, or to fgseaSimple if you supply nperm — matching the upstream convention.

Entry point	Best for	Backend
`fgseaSimple`	Many permutations, preranked stats	GPU (LibTorch)
`fgseaMultilevel`	Accurate small p-values (down to eps)	CPU `par_unseq`
`fgseaPhenotype`	Classical phenotype-permutation on a matrix	CPU `par_unseq`, GPU matmul
`fora`	Set-based over-representation	CPU `par_unseq`

fgseaPhenotype is the Subramanian-style mode that upstream fgsea skips: input is an expression matrix plus a two-class label vector. Each permutation shuffles the labels, recomputes a per-gene rank metric (signal-to-noise, Welch's t, or class-mean variants) via Welford-stable accumulation, re-ranks, and walks ES on that fresh ranking. The GPU path turns the metric pass into a [G, S] @ [B, S]ᵀ → [G, B] matmul and reuses the same scatter-cumsum-reduce kernel as fgseaSimple with stats varying per column.

Why no GPU for multilevel?

Multilevel splitting builds dependent MCMC chains: each move conditions on the previous state, so a single chain can't be vectorised onto a GPU thread block. We instead parallelise across pathways with std::execution::par_unseq, which scales linearly to the number of CPU cores for typical pathway counts (5k–25k).

Algorithm notes

The enrichment-score formula is the literal definition from Subramanian et al. 2005, with the Korotkevich et al. weighting refinement. We do not use the upstream square-root segment-tree heuristic — its inherently sequential cumulative-sum step doesn't map onto a GPU. We instead pay O(n) per permutation and recover the constant-factor loss by running tens of thousands of permutations in parallel.
Random sampling on the device uses the argsort-of-uniform-noise trick for batched k-subsets. Reproducibility is guaranteed by a per-call splitmix-derived seed.
Multiple-testing correction is Benjamini–Hochberg over all kept pathways, matching upstream.

Multiple-testing correction: Storey q-values

The padj column in every output now holds a Storey-Tibshirani q-value rather than a Benjamini–Hochberg adjusted p-value. The q-value is the FDR analogue but with the proportion of true nulls π₀ estimated from the high-p tail of the empirical distribution rather than assumed to be 1. This gives strictly more power than BH whenever π₀ < 1 (typical in pathway analysis with a moderate-to-strong signal), and reduces to BH in the worst case.

For the continuous pipelines (fgseaSimple, fgseaMultilevel) we use the bootstrap λ-selection rule of Storey (2002): for a grid of λ in {0.05, 0.10, …, 0.95} we estimate π₀(λ) and pick the λ minimising the bootstrap MSE against the empirical minimum.

For fora, the hypergeometric tail is discrete, so naive q-values inherit the well-known conservatism of one-sided discrete tests. We apply the Lancaster / Heyse (2011) mid-p adjustment

midP = P(X ≥ q) − ½ · P(X = q)

before feeding the sequence into the Storey machinery. The output exposes both pval (the strict upper-tail probability) and midP columns, with padj derived from the latter.

The estimated π₀ is attached to each output as an attribute:

res <- fgsea(pathways, stats)
attr(res, "pi0")         # numeric, in (0, 1]
attr(res, "padj.method") # "storey-tibshirani"

If you need vanilla BH for compatibility, run stats::p.adjust(res$pval, "BH") on the pval column.

Differences vs upstream `fgsea`

A zero observed ES (degenerate gene set, all-zero member statistics) raises a domain_error rather than silently returning an empty leading edge — fail loud is friendlier than fail quiet here.
padj is a Storey q-value, not BH (see section above).
fora returns an additional midP column.
fgseaMultilevel exposes a floored flag indicating whether the reported p-value hit the eps floor before the level threshold bracketed the observed ES.
geseca, collapsePathways, plotting helpers and the gene-set preparation utilities are not yet ported.

Pushing to GitHub

The repo is initialised locally. To fork to your account:

cd ~/Documents/Claude/Projects/fgseaGPU
gh repo create paliocha/fgseaGPU \
    --source=. --public \
    --description "GPU/Torch-backed fast preranked GSEA" \
    --push

License

MIT, with attribution to the upstream fgsea authors. See LICENSE and inst/UPSTREAM_LICENCE.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
R		R
bench		bench
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.lintr		.lintr
CMakeLists.txt		CMakeLists.txt
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
cleanup		cleanup
configure		configure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fgseaGPU

Why

Project layout

Installation

CPU-only (no GPU)

With the Torch GPU backend

Precise (Neumaier-compensated) ES accumulation

Usage

Why no GPU for multilevel?

Algorithm notes

Multiple-testing correction: Storey q-values

Differences vs upstream `fgsea`

Pushing to GitHub

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fgseaGPU

Why

Project layout

Installation

CPU-only (no GPU)

With the Torch GPU backend

Precise (Neumaier-compensated) ES accumulation

Usage

Why no GPU for multilevel?

Algorithm notes

Multiple-testing correction: Storey q-values

Differences vs upstream fgsea

Pushing to GitHub

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Differences vs upstream `fgsea`

Packages