Skip to content

Native Rust hyperparameter optimization — Optuna-style without reloading data #83

@eprifti

Description

@eprifti

Context

Issue #77 proposed Optuna in Python (gpredomicspy). But Python-level optimization reloads data from disk for every trial — wasteful when the dataset is large (e.g., wetlab 1981×918 matrix).

A native Rust implementation would:

  1. Load data once
  2. Run hundreds of parameter trials in-memory
  3. Use the same feature selection cache across trials
  4. Be orders of magnitude faster than Python subprocess per trial

Design

Core: optimize() function in lib.rs

pub fn optimize(
    data: &Data,
    base_param: &Param,
    search_space: &SearchSpace,
    n_trials: usize,
    metric: OptMetric,        // TestAUC, TestSpearman, CVMeanAUC, etc.
    sampler: Sampler,         // TPE (default), Random, Grid
) -> OptResult {
    // Data is loaded ONCE, shared across all trials
    for trial in 0..n_trials {
        let param = sampler.suggest(&search_space, &history);
        let result = run_trial(data, &param);  // no disk I/O
        history.push(trial, param, result);
    }
    OptResult { best_params, best_value, history }
}

Sampler options

  1. Random — uniform random sampling (baseline)
  2. TPE (Tree-structured Parzen Estimator) — Optuna's default, Bayesian
  3. Grid — exhaustive grid search for small spaces
  4. CMA-ES — covariance matrix adaptation for continuous params

Search space definition (in param.yaml)

optimize:
  n_trials: 100
  metric: test_auc           # or cv_mean_auc, spearman, etc.
  sampler: tpe
  search_space:
    algo: [ga, beam, sa, ils, lasso]
    k_penalty: {log_uniform: [1e-5, 0.01]}
    language: [ter, "bin,ter", "bin,ter,ratio"]
    data_type: [prev, raw, "raw,prev"]
    population_size: {int: [500, 10000]}
    cooling_rate: {uniform: [0.99, 0.9999]}
    feature_minimal_prevalence_pct: {int: [5, 30]}

Key advantages over Python Optuna

Python Optuna (#77) Native Rust
Data loading Once per trial (subprocess) Once total
Feature selection Recomputed per trial Cached
Overhead per trial ~2s (process spawn + data I/O) ~0ms
100 trials on Qin2014 ~200s + algo time ~algo time only
Parallelism Python GIL limited Full rayon parallelism

Implementation phases

  1. Random sampler + grid — simplest, proves the architecture
  2. TPE sampler — port the core algorithm (kernel density estimation)
  3. Pruning — early stopping of unpromising trials (median pruner)
  4. CLI integrationgpredomics --optimize param.yaml
  5. Web app — "Tune" button that calls optimize() via gpredomicspy

References

  • Akiba et al. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD.
  • Bergstra et al. (2011). Algorithms for Hyper-Parameter Optimization. NeurIPS.
  • TPE: Tree-structured Parzen Estimator (Bergstra et al., 2011)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions