Fe (iron) + matrix — a low-level ML compute library written from scratch in Rust.
- No external ML dependencies — no candle, no burn, no tch
- Everything from scratch — like NumPy, but in Rust
- Strongly typed — zero unsound generics,
Elementtrait bounds everywhere - Zero-copy where possible —
Arc-backedStorage<T>enables shared tensor views - Designed to be extended — clean module boundaries, composable abstractions
| Module | What's inside |
|---|---|
tensor/ |
Tensor<T>, Shape, Storage<T>, strided views, broadcasting |
ops/ |
Element-wise, reduce, matmul (naive + tiled + batched), unary math |
autograd/ |
Reverse-mode autodiff, computation Tape, Variable |
nn/ |
Linear, Conv2d, MaxPool2d, BatchNorm1d, Sequential, ReLU, Sigmoid, Tanh, GELU, Softmax, MSE/BCE/CE loss |
optim/ |
SGD (+ momentum, Nesterov, weight decay), Adam (+ AMSGrad) |
optim/scheduler |
StepLR, CosineAnnealingLR with linear warm-up |
utils/ |
Xavier / Kaiming / uniform / normal init; DataLoader + Dataset |
# Cargo.toml
[dependencies]
ferrix = { path = "." }use ferrix::tensor::Tensor;
let a = Tensor::<f32>::randn(vec![3, 4]);
let b = Tensor::<f32>::ones(vec![3, 4]);
let c = &a + &b;
println!("{}", c);use ferrix::autograd::Variable;
use ferrix::tensor::Tensor;
let x = Variable::new(Tensor::from_vec(vec![2.0f32, 3.0], vec![2]).unwrap(), true);
let y = Variable::new(Tensor::from_vec(vec![1.0f32, 1.0], vec![2]).unwrap(), true);
let z = x.mul(&y).sum_all(); // z = sum(x * y)
z.backward().unwrap();
println!("dz/dx = {:?}", x.grad()); // should be [1.0, 1.0]
println!("dz/dy = {:?}", y.grad()); // should be [2.0, 3.0]cargo run --example train_xorcargo run --example train_cnnferrix — XOR Training Example
================================
Epoch 0 | Loss: 0.693147
Epoch 200 | Loss: 0.421803
...
Epoch 2000 | Loss: 0.018472
Final predictions:
[0.0, 0.0] → 0.0321 (pred=0, exp=0) ✓
[0.0, 1.0] → 0.9712 (pred=1, exp=1) ✓
[1.0, 0.0] → 0.9698 (pred=1, exp=1) ✓
[1.0, 1.0] → 0.0289 (pred=0, exp=0) ✓
Accuracy: 4/4
# Build (zero warnings)
cargo build
# Run all tests
cargo test
# Run benchmarks (matmul naive vs tiled, relu/sigmoid throughput)
cargo benchferrix/
├── src/
│ ├── tensor/ # Tensor<T>, Shape, Storage<T>, indexing, iterators
│ ├── ops/ # broadcast, elementwise, reduce, matmul, unary
│ ├── autograd/ # Tape (DAG), Variable, backward pass
│ ├── nn/ # Module trait, Linear, Conv2d, MaxPool2d, BatchNorm1d, Sequential, activations, losses
│ ├── optim/ # Optimizer trait, SGD, Adam, schedulers
│ └── utils/ # initializers, DataLoader
├── tests/ # Integration tests per module
├── benches/ # Criterion benchmarks
└── examples/
├── train_xor.rs # End-to-end MLP training
└── train_cnn.rs # End-to-end CNN training
Storage<T> uses Arc<Vec<T>> for shared ownership. A slice (e.g., tensor.row(0)) returns a new Tensor pointing into the same buffer with a different offset and len — no copy.
The Tape is a flat Vec<Node> recorded in execution order. Each Variable holds an index into it. backward() topologically sorts the tape and walks it in reverse, calling each node's stored backward closure.
SGD and Adam directly mutate the underlying storage pointer after the gradient step, avoiding extra allocations. The unsafe block is sound because gradient updates are sequential and the tensor isn't aliased at that point.
MIT