Daily Performance Improvement Research & Plan
Project Overview
Furnace is an F# tensor library with support for differentiable programming, designed for machine learning, probabilistic programming, and optimization. It provides:
Nested and mixed-mode differentiation
PyTorch familiar naming and idioms
Multiple backends: Reference (CPU-only F#), Torch (TorchSharp/LibTorch with CUDA support)
Common optimizers, model elements, differentiable probability distributions
Current Performance Testing Infrastructure
Benchmarking Framework
BenchmarkDotNet is used for micro-benchmarking
Benchmarks are in tests/Furnace.Benchmarks/
Python comparison benchmarks in tests/Furnace.Benchmarks.Python/
Command to run benchmarks: dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"
Current benchmark matrix tests:
Tensor sizes: 16, 2048, 65536 elements
Data types: int32, float32, float64
Devices: cpu, cuda
Operations: creation (zeros, ones, rand), basic math (addition, scalar ops), matrix ops (matmul)
Current Performance Numbers
From existing benchmark results, Reference backend significantly outperforms:
Reference backend is ~100x faster than PyTorch for small tensor operations
TorchSharp/Torch backend is ~2-3x faster than PyTorch but ~50x slower than Reference
Example (16 element float32 tensors on CPU):
PyTorch addition: ~759ms
TorchSharp addition: ~523ms
Reference addition: ~15ms
Performance Bottlenecks Analysis
1. Backend Layer Inefficiencies
TorchSharp Interop Overhead : Heavy cost converting between F# types and TorchSharp C# types
Type Conversion : Multiple conversions between Dtype/torch.ScalarType, Device/torch.Device
Handle Management : Each tensor operation creates new handles with disposal overhead
2. Tensor Creation Performance
Reference backend creates tensors ~100x faster than Torch backend
Significant overhead in fromCpuData operations through TorchSharp
3. Memory Management
No explicit tensor disposal (issue Explicit disposal discussion #19 ) - relying on GC
TorchSharp tensors accumulate in memory until GC
Potential for memory pressure during intensive computation
4. Algorithm-Level Optimizations
Marsaglia Gaussian generator inefficiency (issue Improve the Marsaglia Gaussian generator #23 ) - discarding second generated sample
Missing vectorization opportunities in Reference backend
No SIMD optimizations for CPU operations
5. Differentiation Overhead
Multiple tensor wrapper types (TensorC/TensorF/TensorR) add call overhead
Deep nesting in primalDeep operations with recursive calls
Typical Workloads & Bottlenecks
Machine Learning Workloads
Neural network training: Forward/backward passes with large matrix operations
Optimization algorithms (Adam, SGD): Many small tensor operations per step
Model inference: Mostly matrix multiplications and activations
Performance Characteristics
I/O bound : Data loading, model serialization
CPU bound : Reference backend operations, automatic differentiation
Memory bound : Large tensor operations, gradient accumulation
GPU bound : CUDA operations through TorchSharp when available
Performance Goals by Round
Round 1: Low-Hanging Fruit (Target: 20-50% improvement)
Fix Marsaglia Gaussian generator - cache second sample (2x improvement for random normal)
Optimize tensor creation paths - reduce type conversions in TorchSharp backend
Add tensor operation fusion - combine multiple operations to reduce intermediate allocations
Improve scalar operations - optimize tensor-scalar arithmetic patterns
Round 2: Backend Optimizations (Target: 2-5x improvement)
SIMD vectorization for Reference backend hot paths
Memory pooling for intermediate tensor allocations
Lazy evaluation for tensor operation chains
In-place operation support for appropriate cases
Round 3: Architecture Changes (Target: 5-10x improvement)
Native backend with F# P/Invoke to optimized BLAS/LAPACK
Tensor operation batching for better GPU utilization
JIT compilation for tensor operation graphs
Custom automatic differentiation with specialized reverse-mode implementation
Build & Testing Commands
Standard Build
dotnet restore
dotnet build --configuration Release --no-restore --verbosity normal
dotnet test --configuration Release --no-build
Benchmark Commands
# Run F# benchmarks
dotnet run --project tests\F urnace.Benchmarks\F urnace.Benchmarks.fsproj -c Release --filter " *"
# Run Python benchmarks (updates source files)
dotnet run --project tests\F urnace.Benchmarks.Python\F urnace.Benchmarks.Python.fsproj -c Release --filter " *"
GPU Testing
dotnet test /p:Furnace_TESTGPU=true
Profiling & Measurement Setup
Micro-benchmarking
Use existing BenchmarkDotNet infrastructure
Target operations in BasicTensorOps benchmark suite
Focus on operations with high iteration counts (workloadSize / tensorSize factor)
Profiling Tools
.NET profilers : PerfView, JetBrains dotMemory, Visual Studio Diagnostics
Native profilers : Intel VTune (for TorchSharp), perf on Linux
Memory profilers : Application Verifier, Debug heap
Performance Measurement Strategy
Compare Reference vs TorchSharp backends
Measure against PyTorch baselines
Test across tensor size matrix: small (16), medium (2048), large (65536)
Profile both CPU and GPU (CUDA) execution paths
Maintainer Priorities
Based on existing issues and project direction:
Correctness over performance - ensure mathematical correctness
API stability - minimize breaking changes to public interfaces
Cross-platform support - maintain Linux/macOS/Windows compatibility
Educational use - support learning ML concepts with F#
PyTorch compatibility - maintain familiar API patterns
Concrete Next Steps
Environment Setup for Performance Work
Build project: dotnet build -c Release
Install benchmark dependencies: dotnet tool restore
Run baseline benchmarks to establish current performance
Set up profiling tools (PerfView/dotMemory)
Create performance regression test suite
Development Workflow
Identify bottleneck via profiling/benchmarks
Implement optimization in isolated branch
Measure improvement with micro-benchmarks
Run regression tests to ensure correctness
Performance CI integration with benchmark comparison
This plan provides a systematic approach to performance improvement with clear measurement criteria and realistic improvement targets.
AI-generated content by Daily Perf Improver may contain mistakes.
Daily Performance Improvement Research & Plan
Project Overview
Furnace is an F# tensor library with support for differentiable programming, designed for machine learning, probabilistic programming, and optimization. It provides:
Current Performance Testing Infrastructure
Benchmarking Framework
tests/Furnace.Benchmarks/tests/Furnace.Benchmarks.Python/dotnet run --project tests\Furnace.Benchmarks\Furnace.Benchmarks.fsproj -c Release --filter "*"Current Performance Numbers
From existing benchmark results, Reference backend significantly outperforms:
Performance Bottlenecks Analysis
1. Backend Layer Inefficiencies
2. Tensor Creation Performance
fromCpuDataoperations through TorchSharp3. Memory Management
4. Algorithm-Level Optimizations
5. Differentiation Overhead
primalDeepoperations with recursive callsTypical Workloads & Bottlenecks
Machine Learning Workloads
Performance Characteristics
Performance Goals by Round
Round 1: Low-Hanging Fruit (Target: 20-50% improvement)
Round 2: Backend Optimizations (Target: 2-5x improvement)
Round 3: Architecture Changes (Target: 5-10x improvement)
Build & Testing Commands
Standard Build
dotnet restore dotnet build --configuration Release --no-restore --verbosity normal dotnet test --configuration Release --no-buildBenchmark Commands
GPU Testing
dotnet test /p:Furnace_TESTGPU=trueProfiling & Measurement Setup
Micro-benchmarking
BasicTensorOpsbenchmark suiteProfiling Tools
Performance Measurement Strategy
Maintainer Priorities
Based on existing issues and project direction:
Concrete Next Steps
Environment Setup for Performance Work
dotnet build -c Releasedotnet tool restoreDevelopment Workflow
This plan provides a systematic approach to performance improvement with clear measurement criteria and realistic improvement targets.