A Unified End-to-End Framework for Fine-Tuning LLMs on Mobile Phones
---
MobileFineTuner is an open-source framework that enables practical, privacy-preserving fine-tuning of Large Language Models (LLMs) directly on commodity mobile phones. By keeping sensitive user data on-device, MobileFineTuner addresses critical privacy concerns while unlocking vast amounts of valuable private-domain data for personalized model adaptation.
Unlike simulation-based or desktop-bound approaches, MobileFineTuner runs natively on mobile hardware through a lean C++ implementation, eliminating Python runtime overhead and enabling both Full Fine-Tuning (Full-FT) and Parameter-Efficient Fine-Tuning (PEFT/LoRA) under tight resource constraints.
- Efficiency: Pure C++ implementation with modular operators, automatic differentiation, and full backpropagation—no Python runtime or external ML frameworks required
- Scalability: Supports multiple mainstream LLM architectures (GPT-2, Gemma) with flexible interfaces for custom training strategies and federated learning integration
- Usability: Simple high-level APIs that abstract away system complexity, enabling rapid prototyping and practical deployment
- Privacy-Preserving: All training data remains on-device, complying with GDPR and user privacy expectations
- Resource-Aware: Built-in memory and energy optimizations designed specifically for mobile constraints
Memory Optimization
- ZeRO-inspired parameter sharding with LRU-based offloading to disk
- Optional FP16 quantization for disk-stored parameters
- Gradient accumulation for micro-batch training under tight memory budgets
Energy Optimization
- Energy-aware computation scheduler adapting to battery level and temperature
- Dynamic throttling to reduce power draw during sustained training
- Installation & Build
- Quick Start
- Supported Models
- Core Components
- Memory Optimization
- Energy-Aware Training
- Evaluation
- PyTorch Alignment
- Benchmarks
- Project Structure
- Citation
- Contributing
- License
- Compiler: C++17 or later
- Build System: CMake ≥ 3.10
- Threading: pthreads
- BLAS (optional): Apple Accelerate, OpenBLAS, or Intel MKL for accelerated matrix operations
cd operators
mkdir build && cd build
cmake .. -DUSE_BLAS=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)Build Outputs:
liboperators.a- Core framework librarygpt2_lora_finetune- GPT-2 LoRA training CLItrain_lora_gemma- Gemma LoRA training CLIeval_ppl- WikiText-2 perplexity evaluationeval_mmlu- MMLU benchmark evaluation
Download WikiText-2 dataset and pretrained model:
# WikiText-2 raw text files
mkdir -p data/wikitext2/wikitext-2-raw
# Place wiki.train.raw, wiki.valid.raw, wiki.test.raw in the above directory
# GPT-2 pretrained weights (HuggingFace format)
# Place in gpt2_lora_finetune/pretrained/gpt2/GPT-2 Small (124M parameters):
./build/gpt2_lora_finetune \
--data_dir data/wikitext2/wikitext-2-raw \
--pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
--lora_out runs/gpt2_lora.safetensors \
--epochs 1 --batch_size 4 --grad_accum_steps 2 --seq_len 128 \
--rank 8 --alpha 16 --lr 2e-4 --warmup_steps 100 \
--eval_interval 200 --clip_grad_norm 1.0Gemma 270M:
./build/train_lora_gemma \
--model_dir gemma-3-270m \
--data_dir data/wikitext2/wikitext-2-raw \
--output_dir runs/gemma_270m_lora \
--epochs 1 --batch 4 --grad_accum 1 --seq_len 256 \
--learning_rate 2e-4 --warmup_ratio 0.03 \
--lora_r 8 --lora_alpha 32 --targets fullAdd parameter sharding to reduce peak memory usage:
./build/gpt2_lora_finetune \
--data_dir data/wikitext2/wikitext-2-raw \
--pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
--lora_out runs/gpt2_lora_shard.safetensors \
--shard_enable \
--shard_dir /tmp/gft_param_shard \
--shard_budget_mb 512 \
--shard_fp16_disk 1 \
--epochs 1 --batch_size 4 --seq_len 128Adapt computation to battery and temperature constraints:
./build/gpt2_lora_finetune \
--data_dir data/wikitext2/wikitext-2-raw \
--pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
--lora_out runs/gpt2_lora_energy.safetensors \
--pm_interval 10 \
--pm_batt_thresh 20 --pm_fb_high 2.0 --pm_fb_low 0.5 \
--pm_temp_thresh 42 --pm_ft_high 2.0 --pm_ft_low 0.5 \
--epochs 1 --batch_size 4 --seq_len 128- GPT-2 Small: 124M parameters, 12 layers, 768 hidden dimensions
- GPT-2 Medium: 355M parameters, 24 layers, 1024 hidden dimensions
- GPT-2 Large: 774M parameters (experimental)
- Gemma-3 270M: Compact decoder-only transformer with Grouped Query Attention (GQA)
- Gemma-3 1B: Scaled version with 18 layers, 2048 hidden dimensions
MobileFineTuner's modular architecture supports easy extension to new models. Key interfaces:
Tensorclass for multi-dimensional arraysops::namespace for differentiable operations- Model graph definition (see
finetune_ops/graph/gpt2_model.cppandgemma_model.cpp)
Custom Tensor Implementation:
- Pooled memory allocation for reduced malloc overhead
- Automatic gradient tracking with topological sort-based backward pass
- In-place operation support with copy-on-write semantics
// Example: Forward and backward through custom ops
auto x = Tensor::randn({batch_size, seq_len, hidden_dim});
auto y = ops::linear(x, weight, bias);
auto loss = ops::mse_loss(y, target);
loss.backward(); // Automatic gradient computationGPT-2 Architecture:
- Transformer decoder with fused QKV attention
- Causal attention masking for autoregressive generation
- Layer normalization and residual connections
Gemma Architecture:
- Grouped Query Attention (GQA) for reduced memory footprint
- RoPE (Rotary Position Embedding) for positional encoding
- GeGLU activation in feed-forward layers
Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation:
- Inject trainable low-rank matrices into attention and MLP layers
- Freeze base model parameters to reduce memory and computation
- PEFT-compatible SafeTensors format for adapter persistence
// LoRA injection targets
GPT-2: Attn QKV + Attn Proj
Gemma: Q/K/V/O projections + Gate/Up/Down MLP projectionsFast and safe tensor serialization:
- Load pretrained weights from HuggingFace format
- Automatic key mapping for model compatibility
- Optional transpose for linear layer weights
- Save LoRA adapters in PEFT-compatible format
Inspired by ZeRO (Zero Redundancy Optimizer), MobileFineTuner implements parameter offloading to overcome mobile memory constraints:
How It Works:
- All model parameters are registered with the
ParameterSharder - A resident memory budget (e.g., 512 MB) is enforced
- Parameters are loaded on-demand via
require()calls during forward/backward - LRU eviction policy offloads inactive parameters to disk
- Optional FP16 quantization reduces disk storage by 50%
Configuration:
--shard_enable # Enable parameter sharding
--shard_dir /tmp/shard # Disk offload directory
--shard_budget_mb 512 # Resident RAM budget (MB)
--shard_fp16_disk 1 # Enable FP16 disk quantizationMemory Savings:
- GPT-2 Small: ~60% reduction (2.4 GB → 1.0 GB)
- GPT-2 Medium: ~55% reduction (3.8 GB → 1.7 GB)
- Gemma 270M: ~40% reduction (4.2 GB → 2.5 GB)
Trade-offs:
- Adds disk I/O overhead (~5-10% runtime increase)
- Most effective when parameter memory dominates activation memory
- Minimal overhead with SSD storage
Divide large batches into micro-batches to reduce activation memory:
--batch_size 8 # Effective batch size
--grad_accum_steps 4 # Accumulate over 4 micro-batchesResult: Forward/backward runs on batch_size / grad_accum_steps = 2 samples at a time, reducing peak activation memory by ~75% while maintaining gradient quality.
MobileFineTuner includes a power monitor (opt_ops/energy/power_monitor) that adapts computation intensity to battery and temperature constraints, extending battery life during sustained training. The monitor uses frequency-based throttling (Hz) which internally converts to sleep duration (ms) between training steps.
Battery-Based Throttling:
--pm_interval 10 # Check battery every 10 steps
--pm_batt_thresh 20.0 # Throttle below 20% battery (default threshold)
--pm_fb_high 2.0 # Frequency high: 2 Hz (500ms sleep) when battery < threshold
--pm_fb_low 0.5 # Frequency low: 0.5 Hz (2000ms sleep) when battery ≥ thresholdTemperature-Based Throttling:
--pm_temp_thresh 42.0 # Throttle above 42°C (default threshold)
--pm_ft_high 2.0 # Frequency high: 2 Hz when temp > threshold
--pm_ft_low 0.5 # Frequency low: 0.5 Hz when temp ≤ thresholdManual Schedule Override:
--pm_schedule "0-99:300,100-199:200,200-:100"
# Steps 0-99: 300ms sleep per step
# Steps 100-199: 200ms sleep per step
# Steps 200+: 100ms sleep per step- Energy Savings: Adjustable throttling based on real-time battery/temperature telemetry
- Thermal Management: Prevents device overheating during extended training sessions
- Flexible Control: Supports manual telemetry simulation and deterministic schedule override
- Minimal Accuracy Loss: Training time increases with sleep duration, but final model quality is preserved
Note: Frequency parameters (pm_fb_high, pm_ft_high) are specified in Hz and internally converted to sleep milliseconds. For example, 2.0 Hz = 500ms sleep, 0.5 Hz = 2000ms sleep.
Measure language modeling quality:
./build/eval_ppl \
--data_root data/wikitext2/wikitext-2-raw \
--pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
--lora_path runs/gpt2_lora.safetensors \
--lora_merge 1Expected Results:
- GPT-2 Small baseline: ~29.5 PPL
- GPT-2 Small + LoRA (1 epoch): ~26.8 PPL
Multi-task language understanding:
./build/eval_mmlu \
--mmlu_root data/mmlu/data \
--split dev \
--pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
--lora_path runs/gpt2_lora.safetensors \
--lora_merge 1 \
--fewshot 0For numerical validation and debugging, MobileFineTuner includes PyTorch reference implementations in pytorch_alignment/:
# GPT-2 LoRA (PyTorch baseline)
python pytorch_alignment/gpt2_lora_finetune.py \
--data_dir data/wikitext2/wikitext-2-raw \
--pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
--lora_out pytorch_runs/gpt2_lora \
--epochs 1 --batch_size 8 --seq_len 128
# Gemma LoRA (PyTorch baseline)
python pytorch_alignment/gemma_lora_finetune.py \
--model_dir gemma-3-270m \
--data_dir data/wikitext2/wikitext-2-raw \
--output_dir pytorch_runs/gemma_lora \
--epochs 1 --batch 4 --seq_len 256Use Cases:
- Verify loss curves match C++ implementation
- Compare final adapter weights for numerical parity
- Debug gradient flow and optimizer behavior
Performance benchmarks on commodity mobile devices:
| Model | Baseline | + Sharding (512MB) | Reduction |
|---|---|---|---|
| GPT-2 Small | 2.4 GB | 1.0 GB | 58% |
| GPT-2 Medium | 3.8 GB | 1.7 GB | 55% |
| Gemma 270M | 4.2 GB | 2.5 GB | 40% |
| Gemma 1B | 8.5 GB | 4.8 GB | 44% |
Training speed depends on device hardware, BLAS acceleration, and model size. Representative examples:
| Model | Configuration | Approximate Time/Epoch |
|---|---|---|
| GPT-2 Small | batch=4, seq_len=128, BLAS enabled | 4-6 hours on modern mobile SoC |
| GPT-2 Medium | batch=4, seq_len=128, BLAS enabled | 10-14 hours on modern mobile SoC |
| Gemma 270M | batch=4, seq_len=256, BLAS enabled | 8-12 hours on modern mobile SoC |
Times vary significantly based on device specs, thermal throttling, and memory optimization settings
| Configuration | Training Time | Energy Impact |
|---|---|---|
| No throttling | Baseline | High power draw, thermal throttling likely |
| Energy-aware throttling | 1.5-2× baseline | Reduced power, extended battery life |
Note: Actual power savings depend on device hardware, battery level, and throttling aggressiveness
MobileFineTuner/
├── operators/ # Core C++ framework
│ ├── finetune_ops/
│ │ ├── core/ # Tensor, autograd, memory manager
│ │ ├── graph/ # GPT-2, Gemma model graphs
│ │ ├── nn/ # Neural network layers (LoRA, Linear, etc.)
│ │ ├── optim/ # Optimizers (Adam), trainers
│ │ └── data/ # WikiText-2 dataset loader
│ ├── opt_ops/
│ │ ├── energy/ # Power monitor
│ │ └── sharding/ # Parameter sharder
│ └── CMakeLists.txt
├── gpt2_lora_finetune/ # GPT-2 training/eval CLIs
│ ├── main.cpp # LoRA training entry point
│ ├── eval_ppl.cpp # Perplexity evaluation
│ └── eval_mmlu.cpp # MMLU benchmark
├── pytorch_alignment/ # PyTorch reference scripts
│ ├── gpt2_lora_finetune.py
│ ├── gpt2_full_finetune.py
│ └── gemma_lora_finetune.py
├── scripts/ # Automation and benchmarking
│ ├── benchmark/ # Sharding, energy benchmarks
│ └── Finetune/ # Training scripts
├── data/ # Expected data root
│ ├── wikitext2/
│ └── mmlu/
├── gemma-3-270m/ # Gemma model weights (HF format)
├── gemma-3-1b/
└── README.md
We welcome contributions from the community! Areas of interest include:
- New Model Architectures: Llama, Mistral, Qwen, etc.
- Mobile Platform Support: iOS Metal acceleration, Android NNAPI integration
- Optimization Techniques: FlashAttention, quantization (INT8/INT4), model pruning
- Federated Learning: Distributed training protocols for privacy-preserving aggregation
- Benchmarking: Real-world mobile device experiments and profiling
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
- C++: Follow Google C++ Style Guide
- Python: Follow PEP 8 with Black formatter
- Documentation: Add inline comments for complex logic, update README for new features
Authors:
- Jiaxiang Geng (Duke Kunshan University, The University of Hong Kong)
- Lunyu Zhao (Duke Kunshan University)
- Yiyi Lu (Duke Kunshan University)
- Bing Luo (Duke Kunshan University)
Email: {jg645, lz269, yl996, bl291}@duke.edu
We thank the open-source community for foundational tools and datasets:
- HuggingFace Transformers for model implementations and pretrained weights
- Microsoft DeepSpeed for ZeRO optimizer inspiration
- WikiText-2 and MMLU benchmark creators
- Apple and Google for mobile hardware access and development tools
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright 2024 Mobile LLM Fine-Tuning Project Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Built with passion for privacy-preserving mobile AI
