MobileFineTuner

A Unified End-to-End Framework for Fine-Tuning LLMs on Mobile Phones

---

Overview

MobileFineTuner is an open-source framework that enables practical, privacy-preserving fine-tuning of Large Language Models (LLMs) directly on commodity mobile phones. By keeping sensitive user data on-device, MobileFineTuner addresses critical privacy concerns while unlocking vast amounts of valuable private-domain data for personalized model adaptation.

Unlike simulation-based or desktop-bound approaches, MobileFineTuner runs natively on mobile hardware through a lean C++ implementation, eliminating Python runtime overhead and enabling both Full Fine-Tuning (Full-FT) and Parameter-Efficient Fine-Tuning (PEFT/LoRA) under tight resource constraints.

Key Features

Efficiency: Pure C++ implementation with modular operators, automatic differentiation, and full backpropagation—no Python runtime or external ML frameworks required
Scalability: Supports multiple mainstream LLM architectures (GPT-2, Gemma) with flexible interfaces for custom training strategies and federated learning integration
Usability: Simple high-level APIs that abstract away system complexity, enabling rapid prototyping and practical deployment
Privacy-Preserving: All training data remains on-device, complying with GDPR and user privacy expectations
Resource-Aware: Built-in memory and energy optimizations designed specifically for mobile constraints

System Optimizations

Memory Optimization

ZeRO-inspired parameter sharding with LRU-based offloading to disk
Optional FP16 quantization for disk-stored parameters
Gradient accumulation for micro-batch training under tight memory budgets

Energy Optimization

Energy-aware computation scheduler adapting to battery level and temperature
Dynamic throttling to reduce power draw during sustained training

Installation & Build

Prerequisites

Compiler: C++17 or later
Build System: CMake ≥ 3.10
Threading: pthreads
BLAS (optional): Apple Accelerate, OpenBLAS, or Intel MKL for accelerated matrix operations

Build Instructions

cd operators
mkdir build && cd build
cmake .. -DUSE_BLAS=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Build Outputs:

liboperators.a - Core framework library
gpt2_lora_finetune - GPT-2 LoRA training CLI
train_lora_gemma - Gemma LoRA training CLI
eval_ppl - WikiText-2 perplexity evaluation
eval_mmlu - MMLU benchmark evaluation

Quick Start

1. Prepare Data and Model

Download WikiText-2 dataset and pretrained model:

# WikiText-2 raw text files
mkdir -p data/wikitext2/wikitext-2-raw
# Place wiki.train.raw, wiki.valid.raw, wiki.test.raw in the above directory

# GPT-2 pretrained weights (HuggingFace format)
# Place in gpt2_lora_finetune/pretrained/gpt2/

2. Run LoRA Fine-Tuning

GPT-2 Small (124M parameters):

./build/gpt2_lora_finetune \
  --data_dir data/wikitext2/wikitext-2-raw \
  --pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
  --lora_out runs/gpt2_lora.safetensors \
  --epochs 1 --batch_size 4 --grad_accum_steps 2 --seq_len 128 \
  --rank 8 --alpha 16 --lr 2e-4 --warmup_steps 100 \
  --eval_interval 200 --clip_grad_norm 1.0

Gemma 270M:

./build/train_lora_gemma \
  --model_dir gemma-3-270m \
  --data_dir data/wikitext2/wikitext-2-raw \
  --output_dir runs/gemma_270m_lora \
  --epochs 1 --batch 4 --grad_accum 1 --seq_len 256 \
  --learning_rate 2e-4 --warmup_ratio 0.03 \
  --lora_r 8 --lora_alpha 32 --targets full

3. Enable Memory Optimization

Add parameter sharding to reduce peak memory usage:

./build/gpt2_lora_finetune \
  --data_dir data/wikitext2/wikitext-2-raw \
  --pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
  --lora_out runs/gpt2_lora_shard.safetensors \
  --shard_enable \
  --shard_dir /tmp/gft_param_shard \
  --shard_budget_mb 512 \
  --shard_fp16_disk 1 \
  --epochs 1 --batch_size 4 --seq_len 128

4. Enable Energy-Aware Scheduling

Adapt computation to battery and temperature constraints:

./build/gpt2_lora_finetune \
  --data_dir data/wikitext2/wikitext-2-raw \
  --pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
  --lora_out runs/gpt2_lora_energy.safetensors \
  --pm_interval 10 \
  --pm_batt_thresh 20 --pm_fb_high 2.0 --pm_fb_low 0.5 \
  --pm_temp_thresh 42 --pm_ft_high 2.0 --pm_ft_low 0.5 \
  --epochs 1 --batch_size 4 --seq_len 128

Supported Models

GPT-2 Family

GPT-2 Small: 124M parameters, 12 layers, 768 hidden dimensions
GPT-2 Medium: 355M parameters, 24 layers, 1024 hidden dimensions
GPT-2 Large: 774M parameters (experimental)

Gemma Family (Google)

Gemma-3 270M: Compact decoder-only transformer with Grouped Query Attention (GQA)
Gemma-3 1B: Scaled version with 18 layers, 2048 hidden dimensions

Adding New Models

MobileFineTuner's modular architecture supports easy extension to new models. Key interfaces:

Tensor class for multi-dimensional arrays
ops:: namespace for differentiable operations
Model graph definition (see finetune_ops/graph/gpt2_model.cpp and gemma_model.cpp)

Core Components

Tensor & Autograd Engine

Custom Tensor Implementation:

Pooled memory allocation for reduced malloc overhead
Automatic gradient tracking with topological sort-based backward pass
In-place operation support with copy-on-write semantics

// Example: Forward and backward through custom ops
auto x = Tensor::randn({batch_size, seq_len, hidden_dim});
auto y = ops::linear(x, weight, bias);
auto loss = ops::mse_loss(y, target);
loss.backward();  // Automatic gradient computation

Model Graphs

GPT-2 Architecture:

Transformer decoder with fused QKV attention
Causal attention masking for autoregressive generation
Layer normalization and residual connections

Gemma Architecture:

Grouped Query Attention (GQA) for reduced memory footprint
RoPE (Rotary Position Embedding) for positional encoding
GeGLU activation in feed-forward layers

LoRA Injection

Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation:

Inject trainable low-rank matrices into attention and MLP layers
Freeze base model parameters to reduce memory and computation
PEFT-compatible SafeTensors format for adapter persistence

// LoRA injection targets
GPT-2:  Attn QKV + Attn Proj
Gemma:  Q/K/V/O projections + Gate/Up/Down MLP projections

SafeTensors I/O

Fast and safe tensor serialization:

Load pretrained weights from HuggingFace format
Automatic key mapping for model compatibility
Optional transpose for linear layer weights
Save LoRA adapters in PEFT-compatible format

Memory Optimization

Parameter Sharding

Inspired by ZeRO (Zero Redundancy Optimizer), MobileFineTuner implements parameter offloading to overcome mobile memory constraints:

How It Works:

All model parameters are registered with the ParameterSharder
A resident memory budget (e.g., 512 MB) is enforced
Parameters are loaded on-demand via require() calls during forward/backward
LRU eviction policy offloads inactive parameters to disk
Optional FP16 quantization reduces disk storage by 50%

Configuration:

--shard_enable              # Enable parameter sharding
--shard_dir /tmp/shard      # Disk offload directory
--shard_budget_mb 512       # Resident RAM budget (MB)
--shard_fp16_disk 1         # Enable FP16 disk quantization

Memory Savings:

GPT-2 Small: ~60% reduction (2.4 GB → 1.0 GB)
GPT-2 Medium: ~55% reduction (3.8 GB → 1.7 GB)
Gemma 270M: ~40% reduction (4.2 GB → 2.5 GB)

Trade-offs:

Adds disk I/O overhead (~5-10% runtime increase)
Most effective when parameter memory dominates activation memory
Minimal overhead with SSD storage

Gradient Accumulation

Divide large batches into micro-batches to reduce activation memory:

--batch_size 8              # Effective batch size
--grad_accum_steps 4        # Accumulate over 4 micro-batches

Result: Forward/backward runs on batch_size / grad_accum_steps = 2 samples at a time, reducing peak activation memory by ~75% while maintaining gradient quality.

Energy-Aware Training

MobileFineTuner includes a power monitor (opt_ops/energy/power_monitor) that adapts computation intensity to battery and temperature constraints, extending battery life during sustained training. The monitor uses frequency-based throttling (Hz) which internally converts to sleep duration (ms) between training steps.

Configuration

Battery-Based Throttling:

--pm_interval 10            # Check battery every 10 steps
--pm_batt_thresh 20.0       # Throttle below 20% battery (default threshold)
--pm_fb_high 2.0            # Frequency high: 2 Hz (500ms sleep) when battery < threshold
--pm_fb_low 0.5             # Frequency low: 0.5 Hz (2000ms sleep) when battery ≥ threshold

Temperature-Based Throttling:

--pm_temp_thresh 42.0       # Throttle above 42°C (default threshold)
--pm_ft_high 2.0            # Frequency high: 2 Hz when temp > threshold
--pm_ft_low 0.5             # Frequency low: 0.5 Hz when temp ≤ threshold

Manual Schedule Override:

--pm_schedule "0-99:300,100-199:200,200-:100"
# Steps 0-99: 300ms sleep per step
# Steps 100-199: 200ms sleep per step
# Steps 200+: 100ms sleep per step

Impact on Training

Energy Savings: Adjustable throttling based on real-time battery/temperature telemetry
Thermal Management: Prevents device overheating during extended training sessions
Flexible Control: Supports manual telemetry simulation and deterministic schedule override
Minimal Accuracy Loss: Training time increases with sleep duration, but final model quality is preserved

Note: Frequency parameters (pm_fb_high, pm_ft_high) are specified in Hz and internally converted to sleep milliseconds. For example, 2.0 Hz = 500ms sleep, 0.5 Hz = 2000ms sleep.

Evaluation

Perplexity (WikiText-2)

Measure language modeling quality:

./build/eval_ppl \
  --data_root data/wikitext2/wikitext-2-raw \
  --pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
  --lora_path runs/gpt2_lora.safetensors \
  --lora_merge 1

Expected Results:

GPT-2 Small baseline: ~29.5 PPL
GPT-2 Small + LoRA (1 epoch): ~26.8 PPL

MMLU Benchmark

Multi-task language understanding:

./build/eval_mmlu \
  --mmlu_root data/mmlu/data \
  --split dev \
  --pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
  --lora_path runs/gpt2_lora.safetensors \
  --lora_merge 1 \
  --fewshot 0

PyTorch Alignment

For numerical validation and debugging, MobileFineTuner includes PyTorch reference implementations in pytorch_alignment/:

# GPT-2 LoRA (PyTorch baseline)
python pytorch_alignment/gpt2_lora_finetune.py \
  --data_dir data/wikitext2/wikitext-2-raw \
  --pretrained_dir gpt2_lora_finetune/pretrained/gpt2 \
  --lora_out pytorch_runs/gpt2_lora \
  --epochs 1 --batch_size 8 --seq_len 128

# Gemma LoRA (PyTorch baseline)
python pytorch_alignment/gemma_lora_finetune.py \
  --model_dir gemma-3-270m \
  --data_dir data/wikitext2/wikitext-2-raw \
  --output_dir pytorch_runs/gemma_lora \
  --epochs 1 --batch 4 --seq_len 256

Use Cases:

Verify loss curves match C++ implementation
Compare final adapter weights for numerical parity
Debug gradient flow and optimizer behavior

Benchmarks

Performance benchmarks on commodity mobile devices:

Memory Usage (Peak RSS)

Model	Baseline	+ Sharding (512MB)	Reduction
GPT-2 Small	2.4 GB	1.0 GB	58%
GPT-2 Medium	3.8 GB	1.7 GB	55%
Gemma 270M	4.2 GB	2.5 GB	40%
Gemma 1B	8.5 GB	4.8 GB	44%

Training Speed

Training speed depends on device hardware, BLAS acceleration, and model size. Representative examples:

Model	Configuration	Approximate Time/Epoch
GPT-2 Small	batch=4, seq_len=128, BLAS enabled	4-6 hours on modern mobile SoC
GPT-2 Medium	batch=4, seq_len=128, BLAS enabled	10-14 hours on modern mobile SoC
Gemma 270M	batch=4, seq_len=256, BLAS enabled	8-12 hours on modern mobile SoC

Times vary significantly based on device specs, thermal throttling, and memory optimization settings

Energy Efficiency

Configuration	Training Time	Energy Impact
No throttling	Baseline	High power draw, thermal throttling likely
Energy-aware throttling	1.5-2× baseline	Reduced power, extended battery life

Note: Actual power savings depend on device hardware, battery level, and throttling aggressiveness

Project Structure

MobileFineTuner/
├── operators/                          # Core C++ framework
│   ├── finetune_ops/
│   │   ├── core/                      # Tensor, autograd, memory manager
│   │   ├── graph/                     # GPT-2, Gemma model graphs
│   │   ├── nn/                        # Neural network layers (LoRA, Linear, etc.)
│   │   ├── optim/                     # Optimizers (Adam), trainers
│   │   └── data/                      # WikiText-2 dataset loader
│   ├── opt_ops/
│   │   ├── energy/                    # Power monitor
│   │   └── sharding/                  # Parameter sharder
│   └── CMakeLists.txt
├── gpt2_lora_finetune/                # GPT-2 training/eval CLIs
│   ├── main.cpp                       # LoRA training entry point
│   ├── eval_ppl.cpp                   # Perplexity evaluation
│   └── eval_mmlu.cpp                  # MMLU benchmark
├── pytorch_alignment/                 # PyTorch reference scripts
│   ├── gpt2_lora_finetune.py
│   ├── gpt2_full_finetune.py
│   └── gemma_lora_finetune.py
├── scripts/                           # Automation and benchmarking
│   ├── benchmark/                     # Sharding, energy benchmarks
│   └── Finetune/                      # Training scripts
├── data/                              # Expected data root
│   ├── wikitext2/
│   └── mmlu/
├── gemma-3-270m/                      # Gemma model weights (HF format)
├── gemma-3-1b/
└── README.md

Contributing

We welcome contributions from the community! Areas of interest include:

New Model Architectures: Llama, Mistral, Qwen, etc.
Mobile Platform Support: iOS Metal acceleration, Android NNAPI integration
Optimization Techniques: FlashAttention, quantization (INT8/INT4), model pruning
Federated Learning: Distributed training protocols for privacy-preserving aggregation
Benchmarking: Real-world mobile device experiments and profiling

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/your-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/your-feature)
Open a Pull Request

Code Style

C++: Follow Google C++ Style Guide
Python: Follow PEP 8 with Black formatter
Documentation: Add inline comments for complex logic, update README for new features

Contact

Authors:

Jiaxiang Geng (Duke Kunshan University, The University of Hong Kong)
Lunyu Zhao (Duke Kunshan University)
Yiyi Lu (Duke Kunshan University)
Bing Luo (Duke Kunshan University)

Email: {jg645, lz269, yl996, bl291}@duke.edu

Acknowledgments

We thank the open-source community for foundational tools and datasets:

HuggingFace Transformers for model implementations and pretrained weights
Microsoft DeepSpeed for ZeRO optimizer inspiration
WikiText-2 and MMLU benchmark creators
Apple and Google for mobile hardware access and development tools

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright 2024 Mobile LLM Fine-Tuning Project Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Built with passion for privacy-preserving mobile AI

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/mmlu		data/mmlu
gpt2_full_finetune		gpt2_full_finetune
gpt2_lora_finetune		gpt2_lora_finetune
operators		operators
pytorch_alignment		pytorch_alignment
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logo.jpg		logo.jpg

License

AndyLu666/MobileFineTuner

Folders and files

Latest commit

History

Repository files navigation

MobileFineTuner

Overview

Key Features

System Optimizations

Table of Contents

Installation & Build

Prerequisites

Build Instructions

Quick Start

1. Prepare Data and Model

2. Run LoRA Fine-Tuning

3. Enable Memory Optimization

4. Enable Energy-Aware Scheduling

Supported Models

GPT-2 Family

Gemma Family (Google)

Adding New Models

Core Components

Tensor & Autograd Engine

Model Graphs

LoRA Injection

SafeTensors I/O

Memory Optimization

Parameter Sharding

Gradient Accumulation

Energy-Aware Training

Configuration

Impact on Training

Evaluation

Perplexity (WikiText-2)

MMLU Benchmark

PyTorch Alignment

Benchmarks

Memory Usage (Peak RSS)

Training Speed

Energy Efficiency

Project Structure

Contributing

Development Workflow

Code Style

Contact

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 212

Uh oh!

Languages

Packages