🔗 Paper link: Efficient Large Language Model Inference with Neural Block Linearization
Figure: Illustration of Neural Block Linearization (NBL), which replaces a multi-head attention layer with an efficient linear layer using the closed-form LMMSE estimator.The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs.
conda create -n llm-drop python=3.10
conda activate llm-drop
# For NBL:
cd ./NBL
pip install -r requirements.txtbash scripts/apply_nbl/layer_nbl_llama.shbash scripts/apply_nbl/layer_nbl_mistral.shThese scripts will:
- Generate importance scores for blocks/layers.
- Determine which modules to retain/drop.
- Save compressed model configs and weights.
- Intermediate outputs (CCA values, importance scores) → stored under
/llm_variables/ - Compressed models → stored under
../results_prune/cache/
Evaluate performance on reasoning benchmarks:
bash scripts/benchmark/benchmark_lm_eval_llama.sh
bash scripts/benchmark/benchmark_lm_eval_mistral.sh- NBL evaluation builds on EleutherAI/lm-evaluation-harness.
- To reproduce results, please use this fork.
- Custom modeling files for NBL-adapted Mistral/Llama are in
src/llmtuner/model.
bash scripts/benchmark/benchmark_speed.shFor AWQ-based quantization:
python quantize.pySee AutoAWQ for CUDA-specific installation details.
bash Speculative/EAGLE/run_speculative_mt_bench.shcd LoRA
python lora.py # trains LoRA adapters
python lora_save.py # fuses tuned layers into NBL modelcd "Calibration Runtime"
python calc.py- NBL builds on CASE-Lab-UMD/LLM-Drop
- Evaluation via EleutherAI/lm-evaluation-harness
- Quantization via AutoAWQ
@article{erdogan2025efficient,
title={Efficient Large Language Model Inference with Neural Block Linearization},
author={Erdogan, Mete and Tonin, Francesco and Cevher, Volkan},
journal={arXiv preprint arXiv:2505.21077},
year={2025}
}
