Efficient Large Language Model Inference with Neural Block Linearization (NBL)

Authors: Mete Erdogan, Francesco Tonin, Volkan Cevher

To be appeared: The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

🔗 Paper link: Efficient Large Language Model Inference with Neural Block Linearization

Figure: Illustration of Neural Block Linearization (NBL), which replaces a multi-head attention layer with an efficient linear layer using the closed-form LMMSE estimator.

Abstract

The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs.

Quick Start

Installation

conda create -n llm-drop python=3.10
conda activate llm-drop

# For NBL:
cd ./NBL
pip install -r requirements.txt

Running NBL

Apply Attn NBL on Llama-3.1-8B

bash scripts/apply_nbl/layer_nbl_llama.sh

Apply Attn NBL on Mistral-7B

bash scripts/apply_nbl/layer_nbl_mistral.sh

These scripts will:

Generate importance scores for blocks/layers.
Determine which modules to retain/drop.
Save compressed model configs and weights.

Intermediate outputs (CCA values, importance scores) → stored under /llm_variables/
Compressed models → stored under ../results_prune/cache/

Benchmarks

Reasoning Accuracy

Evaluate performance on reasoning benchmarks:

bash scripts/benchmark/benchmark_lm_eval_llama.sh
bash scripts/benchmark/benchmark_lm_eval_mistral.sh

NBL evaluation builds on EleutherAI/lm-evaluation-harness.
To reproduce results, please use this fork.
Custom modeling files for NBL-adapted Mistral/Llama are in src/llmtuner/model.

Speedup Measurements

bash scripts/benchmark/benchmark_speed.sh

Quantization

For AWQ-based quantization:

python quantize.py

See AutoAWQ for CUDA-specific installation details.

Speculative Decoding (EAGLE + NBL)

bash Speculative/EAGLE/run_speculative_mt_bench.sh

LoRA Fine-Tuning with NBL

cd LoRA
python lora.py        # trains LoRA adapters
python lora_save.py   # fuses tuned layers into NBL model

Calibration Runtime (GPU Implementation)

cd "Calibration Runtime"
python calc.py

Code Acknowledgements

NBL builds on CASE-Lab-UMD/LLM-Drop
Evaluation via EleutherAI/lm-evaluation-harness
Quantization via AutoAWQ

Citation

@article{erdogan2025efficient,
  title={Efficient Large Language Model Inference with Neural Block Linearization},
  author={Erdogan, Mete and Tonin, Francesco and Cevher, Volkan},
  journal={arXiv preprint arXiv:2505.21077},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
calibration_runtime		calibration_runtime
figures		figures
llm_variables		llm_variables
lora		lora
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Large Language Model Inference with Neural Block Linearization (NBL)

Authors: Mete Erdogan, Francesco Tonin, Volkan Cevher

To be appeared: The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

Abstract

Quick Start

Installation

Running NBL

Apply Attn NBL on Llama-3.1-8B

Apply Attn NBL on Mistral-7B

Benchmarks

Reasoning Accuracy

Speedup Measurements

Quantization

Speculative Decoding (EAGLE + NBL)

LoRA Fine-Tuning with NBL

Calibration Runtime (GPU Implementation)

Code Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Efficient Large Language Model Inference with Neural Block Linearization (NBL)

Authors: Mete Erdogan, Francesco Tonin, Volkan Cevher

To be appeared: The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

Abstract

Quick Start

Installation

Running NBL

Apply Attn NBL on Llama-3.1-8B

Apply Attn NBL on Mistral-7B

Benchmarks

Reasoning Accuracy

Speedup Measurements

Quantization

Speculative Decoding (EAGLE + NBL)

LoRA Fine-Tuning with NBL

Calibration Runtime (GPU Implementation)

Code Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages