📚 Paper Reading

A curated collection of research papers on AI systems, compilers, architecture, and systems software.

🔧 Deep Learning Compiler

Status	Paper	Venue	Links
✅	The Deep Learning Compiler: A Comprehensive Survey	—	Paper / Note
✅	MLIR: Scaling Compiler Infrastructure for Domain Specific Computation	CGO'21	Paper / Note
✅	TIRAMISU: A Polyhedral Compiler for Expressing Fast and Portable Code	CGO'19	Paper / Note
✅	Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks	OSDI'20	Paper / Note
✅	ROLLER: Fast and Efficient Tensor Compilation for Deep Learning	OSDI'22	Paper / Note
✅	BOLT: Bridging The Gap Between Auto-Tuners and Hardware-Native Performance	MLSys'22	Paper / Note
✅	AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures	ASPLOS'22	Paper / Note
✅	AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction	ISCA'22	Paper / Note
✅	Welder: Scheduling Deep Learning Memory Access via Tile-graph	OSDI'23	Paper / Note
✅	Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators	OSDI'23	Paper
✅	Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning	OSDI'23	Paper / Note
✅	Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion	HPCA'23	Paper / Note
✅	Graphene: An IR for Optimized Tensor Computations on GPUs	ASPLOS'23	Paper / Note
✅	Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor	SOSP'24	Paper
✅	ThunderKittens: Simple, Fast, and Adorable AI Kernels	—	Paper
✅	Mirage: A Multi-Level Superoptimizer for Tensor Programs	OSDI'25	Paper
✅	PipeThreader: Software-Defined Pipelining for Efficient DNN Execution	OSDI'25	Paper
✅	TileLang: A Composable Tiled Programming Model for AI Systems	—	Paper
✅	Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References	arXiv'25	Paper
✅	KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads	OSDI'25	Paper

🚀 LLM Inference

General

Status	Paper	Venue	Links
✅	A Survey of LLM Inference Systems	—	Paper / Note
⬜	WaferLLM: Large Language Model Inference at Wafer Scale	OSDI'25	Paper

Long Context Inference

Status	Paper	Venue	Links
✅	Training-Free Long-Context Scaling of Large Language Models	ICML'24	Paper / Note
✅	Efficient Streaming Language Models with Attention Sinks	ICLR'24	Paper
✅	Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference	ICML'24	Paper
✅	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	ICLR'25	Paper

LLM Serving

Status	Paper	Venue	Links
✅	SGLang: Efficient Execution of Structured Language Model Programs	—	Paper
✅	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	—	Paper
⬜	DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving	OSDI'24	Paper
⬜	LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism	SOSP'24	Paper
⬜	Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot	FAST'25	Paper
⬜	NanoFlow: Towards Optimal Large Language Model Serving Throughput	OSDI'25	Paper

MegaKernel

Status	Paper	Venue	Links
✅	Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B	Blog	Paper
✅	Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs	arXiv'25	Paper
✅	TileRT: Tile-Based Runtime for Ultra-Low-Latency LLM Inference	—	Paper
✅	SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations	arXiv'25	Paper
✅	Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference	Blog	Paper

🏋️ LLM Training

Distributed Training

Status	Paper	Venue	Links
⬜	LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism	—	Paper

RL Training

Status	Paper	Venue	Links
✅	Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning	arXiv'25	Paper

Compute-Communication Overlap

Status	Paper	Venue	Links
✅	Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion	—	Paper
✅	DeepEP: An Efficient Expert-Parallel Communication Library	—	Paper
⬜	Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning	ASPLOS'24	Paper
⬜	Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts	—	Paper
⬜	TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives	MLSys'25	Paper
⬜	Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler	—	Paper
⬜	FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation	EuroSys'25	Paper
⬜	TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference	—	Paper

🧠 Deep Learning

Attention Mechanisms & Variants

Status	Paper	Venue	Links
✅	Attention Is All You Need	NeurIPS'17	Paper / Note
✅	Big Bird: Transformers for Longer Sequences	NeurIPS'20	Paper / Note
✅	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness	NeurIPS'22	Paper / Note
✅	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	arXiv	Paper / Note
✅	Flash-Decoding for Long-Context Inference	Blog	Paper / Note
✅	A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention	—	Paper

New Architectures

Status	Paper	Venue	Links
✅	Gated Linear Attention Transformers with Hardware-Efficient Training	arXiv	Paper / Note
✅	Kimi Linear Attention: An Expressive, Efficient Attention Architecture	arXiv'25	Paper
✅	DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models	arXiv'25	Paper

On-Device / Mobile

Status	Paper	Venue	Links
✅	On-Device Training Under 256KB Memory	NeurIPS'22	Paper
✅	PockEngine: Sparse and Efficient Fine-tuning in a Pocket	MICRO'23	Paper / Note

🤖 LLM for Kernel Optimization

Status	Paper	Venue	Links
✅	AVO: Agentic Variation Operators for Autonomous Evolutionary Search	arXiv'26	Paper / Note

🖥️ GPU Microarchitecture

Status	Paper	Venue	Links
⬜	Understanding Latency Hiding on GPUs	—	Paper

📐 Math Foundations

Status	Paper	Venue	Links
⬜	Categorical Foundations for CuTe Layouts	—	Paper

⚙️ Compiler

Status	Paper	Venue	Links
✅	Honeycomb: Secure and Efficient GPU Executions via Static Validation	OSDI'23	Paper / Note
✅	HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis	ASPLOS'24	Paper / Note

🐧 Operating Systems

Status	Paper	Venue	Links
✅	RedLeaf: Isolation and Communication in a Safe Operating System	OSDI'20	Paper / Note
✅	Theseus: an Experiment in Operating System Structure and State Management	OSDI'20	Paper
✅	Unikraft: Fast, Specialized Unikernels the Easy Way	EuroSys'21	Paper / Note
✅	The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems	SOSP'21	Paper / Note

🛡️ Hypervisor & Virtualization

Status	Paper	Venue	Links
✅	HyperBench: A Benchmark Suite for Virtualization Capabilities	—	Paper / Note
✅	DuVisor: a User-level Hypervisor Through Delegated Virtualization	arXiv'22	Paper
✅	AvA: Accelerated Virtualization of Accelerators	ASPLOS'22	Paper
✅	Security and Performance in the Delegated User-level Virtualization	OSDI'23	Paper / Note
✅	System Virtualization for Neural Processing Units	HotOS'23	Paper
✅	Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMs	EuroSys'23	Paper / Note
✅	Honeycomb: Secure and Efficient GPU Executions via Static Validation	OSDI'23	Paper / Note

🔬 RISC-V

Status	Paper	Venue	Links
✅	A First Look at RISC-V Virtualization from an Embedded Systems Perspective	TC'21	Paper
✅	CVA6 RISC-V Virtualization: Architecture, Microarchitecture, and Design Space Exploration	arXiv'23	Paper

If you find this list helpful, feel free to ⭐ star this repo!

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.github/workflows		.github/workflows
notes		notes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Paper Reading

Table of Contents

🔧 Deep Learning Compiler

🚀 LLM Inference

General

Long Context Inference

LLM Serving

MegaKernel

🏋️ LLM Training

Distributed Training

RL Training

Compute-Communication Overlap

🧠 Deep Learning

Attention Mechanisms & Variants

New Architectures

On-Device / Mobile

🤖 LLM for Kernel Optimization

🖥️ GPU Microarchitecture

📐 Math Foundations

⚙️ Compiler

🐧 Operating Systems

🛡️ Hypervisor & Virtualization

🔬 RISC-V

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

📚 Paper Reading

Table of Contents

🔧 Deep Learning Compiler

🚀 LLM Inference

General

Long Context Inference

LLM Serving

MegaKernel

🏋️ LLM Training

Distributed Training

RL Training

Compute-Communication Overlap

🧠 Deep Learning

Attention Mechanisms & Variants

New Architectures

On-Device / Mobile

🤖 LLM for Kernel Optimization

🖥️ GPU Microarchitecture

📐 Math Foundations

⚙️ Compiler

🐧 Operating Systems

🛡️ Hypervisor & Virtualization

🔬 RISC-V

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages