A curated collection of research papers on AI systems, compilers, architecture, and systems software.
Legend: ✅ = Read | ⬜ = To Read | 📝 = Note Available
Status
Paper
Venue
Links
✅
The Deep Learning Compiler: A Comprehensive Survey
—
Paper / Note
✅
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
CGO'21
Paper / Note
✅
TIRAMISU: A Polyhedral Compiler for Expressing Fast and Portable Code
CGO'19
Paper / Note
✅
Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
OSDI'20
Paper / Note
✅
ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
OSDI'22
Paper / Note
✅
BOLT: Bridging The Gap Between Auto-Tuners and Hardware-Native Performance
MLSys'22
Paper / Note
✅
AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures
ASPLOS'22
Paper / Note
✅
AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction
ISCA'22
Paper / Note
✅
Welder: Scheduling Deep Learning Memory Access via Tile-graph
OSDI'23
Paper / Note
✅
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
OSDI'23
Paper
✅
Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
OSDI'23
Paper / Note
✅
Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion
HPCA'23
Paper / Note
✅
Graphene: An IR for Optimized Tensor Computations on GPUs
ASPLOS'23
Paper / Note
✅
Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
SOSP'24
Paper
✅
ThunderKittens: Simple, Fast, and Adorable AI Kernels
—
Paper
✅
Mirage: A Multi-Level Superoptimizer for Tensor Programs
OSDI'25
Paper
✅
PipeThreader: Software-Defined Pipelining for Efficient DNN Execution
OSDI'25
Paper
✅
TileLang: A Composable Tiled Programming Model for AI Systems
—
Paper
✅
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References
arXiv'25
Paper
✅
KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
OSDI'25
Paper
Status
Paper
Venue
Links
✅
A Survey of LLM Inference Systems
—
Paper / Note
⬜
WaferLLM: Large Language Model Inference at Wafer Scale
OSDI'25
Paper
Status
Paper
Venue
Links
✅
Training-Free Long-Context Scaling of Large Language Models
ICML'24
Paper / Note
✅
Efficient Streaming Language Models with Attention Sinks
ICLR'24
Paper
✅
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
ICML'24
Paper
✅
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
ICLR'25
Paper
Status
Paper
Venue
Links
✅
SGLang: Efficient Execution of Structured Language Model Programs
—
Paper
✅
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
—
Paper
⬜
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
OSDI'24
Paper
⬜
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
SOSP'24
Paper
⬜
Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot
FAST'25
Paper
⬜
NanoFlow: Towards Optimal Large Language Model Serving Throughput
OSDI'25
Paper
Status
Paper
Venue
Links
✅
Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B
Blog
Paper
✅
Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs
arXiv'25
Paper
✅
TileRT: Tile-Based Runtime for Ultra-Low-Latency LLM Inference
—
Paper
✅
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
arXiv'25
Paper
✅
Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference
Blog
Paper
Status
Paper
Venue
Links
⬜
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
—
Paper
Status
Paper
Venue
Links
✅
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
arXiv'25
Paper
Compute-Communication Overlap
Status
Paper
Venue
Links
✅
Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion
—
Paper
✅
DeepEP: An Efficient Expert-Parallel Communication Library
—
Paper
⬜
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
ASPLOS'24
Paper
⬜
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
—
Paper
⬜
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
MLSys'25
Paper
⬜
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
—
Paper
⬜
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation
EuroSys'25
Paper
⬜
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
—
Paper
Attention Mechanisms & Variants
Status
Paper
Venue
Links
✅
Attention Is All You Need
NeurIPS'17
Paper / Note
✅
Big Bird: Transformers for Longer Sequences
NeurIPS'20
Paper / Note
✅
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
NeurIPS'22
Paper / Note
✅
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
arXiv
Paper / Note
✅
Flash-Decoding for Long-Context Inference
Blog
Paper / Note
✅
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
—
Paper
Status
Paper
Venue
Links
✅
Gated Linear Attention Transformers with Hardware-Efficient Training
arXiv
Paper / Note
✅
Kimi Linear Attention: An Expressive, Efficient Attention Architecture
arXiv'25
Paper
✅
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
arXiv'25
Paper
Status
Paper
Venue
Links
✅
On-Device Training Under 256KB Memory
NeurIPS'22
Paper
✅
PockEngine: Sparse and Efficient Fine-tuning in a Pocket
MICRO'23
Paper / Note
🤖 LLM for Kernel Optimization
Status
Paper
Venue
Links
✅
AVO: Agentic Variation Operators for Autonomous Evolutionary Search
arXiv'26
Paper / Note
Status
Paper
Venue
Links
⬜
Understanding Latency Hiding on GPUs
—
Paper
Status
Paper
Venue
Links
⬜
Categorical Foundations for CuTe Layouts
—
Paper
Status
Paper
Venue
Links
✅
Honeycomb: Secure and Efficient GPU Executions via Static Validation
OSDI'23
Paper / Note
✅
HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis
ASPLOS'24
Paper / Note
Status
Paper
Venue
Links
✅
RedLeaf: Isolation and Communication in a Safe Operating System
OSDI'20
Paper / Note
✅
Theseus: an Experiment in Operating System Structure and State Management
OSDI'20
Paper
✅
Unikraft: Fast, Specialized Unikernels the Easy Way
EuroSys'21
Paper / Note
✅
The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems
SOSP'21
Paper / Note
🛡️ Hypervisor & Virtualization
Status
Paper
Venue
Links
✅
HyperBench: A Benchmark Suite for Virtualization Capabilities
—
Paper / Note
✅
DuVisor: a User-level Hypervisor Through Delegated Virtualization
arXiv'22
Paper
✅
AvA: Accelerated Virtualization of Accelerators
ASPLOS'22
Paper
✅
Security and Performance in the Delegated User-level Virtualization
OSDI'23
Paper / Note
✅
System Virtualization for Neural Processing Units
HotOS'23
Paper
✅
Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMs
EuroSys'23
Paper / Note
✅
Honeycomb: Secure and Efficient GPU Executions via Static Validation
OSDI'23
Paper / Note
Status
Paper
Venue
Links
✅
A First Look at RISC-V Virtualization from an Embedded Systems Perspective
TC'21
Paper
✅
CVA6 RISC-V Virtualization: Architecture, Microarchitecture, and Design Space Exploration
arXiv'23
Paper
If you find this list helpful, feel free to ⭐ star this repo!