Skip to content

KuangjuX/Paper-reading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

111 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📚 Paper Reading

A curated collection of research papers on AI systems, compilers, architecture, and systems software.

Papers Read To Read


Table of Contents

Legend: ✅ = Read  |  ⬜ = To Read  |  📝 = Note Available


🔧 Deep Learning Compiler

Status Paper Venue Links
The Deep Learning Compiler: A Comprehensive Survey Paper / Note
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation CGO'21 Paper / Note
TIRAMISU: A Polyhedral Compiler for Expressing Fast and Portable Code CGO'19 Paper / Note
Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks OSDI'20 Paper / Note
ROLLER: Fast and Efficient Tensor Compilation for Deep Learning OSDI'22 Paper / Note
BOLT: Bridging The Gap Between Auto-Tuners and Hardware-Native Performance MLSys'22 Paper / Note
AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures ASPLOS'22 Paper / Note
AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction ISCA'22 Paper / Note
Welder: Scheduling Deep Learning Memory Access via Tile-graph OSDI'23 Paper / Note
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators OSDI'23 Paper
Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning OSDI'23 Paper / Note
Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion HPCA'23 Paper / Note
Graphene: An IR for Optimized Tensor Computations on GPUs ASPLOS'23 Paper / Note
Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor SOSP'24 Paper
ThunderKittens: Simple, Fast, and Adorable AI Kernels Paper
Mirage: A Multi-Level Superoptimizer for Tensor Programs OSDI'25 Paper
PipeThreader: Software-Defined Pipelining for Efficient DNN Execution OSDI'25 Paper
TileLang: A Composable Tiled Programming Model for AI Systems Paper
Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References arXiv'25 Paper
KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads OSDI'25 Paper

🚀 LLM Inference

General

Status Paper Venue Links
A Survey of LLM Inference Systems Paper / Note
WaferLLM: Large Language Model Inference at Wafer Scale OSDI'25 Paper

Long Context Inference

Status Paper Venue Links
Training-Free Long-Context Scaling of Large Language Models ICML'24 Paper / Note
Efficient Streaming Language Models with Attention Sinks ICLR'24 Paper
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference ICML'24 Paper
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads ICLR'25 Paper

LLM Serving

Status Paper Venue Links
SGLang: Efficient Execution of Structured Language Model Programs Paper
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Paper
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving OSDI'24 Paper
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism SOSP'24 Paper
Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot FAST'25 Paper
NanoFlow: Towards Optimal Large Language Model Serving Throughput OSDI'25 Paper

MegaKernel

Status Paper Venue Links
Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B Blog Paper
Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs arXiv'25 Paper
TileRT: Tile-Based Runtime for Ultra-Low-Latency LLM Inference Paper
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations arXiv'25 Paper
Compiling LLMs into a MegaKernel: A Path to Low-Latency Inference Blog Paper

🏋️ LLM Training

Distributed Training

Status Paper Venue Links
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism Paper

RL Training

Status Paper Venue Links
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning arXiv'25 Paper

Compute-Communication Overlap

Status Paper Venue Links
Flux: Fast Software-based Communication Overlap on GPUs through Kernel Fusion Paper
DeepEP: An Efficient Expert-Parallel Communication Library Paper
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning ASPLOS'24 Paper
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts Paper
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives MLSys'25 Paper
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler Paper
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation EuroSys'25 Paper
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference Paper

🧠 Deep Learning

Attention Mechanisms & Variants

Status Paper Venue Links
Attention Is All You Need NeurIPS'17 Paper / Note
Big Bird: Transformers for Longer Sequences NeurIPS'20 Paper / Note
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness NeurIPS'22 Paper / Note
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning arXiv Paper / Note
Flash-Decoding for Long-Context Inference Blog Paper / Note
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention Paper

New Architectures

Status Paper Venue Links
Gated Linear Attention Transformers with Hardware-Efficient Training arXiv Paper / Note
Kimi Linear Attention: An Expressive, Efficient Attention Architecture arXiv'25 Paper
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models arXiv'25 Paper

On-Device / Mobile

Status Paper Venue Links
On-Device Training Under 256KB Memory NeurIPS'22 Paper
PockEngine: Sparse and Efficient Fine-tuning in a Pocket MICRO'23 Paper / Note

🤖 LLM for Kernel Optimization

Status Paper Venue Links
AVO: Agentic Variation Operators for Autonomous Evolutionary Search arXiv'26 Paper / Note

🖥️ GPU Microarchitecture

Status Paper Venue Links
Understanding Latency Hiding on GPUs Paper

📐 Math Foundations

Status Paper Venue Links
Categorical Foundations for CuTe Layouts Paper

⚙️ Compiler

Status Paper Venue Links
Honeycomb: Secure and Efficient GPU Executions via Static Validation OSDI'23 Paper / Note
HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis ASPLOS'24 Paper / Note

🐧 Operating Systems

Status Paper Venue Links
RedLeaf: Isolation and Communication in a Safe Operating System OSDI'20 Paper / Note
Theseus: an Experiment in Operating System Structure and State Management OSDI'20 Paper
Unikraft: Fast, Specialized Unikernels the Easy Way EuroSys'21 Paper / Note
The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems SOSP'21 Paper / Note

🛡️ Hypervisor & Virtualization

Status Paper Venue Links
HyperBench: A Benchmark Suite for Virtualization Capabilities Paper / Note
DuVisor: a User-level Hypervisor Through Delegated Virtualization arXiv'22 Paper
AvA: Accelerated Virtualization of Accelerators ASPLOS'22 Paper
Security and Performance in the Delegated User-level Virtualization OSDI'23 Paper / Note
System Virtualization for Neural Processing Units HotOS'23 Paper
Nephele: Extending Virtualization Environments for Cloning Unikernel-based VMs EuroSys'23 Paper / Note
Honeycomb: Secure and Efficient GPU Executions via Static Validation OSDI'23 Paper / Note

🔬 RISC-V

Status Paper Venue Links
A First Look at RISC-V Virtualization from an Embedded Systems Perspective TC'21 Paper
CVA6 RISC-V Virtualization: Architecture, Microarchitecture, and Design Space Exploration arXiv'23 Paper

If you find this list helpful, feel free to ⭐ star this repo!

About

My Paper Reading Lists and Notes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors