Custom RMSNorm kernel for LLama3-8B.
This repository provides a drop-in custom CUDA RMSNorm implementation and integration for Transformer-based large language models, with explicit support for LLaMA models.
uvicorn app:app --host 0.0.0.0 --port 8000
Generate text:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain RMSNorm",
"max_new_tokens": 128
}'