Skip to content

Add GPU-accelerated AV1 encode/decode (VA-API, NVENC/NVDEC, Vulkan Video) #217

@staging-devin-ai-integration

Description

Summary

Add hardware-accelerated AV1 video encoding and decoding to StreamKit, leveraging GPU capabilities via VA-API, NVIDIA NVENC/NVDEC, and (future) Vulkan Video. This builds on the CPU AV1 foundation (#216) and the GPU compositing work (#194) to move toward a zero-copy GPU media pipeline.

Motivation

  • CPU AV1 encoding (rav1e) at real-time speeds trades significant quality; HW encoders deliver better quality at the same latency
  • GPU compositing (GPU-accelerated compositing via wgpu #194) already keeps frames on GPU — HW codecs reduce CPU↔GPU roundtrips
  • AV1 has universal HW encode support: Intel (Arc+), AMD (RDNA 3+), NVIDIA (RTX 40xx+)
  • The ultimate goal is a zero-copy pipeline: GPU decode → GPU composite → GPU encode with no CPU copies

Prerequisites

HW Acceleration Options

Option A: VA-API via cros-codecs / cros-libva

Property Detail
Crate cros-codecs v0.0.6 (BSD-3-Clause)
AV1 decode Supported (VA-API)
AV1 encode Supported (VA-API)
VP9 decode/encode Also supported
HW support Intel, AMD, NVIDIA on Linux (via Mesa/intel-media-driver)
Maturity Solid — developed for ChromeOS/crosvm, 546k downloads
wgpu interop No native interop — requires DMA-BUF → Vulkan VkImage → wgpu texture bridge (custom work)

Option B: NVIDIA NVENC/NVDEC (native SDK bindings)

Property Detail
SDK NVIDIA Video Codec SDK
AV1 encode Supported (RTX 40xx+, Ada Lovelace)
AV1 decode Supported (RTX 30xx+, Ampere)
Rust bindings Would need custom FFI bindings or use existing crates like nvidia-video-codec-sdk
HW support NVIDIA GPUs only
Performance Best-in-class for NVIDIA hardware, lower latency than VA-API on NVIDIA
wgpu interop CUDA↔Vulkan interop is possible but complex

Option C: Vulkan Video via vk-video

Property Detail
Crate vk-video v0.2.1 (MIT)
AV1 support Not yet — currently H.264 only
wgpu interop Native! — built on wgpu, decoded frames are wgpu::Textures
Maturity Very early (1.2k downloads), developed by Software Mansion for Smelter
Vulkan spec VK_KHR_video_decode_av1 ratified Feb 2024, VK_KHR_video_encode_av1 ratified Nov 2024
Potential Best long-term option for zero-copy pipeline, but needs AV1 support added

Proposed Phased Approach

Phase 2a: VA-API HW acceleration (most practical today)

// crates/nodes/src/video/vaapi.rs (new)

/// VA-API accelerated AV1 encoder using cros-codecs.
pub struct VaapiAv1EncoderNode { config: VaapiAv1EncoderConfig }
// Input: I420/NV12 raw frames
// Output: AV1 encoded packets
// Uses cros_codecs::encoder for VA-API AV1 encoding

/// VA-API accelerated AV1 decoder using cros-codecs.  
pub struct VaapiAv1DecoderNode { config: VaapiAv1DecoderConfig }
// Input: AV1 encoded packets
// Output: NV12 frames (VA surface → mapped to CPU buffer)

/// Runtime capability detection
pub fn probe_vaapi_capabilities() -> VaapiCapabilities { ... }
// Checks available VA-API profiles/entrypoints for AV1 encode/decode
  • Feature: vaapi = ["dep:cros-codecs"]
  • Fallback: auto-detect VA-API at runtime, fall back to CPU rav1e/rav1d if unavailable
  • Works on Intel/AMD/NVIDIA on Linux
  • Docker: requires --gpus flag + VA-API drivers

Phase 2b: NVIDIA NVENC/NVDEC (optional, for best NVIDIA performance)

  • Feature: nvenc = ["dep:nvidia-video-codec-sdk"] (or custom FFI bindings)
  • Only worth pursuing if VA-API performance on NVIDIA is insufficient
  • NVENC typically has lower latency and better rate control than VA-API on NVIDIA

Phase 2c: Vulkan Video zero-copy pipeline (future, when ecosystem matures)

AV1 packets → Vulkan Video decode → wgpu::Texture (NV12)
  → GPU compositor (wgpu, already merged) → wgpu::Texture (output)
  → Vulkan Video encode → AV1 packets
  • No CPU roundtrips — frames never leave GPU memory
  • Requires vk-video crate to add AV1 support, OR building Vulkan Video AV1 directly using ash + wgpu Vulkan interop
  • The wgpu Device and Queue are shared between compositor and codec
  • This is the ultimate performance target

Tasks

  • Evaluate VA-API AV1 encode/decode quality and latency with cros-codecs (PoC)
  • Implement VaapiAv1EncoderNode and VaapiAv1DecoderNode
  • Add runtime GPU capability detection and CPU fallback
  • Evaluate NVIDIA NVENC/NVDEC bindings (if VA-API perf on NVIDIA is insufficient)
  • Update Docker GPU images for VA-API/NVENC support
  • Benchmark HW vs CPU encode/decode (latency, quality, throughput)
  • Track vk-video AV1 progress for future zero-copy integration
  • Update docs

Open Questions

  1. VA-API vs NVENC priority — Start with VA-API (cross-vendor) or NVENC (best NVIDIA perf)?
  2. DMA-BUF → wgpu bridge — Build custom interop for Phase 2a, or wait for Vulkan Video (Phase 2c)?
  3. Auto-selection — Should the runtime automatically choose HW > CPU, or let the user configure?
  4. Docker story — Separate Dockerfile.gpu-vaapi / Dockerfile.gpu-nvenc, or unified with runtime detection?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions