Detailed description of the requested feature
I've been trying to achieve low latency and efficient video processing. Something that Jetson Thor is particularly suitable for I think.
By the time I route a video clip to VLM, the frames are already in GPU memory (NVMM buffer). And possibly as tensors as well, as a neural network detection flagged the clip for VLM processing in the first place.
It seems that currently there's no api for that.
The only image entry points are the host pybinds load_image_from_path and load_image_from_bytes (edgellm_pybind.cpp:224-225). No device/DLPack/NVMM input API exists it seems. The only way in is an encoded image file / JPEG bytes that get stb-decoded back to a HOST buffer — even when the frame is already a decoded GPU tensor.
So encoding and then decoding again. Which seems a waste.
Timeline
I can bite the bullet and spend compute and time on the superfluous encode / decode. So in that sense I'm not blocked. But I do think it prevents the Jetson Thor from truly fulfilling its hardware potential.
Describe alternatives you've considered
Another option is to modify tensorrt-edge-llm itself. I've prototyped it locally (device-aware ImageData, a GPU resize via NPP, and cudaMemcpyDeviceToDevice in the ViT runner) and validated it produces output identical to the host path.
Looking at the changes it's not a local patch or a small PR but a change in the framework itself.
Which would be better implemented upstream.
Hence this issue and not a PR.
Also I notice "Native Video Processing" in the roadmap. So maybe you're planning to implement this anyway.
Target hardware/use case
Hardware: Jetson Thor
Software: Jetpack 7.2, CUDA 13.0
Use case: edge video / sensor processing
Detailed description of the requested feature
I've been trying to achieve low latency and efficient video processing. Something that Jetson Thor is particularly suitable for I think.
By the time I route a video clip to VLM, the frames are already in GPU memory (NVMM buffer). And possibly as tensors as well, as a neural network detection flagged the clip for VLM processing in the first place.
It seems that currently there's no api for that.
The only image entry points are the host pybinds load_image_from_path and load_image_from_bytes (edgellm_pybind.cpp:224-225). No device/DLPack/NVMM input API exists it seems. The only way in is an encoded image file / JPEG bytes that get stb-decoded back to a HOST buffer — even when the frame is already a decoded GPU tensor.
So encoding and then decoding again. Which seems a waste.
Timeline
I can bite the bullet and spend compute and time on the superfluous encode / decode. So in that sense I'm not blocked. But I do think it prevents the Jetson Thor from truly fulfilling its hardware potential.
Describe alternatives you've considered
Another option is to modify tensorrt-edge-llm itself. I've prototyped it locally (device-aware ImageData, a GPU resize via NPP, and cudaMemcpyDeviceToDevice in the ViT runner) and validated it produces output identical to the host path.
Looking at the changes it's not a local patch or a small PR but a change in the framework itself.
Which would be better implemented upstream.
Hence this issue and not a PR.
Also I notice "Native Video Processing" in the roadmap. So maybe you're planning to implement this anyway.
Target hardware/use case
Hardware: Jetson Thor
Software: Jetpack 7.2, CUDA 13.0
Use case: edge video / sensor processing