Move TRT-RTX runtime controls to runtime context managers (v3, for review)#3
Move TRT-RTX runtime controls to runtime context managers (v3, for review)#3tp5uiuc wants to merge 6 commits into
Conversation
…anagers Replaces the v2 design that packed three runtime-mode controls (``cuda_graph_strategy``, ``dynamic_shapes_kernel_specialization_strategy``, ``runtime_cache``) into ``CompilationSettings`` and the serialized engine tuple. Per pytorch#4310, these are runtime mode controls -- not engine properties -- and shouldn't pin at compile time or round-trip through serialization. Highlights: * New ``RuntimeSettings`` dataclass on both Python and C++ sides (``py/torch_tensorrt/runtime/_runtime_settings.py``, ``core/runtime/RuntimeSettings.h``). Three fields: ``dynamic_shapes_kernel_specialization_strategy``, ``cuda_graph_strategy``, ``runtime_cache``. The cache field accepts ``None``, a path string (engine creates an implicit handle, saves on ``__del__``, mirrors old ``runtime_cache_path=`` behavior), or a ``RuntimeCacheHandle`` (shared cache, lifecycle owned by the ``runtime_cache()`` CM). * New ``RuntimeCacheHandle`` registered as a torchbind class (``torch.classes.tensorrt.RuntimeCacheHandle``) so the same C++ ``IRuntimeCache`` shared_ptr crosses the Python/C++ boundary. * New per-engine ``update_runtime_settings`` API on both ``TRTEngine`` flavors. Fast-paths on settings equality; eagerly rebuilds ``IRuntimeConfig`` + recreates execution context on diff. * Three new context managers in ``torch_tensorrt.runtime``: ``runtime_config(target_or_targets, **kw)`` (the pool API; also yields the target so ``with runtime_config(model, ...) as m:`` works), ``runtime_cache(target, path)`` (shared cache CM), and the per-knob sugars ``set_cuda_graph_strategy`` / ``set_dynamic_shapes_kernel_strategy``. All three accept a list of modules for multi-target use; the cache CM yields the ``RuntimeCacheHandle`` for inspection or explicit ``save()``. * New ``runtime_settings=`` kwarg on ``compile()``, ``cross_compile_for_windows()``, and ``convert_module()`` so callers can prime the engine with the right values up front. Compile-time hint avoids the enter/exit recreate cost. * ``CompilationSettings`` loses the three fields; the compiler entry points drop the three kwargs. ``SerializedInfoIndex`` drops the four RTX-related slots; ``SERIALIZATION_LEN`` returns to 12. Engines saved with the old 16-slot layout will raise the existing layout-mismatch error on load. * Three existing test files migrated to the new API; new ``tests/py/dynamo/runtime/test_004_runtime_settings.py`` covers the data model, compile-time hint, runtime CM restore semantics, multi-target form, and dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-up bugs exposed by the cross-runtime test parameterization on
the C++ engine path:
1. ``torch.classes.tensorrt.Engine.update_runtime_settings(...)`` rejected
Python ``None`` for the ``RuntimeCacheHandle`` argument because TorchBind
does not auto-convert ``None`` to a null ``c10::intrusive_ptr``. Switch
the signature to ``c10::optional<c10::intrusive_ptr<RuntimeCacheHandle>>``
so the default ``runtime_cache=None`` case round-trips cleanly.
2. ``RuntimeSettings(runtime_cache="/some/path")`` only auto-saved to disk
on engine destruction for the Python runtime (via ``_TRTEngine.__del__``).
The C++ engine had no equivalent saver and the IRuntimeCache it
materialized internally wasn't accessible from Python.
Make the cpp path symmetric:
- Expose ``serialize() -> at::Tensor`` / ``deserialize(at::Tensor)`` /
``has_cache()`` on the torchbind ``RuntimeCacheHandle`` class. ``at::Tensor``
of uint8 is used instead of ``std::string`` because TorchBind forces
``std::string`` through Python ``str`` (UTF-8) and serialized cache bytes
are not valid UTF-8.
- In ``TorchTensorRTModule.setup_engine`` (cpp branch), pre-materialize a
torchbind handle when ``runtime_cache`` is a path string, store it on
the module, and substitute it into ``_runtime_settings`` so the dispatch
passes the same handle through.
- Add ``_load_cpp_implicit_cache`` / ``_save_cpp_implicit_cache`` and a
module ``__del__`` that mirrors the Python ``_TRTEngine`` saver, with
``filelock`` + atomic-rename semantics.
- Teach ``_to_torchbind_handle`` to pass an already-torchbind
``torch.ScriptObject`` through unchanged.
All cpp + python runtime tests pass on TRT-RTX 1.5: test_004 (12/12),
test_000 (10/10), test_001 dynamic_shapes (14/14), test_001 cuda_graph
(13/13).
| runtime_cache: Optional[Union[str, "RuntimeCacheHandle"]] = None # noqa: F821 | ||
|
|
||
| def __post_init__(self) -> None: | ||
| if ( |
There was a problem hiding this comment.
The _runtime_settings is only expected to be used in TRT-RTX. For regular TRT, let's issue a warning here saying this has no effect.
There was a problem hiding this comment.
Done in 4bfa982. RuntimeSettings.__post_init__ now emits a one-shot UserWarning on non-RTX builds (gated by a module-level _NON_RTX_WARNING_EMITTED so it does not spam):
global _NON_RTX_WARNING_EMITTED
if not ENABLED_FEATURES.tensorrt_rtx and not _NON_RTX_WARNING_EMITTED:
warnings.warn(
"RuntimeSettings is only honored on TRT-RTX builds; "
"values will be ignored on standard TRT.",
UserWarning,
stacklevel=2,
)
_NON_RTX_WARNING_EMITTED = True…timeCacheHandle lifecycle Structural cleanup on top of the v3 work (no observable behavior change). C++ side -------- ``RuntimeSettings`` migrates from a ``TRTEngine`` member to a ``TRTRuntimeConfig`` member -- the value-type now lives with its primary consumer (the IRuntimeConfig builder). ``TRTRuntimeConfig`` gains ``set_settings()`` (the diff-and-invalidate primitive) and turns the static ``uses_internal_capture`` / ``is_monolithic_capturable`` helpers into instance methods so callers do not need to pass settings around. ``TRTEngine::runtime_settings()`` forwards through. Python side ----------- Introduces a Python ``TRTRuntimeConfig`` class mirroring the C++ struct. ``_TRTEngine`` drops its three legacy fields (``runtime_config``, ``runtime_settings``, ``_implicit_cache_handle``) for a single ``self._trt_runtime_config`` member; ``_create_execution_context`` / ``update_runtime_settings`` / ``_is_monolithic_capturable`` / ``_enable_rtx_native_cudagraphs`` all delegate. Every ``ENABLED_FEATURES.tensorrt_rtx`` branch related to runtime-mode controls is absorbed into the shim, so engine and module call sites stay uniform across TRT and TRT-RTX builds. Following the project's grouping convention, ``py/torch_tensorrt/runtime/_runtime_settings.py`` is merged into ``_runtime_config.py``; that file now holds ``RuntimeSettings``, the new ``TRTRuntimeConfig``, the existing ``runtime_config()`` CM, and its factory. Imports across the tree are repointed. RuntimeCacheHandle ownership model ---------------------------------- Save-on-destruction moves from the two engine-side ``__del__`` paths (``_TRTEngine.close()`` for Python runtime, ``TorchTensorRTModule.__del__`` for cpp runtime) onto ``RuntimeCacheHandle.__del__`` itself, gated by a new ``autosave_on_del`` flag. The flag is set by ownership context: * Engine-implicit handles (created from a path-string compile-time hint) get ``autosave_on_del=True`` -- no other Python object holds them, so the destructor is the only save opportunity. * The ``runtime_cache(target, path)`` CM uses ``autosave_on_del=False`` on the handle it constructs; its ``__exit__`` saves explicitly. * Hand-built handles default to ``autosave_on_del=False`` so save timing stays under the user's control. The handle additionally accepts a ``torchbind_handle`` sibling so the same Python object can wrap either a ``trt.IRuntimeCache`` (Python rt) or a ``torch.classes.tensorrt.RuntimeCacheHandle`` (cpp rt); ``save`` / ``load`` source bytes from whichever is populated. The cpp-runtime helpers on ``TorchTensorRTModule`` (``_load_cpp_implicit_cache``, ``_save_cpp_implicit_cache``, ``__del__``) and the duplicate save logic in ``_TRTEngine.close()`` are removed; both runtimes funnel through the single ``RuntimeCacheHandle.__del__`` path. Tests ----- test_000 grows two new tests asserting the new contract: * ``test_cm_does_not_double_save_on_rc_gc`` -- only one save fires per CM block even after ``rc`` is GC'd. * ``test_user_built_handle_no_autosave_by_default`` -- hand-built handles do not autosave on GC. All 51 runtime tests pass on the refactored design (test_004 12/12, test_000 12/12, test_001 ds 14/14, test_001 cg 13/13).
Five follow-up changes responding to PR review comments: * **Fold strategy sugar into ``_runtime_config.py``.** Delete ``_dynamic_shapes_kernel_strategy.py`` and ``_cuda_graph_strategy.py``; ``set_dynamic_shapes_kernel_strategy`` / ``set_cuda_graph_strategy`` now live alongside the ``runtime_config`` CM they delegate to. ``torch_tensorrt/runtime/__init__.py`` re-exports them from the consolidated module. * **Hoist ``RuntimeSettings`` defaults into ``_defaults.py``.** Three new constants (``DYNAMIC_SHAPES_KERNEL_SPECIALIZATION_STRATEGY``, ``CUDA_GRAPH_STRATEGY``, ``RUNTIME_CACHE_PATH``) mirror the compilation-settings pattern. ``RUNTIME_CACHE_PATH`` defaults to a per-user temp file similar to ``ENGINE_CACHE_DIR``, so users get a disk-backed runtime cache without explicit opt-in; override via ``RuntimeSettings(runtime_cache="/path")`` or the ``runtime_cache`` CM. Test_000 and test_004 updated to reflect the new default. * **Warn on non-RTX ``RuntimeSettings`` construction.** ``__post_init__`` now emits a one-shot ``UserWarning`` on regular TRT builds (gated by ``ENABLED_FEATURES.tensorrt_rtx``) so users see that the settings have no effect. * **Drop ``TYPE_CHECKING`` string forward-refs for ``RuntimeSettings``.** Direct top-level imports across ``_compiler.py``, ``_conversion.py``, ``_TRTEngine.py`` and ``_TorchTensorRTModule.py``; bare ``Optional[RuntimeSettings]`` annotations everywhere. Deferred imports inside ``__init__`` / ``__setstate__`` removed. All 51 runtime tests pass (test_004 12/12, test_000 12/12, test_001 ds 14/14, test_001 cg 13/13).
| // Expose the underlying IRuntimeCache bytes to Python so the Python- | ||
| // side save/load logic can persist them under filelock. Returns an | ||
| // empty uint8 tensor if the cache hasn't been materialized yet. | ||
| // | ||
| // We return ``at::Tensor`` rather than ``std::string`` because TorchBind | ||
| // forces ``std::string`` to round-trip through Python ``str`` (UTF-8) | ||
| // and serialized cache bytes are not valid UTF-8. | ||
| .def( | ||
| "serialize", | ||
| [](const c10::intrusive_ptr<RuntimeCacheHandle>& self) -> at::Tensor { | ||
| #ifdef TRT_MAJOR_RTX | ||
| auto opts = at::TensorOptions().dtype(at::kByte); | ||
| if (!self->cache) { | ||
| return at::empty({0}, opts); | ||
| } | ||
| auto host_mem = make_trt(self->cache->serialize()); | ||
| if (!host_mem) { | ||
| return at::empty({0}, opts); | ||
| } |
There was a problem hiding this comment.
These methods "serialize", "deserialize", "has_cache" should be implemented elsewhere (in its cpp file). If the concern is introducing at:: symbols we can work around it. This file should only be used for registration.
Another smell is using #ifdef here, these should be hidden in the implementation.
There was a problem hiding this comment.
Done in 111cdb2eb. serialize / deserialize / has_cache are now member functions of RuntimeCacheHandle declared in RuntimeSettings.h and implemented in RuntimeSettings.cpp. All #ifdef TRT_MAJOR_RTX lives inside those method bodies; the registration in register_jit_hooks.cpp is preprocessor-free for this class:
torch::class_<RuntimeCacheHandle>("tensorrt", "RuntimeCacheHandle")
.def(torch::init<std::string>())
.def_readwrite("path", &RuntimeCacheHandle::path)
.def("serialize", &RuntimeCacheHandle::serialize)
.def("deserialize", &RuntimeCacheHandle::deserialize)
.def("has_cache", &RuntimeCacheHandle::has_cache);| [](const c10::intrusive_ptr<TRTEngine>& self, | ||
| std::string const& dynamic_shapes_kernel_specialization_strategy, | ||
| std::string const& cuda_graph_strategy, | ||
| c10::optional<c10::intrusive_ptr<RuntimeCacheHandle>> runtime_cache) -> void { |
There was a problem hiding this comment.
Is it not possible to implement this as a property with getter and setter because of this c10::optional<c10::intrusive_ptr<RuntimeCacheHandle>> runtime_cache?
There was a problem hiding this comment.
Possible — the c10::optional<c10::intrusive_ptr<RuntimeCacheHandle>> signature is fine for torchbind def_property (device_memory_budget immediately below in this same registration is a property on TRTEngine, for a comparable point).
The reason update_runtime_settings is a single bundled setter is that RuntimeSettings is the unit of context invalidation: changing any one of the three fields ends up calling recreate_execution_context once. Splitting into three individual properties would cause three sequential context-recreates on the engine-setup path (where all three are set together via _dispatch_runtime_settings_to_engine). The diff-check inside TRTRuntimeConfig::set_settings would catch no-op repeats, but consecutive changing writes would each trigger a recreate.
If you would rather have property syntax I can split it, but the bundled form keeps setup tight. WDYT?
Mirror ``TRTRuntimeConfig.set_settings`` (Python runtime) on the cpp runtime path. Previously the cpp side dropped the C++ engine's intrusive_ptr on settings change but left ``self._implicit_cache_handle`` on the ``TorchTensorRTModule`` pointing at the *old* wrapper -- the new cache had no Python autosave companion and never wrote to disk. Factor the path-string-to-torchbind-handle materialization into ``TorchTensorRTModule._materialize_cpp_implicit_handle``. Called from ``setup_engine`` and ``_dispatch_runtime_settings_to_engine`` (cpp branch); synchronously saves the prior wrapper before swap, replaces ``self._implicit_cache_handle`` with the new one, then runs ``load()`` after the C++ engine has attached the IRuntimeCache. Test: ``test_set_runtime_settings_saves_prior_cache_on_swap`` (parametrized over both runtimes). Compiles with path A; swaps to path B; asserts A is written synchronously at swap time and B is written on ``del compiled``. The walk-to-inner-module is wrapped in a helper so the loop variable doesn't outlive the call and keep the inner TRT module alive past ``del compiled`` (which would suppress the post-del autosave). All 53 tests pass (test_004 12/12, test_000 14/14, test_001 ds 14/14, test_001 cg 13/13).
| return dynamic_shapes_kernel_specialization_strategy == other.dynamic_shapes_kernel_specialization_strategy && | ||
| cuda_graph_strategy == other.cuda_graph_strategy && runtime_cache.get() == other.runtime_cache.get(); |
There was a problem hiding this comment.
Can't we use std::tie(a,b,c) == std::tie(other.a, other.b, other.c)?
There was a problem hiding this comment.
Done in 111cdb2eb. Note: runtime_cache.get() returns a temporary T* so the raw pointers are hoisted into locals first to satisfy std::ties lvalue requirement.
auto* this_cache = runtime_cache.get();
auto* other_cache = other.runtime_cache.get();
return std::tie(dynamic_shapes_kernel_specialization_strategy, cuda_graph_strategy, this_cache) ==
std::tie(other.dynamic_shapes_kernel_specialization_strategy, other.cuda_graph_strategy, other_cache);| RuntimeSettings result = *this; | ||
| result.dynamic_shapes_kernel_specialization_strategy = override.dynamic_shapes_kernel_specialization_strategy; | ||
| result.cuda_graph_strategy = override.cuda_graph_strategy; | ||
| if (override.runtime_cache) { | ||
| result.runtime_cache = override.runtime_cache; | ||
| } | ||
| return result; | ||
| } |
There was a problem hiding this comment.
I don't like this at all. An easier way of doing this is to define a swap method and implement it that way (similar to copy-and-swap idiom) for example
There was a problem hiding this comment.
Dropped merge() entirely in 111cdb2eb (see also your follow-up comment below making the same point about member-wise assignment). Direct field assignment is the idiom now — no merge() helper.
| .def("path", &RuntimeCacheHandle::path) | ||
| .def("set_path", &RuntimeCacheHandle::set_path) |
There was a problem hiding this comment.
Done in 111cdb2eb. path is now a public field on the (now struct) RuntimeCacheHandle and registered as def_readwrite("path", &RuntimeCacheHandle::path). The old path() getter and set_path() setter methods are gone.
| private: | ||
| std::string path_; |
There was a problem hiding this comment.
What does making path private get us at the end of day? its not like if path changes the runtime cache immediately changes. better to make it public and replace the entire class RuntimeCacheHandle with a struct.
There was a problem hiding this comment.
Done in 111cdb2eb — RuntimeCacheHandle is now a struct with a public path field. The cache shared_ptr is also a public field. Agreed there were no encapsulation invariants worth preserving.
| // For non-default detection on the strategy strings we always overlay; the cache | ||
| // pointer is overlaid iff `override.runtime_cache` is non-null. | ||
| RuntimeSettings merge(RuntimeSettings const& override) const; | ||
|
|
There was a problem hiding this comment.
what's the purpose of this again? Can't we do
auto another_runtime_settings = old.runtime_settings;
another_runtime_settings.dynamic_shapes_kernel_specialization_strategy = "something_else";
old.runtime_settings = another_runtime_settings; // or std::swap(old.runtime_settings, another_runtime_settings);
instead? Why add anothr function if we are going to member-wise swap anyway?
There was a problem hiding this comment.
Dropped merge() in 111cdb2. Direct field assignment / std::swap on the struct is exactly the right pattern — no helper required. (For context: Python RuntimeSettings is @dataclass(frozen=True), so on the Python side RuntimeSettings.merge(**kwargs) exists as the dataclass-replace idiom; that is unrelated and stays.)
| const std::string& serialized_metadata, | ||
| const ResourceAllocationStrategy resource_allocation_strategy, | ||
| TRTRuntimeConfig runtime_cfg) | ||
| RuntimeSettings runtime_settings) |
There was a problem hiding this comment.
Will this constructor be ever called from the Python side so that Runtimesettings get passed in and context gets created only once?
There was a problem hiding this comment.
Today: no. This is the "first" constructor overload (the public one that accepts a RuntimeSettings directly) — but the only torchbind init is the vector<string> one at L87 of register_jit_hooks.cpp, which goes through the deserialization constructor at L114 of TRTEngine.cpp, passing RuntimeSettings{} (defaults). The Python side then calls engine.update_runtime_settings(...) right after construction, which recreates the context once. So construction → recreate-1 (with defaults) → settings dispatch → recreate-2 (with the real settings).
To get it down to one recreate, I would add a second torch::init<...> overload that accepts the strategy strings + the cache handle alongside serialized_info, route through this RuntimeSettings-aware constructor, and skip the post-construction dispatch in Python. Doable in this PR if you want — but the savings are small (one createExecutionContext per engine) and the dispatch path stays as-is for live runtime_config CM changes anyway.
| std::string dynamic_shapes_kernel_specialization_strategy = "lazy"; | ||
| std::string cuda_graph_strategy = "disabled"; |
There was a problem hiding this comment.
Do we need to store strings? Can't we store ints here since they will be converted to enumerations anyway? This way we just static cast them to get the enumeration
There was a problem hiding this comment.
Tradeoff:
- Ints: type-safe;
static_cast<nvinfer1::DynamicShapesKernelSpecializationStrategy>(settings_.foo)is a no-op; one validation layer. - Strings: human-readable
to_str()(logged viaoperator<<) is useful for support reports / debug logs; identical Python-and-C++ value-form for the canonical setting; reasonably grep-able through the codebase.
Today the layout is "strings everywhere, two validation points": Python RuntimeSettings.__post_init__ validates against the string→int map, then TRTRuntimeConfig::ensure_initialized re-validates inside to_trt_ds_strategy / to_trt_cg_strategy. Moving to int internally means: keep the Python-facing API as strings, convert once at the torchbind boundary (update_runtime_settings lambda), store ints, skip the second validation. The cost is that to_str() needs a reverse-lookup table to print "lazy" / "whole_graph_capture".
Happy to push that through if you prefer ints — I held off here because the current form keeps logs human-readable on both sides of the boundary. WDYT?
…Handle C++-side cleanup spurred by review comments on #3: - Convert ``RuntimeCacheHandle`` from a class with a private ``path_`` field + accessor methods (``path()`` / ``set_path()``) to a struct with a public ``path`` field. Re-register the torchbind binding via ``.def_readwrite("path", &RuntimeCacheHandle::path)``. - Move the bodies of ``serialize``, ``deserialize``, and ``has_cache`` out of the JIT-binding registration file (``register_jit_hooks.cpp``) and into member functions implemented in ``RuntimeSettings.cpp``. The ``#ifdef TRT_MAJOR_RTX`` guards live inside those impls; the registration file is preprocessor-free for these bindings. - Use ``std::tie`` in ``RuntimeSettings::operator==`` for cleaner field-wise comparison (raw ``intrusive_ptr::get()`` results hoisted to lvalues to satisfy ``std::tie``'s reference requirement). - Drop ``RuntimeSettings::merge``. C++ ``RuntimeSettings`` is now value-typed end-to-end; direct field assignment is the idiom. No callers used ``merge`` outside its own definition. No behavior change. Python-side ``RuntimeCacheHandle`` wrapper and the runtime test suite are unaffected.
Rewrites the v2 design (PR #2 base branch) to move
cuda_graph_strategy,dynamic_shapes_kernel_specialization_strategy,runtime_cachefromCompilationSettings/ serialized engine slots to runtime context managers per pytorch#4310.Summary
RuntimeSettingsdataclass on both Python and C++ sides;RuntimeCacheHandleregistered as a torchbind class for shared-cache semantics.torch_tensorrt.runtime:runtime_config(pool API),runtime_cache(shared cache), plus per-knob sugars. All accept a list of modules.runtime_settings=kwarg oncompile()/cross_compile_for_windows()/convert_module()for compile-time hints (1 context-create cost, no enter/exit recreate).update_runtime_settings(rs)with fast-path equality check; rebuildsIRuntimeConfig+ recreates execution context on diff.SerializedInfoIndexdrops 4 RTX slots;SERIALIZATION_LENback to 12.Tests
test_004_runtime_settings.py(12 tests) covering data model, compile-time hint, CM restore, multi-target, dispatch.test_000_runtime_cache.py,test_001_dynamic_shapes_kernel_strategy.py,test_001_cuda_graph_strategy.pymigrated to the new API.Status
SKIP=mypyfor the pre-existing_TRTEngine.pyerrors tracked separately).test_004all 12 pass; Python-runtime half of the three other test files passes.libtensorrt_rtx.so.1atcuda_engine->getStreamableWeightsSize()-- I confirmed this is a pre-existing environmental issue on the test node (the same crash occurs with a known-good pre-built v2 wheel installed in the same env), not a regression from this refactor.