Skip to content

fix: expose inference server kwargs#2028

Open
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1621
Open

fix: expose inference server kwargs#2028
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1621

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Expose a generic server_kwargs parameter on InferenceServer and forward it to Ray Serve's build_openai_app, allowing users to pass options such as ingress_deployment_config for tuning ingress replicas.

Closes #1621

Usage

from nemo_curator.core.serve import InferenceServer, RayServeModelConfig

server = InferenceServer(
    models=[RayServeModelConfig(model_identifier="my-model")],
    server_kwargs={
        "ingress_deployment_config": {
            "autoscaling_config": {"min_replicas": 4, "max_replicas": 4},
        },
    },
)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Testing

  • uv run ruff check nemo_curator/core/serve/server.py nemo_curator/core/serve/ray_serve/backend.py tests/core/serve/ray_serve/test_backend.py
  • uv run python -m py_compile nemo_curator/core/serve/server.py nemo_curator/core/serve/ray_serve/backend.py tests/core/serve/ray_serve/test_backend.py
  • uv run pytest tests/core/serve/ray_serve/test_backend.py tests/core/serve/test_server.py (blocked locally: package intentionally raises on Darwin; NeMo-Curator currently only supports Linux)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner May 25, 2026 03:14
@nightcityblade nightcityblade requested review from ayushdg and removed request for a team May 25, 2026 03:14
@copy-pr-bot

copy-pr-bot Bot commented May 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a server_kwargs pass-through on InferenceServer so callers can supply arbitrary keyword arguments (e.g. ingress_deployment_config) directly to Ray Serve's build_openai_app factory without needing internal changes for each option.

  • server.py: adds server_kwargs: dict[str, Any] = field(default_factory=dict) to InferenceServer.
  • backend.py: extracts _build_openai_app_args(), which deep-copies server_kwargs, guards against the reserved llm_configs key, and correctly merges the quiet runtime_env into any user-supplied ingress_deployment_config.
  • test_backend.py: two new tests verify the forwarding/merge behavior and the llm_configs guard.

Confidence Score: 5/5

Safe to merge; changes are additive and well-tested.

The change is small and additive — a new optional field with a default-empty dict and a clean extraction of build-args logic into a helper. The guard against the reserved llm_configs key prevents silent data loss, the deep-copy isolates caller state, and the runtime-env merge correctly handles user-supplied ingress_deployment_config. Two targeted tests cover both the happy path and the error path.

No files require special attention beyond the noted server_kwargs being Ray Serve-specific despite living on the generic InferenceServer dataclass.

Important Files Changed

Filename Overview
nemo_curator/core/serve/server.py Adds server_kwargs: dict[str, Any] field to InferenceServer; only consumed by RayServeBackend
nemo_curator/core/serve/ray_serve/backend.py Extracts _build_openai_app_args to deep-copy server_kwargs, guard against reserved llm_configs key, and properly merge quiet_runtime_env into any user-supplied ingress_deployment_config
tests/core/serve/ray_serve/test_backend.py Adds two tests: forwarding of server_kwargs into build_openai_app args (including runtime-env merge) and rejection of reserved llm_configs key

Sequence Diagram

sequenceDiagram
    participant User
    participant InferenceServer
    participant RayServeBackend
    participant build_openai_app

    User->>InferenceServer: "InferenceServer(models, server_kwargs={...})"
    User->>InferenceServer: start()
    InferenceServer->>RayServeBackend: _deploy()
    RayServeBackend->>RayServeBackend: _build_openai_app_args(llm_configs, quiet_env)
    Note over RayServeBackend: deepcopy(server_kwargs)<br/>guard llm_configs key<br/>merge quiet_runtime_env
    RayServeBackend->>build_openai_app: build_args (server_kwargs + llm_configs)
    build_openai_app-->>RayServeBackend: app
    RayServeBackend->>RayServeBackend: serve.run(app)
Loading

Reviews (3): Last reviewed commit: "Remove redundant Ray Serve deepcopy call..." | Re-trigger Greptile

Comment on lines +104 to +105
build_args = deepcopy(self._server.server_kwargs)
build_args["llm_configs"] = llm_configs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 llm_configs key silently overwritten from server_kwargs — if a caller includes "llm_configs" in server_kwargs (e.g. trying to pre-seed the list), line 105 clobbers it without any warning. Adding an explicit guard makes the contract clear and surfaces the mistake instead of silently dropping user data.

Suggested change
build_args = deepcopy(self._server.server_kwargs)
build_args["llm_configs"] = llm_configs
build_args = deepcopy(self._server.server_kwargs)
if "llm_configs" in build_args:
msg = "'llm_configs' is a reserved key and must not be set in server_kwargs; use InferenceServer.models instead."
raise ValueError(msg)
build_args["llm_configs"] = llm_configs

Comment on lines +109 to +110
ingress_config = deepcopy(build_args.get("ingress_deployment_config", {}))
actor_options = deepcopy(ingress_config.get("ray_actor_options", {}))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The deepcopy calls on lines 109–110 are redundant. build_args is already a deep copy of server_kwargs (line 104), so any nested dict extracted from it is already an independent object. The extra copies add allocation overhead without protecting anything.

Suggested change
ingress_config = deepcopy(build_args.get("ingress_deployment_config", {}))
actor_options = deepcopy(ingress_config.get("ray_actor_options", {}))
ingress_config = build_args.get("ingress_deployment_config", {})
actor_options = ingress_config.get("ray_actor_options", {})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed in 0e954ce by reusing the nested dictionaries from the already-deep-copied build_args instead of deepcopying them again. Verification: python3 -m ruff check nemo_curator/core/serve/ray_serve/backend.py tests/core/serve/ray_serve/test_backend.py passed; python3 -m compileall -q nemo_curator/core/serve/ray_serve/backend.py tests/core/serve/ray_serve/test_backend.py passed. Targeted pytest is blocked locally because ray is not installed in this environment.

@nightcityblade

Copy link
Copy Markdown
Contributor Author

Addressed the Greptile feedback in the latest push:

  • _build_openai_app_args now rejects server_kwargs containing llm_configs instead of silently overwriting it.
  • Added a regression test for that validation path.

Validation run locally:

  • uvx ruff check nemo_curator/core/serve/ray_serve/backend.py tests/core/serve/ray_serve/test_backend.py
  • .venv/bin/python -m py_compile nemo_curator/core/serve/ray_serve/backend.py tests/core/serve/ray_serve/test_backend.py
  • .venv/bin/python -m pytest tests/core/serve/ray_serve/test_backend.py could not complete on this macOS ARM host because NeMo-Curator raises its Linux-only platform guard during test collection; installing the full inference_server extra also hit NVIDIA/vLLM wheel availability limits on macOS ARM.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InferenceServer: expose server_kwargs to allow tuning ingress replicas

2 participants