Skip to content

[Enhancement] Expose inference speed metrics for Chat and Completion#399

Open
cch1rag wants to merge 1 commit intoundreamai:mainfrom
cch1rag:enhancement/expose-inference-speed
Open

[Enhancement] Expose inference speed metrics for Chat and Completion#399
cch1rag wants to merge 1 commit intoundreamai:mainfrom
cch1rag:enhancement/expose-inference-speed

Conversation

@cch1rag
Copy link

@cch1rag cch1rag commented Mar 23, 2026

Summary

Closes #315.

This exposes the backend timing data on the Unity side.

After a completed Chat() or Completion() call, you can now read:

  • TokensPerSecond
  • PromptTokensPerSecond

These values come from the timing data already returned by the backend. If timing data is not available, both properties stay at -1.

What changed

  • added TokensPerSecond and PromptTokensPerSecond to LLMClient
  • updated the managed wrapper to request the JSON completion response when needed
  • parsed the response JSON and still returned plain text to callers
  • used the same parsing path for both Completion() and LLMAgent.Chat()
  • added tests for JSON parsing, fallback behavior, and stale timing reset

Testing

  • git diff --check
  • Unity EditMode parser tests passed:
    • LLMUnityTests.TestLLMClient_ResponseParsing

I kept the README update and the package test-discovery cleanup out of this PR to keep it focused on the issue itself. I can send those too if they would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

How to Calculate Token Score (Token / minute) to check if the CPU is strong enough?

1 participant