It would be very helpful to expose more detailed performance metrics in gpullama, similar to what Ollama provides (see) .
Right now, it is difficult to properly evaluate performance (especially across CPU/GPU backends and TornadoVM execution) without fine-grained timing information. Having a consistent set of metrics would significantly improve benchmarking, profiling, and optimization.
Proposed metrics
Core metrics (aligned with Ollama-style reporting):
total_duration – total time to generate the full response
load_duration – time spent loading the model
prompt_eval_count – number of input tokens processed
prompt_eval_duration (prefill) – time spent processing the prompt
eval_count – number of generated output tokens
eval_duration (decode) – time spent generating tokens
TornadoVM-specific metrics:
tornado_task_graph_compile_duration – time to compile the Tornado task graph
tornado_task_graph_warmup_duration – time spent in warmup/execution until steady state
All timings should ideally be reported in nanoseconds for consistency and precision.
With the above we can calculate:
time_to_first_token (TTFT) – can be derived from existing durations (e.g. load + prefill + first decode step), so it may not need separate instrumentation if timestamps are available
prefill_throuput as tok/s = prompt_eval_count / prompt_eval_duration
decode_throuput as tok/s = eval_count / eval_duration
total_throuput as tok/s = prompt_eval_count + eval_count / total_duration <--- as we do now
Why this is useful
These metrics would make it easier to:
break down execution into loading, prefill, decode, and runtime overheads
understand TornadoVM-specific costs (compilation and warmup)
compare CPU vs GPU vs TornadoVM performance more accurately
identify bottlenecks and guide optimizations
It would be very helpful to expose more detailed performance metrics in gpullama, similar to what Ollama provides (see) .
Right now, it is difficult to properly evaluate performance (especially across CPU/GPU backends and TornadoVM execution) without fine-grained timing information. Having a consistent set of metrics would significantly improve benchmarking, profiling, and optimization.
Proposed metrics
Core metrics (aligned with Ollama-style reporting):
total_duration– total time to generate the full responseload_duration– time spent loading the modelprompt_eval_count– number of input tokens processedprompt_eval_duration(prefill) – time spent processing the prompteval_count– number of generated output tokenseval_duration(decode) – time spent generating tokensTornadoVM-specific metrics:
tornado_task_graph_compile_duration– time to compile the Tornado task graphtornado_task_graph_warmup_duration– time spent in warmup/execution until steady stateAll timings should ideally be reported in nanoseconds for consistency and precision.
With the above we can calculate:
time_to_first_token(TTFT) – can be derived from existing durations (e.g. load + prefill + first decode step), so it may not need separate instrumentation if timestamps are availableprefill_throuputastok/s=prompt_eval_count/prompt_eval_durationdecode_throuputastok/s=eval_count/eval_durationtotal_throuputastok/s=prompt_eval_count+eval_count/total_duration<--- as we do nowWhy this is useful
These metrics would make it easier to:
break down execution into loading, prefill, decode, and runtime overheads
understand TornadoVM-specific costs (compilation and warmup)
compare CPU vs GPU vs TornadoVM performance more accurately
identify bottlenecks and guide optimizations