Skip to content

verl-project/rl-insight

Repository files navigation

RL-Insight logo

RL-Insight

Online observability for reinforcement learning training. RL-Insight connects training-side metrics, RL state traces, and service dashboards across distributed rollout and optimization workloads.

Ask DeepWiki GitHub Repo stars Twitter Documentation

Why RL-Insight

RL-Insight focuses on the online observability path that RL training needs most:

  • One-command server startup: install and start Prometheus, Tempo, and Grafana with rl-insight server install and rl-insight server start.
  • Trainer and rollout metric aggregation: collect key actor, rollout, and transfer queue metrics in one monitoring view while keeping training-side instrumentation lightweight.
  • Grafana dashboards for RL workloads: provide ready-to-use dashboard structure for training metrics, rollout behavior, engine metrics, and RL state timelines.

Architecture

RL-Insight monitor architecture

The monitor has two data paths. Trainer-side Python API events are aggregated by the client and monitor hub, then exposed to Prometheus or exported to Tempo. Rollout and inference engines expose their own metrics endpoints directly to Prometheus. Grafana queries Prometheus and Tempo to render the RL dashboards.

Demo

rl-insight.demo.mp4

Watch the demo video

News

  • [2026/06/16] RL-Insight officially supports Online Monitor, including one-command server startup, trainer and rollout metric aggregation, and Grafana dashboards for RL workloads.

Get Started

Start with the guide that matches your current setup:

Document What it covers When to use it
Server Installation Prometheus, Tempo, and Grafana service setup, including supported Linux platforms, direct installation, offline installation, and existing service binaries. Use this first if the monitor services are not installed or you need to verify the server environment.
Quick Start A full smoke-test flow: install the Python package, start the monitor stack, emit sample metric/trace data, and open Grafana. Use this after the services are ready, or when you want to validate the monitor path end to end.

Recommended order:

  1. Prepare the server services with Server Installation.
  2. Run the end-to-end flow with Quick Start.

Server Stack

RL-Insight manages three open-source services locally on Linux:

Service Purpose Default port Required version Installer version
Prometheus Metric storage and queries 9090 >= 2.30.0 2.54.1
Tempo Trace storage and query API 3200 >= 2.0.0 2.6.1
Grafana Dashboards and trace exploration 3000 >= 13.0.0 13.0.0

rl-insight server install downloads supported Linux binaries into ~/.rl-insight/services. rl-insight server start runs Prometheus, Tempo, and Grafana with data persisted under ~/.rl-insight/data by default.

Training API

rl_insight/ exports the online monitor public API, so training code can import one module:

API Use
init(project=None, experiment_name=None, config=None) Enable monitoring once per process and attach global labels.
metric_count(name, amount=1.0, documentation="", **labels) Increment a Prometheus counter.
metric_value(name, value, documentation="", **labels) Record the latest value for a gauge.
metric_distribution(name, value, documentation="", **labels) Add one sample to a histogram.
trace_state(state_name, state_lane_id=None, **labels) Record a named RL state interval.
trace_op(name=None, extra_labels=None, **static_labels) Decorate a synchronous function and emit one duration span per call.
finish() Reset in-process monitor state.

Configuration can be passed to insight.init(config=...) or through environment variables:

insight.init(
    project="verl",
    experiment_name="ppo-smoke-test",
    config={
        "server": {
            "namespace": "rl_insight_monitor",
            "backend": "ray",
            "service_ip": "10.0.0.8",
        },
        "prometheus": {
            "metrics_report_port": 9092,
            "prometheus_port": 9090,
        },
        "otel": {
            "otel_port": 4318,
        },
    },
)

Useful environment variables:

Variable Purpose
RL_INSIGHT_SERVICE_IP RL-Insight server IP used by training workers to export traces.
RL_INSIGHT_OTEL_PORT OTLP HTTP port, default 4318.
RL_INSIGHT_PROMETHEUS_PORT Prometheus HTTP port, default 9090.
RL_INSIGHT_PROMETHEUS_CONFIG_FILE Prometheus config path when the monitor hub updates scrape targets.

Recipe Offline Analysis

Offline timeline, heatmap, and parser utilities are kept under recipe/; see Recipe README for that workflow.

Roadmap

  • Q1 Roadmap #6
  • Q2 Roadmap #49

Documentation

  • Quick Start: install RL-Insight, start the services, instrument code, and open Grafana.
  • Server Installation: Linux service requirements, supported OS/CPU combinations, and version policy.
  • Default server config: bundled ports, retention settings, and service config paths.
  • Recipe README: offline timeline, heatmap, and parser utilities.

Contribution Guide

See CONTRIBUTING.md.

About

Provide performance insight capabilities for RL frameworks.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages