PostgreSQL Replication Lag Diagnostics Toolkit

A toolkit for diagnosing replication lag in multi-site PostgreSQL topologies (e.g. Primary → local standby → remote DR standby). It separates the two root causes that are routinely confused: a network bottleneck (single-stream TCP) versus an apply bottleneck (standby disk/CPU).

All scripts are read-only against the database and have no external dependencies (only bash, psql, python3; iostat/mpstat from the sysstat package are optional).

Repository Layout

.
├── repl.script.conf      # SCRIPT/toolkit behaviour: output, logging, cadence, appearance (committed)
├── repl.env.example     # ENVIRONMENT template: PostgreSQL/OS/topology (committed)
├── repl.env             # your local environment config (git-ignored; copy of the example)
├── bin/                 # single entry point (put this dir on PATH)
│   └── pg-repl-diag     # dispatcher: routes sub-commands to libexec/
├── libexec/
│   └── pg-repl-diag/    # sub-command implementations (not called directly)
│       ├── collect          # pg-repl-diag collect
│       ├── net-test         # pg-repl-diag net-test
│       ├── apply-check      # pg-repl-diag apply-check
│       ├── sample-primary   # pg-repl-diag sample-primary
│       ├── sample-standby   # pg-repl-diag sample-standby
│       └── dashboard        # pg-repl-diag dashboard
├── lib/                 # shared library
│   ├── repl_common.sh   # config loader + validation + per-run logging
│   └── repl_format.sh   # shared console UI (banners, sections, status tags)
└── output/              # all runtime artifacts (git-ignored, created on demand)
    ├── metrics/         # sampler CSV files
    ├── reports/         # static collector reports
    ├── bursts/          # incident burst captures
    ├── dashboards/      # generated HTML dashboards
    └── log/             # per-run console logs (stdout+stderr of each command)

Configuration is split by concern across two files, both loaded via lib/repl_common.sh, which aborts if a variable an operation depends on is missing:

repl.script.conf — script / toolkit behaviour (output dirs, logging, sampling cadence, thresholds, console appearance). Identical on every node and committed to git.
repl.env — deployment environment (PostgreSQL connection, target / peer IPs, link bandwidth). Site-specific, copied from repl.env.example, and git-ignored so site values are never committed.

Scripts never embed configuration values.

Package Contents

Everything is driven through a single command, pg-repl-diag. Put bin/ on your PATH (or call bin/pg-repl-diag directly) and run pg-repl-diag help for the full list.

Command	Run on	Type	Purpose
`pg-repl-diag collect`	every node	one-time	Static OS + DB configuration snapshot
`pg-repl-diag net-test`	Primary	on-demand	Single-stream vs parallel test (iperf3) + WAL rate
`pg-repl-diag apply-check`	Standby	on-demand	Monitor apply gap, wait events, disk, redo CPU
`pg-repl-diag sample-primary`	Primary	periodic	Sample lag & WAL rate → CSV + burst captures
`pg-repl-diag sample-standby`	Standby	periodic	Sample apply/network/disk/CPU → CSV + burst captures
`pg-repl-diag dashboard`	anywhere	offline	Convert CSV → interactive HTML dashboard

Configuration

cp repl.env.example repl.env
$EDITOR repl.env          # adjust connection + topology for THIS node
# (toolkit behaviour lives in repl.script.conf and rarely needs changes)

Both files are sourced by bash and use the VAR="${VAR:-default}" form, so any value already exported in the environment still takes precedence — per-invocation overrides keep working, e.g. INTERVAL=5 pg-repl-diag sample-standby.

Environment — site-specific (repl.env, from repl.env.example):

Variable	Default	Meaning
`PGHOST`/`PGPORT`/`PGUSER`/`PGDATABASE`	libpq defaults	PostgreSQL connection
`TARGET`	(empty, required for network test)	standby IP running `iperf3 -s`
`LINK_MBPS`	1000	this link's per-direction bandwidth (Mbps), for the BDP calc
`PEERS`	(empty, optional)	space-separated peer IPs for RTT in the collector

Script / toolkit — same on every node (repl.script.conf):

Variable	Default	Meaning
`OUTPUT_DIR`	`./output`	root for all generated artifacts
`METRICS_DIR` / `REPORTS_DIR` / `BURST_DIR` / `DASHBOARD_DIR`	under `OUTPUT_DIR`	per-type output folders
`LOG_DIR`	`$OUTPUT_DIR/log`	per-run console logs (every command mirrors stdout+stderr here; set `REPL_NO_LOG=1` to disable)
`INTERVAL`	10	seconds between sampler samples
`THRESHOLD_LAG_S`	30	lag threshold (s) that triggers a burst capture
`BURST_COOLDOWN`	120	minimum gap between bursts (s)
`COUNT`	0	0 = sampler runs indefinitely; >0 = stop after N samples
`APPLY_INTERVAL` / `APPLY_COUNT`	5 / 12	cadence of the on-demand apply probe
`IPERF_PORT` / `DURATION` / `PARALLEL` / `WAL_SAMPLE`	5201 / 20 / 8 / 15	network test parameters
`REPL_TOOLKIT_NAME` / `REPL_TOOLKIT_VERSION`	toolkit name / `1.0`	shown in the console banner of every script
`REPL_WIDTH`	74	banner / rule width in columns
`REPL_COLOR`	`auto`	`auto` (color only on a terminal) / `always` / `never`

All scripts share one console look & feel via lib/repl_format.sh (banners, sections, [INFO]/[ OK ]/[WARN]/[FAIL] status tags, and VERDICT blocks). Color is automatic on a terminal and suppressed when output is redirected or NO_COLOR is set, so log files stay clean.

Do not store the database password in repl.env — use ~/.pgpass (chmod 600). Ideally run the scripts as the postgres user so config files and process-level metrics are readable.

Operations validate the variables they need and abort when one is missing — e.g. pg-repl-diag net-test refuses to run without TARGET.

Prerequisites

Platform

Linux only. The scripts read /proc, /sys, and net.ipv4.* sysctls and rely on iproute2 / sysstat, so they do not run on macOS or Windows. Run them directly on the PostgreSQL hosts (Primary, local standby, remote DR standby) — the host is where the disk, CPU, and TCP socket being diagnosed actually live. bash 4+ is required (the scripts use arrays, mapfile, and process substitution).

OS user

Run the scripts as the postgres OS user (or root). This is needed so that:

postgresql.auto.conf inside data_directory is readable (collector §19);
the redo/startup process CPU can be read from /proc/<pid>/stat (standby sampler cpu_startup_pct column);
ss -tinp can show the PID owning the replication socket;
ethtool (NIC speed/duplex) works — it needs root.

Running as an unrelated user does not crash the scripts, but the items above are silently reported as unavailable.

Database access & privileges

Connect via libpq env vars (PGHOST / PGPORT / PGUSER / PGDATABASE in repl.env).
Put the password in ~/.pgpass (chmod 600) — never in repl.env.
The role needs read access to the monitoring views. Grant pg_monitor (GRANT pg_monitor TO youruser;) or use a superuser. Without it, client_addr, wait_event, and query text in pg_stat_activity / pg_stat_replication are hidden, which weakens the apply-side and socket diagnosis.
All queries are strictly read-only.
PostgreSQL 12+ works; 14+ recommended (pg_stat_wal). pg_stat_checkpointer (PG17+) and a few view columns are version-gated and skipped gracefully on older releases.

Ports & firewall

Port	Direction	Used by	Required?
5432 (`PGPORT`)	standby → primary	streaming replication and every script's `psql`	Yes
5201 (`IPERF_PORT`)	primary → standby	`pg-repl-diag net-test` only	Only for the network test
ICMP (ping)	primary ↔ standby	RTT / packet-loss baseline in the collector & network test	Recommended

For the network test, start iperf3 -s on the standby (listening on IPERF_PORT, default 5201) and make sure that port is reachable from the Primary.

Required packages / commands

Present on a standard Linux + PostgreSQL install; listed for completeness:

Package	Provides	Why it's needed
`bash` (4+)	shell	arrays / `mapfile` / process substitution
postgresql client	`psql`	every script queries the DB
`python3`	`python3`	JSON parsing (iperf3) and float math
coreutils + `awk`/`sed`/`grep`	`date`, `df`, `hostname`, `getconf`, …	text processing & math
`procps`	`ps`, `sysctl`, `free`	process lookup, TCP sysctls, memory
`iproute2`	`ss`, `ip`	replication-socket RTT/retransmits, addressing

Required only for pg-repl-diag net-test:

Package	Provides	Note
`iperf3`	`iperf3`	needed on both Primary (client) and standby (`iperf3 -s`)
`iputils`	`ping`	RTT & packet-loss baseline

Optional packages — what you lose without them

Everything below degrades gracefully; the scripts keep running and only the listed data goes uncollected.

Package	Command	Used for	If missing
`sysstat`	`iostat`	disk `%util` / `await` / busiest device	`disk_util` / `disk_dev` / `disk_await` recorded as `0` / `-`; the "DISK I/O" apply-side root cause cannot be confirmed
`sysstat`	`mpstat`	per-core CPU inside standby burst captures	that block is omitted from `burst_standby_*.txt`
`util-linux`	`lscpu`	CPU model / per-core clock (collector §2)	falls back to `/proc/cpuinfo`
`util-linux`	`lsblk`	block devices / I/O scheduler (collector §4)	section skipped
`ethtool`	`ethtool`	NIC speed/duplex/ring buffers (collector §6)	shows "(ethtool not available)"
`chrony` / `ntp`	`chronyc` / `ntpq`	clock-sync check — skew invalidates lag numbers (collector §7)	warns; verify the three nodes share a synced clock manually
`net-tools`	`netstat`	global retransmit counters (collector §8)	line skipped (`ss` still gives per-socket data)

Install

# RHEL/Rocky
sudo yum install -y sysstat iperf3 ethtool chrony
# Debian/Ubuntu
sudo apt-get install -y sysstat iperf3 ethtool chrony

iperf3 is only needed if you run the network test. sysstat is the most impactful optional package — without iostat you cannot distinguish a disk-bound standby from a CPU-bound one.

Workflow

Make the dispatcher executable once: chmod +x bin/pg-repl-diag. Put bin/ on your PATH to drop the bin/ prefix below, or call bin/pg-repl-diag … directly.

1) Static snapshot (once, on each node)

# On the Primary, also measure RTT to both standbys:
pg-repl-diag collect <standby-1-ip> <standby-2-ip>
# On the remote standby:
pg-repl-diag collect <primary-ip>

Output: output/reports/repl_collect_<host>_<ts>.txt (passwords auto-redacted).

2) Periodic sampling (run continuously)

On the Primary:

pg-repl-diag sample-primary start     # background daemon; console log -> output/log/
pg-repl-diag sample-primary status    # running? | stop with: ... sample-primary stop

On the remote standby:

pg-repl-diag sample-standby start     # background daemon; console log -> output/log/
pg-repl-diag sample-standby status    # running? | stop with: ... sample-standby stop

start is restart-safe and self-guarding: a lock file kept next to the sampler (libexec/pg-repl-diag/.<output>_metrics.lock, alongside a .pid) ensures only one instance runs per node — a second start, or a stray foreground run, is refused. You can still run it in the foreground for an interactive session by omitting the action (pg-repl-diag sample-standby, Ctrl-C to stop).

Raw output:

output/metrics/primary_metrics.csv, output/metrics/standby_metrics.csv (append, restart-safe)
output/bursts/burst_primary_<ts>.txt, output/bursts/burst_standby_<ts>.txt (incident snapshots)

3) On-demand deep dives

pg-repl-diag net-test <standby-ip>   # on the Primary (start 'iperf3 -s' on the standby first)
pg-repl-diag apply-check             # on the standby

4) Generate the dashboard

Collect the CSVs from each node into output/metrics/, then:

# the dispatcher loads the output paths from the config for you:
pg-repl-diag dashboard
# or point at explicit paths:
pg-repl-diag dashboard --metrics-dir ./output/metrics --burst-dir ./output/bursts \
                       --out ./output/dashboards/dashboard.html

This writes three self-contained pages (open any in a browser, no internet needed):

File	Contents
`dashboard.html`	Overview: topology, node roles, global verdict, links
`dashboard-primary.html`	Primary side: WAL generation rate + per-standby replication lag
`dashboard-standby.html`	Standby side: apply/gap/disk/CPU/RTT + wait events

Every chart is zoomable: scroll to zoom the time axis, drag to pan, double-click to reset. All charts on a page share one time window, so they stay aligned while you zoom — and the axis adds dates (with midnight markers) once the window spans more than a day, so multi-day captures read cleanly.

Multiple nodes / cascade. By default primary_metrics.csv and standby_metrics.csv are treated as the primary and the standby. To compare more nodes — or to show a cascading standby (a node that both receives and sends WAL, e.g. Primary → standby-B → standby-C, or a standby that was promoted) — copy each node's CSVs in with a node-name prefix:

# from node "edb-b" (a cascade middle node: receives from A, sends to C)
cp edb-b:/…/output/metrics/standby_metrics.csv  output/metrics/edb-b_standby_metrics.csv
cp edb-b:/…/output/metrics/primary_metrics.csv  output/metrics/edb-b_primary_metrics.csv

A node that appears under the same prefix in both roles is flagged CASCADE on both dashboards, with cross-links to its other role, so a cascade topology is unambiguous.

Running as a systemd service (optional, cleaner)

/etc/systemd/system/repl-sampler.service:

[Unit]
Description=PG Replication Sampler
After=postgresql.service

[Service]
User=postgres
WorkingDirectory=/opt/repl-diag
Environment=REPL_ENV_FILE=/opt/repl-diag/repl.env
ExecStart=/opt/repl-diag/bin/pg-repl-diag sample-standby   ; use sample-primary on the primary node
Restart=always

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload && sudo systemctl enable --now repl-sampler

Reading the Dashboard

Start on dashboard.html (Overview): it shows the topology, each node's role, the global verdict, and links to the two role pages. Use the nav bar to switch between Overview / Primary / Standby. Zoom controls (scroll / drag / double-click) apply to every chart, and the visible time window is shown top-right.

Standby page — each standby node gets its own verdict, summary cards, and charts:

Replication Lag (time_lag) — the symptom users feel.
Arrival vs Apply Rate — if apply < arrival, the standby is falling behind → apply-bound.
Apply Gap — if it keeps rising → apply is not catching up.
Resource Saturation — disk %util & redo CPU (0–100 scale). Correlate with lag spikes.
Network Quality — rtt & retransmits on the replication socket.
Wait Event Distribution — where the redo process spends its time.

Primary page — each primary node shows its WAL Generation Rate plus replay_lag / total lag per downstream standby, so you can compare standbys. For a cascade node, its upstream lag is on the Standby page and its downstream lag is on the Primary page (both carry a CASCADE badge linking to the other).

Decision Tree

High lag?
├─ Apply < Arrival  &  gap growing  → APPLY-BOUND
│   ├─ high disk %util / IO waits    → standby disk I/O (upgrade storage, separate WAL)
│   ├─ redo CPU ~100% (1 core)       → single-thread CPU (higher per-core clock, reduce WAL)
│   └─ Recovery/Conflict waits       → query conflict (tune hot_standby_feedback)
└─ Low arrival, no apply backlog → NETWORK-BOUND
    ├─ rising retransmits / loss      → loss-bound    → switch congestion control to BBR
    └─ high rtt, retransmits ~0       → latency-bound → raise TCP buffers above the BDP

To prove NETWORK-BOUND vs link capacity, run pg-repl-diag net-test (single-stream vs parallel test). For apply-side detail, run pg-repl-diag apply-check.

Notes

Start the samplers before an incident so the rolling history captures it.
Sampling during peak / batch hours reveals the peak WAL rate, which drives lag.
All output containing primary_conninfo/socket data has passwords redacted.
For stronger long-term monitoring, consider Prometheus + postgres_exporter + node_exporter. This toolkit is for fast diagnosis without heavy setup.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
bin		bin
docker		docker
lib		lib
libexec/pg-repl-diag		libexec/pg-repl-diag
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
repl.env.example		repl.env.example
repl.script.conf		repl.script.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PostgreSQL Replication Lag Diagnostics Toolkit

Repository Layout

Package Contents

Configuration

Prerequisites

Platform

OS user

Database access & privileges

Ports & firewall

Required packages / commands

Optional packages — what you lose without them

Install

Workflow

1) Static snapshot (once, on each node)

2) Periodic sampling (run continuously)

3) On-demand deep dives

4) Generate the dashboard

Running as a systemd service (optional, cleaner)

Reading the Dashboard

Decision Tree

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PostgreSQL Replication Lag Diagnostics Toolkit

Repository Layout

Package Contents

Configuration

Prerequisites

Platform

OS user

Database access & privileges

Ports & firewall

Required packages / commands

Optional packages — what you lose without them

Install

Workflow

1) Static snapshot (once, on each node)

2) Periodic sampling (run continuously)

3) On-demand deep dives

4) Generate the dashboard

Running as a systemd service (optional, cleaner)

Reading the Dashboard

Decision Tree

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages