A toolkit for diagnosing replication lag in multi-site PostgreSQL topologies (e.g. Primary → local standby → remote DR standby). It separates the two root causes that are routinely confused: a network bottleneck (single-stream TCP) versus an apply bottleneck (standby disk/CPU).
All scripts are read-only against the database and have no external
dependencies (only bash, psql, python3; iostat/mpstat from the
sysstat package are optional).
.
├── repl.script.conf # SCRIPT/toolkit behaviour: output, logging, cadence, appearance (committed)
├── repl.env.example # ENVIRONMENT template: PostgreSQL/OS/topology (committed)
├── repl.env # your local environment config (git-ignored; copy of the example)
├── bin/ # single entry point (put this dir on PATH)
│ └── pg-repl-diag # dispatcher: routes sub-commands to libexec/
├── libexec/
│ └── pg-repl-diag/ # sub-command implementations (not called directly)
│ ├── collect # pg-repl-diag collect
│ ├── net-test # pg-repl-diag net-test
│ ├── apply-check # pg-repl-diag apply-check
│ ├── sample-primary # pg-repl-diag sample-primary
│ ├── sample-standby # pg-repl-diag sample-standby
│ └── dashboard # pg-repl-diag dashboard
├── lib/ # shared library
│ ├── repl_common.sh # config loader + validation + per-run logging
│ └── repl_format.sh # shared console UI (banners, sections, status tags)
└── output/ # all runtime artifacts (git-ignored, created on demand)
├── metrics/ # sampler CSV files
├── reports/ # static collector reports
├── bursts/ # incident burst captures
├── dashboards/ # generated HTML dashboards
└── log/ # per-run console logs (stdout+stderr of each command)
Configuration is split by concern across two files, both loaded via
lib/repl_common.sh, which aborts if a variable an operation depends on is
missing:
repl.script.conf— script / toolkit behaviour (output dirs, logging, sampling cadence, thresholds, console appearance). Identical on every node and committed to git.repl.env— deployment environment (PostgreSQL connection, target / peer IPs, link bandwidth). Site-specific, copied fromrepl.env.example, and git-ignored so site values are never committed.
Scripts never embed configuration values.
Everything is driven through a single command, pg-repl-diag. Put bin/ on
your PATH (or call bin/pg-repl-diag directly) and run pg-repl-diag help for
the full list.
| Command | Run on | Type | Purpose |
|---|---|---|---|
pg-repl-diag collect |
every node | one-time | Static OS + DB configuration snapshot |
pg-repl-diag net-test |
Primary | on-demand | Single-stream vs parallel test (iperf3) + WAL rate |
pg-repl-diag apply-check |
Standby | on-demand | Monitor apply gap, wait events, disk, redo CPU |
pg-repl-diag sample-primary |
Primary | periodic | Sample lag & WAL rate → CSV + burst captures |
pg-repl-diag sample-standby |
Standby | periodic | Sample apply/network/disk/CPU → CSV + burst captures |
pg-repl-diag dashboard |
anywhere | offline | Convert CSV → interactive HTML dashboard |
cp repl.env.example repl.env
$EDITOR repl.env # adjust connection + topology for THIS node
# (toolkit behaviour lives in repl.script.conf and rarely needs changes)Both files are sourced by bash and use the VAR="${VAR:-default}" form, so any
value already exported in the environment still takes precedence — per-invocation
overrides keep working, e.g. INTERVAL=5 pg-repl-diag sample-standby.
Environment — site-specific (repl.env, from repl.env.example):
| Variable | Default | Meaning |
|---|---|---|
PGHOST/PGPORT/PGUSER/PGDATABASE |
libpq defaults | PostgreSQL connection |
TARGET |
(empty, required for network test) | standby IP running iperf3 -s |
LINK_MBPS |
1000 | this link's per-direction bandwidth (Mbps), for the BDP calc |
PEERS |
(empty, optional) | space-separated peer IPs for RTT in the collector |
Script / toolkit — same on every node (repl.script.conf):
| Variable | Default | Meaning |
|---|---|---|
OUTPUT_DIR |
./output |
root for all generated artifacts |
METRICS_DIR / REPORTS_DIR / BURST_DIR / DASHBOARD_DIR |
under OUTPUT_DIR |
per-type output folders |
LOG_DIR |
$OUTPUT_DIR/log |
per-run console logs (every command mirrors stdout+stderr here; set REPL_NO_LOG=1 to disable) |
INTERVAL |
10 | seconds between sampler samples |
THRESHOLD_LAG_S |
30 | lag threshold (s) that triggers a burst capture |
BURST_COOLDOWN |
120 | minimum gap between bursts (s) |
COUNT |
0 | 0 = sampler runs indefinitely; >0 = stop after N samples |
APPLY_INTERVAL / APPLY_COUNT |
5 / 12 | cadence of the on-demand apply probe |
IPERF_PORT / DURATION / PARALLEL / WAL_SAMPLE |
5201 / 20 / 8 / 15 | network test parameters |
REPL_TOOLKIT_NAME / REPL_TOOLKIT_VERSION |
toolkit name / 1.0 |
shown in the console banner of every script |
REPL_WIDTH |
74 | banner / rule width in columns |
REPL_COLOR |
auto |
auto (color only on a terminal) / always / never |
All scripts share one console look & feel via lib/repl_format.sh (banners,
sections, [INFO]/[ OK ]/[WARN]/[FAIL] status tags, and VERDICT blocks).
Color is automatic on a terminal and suppressed when output is redirected or
NO_COLOR is set, so log files stay clean.
Do not store the database password in
repl.env— use~/.pgpass(chmod 600). Ideally run the scripts as thepostgresuser so config files and process-level metrics are readable.
Operations validate the variables they need and abort when one is missing —
e.g. pg-repl-diag net-test refuses to run without TARGET.
Linux only. The scripts read /proc, /sys, and net.ipv4.* sysctls and rely
on iproute2 / sysstat, so they do not run on macOS or Windows. Run them
directly on the PostgreSQL hosts (Primary, local standby, remote DR standby) — the
host is where the disk, CPU, and TCP socket being diagnosed actually live.
bash 4+ is required (the scripts use arrays, mapfile, and process substitution).
Run the scripts as the postgres OS user (or root). This is needed so that:
postgresql.auto.confinsidedata_directoryis readable (collector §19);- the redo/
startupprocess CPU can be read from/proc/<pid>/stat(standby samplercpu_startup_pctcolumn); ss -tinpcan show the PID owning the replication socket;ethtool(NIC speed/duplex) works — it needs root.
Running as an unrelated user does not crash the scripts, but the items above are silently reported as unavailable.
- Connect via libpq env vars (
PGHOST/PGPORT/PGUSER/PGDATABASEinrepl.env). - Put the password in
~/.pgpass(chmod 600) — never inrepl.env. - The role needs read access to the monitoring views. Grant
pg_monitor(GRANT pg_monitor TO youruser;) or use a superuser. Without it,client_addr,wait_event, and query text inpg_stat_activity/pg_stat_replicationare hidden, which weakens the apply-side and socket diagnosis. - All queries are strictly read-only.
- PostgreSQL 12+ works; 14+ recommended (
pg_stat_wal).pg_stat_checkpointer(PG17+) and a few view columns are version-gated and skipped gracefully on older releases.
| Port | Direction | Used by | Required? |
|---|---|---|---|
5432 (PGPORT) |
standby → primary | streaming replication and every script's psql |
Yes |
5201 (IPERF_PORT) |
primary → standby | pg-repl-diag net-test only |
Only for the network test |
| ICMP (ping) | primary ↔ standby | RTT / packet-loss baseline in the collector & network test | Recommended |
For the network test, start iperf3 -s on the standby (listening on IPERF_PORT,
default 5201) and make sure that port is reachable from the Primary.
Present on a standard Linux + PostgreSQL install; listed for completeness:
| Package | Provides | Why it's needed |
|---|---|---|
bash (4+) |
shell | arrays / mapfile / process substitution |
| postgresql client | psql |
every script queries the DB |
python3 |
python3 |
JSON parsing (iperf3) and float math |
coreutils + awk/sed/grep |
date, df, hostname, getconf, … |
text processing & math |
procps |
ps, sysctl, free |
process lookup, TCP sysctls, memory |
iproute2 |
ss, ip |
replication-socket RTT/retransmits, addressing |
Required only for pg-repl-diag net-test:
| Package | Provides | Note |
|---|---|---|
iperf3 |
iperf3 |
needed on both Primary (client) and standby (iperf3 -s) |
iputils |
ping |
RTT & packet-loss baseline |
Everything below degrades gracefully; the scripts keep running and only the listed data goes uncollected.
| Package | Command | Used for | If missing |
|---|---|---|---|
sysstat |
iostat |
disk %util / await / busiest device |
disk_util / disk_dev / disk_await recorded as 0 / -; the "DISK I/O" apply-side root cause cannot be confirmed |
sysstat |
mpstat |
per-core CPU inside standby burst captures | that block is omitted from burst_standby_*.txt |
util-linux |
lscpu |
CPU model / per-core clock (collector §2) | falls back to /proc/cpuinfo |
util-linux |
lsblk |
block devices / I/O scheduler (collector §4) | section skipped |
ethtool |
ethtool |
NIC speed/duplex/ring buffers (collector §6) | shows "(ethtool not available)" |
chrony / ntp |
chronyc / ntpq |
clock-sync check — skew invalidates lag numbers (collector §7) | warns; verify the three nodes share a synced clock manually |
net-tools |
netstat |
global retransmit counters (collector §8) | line skipped (ss still gives per-socket data) |
# RHEL/Rocky
sudo yum install -y sysstat iperf3 ethtool chrony
# Debian/Ubuntu
sudo apt-get install -y sysstat iperf3 ethtool chrony
iperf3is only needed if you run the network test.sysstatis the most impactful optional package — withoutiostatyou cannot distinguish a disk-bound standby from a CPU-bound one.
Make the dispatcher executable once:
chmod +x bin/pg-repl-diag. Putbin/on yourPATHto drop thebin/prefix below, or callbin/pg-repl-diag …directly.
# On the Primary, also measure RTT to both standbys:
pg-repl-diag collect <standby-1-ip> <standby-2-ip>
# On the remote standby:
pg-repl-diag collect <primary-ip>Output: output/reports/repl_collect_<host>_<ts>.txt (passwords auto-redacted).
On the Primary:
pg-repl-diag sample-primary start # background daemon; console log -> output/log/
pg-repl-diag sample-primary status # running? | stop with: ... sample-primary stopOn the remote standby:
pg-repl-diag sample-standby start # background daemon; console log -> output/log/
pg-repl-diag sample-standby status # running? | stop with: ... sample-standby stopstart is restart-safe and self-guarding: a lock file kept next to the sampler
(libexec/pg-repl-diag/.<output>_metrics.lock, alongside a .pid) ensures only
one instance runs per node — a second start, or a stray foreground run, is
refused. You can still run it in the foreground for an interactive session by
omitting the action (pg-repl-diag sample-standby, Ctrl-C to stop).
Raw output:
output/metrics/primary_metrics.csv,output/metrics/standby_metrics.csv(append, restart-safe)output/bursts/burst_primary_<ts>.txt,output/bursts/burst_standby_<ts>.txt(incident snapshots)
pg-repl-diag net-test <standby-ip> # on the Primary (start 'iperf3 -s' on the standby first)
pg-repl-diag apply-check # on the standbyCollect the CSVs from each node into output/metrics/, then:
# the dispatcher loads the output paths from the config for you:
pg-repl-diag dashboard
# or point at explicit paths:
pg-repl-diag dashboard --metrics-dir ./output/metrics --burst-dir ./output/bursts \
--out ./output/dashboards/dashboard.htmlThis writes three self-contained pages (open any in a browser, no internet needed):
| File | Contents |
|---|---|
dashboard.html |
Overview: topology, node roles, global verdict, links |
dashboard-primary.html |
Primary side: WAL generation rate + per-standby replication lag |
dashboard-standby.html |
Standby side: apply/gap/disk/CPU/RTT + wait events |
Every chart is zoomable: scroll to zoom the time axis, drag to pan, double-click to reset. All charts on a page share one time window, so they stay aligned while you zoom — and the axis adds dates (with midnight markers) once the window spans more than a day, so multi-day captures read cleanly.
Multiple nodes / cascade. By default primary_metrics.csv and
standby_metrics.csv are treated as the primary and the standby. To compare more
nodes — or to show a cascading standby (a node that both receives and sends
WAL, e.g. Primary → standby-B → standby-C, or a standby that was promoted) —
copy each node's CSVs in with a node-name prefix:
# from node "edb-b" (a cascade middle node: receives from A, sends to C)
cp edb-b:/…/output/metrics/standby_metrics.csv output/metrics/edb-b_standby_metrics.csv
cp edb-b:/…/output/metrics/primary_metrics.csv output/metrics/edb-b_primary_metrics.csvA node that appears under the same prefix in both roles is flagged CASCADE on both dashboards, with cross-links to its other role, so a cascade topology is unambiguous.
/etc/systemd/system/repl-sampler.service:
[Unit]
Description=PG Replication Sampler
After=postgresql.service
[Service]
User=postgres
WorkingDirectory=/opt/repl-diag
Environment=REPL_ENV_FILE=/opt/repl-diag/repl.env
ExecStart=/opt/repl-diag/bin/pg-repl-diag sample-standby ; use sample-primary on the primary node
Restart=always
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload && sudo systemctl enable --now repl-samplerStart on dashboard.html (Overview): it shows the topology, each node's role,
the global verdict, and links to the two role pages. Use the nav bar to switch
between Overview / Primary / Standby. Zoom controls (scroll / drag / double-click)
apply to every chart, and the visible time window is shown top-right.
Standby page — each standby node gets its own verdict, summary cards, and charts:
- Replication Lag (time_lag) — the symptom users feel.
- Arrival vs Apply Rate — if apply < arrival, the standby is falling behind → apply-bound.
- Apply Gap — if it keeps rising → apply is not catching up.
- Resource Saturation — disk %util & redo CPU (0–100 scale). Correlate with lag spikes.
- Network Quality — rtt & retransmits on the replication socket.
- Wait Event Distribution — where the redo process spends its time.
Primary page — each primary node shows its WAL Generation Rate plus replay_lag / total lag per downstream standby, so you can compare standbys. For a cascade node, its upstream lag is on the Standby page and its downstream lag is on the Primary page (both carry a CASCADE badge linking to the other).
High lag?
├─ Apply < Arrival & gap growing → APPLY-BOUND
│ ├─ high disk %util / IO waits → standby disk I/O (upgrade storage, separate WAL)
│ ├─ redo CPU ~100% (1 core) → single-thread CPU (higher per-core clock, reduce WAL)
│ └─ Recovery/Conflict waits → query conflict (tune hot_standby_feedback)
└─ Low arrival, no apply backlog → NETWORK-BOUND
├─ rising retransmits / loss → loss-bound → switch congestion control to BBR
└─ high rtt, retransmits ~0 → latency-bound → raise TCP buffers above the BDP
To prove NETWORK-BOUND vs link capacity, run pg-repl-diag net-test
(single-stream vs parallel test). For apply-side detail, run pg-repl-diag apply-check.
- Start the samplers before an incident so the rolling history captures it.
- Sampling during peak / batch hours reveals the peak WAL rate, which drives lag.
- All output containing
primary_conninfo/socket data has passwords redacted. - For stronger long-term monitoring, consider Prometheus +
postgres_exporter+node_exporter. This toolkit is for fast diagnosis without heavy setup.