Skip to content

Add Unix socket listener with proper HBA support#175

Open
vadv wants to merge 24 commits intomasterfrom
benchmark-unix-socket
Open

Add Unix socket listener with proper HBA support#175
vadv wants to merge 24 commits intomasterfrom
benchmark-unix-socket

Conversation

@vadv
Copy link
Copy Markdown
Collaborator

@vadv vadv commented Apr 5, 2026

Summary

  • unix_socket_dir config creates .s.PGSQL.<port> socket file. Connect with psql -h <dir> or pgbench -h <dir>.
  • HBA local rules now match Unix socket connections (previously parsed but ignored). host/hostssl/hostnossl match only TCP.
  • is_unix flag threaded through check_hbaClient::startup for correct rule dispatch.

Known limitations

  • No unix_socket_permissions. Restrict access through directory permissions.
  • Unix listener not handed off during SIGUSR2 binary upgrade (~100ms reconnect window).
  • only_ssl_connections does not reject Unix socket connections.

Comment thread src/client/entrypoint.rs Fixed
@vadv vadv force-pushed the benchmark-unix-socket branch 4 times, most recently from 0a555a4 to 65d09ad Compare April 5, 2026 19:25
dmitrivasilyev added 9 commits April 6, 2026 12:43
Client connections via Unix domain socket (unix_socket_dir config option).
Bypasses TCP stack entirely — useful for benchmarking and local deployments.

Changes:
- unix_socket_dir config creates .s.PGSQL.<port> socket (pgbench-compatible)
- HBA local rules now correctly match Unix socket connections
  (previously local rules were always skipped)
- HBA host/hostssl/hostnossl rules only match TCP connections
- max_connections enforced for Unix socket clients
- Socket file cleaned up on shutdown
- is_unix flag threaded through check_hba -> Client::startup
- BDD test: psql via Unix socket reaches PostgreSQL backend
  (pg_backend_pid proves real backend, not synthetic response)
- CHANGELOG.md with known limitations
- Fix duplicate comment in annotated config for unix_socket_dir
- Add ${PG_TEMP_DIR} placeholder for BDD tests
Needed: compare all three poolers (pg_doorman, pgbouncer, odyssey) via
Unix domain socket alongside existing TCP benchmarks.

What changed: bench.feature runs Unix socket tests first — simple,
extended, prepared protocols with 1/40/120/500/10000 clients, with and
without reconnect. Odyssey gets a dedicated unix-socket-only instance
(it cannot bind TCP and Unix socket in one process). Benchmark tables
now use ▲/▼ arrows for faster visual scanning of results.
Needed: AWS Fargate bench run failed because odyssey refused to start
the unix-socket-only instance with "unix_socket_mode is not set".

Added unix_socket_mode "0644" to the dedicated odyssey instance config.
Needed: during the long benchmark run the polling loop only printed
"Task status is RUNNING" with no insight into what the bench was doing.
Hard to tell if it was making progress or stuck.

Now every 5th poll (~2.5 minutes) the loop fetches new log events from
CloudWatch and prints them inline, so the workflow output shows the
bench output as it happens.
Needed: the .s.PGSQL.<port> socket file inherited the process umask after
bind, so the access surface depended on whoever launched pg_doorman. There
was no way to declare the intended permissions in the config; operators had
to umask-wrap systemd units or rely on parent directory permissions.

What changed: a new general.unix_socket_mode setting (octal string,
default "0600") fixes the permission bits on the socket file immediately
after bind. The mode is validated at config load — invalid octal, empty
strings, and bits above 0o777 (setuid/setgid/sticky) are rejected upfront
rather than failing later inside the listener. The applied mode is logged
at startup. To grant a Unix group, set "0660"; to allow any local user,
set "0666".
Needed: cargo clippy --tests --deny "warnings" failed with 34 errors
across BDD test helpers and unit-test modules, blocking CI lint gate.

What changed: applied clippy's recommended fixes - converted
manual range checks to RangeInclusive::contains, replaced loop+match
patterns with while let, swapped &[b'X'] for b"X" byte strings,
removed useless format!/return/lifetimes, derived Default where
trivial, used struct update syntax instead of post-construction
field assignment, dropped unnecessary i32 casts, and switched
from DoubleEndedIterator::last() to next_back(). No test logic
or assertions were modified.
Needed: previous run hit the 120-minute polling timeout at 124 minutes
of ECS RUNNING state — the bench had finished test 226/240 and only
needed ~7 more minutes. Adding the unix socket suite (90 new pgbench
runs at -T 30s each) pushed the wall-clock past the old 120-min
ceiling. The pre-unix-socket bench took ~1.2 hours, so a 3-hour ceiling
gives comfortable headroom for future additions.

Also: aws logs get-log-events was called without --start-from-head and
without pagination, returning only the tail of CloudWatch events. For a
multi-MB log stream this could chop off the ===BEGIN_BENCHMARK_RESULTS===
marker even on a successful run, so the markdown file was at risk of
silent truncation. Now we paginate from the start until nextForwardToken
stops advancing.
Needed: a 2-hour bench produces hundreds of MB of CloudWatch events
(the unix-socket odyssey instance alone logs every connection accept
and disconnect). Paginating the whole stream from the start with
get-log-events takes 30+ minutes and times out the workflow before
the artifact is uploaded.

What changed: switched to filter-log-events which accepts a CloudWatch
filter pattern and a time window. Two new optional inputs let the
caller skip the noisy setup phase or grab a specific minute range.
With a focused pattern (e.g. "?Scenario ?pgbench ?error") the workflow
finishes in under a minute even on the largest streams.
@vadv vadv force-pushed the benchmark-unix-socket branch from fc2df36 to 7c96ac6 Compare April 6, 2026 09:46
dmitrivasilyev added 15 commits April 6, 2026 13:53
Needed: the previous fix paginated the entire CloudWatch stream from
the start to avoid truncating the markdown result. For a 2-hour bench
the stream is hundreds of MB (mostly odyssey accept/disconnect spam),
so the download step crawled for 30+ minutes and could exceed the
GitHub Actions runner timeout.

What changed: a single filter-log-events call locates the timestamp
of the ===BEGIN_BENCHMARK_RESULTS=== marker server-side, then we only
fetch events from that timestamp onward and stop as soon as the END
marker lands in the file. The download finishes in seconds regardless
of stream size while still capturing the full markdown payload. If the
marker is missing (bench did not finish) we fall back to the last 200
events for context.
Needed: 3.4.0 is not released yet, so a 3.4.1 entry on top of it was
incorrect. The items attributed to 3.4.1 (log readability, auth
failure IP logging, replenish noise, runtime log level control, etc.)
all belong to the same unreleased 3.4.0 train.
Needed: 3.4.0 is still unreleased, so a separate Unreleased section
above it was redundant. The Unix socket listener, HBA local rule
matching, and unix_socket_mode entries all belong to the same 3.4.0
release train.
Needed: the new is_unix parameter turned check_hba into a security-critical
matcher (local rules gate Unix sockets, host* rules gate TCP). The only
existing tests exercised TCP paths, so any regression in the unix_socket
listener's authentication path would have passed CI.

What changed: thirteen unit tests spread across src/auth/hba.rs and
src/config/tests.rs pin down the new semantics — local vs host rule
transport matching, reject-before-trust ordering via Unix, scram-only
rules returning NotMatched for md5 clients, and the legacy general.hba
fallback allowing all Unix clients while still enforcing CIDR limits
on TCP clients. The legacy branch became reachable from tests after
extracting check_hba_with_general, a pure helper that takes an explicit
General snapshot instead of pulling from the global config.
Needed: the listener bind path created the socket file at umask-derived
rights (usually 0644), then ratcheted them down to unix_socket_mode via
set_permissions. The window between bind and chmod allowed a concurrent
local client to reach the listener with weaker permissions than the
operator configured, silently bypassing the intended restriction and
contradicting the changelog promise that the socket "does not depend on
the process umask".

What changed: before bind() the process temporarily tightens its umask
to cover every bit the configured mode forbids, then restores the old
umask once the file is in place. set_permissions still runs afterwards
so operators can loosen the mode (e.g. 0660 with a group bit) — the
guard only enforces a floor. Two serial unit tests pin the guard's
restore-on-drop and bit-preservation behaviour, and two new BDD
scenarios stat the socket file to verify both the 0600 default and the
0660 override are honoured end-to-end.
Needed: the listener bring-up called remove_file unconditionally, so
pointing unix_socket_dir at a shared directory such as /var/run/postgresql
would silently delete another service's live socket and cut off every
local client before pg_doorman even bound its own.

What changed: a dedicated prepare_unix_socket_path helper stats the
target, probes it with a local connect, and only unlinks the file when
the probe is refused (stale leftover from a crash). If a process is
still listening, the helper returns a descriptive error and run_server
exits with OSERR instead of clobbering the foreign inode. Three unit
tests cover the missing-file, stale-file, and live-listener branches.
Needed: when an operator migrates to the Unix listener but keeps the
legacy general.hba CIDR whitelist, the IP-based rule no longer covers
local clients — check_hba_with_general returns Allow for every Unix
connection. The silent privilege expansion is only visible by reading
the HBA matcher source, so an operator can easily miss it.

What changed: Config::validate now calls
legacy_hba_bypassed_by_unix_socket and emits a warn! that tells the
operator to move to pg_hba with explicit `local` rules if they want
the whitelist to keep working. The check lives in a dedicated helper
so unit tests cover the four branches (bypass detected, no unix dir,
empty legacy list, pg_hba overrides) without touching the log stream.
Needed: the Unix accept branch dropped the socket the moment
CURRENT_CLIENT_COUNT exceeded max_connections, so psql and similar
clients saw "server closed the connection unexpectedly" and operators
went hunting for a segfault or an OOM. The TCP branch had the right
behaviour for years — it consumes the startup packet and returns a
proper 53300 ErrorResponse — and Unix needs to match it.

What changed: a new client_entrypoint_too_many_clients_already_unix
reads the startup message (refusing a stray SSL request with a clear
08P01) and writes the same "sorry, too many clients already" error
response that TCP emits. The Unix accept branch calls it instead of
silently returning, so max_connections rejection is now observable
end-to-end for both transports.

Test coverage for the rejection BDD scenario is intentionally deferred:
the scenario needs two concurrent Unix clients to cross the limit
deterministically, and the current BDD harness has no helper for
parallel psql processes. Follow-up will land once that helper exists.
Needed: on SIGUSR2 binary upgrade the successor rebinds the same
.s.PGSQL.<port> file before the parent reaches its shutdown path. The
parent then unlinked the path unconditionally, wiping the successor's
inode and leaving the new listener nameless — every subsequent Unix
client would see ENOENT. The changelog admitted a ~100ms refusal gap,
but the real outcome was losing the Unix listener until a full restart.

What changed: right after bind+chmod the listener startup captures the
(dev, ino) of the socket file via a new UnixSocketOwnership helper. At
shutdown we stat the path again and only call remove_file when the
inode still matches; if it has changed we log that another process owns
the path and leave the file alone. The decision lives in a pure
UnixSocketOwnership::inspect helper covered by six unit tests — match,
replacement, missing, skip-on-stat-error, plus a round-trip that
actually re-binds the same path to exercise the full "successor
replaces inode" scenario.
Needed: run_server inlined ~60 lines of bind+chmod+ownership+umask
plumbing, which made the failure modes (missing directory, stale
socket, chmod denied) unreachable from unit tests and noisy to read.

What changed: create_unix_listener now owns the whole bring-up —
prepare path, raise the umask, bind, drop the guard, chmod, capture
ownership — and returns a Result so the caller keeps the exit-on-
error policy without leaking OS exit codes into the helper. Three
tokio tests exercise the happy path, the missing-directory error,
and the group-readable mode override in a tempdir, so regressions in
any of those branches trip CI instead of a production bring-up.
Needed: the HBA matcher and Client::startup took a (SocketAddr, ssl,
is_unix) triplet with two unrelated booleans sitting next to each
other — a classic swap-me footgun. On the Unix path callers had to
manufacture a 127.0.0.1:0 "fake_addr" to satisfy the TCP-shaped
signature, which then leaked into SHOW CLIENTS rows and HBA error
messages and made operators think unix clients were localhost TCP.

What changed: a new top-level ClientTransport enum (Tcp { peer, ssl }
or Unix) is threaded through PgHba::check_hba, check_hba_with_general,
the public config::check_hba wrapper, Client::startup, and both
client_entrypoint variants. is_unix(), is_tls(), hba_ip() and
peer_display() keep the call sites readable. Unix clients no longer
construct fake_addr at the entrypoint layer — the sentinel lives
exclusively inside Client::startup for struct-level compatibility
while a follow-up teaches Client to store a typed peer. Existing HBA
unit tests and config legacy-hba tests were ported to the new API
(nothing lost), the hba_eval_tests suite now builds its
ClientIdentifier via ClientTransport::Tcp, and test modules that
touch the filesystem were marked `#[serial]` to coexist with the
umask guard tests. 423 lib tests stay green.
Needed: the TCP and Unix entrypoints each inlined the same "call
Client::startup, log connected, run handle, flush disconnect_stats"
sequence. Same four locals, same branch shape, only the ClientTransport
variant and the log label differed. The plain-continue-after-rejected-
TLS branch in the TCP path was a third copy. Any change to the lifecycle
meant hunting down three mirrored places.

What changed: a new drive_authenticated_client generic helper takes the
split stream halves, the transport descriptor, and a log label, then
runs the full post-startup pipeline. The three former copies collapse
to single call sites and client_entrypoint_unix shrinks to a straight
dispatch on the three ClientConnectionType variants. No behavioural
change — tests (incl. the 13 HBA unit tests) still pass.
Needed: both accept sites inlined the same match on
Result<Option<ClientSessionInfo>, Error> — same identity formatting,
same elapsed-time rendering, same info-vs-warn split. Any tweak to
session logging meant changing two places in lockstep.

What changed: a new log_session_end helper takes the entrypoint result,
the connection_id, a peer label, the session start time, and the
config flag, and produces the single-line disconnect log. The TCP
accept branch passes addr.to_string(), the Unix branch passes the
"unix:" sentinel — the only transport-specific bit left in the
select arms. Spawned tasks shrink from ~45 lines each to ~25 with no
behavioural change; ClientSessionInfo became pub to cross the module
boundary. 423 lib tests green.
Needed: the unix-socket feature only carried the happy-path scenario
plus the file-mode assertions added with the chmod fix. Critical HBA
and TLS rules that reach the matcher only via the new ClientTransport
plumbing had no end-to-end coverage, so a regression in is_unix
handling could pass CI silently.

What changed: three new scenarios on top of the existing happy path —
"local reject blocks Unix", "host rule does not match Unix", and
"only_ssl_connections does not block Unix". Each spins up pg_doorman
with the relevant pg_hba snippet (or tls_mode = require) and asserts
the expected outcome via psql over the socket. Two new step
definitions in postgres_helper exercise the rejection path:
psql_unix_connection_fails (no error string check) and
psql_unix_connection_fails_with (substring match for structured error
verification).
Needed: the BDD suite checked the trust path and the bare-rejection
path through the new HBA local-rule plumbing, but never exercised a
password-authenticated client coming in over the Unix socket. md5 is
the most common production setup, so a regression in the auth
pipeline for Unix would still pass CI.

What changed: two new scenarios in unix-socket.feature reuse the
existing tests/fixture.sql user (example_user_1 / md5...test) — one
asserts SELECT 1 succeeds when the password matches, the other
asserts the connection fails with the wrong password. Two new step
definitions in postgres_helper supply the PGPASSWORD env var for
both the success and failure paths.
Comment thread src/app/server.rs Dismissed
Comment thread src/client/entrypoint.rs Dismissed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants