Skip to content

feat: Allow aws-vault to safely be run in parallel#337

Draft
timvisher-dd wants to merge 30 commits intoByteNess:mainfrom
timvisher-dd:sso-browser-lock
Draft

feat: Allow aws-vault to safely be run in parallel#337
timvisher-dd wants to merge 30 commits intoByteNess:mainfrom
timvisher-dd:sso-browser-lock

Conversation

@timvisher-dd
Copy link
Copy Markdown

Closes 99designs/aws-vault#1275 (abandoned)

I run aws-vault as a credentials_provider for SSO backed AWS profiles. This is often in parallel contexts of dozens to hundreds of invocations at once.

In that context when credentials needed to be refreshed, aws-vault would open an unconstrained amount of browser tabs in parallel, usually triggering HTTP 500 responses on the AWS side and failing to acquire creds.

To mitigate this I developed a small wrapper around aws-vault that would use a Bash dir lock (Wooledge BashFAQ 45) when there was a possibility that the credentials would need to be refreshed. This worked but it was also quite slow as it would lock the entire aws-vault execution rather than just the step that might launch a browser tab. The dir locking strategy was also sensitive to being killed by process managers like terraform and so had to do things like die voluntarily after magic amounts of seconds to avoid being SIGKILLed and leaving a stale lock around.

This changeset introduces a cross-platform process lock that is tolerant of all forms of process death (even SIGKILL), and wires it into the SSO/OIDC browser flow so only one process refreshes at a time. This keeps parallel invocations fast while avoiding the browser storm that triggers HTTP 500s.

While stress testing I also found that the session cache could race across processes, leading to "item already exists" errors from the keychain. This branch adds a session cache refresh lock so only one process writes back to the same cache entry at once, while others wait and re-check.

Because the parallelism is now safe and fast, I also hit SSO OIDC rate limits (429). This change adds Retry-After support with jittered backoff so the SSO token exchange is resilient under heavy load.

In a stress test across 646 AWS profiles in 9 SSO Directories I'm able to retrieve creds successfully in ~36 seconds on my box. This fails on upstream HEAD because the browser storm overwhelms the IAM/SSO APIs and the keychain cache update races.

Performance implications

This change serializes keychain operations (for the macOS keychain backend only) so only one aws-vault process is touching the keychain at a time. This avoids concurrent unlock prompts and other keychain contention at the cost of making highly parallel workloads wait their turn for keychain access. In practice this has been acceptable for my workloads.

On my machine, successfully retrieving creds for 642 profiles completes in under 60 seconds, which is good enough that I did not do a formal before/after benchmark. If maintainers want more detailed measurements I can provide them.

Testing

Unit tests have been added for the SSO/OIDC lock, session cache lock, and OIDC retry logic. I also used some local integration test scripts that clear out the creds and run aws-vault export --format=json across different sets of my profiles and assert that it succeeds. Finally I've converted my local tooling to use this fork of aws-vault and have been exercising it there without issue.

Colofon

I did not write any of this code. Codex did. That said I have read through it in some detail and it looks reasonable to me.

Co-authored-by: Codex noreply@openai.com

@timvisher-dd timvisher-dd requested a review from mbevc1 as a code owner April 4, 2026 00:07
@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file feat labels Apr 4, 2026
@mbevc1
Copy link
Copy Markdown
Contributor

mbevc1 commented Apr 4, 2026

Thanks, sorry but we might have talked about some of the feedback, but context is a bit lost now :)

So PR still looks good and I think just a bit more feedback to polish the codebase.

  • The login command is missing --parallel-safe wiring
    exec, export, and rotate all receive the flag, but login (which also opens a browser and calls SSO) is not in the diff. If a user sets AWS_VAULT_PARALLEL_SAFE=true, login will silently not benefit from the lock, which is the most visible browser-storm scenario. Either add it to login or document clearly that login is intentionally excluded and why.
  • The USAGE.md documentation says the lock "applies to all backends" — but then the trade-offs section correctly notes the serialization cost. The inconsistency is in the code: the lock is applied in global.go whenever --parallel-safe is set, regardless of backend. For the file backend or pass backend, the underlying operations are likely already atomic at the OS level and the lock adds pure overhead. Consider gating the LockedKeyring wrapper on a.KeyringBackend == "keychain" (or making the behavior clearly documented per-backend), to avoid unnecessary serialization on Linux.

cli/global.go Outdated
if a.ParallelSafe {
lockKey := a.KeyringConfig.KeychainName
if lockKey == "" {
lockKey = "aws-vault"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback "aws-vault" is a fixed string shared across all instances regardless of backend. On Linux with the file or pass backend, two users on the same machine or two different aws-vault configs would contend on the same lock file path. The lock key should incorporate something backend-specific (e.g., the backend type, or the file-based keyring directory). Consider deriving the key from the full keyring config rather than just the keychain name.

go.mod Outdated
github.com/byteness/keyring v1.9.0
github.com/charmbracelet/huh v1.0.0
github.com/charmbracelet/lipgloss v1.1.0
github.com/gofrs/flock v0.8.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to go with latest v0.13.0 here?
https://github.com/gofrs/flock/releases

return time.Unix(0, v*int64(time.Millisecond))
}

const (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if some of those constants could be either exposed as configurable flags (e.g., --parallel-safe-timeout) or defined as named constants with a comment explaining the rationale for each value.

}
}

func (p *CachedSessionProvider) ensureSessionDependencies() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lazy initialization via nil-checks on struct fields and it's harder to test and reason about than explicit construction. The tests inject mocks into the struct fields directly, which is fine, but production code paths rely on ensureSessionDependencies() being called. Consider using a constructor function or sync.Once to make initialization explicit and safe. At minimum, add a comment warning that this must be called before any field use.

if locked {
return p.doLockedSessionWork(ctx)
}
if err = waiter.sleepAfterMiss(ctx); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner err = waiter.sleepAfterMiss(ctx) shadows the lock err variable. While this doesn't cause a bug here, it's confusing. Use a new variable: if sleepErr := waiter.sleepAfterMiss(ctx); sleepErr != nil.

@timvisher-dd timvisher-dd force-pushed the sso-browser-lock branch 4 times, most recently from 5233d3e to 7e55e68 Compare April 7, 2026 18:34
@timvisher-dd timvisher-dd marked this pull request as draft April 8, 2026 11:02
@timvisher-dd timvisher-dd marked this pull request as ready for review April 9, 2026 18:13
@timvisher-dd timvisher-dd marked this pull request as draft April 9, 2026 18:13
timvisher-dd and others added 16 commits April 10, 2026 08:17
Add file-based process locks (via github.com/gofrs/flock) and a
reusable lock-wait loop for serializing keychain, session-cache, and
SSO-token operations across concurrent aws-vault processes.

New files:
- process_lock.go: ProcessLock interface + file-lock implementation
- lock_waiter.go: poll-sleep-log wait loop with configurable timing
- keychain_lock.go, session_lock.go, sso_lock.go: typed lock wrappers
- locked_keyring.go: keyring.Keyring decorator that serializes all
  read/write operations behind a process lock
- lock_test.go: shared test helpers for lock-based tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire the lock infrastructure into the SSO browser flow and session
cache so only one aws-vault process refreshes credentials at a time.

- SSO token lock: serializes OIDC browser auth per StartURL so
  concurrent processes don't each open a browser tab.
- Session cache lock: serializes cache writes so concurrent processes
  don't race on "item already exists" keyring errors.
- Keyring lock: wraps the keyring with a LockedKeyring decorator when
  --parallel-safe is enabled to serialize all keyring operations.
- New --parallel-safe flag / AWS_VAULT_PARALLEL_SAFE env var threaded
  through exec, export, login, and rotate commands.
- NewTempCredentialsProviderWithOptions for opt-in parallel safety.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When many parallel aws-vault processes hit GetRoleCredentials
simultaneously, AWS returns HTTP 429 (TooManyRequests). Add a retry
loop with Retry-After header support and exponential backoff with
jitter so the SSO token exchange is resilient under heavy load.

Gives up after 5 minutes of persistent 429s as a safety net.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a section explaining the --parallel-safe flag, what it protects
against (browser storms, keyring races), its trade-offs (serialized
keyring ops, all-or-nothing opt-in), and current limitations (lock
wait timeout, SSO retry timeout).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename shadowed err to sleepErr in getSessionWithLock
- Add rationale comments to retry/backoff constants
- Add doc comments to ensureSessionDependencies/ensureSSODependencies
  explaining the lazy-init pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Login opens a browser for an interactive console session — you cannot
meaningfully log in to multiple consoles in parallel — so there is no
concurrent-access problem for --parallel-safe to solve. Document this
in the code comment, --help output, and USAGE.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lock key previously fell back to the fixed string "aws-vault" for
non-keychain backends, causing unintended contention on shared systems.
Now each backend type incorporates its specific config (directory, prefix,
collection name, vault ID, etc.) into the lock key so different backends
and configurations lock independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three identical sleep-with-context-cancellation functions collapsed
into one in process_lock.go.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename getRoleCredentialsAsStsCredemtials → getRoleCredentialsAsStsCredentials
- Use lockLogger type instead of func(string, ...any) for ssoLogf
- Flip capDelay > max to max < capDelay per repo convention
- Replace defaultSSOSleep with shared defaultContextSleep

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keychain_lock.go, session_lock.go, and sso_lock.go were structurally
identical — each defined a type alias, two constructors, and a filename
helper differing only by a string prefix. Replace all three with a
single NewDefaultLock(prefix, key) in process_lock.go and use
ProcessLock directly instead of the type aliases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…struct

The 10-parameter constructor was hard to read and easy to mis-order.
Replace with lockWaiterOpts struct for named fields. Also use the
shared defaultContextSleep instead of an inline copy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old range always added 10-30% to the base delay, never reducing it.
This meant concurrent processes that received the same Retry-After
header would all retry at nearly the same time. The new 0.5x-1.5x
range provides full jitter — some fire earlier, some later — properly
decorrelating parallel retries.

Note: Go 1.20+ auto-seeds math/rand, so the seeding concern does not
apply to this codebase (go 1.25).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The keyring lock-wait loop already had a 2-minute timeout safety net,
but the session and SSO loops would block indefinitely if the lock
holder hung. Add matching context.WithTimeout to both, consistent
with the keyring's defaultKeyringLockTimeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Login was documented as excluded from --parallel-safe, but Keyring()
unconditionally wrapped with LockedKeyring when the flag was set.
Split into rawKeyring (shared base) and Keyring (adds lock wrapper).
Login now calls RawKeyring() to bypass the lock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
timvisher-dd and others added 14 commits April 10, 2026 08:17
The try/sleep/recheck lock protocol was copy-pasted in three places.
Extract a single generic withProcessLock[T] that encodes the protocol
once: check cache, try lock, do work under lock, sleep and retry.

Also fixes a bug where the SSO retry loop used time.Until(deadline)
(real clock) instead of deadline.Sub(p.ssoNow()) (injected clock),
making the timeout untestable with fake clocks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestJitteredBackoffProgression: verify cap doubles per attempt
- TestJitteredBackoffRespectsMax: verify cap doesn't exceed max
- TestJitteredBackoffDoublesPerAttempt: verify deterministic cap
- TestJitterRetryAfterRange: verify 0.5x-1.5x jitter range
- TestJitterRetryAfterZeroBase/NegativeBase: edge cases
- TestGetRoleCredentialsTimeoutOnPersistentRateLimit: end-to-end
  test with fake SSO server and fake clock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover all backend types (keychain, file, pass, secret-service, kwallet,
wincred, op variants) with config set and empty, plus the fallback
to backend name and "aws-vault".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminate applyParallelSafety spray pattern — each provider constructor
now accepts parallelSafe and configures locking at construction time.
This removes the post-construction mutation of UseSessionLock and
EnableSSOTokenLock, making providers fully configured at birth.

The observable behavior is unchanged: same lock files, same
serialization guarantees, same test assertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… joining

Tests verify:
- Lock retry behavior when lock is not immediately available
- Timeout after configurable duration when lock is never acquired
- errors.Join of work error and unlock error

Also makes lockTimeout a struct field (was hardcoded constant) so
tests can inject shorter timeouts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that UseSessionLock is set on the CachedSessionProvider when
ParallelSafe=true for the GetSessionToken (stored credentials) path.
This complements the existing SSO path test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use ProfileConfig directly instead of config file parsing to avoid
needing MfaPromptMethod. Add AssumeRole path test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At attempt ~37, base << uint(attempt-1) overflows int64 and produces a
negative value. The existing guard caught this but reset to base (200ms)
instead of max (5s), making late retries more aggressive under sustained
429s — exactly the wrong behavior during a rate-limit storm.

Also extends TestJitteredBackoffRespectsMax to cover attempts 20-60 with
both min and max bounds, catching the overflow regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The in-process sync.Mutex is not covered by the 2-minute flock timeout.
If a keyring operation hangs (e.g. a stuck gpg subprocess), other
goroutines in the same process block indefinitely on mu.Lock(). Document
this limitation since the keyring.Keyring interface is not context-aware.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous description enumerated specific commands (exec, export,
rotate) but the keyring wrapping via a.Keyring() is app-global. Describe
what the flag does without enumerating commands to avoid inaccuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lock-wait timeout (context.WithTimeout) was wrapping the entire
withProcessLock call, which meant it capped both the time waiting for
the lock AND the time doing work under the lock. For SSO browser auth,
this could cancel a legitimate login that takes longer than 2 minutes.

Fix: doWork closures now capture the caller's ctx (via `func(_ context.Context)`)
instead of receiving the timeout-constrained waitCtx. The lock-wait timeout
only bounds the polling loop, not the work done once the lock is acquired.

Also replaces the redundant `Lock ProcessLock` field in lockWaiterOpts
with `LockPath string` — the field was only used for `.Path()` in log
messages, and passing the full lock created a risk of divergence with
the positional lock parameter in withProcessLock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The OIDC device-flow polling loop used time.Sleep which doesn't respond
to context cancellation. While holding the SSO flock, this blocked other
processes from attempting the lock and prevented prompt cancellation on
Ctrl+C. Switch to p.ssoSleep (defaultContextSleep) so the sleep respects
the ctx passed to newOIDCToken.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
getCachedSession errors were silently swallowed — the code fell through
to re-fetch credentials on any error, which is correct self-healing
behavior but made real keyring problems (permissions, corruption)
invisible. Add log.Printf for errors that aren't keyring.ErrKeyNotFound
so they surface in debug mode without changing the fallthrough behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Keyring() refactor split keyringImpl (parallel-safe wrapper) from
rawKeyringImpl (underlying keyring). Example tests that inject a mock
keyring were still setting keyringImpl, but Keyring() now calls
rawKeyring() first — which tried keyring.Open and failed on CI where
secret-service is unavailable. Set rawKeyringImpl instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation feat

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants