feat: Allow aws-vault to safely be run in parallel#337
feat: Allow aws-vault to safely be run in parallel#337timvisher-dd wants to merge 30 commits intoByteNess:mainfrom
Conversation
|
Thanks, sorry but we might have talked about some of the feedback, but context is a bit lost now :) So PR still looks good and I think just a bit more feedback to polish the codebase.
|
cli/global.go
Outdated
| if a.ParallelSafe { | ||
| lockKey := a.KeyringConfig.KeychainName | ||
| if lockKey == "" { | ||
| lockKey = "aws-vault" |
There was a problem hiding this comment.
The fallback "aws-vault" is a fixed string shared across all instances regardless of backend. On Linux with the file or pass backend, two users on the same machine or two different aws-vault configs would contend on the same lock file path. The lock key should incorporate something backend-specific (e.g., the backend type, or the file-based keyring directory). Consider deriving the key from the full keyring config rather than just the keychain name.
go.mod
Outdated
| github.com/byteness/keyring v1.9.0 | ||
| github.com/charmbracelet/huh v1.0.0 | ||
| github.com/charmbracelet/lipgloss v1.1.0 | ||
| github.com/gofrs/flock v0.8.1 |
There was a problem hiding this comment.
Any reason not to go with latest v0.13.0 here?
https://github.com/gofrs/flock/releases
| return time.Unix(0, v*int64(time.Millisecond)) | ||
| } | ||
|
|
||
| const ( |
There was a problem hiding this comment.
I wonder if some of those constants could be either exposed as configurable flags (e.g., --parallel-safe-timeout) or defined as named constants with a comment explaining the rationale for each value.
vault/cachedsessionprovider.go
Outdated
| } | ||
| } | ||
|
|
||
| func (p *CachedSessionProvider) ensureSessionDependencies() { |
There was a problem hiding this comment.
This seems like a lazy initialization via nil-checks on struct fields and it's harder to test and reason about than explicit construction. The tests inject mocks into the struct fields directly, which is fine, but production code paths rely on ensureSessionDependencies() being called. Consider using a constructor function or sync.Once to make initialization explicit and safe. At minimum, add a comment warning that this must be called before any field use.
vault/cachedsessionprovider.go
Outdated
| if locked { | ||
| return p.doLockedSessionWork(ctx) | ||
| } | ||
| if err = waiter.sleepAfterMiss(ctx); err != nil { |
There was a problem hiding this comment.
The inner err = waiter.sleepAfterMiss(ctx) shadows the lock err variable. While this doesn't cause a bug here, it's confusing. Use a new variable: if sleepErr := waiter.sleepAfterMiss(ctx); sleepErr != nil.
5233d3e to
7e55e68
Compare
Add file-based process locks (via github.com/gofrs/flock) and a reusable lock-wait loop for serializing keychain, session-cache, and SSO-token operations across concurrent aws-vault processes. New files: - process_lock.go: ProcessLock interface + file-lock implementation - lock_waiter.go: poll-sleep-log wait loop with configurable timing - keychain_lock.go, session_lock.go, sso_lock.go: typed lock wrappers - locked_keyring.go: keyring.Keyring decorator that serializes all read/write operations behind a process lock - lock_test.go: shared test helpers for lock-based tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire the lock infrastructure into the SSO browser flow and session cache so only one aws-vault process refreshes credentials at a time. - SSO token lock: serializes OIDC browser auth per StartURL so concurrent processes don't each open a browser tab. - Session cache lock: serializes cache writes so concurrent processes don't race on "item already exists" keyring errors. - Keyring lock: wraps the keyring with a LockedKeyring decorator when --parallel-safe is enabled to serialize all keyring operations. - New --parallel-safe flag / AWS_VAULT_PARALLEL_SAFE env var threaded through exec, export, login, and rotate commands. - NewTempCredentialsProviderWithOptions for opt-in parallel safety. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When many parallel aws-vault processes hit GetRoleCredentials simultaneously, AWS returns HTTP 429 (TooManyRequests). Add a retry loop with Retry-After header support and exponential backoff with jitter so the SSO token exchange is resilient under heavy load. Gives up after 5 minutes of persistent 429s as a safety net. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a section explaining the --parallel-safe flag, what it protects against (browser storms, keyring races), its trade-offs (serialized keyring ops, all-or-nothing opt-in), and current limitations (lock wait timeout, SSO retry timeout). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename shadowed err to sleepErr in getSessionWithLock - Add rationale comments to retry/backoff constants - Add doc comments to ensureSessionDependencies/ensureSSODependencies explaining the lazy-init pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Login opens a browser for an interactive console session — you cannot meaningfully log in to multiple consoles in parallel — so there is no concurrent-access problem for --parallel-safe to solve. Document this in the code comment, --help output, and USAGE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lock key previously fell back to the fixed string "aws-vault" for non-keychain backends, causing unintended contention on shared systems. Now each backend type incorporates its specific config (directory, prefix, collection name, vault ID, etc.) into the lock key so different backends and configurations lock independently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three identical sleep-with-context-cancellation functions collapsed into one in process_lock.go. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename getRoleCredentialsAsStsCredemtials → getRoleCredentialsAsStsCredentials - Use lockLogger type instead of func(string, ...any) for ssoLogf - Flip capDelay > max to max < capDelay per repo convention - Replace defaultSSOSleep with shared defaultContextSleep Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keychain_lock.go, session_lock.go, and sso_lock.go were structurally identical — each defined a type alias, two constructors, and a filename helper differing only by a string prefix. Replace all three with a single NewDefaultLock(prefix, key) in process_lock.go and use ProcessLock directly instead of the type aliases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…struct The 10-parameter constructor was hard to read and easy to mis-order. Replace with lockWaiterOpts struct for named fields. Also use the shared defaultContextSleep instead of an inline copy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old range always added 10-30% to the base delay, never reducing it. This meant concurrent processes that received the same Retry-After header would all retry at nearly the same time. The new 0.5x-1.5x range provides full jitter — some fire earlier, some later — properly decorrelating parallel retries. Note: Go 1.20+ auto-seeds math/rand, so the seeding concern does not apply to this codebase (go 1.25). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The keyring lock-wait loop already had a 2-minute timeout safety net, but the session and SSO loops would block indefinitely if the lock holder hung. Add matching context.WithTimeout to both, consistent with the keyring's defaultKeyringLockTimeout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Login was documented as excluded from --parallel-safe, but Keyring() unconditionally wrapped with LockedKeyring when the flag was set. Split into rawKeyring (shared base) and Keyring (adds lock wrapper). Login now calls RawKeyring() to bypass the lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The try/sleep/recheck lock protocol was copy-pasted in three places. Extract a single generic withProcessLock[T] that encodes the protocol once: check cache, try lock, do work under lock, sleep and retry. Also fixes a bug where the SSO retry loop used time.Until(deadline) (real clock) instead of deadline.Sub(p.ssoNow()) (injected clock), making the timeout untestable with fake clocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestJitteredBackoffProgression: verify cap doubles per attempt - TestJitteredBackoffRespectsMax: verify cap doesn't exceed max - TestJitteredBackoffDoublesPerAttempt: verify deterministic cap - TestJitterRetryAfterRange: verify 0.5x-1.5x jitter range - TestJitterRetryAfterZeroBase/NegativeBase: edge cases - TestGetRoleCredentialsTimeoutOnPersistentRateLimit: end-to-end test with fake SSO server and fake clock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cover all backend types (keychain, file, pass, secret-service, kwallet, wincred, op variants) with config set and empty, plus the fallback to backend name and "aws-vault". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminate applyParallelSafety spray pattern — each provider constructor now accepts parallelSafe and configures locking at construction time. This removes the post-construction mutation of UseSessionLock and EnableSSOTokenLock, making providers fully configured at birth. The observable behavior is unchanged: same lock files, same serialization guarantees, same test assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… joining Tests verify: - Lock retry behavior when lock is not immediately available - Timeout after configurable duration when lock is never acquired - errors.Join of work error and unlock error Also makes lockTimeout a struct field (was hardcoded constant) so tests can inject shorter timeouts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that UseSessionLock is set on the CachedSessionProvider when ParallelSafe=true for the GetSessionToken (stored credentials) path. This complements the existing SSO path test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use ProfileConfig directly instead of config file parsing to avoid needing MfaPromptMethod. Add AssumeRole path test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
At attempt ~37, base << uint(attempt-1) overflows int64 and produces a negative value. The existing guard caught this but reset to base (200ms) instead of max (5s), making late retries more aggressive under sustained 429s — exactly the wrong behavior during a rate-limit storm. Also extends TestJitteredBackoffRespectsMax to cover attempts 20-60 with both min and max bounds, catching the overflow regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The in-process sync.Mutex is not covered by the 2-minute flock timeout. If a keyring operation hangs (e.g. a stuck gpg subprocess), other goroutines in the same process block indefinitely on mu.Lock(). Document this limitation since the keyring.Keyring interface is not context-aware. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous description enumerated specific commands (exec, export, rotate) but the keyring wrapping via a.Keyring() is app-global. Describe what the flag does without enumerating commands to avoid inaccuracy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The lock-wait timeout (context.WithTimeout) was wrapping the entire withProcessLock call, which meant it capped both the time waiting for the lock AND the time doing work under the lock. For SSO browser auth, this could cancel a legitimate login that takes longer than 2 minutes. Fix: doWork closures now capture the caller's ctx (via `func(_ context.Context)`) instead of receiving the timeout-constrained waitCtx. The lock-wait timeout only bounds the polling loop, not the work done once the lock is acquired. Also replaces the redundant `Lock ProcessLock` field in lockWaiterOpts with `LockPath string` — the field was only used for `.Path()` in log messages, and passing the full lock created a risk of divergence with the positional lock parameter in withProcessLock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The OIDC device-flow polling loop used time.Sleep which doesn't respond to context cancellation. While holding the SSO flock, this blocked other processes from attempting the lock and prevented prompt cancellation on Ctrl+C. Switch to p.ssoSleep (defaultContextSleep) so the sleep respects the ctx passed to newOIDCToken. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
getCachedSession errors were silently swallowed — the code fell through to re-fetch credentials on any error, which is correct self-healing behavior but made real keyring problems (permissions, corruption) invisible. Add log.Printf for errors that aren't keyring.ErrKeyNotFound so they surface in debug mode without changing the fallthrough behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Keyring() refactor split keyringImpl (parallel-safe wrapper) from rawKeyringImpl (underlying keyring). Example tests that inject a mock keyring were still setting keyringImpl, but Keyring() now calls rawKeyring() first — which tried keyring.Open and failed on CI where secret-service is unavailable. Set rawKeyringImpl instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ee7d472 to
952e2d9
Compare
Closes 99designs/aws-vault#1275 (abandoned)
I run aws-vault as a credentials_provider for SSO backed AWS profiles. This is often in parallel contexts of dozens to hundreds of invocations at once.
In that context when credentials needed to be refreshed, aws-vault would open an unconstrained amount of browser tabs in parallel, usually triggering HTTP 500 responses on the AWS side and failing to acquire creds.
To mitigate this I developed a small wrapper around aws-vault that would use a Bash dir lock (Wooledge BashFAQ 45) when there was a possibility that the credentials would need to be refreshed. This worked but it was also quite slow as it would lock the entire aws-vault execution rather than just the step that might launch a browser tab. The dir locking strategy was also sensitive to being killed by process managers like terraform and so had to do things like die voluntarily after magic amounts of seconds to avoid being SIGKILLed and leaving a stale lock around.
This changeset introduces a cross-platform process lock that is tolerant of all forms of process death (even SIGKILL), and wires it into the SSO/OIDC browser flow so only one process refreshes at a time. This keeps parallel invocations fast while avoiding the browser storm that triggers HTTP 500s.
While stress testing I also found that the session cache could race across processes, leading to "item already exists" errors from the keychain. This branch adds a session cache refresh lock so only one process writes back to the same cache entry at once, while others wait and re-check.
Because the parallelism is now safe and fast, I also hit SSO OIDC rate limits (429). This change adds Retry-After support with jittered backoff so the SSO token exchange is resilient under heavy load.
In a stress test across 646 AWS profiles in 9 SSO Directories I'm able to retrieve creds successfully in ~36 seconds on my box. This fails on upstream HEAD because the browser storm overwhelms the IAM/SSO APIs and the keychain cache update races.
Performance implications
This change serializes keychain operations (for the macOS keychain backend only) so only one aws-vault process is touching the keychain at a time. This avoids concurrent unlock prompts and other keychain contention at the cost of making highly parallel workloads wait their turn for keychain access. In practice this has been acceptable for my workloads.
On my machine, successfully retrieving creds for 642 profiles completes in under 60 seconds, which is good enough that I did not do a formal before/after benchmark. If maintainers want more detailed measurements I can provide them.
Testing
Unit tests have been added for the SSO/OIDC lock, session cache lock, and OIDC retry logic. I also used some local integration test scripts that clear out the creds and run
aws-vault export --format=jsonacross different sets of my profiles and assert that it succeeds. Finally I've converted my local tooling to use this fork of aws-vault and have been exercising it there without issue.Colofon
I did not write any of this code. Codex did. That said I have read through it in some detail and it looks reasonable to me.
Co-authored-by: Codex noreply@openai.com