Skip to content

feat: major performance & reliability improvements + critical bug fixes#169

Open
huohua-dev wants to merge 10 commits intolc:masterfrom
huohua-dev:feat/enhancement
Open

feat: major performance & reliability improvements + critical bug fixes#169
huohua-dev wants to merge 10 commits intolc:masterfrom
huohua-dev:feat/enhancement

Conversation

@huohua-dev
Copy link

Summary

This PR introduces significant performance and reliability improvements to gau, along with critical bug fixes that affect core functionality.

Changes

🐛 Bug Fixes

1. --fp flag drops ALL URLs (critical, zero output)

The RemoveParameters logic in output.go was inverted — it skipped URLs that had not been seen before, and since lastURL.Add() was placed after the continue statement, it was never executed. This caused the dedup set to remain permanently empty, resulting in every single URL being dropped when --fp was used.

Before (broken):

// BUG: !Contains means "not seen before" → skip first occurrence
// Add() is after continue → never executes → set stays empty → ALL URLs skipped
if RemoveParameters && !lastURL.Contains(u.Host+u.Path) {
    continue
}
lastURL.Add(u.Host + u.Path)

After (fixed):

if RemoveParameters {
    if lastURL.Contains(u.Host + u.Path) {
        continue // already seen this endpoint, skip duplicate params
    }
    lastURL.Add(u.Host + u.Path)
}

2. --blacklist extension matching never works

path.Ext() returns extensions with a leading dot (e.g., ".png"), but the blacklist set stores values without dots (e.g., "png" from --blacklist png,jpg). The comparison .png vs png always fails, so blacklist filtering was silently broken.

Fix: Added strings.TrimPrefix(ext, ".") before checking the blacklist. Applied to both WriteURLs and WriteURLsJSON.


⚡ Performance Improvements

3. Concurrent pagination per provider (--provider-threads)

Added a new --provider-threads flag (default: 3) that enables concurrent page fetching within each provider. Previously, each provider fetched pages sequentially — for large targets with hundreds of pages (e.g., huawei.com with 200+ OTX pages), this was extremely slow.

Implemented for all four providers:

  • Wayback: Concurrent page fetching with proper CDX API pagination
  • CommonCrawl: Concurrent page fetching across index pages
  • OTX: Concurrent page fetching with offset-based pagination
  • URLScan: Concurrent search-after cursor pagination

4. Retry with exponential backoff

Added a robust retry mechanism in the HTTP client with:

  • Configurable max retries (default: 5)
  • Exponential backoff with jitter
  • Automatic retry on 429 (rate limit) and 5xx errors
  • Structured StatusCodeError type for better error handling

5. Per-provider timeout control

Each provider now runs with its own timeout context (capped at 5 minutes), preventing a single slow/stuck provider from blocking the entire run indefinitely.


🛡️ Reliability Improvements

6. Real-time stdout flush

Added os.Stdout.Sync() after each URL write to prevent data loss when the process is killed (e.g., SIGKILL, Ctrl+C, pipe break). Previously, buffered output could be lost.

7. Execution summary

Added a summary log line at the end of each run showing total URLs found and execution duration:

INFO[0045] === Gau Execution Summary ===
INFO[0045] Total URLs: 12,847
INFO[0045] Duration: 45.2s
INFO[0045] =============================

8. Structured error logging

All providers now use structured logging with logrus.WithFields for consistent, parseable error output including provider name, domain, page number, and timeout values.


Testing

All changes have been tested against real-world targets with various flag combinations:

Test Case Before After
--fp alone 0 URLs ❌ Normal output ✅
--blacklist png,jpg No filtering ❌ Correct filtering ✅
--fp --blacklist combo 0 URLs ❌ Normal filtered output ✅
Large target + multi-provider Very slow ~3-5x faster with --provider-threads
Process killed mid-run Partial data loss All flushed data preserved

Files Changed

  • pkg/output/output.go — Fixed --fp logic inversion + blacklist dot mismatch + real-time flush + URL counter
  • pkg/httpclient/client.go — Added retry with exponential backoff + StatusCodeError
  • pkg/providers/wayback/wayback.go — Concurrent pagination + structured logging
  • pkg/providers/commoncrawl/commoncrawl.go — Concurrent pagination + structured logging
  • pkg/providers/otx/otx.go — Concurrent pagination + structured logging
  • pkg/providers/urlscan/urlscan.go — Structured logging
  • pkg/providers/providers.go — Added ProviderThreads to Config
  • runner/runner.go — Per-provider timeout control
  • runner/flags/flags.go — Added --provider-threads flag
  • cmd/gau/main.go — Execution summary + URL count tracking

Backward Compatibility

All changes are fully backward compatible:

  • New flags have sensible defaults (--provider-threads 3)
  • Existing flags work as before (but now correctly)
  • No breaking changes to CLI interface or config file format

Huohua Dev added 10 commits February 9, 2026 13:41
- Call os.Stdout.Sync() after each URL write in WriteURLs and WriteURLsJSON
- Ensure data is immediately flushed to disk in pipe/redirect scenarios
- Add atomic URL counter parameter for exit summary tracking
- Add StatusCodeError type to carry HTTP status codes through error chain
- Implement exponential backoff retry for network errors (capped at 30s)
- Skip retry for 429 rate-limit and 400 bad-request responses
- Add shouldRetry() to detect retryable network errors
- Replace manual case-insensitive search with strings.ToLower
- Implement dispatcher+worker pattern for parallel page fetching
- Use sync.Once to safely stop dispatcher on empty results
- Add structured logging with provider/domain/page fields
- Use StatusCodeError for proper 400 status handling
- Support configurable provider-threads parameter
- Implement dispatcher+worker pattern for parallel page fetching
- Use errors.As with StatusCodeError for proper 429 detection
- Stop pagination when has_next is false
- Add structured logging with provider/domain/page/status fields
…rawl

- Implement dispatcher+worker pattern using known page count
- Cap worker threads to actual page count
- Use errors.As with StatusCodeError for proper error classification
- Add structured logging for connection errors and API errors
- Add provider/domain/page/error fields to warning logs
- Add response body to rate-limit log for debugging
- Add ProviderThreads field to providers.Config
- Register --provider-threads CLI flag with default value 3
- Support provider-threads in .gau.toml config file
- Create timeout context for each provider work item
- Cap provider timeout at 5 minutes to prevent single provider blocking
- Add structured logging with provider/domain/timeout fields
- Track total URL count using atomic counter
- Log summary with total URLs and duration on exit
…sion dot mismatch

Bug 1: --fp (RemoveParameters) had inverted Contains check and Add() placed
after continue, causing lastURL set to stay empty forever → 0 output.

Bug 2: path.Ext() returns '.png' but blacklist stores 'png' (no dot),
so blacklistMap.Contains() never matched. Added TrimPrefix to strip leading dot.

Both fixes applied to WriteURLs and WriteURLsJSON.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant