Skip to content

fix Apache combined-format user-field capture class#5

Open
HrachShah wants to merge 1 commit into
mainfrom
fix/apache-combined-user-class
Open

fix Apache combined-format user-field capture class#5
HrachShah wants to merge 1 commit into
mainfrom
fix/apache-combined-user-class

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 9, 2026

Copy link
Copy Markdown
Owner

The previous COMBINED_PATTERN captured the user field with (?P<user>\\s+) — one-or-more whitespace, which never matches the literal '-' that Apache uses for an empty user. That meant the combined branch never matched any real combined-format line; parse() silently fell through to COMMON_PATTERN, dropping the user_agent and referer fields (the entire reason a "combined" format exists).

The fix changes the user capture to \\S+ (non-whitespace) with a mandatory \\s+ separator after it, so the timestamp [ has a guaranteed preceding space to anchor against. The referer and user_agent captures are also broadened from [^"]+ to [^"]* so an empty quoted form is accepted.

Two new tests in TestApacheParserCombinedUser cover both the real-username case (where the original bug was masked — the parser silently fell back to common-format) and the hyphen-user case. All 23 tests in tests/test_parsers.py pass.

Summary by Sourcery

Fix Apache combined log parsing so the combined-format branch matches real log lines and preserves user, referer, and user-agent fields.

Bug Fixes:

  • Correct Apache combined log pattern to capture non-whitespace user values and allow hyphen placeholders, ensuring combined-format lines are parsed instead of falling back to common format.

Tests:

  • Add regression tests covering Apache combined-format parsing for both real usernames and hyphen placeholder users, including referer and user-agent fields.

Apache's combined log format (and the common format it extends) uses '-' as
the literal placeholder for an empty user. The previous COMBINED_PATTERN
captured the user field with '\\s+' — one-or-more whitespace — which never
matches a literal '-'. The pattern therefore never matched any real
combined-format line; parse() silently fell through to COMMON_PATTERN, and
the user_agent and referer fields (the entire reason a 'combined' format
exists) were dropped on the floor.

The fix changes the user capture to '\\S+' (non-whitespace) and adds a
mandatory '\\s+' separator after it so the timestamp '[' has a guaranteed
preceding space to anchor against. The referer and user_agent captures are
also broadened from '[^\"]+' to '[^\"]*' so an empty quoted form is
accepted (some log emitters do emit "" for a missing referer instead of
'-', and a regex that demands at least one character fails to match those).

Two new tests in TestApacheParserCombinedUser cover both the
real-username case (where the original bug was masked — the parser
silently fell back to common-format) and the hyphen-user case (which was
already 'working' in the loose sense that can_parse returned True). The
combined-format user_agent and referer fields are now correctly populated
in both cases.

All 23 tests in tests/test_parsers.py pass.
@sourcery-ai

sourcery-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adjusts the Apache combined log regex to correctly capture the user field (including '-' and real users) and allow empty referer/user-agent values, and adds regression tests to verify parsing for both real and hyphen users.

Flow diagram for ApacheParser.parse combined vs common patterns

flowchart TD
    A[Log line input] --> B[ApacheParser.parse]
    B --> C[Apply COMBINED_PATTERN]
    C -->|match| D[Extract host, ident, user, timestamp, request, status, size, referer, user_agent]
    D --> E[Return combined-format parsed entry]
    C -->|no match| F[Apply COMMON_PATTERN]
    F -->|match| G[Extract host, ident, user, timestamp, request, status, size]
    G --> H[Return common-format parsed entry]
    F -->|no match| I[Raise parse error or return None]

    %% COMBINED_PATTERN now uses \S+ for user and [^"]* for referer/user_agent
Loading

File-Level Changes

Change Details Files
Fix Apache combined log regex so combined-format lines match correctly and user/referer/user-agent fields are captured as intended.
  • Change user capture group from whitespace to non-whitespace token with following space separator to ensure proper matching and anchor before timestamp.
  • Relax referer and user_agent capture groups to accept empty quoted strings while still excluding embedded quotes.
src/log_analyzer_cli/parsers/apache.py
Add regression tests to ensure Apache combined-format parsing works for both real users and hyphen placeholder users and that referer/user-agent are preserved.
  • Introduce TestApacheParserCombinedUser test class documenting the previous bug and expected behavior.
  • Add test for combined-format line with a real username verifying user, referer, and user_agent metadata.
  • Add test for combined-format line with '-' user verifying user and user_agent metadata are parsed correctly.
tests/test_parsers.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@HrachShah, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 55 minutes and 29 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eca15bf9-7b9b-4768-bd73-09635e3c9e86

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 0d133d3.

📒 Files selected for processing (2)
  • src/log_analyzer_cli/parsers/apache.py
  • tests/test_parsers.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/apache-combined-user-class

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="tests/test_parsers.py" line_range="207-216" />
<code_context>
+        assert entry.metadata["referer"] == "https://example.com/referer"
+        assert entry.metadata["user_agent"] == "Mozilla/5.0 (X11; Linux) Firefox/120"
+
+    def test_parse_apache_combined_with_hyphen_user(self):
+        parser = ApacheParser()
+        line = (
+            '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] '
+            '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+        )
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.metadata["user"] == "-"
+        assert entry.metadata["user_agent"] == "Mozilla/5.0"
</code_context>
<issue_to_address>
**suggestion (testing):** Add an assertion for the referer in the hyphen-user case to prove the combined format (not common) is used.

Previously, `parse()` fell back to the common pattern for hyphen users, which dropped both `referer` and `user_agent`. This test currently only checks `user` and `user_agent`. Adding `assert entry.metadata["referer"] == "-"` would explicitly verify that the combined pattern is used and that the `referer` field is preserved instead of being lost via the common-pattern fallback.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread tests/test_parsers.py
Comment on lines +207 to +216
def test_parse_apache_combined_with_hyphen_user(self):
parser = ApacheParser()
line = (
'192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] '
'"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
)
entry = parser.parse(line)
assert entry is not None
assert entry.metadata["user"] == "-"
assert entry.metadata["user_agent"] == "Mozilla/5.0"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add an assertion for the referer in the hyphen-user case to prove the combined format (not common) is used.

Previously, parse() fell back to the common pattern for hyphen users, which dropped both referer and user_agent. This test currently only checks user and user_agent. Adding assert entry.metadata["referer"] == "-" would explicitly verify that the combined pattern is used and that the referer field is preserved instead of being lost via the common-pattern fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant