fix Apache combined-format user-field capture class#5
Conversation
Apache's combined log format (and the common format it extends) uses '-' as the literal placeholder for an empty user. The previous COMBINED_PATTERN captured the user field with '\\s+' — one-or-more whitespace — which never matches a literal '-'. The pattern therefore never matched any real combined-format line; parse() silently fell through to COMMON_PATTERN, and the user_agent and referer fields (the entire reason a 'combined' format exists) were dropped on the floor. The fix changes the user capture to '\\S+' (non-whitespace) and adds a mandatory '\\s+' separator after it so the timestamp '[' has a guaranteed preceding space to anchor against. The referer and user_agent captures are also broadened from '[^\"]+' to '[^\"]*' so an empty quoted form is accepted (some log emitters do emit "" for a missing referer instead of '-', and a regex that demands at least one character fails to match those). Two new tests in TestApacheParserCombinedUser cover both the real-username case (where the original bug was masked — the parser silently fell back to common-format) and the hyphen-user case (which was already 'working' in the loose sense that can_parse returned True). The combined-format user_agent and referer fields are now correctly populated in both cases. All 23 tests in tests/test_parsers.py pass.
Reviewer's GuideAdjusts the Apache combined log regex to correctly capture the user field (including '-' and real users) and allow empty referer/user-agent values, and adds regression tests to verify parsing for both real and hyphen users. Flow diagram for ApacheParser.parse combined vs common patternsflowchart TD
A[Log line input] --> B[ApacheParser.parse]
B --> C[Apply COMBINED_PATTERN]
C -->|match| D[Extract host, ident, user, timestamp, request, status, size, referer, user_agent]
D --> E[Return combined-format parsed entry]
C -->|no match| F[Apply COMMON_PATTERN]
F -->|match| G[Extract host, ident, user, timestamp, request, status, size]
G --> H[Return common-format parsed entry]
F -->|no match| I[Raise parse error or return None]
%% COMBINED_PATTERN now uses \S+ for user and [^"]* for referer/user_agent
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Warning Review limit reached
More reviews will be available in 55 minutes and 29 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="tests/test_parsers.py" line_range="207-216" />
<code_context>
+ assert entry.metadata["referer"] == "https://example.com/referer"
+ assert entry.metadata["user_agent"] == "Mozilla/5.0 (X11; Linux) Firefox/120"
+
+ def test_parse_apache_combined_with_hyphen_user(self):
+ parser = ApacheParser()
+ line = (
+ '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] '
+ '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"'
+ )
+ entry = parser.parse(line)
+ assert entry is not None
+ assert entry.metadata["user"] == "-"
+ assert entry.metadata["user_agent"] == "Mozilla/5.0"
</code_context>
<issue_to_address>
**suggestion (testing):** Add an assertion for the referer in the hyphen-user case to prove the combined format (not common) is used.
Previously, `parse()` fell back to the common pattern for hyphen users, which dropped both `referer` and `user_agent`. This test currently only checks `user` and `user_agent`. Adding `assert entry.metadata["referer"] == "-"` would explicitly verify that the combined pattern is used and that the `referer` field is preserved instead of being lost via the common-pattern fallback.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def test_parse_apache_combined_with_hyphen_user(self): | ||
| parser = ApacheParser() | ||
| line = ( | ||
| '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] ' | ||
| '"GET /index.html HTTP/1.1" 200 2326 "-" "Mozilla/5.0"' | ||
| ) | ||
| entry = parser.parse(line) | ||
| assert entry is not None | ||
| assert entry.metadata["user"] == "-" | ||
| assert entry.metadata["user_agent"] == "Mozilla/5.0" |
There was a problem hiding this comment.
suggestion (testing): Add an assertion for the referer in the hyphen-user case to prove the combined format (not common) is used.
Previously, parse() fell back to the common pattern for hyphen users, which dropped both referer and user_agent. This test currently only checks user and user_agent. Adding assert entry.metadata["referer"] == "-" would explicitly verify that the combined pattern is used and that the referer field is preserved instead of being lost via the common-pattern fallback.
The previous
COMBINED_PATTERNcaptured the user field with(?P<user>\\s+)— one-or-more whitespace, which never matches the literal'-'that Apache uses for an empty user. That meant the combined branch never matched any real combined-format line;parse()silently fell through toCOMMON_PATTERN, dropping the user_agent and referer fields (the entire reason a "combined" format exists).The fix changes the user capture to
\\S+(non-whitespace) with a mandatory\\s+separator after it, so the timestamp[has a guaranteed preceding space to anchor against. The referer and user_agent captures are also broadened from[^"]+to[^"]*so an empty quoted form is accepted.Two new tests in
TestApacheParserCombinedUsercover both the real-username case (where the original bug was masked — the parser silently fell back to common-format) and the hyphen-user case. All 23 tests intests/test_parsers.pypass.Summary by Sourcery
Fix Apache combined log parsing so the combined-format branch matches real log lines and preserves user, referer, and user-agent fields.
Bug Fixes:
Tests: