swallow out-of-range numeric timestamps in JSON parser#3
Conversation
…OSError on file read, analyze command raises (OSError, ValueError, RuntimeError)
…attern — the standalone port regex was replacing ports after IP substitution, turning 192.168.1.1:8080 into <IP>:<PORT> instead of <IP>:<PORT>; now IP:port is replaced first as a distinct unit
…00 was being mangled by the standalone port regex; now replaced as a unit before port substitution
…fe80::1, ::1, and 2001:0db8:85a3:0000:0000:8a2e:0370:7334 were all being mangled by the standalone port regex; fe80::1 became 'fe80::<PORT>' so different IPv6 addresses got grouped together as the same pattern; now compressed and full-form IPv6 addresses are replaced as units before the port substitution runs, and the bracketed URL form [::1]:8080 is recognized as well
…ined log format The user field was using \s+ (whitespace only) instead of \S+ (any non-whitespace), so the COMBINED_PATTERN could never match a real combined log line — every real combined log has '-' or a username in that position. Because parse() always tried COMBINED_PATTERN first and silently fell back to COMMON_PATTERN when it failed, the 'referer' and 'user_agent' fields were never actually extracted from any combined log line. Changed the user field to \S+ and pulled the whitespace before the '[' into its own \s+ so the pattern matches a real combined log and the optional referer/user-agent capture group is exercised.
…mtimestamp raises ValueError (year out of range, e.g. a 1e15s Unix timestamp = year 33658) and OverflowError (1e308 — larger than the platform time_t), and a negative timestamp can raise OSError on some platforms. _extract_timestamp is internal and its contract is 'returns Optional[datetime]', so a busted numeric field is just 'no timestamp', not 'the whole JSON line is unparseable'. The old code let the exception escape parse(), which dropped the entire ParsedEntry (level, message, source, metadata all gone) and surfaced a traceback to the CLI caller for what should be a per-line parse failure. Wrapping the call in try/except (ValueError, OSError, OverflowError) keeps the rest of the entry, sets timestamp to None, and lets the CLI analyze the other 10,000 lines in the file as usual.
Reviewer's GuideThis PR hardens JSON and Apache log parsing, normalizes IP/port patterns more robustly (including IPv6), tweaks CLI exception handling, and slightly refines time-bucketing text output, while adding tests to pin the new behaviors. Sequence diagram for JSONLogParser.parse handling out-of-range numeric timestampssequenceDiagram
actor User
participant CLI as LogAnalyzerCLI
participant Parser as JSONLogParser
User->>CLI: analyze
CLI->>Parser: parse(line)
Parser->>Parser: json.loads(line)
alt valid JSON
Parser->>Parser: _extract_timestamp(data)
alt timestamp in range
Parser-->>Parser: timestamp datetime
else timestamp out of range
Parser-->>Parser: raise ValueError/OSError/OverflowError
Parser->>Parser: catch (ValueError, OSError, OverflowError)
Parser-->>Parser: timestamp = None
end
Parser->>Parser: _extract_level(data)
Parser->>Parser: _extract_message(data)
Parser-->>CLI: ParsedEntry(timestamp or None,...)
else invalid JSON
Parser-->>CLI: None
end
Sequence diagram for narrowed CLI exception handling in analyzesequenceDiagram
actor User
participant CLI as LogAnalyzerCLI
User->>CLI: analyze
CLI->>CLI: process_logs()
alt raises OSError/ValueError/RuntimeError
CLI-->>CLI: catch (OSError, ValueError, RuntimeError)
CLI->>User: echo("Error: {e}")
CLI->>User: sys.exit(1)
else no handled exception
CLI-->>User: print analysis output
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThe PR improves the log analyzer's robustness and correctness across multiple components. JSON parser timestamps now gracefully handle out-of-range values instead of failing entire lines, Apache combined-log parsing fixes the user field regex, CLI error handling becomes more specific, time distribution display formatting is corrected, and error pattern normalization better handles IPv4 and IPv6 addresses with ports. ChangesParser, CLI, and Utility Robustness Enhancements
🎯 2 (Simple) | ⏱️ ~12 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 3 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="src/log_analyzer_cli/cli.py" line_range="136" />
<code_context>
click.echo(output_str)
- except Exception as e:
+ except (OSError, ValueError, RuntimeError) as e:
click.echo(f"Error: {e}", err=True)
sys.exit(1)
</code_context>
<issue_to_address>
**issue (bug_risk):** Narrowing the caught exceptions in `analyze` may surface raw tracebacks (e.g. ClickException) to the user instead of a clean error message.
Previously this block caught all `Exception`, so any unexpected failure (including `click.ClickException` and other non-`ValueError`/`RuntimeError` parsing errors) was turned into a consistent `Error: ...` message and exit code. With the narrower tuple, those now surface as raw tracebacks, changing CLI output and potentially breaking scripts that depend on the current error format/exit status. If the intention is to avoid hiding programmer bugs, consider instead catching a project-specific exception (e.g. `LogAnalyzerError`) or at least including `click.ClickException`, while allowing truly unexpected exceptions to propagate.
</issue_to_address>
### Comment 2
<location path="src/log_analyzer_cli/cli.py" line_range="168" />
<code_context>
if line:
sample_lines.append(line)
- except Exception as e:
+ except OSError as e:
click.echo(f"Warning: Could not read file: {e}", err=True)
return None
</code_context>
<issue_to_address>
**issue (bug_risk):** Catching only `OSError` in `_get_parser` means decoding errors (e.g. `UnicodeDecodeError`) will now crash instead of yielding a warning and continuing.
Previously, any read failure—including encoding errors—was treated as non-fatal, with a warning and `None` returned. Restricting this to `OSError` means `UnicodeDecodeError` and other non-`OSError` read issues will now crash the CLI, changing behavior for users with misencoded files. If you still want “can’t read this file” to be non-fatal, please also catch `UnicodeError`/`UnicodeDecodeError` or otherwise handle these consistently with your existing I/O warnings.
</issue_to_address>
### Comment 3
<location path="tests/test_parsers.py" line_range="231-240" />
<code_context>
assert parser_class is None
+
+
+class TestApacheCombinedPatternExtraction:
+ """Tests that the COMBINED_PATTERN regex actually matches a real combined log
+ and properly extracts referer and user_agent fields.
+ """
+
+ def test_combined_pattern_extracts_referer(self):
+ parser = ApacheParser()
+ line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com/" "Mozilla/5.0"'
+ entry = parser.parse(line)
+ assert entry is not None
+ assert entry.metadata.get("referer") == "https://example.com/"
+ assert entry.metadata.get("user_agent") == "Mozilla/5.0"
+
+ def test_combined_pattern_matches_with_dash_user(self):
</code_context>
<issue_to_address>
**suggestion (testing):** Add an assertion on the parsed timestamp or other core fields to fully exercise the updated Apache combined log regex
The new tests cover the added groups (`referer`, `user_agent`, `user`), but the regex change also modified the `timestamp` capture and surrounding whitespace. Please add an assertion that `entry.timestamp` is non-`None` and equals the expected datetime so we also validate timestamp parsing isn’t regressed by this change.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| click.echo(output_str) | ||
|
|
||
| except Exception as e: | ||
| except (OSError, ValueError, RuntimeError) as e: |
There was a problem hiding this comment.
issue (bug_risk): Narrowing the caught exceptions in analyze may surface raw tracebacks (e.g. ClickException) to the user instead of a clean error message.
Previously this block caught all Exception, so any unexpected failure (including click.ClickException and other non-ValueError/RuntimeError parsing errors) was turned into a consistent Error: ... message and exit code. With the narrower tuple, those now surface as raw tracebacks, changing CLI output and potentially breaking scripts that depend on the current error format/exit status. If the intention is to avoid hiding programmer bugs, consider instead catching a project-specific exception (e.g. LogAnalyzerError) or at least including click.ClickException, while allowing truly unexpected exceptions to propagate.
| if line: | ||
| sample_lines.append(line) | ||
| except Exception as e: | ||
| except OSError as e: |
There was a problem hiding this comment.
issue (bug_risk): Catching only OSError in _get_parser means decoding errors (e.g. UnicodeDecodeError) will now crash instead of yielding a warning and continuing.
Previously, any read failure—including encoding errors—was treated as non-fatal, with a warning and None returned. Restricting this to OSError means UnicodeDecodeError and other non-OSError read issues will now crash the CLI, changing behavior for users with misencoded files. If you still want “can’t read this file” to be non-fatal, please also catch UnicodeError/UnicodeDecodeError or otherwise handle these consistently with your existing I/O warnings.
| class TestApacheCombinedPatternExtraction: | ||
| """Tests that the COMBINED_PATTERN regex actually matches a real combined log | ||
| and properly extracts referer and user_agent fields. | ||
| """ | ||
|
|
||
| def test_combined_pattern_extracts_referer(self): | ||
| parser = ApacheParser() | ||
| line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com/" "Mozilla/5.0"' | ||
| entry = parser.parse(line) | ||
| assert entry is not None |
There was a problem hiding this comment.
suggestion (testing): Add an assertion on the parsed timestamp or other core fields to fully exercise the updated Apache combined log regex
The new tests cover the added groups (referer, user_agent, user), but the regex change also modified the timestamp capture and surrounding whitespace. Please add an assertion that entry.timestamp is non-None and equals the expected datetime so we also validate timestamp parsing isn’t regressed by this change.
JSONLogParser.parse() called _extract_timestamp(data) without a try/except, so any numeric timestamp that datetime.fromtimestamp can't represent — a 1e15-second value (year 33658), a -1e15-second value (year -31686769), a 1e308 (OverflowError, larger than the platform time_t), or some negative values on platforms that raise OSError — raised an unhandled exception out of parse(). That dropped the entire ParsedEntry (level, message, source, metadata all gone) and surfaced a traceback to the CLI caller for what should have been a per-line parse failure. The internal _extract_timestamp contract is 'returns Optional[datetime]', so a busted numeric field is just 'no timestamp', not 'the whole JSON line is unparseable'.
This commit wraps the _extract_timestamp call in try/except (ValueError, OSError, OverflowError) and sets timestamp = None on failure. The level/message/source/metadata are still extracted, and a JSON log file with one busted timestamp (very common in cloud logs that mix Unix-epoch-seconds and Unix-epoch-milliseconds, or where a clock skew produces a far-future timestamp) no longer aborts the whole analyze run.
Tests:
Summary by Sourcery
Handle invalid timestamps and improve log parsing and normalization robustness across JSON, Apache, and CLI layers.
Bug Fixes:
Enhancements:
Tests:
Summary by CodeRabbit
Bug Fixes