Skip to content

swallow out-of-range numeric timestamps in JSON parser#3

Open
HrachShah wants to merge 7 commits into
mainfrom
fix/json-parser-out-of-range-timestamp
Open

swallow out-of-range numeric timestamps in JSON parser#3
HrachShah wants to merge 7 commits into
mainfrom
fix/json-parser-out-of-range-timestamp

Conversation

@HrachShah

@HrachShah HrachShah commented Jun 9, 2026

Copy link
Copy Markdown
Owner

JSONLogParser.parse() called _extract_timestamp(data) without a try/except, so any numeric timestamp that datetime.fromtimestamp can't represent — a 1e15-second value (year 33658), a -1e15-second value (year -31686769), a 1e308 (OverflowError, larger than the platform time_t), or some negative values on platforms that raise OSError — raised an unhandled exception out of parse(). That dropped the entire ParsedEntry (level, message, source, metadata all gone) and surfaced a traceback to the CLI caller for what should have been a per-line parse failure. The internal _extract_timestamp contract is 'returns Optional[datetime]', so a busted numeric field is just 'no timestamp', not 'the whole JSON line is unparseable'.

This commit wraps the _extract_timestamp call in try/except (ValueError, OSError, OverflowError) and sets timestamp = None on failure. The level/message/source/metadata are still extracted, and a JSON log file with one busted timestamp (very common in cloud logs that mix Unix-epoch-seconds and Unix-epoch-milliseconds, or where a clock skew produces a far-future timestamp) no longer aborts the whole analyze run.

Tests:

  • existing 49 tests still pass
  • 2 new tests cover the out-of-range case (1e15, -1e15, 1e308) and the unaffected happy path

Open in Capy

Summary by Sourcery

Handle invalid timestamps and improve log parsing and normalization robustness across JSON, Apache, and CLI layers.

Bug Fixes:

  • Preserve JSON log entries with out-of-range numeric timestamps by treating them as missing timestamps instead of failing the entire parse.
  • Correct Apache combined log regex to properly capture the user and timestamp fields in real-world log lines.
  • Prevent overly broad exception handling in the CLI analyze command and file sampling helper by narrowing caught exceptions to expected error types.

Enhancements:

  • Extend error pattern normalization to recognize and normalize IPv6 addresses and IP:port combinations more accurately in log messages.
  • Refine time-distribution bucketing output to produce explicit minute-and-second bucket labels in text reports.
  • Add tests for JSON timestamp edge cases and Apache combined log parsing to validate the updated parsing behavior.

Tests:

  • Introduce tests ensuring out-of-range JSON numeric timestamps drop only the timestamp field while preserving the rest of the entry.
  • Add Apache combined log parser tests to verify referer, user, and user_agent extraction from combined log format lines.

Summary by CodeRabbit

Bug Fixes

  • Apache combined log format now correctly parses user fields in log entries
  • JSON logs with invalid or out-of-range timestamps are retained instead of being dropped
  • Time distribution reports now use corrected interval bucketing calculations
  • Improved IP address and IPv6 normalization for more accurate pattern matching

Zo Bot added 7 commits May 14, 2026 11:47
…OSError on file read, analyze command raises (OSError, ValueError, RuntimeError)
…attern — the standalone port regex was replacing ports after IP substitution, turning 192.168.1.1:8080 into <IP>:<PORT> instead of <IP>:<PORT>; now IP:port is replaced first as a distinct unit
…00 was being mangled by the standalone port regex; now replaced as a unit before port substitution
…fe80::1, ::1, and 2001:0db8:85a3:0000:0000:8a2e:0370:7334 were all being mangled by the standalone port regex; fe80::1 became 'fe80::<PORT>' so different IPv6 addresses got grouped together as the same pattern; now compressed and full-form IPv6 addresses are replaced as units before the port substitution runs, and the bracketed URL form [::1]:8080 is recognized as well
…ined log format

The user field was using \s+ (whitespace only) instead of \S+ (any non-whitespace), so the COMBINED_PATTERN could never match a real combined log line — every real combined log has '-' or a username in that position. Because parse() always tried COMBINED_PATTERN first and silently fell back to COMMON_PATTERN when it failed, the 'referer' and 'user_agent' fields were never actually extracted from any combined log line. Changed the user field to \S+ and pulled the whitespace before the '[' into its own \s+ so the pattern matches a real combined log and the optional referer/user-agent capture group is exercised.
…mtimestamp raises ValueError (year out of range, e.g. a 1e15s Unix timestamp = year 33658) and OverflowError (1e308 — larger than the platform time_t), and a negative timestamp can raise OSError on some platforms. _extract_timestamp is internal and its contract is 'returns Optional[datetime]', so a busted numeric field is just 'no timestamp', not 'the whole JSON line is unparseable'. The old code let the exception escape parse(), which dropped the entire ParsedEntry (level, message, source, metadata all gone) and surfaced a traceback to the CLI caller for what should be a per-line parse failure. Wrapping the call in try/except (ValueError, OSError, OverflowError) keeps the rest of the entry, sets timestamp to None, and lets the CLI analyze the other 10,000 lines in the file as usual.
@sourcery-ai

sourcery-ai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Reviewer's Guide

This PR hardens JSON and Apache log parsing, normalizes IP/port patterns more robustly (including IPv6), tweaks CLI exception handling, and slightly refines time-bucketing text output, while adding tests to pin the new behaviors.

Sequence diagram for JSONLogParser.parse handling out-of-range numeric timestamps

sequenceDiagram
    actor User
    participant CLI as LogAnalyzerCLI
    participant Parser as JSONLogParser

    User->>CLI: analyze
    CLI->>Parser: parse(line)
    Parser->>Parser: json.loads(line)
    alt valid JSON
        Parser->>Parser: _extract_timestamp(data)
        alt timestamp in range
            Parser-->>Parser: timestamp datetime
        else timestamp out of range
            Parser-->>Parser: raise ValueError/OSError/OverflowError
            Parser->>Parser: catch (ValueError, OSError, OverflowError)
            Parser-->>Parser: timestamp = None
        end
        Parser->>Parser: _extract_level(data)
        Parser->>Parser: _extract_message(data)
        Parser-->>CLI: ParsedEntry(timestamp or None,...)
    else invalid JSON
        Parser-->>CLI: None
    end
Loading

Sequence diagram for narrowed CLI exception handling in analyze

sequenceDiagram
    actor User
    participant CLI as LogAnalyzerCLI

    User->>CLI: analyze
    CLI->>CLI: process_logs()
    alt raises OSError/ValueError/RuntimeError
        CLI-->>CLI: catch (OSError, ValueError, RuntimeError)
        CLI->>User: echo("Error: {e}")
        CLI->>User: sys.exit(1)
    else no handled exception
        CLI-->>User: print analysis output
    end
Loading

File-Level Changes

Change Details Files
Ensure JSON parser treats out-of-range numeric timestamps as missing instead of aborting parsing for the entire line.
  • Wrap _extract_timestamp call in a try/except for ValueError, OSError, and OverflowError within parse().
  • On timestamp extraction failure, set timestamp to None but still extract level, message, and other fields.
  • Add tests that verify extremely large, extremely small, and overflow numeric timestamps yield entries with timestamp=None while preserving other fields.
  • Add a sanity test ensuring valid numeric timestamps still parse into non-None datetime values.
src/log_analyzer_cli/parsers/json_log.py
tests/test_parsers.py
Fix Apache combined log regex to correctly capture the user and timestamp fields and add coverage for referer/user_agent extraction.
  • Adjust the COMBINED_PATTERN to use a non-whitespace token for user, add explicit whitespace before the timestamp, and keep the timestamp group unchanged.
  • Add tests to verify combined log lines are parsed and that referer, user_agent, and user fields are correctly extracted into metadata.
src/log_analyzer_cli/parsers/apache.py
tests/test_parsers.py
Improve error message normalization to handle IPv6 addresses and IP:port formats consistently in CLI output.
  • Change IPv4 IP:port replacement to use ':' instead of ''.
  • Add regexes to normalize IPv6 with zone ID and port, bracketed IPv6 with port, full-form IPv6, and compressed IPv6 forms, ensuring ports are not misidentified inside IPv6 literals.
  • Normalize plain IPv4 addresses to '' after handling IP:port cases.
src/log_analyzer_cli/utils.py
Narrow CLI exception handling to avoid swallowing unexpected errors and adjust file-read warnings to match expected failure modes.
  • Change analyze() top-level exception handler to catch only OSError, ValueError, and RuntimeError, exiting with an error message on these known failure types.
  • Restrict the _get_parser() file-reading try/except to catch only OSError, returning None and printing a warning when the file cannot be read.
src/log_analyzer_cli/cli.py
Refine text time-interval bucketing output format for time distributions.
  • Make the minute bucket computation explicit with parentheses for readability.
  • Include ':00' seconds in the formatted bucket key strings, changing from 'YYYY-MM-DD HH:MM' to 'YYYY-MM-DD HH:MM:00'.
src/log_analyzer_cli/formatters/text.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5a1a25f7-1d40-4348-b1dd-b8b3ae3b3d2e

📥 Commits

Reviewing files that changed from the base of the PR and between e93757f and 9224045.

📒 Files selected for processing (6)
  • src/log_analyzer_cli/cli.py
  • src/log_analyzer_cli/formatters/text.py
  • src/log_analyzer_cli/parsers/apache.py
  • src/log_analyzer_cli/parsers/json_log.py
  • src/log_analyzer_cli/utils.py
  • tests/test_parsers.py

📝 Walkthrough

Walkthrough

The PR improves the log analyzer's robustness and correctness across multiple components. JSON parser timestamps now gracefully handle out-of-range values instead of failing entire lines, Apache combined-log parsing fixes the user field regex, CLI error handling becomes more specific, time distribution display formatting is corrected, and error pattern normalization better handles IPv4 and IPv6 addresses with ports.

Changes

Parser, CLI, and Utility Robustness Enhancements

Layer / File(s) Summary
JSON parser timestamp resilience
src/log_analyzer_cli/parsers/json_log.py, tests/test_parsers.py
Timestamp extraction is guarded with try/except catching ValueError, OSError, and OverflowError; timestamp is set to None instead of failing the line. Tests verify out-of-range and valid numeric timestamp handling.
Apache combined log user field pattern
src/log_analyzer_cli/parsers/apache.py, tests/test_parsers.py
COMBINED_PATTERN user capture group changed from \s+ to \S+ to match non-whitespace instead of whitespace. Tests confirm correct extraction of referer, user_agent, and user fields including dash placeholders.
CLI exception handling narrowing
src/log_analyzer_cli/cli.py
The analyze command handler and _get_parser helper now catch specific exception types (OSError, ValueError, RuntimeError for analyze; OSError for _get_parser) instead of all Exception, allowing unexpected errors to propagate.
Time distribution interval formatting
src/log_analyzer_cli/formatters/text.py
Minute bucket calculation uses integer division and multiplication; interval keys now append explicit :00 seconds, changing time-based log grouping granularity.
Error pattern IP and port normalization
src/log_analyzer_cli/utils.py
normalize_error_pattern expands IP preprocessing: IPv4 IP:port becomes <IP>:<PORT>, multiple IPv6 variants (zone-ID, bracketed, full, compressed) are normalized to <IP> or [<IP>]:<PORT>, with port-handling logic positioned before standalone port substitution to prevent IPv6 corruption.

🎯 2 (Simple) | ⏱️ ~12 minutes

A rabbit hops through logs with glee,
Catching errors gracefully,
Timestamps, ports, and patterns refined,
Five small fixes, well-designed! 🐰✨
Robustness blooms where bugs once grew.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'swallow out-of-range numeric timestamps in JSON parser' directly and specifically describes the main change: handling out-of-range timestamps in the JSON parser by catching exceptions instead of propagating them.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/json-parser-out-of-range-timestamp
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/json-parser-out-of-range-timestamp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src/log_analyzer_cli/cli.py" line_range="136" />
<code_context>
         click.echo(output_str)

-    except Exception as e:
+    except (OSError, ValueError, RuntimeError) as e:
         click.echo(f"Error: {e}", err=True)
         sys.exit(1)
</code_context>
<issue_to_address>
**issue (bug_risk):** Narrowing the caught exceptions in `analyze` may surface raw tracebacks (e.g. ClickException) to the user instead of a clean error message.

Previously this block caught all `Exception`, so any unexpected failure (including `click.ClickException` and other non-`ValueError`/`RuntimeError` parsing errors) was turned into a consistent `Error: ...` message and exit code. With the narrower tuple, those now surface as raw tracebacks, changing CLI output and potentially breaking scripts that depend on the current error format/exit status. If the intention is to avoid hiding programmer bugs, consider instead catching a project-specific exception (e.g. `LogAnalyzerError`) or at least including `click.ClickException`, while allowing truly unexpected exceptions to propagate.
</issue_to_address>

### Comment 2
<location path="src/log_analyzer_cli/cli.py" line_range="168" />
<code_context>
                 if line:
                     sample_lines.append(line)
-    except Exception as e:
+    except OSError as e:
         click.echo(f"Warning: Could not read file: {e}", err=True)
         return None
</code_context>
<issue_to_address>
**issue (bug_risk):** Catching only `OSError` in `_get_parser` means decoding errors (e.g. `UnicodeDecodeError`) will now crash instead of yielding a warning and continuing.

Previously, any read failure—including encoding errors—was treated as non-fatal, with a warning and `None` returned. Restricting this to `OSError` means `UnicodeDecodeError` and other non-`OSError` read issues will now crash the CLI, changing behavior for users with misencoded files. If you still want “can’t read this file” to be non-fatal, please also catch `UnicodeError`/`UnicodeDecodeError` or otherwise handle these consistently with your existing I/O warnings.
</issue_to_address>

### Comment 3
<location path="tests/test_parsers.py" line_range="231-240" />
<code_context>
         assert parser_class is None
+
+
+class TestApacheCombinedPatternExtraction:
+    """Tests that the COMBINED_PATTERN regex actually matches a real combined log
+    and properly extracts referer and user_agent fields.
+    """
+
+    def test_combined_pattern_extracts_referer(self):
+        parser = ApacheParser()
+        line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com/" "Mozilla/5.0"'
+        entry = parser.parse(line)
+        assert entry is not None
+        assert entry.metadata.get("referer") == "https://example.com/"
+        assert entry.metadata.get("user_agent") == "Mozilla/5.0"
+
+    def test_combined_pattern_matches_with_dash_user(self):
</code_context>
<issue_to_address>
**suggestion (testing):** Add an assertion on the parsed timestamp or other core fields to fully exercise the updated Apache combined log regex

The new tests cover the added groups (`referer`, `user_agent`, `user`), but the regex change also modified the `timestamp` capture and surrounding whitespace. Please add an assertion that `entry.timestamp` is non-`None` and equals the expected datetime so we also validate timestamp parsing isn’t regressed by this change.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

click.echo(output_str)

except Exception as e:
except (OSError, ValueError, RuntimeError) as e:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Narrowing the caught exceptions in analyze may surface raw tracebacks (e.g. ClickException) to the user instead of a clean error message.

Previously this block caught all Exception, so any unexpected failure (including click.ClickException and other non-ValueError/RuntimeError parsing errors) was turned into a consistent Error: ... message and exit code. With the narrower tuple, those now surface as raw tracebacks, changing CLI output and potentially breaking scripts that depend on the current error format/exit status. If the intention is to avoid hiding programmer bugs, consider instead catching a project-specific exception (e.g. LogAnalyzerError) or at least including click.ClickException, while allowing truly unexpected exceptions to propagate.

if line:
sample_lines.append(line)
except Exception as e:
except OSError as e:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Catching only OSError in _get_parser means decoding errors (e.g. UnicodeDecodeError) will now crash instead of yielding a warning and continuing.

Previously, any read failure—including encoding errors—was treated as non-fatal, with a warning and None returned. Restricting this to OSError means UnicodeDecodeError and other non-OSError read issues will now crash the CLI, changing behavior for users with misencoded files. If you still want “can’t read this file” to be non-fatal, please also catch UnicodeError/UnicodeDecodeError or otherwise handle these consistently with your existing I/O warnings.

Comment thread tests/test_parsers.py
Comment on lines +231 to +240
class TestApacheCombinedPatternExtraction:
"""Tests that the COMBINED_PATTERN regex actually matches a real combined log
and properly extracts referer and user_agent fields.
"""

def test_combined_pattern_extracts_referer(self):
parser = ApacheParser()
line = '192.168.1.10 - - [20/Mar/2025:10:15:32 +0000] "GET /index.html HTTP/1.1" 200 2326 "https://example.com/" "Mozilla/5.0"'
entry = parser.parse(line)
assert entry is not None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add an assertion on the parsed timestamp or other core fields to fully exercise the updated Apache combined log regex

The new tests cover the added groups (referer, user_agent, user), but the regex change also modified the timestamp capture and surrounding whitespace. Please add an assertion that entry.timestamp is non-None and equals the expected datetime so we also validate timestamp parsing isn’t regressed by this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant