ait-detectmate · viktorbeck98 · Mar 17, 2026 · Mar 7, 2026 · Mar 7, 2026 · Mar 7, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,102 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+DetectMateLibrary is a Python library for log processing and anomaly detection. It provides composable, stream-friendly components (parsers and detectors) that communicate via Protobuf-based schemas. The library is designed for both single-process and microservice deployments.
+
+## Development Commands
+
+```bash
+# Install dependencies and pre-commit hooks
+uv sync --dev
+uv run prek install
+
+# Run tests
+uv run pytest -q
+uv run pytest -s                                      # verbose with stdout
+uv run pytest --cov=. --cov-report=term-missing       # with coverage
+uv run pytest tests/test_foo.py                       # single test file
+
+# Run linting/formatting (all pre-commit hooks)
+uv run prek run -a
+
+# Recompile Protobuf (only if schemas.proto is modified)
+protoc --proto_path=src/detectmatelibrary/schemas/ \
+  --python_out=src/detectmatelibrary/schemas/ \
+  src/detectmatelibrary/schemas/schemas.proto
+
+# Scaffold a new component workspace
+mate create --type <parser|detector> --name <name> --dir <target_dir>
+```
+
+## Architecture
+
+### Data Flow
+
+```
+Raw Logs → Parser → ParserSchema → Detector → DetectorSchema (Alerts)
+```
+
+All data flows through typed Protobuf-backed schema objects. Components are stateful and support an optional training phase before detection.
+
+### Core Abstractions (`src/detectmatelibrary/common/`)
+
+- **`CoreComponent`** — base class managing buffering, ID generation, and training state
+  - **`CoreParser(CoreComponent)`** — parse raw logs into `ParserSchema`
+  - **`CoreDetector(CoreComponent)`** — detect anomalies in `ParserSchema`, emit `DetectorSchema`
+- **`CoreConfig`** / **`CoreParserConfig`** / **`CoreDetectorConfig`** — Pydantic-based configuration hierarchy
+
+### Schema System (`src/detectmatelibrary/schemas/`)
+
+- `BaseSchema` wraps generated Protobuf messages with dict-like access (`schema["field"]`)
+- Key schemas: `LogSchema`, `ParserSchema`, `DetectorSchema`
+- Support serialization to/from bytes for transport and persistence
+
+### Buffering Modes (`src/detectmatelibrary/utils/data_buffer.py`)
+
+Three modes via `ArgsBuffer` config:
+- **NO_BUF** — one item at a time (default)
+- **BATCH** — accumulate N items, process as batch
+- **WINDOW** — sliding window of size N
+
+### Implementations
+
+- **Parsers** (`src/detectmatelibrary/parsers/`): `JsonParser`, `DummyParser`, `TemplateMatcherParser` (uses Drain3 for template mining)
+- **Detectors** (`src/detectmatelibrary/detectors/`): `NewValueDetector`, `NewValueComboDetector`, `RandomDetector`, `DummyDetector`
+- **Utilities** (`src/detectmatelibrary/utils/`): `DataBuffer`, `EventPersistency`, `KeyExtractor`, `TimeFormatHandler`, `IdGenerator`
+
+## Extending the Library
+
+Implement a custom detector by subclassing `CoreDetector`:
+
+```python
+class MyDetectorConfig(CoreDetectorConfig):
+    method_type: str = "my_detector"
+    my_param: int = 10
+
+class MyDetector(CoreDetector):
+    def __init__(self, name="MyDetector", config=MyDetectorConfig()):
+        super().__init__(name=name, config=config)
+
+    def train(self, input_: ParserSchema) -> None:
+        pass  # optional
+
+    def detect(self, input_: ParserSchema, output_: DetectorSchema) -> bool:
+        output_["detectorID"] = self.name
+        output_["score"] = 0.0
+        return False  # True = anomaly detected
+```
+
+Same pattern applies for `CoreParser` — implement `parse(input_: LogSchema, output_: ParserSchema) -> bool`.
+
+## Code Quality
+
+Pre-commit hooks enforce:
+- **mypy** strict mode
+- **flake8** linting, **autopep8** formatting (max line 110)
+- **bandit** security checks, **vulture** dead-code detection (70% threshold)
+- **docformatter** docstring style
+
+Python 3.12 is required (see `.python-version`).
diff --git a/docs/parsers.md b/docs/parsers.md
@@ -102,4 +102,10 @@ def test_my_parser_parse():
     assert out["variables"] == ["a", "b", "c"]
 ```
 
+## Available parsers
+
+- [JSON Parser](parsers/json_parser.md): extracts structured fields from JSON-formatted logs.
+- [Template Matcher](parsers/template_matcher.md): matches logs against a predefined set of `<*>` templates.
+- [LogBatcher Parser](parsers/logbatcher_parser.md): LLM-based parser that infers templates from raw logs with no training data.
+
 Go back to [Index](index.md)
diff --git a/docs/parsers/logbatcher_parser.md b/docs/parsers/logbatcher_parser.md
@@ -0,0 +1,106 @@
+# LogBatcher Parser
+
+LLM-based log parser that infers event templates from raw log messages using any OpenAI-compatible model. No training data or labeled examples are required.
+
+|            | Schema                        | Description                              |
+|------------|-------------------------------|------------------------------------------|
+| **Input**  | [LogSchema](../schemas.md)    | Raw log string                           |
+| **Output** | [ParserSchema](../schemas.md) | Structured log with template and variables |
+
+## Overview
+
+`LogBatcherParser` wraps the [LogBatcher](https://github.com/LogIntelligence/LogBatcher) engine (MIT, LogIntelligence 2024) as a `CoreParser`. Parsing proceeds in two phases:
+
+1. **Cache lookup** — the incoming log is matched against previously seen templates using a hash-based exact match followed by a tree-based similarity check. If a match is found, no LLM call is made.
+2. **LLM query** — on a cache miss, the log is submitted to the configured model. The returned template is stored in the cache for future reuse.
+
+Variable slots in templates use the `<*>` wildcard notation (e.g. `User <*> logged in from <*>`). Extracted variables are written to `output_["variables"]` in order of appearance.
+
+## Configuration
+
+| Field | Type | Default | Description |
+|---|---|---|---|
+| `method_type` | string | `"logbatcher_parser"` | Parser type identifier |
+| `model` | string | `"gpt-4o-mini"` | Model name passed to the OpenAI-compatible endpoint |
+| `api_key` | string | `""` | API key for the chosen provider |
+| `base_url` | string | `""` | Base URL of the OpenAI-compatible endpoint. Leave empty to use the default OpenAI endpoint |
+| `batch_size` | int | `10` | Maximum number of logs submitted per LLM call |
+
+Example YAML fragment (OpenAI):
+
+```yaml
+parsers:
+  LogBatcherParser:
+    method_type: logbatcher_parser
+    params:
+      model: "gpt-4o-mini"
+      api_key: "<YOUR_API_KEY>"
+      batch_size: 10
+```
+
+Example YAML fragment (local Ollama):
+
+```yaml
+parsers:
+  LogBatcherParser:
+    method_type: logbatcher_parser
+    params:
+      model: "llama3"
+      api_key: "ollama"
+      base_url: "http://localhost:11434/v1"
+      batch_size: 10
+```
+
+## Usage examples
+
+Basic usage — parse a raw log and read the inferred template:
+
+```python
+from detectmatelibrary.parsers.logbatcher import LogBatcherParser, LogBatcherParserConfig
+import detectmatelibrary.schemas as schemas
+
+config = LogBatcherParserConfig(
+    api_key="<YOUR_API_KEY>",
+    model="gpt-4o-mini",
+    batch_size=10,
+)
+
+parser = LogBatcherParser(name="LogBatcherParser", config=config)
+
+input_log = schemas.LogSchema({
+    "logID": "1",
+    "log": "User admin logged in from 192.168.1.10",
+})
+
+output = schemas.ParserSchema()
+parser.parse(input_log, output)
+
+print(output["template"])    # e.g. "User <*> logged in from <*>"
+print(output["variables"])   # e.g. ["admin", "192.168.1.10"]
+print(output["EventID"])     # integer index assigned by the cache
+```
+
+Using a local Ollama instance:
+
+```python
+config = LogBatcherParserConfig(
+    api_key="ollama",
+    model="llama3",
+    base_url="http://localhost:11434/v1",
+    batch_size=10,
+)
+parser = LogBatcherParser(name="LogBatcherParser", config=config)
+```
+
+Passing config as a dict:
+
+```python
+parser = LogBatcherParser(config={
+    "method_type": "logbatcher_parser",
+    "api_key": "<YOUR_API_KEY>",
+    "model": "gpt-4o-mini",
+    "batch_size": 10,
+})
+```
+
+Go back to [Index](../index.md)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -18,6 +18,7 @@ nav:
   - Parsers Methods:
     - Template Matcher: parsers/template_matcher.md
     - Json Parser: parsers/json_parser.md
+    - LogBatcher Parser: parsers/logbatcher_parser.md
   - Detectors Methods:
     - Random Detector: detectors/random_detector.md
     - New Value: detectors/new_value.md

diff --git a/pyproject.toml b/pyproject.toml
@@ -10,6 +10,12 @@ dependencies = [
     "pydantic>=2.11.7",
     "pyyaml>=6.0.3",
     "regex>=2025.11.3",
+    "kafka-python>=2.3.0",
+    "openai>=2.26.0",
+    "tenacity>=9.1.4",
+    "scipy>=1.17.1",
+    "scikit-learn>=1.8.0",
+    "tiktoken>=0.12.0",
     "numpy>=2.3.2",
     "pandas>=2.3.2",
     "polars>=1.38.1",

diff --git a/src/detectmatelibrary/parsers/logbatcher/__init__.py b/src/detectmatelibrary/parsers/logbatcher/__init__.py
@@ -0,0 +1,29 @@
+# MIT License
+#
+# Copyright (c) 2024 LogIntelligence
+#
+# Based on LogBatcher (https://github.com/LogIntelligence/LogBatcher)
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+# import sys, os
+# sys.path.append(os.path.join(os.getcwd(), "parsing", "parsers"))
+
+# flake8: noqa
+from .parser import LogBatcherParserConfig, LogBatcherParser  # noqa: F401
diff --git a/src/detectmatelibrary/parsers/logbatcher/engine/LICENSE b/src/detectmatelibrary/parsers/logbatcher/engine/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 LogIntelligence
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.