Skip to content

Null0x47/xdrgen

Repository files navigation

XDRGEN

xdrgen is a CLI tool that generates production-like Defender XDR telemetry based on a provided YAML profile.

⚠️ Experimental / heavily vibe-coded. This project is a work in progress and was largely built by feel rather than against an authoritative spec. The generated telemetry will contain errors — wrong enum values, fields that wouldn't co-occur in real Defender data, distributions that don't match production, etc. Don't rely on it for anything that matters until each table has been evaluated against real-world samples. Correctness will be tightened in future changes.

It does two things:

  1. generate — produce coherent, production-like telemetry events as JSON, driven by a YAML profile that lists which tables to emit.
  2. update-models — fetch the canonical list of Defender XDR / MDE tables and their Solution Analyzer column schemas from the Azure/Azure-Sentinel repo, and emit one strongly-typed Pydantic model per table into ./models/.

Requirements

  • Python 3.12+
  • uv (recommended; see Without uv for a pip-based setup)

Without uv

If you'd rather not install uv, set up a venv with pip and drop the uv run prefix from every command in this README:

python3.12 -m venv .venv
source .venv/bin/activate
pip install -e .

xdrgen generate profile.yaml -n 100

Dev tooling lives under the dev dependency group in pyproject.toml. Pip doesn't install groups today, so for the test suite / linter add them by hand:

pip install pytest pytest-asyncio respx ruff
pytest -q

Commands

generate

░██    ░██ ░███████   ░█████████    ░██████  ░██████████ ░███    ░██
 ░██  ░██  ░██   ░██  ░██     ░██  ░██   ░██ ░██         ░████   ░██
  ░██░██   ░██    ░██ ░██     ░██ ░██        ░██         ░██░██  ░██
   ░███    ░██    ░██ ░█████████  ░██  █████ ░█████████  ░██ ░██ ░██
  ░██░██   ░██    ░██ ░██   ░██   ░██     ██ ░██         ░██  ░██░██
 ░██  ░██  ░██   ░██  ░██    ░██   ░██  ░███ ░██         ░██   ░████
░██    ░██ ░███████   ░██     ░██   ░█████░█ ░██████████ ░██    ░███
                                                               v0.1.0

 Usage: xdrgen generate [OPTIONS] [PROFILE]

 Generate production-like Defender XDR telemetry as JSON.

 Events are buffered in memory and flushed to disk every `--flush-every`
 events — so neither finite (`-n`) nor `--indefinite` runs grow memory
 without bound. The buffer is also flushed on `Ctrl+C` and at the end of
 a finite run, so no event written to memory is ever lost.

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│   profile      [PROFILE]  Optional YAML profile selecting tables and overriding tenant fixtures.                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --output                  -o      PATH                    Output path. Defaults to `./telemetry.json`, or            │
│                                                           `./telemetry/` with `--per-table`.                         │
│ --count                   -n      INTEGER RANGE [x>=1]    Number of events to generate (ignored with --indefinite).  │
│                                                           [default: 10]                                              │
│ --indefinite                                              Run until interrupted with Ctrl+C.                         │
│ --interval                -i      FLOAT RANGE [x>=0.0]    Seconds to wait between events. [default: 1.0]             │
│ --echo                                                    Also print each event to stdout.                           │
│ --per-table                                               Group events per-table: one file per event (file sink) or  │
│                                                           one topic per table (kafka).                               │
│ --flush-every                     INTEGER RANGE [x>=1]    Buffer this many events before flushing to the active      │
│                                                           sink.                                                      │
│                                                           [default: 10000]                                           │
│ --sink                            [json|kafka|kustainer]  Destination for events: `json`, `kafka`, or `kustainer`.   │
│                                                           [default: json]                                            │
│ --kafka-bootstrap                 TEXT                    Kafka bootstrap servers, e.g. `localhost:9092`. Required   │
│                                                           for --sink kafka.                                          │
│ --kafka-topic                     TEXT                    Kafka topic to produce to (ignored with --per-table).      │
│                                                           [default: xdrgen]                                          │
│ --kafka-topic-prefix              TEXT                    Prefix for per-table Kafka topic names (only with          │
│                                                           --per-table).                                              │
│                                                           [default: xdrgen.]                                         │
│ --kustainer-cluster               TEXT                    Kustainer (Kusto emulator) HTTP endpoint.                  │
│                                                           [default: http://localhost:8080]                           │
│ --kustainer-database              TEXT                    Kustainer database events are ingested into.               │
│                                                           [default: NetDefaultDB]                                    │
│ --kustainer-table-prefix          TEXT                    Prefix prepended to every Kustainer table name.            │
│ --help                                                    Show this message and exit.                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Produce coherent, production-like telemetry events as JSON, driven by an optional YAML profile. Each event is generated, validated through its Pydantic model (so field names and types match the real Defender XDR columns), and handed to a sinkjson by default. Pick one with --sink; see Sinks below for what ships and how to add more.

Events are buffered in memory and flushed to the active sink every --flush-every events (default 10 000), as well as at the end of a run and on Ctrl+C. Neither finite (-n) nor --indefinite runs grow memory without bound, and no buffered event is ever lost.

--per-table cross-cuts whichever sink is active — it changes how events are grouped, not where they're sent:

  • --sink json (default): one combined ./telemetry.json array → with --per-table, one file per event under ./telemetry/{TableName}-{n:04d}.json.
  • --sink kafka: every event to --kafka-topic (default xdrgen) → with --per-table, one topic per table named {--kafka-topic-prefix}{TableName} (default prefix xdrgen.xdrgen.CloudAppEvents, xdrgen.EmailEvents, …).

Both the YAML profile itself and its tables: key are optional — omit either to generate for every table that has a generator.

# All tables that have a generator, file sink, ./telemetry.json
uv run xdrgen generate

# 100 events, no delay between them, into a custom file
uv run xdrgen generate -n 100 -i 0 -o ./out/cae.json

# 100 events, one JSON file per event, into ./out/events/
uv run xdrgen generate -n 100 -i 0 -o ./out/events --per-table

# Stream events forever, one every 2 seconds, flushes after 10000 events or once interrupted
uv run xdrgen generate --indefinite -i 2

# Stream events forever, flush every 100 events
uv run xdrgen generate --indefinite --flush-every 100

# Stream to Kafka instead, single topic
uv run xdrgen generate --sink kafka --kafka-bootstrap localhost:9092

Pass --echo to also stream every generated event to stdout as it is written, regardless of which sink is active — useful for piping or watching the stream live.

Profile overrides

The same YAML profile can carry an optional overrides: mapping to replace fixtures baked into the default World. Useful when you want the stream to look like it came from your tenant rather than the default contoso.com fixture. Every override key is optional; omit the whole block (or individual keys) to keep the defaults.

The profile is validated by Pydantic models defined in world.py (Profile, Overrides, plus the typed sub-models for User, IPEntry, ClientApp, etc.) — unknown keys, wrong shapes, and missing required sub-fields fail fast with a clear error from xdrgen generate. At runtime, Profile.build_world() produces a frozen World instance that is threaded into every generator, so overrides apply atomically with no module-level mutation.

Scalar overrides (replace a single value):

Key Default Surfaces in
tenant_id a1b2c3d4-5e6f-4071-8293-94a5b6c7d8e9 Every TenantId, OrganizationId in RawEventData
tenant_domain contoso.com Cloud / UPN domain, intra-org Message-Ids
on_prem_ad_domain contoso.local AccountDomain on Identity* tables
on_prem_netbios_domain CONTOSO NetBIOS form of the on-prem domain
on_prem_sid_prefix S-1-5-21-1004336348-1177238915-682003330 AccountSid prefix (per-user RID is appended)

Collection overrides (fully replace the default list — no merge):

Key Item shape (required fields bold) Used by
domain_controllers name, ip, device_id Identity* tables
devices device_id, device_name, os_platform, os_version, public_ip, local_ip, mac_address, machine_group, primary_user_upn Device* tables (DeviceEvents, DeviceProcessEvents, DeviceLogonEvents, …)
processes file_name, folder_path, company, description, internal_file_name, original_file_name, product_name, product_version, command_lines, integrity_level, elevation, signature_status, signer_type, parent InitiatingProcess* columns on every Device* table
users display_name, upn, object_id, type, device_name, device_id, last_password_change_days_ago, sam_account_name, sid_rid, given_name, surname, department, job_title, employee_id, city, country Almost every table
ips ip, city, state, country, isp, category, latitude, longitude Source IPs across cloud and email tables
user_agents ua, platform, device_type, browser CloudAppEvents, EntraIdSignInEvents, IdentityLogonEvents
resources name, id EntraIdSignInEvents.ResourceDisplayName / .ResourceId, plus EntraIdSpnSignInEvents when overridden (otherwise SPN events fall back to a workload-identity-flavoured default)
client_apps name, app_id EntraIdSignInEvents.Application / .ApplicationId
service_principals name, id, app_id, is_managed_identity EntraIdSpnSignInEvents.ServicePrincipalName / .ServicePrincipalId / .Application / .ApplicationId
groups name, id GraphApiAuditEvents.RequestUri (/groups/{id}/…)
conditional_access_policies id, displayName, enforcedGrantControls, enforcedSessionControls EntraIdSignInEvents.ConditionalAccessPolicies
entra_sign_in_error_codes code, weight, description EntraIdSignInEvents.ErrorCode distribution + AuthenticationProcessingDetails
entra_spn_sign_in_error_codes code, weight, description EntraIdSpnSignInEvents.ErrorCode distribution
email_templates subject, sender_*, recipient_persona, direction, delivery_action, delivery_location, email_action, email_size, bulk_complaint_level, authentication_details, confidence_level, optional threat / policy fields, plus nested attachments (file_name, extension, file_type, file_size) and urls (url, domain, location) Pre-built email pool feeding EmailEvents / EmailAttachmentInfo / EmailPostDeliveryEvents / EmailUrlInfo / UrlClickEvents (correlated by NetworkMessageId)

A fully documented example profile is shipped at the repo root as profile.example.yaml with sample values for every override — copy it and edit:

cp profile.example.yaml profile.yaml
uv run xdrgen generate profile.yaml -n 100

Examples

# 100 events, no delay between them, into a custom file
uv run xdrgen generate profile.yaml -n 100 -i 0 -o ./out/cae.json

# 100 events, one JSON file per event, into ./out/events/
uv run xdrgen generate profile.yaml -n 100 -i 0 -o ./out/events --per-table

# Stream events forever, one every 2 seconds (writes once interrupted)
uv run xdrgen generate profile.yaml --indefinite -i 2

Sinks

Sinks live in sinks/. Each module defines a sink (a Sink-protocol class — write(batch) and close()) plus a build(...) factory; main._build_sink returns one based on --sink. The --per-table and --flush-every flags are sink-agnostic — they're handled by main and apply uniformly to whichever sink is active.

Three sinks ship today:

  • --sink json (default) — JSON to disk. Single combined array, or per-event files with --per-table. See sinks/json.py.
  • --sink kafka — produces JSON to a Kafka broker via kafka-python. The table name is used as the message key so partitioning stays consistent per table. See sinks/kafka.py.
  • --sink kustainer — ingests directly into Kustainer, Microsoft's official Kusto/ADX emulator, via the azure-kusto-data SDK. Each event lands in the table named after its Pydantic model (e.g. CloudAppEvents). The emulator does not implement streaming or queued ingestion, so the sink uses the universally-supported .ingest inline control command on the engine endpoint. See sinks/kustainer.py.
Testing the Kafka sink locally

A docker/docker-compose-kafka.yml spins up a single-broker Kafka (KRaft mode, no Zookeeper) plus Kafka UI so you can watch topics fill up:

docker compose -f docker/docker-compose-kafka.yml up -d
uv run xdrgen generate -n 100 -i 0 --sink kafka --kafka-bootstrap localhost:9092
# Then browse http://localhost:8080 → cluster `local` → Topics.

The compose file exposes two listeners on the broker — localhost:9092 for clients on your host (xdrgen) and kafka:29092 for clients on the compose network (Kafka UI). Mismatched listeners are the most common Kafka-in-Compose footgun; this split keeps both paths working.

Testing the Kustainer sink locally

A docker/docker-compose-kustainer.yml spins up the official kustainer-linux image (single HTTP endpoint on 8080, unauthenticated, NetDefaultDB database). Tables aren't created automatically by the sink — bootstrap them once with scripts/create_kustainer_tables.py, which walks every Pydantic model under ./models/ and emits a .create-merge table for each. Re-running it after xdrgen update-models is safe — .create-merge only adds new columns, it never drops existing data.

docker compose -f docker/docker-compose-kustainer.yml up -d
uv run python scripts/create_kustainer_tables.py
uv run xdrgen generate -n 100 -i 0 --sink kustainer

The script and the sink share their type mapping (strstring, intlong, boolbool, floatreal, datetimedatetime, anything else→dynamic) so the schema the script writes always matches the rows the sink emits. Pass --dry-run to print the control commands without touching the cluster, and --cluster / --database to point at a non-default emulator.

Querying the local emulator

Kustainer speaks the standard Kusto REST API at http://localhost:8080/v1/rest/query (queries) and /v1/rest/mgmt (control commands). Anything that talks to a real ADX cluster works against it — the only difference is the URL and the lack of authentication.

The lowest-friction option is the Azure Data Explorer Web UI: sign in to your Microsoft account, click + Add -> Connection, paste http://localhost:8080 in the Connection URI text field and you get the same query editor you'd use against a real cluster. Browser-side it's a JS app; queries go directly from your browser to localhost:8080.

For a one-shot curl:

curl -s http://localhost:8080/v1/rest/query \
  -H 'Content-Type: application/json' \
  -d '{"db":"NetDefaultDB","csl":"CloudAppEvents | take 5"}'

Or from Python via the same SDK the sink uses:

from azure.kusto.data import KustoClient, KustoConnectionStringBuilder

kcsb = KustoConnectionStringBuilder.with_no_authentication("http://localhost:8080")
client = KustoClient(kcsb)
response = client.execute("NetDefaultDB", "CloudAppEvents | summarize count() by ActionType")
for row in response.primary_results[0]:
    print(row["ActionType"], row["count_"])

.show tables (control command, run via execute_mgmt) is the quickest way to confirm the bootstrap script created everything before you start ingesting.

The image is x86_64 only; on Apple Silicon, set --platform linux/amd64 in the compose file (Docker emulates, so it is slow). The emulator has no separate ingest URI, no streaming ingestion, no queued ingestion, and no authentication — mcr.microsoft.com/azuredataexplorer/kustainer-linux is for local development only.

Production-like data

The generators are hand-curated per table to produce realistic, correlated values rather than random noise — a single tenant ID, the same user pool, the same IP-to-geo mapping, and user agents paired with their matching OS / browser. Per-table specifics (what each generator does to mirror real Defender data) live in generators/PRODUCTION_LIKE_DATA.md.

Supported tables

The generate command currently has handcrafted generators for:

  • CloudAppEvents
  • DeviceEvents
  • DeviceFileCertificateInfo
  • DeviceImageLoadEvents
  • DeviceLogonEvents
  • DeviceNetworkEvents
  • DeviceNetworkInfo
  • DeviceProcessEvents
  • DeviceRegistryEvents
  • EmailAttachmentInfo
  • EmailEvents
  • EmailPostDeliveryEvents
  • EmailUrlInfo
  • EntraIdSignInEvents
  • EntraIdSpnSignInEvents
  • GraphApiAuditEvents
  • IdentityAccountInfo
  • IdentityDirectoryEvents
  • IdentityEvents
  • IdentityLogonEvents
  • IdentityQueryEvents
  • UrlClickEvents

Listing a table in the YAML that does not have a generator yet will fail fast with a list of available tables. More generators will be added over time.

update-models

Fetches three CSVs from Azure/Azure-Sentinel/Tools/Solutions Analyzer/:

  • tables.csv and tables_reference.csv — overlapping table catalogues. The union of rows whose category column contains xdr or mde (case-insensitive) defines the Defender XDR / MDE table set.
  • table_schemas.csv — column-level schemas (table name, column name, type, description).

It then writes one Pydantic model per table into ./models/.

uv run xdrgen update-models

System columns starting with _ (e.g. _BilledSize, _IsBillable, _ResourceId) are skipped — only first-class table columns become model fields. All fields are Optional[T] = Field(None, ...) because XDR events are inherently sparse. Column descriptions from the source docs become Pydantic field descriptions.

from models import CloudAppEvents

event = CloudAppEvents(
    ActionType="FileDownloaded",
    AccountDisplayName="Avery Chen",
    IPAddress="20.43.122.12",
)

You usually don't need to run this yourself — a GitHub Actions workflow runs update-models daily at 06:00 UTC against main, and any diff in ./models/ is opened as a PR and squash-merged once lint and tests pass. So main always tracks the latest upstream Defender XDR / MDE schemas.

Development

Everything below is for working on xdrgen itself, not for using the CLI.

Formatting

The codebase is formatted with ruff (target Python 3.12, configured in pyproject.toml). Always run ruff format before committing changes:

# Format every file in-place
uv run ruff format .

# Or just check for drift without writing
uv run ruff format --check .

Linting

The codebase is also linted with ruff (target Python 3.12, configured in pyproject.toml). Always run ruff check --fix before committing changes — safe fixes are applied automatically; anything left is for you to address:

# Lint every file and apply safe fixes in-place
uv run ruff check --fix .

# Or just report issues without writing
uv run ruff check .

Running tests

The test suite lives under tests/ and runs with pytest:

# Run everything
uv run pytest

# Run a single file
uv run pytest tests/test_telemetry.py

# Run a single test by name
uv run pytest tests/test_telemetry.py::test_identity_logon_events_terminate_at_a_known_dc

# Quieter output
uv run pytest -q

Adding a new sink

  1. Drop a module in sinks/ (e.g. sinks/s3.py) that exports a class implementing the Sink protocol from sinks/base.py plus a top-level build(...) factory returning an instance.
  2. Add an entry to the SinkChoice enum in main.py and a branch in _build_sink that calls your factory with the relevant CLI flags.
  3. Add CLI flags for any sink-specific config (mirror the --kafka-* pattern), and a unit test in tests/test_sinks.py that stubs out the underlying client so it doesn't need real infrastructure.

About

Generate production-like Defender XDR telemetry based on a YAML profile

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages