Add YARA-X rules engine integration#2900
Conversation
- New YaraScanTask in the pipeline (between carving and indexing) using libyara-x-capi 1.16.0 via JNA (in-process). - yara/ task subpackage: YaraEngine, YaraScanner, YaraRulesetLoader, YaraInstallPaths, YaraMatch, MatchedString, YaraHighlightSupport. - YaraConfig configurable + conf/YaraConfig.txt catalog/limits. - yara:* properties (ExtraProperties), UI facet/columns, HTML report. - --yara-only CLI mode to re-apply the catalog over a processed case (SkipCommitedTask + IndexTask.updateDocuments). - Bundled native (tools/yara-x/), license, ThirdParty + ReleaseNotes, CI step verifying the bundled Linux .so. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces an in-process YARA-X scanning capability into IPED’s processing pipeline via JNA bindings to libyara-x-capi, including configuration, UI integration (facet/columns + highlight terms), CI verification for the bundled Linux native, and a new --yara-only mode that re-scans an existing case and updates Lucene documents in place.
Changes:
- Added YARA-X engine wrapper + scanner, catalog discovery, match decoding, and a pipeline task (
YaraScanTask) that persists results intoyara:*extra-properties. - Added
--yara-onlyCLI mode and indexing changes to update existing Lucene docs for committed items. - Added packaging/docs/licensing/CI pieces to ship and validate the native runtime and document third-party usage.
Reviewed changes
Copilot reviewed 43 out of 45 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/yara-x/win64/yara_x.h | Bundled upstream C header for reference. |
| tools/yara-x/README.md | Documents native layout, version pinning, hashes, and update procedure. |
| tools/yara-x/LICENSE | License file for bundled YARA-X runtime (currently a placeholder in PR). |
| ThirdParty.txt | Adds YARA-X + JNA third-party notices. |
| ReleaseNotes.txt | Adds release entry describing YARA-X integration and --yara-only. |
| licenses/YARA-X.txt | Adds BSD-3-Clause license text for YARA-X. |
| iped-engine/src/test/java/iped/engine/task/yara/YaraScanTaskIntegrationTest.java | End-to-end integration tests for YaraScanTask against real native lib. |
| iped-engine/src/test/java/iped/engine/task/yara/YaraRulesetLoaderTest.java | Unit tests for YARA ruleset discovery. |
| iped-engine/src/test/java/iped/engine/task/yara/YaraHighlightSupportTest.java | Tests for hex-to-facet decoding logic. |
| iped-engine/src/test/java/iped/engine/task/yara/YaraEngineTest.java | Integration-gated tests for engine compilation and scanning. |
| iped-engine/src/test/java/iped/engine/config/YaraConfigTest.java | Unit tests for YaraConfig parsing and defaults. |
| iped-engine/src/main/java/iped/engine/task/yara/YaraScanTask.java | New pipeline task: loads catalog once, scans items, persists yara:* fields, emits metrics. |
| iped-engine/src/main/java/iped/engine/task/yara/YaraScanner.java | Per-worker scanner wrapper and match collection via native callbacks. |
| iped-engine/src/main/java/iped/engine/task/yara/YaraRulesetLoader.java | Recursive discovery of .yar/.yara sources (deterministic ordering). |
| iped-engine/src/main/java/iped/engine/task/yara/YaraMatch.java | Immutable match model (namespace/name/tags/strings). |
| iped-engine/src/main/java/iped/engine/task/yara/YaraInstallPaths.java | Auto-detects release root and bundled native directory. |
| iped-engine/src/main/java/iped/engine/task/yara/YaraHighlightSupport.java | Decodes matched bytes for facet/highlighting (printable ASCII vs hex). |
| iped-engine/src/main/java/iped/engine/task/yara/YaraEngine.java | JNA bindings, native loading strategy, compilation + error parsing. |
| iped-engine/src/main/java/iped/engine/task/yara/MatchedString.java | Represents a matched byte slice (id/offset/hex/truncation). |
| iped-engine/src/main/java/iped/engine/task/SkipCommitedTask.java | Alters committed-item skipping behavior to support --yara-only. |
| iped-engine/src/main/java/iped/engine/task/index/IndexTask.java | Adds updateDocuments path in --yara-only mode. |
| iped-engine/src/main/java/iped/engine/task/HTMLReportTask.java | Adjusts extra-properties handling comments; references a (missing) renderer. |
| iped-engine/src/main/java/iped/engine/config/YaraConfig.java | New task config: rule dirs, size/timeout, scan policy, library hint. |
| iped-engine/src/main/java/iped/engine/CmdLineArgs.java | Adds isYaraOnly() default method (docs currently out of sync). |
| iped-engine/pom.xml | Adds JNA dependency. |
| iped-app/src/main/java/iped/app/ui/MetadataPanel.java | Extends highlight-term selection to yara:match:* fields. |
| iped-app/src/main/java/iped/app/ui/columns/ColumnsManager.java | Adds YARA group and groups yara:* extra attributes. |
| iped-app/src/main/java/iped/app/processing/Main.java | Enforces enableYara=true requirement in --yara-only mode. |
| iped-app/src/main/java/iped/app/processing/CmdLineArgsImpl.java | Adds --yara-only flag parsing + validation; implies --continue. |
| iped-app/resources/localization/iped-engine-messages.properties | Adds YARA task/report message keys (some wording out of sync). |
| iped-app/resources/localization/iped-engine-messages_pt_BR.properties | Adds pt-BR equivalents (some wording out of sync). |
| iped-app/resources/localization/iped-desktop-messages.properties | Adds ColumnsManager.Yara label. |
| iped-app/resources/localization/iped-desktop-messages_pt_BR.properties | Adds pt-BR ColumnsManager.Yara label. |
| iped-app/resources/localization/iped-desktop-messages_it_IT.properties | Adds it-IT ColumnsManager.Yara label. |
| iped-app/resources/localization/iped-desktop-messages_fr_FR.properties | Adds fr-FR ColumnsManager.Yara label. |
| iped-app/resources/localization/iped-desktop-messages_es_AR.properties | Adds es-AR ColumnsManager.Yara label. |
| iped-app/resources/localization/iped-desktop-messages_de_DE.properties | Adds de-DE ColumnsManager.Yara label. |
| iped-app/resources/config/IPEDConfig.txt | Adds enableYara toggle with documentation. |
| iped-app/resources/config/conf/YaraConfig.txt | Adds default YARA config file template. |
| iped-app/resources/config/conf/TaskInstaller.xml | Inserts YaraScanTask into pipeline between carving and indexing. |
| iped-app/pom.xml | Copies tools/yara-x into the release tree during build. |
| iped-api/src/main/java/iped/properties/ExtraProperties.java | Adds yara: constants for tags + per-rule match fields. |
| .github/workflows/maven.yml | CI step to verify bundled Linux .so and run integration-gated tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| PLACEHOLDER — populated at release-build time with the LICENSE file from | ||
| https://github.com/VirusTotal/yara-x (BSD 3-clause). | ||
|
|
||
| This file MUST be replaced with the actual upstream YARA-X license text before | ||
| the native binaries (`win64/yara_x_capi.dll`, `linux64/libyara_x_capi.so`) | ||
| are shipped. See `licenses/YARA-X.txt` for the canonical copy used by IPED's | ||
| third-party license aggregation. |
| ├── LICENSE (BSD 3-clause from upstream YARA-X) | ||
| ├── win64/ | ||
| │ ├── yara_x_capi.dll (21,542,400 bytes — YARA-X 1.16.0, MSVC x86_64) | ||
| │ └── yara_x.h (39,444 bytes — C header, kept for reference) | ||
| └── linux64/ | ||
| └── (empty — see "Linux build" section below) | ||
| ``` |
| Diferente do YARA clássico, o upstream do YARA-X **publica binários | ||
| self-contained pré-compilados** para Windows e Linux — sem build manual. | ||
|
|
||
| 1. **Identifique a versão alvo** em https://github.com/VirusTotal/yara-x/releases. | ||
| Procure os assets que começam com `libyara-x-capi-vX.Y.Z-...`. | ||
|
|
||
| 2. **Linux (x86_64)** — **NÃO há prebuilt no release 1.16.0** (o upstream só | ||
| publica o asset `yara-x-capi-*-msvc.zip` para Windows; o asset Linux | ||
| `yara-x-v1.16.0-x86_64-unknown-linux-gnu.gz` é o CLI `yara-x`, não a C API). |
| @@ -1,3 +1,7 @@ | |||
| TBD: IPED-4.4.0 | |||
| News: | |||
| #spec/001-yara-rules-engine: YARA Rules Engine. New `YaraScanTask` in the processing pipeline applies YARA-X 1.16.0 rules (via libyara-x-capi, in-process through JNA) to item content during processing, populating the indexed multi-valued fields `yara:rule` and `yara:tag` plus a structured `yara:matches` JSON field per matched item. Catalog is configured via `conf/YaraConfig.txt` (`ruleDirectories`, `maxFileSizeBytes`, `perItemTimeoutMs`, `scanAllItems`, `matchHexMaxBytes`). Profile-level overrides live at `profiles/<X>/conf/YaraConfig.txt`; `forensic` and `pedo` ship with `enableYara=true` by default (no-op without rules). Module `cuckoo` is banned at runtime via `yrx_compiler_ban_module`; the compiler runs with `YRX_RELAXED_RE_SYNTAX` for compatibility with classic YARA catalogs. The analysis UI exposes matched rules and tags as a dedicated facet (group `ColumnsManager.Yara`) so the analyst can filter and bookmark by rule with the standard flow. The HTML report includes a structured per-item "YARA matches" block with HTML-safe escape (`YaraReportRenderer`). A new CLI flag `--yara-only -o <CASE_OUTPUT_DIR>` re-applies the current catalog to an already processed case without reprocessing the pipeline, updating `yara:*` fields in the existing Lucene index. Native binary `yara_x_capi.dll` ships under `tools/yara-x/win64/`; Linux `libyara_x_capi.so` must be built from source via `cargo build -p yara-x-capi --release` (see `tools/yara-x/README.md`). Specs and contracts in `specs/001-yara-rules-engine/`. New dependency: `net.java.dev.jna:jna:5.7.0` declared in `iped-engine/pom.xml`. | |||
| /** | ||
| * Quando {@code true}, o IPED roda apenas o pipeline YARA-X sobre um caso | ||
| * já processado (sem ingerir nova evidência), atualizando os campos | ||
| * {@code yara:rule}/{@code yara:tag}/{@code yara:matches} no índice | ||
| * Lucene existente. Ver {@code specs/001-yara-rules-engine/research.md} | ||
| * §R-08 e {@code contracts/cli-yara-only.contract.md}. | ||
| * | ||
| * <p>Default {@code false} (modo padrão de processamento). O método é | ||
| * {@code default} para preservar compatibilidade com implementações | ||
| * existentes de {@code CmdLineArgs}.</p> | ||
| */ |
| /** | ||
| * Integration test of the full {@link YaraScanTask} pipeline against the real | ||
| * {@code libyara-x-capi}. Loads a small rule catalog from a temp directory, | ||
| * compiles via {@link YaraEngine}, and runs the task's {@link YaraScanTask#process} | ||
| * on in-memory {@link Item}s to verify that {@code yara:rule}, {@code yara:tag} | ||
| * and {@code yara:matches} are populated correctly (FR-001 / FR-003 / FR-004 / | ||
| * FR-005 / FR-006 / FR-012). | ||
| * | ||
| * <p>Skipped via {@link org.junit.Assume} when {@code libyara-x-capi} is not | ||
| * loadable in the test environment.</p> | ||
| */ |
| YaraScanTask.Name=YARA scan | ||
| YaraScanTask.Description=Apply YARA-X rules to item content and tag matched items with rule/tag/offset metadata. | ||
| YaraScanTask.EngineUnavailable=YARA-X engine (libyara-x-capi) not loadable — YARA scan disabled for this case. |
| YaraScanTask.Name=Scan YARA | ||
| YaraScanTask.Description=Aplica regras YARA-X ao conteúdo do item e marca os itens casados com regra/tag/offset. | ||
| YaraScanTask.EngineUnavailable=Engine YARA-X (libyara-x-capi) não pôde ser carregada — scan YARA desabilitado para este caso. |
| if (!parentsWithLostSubitems.remove(trackID)) { | ||
| item.setToIgnore(true); | ||
| // In --yara-only mode we deliberately do NOT setToIgnore: the item must | ||
| // keep flowing through the pipeline so YaraScanTask can re-scan its | ||
| // content and IndexTask can issue an updateDocuments for the existing | ||
| // trackID. We still tag IS_COMMITTED so IndexTask knows it's the | ||
| // update branch (vs. addDocuments for new items). | ||
| item.setTempAttribute(IS_COMMITTED, Boolean.TRUE.toString()); | ||
| return; | ||
| if (!args.isYaraOnly()) { | ||
| item.setToIgnore(true); | ||
| return; | ||
| } |
| } | ||
| } | ||
|
|
||
| // YARA match rendering moved to iped.engine.task.yara.YaraReportRenderer (testable in isolation). |
Functional: - YaraScanTask: ceil perItemTimeoutMs -> seconds so sub-second timeouts (the config accepts >= 100 ms) no longer truncate to 0 (= no timeout). - YaraScanTask: cap content reads at maxFileSizeBytes+1 instead of an unbounded readAllBytes(), so items with unknown/incorrect getLength() (notably with scanAllItems=true) can't bypass the size limit and OOM. Docs/comments: - Sync comments, Javadoc, i18n, ReleaseNotes and TaskInstaller to the rev-5 field model (yara:tag + per-rule yara:match:<namespace>/<name>); drop stale yara:rule / yara:matches references (kept only where they document the removal). - Remove the dangling reference to the non-existent YaraReportRenderer. - Drop references to specs/001-yara-rules-engine/* and "Constitution Principle" (spec-kit artifacts not shipped to upstream). - Replace the tools/yara-x/LICENSE placeholder with the actual YARA-X BSD-3-Clause text; fix README (linux64 .so is shipped; remove the "prebuilt for both platforms" contradiction). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nf-inc#15) Per Copilot review (sepinf-inc#15), the --yara-only re-index path has known issues that are deferred (not fixed) in this PR: - IndexTask.updateDocuments(trackId) removes only the parent doc; content fragments carry fragParentId but not trackId, so stale fragments can remain on re-index. - Leaf items are reassigned a new id on reprocess (SkipCommitedTask only restores ids for containers/dirs/roots/split-text items). Corrected the misleading IndexTask comment that claimed fragments were deleted, and marked --yara-only as EXPERIMENTAL in the CLI help and ReleaseNotes, advising a full reprocess for production cases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the review. I went through all 16 comments — here is how each was handled (fixes in Functional fixes
Documentation / accuracy fixesThe field model was simplified in a late revision to
Deferred (documented), not fixed in this PR
Intentional behavior (by design)
|
| public List<YaraMatch> scan(byte[] buffer, int length, int timeoutSeconds) { | ||
| if (closed || scannerPtr == null || buffer == null || length <= 0) { | ||
| return Collections.emptyList(); | ||
| } | ||
| collector.reset(buffer, length); | ||
| if (timeoutSeconds > 0) { | ||
| YaraEngine.LibYaraX.INSTANCE.yrx_scanner_set_timeout(scannerPtr, (long) timeoutSeconds); | ||
| } | ||
| Memory native_buf = new Memory(length); | ||
| native_buf.write(0, buffer, 0, length); | ||
| try { | ||
| int rc = YaraEngine.LibYaraX.INSTANCE.yrx_scanner_scan(scannerPtr, native_buf, (long) length); | ||
| if (rc != YaraEngine.YRX_SUCCESS && rc != YaraEngine.YRX_SCAN_TIMEOUT) { | ||
| logger.debug("yrx_scanner_scan returned {}", rc); | ||
| } | ||
| return collector.takeMatches(); | ||
| } finally { | ||
| // Release the reference to the Java buffer; the native callback does NOT | ||
| // retain pointers after yrx_scanner_scan returns. | ||
| collector.clearBuffer(); | ||
| } | ||
| } |
| synchronized (finished) { | ||
| if (!finished.get()) { | ||
| if (sharedEngine != null) { | ||
| sharedEngine.close(); | ||
| sharedEngine = null; | ||
| } |
| if (isCommitted && yaraOnly) { | ||
| // --yara-only re-index (EXPERIMENTAL): refresh an already-committed item's | ||
| // yara:* fields. updateDocuments(Term, Iterable) deletes the docs matching | ||
| // the trackId and atomically adds the new block. | ||
| // | ||
| // KNOWN LIMITATION: only the parent (metadata) doc carries trackId; the | ||
| // content-fragment docs carry fragParentId but NOT trackId, so they are | ||
| // NOT removed here -- re-running --yara-only can leave stale content | ||
| // fragments behind. Leaf items are also re-assigned a new id on reprocess | ||
| // (SkipCommitedTask only restores ids for containers/dirs/roots/split-text | ||
| // items), which can change the parent id. Treat --yara-only as experimental | ||
| // until the re-index path is hardened. | ||
| Term trackIdTerm = new Term(IndexItem.TRACK_ID, Util.getTrackID(evidence)); | ||
| worker.writer.updateDocuments(trackIdTerm, new DocumentsIterable(evidence, fragReader)); | ||
| } else { |
| if (cmdLineParams.isYaraOnly()) { | ||
| // --yara-only goes through the normal Manager flow (DataSourceReader → | ||
| // pipeline → IndexTask). The CLI parser already enforced that -d is | ||
| // present and the case folder exists; isContinue() now also returns | ||
| // true for yara-only mode so SkipCommitedTask loads the committed | ||
| // trackIDs, and IndexTask switches to updateDocuments for those items. | ||
| YaraConfig yaraConfig = ConfigurationManager.get().findObject(YaraConfig.class); | ||
| if (yaraConfig == null || !yaraConfig.isEnabled()) { | ||
| throw new IPEDException( | ||
| "--yara-only requires enableYara=true in IPEDConfig.txt (or in the chosen -profile). " | ||
| + "Otherwise YaraScanTask would not run and updateDocuments would wipe the existing yara:* fields."); | ||
| } |
Addresses Copilot review: the existing guard only checked enableYara=true, but that alone does not guarantee YaraScanTask runs. If the native lib is unavailable, ruleDirectories is empty, or no .yar/.yara files are found, the task stays disabled and IndexTask's updateDocuments path would re-index the items WITHOUT yara:* attributes, silently wiping the previously stored yara:* fields across the case. Main.startManager() now aborts --yara-only (before any processing) unless ruleDirectories is non-empty, at least one rule file is discovered, and libyara-x-capi loads, reusing YaraRulesetLoader.discover and YaraEngine.ensureAvailable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the second pass. Notes on the three new comments (latest commit Addressed
Not a defect in this codebase (explained)
Low priority (acknowledged)
|
| if (hint != null && !hint.isEmpty()) { | ||
| File hintFile = new File(hint); | ||
| if (hintFile.exists()) { | ||
| NativeLibrary.addSearchPath("yara_x_capi", hintFile.getParentFile().getAbsolutePath()); | ||
| } | ||
| } |
| /** Idempotent shutdown (no-op for YARA-X — classic libyara required {@code yr_finalize}). */ | ||
| public static synchronized void shutdown() { | ||
| libraryAvailable = false; | ||
| } |
| long cap = Math.min(maxBytes + 1, (long) MAX_ARRAY_LENGTH); | ||
| return in.readNBytes((int) cap); |
| long offset = match.offset; | ||
| long length = match.length; | ||
| if (offset < 0 || length <= 0) { | ||
| return; | ||
| } | ||
| boolean truncated = length > matchHexMaxBytes; | ||
| String hex = extractHex(offset, length); | ||
| out.add(new MatchedString(id, offset, hex, truncated)); |
| String printable = decodePrintable(hex); | ||
| if (printable != null) { | ||
| return printable; | ||
| } | ||
| return hex.toLowerCase(); | ||
| } |
Summary
Adds a YARA-X rules engine to the processing pipeline. A new
YaraScanTaskscans item content with a configurable catalog of YARA rules and records the matches as searchable properties, a dedicated UI facet/columns, and entries in the HTML report.The engine runs in-process via
libyara-x-capi1.16.0 (YARA-X) through JNA bindings — no external process.What's included
Pipeline / engine (
iped-engine/.../task/yara/)YaraScanTask— new task between carving and indexing.YaraEngine(JNA) +YaraScanner— native engine wrapper.YaraRulesetLoader,YaraInstallPaths,YaraMatch,MatchedString,YaraHighlightSupport.YaraConfigconfigurable +conf/YaraConfig.txt(rule catalog + limits).Item model / search / UI
yara:*properties onExtraProperties.MetadataPanel,ColumnsManager).HTMLReportTask).CLI
--yara-onlymode to re-apply the rule catalog over an already-processed case (orchestrated throughSkipCommitedTask+IndexTask.updateDocuments).Build / native / docs
iped-engine;copy-yara-xexecution iniped-appships the natives into the release tree.tools/yara-x/: Linux x86_64libyara_x_capi.soand Windows x64yara_x_capi.dll+ header, plusREADME.md/LICENSE.licenses/YARA-X.txt,ThirdParty.txtandReleaseNotes.txtupdated.maven.yml) that verifies the bundled Linux.so.Configuration
Enabled via
IPEDConfig.txt; rule catalog and limits inconf/YaraConfig.txt.Testing
Full reactor
mvn clean packageis green on Java 11 (Liberica Full 11.0.31). The YARA tests run against the bundled native library and pass:YaraConfigTest22/22,YaraEngineTest5/5,YaraHighlightSupportTest13/13,YaraRulesetLoaderTest9/9,YaraScanTaskIntegrationTest7/7.Note for maintainers
The two native libraries are committed under
tools/yara-x/(~31.9 MB.so+ ~21.5 MB.dll). YARA-X 1.16.0 publishes a prebuilt C API only for Windows MSVC; the Linux.sowas built locally (cargo build -p yara-x-capi --release, seetools/yara-x/README.md). If you prefer not to track these binaries in the repo, I'm happy to switch to a build-time download/unpack instead — just let me know your preference.