Getting Started

This comprehensive guide covers installation, usage, configuration, and troubleshooting for the Protocol Reverse Engineering pipeline.

Prerequisites
Installation
Your First Analysis
Usage Guide
Feature Modes
LLM Integration
Ground Truth Evaluation
Diagnostic Tools
Step-by-Step Execution
Troubleshooting
Configuration Reference

Prerequisites

Before you begin, ensure you have:

Python 3.10 or higher
- Check: python --version or python3 --version
- Download: https://www.python.org/downloads/
TShark (Wireshark CLI)
- Check: tshark --version
- Download: https://www.wireshark.org/download.html
- Note: Install Wireshark, which includes TShark
PCAP files containing protocol traffic you want to analyze

Installation

Step 1: Clone or Download the Repository

git clone <repository-url>
cd protocol_re

Step 2: Create a Virtual Environment

# Create virtual environment
python3 -m venv venv

# Activate it
# On Linux/Mac:
source venv/bin/activate

# On Windows (PowerShell):
venv\Scripts\activate

# On Windows (Command Prompt):
venv\Scripts\activate.bat

Step 3: Install Dependencies

pip install -r requirements.txt

Dependencies installed:

numpy - Numerical computing
scikit-learn - Machine learning and clustering
hdbscan - Hierarchical density-based clustering
scapy - Packet manipulation (optional extraction method)
torch - PyTorch for neural features
colorama - Colored terminal output

Your First Analysis

Example 1: Analyze Modbus TCP Traffic

# Assuming PCAP file are in pcaps/ folder.
# Run the pipeline, without using LLM.
python main.py pcaps/ --tshark-filter mbtcp --llm-render-only

What happens:

Extracts Modbus TCP payloads from PCAPs
Discovers message families using clustering
Infers field boundaries
Detects request/response pairs
Assigns semantic labels to fields
Generates comprehensive protocol specification

Output:

output/protocol_report.md - Human-readable specification
output/protocol_report.html - Interactive HTML report
data/10_protocol_model.json - Machine-readable model

Usage Guide

Basic Usage

# Basic analysis
python main.py pcaps/ --tshark-filter <filter>

# Use existing messages (skip extraction)
python main.py --use-existing-messages

Common TShark filters:

Protocol	Filter	Description
Modbus TCP	`mbtcp`	Modbus TCP protocol
S7comm	`s7comm`	Siemens S7 communication
DNP3	`dnp3`	DNP3 SCADA protocol
IEC 60870-5-104	`iec104`	IEC 104 protocol
Custom TCP	`tcp.port == 2000`	TCP port 2000
Custom UDP	`udp.port == 2222`	UDP port 2222

Find available filters:

tshark -G protocols

Advanced Usage

Collection and Deduplication

python main.py source_files/ --collect --tshark-filter mbtcp

This will:

Collect all PCAPs from source_files/ into pcaps/
Remove duplicate captures
Run the full pipeline

TCP Port Extraction (Alternative to TShark)

python main.py pcaps/ --extraction-method tcp --service-port 502

When to use:

TShark filter is not available for your protocol
You want to extract by TCP/UDP port only
TShark is not installed

Enhanced Boundary Detection

Options:

--boundary-max-fields 12 - Limit maximum fields per family (default: 15)
--enable-merging - Enable multi-pass segment merging

Impact:

Reduces false positive boundaries
Eliminates excessive 1-byte fields

Multi-Layer Protocol Detection

python main.py pcaps/ --tshark-filter mbtcp --enable-layer-detection

Options:

--enable-layer-detection - Enable layer detection
--layer-min-confidence 0.7 - Minimum confidence threshold

Use cases:

Protocols with stable outer headers
Transport framing + application payload
Protocol tunneling scenarios

Message Limits

python main.py pcaps/ --tshark-filter mbtcp --max-messages 50000

Default: 200,000 messages

When to adjust:

Small captures: reduce for faster processing
Large captures: increase for better coverage
Memory constraints: reduce to limit memory usage

Feature Modes

Raw Bytes Mode

Use padded byte vectors:

python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode raw_bytes

Pros:

good accuracy on simple protocols
Fast and deterministic
No external dependencies

Use when:

A trained neural model is not available
Protocol has clear structural patterns

Structural Mode

Use symbolic protocol features:

python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode structural

Pros:

Protocol-agnostic feature extraction
Interpretable features

Use when:

You want to understand feature importance
Raw bytes mode is not working well

Neural Mode

Use VAE latent vectors:

python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode neural --family-neural-model-path assets/pre_trained/industrial_VAE.pth

Pros:

Can capture complex patterns
Learned representations

Cons:

May produce poor clustering (collapsed latent space)
Requires PyTorch and trained model

Use when:

A well-trained VAE model is available
Protocol has complex, non-obvious patterns

Hybrid Mode

Combine neural and structural features:

# Adaptive fusion (recommended)
python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode hybrid --fusion-method adaptive --family-neural-model-path assets/pre_trained/industrial_VAE.pth

# Learned fusion with MLP
python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode hybrid --fusion-method learned --family-neural-model-path assets/pre_trained/industrial_VAE.pth

# Fixed weights
python main.py pcaps/ --tshark-filter mbtcp \
    --family-feature-mode hybrid \
    --fusion-method fixed \
    --fusion-neural-weight 0.3 \
    --fusion-structural-weight 0.7 \
    --family-neural-model-path assets/pre_trained/industrial_VAE.pth

Fusion methods:

adaptive - Quality-based automatic weighting (default)
learned - MLP-based feature importance learning
fixed - Manual weight specification
concat - Simple concatenation

Features:

Automatic neural collapse detection
Fallback to structural features when neural fails
Latent vector caching for speed

LLM Integration

Setup

Create config/llm_config.json:

{
  "api_key_required": "yes",
  "openai_base_url": "https://api.openai.com/v1",
  "model": "gpt-4o-mini",
  "temperature": 0.1,
  "max_tokens": 4000,
  "timeout": 180,
  "max_retries": 5,
  "retry_delay_seconds": 2.0,
  "max_retry_delay_seconds": 30.0,
  "request_interval_seconds": 1.0
}

Set API key:

# Linux/Mac
export OPENAI_API_KEY=<your-api-key>

# Windows PowerShell
$env:OPENAI_API_KEY = "<your-api-key>"

Run with LLM Refinement

The runner tries to load config/llm_config.json from root folder.

python main.py pcaps/ --tshark-filter mbtcp

LLM Options

# Custom prompt template
python main.py pcaps/ --tshark-filter mbtcp \
    --llm-config config/llm_config.json \
    --llm-template custom_prompt.md

# Render prompt only (no API call)
python main.py pcaps/ --tshark-filter mbtcp --llm-render-only

# Adjust LLM parameters (temperature, max_tokens, timeout are read from the config file only — edit config/llm_config.json to change them)
python main.py pcaps/ --tshark-filter mbtcp \
    --llm-config config/llm_config.json

Stage-Specific LLM Refinement

Run individual LLM refinement stages:

# Boundary refinement
python scripts/07b_refine_boundaries_llm.py \
    data/05_families.json \
    data/05_families.refined.json \
    --config config/llm_config.json

# Semantic labeling
python scripts/11b_label_semantics_llm.py \
    data/05_families.json \
    data/09_semantics.json \
    data/09_semantics.llm.json \
    --config config/llm_config.json

# Relation validation
python scripts/10b_validate_relations_llm.py \
    data/08_relations.json \
    data/08_relations.validated.json \
    --config config/llm_config.json

Ground Truth Evaluation

Prepare Ground Truth

Create a ground truth JSON file (see truth_files/modbus.json for example).

Run with Evaluation

python main.py pcaps/ --tshark-filter mbtcp \
    --ground-truth-json truth_files/modbus.json

View Evaluation Results

Check output/protocol_report.html, the final evaluation section:

Message type matching (accuracy/F1)
Field boundary detection (accuracy/F1)
Semantic labeling (accuracy/F1)
Relation detection (accuracy/F1)
Overall score

Diagnostic Tools

Diagnose Neural Features

Analyze neural feature quality and detect collapsed latent spaces:

python scripts/diagnostics/20_diagnose_neural_features.py data/01_messages.jsonl \
    --sample-size 5000 \
    --model-path assets/pre_trained/industrial_VAE.pth \
    --latent-cache data/latent_cache.json

Output:

Latent space variance analysis
Separation metrics
Comparison with structural features
Recommendations

Test Enhanced Neural Features

Compare original vs enhanced neural features:

python scripts/diagnostics/21_test_enhanced_neural.py data/01_messages.jsonl \
    --sample-size 5000 \
    --model-path assets/pre_trained/industrial_VAE.pth

Test Boundary Detection

Test boundary detection with different thresholds:

python scripts/diagnostics/22_test_boundary_detection.py data/01_messages.jsonl \
    --assignments-json data/02_family_assignments.json \
    --features-json data/03_family_features.json

Test Learned Fusion

Test hybrid feature fusion methods:

python scripts/diagnostics/23_test_learned_fusion.py data/01_messages.jsonl \
    --model-path assets/pre_trained/industrial_VAE.pth

Test Boundary Refinement

Compute boundary quality metrics and test LLM refinement:

python scripts/diagnostics/24_test_boundary_refinement.py data/05_families.json \
    --messages-json data/01_messages.jsonl \
    --assignments-json data/02_family_assignments.json

Step-by-Step Execution

For debugging or custom workflows, run stages individually:

# Set Python path
export PYTHONPATH=src  # Windows: $env:PYTHONPATH="src"

# Stage 03: Extract messages
python scripts/03_extract_messages.py pcaps data/01_messages.jsonl \
    --extraction-method tshark \
    --tshark-filter mbtcp \
    --max-messages 200000

# Stage 04: Discover families
python scripts/04_discover_families.py data/01_messages.jsonl \
    data/02_family_assignments.json \
    --sample-size 100000 \
    --feature-mode raw_bytes

# Stage 05: Infer framing
python scripts/05_infer_framing.py data/01_messages.jsonl \
    data/02_family_assignments.json \
    data/04_framing.json

# Stage 06: Extract features
python scripts/06_extract_features.py data/01_messages.jsonl \
    data/03_family_features.json \
    --assignments-json data/02_family_assignments.json

# Stage 07: Infer boundaries
python scripts/07_infer_boundaries.py data/01_messages.jsonl \
    data/05_families.json \
    --assignments-json data/02_family_assignments.json \
    --features-json data/03_family_features.json \
    --framing-json data/04_framing.json \
    --enhanced \
    --max-fields 15

# Stage 08: Pair requests/responses
python scripts/08_pair_requests_responses.py data/01_messages.jsonl \
    data/06_pairs.json \
    --assignments-json data/02_family_assignments.json

# Stage 09: Infer discriminators
python scripts/09_infer_keywords.py data/01_messages.jsonl \
    data/07_keywords.json \
    --assignments-json data/02_family_assignments.json \
    --features-json data/03_family_features.json \
    --framing-json data/04_framing.json

# Stage 10: Infer relations
python scripts/10_infer_relations.py data/01_messages.jsonl \
    data/02_family_assignments.json \
    data/06_pairs.json \
    data/08_relations.json

# Stage 11: Infer semantics
python scripts/11_infer_semantics.py data/05_families.json \
    data/08_relations.json \
    data/09_semantics.json

# Stage 12: Build protocol model
python scripts/12_build_protocol_model.py data/05_families.json \
    data/10_protocol_model.json \
    --features-json data/03_family_features.json \
    --keywords-json data/07_keywords.json \
    --relations-json data/08_relations.json \
    --semantics-json data/09_semantics.json \
    --framing-json data/04_framing.json

# Stage 13: Evaluate pipeline
python scripts/13_evaluate_pipeline.py data/01_messages.jsonl \
    data/02_family_assignments.json \
    data/05_families.json \
    data/06_pairs.json \
    data/08_relations.json \
    data/11_evaluation.json \
    --semantics-json data/09_semantics.json

# Stage 14: Export LLM evidence
python scripts/14_export_llm_evidence.py data/10_protocol_model.json \
    data/12_llm_evidence.json \
    --evaluation-json data/11_evaluation.json

# Stage 15: Analyze with LLM
python scripts/15_analyze_with_llm.py data/12_llm_evidence.json \
    data/13_llm_analysis.json \
    --config config/llm_config.json \
    --prompt-out data/13_llm_prompt.md

# Stage 15b: Apply LLM refinement
python scripts/15b_apply_llm_refinement.py data/10_protocol_model.json \
    data/13_llm_analysis.json \
    data/10_protocol_model.refined.json \
    --evidence-json data/12_llm_evidence.json \
    --schema-json assets/schema/protocol_model.schema.json \
    --patches-out data/13_llm_patches.json \
    --validation-out data/13_llm_patch_validation.json

# Stage 16: Prepare evaluation data
python scripts/16_prepare_evaluation_data.py data/10_protocol_model.json \
    data/11_evaluation.json \
    data/13_llm_analysis.json \
    data/14_evaluation_model_data.json \
    --refined-protocol-model-json data/10_protocol_model.refined.json \
    --patch-validation-json data/13_llm_patch_validation.json

# Stage 17: Evaluate against ground truth
python scripts/17_evaluate_protocol_spec.py data/14_evaluation_model_data.json \
    truth_files/modbus.json \
    data/15_evaluation_result.json

# Stage 18: Export Markdown
python scripts/18_export_markdown.py data/10_protocol_model.refined.json \
    output/protocol_report.md \
    --evaluation-json data/11_evaluation.json \
    --llm-analysis-json data/13_llm_analysis.json \
    --final-evaluation-json data/15_evaluation_result.json

# Stage 19: Export HTML
python scripts/19_export_html.py data/10_protocol_model.refined.json \
    output/protocol_report.html \
    --evaluation-json data/11_evaluation.json \
    --llm-analysis-json data/13_llm_analysis.json \
    --final-evaluation-json data/15_evaluation_result.json

Troubleshooting

TShark Not Found

Error: tshark: command not found

Solution:

Install Wireshark (includes TShark)
Add TShark to PATH
Verify: tshark --version

No Messages Extracted

Error: No messages found in corpus

Possible causes:

Incorrect TShark filter
PCAP files don't contain matching traffic
Extraction method mismatch

Solutions:

Verify filter: tshark -r capture.pcap -Y "mbtcp" -T fields -e data
Try alternative extraction: --extraction-method tcp --service-port 502
Check PCAP contents: tshark -r capture.pcap

Poor Clustering Results

Symptoms:

Too few families
All messages in one cluster
Low silhouette score

Solutions:

Try different feature mode: --family-feature-mode raw_bytes
Diagnose neural features: python scripts/diagnostics/20_diagnose_neural_features.py
Adjust clustering parameters: --sample-size 50000
Check message diversity: ensure captures contain varied traffic

Over-Segmentation

Symptoms:

Too many 1-byte fields
Low boundary precision
Excessive field count (too many fields per family)

Solutions:

Reduce field limit: --boundary-max-fields 12
Use LLM refinement: --llm-config config/llm_config.json

LLM API Errors

Error: OpenAI API error: 401 Unauthorized

Solution:

Check API key: echo $OPENAI_API_KEY
Verify config: cat config/llm_config.json
Test API: curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

Error: Timeout waiting for LLM response

Solution:

Increase timeout: --llm-timeout 300
Reduce evidence size: --family-limit 10

Memory Issues

Error: MemoryError or system slowdown

Solutions:

Reduce message limit: --max-messages 50000
Reduce clustering sample: --sample-size 10000
Close other applications
Use 64-bit Python

Slow Performance

Symptoms:

Pipeline takes > 15 minutes for 200K messages
Stages hang or appear frozen

Solutions:

Check TShark performance: time the extraction stage
Reduce sample size: --sample-size 50000
Use raw_bytes mode (fastest): --family-feature-mode raw_bytes

ModuleNotFoundError

Error: ModuleNotFoundError: No module named 'protocol_re'

Solution: Set Python path before running individual scripts:

# Linux/Mac
export PYTHONPATH=src

# Windows PowerShell
$env:PYTHONPATH="src"

Note: main.py sets this automatically; only needed for individual scripts.

Configuration Reference

Main Pipeline Options

python main.py <pcap-dir> [OPTIONS]

Required (one of):
  --tshark-filter FILTER        TShark display filter (e.g., mbtcp)
  --extraction-method tcp       Use TCP port extraction
  --use-existing-messages       Skip extraction, use existing data/01_messages.jsonl

Extraction:
  --max-messages N              Maximum messages to extract (default: 200000)
  --service-port PORT           TCP port for extraction (with --extraction-method tcp)

Clustering:
  --family-feature-mode MODE    Feature mode: raw_bytes (default), structural, neural, hybrid
  --sample-size N               Clustering sample size (default: 100000)
  --family-neural-model-path    Path to neural model (default: assets/pre_trained/industrial_VAE.pth)

Boundaries:
  --boundary-max-fields N       Maximum fields per family (default: 15)

Layer Detection:
  --enable-layer-detection      Enable multi-layer protocol detection
  --layer-min-confidence N      Minimum confidence for layer detection (default: 0.6)

LLM:
  --llm-config FILE             LLM configuration file (default: config/llm_config.json)
  --llm-render-only             Skip LLM API calls, but render the prompts.
  --llm-template                LLM template file path to override default template for stage 15.
  --reuse-llm-responses         Before each LLM API call, reuse an existing stage result JSON response when present.
  --llm-boundary-confidence     Minimum confidence for LLM boundary merge suggestions (default: 0.6).
  --llm-semantic-confidence     Minimum confidence for LLM semantic labels (default: 0.5).
  --llm-relation-confidence     Minimum confidence for LLM relation validation (default: 0.7).

Evaluation:
  --ground-truth-json FILE      Ground truth protocol for evaluation

Other:
  --collect                     Collect PCAPs from source tree first

Next Step

Read Architecture for system design details

FilesExpand file tree

getting_started.md

Latest commit

History

getting_started.md

File metadata and controls

Getting Started

Table of Contents

Prerequisites

Installation

Step 1: Clone or Download the Repository

Step 2: Create a Virtual Environment

Step 3: Install Dependencies

Your First Analysis

Example 1: Analyze Modbus TCP Traffic

Usage Guide

Basic Usage

Advanced Usage

Collection and Deduplication

TCP Port Extraction (Alternative to TShark)

Enhanced Boundary Detection

Multi-Layer Protocol Detection

Message Limits

Feature Modes

Raw Bytes Mode

Structural Mode

Neural Mode

Hybrid Mode

LLM Integration

Setup

Run with LLM Refinement

LLM Options

Stage-Specific LLM Refinement

Ground Truth Evaluation

Prepare Ground Truth

Run with Evaluation

View Evaluation Results

Diagnostic Tools

Diagnose Neural Features

Test Enhanced Neural Features

Test Boundary Detection

Test Learned Fusion

Test Boundary Refinement

Step-by-Step Execution

Troubleshooting

TShark Not Found

No Messages Extracted

Poor Clustering Results

Over-Segmentation

LLM API Errors

Memory Issues

Slow Performance

ModuleNotFoundError

Configuration Reference

Main Pipeline Options

Next Step