This comprehensive guide covers installation, usage, configuration, and troubleshooting for the Protocol Reverse Engineering pipeline.
- Prerequisites
- Installation
- Your First Analysis
- Usage Guide
- Feature Modes
- LLM Integration
- Ground Truth Evaluation
- Diagnostic Tools
- Step-by-Step Execution
- Troubleshooting
- Configuration Reference
Before you begin, ensure you have:
-
Python 3.10 or higher
- Check:
python --versionorpython3 --version - Download: https://www.python.org/downloads/
- Check:
-
TShark (Wireshark CLI)
- Check:
tshark --version - Download: https://www.wireshark.org/download.html
- Note: Install Wireshark, which includes TShark
- Check:
-
PCAP files containing protocol traffic you want to analyze
git clone <repository-url>
cd protocol_re# Create virtual environment
python3 -m venv venv
# Activate it
# On Linux/Mac:
source venv/bin/activate
# On Windows (PowerShell):
venv\Scripts\activate
# On Windows (Command Prompt):
venv\Scripts\activate.batpip install -r requirements.txtDependencies installed:
numpy- Numerical computingscikit-learn- Machine learning and clusteringhdbscan- Hierarchical density-based clusteringscapy- Packet manipulation (optional extraction method)torch- PyTorch for neural featurescolorama- Colored terminal output
# Assuming PCAP file are in pcaps/ folder.
# Run the pipeline, without using LLM.
python main.py pcaps/ --tshark-filter mbtcp --llm-render-onlyWhat happens:
- Extracts Modbus TCP payloads from PCAPs
- Discovers message families using clustering
- Infers field boundaries
- Detects request/response pairs
- Assigns semantic labels to fields
- Generates comprehensive protocol specification
Output:
output/protocol_report.md- Human-readable specificationoutput/protocol_report.html- Interactive HTML reportdata/10_protocol_model.json- Machine-readable model
# Basic analysis
python main.py pcaps/ --tshark-filter <filter>
# Use existing messages (skip extraction)
python main.py --use-existing-messagesCommon TShark filters:
| Protocol | Filter | Description |
|---|---|---|
| Modbus TCP | mbtcp |
Modbus TCP protocol |
| S7comm | s7comm |
Siemens S7 communication |
| DNP3 | dnp3 |
DNP3 SCADA protocol |
| IEC 60870-5-104 | iec104 |
IEC 104 protocol |
| Custom TCP | tcp.port == 2000 |
TCP port 2000 |
| Custom UDP | udp.port == 2222 |
UDP port 2222 |
Find available filters:
tshark -G protocolspython main.py source_files/ --collect --tshark-filter mbtcpThis will:
- Collect all PCAPs from
source_files/intopcaps/ - Remove duplicate captures
- Run the full pipeline
python main.py pcaps/ --extraction-method tcp --service-port 502When to use:
- TShark filter is not available for your protocol
- You want to extract by TCP/UDP port only
- TShark is not installed
Options:
--boundary-max-fields 12- Limit maximum fields per family (default: 15)--enable-merging- Enable multi-pass segment merging
Impact:
- Reduces false positive boundaries
- Eliminates excessive 1-byte fields
python main.py pcaps/ --tshark-filter mbtcp --enable-layer-detectionOptions:
--enable-layer-detection- Enable layer detection--layer-min-confidence 0.7- Minimum confidence threshold
Use cases:
- Protocols with stable outer headers
- Transport framing + application payload
- Protocol tunneling scenarios
python main.py pcaps/ --tshark-filter mbtcp --max-messages 50000Default: 200,000 messages
When to adjust:
- Small captures: reduce for faster processing
- Large captures: increase for better coverage
- Memory constraints: reduce to limit memory usage
Use padded byte vectors:
python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode raw_bytesPros:
- good accuracy on simple protocols
- Fast and deterministic
- No external dependencies
Use when:
- A trained neural model is not available
- Protocol has clear structural patterns
Use symbolic protocol features:
python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode structuralPros:
- Protocol-agnostic feature extraction
- Interpretable features
Use when:
- You want to understand feature importance
- Raw bytes mode is not working well
Use VAE latent vectors:
python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode neural --family-neural-model-path assets/pre_trained/industrial_VAE.pthPros:
- Can capture complex patterns
- Learned representations
Cons:
- May produce poor clustering (collapsed latent space)
- Requires PyTorch and trained model
Use when:
- A well-trained VAE model is available
- Protocol has complex, non-obvious patterns
Combine neural and structural features:
# Adaptive fusion (recommended)
python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode hybrid --fusion-method adaptive --family-neural-model-path assets/pre_trained/industrial_VAE.pth
# Learned fusion with MLP
python main.py pcaps/ --tshark-filter mbtcp --family-feature-mode hybrid --fusion-method learned --family-neural-model-path assets/pre_trained/industrial_VAE.pth
# Fixed weights
python main.py pcaps/ --tshark-filter mbtcp \
--family-feature-mode hybrid \
--fusion-method fixed \
--fusion-neural-weight 0.3 \
--fusion-structural-weight 0.7 \
--family-neural-model-path assets/pre_trained/industrial_VAE.pthFusion methods:
adaptive- Quality-based automatic weighting (default)learned- MLP-based feature importance learningfixed- Manual weight specificationconcat- Simple concatenation
Features:
- Automatic neural collapse detection
- Fallback to structural features when neural fails
- Latent vector caching for speed
- Create
config/llm_config.json:
{
"api_key_required": "yes",
"openai_base_url": "https://api.openai.com/v1",
"model": "gpt-4o-mini",
"temperature": 0.1,
"max_tokens": 4000,
"timeout": 180,
"max_retries": 5,
"retry_delay_seconds": 2.0,
"max_retry_delay_seconds": 30.0,
"request_interval_seconds": 1.0
}- Set API key:
# Linux/Mac
export OPENAI_API_KEY=<your-api-key>
# Windows PowerShell
$env:OPENAI_API_KEY = "<your-api-key>"The runner tries to load config/llm_config.json from root folder.
python main.py pcaps/ --tshark-filter mbtcp# Custom prompt template
python main.py pcaps/ --tshark-filter mbtcp \
--llm-config config/llm_config.json \
--llm-template custom_prompt.md
# Render prompt only (no API call)
python main.py pcaps/ --tshark-filter mbtcp --llm-render-only
# Adjust LLM parameters (temperature, max_tokens, timeout are read from the config file only — edit config/llm_config.json to change them)
python main.py pcaps/ --tshark-filter mbtcp \
--llm-config config/llm_config.jsonRun individual LLM refinement stages:
# Boundary refinement
python scripts/07b_refine_boundaries_llm.py \
data/05_families.json \
data/05_families.refined.json \
--config config/llm_config.json
# Semantic labeling
python scripts/11b_label_semantics_llm.py \
data/05_families.json \
data/09_semantics.json \
data/09_semantics.llm.json \
--config config/llm_config.json
# Relation validation
python scripts/10b_validate_relations_llm.py \
data/08_relations.json \
data/08_relations.validated.json \
--config config/llm_config.jsonCreate a ground truth JSON file (see truth_files/modbus.json for example).
python main.py pcaps/ --tshark-filter mbtcp \
--ground-truth-json truth_files/modbus.jsonCheck output/protocol_report.html, the final evaluation section:
- Message type matching (accuracy/F1)
- Field boundary detection (accuracy/F1)
- Semantic labeling (accuracy/F1)
- Relation detection (accuracy/F1)
- Overall score
Analyze neural feature quality and detect collapsed latent spaces:
python scripts/diagnostics/20_diagnose_neural_features.py data/01_messages.jsonl \
--sample-size 5000 \
--model-path assets/pre_trained/industrial_VAE.pth \
--latent-cache data/latent_cache.jsonOutput:
- Latent space variance analysis
- Separation metrics
- Comparison with structural features
- Recommendations
Compare original vs enhanced neural features:
python scripts/diagnostics/21_test_enhanced_neural.py data/01_messages.jsonl \
--sample-size 5000 \
--model-path assets/pre_trained/industrial_VAE.pthTest boundary detection with different thresholds:
python scripts/diagnostics/22_test_boundary_detection.py data/01_messages.jsonl \
--assignments-json data/02_family_assignments.json \
--features-json data/03_family_features.jsonTest hybrid feature fusion methods:
python scripts/diagnostics/23_test_learned_fusion.py data/01_messages.jsonl \
--model-path assets/pre_trained/industrial_VAE.pthCompute boundary quality metrics and test LLM refinement:
python scripts/diagnostics/24_test_boundary_refinement.py data/05_families.json \
--messages-json data/01_messages.jsonl \
--assignments-json data/02_family_assignments.jsonFor debugging or custom workflows, run stages individually:
# Set Python path
export PYTHONPATH=src # Windows: $env:PYTHONPATH="src"
# Stage 03: Extract messages
python scripts/03_extract_messages.py pcaps data/01_messages.jsonl \
--extraction-method tshark \
--tshark-filter mbtcp \
--max-messages 200000
# Stage 04: Discover families
python scripts/04_discover_families.py data/01_messages.jsonl \
data/02_family_assignments.json \
--sample-size 100000 \
--feature-mode raw_bytes
# Stage 05: Infer framing
python scripts/05_infer_framing.py data/01_messages.jsonl \
data/02_family_assignments.json \
data/04_framing.json
# Stage 06: Extract features
python scripts/06_extract_features.py data/01_messages.jsonl \
data/03_family_features.json \
--assignments-json data/02_family_assignments.json
# Stage 07: Infer boundaries
python scripts/07_infer_boundaries.py data/01_messages.jsonl \
data/05_families.json \
--assignments-json data/02_family_assignments.json \
--features-json data/03_family_features.json \
--framing-json data/04_framing.json \
--enhanced \
--max-fields 15
# Stage 08: Pair requests/responses
python scripts/08_pair_requests_responses.py data/01_messages.jsonl \
data/06_pairs.json \
--assignments-json data/02_family_assignments.json
# Stage 09: Infer discriminators
python scripts/09_infer_keywords.py data/01_messages.jsonl \
data/07_keywords.json \
--assignments-json data/02_family_assignments.json \
--features-json data/03_family_features.json \
--framing-json data/04_framing.json
# Stage 10: Infer relations
python scripts/10_infer_relations.py data/01_messages.jsonl \
data/02_family_assignments.json \
data/06_pairs.json \
data/08_relations.json
# Stage 11: Infer semantics
python scripts/11_infer_semantics.py data/05_families.json \
data/08_relations.json \
data/09_semantics.json
# Stage 12: Build protocol model
python scripts/12_build_protocol_model.py data/05_families.json \
data/10_protocol_model.json \
--features-json data/03_family_features.json \
--keywords-json data/07_keywords.json \
--relations-json data/08_relations.json \
--semantics-json data/09_semantics.json \
--framing-json data/04_framing.json
# Stage 13: Evaluate pipeline
python scripts/13_evaluate_pipeline.py data/01_messages.jsonl \
data/02_family_assignments.json \
data/05_families.json \
data/06_pairs.json \
data/08_relations.json \
data/11_evaluation.json \
--semantics-json data/09_semantics.json
# Stage 14: Export LLM evidence
python scripts/14_export_llm_evidence.py data/10_protocol_model.json \
data/12_llm_evidence.json \
--evaluation-json data/11_evaluation.json
# Stage 15: Analyze with LLM
python scripts/15_analyze_with_llm.py data/12_llm_evidence.json \
data/13_llm_analysis.json \
--config config/llm_config.json \
--prompt-out data/13_llm_prompt.md
# Stage 15b: Apply LLM refinement
python scripts/15b_apply_llm_refinement.py data/10_protocol_model.json \
data/13_llm_analysis.json \
data/10_protocol_model.refined.json \
--evidence-json data/12_llm_evidence.json \
--schema-json assets/schema/protocol_model.schema.json \
--patches-out data/13_llm_patches.json \
--validation-out data/13_llm_patch_validation.json
# Stage 16: Prepare evaluation data
python scripts/16_prepare_evaluation_data.py data/10_protocol_model.json \
data/11_evaluation.json \
data/13_llm_analysis.json \
data/14_evaluation_model_data.json \
--refined-protocol-model-json data/10_protocol_model.refined.json \
--patch-validation-json data/13_llm_patch_validation.json
# Stage 17: Evaluate against ground truth
python scripts/17_evaluate_protocol_spec.py data/14_evaluation_model_data.json \
truth_files/modbus.json \
data/15_evaluation_result.json
# Stage 18: Export Markdown
python scripts/18_export_markdown.py data/10_protocol_model.refined.json \
output/protocol_report.md \
--evaluation-json data/11_evaluation.json \
--llm-analysis-json data/13_llm_analysis.json \
--final-evaluation-json data/15_evaluation_result.json
# Stage 19: Export HTML
python scripts/19_export_html.py data/10_protocol_model.refined.json \
output/protocol_report.html \
--evaluation-json data/11_evaluation.json \
--llm-analysis-json data/13_llm_analysis.json \
--final-evaluation-json data/15_evaluation_result.jsonError: tshark: command not found
Solution:
- Install Wireshark (includes TShark)
- Add TShark to PATH
- Verify:
tshark --version
Error: No messages found in corpus
Possible causes:
- Incorrect TShark filter
- PCAP files don't contain matching traffic
- Extraction method mismatch
Solutions:
- Verify filter:
tshark -r capture.pcap -Y "mbtcp" -T fields -e data - Try alternative extraction:
--extraction-method tcp --service-port 502 - Check PCAP contents:
tshark -r capture.pcap
Symptoms:
- Too few families
- All messages in one cluster
- Low silhouette score
Solutions:
- Try different feature mode:
--family-feature-mode raw_bytes - Diagnose neural features:
python scripts/diagnostics/20_diagnose_neural_features.py - Adjust clustering parameters:
--sample-size 50000 - Check message diversity: ensure captures contain varied traffic
Symptoms:
- Too many 1-byte fields
- Low boundary precision
- Excessive field count (too many fields per family)
Solutions:
- Reduce field limit:
--boundary-max-fields 12 - Use LLM refinement:
--llm-config config/llm_config.json
Error: OpenAI API error: 401 Unauthorized
Solution:
- Check API key:
echo $OPENAI_API_KEY - Verify config:
cat config/llm_config.json - Test API:
curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models
Error: Timeout waiting for LLM response
Solution:
- Increase timeout:
--llm-timeout 300 - Reduce evidence size:
--family-limit 10
Error: MemoryError or system slowdown
Solutions:
- Reduce message limit:
--max-messages 50000 - Reduce clustering sample:
--sample-size 10000 - Close other applications
- Use 64-bit Python
Symptoms:
- Pipeline takes > 15 minutes for 200K messages
- Stages hang or appear frozen
Solutions:
- Check TShark performance: time the extraction stage
- Reduce sample size:
--sample-size 50000 - Use raw_bytes mode (fastest):
--family-feature-mode raw_bytes
Error: ModuleNotFoundError: No module named 'protocol_re'
Solution: Set Python path before running individual scripts:
# Linux/Mac
export PYTHONPATH=src
# Windows PowerShell
$env:PYTHONPATH="src"Note: main.py sets this automatically; only needed for individual scripts.
python main.py <pcap-dir> [OPTIONS]
Required (one of):
--tshark-filter FILTER TShark display filter (e.g., mbtcp)
--extraction-method tcp Use TCP port extraction
--use-existing-messages Skip extraction, use existing data/01_messages.jsonl
Extraction:
--max-messages N Maximum messages to extract (default: 200000)
--service-port PORT TCP port for extraction (with --extraction-method tcp)
Clustering:
--family-feature-mode MODE Feature mode: raw_bytes (default), structural, neural, hybrid
--sample-size N Clustering sample size (default: 100000)
--family-neural-model-path Path to neural model (default: assets/pre_trained/industrial_VAE.pth)
Boundaries:
--boundary-max-fields N Maximum fields per family (default: 15)
Layer Detection:
--enable-layer-detection Enable multi-layer protocol detection
--layer-min-confidence N Minimum confidence for layer detection (default: 0.6)
LLM:
--llm-config FILE LLM configuration file (default: config/llm_config.json)
--llm-render-only Skip LLM API calls, but render the prompts.
--llm-template LLM template file path to override default template for stage 15.
--reuse-llm-responses Before each LLM API call, reuse an existing stage result JSON response when present.
--llm-boundary-confidence Minimum confidence for LLM boundary merge suggestions (default: 0.6).
--llm-semantic-confidence Minimum confidence for LLM semantic labels (default: 0.5).
--llm-relation-confidence Minimum confidence for LLM relation validation (default: 0.7).
Evaluation:
--ground-truth-json FILE Ground truth protocol for evaluation
Other:
--collect Collect PCAPs from source tree first- Read Architecture for system design details