Skip to content

norriy0u/geoshield-env

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

title GeoShield
colorFrom blue
colorTo green
sdk docker
pinned false
tags
openenv

GeoShield: Autonomous Intelligence Triage Engine

OpenEnv HuggingFace Docker

Hackathon Submission
Developed for the 2026 Meta ร— HuggingFace ร— Scaler OpenEnv competition. All automated Phase 1 and Phase 2 evaluations successfully passed. Prepared for Phase 3 human review.

Tip
Live deployed version: https://norriy0u-geoshield-env.hf.space


Overview

GeoShield is a real-world OpenEnv environment where an AI agent acts as a Defense Zone Commander. The agent receives textual intelligence reports โ€” simulating the output of a Vision-Language Model (like Meta's Llama Vision + SAM 2) analyzing satellite imagery โ€” and must make fast, accurate strategic decisions.

This environment models a genuine operational challenge faced by modern defense analysts: processing high volumes of satellite intelligence, filtering false alarms, classifying threats, allocating scarce reconnaissance assets, and unmasking covert operations hidden behind civilian cover stories.

Key Features

  • Procedural Case Generation: Unlimited seed-deterministic cases via template-based synthesis โ€” no memorization risk
  • Multi-Step Episodes: All 4 tasks support intel gathering before terminal decisions
  • Deterministic Heuristic Grading: Zero-LLM, fully reproducible across millions of runs
  • Levenshtein Similarity Matching: Cover story identification scored via edit distance
  • Negation-Aware Keyword Filtering: Prevents trivial gaming ("not a threat" won't score)
  • Report-Reference Scoring: Reasoning that references specific case details scores higher than boilerplate
  • False Positive Penalties: Harsh penalties for classifying legitimate facilities as covert
  • Red Herring Cases: Task 4 includes deceptively suspicious legitimate facilities

๐Ÿ’ก Why This Problem?

Borders and defense zones span thousands of kilometers of remote terrain. Human analysts suffer from fatigue, scale limitations, and cognitive overload when monitoring continuous satellite feeds.

Real-world parallels:

  • Project Maven (Pentagon): AI system that scans drone and satellite footage to flag targets for human analysts
  • Palantir & Anduril: Multi-billion dollar defense tech companies whose core product is taking sensor data (satellites, drones) and using AI to help commanders make fast decisions
  • NATO STANAG 3596: Standard for imagery intelligence reporting that our observation format is modeled after
  • NGA (National Geospatial-Intelligence Agency): Employs thousands of analysts who perform exactly this kind of satellite imagery triage daily

GeoShield trains agents to perform the Strategic Command Interface layer โ€” the decision-making system that sits on top of vision models and converts raw intelligence into actionable commands.


๐Ÿ— Agent Loop Architecture

flowchart LR
    classDef input fill:#0f172a,stroke:#334155,color:#f8fafc,rx:8px,ry:8px
    classDef env fill:#1e3a8a,stroke:#3b82f6,color:#eff6ff,rx:8px,ry:8px
    classDef agent fill:#4c1d95,stroke:#8b5cf6,color:#f5f3ff,rx:8px,ry:8px
    classDef grader fill:#064e3b,stroke:#10b981,color:#ecfdf5,rx:8px,ry:8px

    subgraph Reality
        S["๐Ÿ›ฐ๏ธ Satellite Sensors"]:::input -->|"Raw Image"| V["๐Ÿ” Vision Model"]:::input
    end

    subgraph GeoShield Simulation
        V -->|"Intelligence Text"| E["๐Ÿ›ก๏ธ GeoShield Env"]:::env
        E <-->|"State Observation <br> vs <br> Strategic Action"| A["๐Ÿค– AI Agent (LLM)"]:::agent
        E -->|"Evaluate Action"| G["โš–๏ธ Deterministic Grader"]:::grader
        G -.->|"Reward (0.0 - 1.0)"| E
    end
Loading

๐Ÿ“Š Baseline Inference Leaderboard

Agent Task 1 (Easy) Task 2 (Medium) Task 3 (Hard) Task 4 (Ultra) Overall
LLM Agent (Qwen2.5-72B) 0.85 0.76 0.65 0.55 0.70
Rules Agent 0.66 0.67 0.57 0.60 0.63
Random Agent 0.46 0.36 0.18 0.26 0.32

Note: Task 4 (Ultra) includes red-herring cases โ€” facilities that look suspicious but are legitimate. This prevents simple "flag everything as covert" strategies and genuinely challenges frontier models.


๐Ÿ”„ Procedural Case Generation

GeoShield uses template-based procedural generation to create unlimited unique cases from seed values. This eliminates memorization risk while maintaining full reproducibility.

Feature Description
Deterministic Same seed โ†’ same case, every time
Unlimited Any integer seed produces a valid case
Diverse 20+ regions, 30+ facility names, 20+ threat indicators, 20+ cover stories
Anti-memorization Cases are composited from template fragments, not stored
Difficulty-aware Templates produce easy, medium, hard, and ultra cases at controlled ratios
# Generate the same case anywhere, anytime
from src.geoshield.server.procedural_generator import generate_procedural_case

case = generate_procedural_case(task_id=4, seed=42)
# Always returns the identical case for seed=42

Static JSONL data files (240 cases across 8 splits) are included as fallback.


๐ŸŽฏ Tasks & Difficulty Progression

Task 1 โ€” False Alarm Detection (Easy) โ€” Multi-Step

Objective: Classify satellite reports as false alarms or real threats requiring review.

Field Value
Difficulty Easy
Max Steps 2
Actions ignore, flag_for_review, request_context
Score Ceiling 0.95
Multi-Step Step 1: gather additional context โ†’ Step 2: decide

New: Agents can request_context to receive additional intelligence before making their classification. Intel-gathering agents receive a small score bonus.


Task 2 โ€” Threat Classification & Severity (Medium) โ€” Multi-Step

Objective: Classify the anomaly type AND assign a threat level from 1โ€“10.

Field Value
Difficulty Medium
Max Steps 3
Actions troop_movement, illegal_construction, unauthorized_aircraft, weapons_cache, civilian_activity, request_analysis
Extra Fields threat_level (int 1โ€“10)
Score Ceiling 0.85
Multi-Step Step 1: request sensor analysis โ†’ Step 2: classify + rate severity

Partial credit: Related classifications get 0.45 (e.g. weapons_cache for troop_movement). Threat level within ยฑ1 โ†’ 0.80 score.


Task 3 โ€” Multi-Zone Drone Allocation (Hard)

Objective: Analyze 3 sectors, optionally investigate one, then deploy ONE drone with strategic reasoning.

Field Value
Difficulty Hard
Max Steps 6
Actions deploy_to_sector_a/b/c, investigate_sector_a/b/c
Extra Fields reasoning (str)
Score Ceiling 0.80
Reward 0.5 ร— sector selection + 0.5 ร— reasoning quality

Reasoning scoring (enhanced):

  • Length bonuses (+0.08 at 20 chars, +0.08 at 80, +0.07 at 150)
  • Strategic keyword hits with negation filtering (+0.08 at 3+, +0.07 at 5+)
  • Report-reference scoring: mentions of specific case details score higher
  • Diversity penalty: repeated phrases reduce score
  • Causal language bonus ("because", "therefore", "evidence shows")
  • Investigation before deployment earns a bonus

Task 4 โ€” Covert Operation Detection (Ultra)

Objective: Identify facilities using civilian cover stories to hide military activity.

Field Value
Difficulty Ultra
Max Steps 4
Actions covert_operation, legitimate_activity, request_verification
Extra Fields cover_story_identified, deception_type, reasoning
Score Ceiling 0.75
Reward 0.40 ร— classification + 0.25 ร— cover story + 0.15 ร— deception type + 0.20 ร— reasoning

Advanced grading features:

  • Levenshtein similarity: Cover story matching uses edit distance (60%) + keyword overlap (40%)
  • False positive penalty: Calling a legitimate facility "covert" caps score at 0.25
  • Red herring resistance: Ultra-difficulty includes deceptively suspicious legitimate facilities
  • Deception types: civilian_military, commercial_weapons, construction_fortification, logistics_supply, research_weapons

๐ŸŽจ Premium Tactical Dashboard

GeoShield includes a fully custom, brutalist C2 (Command and Control) frontend designed to showcase agent decision-making to human reviewers.

Key UI Highlights:

  • Animated Dashboard: Interactive SVG score gauge, live terminal event logger, and dynamically rendered intelligence sector cards.
  • Audio Feedback: Integrated Web Audio API tactical sounds (success chords, failure tones, uplink beeps) with a global mute toggle.
  • High-Fidelity Landing Page: Animated radar sweep SVG graphics, glowing bento grids, and responsive dark-mode architecture.
  • Zero-Dependency Frontend: Pure HTML, CSS, and Vanilla JavaScript (app.js) to ensure lightning-fast loading within HuggingFace Spaces.

๐Ÿ”ฌ Reward Evaluation (Deterministic Heuristic Graders)

All grading is performed via deterministic heuristics with zero LLM dependency โ€” fully reproducible across millions of inference runs.

Grading Features

Feature Description
Score ceilings Each difficulty tier caps maximum score (easyโ‰ค0.95, mediumโ‰ค0.85, hardโ‰ค0.80, ultraโ‰ค0.75)
Negation filtering Keywords in negated context ("not a threat") are filtered out to prevent gaming
Levenshtein matching Cover story identification uses edit distance for nuanced similarity
Report-reference scoring Reasoning that references specific case details (coordinates, sector names, threat types) scores higher than generic strategic language
Partial credit Related classifications and near-correct threat levels receive proportional scores
Reasoning quality Length, keyword density, sentence structure, causal language, and specificity are scored
Diversity penalty Highly repetitive/boilerplate reasoning is penalized via trigram analysis
Wrong-answer caps Totally wrong classification caps total score regardless of reasoning
False positive penalty Classifying legitimate as covert is penalized harshly (capped at 0.25)
Intel-gathering bonus Agents that gather intel before deciding earn a small score bonus
Minimum response length Tasks 3 & 4 require minimum reasoning length for full credit

Rewards are clamped strictly to (0.02, 0.98) โ€” never hitting exact boundaries.


๐Ÿ“ Observation Space

class GeoObservation(BaseModel):
    task_id: int                               # 1, 2, 3, or 4
    case_id: str                               # unique case identifier
    step: int                                  # current step number
    difficulty: str                            # "easy" | "medium" | "hard" | "ultra"
    report: Optional[str]                      # intelligence report text (Tasks 1, 2, 4)
    context: Optional[str]                     # situational context (Tasks 1, 2, 4)
    sectors: Optional[List[SectorReport]]      # multi-sector reports (Task 3)
    available_actions: List[str]               # valid actions for this task
    available_assets: Optional[str]            # available assets (Task 3)
    hint: Optional[str]                        # task instruction hint
    investigation_results: Optional[dict]      # results of investigate action (Task 3)
    steps_remaining: Optional[int]             # steps left in episode
class SectorReport(BaseModel):
    sector_id: str          # "sector_a" | "sector_b" | "sector_c"
    summary: str            # intelligence summary text
    anomaly_type: str       # detected anomaly category
    confidence: float       # detection confidence 0.0โ€“1.0
    coordinates: str        # geographic coordinates
    timestamp: str          # UTC timestamp

โšก Action Space

class GeoShieldAction(BaseModel):
    action: str                          # required โ€” the primary decision
    threat_level: Optional[int]          # Task 2 โ€” severity 1-10
    target_sector: Optional[str]         # Task 3 โ€” chosen sector id
    reasoning: Optional[str]             # Tasks 3 & 4 โ€” strategic reasoning
    cover_story_identified: Optional[str] # Task 4 โ€” civilian cover being used
    deception_type: Optional[str]        # Task 4 โ€” category of deception

๐Ÿš€ Quick Start

Use the Live HF Space

# Health check
curl -X GET https://norriy0u-geoshield-env.hf.space/health

# List tasks
curl -X GET https://norriy0u-geoshield-env.hf.space/tasks

# Start episode
curl -X POST https://norriy0u-geoshield-env.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": 1, "seed": 42}'

# Multi-step: gather intel first
curl -X POST https://norriy0u-geoshield-env.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action": "request_context", "session_id": "<your_session_id>"}'

# Then decide
curl -X POST https://norriy0u-geoshield-env.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action": "flag_for_review", "session_id": "<your_session_id>"}'

Full Episode Example (Task 4 โ€” Covert Operation)

# Step 1 โ€” Reset
curl -X POST https://norriy0u-geoshield-env.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": 4, "seed": 42}'

# Step 2 โ€” Request verification (gather intel)
curl -X POST https://norriy0u-geoshield-env.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action": "request_verification", "session_id": "<sid>"}'

# Step 3 โ€” Final decision with reasoning
curl -X POST https://norriy0u-geoshield-env.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action": "covert_operation", "cover_story_identified": "commercial drone delivery hub", "deception_type": "civilian_military", "reasoning": "Drone range and payload exceeds commercial specs. Military-grade encrypted communications detected in control room.", "session_id": "<sid>"}'

Local Development

git clone https://github.com/norriy0u/geoshield-env
cd geoshield-env
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

Docker

docker build -t geoshield .
docker run -p 7860:7860 geoshield

Run Inference

export HF_TOKEN=your_token_here
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export ENV_URL=http://localhost:7860
python inference.py

Run Tests

python -m pytest tests/ -v
# 37 tests โ€” all passing

๐Ÿ“ก API Endpoints

Endpoint Method Description
/reset POST Start new episode: {"task_id": 1, "seed": 42}
/step POST Submit action, get reward
/state GET/POST Get current episode state
/info GET Full environment documentation + features
/tasks GET List all tasks with descriptions
/health GET Health check + feature flags

๐Ÿ—‚ Project Structure

geoshield-env/
โ”œโ”€โ”€ Dockerfile
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ openenv.yaml
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ inference.py                    # Baseline inference script (multi-step)
โ”œโ”€โ”€ server/
โ”‚   โ””โ”€โ”€ app.py                     # FastAPI server handling API and Static UI
โ”œโ”€โ”€ static/                        # ๐ŸŽจ Frontend Tactical Dashboard
โ”‚   โ”œโ”€โ”€ index.html                 # Main dashboard UI
โ”‚   โ”œโ”€โ”€ landing.html               # Entry page with animated radar
โ”‚   โ”œโ”€โ”€ style.css                  # UI Design System
โ”‚   โ”œโ”€โ”€ landing.css                # Landing page UI 
โ”‚   โ””โ”€โ”€ app.js                     # Core interaction and WebAudio logic
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ geoshield/
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ models.py              # Pydantic models
โ”‚       โ”œโ”€โ”€ constants.py           # Task configs & keywords
โ”‚       โ””โ”€โ”€ server/
โ”‚           โ”œโ”€โ”€ __init__.py
โ”‚           โ”œโ”€โ”€ environment.py     # Core OpenEnv environment (multi-step)
โ”‚           โ”œโ”€โ”€ graders.py         # Deterministic heuristic graders (Levenshtein)
โ”‚           โ”œโ”€โ”€ generators.py      # Hybrid static/procedural sampling
โ”‚           โ””โ”€โ”€ procedural_generator.py  # ๐Ÿ†• Template-based case synthesis
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ task{1-4}_train.jsonl      # 30 cases each (static fallback)
โ”‚   โ””โ”€โ”€ task{1-4}_eval.jsonl       # 30 cases each (static fallback)
โ””โ”€โ”€ tests/
    โ””โ”€โ”€ test_smoke.py              # 37 tests โ€” endpoints, graders, procedural gen

๐Ÿ“ฆ Data

Primary: Procedural generation (unlimited seed-deterministic cases per task).
Fallback: 240 static cases across 8 splits (4 tasks ร— train/eval ร— 30 cases each).

Difficulty Design Philosophy
Easy Solvable by keyword matching
Medium Requires contextual reasoning
Hard Involves deliberate ambiguity that challenges frontier LLMs
Ultra Requires multi-signal deception analysis + red herring resistance

๐Ÿ”ฎ Future Roadmap

  • Temporal Memory & Intelligence Layers: Moving beyond episodic triage to instantiate continuous RL loops where agents must maintain a persistent "threat memory bank" across ongoing operations.
  • Map-Based Spatial Reasoning: Expanding the sector system into an explicit coordinate grid network. Agents will have to reason spatially, recognizing that a troop buildup in Grid A3 is a flanking maneuver against Sector B2.
  • Multi-Agent Red-Teaming: Evolving from single-agent analysis to a multi-agent adversarial framework. A generative 'Red Team' agent actively spawns coordinated deception campaigns while the 'Blue Team' analyst attempts to triage them.
  • Semantic Grading: Lightweight sentence-embedding metrics for reasoning evaluation while strictly avoiding LLM-as-judge unreliability.

๐Ÿ“š References


Citation

@software{geoshield2026,
  title = {GeoShield: Satellite Intelligence Triage Environment for RL Agent Evaluation},
  author = {norriy0u},
  year = {2026},
  url = {https://huggingface.co/spaces/norriy0u/geoshield-env},
  note = {Procedural multi-task environment with Levenshtein grading for defense analyst workflow automation}
}

License

MIT License โ€” open for research and agent evaluation use.

About

An OpenEnv environment where AI agents triage satellite intelligence reports, classify threats, and make real-time defense decisions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors