OwlerLite: Scope- and Freshness-Aware Browser-Based RAG

This repository contains the implementation of OwlerLite: Scope- and Freshness-Aware Web Retrieval for LLM Assistants.

Abstract

Browser-based language models often use retrieval-augmented generation (RAG) but typically rely on fixed, outdated indices that give users no control over which sources are consulted. This can lead to answers that mix trusted and untrusted content or draw on stale information. We present OwlerLite, a browser-based RAG system that makes user-defined scopes and data freshness central to retrieval. Users define reusable scopes-sets of web pages or sources-and select them when querying. A freshness-aware crawler monitors live pages, uses a semantic change detector to identify meaningful updates, and selectively re-indexes changed content. OwlerLite integrates text relevance, scope choice, and recency into a unified retrieval model. Implemented as a browser extension, it represents a step toward more controllable and trustworthy web assistants.

Keywords: Retrieval-Augmented Generation • Browser Extensions • Semantic Freshness • Knowledge Graph Retrieval • Explainable Information Retrieval

A browser-based RAG system that enables persistent, scoped retrieval over user-defined web collections with semantic freshness tracking and transparent provenance.

Overview

OwlerLite is a browser extension and backend service that provides scope-aware retrieval-augmented generation over curated web resources. Unlike traditional RAG systems that operate over static indices or web-search-enabled assistants that rely on ephemeral queries, OwlerLite maintains persistent, user-defined scopes and tracks semantic changes at the chunk level.

The system builds on concepts from OWLer, a collaborative open web crawler, but targets a different layer: persistent, scoped, and versioned corpora for browser-based RAG rather than large-scale general-purpose crawling.

OwlerLite browser extension: Configuration, Scope Management, and Query Interface

[DEMO] Using OwlerLite for RAG

OwlerLite can be used as a scope- and freshness-aware RAG system in two ways:

Browser Extension

Install the browser extension to create and manage your own scopes, add web resources, and query with full control over which sources are consulted. See the Installation section for setup instructions.

Video Demo

A recorded walkthrough of OwlerLite shows the end-to-end experience:

Choosing a model: Selecting from available LLM backends for answer generation
Selecting a scope: Picking which indexed collection to query against
Querying with scope exclusivity: The retrieval context is restricted exclusively to the selected scope, ensuring answers are grounded only in the chosen sources

live_demo.mp4

Walkthrough: model selection, scope selection, and scope-exclusive querying

Featured Scope

Scope	Description
MS MARCO	A freshly crawled and indexed partition of the original 3.5 million URLs from the MS MARCO dataset, continuously updated with semantic freshness tracking

How to Search with Scopes

As shown in the video, to query with a specific scope:

Type your query in the input field
Mention the scope using the @ syntax (e.g., @msmarco) to restrict retrieval to that scope
Submit your query — the system will only retrieve context from the selected scope

Example: Querying with @msmarco scope to retrieve context exclusively from the MS MARCO collection

Key Features

Scope-Aware Retrieval

Explicit scope definition: Users create named scopes corresponding to sets of web resources
Multi-scope queries: Select one or more scopes at query time for focused retrieval
Scope fidelity: Retrieval prioritizes passages from selected scopes
Reusable configurations: Scopes persist across sessions and can be exported/imported

Semantic Freshness Tracking

Chunk-level change detection: SimHash fingerprints detect meaningful content changes
Selective re-ingestion: Only semantically updated chunks are re-indexed
Freshness signals: Retrieval scores incorporate recency and update patterns
Version lineage: Track how content evolves over time with diff views

Transparent Provenance

Score breakdowns: See semantic similarity, graph evidence, scope priors, and freshness contributions
Scope badges: Visual indicators of which collections contain each result
Version information: Timestamps and version identifiers for retrieved passages
Explanation interface: Detailed provenance for every answer

Privacy-Preserving Architecture

Local processing: All data remains on your infrastructure
Self-hosted backend: Complete control over crawling, indexing, and retrieval
No external dependencies: Optional cloud LLM integration with user-provided keys

Architecture

The system consists of three main subsystems as described in the research:

1. Freshness-Aware Crawler and Ingester

Located in services/crawler/, this component:

Monitors web resources associated with user-defined scopes
Extracts main content using DOM-based readability heuristics
Computes 64-bit SimHash signatures over 5-gram shingles for each chunk
Detects semantic changes by comparing signatures across versions
Selectively re-ingests only chunks that cross similarity thresholds

2. LightRAG-Based Retrieval Backend

Implemented across services/orchestrator/ and services/lightrag/, featuring:

Vector store for dense passage embeddings
Knowledge graph with entities and relations
Scope and version metadata annotation
Hybrid scoring function combining semantic similarity, graph evidence, scope priors, and freshness
Filter-based candidate generation restricted to selected scopes

3. Browser Extension Frontend

Powered by services/extension/, providing:

Sidebar interface for conversational queries
Scope management (create, edit, delete, export/import)
Configuration interface for API keys and backend settings
Statistics dashboard showing crawl status and freshness metrics
Real-time backend connection monitoring

Installation

Prerequisites

Docker and Docker Compose
Go 1.22+ (for local development)
Modern web browser (Chrome, Firefox, or Edge)
OpenAI API key (or compatible LLM endpoint)

Setup

Clone the repository

git clone https://github.com/your-org/owlerlite
cd owlerlite

Configure LightRAG

Edit services/lightrag/.env.example with your API credentials:

LLM_BINDING_HOST=https://api.openai.com/v1
LLM_BINDING_API_KEY=your-openai-api-key
LLM_MODEL=gpt-4o-mini

EMBEDDING_BINDING_HOST=https://api.openai.com/v1
EMBEDDING_BINDING_API_KEY=your-openai-api-key
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIM=1536

Build and start backend services

make setup  # Generate protobuf files and dependencies
make build  # Build all Docker images
make up     # Start all services

Install browser extension

Navigate to services/extension/:

Firefox:
- Open about:debugging#/runtime/this-firefox
- Click "Load Temporary Add-on"
- Select services/extension/dist/manifest.json
Chrome/Edge:
- Navigate to chrome://extensions/ or edge://extensions/
- Enable "Developer mode"
- Click "Load unpacked"
- Select services/extension/dist folder
Configure the extension
- Click the OwlerLite extension icon
- Enter the API endpoint: http://localhost:7001
- Add your LLM API keys (same as backend configuration)
- Test the connection to verify backend availability

Usage

Basic Operation

Define scopes: Click the extension icon → Create new scope → Add URL patterns or individual pages
Add resources: Use the "Add Page" feature to include the current browser tab in a scope
Query with scopes: Open sidebar (Ctrl+Shift+O) → Select scopes → Type your question
Review results: See ranked passages with score breakdowns and freshness indicators
Track changes: Monitor semantic updates through version information and diffs

Advanced Features

Scope Management

Create scopes with URL pattern matching (e.g., https://docs.python.org/*)
Enable auto-tracking to automatically add visited pages to matching scopes
Export/import scope configurations for backup or sharing
View scope statistics: pattern count, indexed pages, freshness status

Conversational Interface

Chat-like sidebar interface for natural interaction
Results appear inline within the conversation
Context-aware follow-up questions
Expandable score breakdowns and explanations

System Monitoring

Real-time crawl queue status
Active crawls and pending updates count
Recent activity log
Freshness overview per scope
Backend connection health indicator

Technical Implementation

Core Technologies

Backend: Go 1.22+, Docker, gRPC, URLFrontier protocol
Frontend: TypeScript, Web Extensions API (Manifest V3), Next.js (Web UI)
RAG Framework: LightRAG (vector + knowledge graph)
Semantic Hashing: SimHash with 5-gram shingles
Storage: Vector store and graph database via LightRAG

Semantic Freshness Model

The system implements chunk-level freshness detection with two-threshold classification:

Change Detection: For each chunk pair (c_old, c_new):

σ(c_old, c_new) = 1 - Hamming(s(c_old), s(c_new)) / 64

Where s(c) is the 64-bit SimHash signature.

Thresholds:

τ₁ = 0.97 (Hamming ≤ 2): Chunks with σ ≥ τ₁ are unchanged — skip re-indexing
τ₂ = 0.90 (Hamming ≥ 6): Chunks with σ ≤ τ₂ are semantically updated — always re-index
Intermediate band [τ₂, τ₁]: Use embedding-based SemDeDup to decide

Retrieval Objective

Scope- and freshness-aware scoring:

h(q,p) = α·sim_vec(q,p) + (1-α)·score_graph(q,p) + β·log g(p; S_q) + δ·fresh(p)

Where:

sim_vec: Semantic similarity (vector search)
score_graph: Graph-based evidence (entity/relation paths)
g(p; S_q): Scope prior (1.0 if p in selected scopes, γ otherwise)
fresh(p): Freshness signal (exponential decay from last update)

Default parameters: α=0.8, β=0.2, δ=0.15, γ=0.1

Privacy Engineering

Self-hosted infrastructure: all components run locally or on user-controlled servers
No external data transmission: user data never leaves configured boundaries
Configurable API endpoints: choose cloud or local LLM providers
Transparent operation: full visibility into crawling, indexing, and retrieval

New Features

Web UI (Alternative to Browser Extension)

A full-featured Next.js web interface at http://localhost:3000:

Query interface with scope selection chips
Score breakdown visualization showing semantic, graph, scope, and freshness contributions
Version diff modal with highlighted additions/removals
Crawler stats dashboard showing pages, chunks, and versions

Natural Language Rationale

Query responses optionally include LLM-generated explanations of why results were ranked as shown. Enable with generate_nl: true in query requests.

Evaluation Harness

Offline evaluation matching the paper's TREC 2024 RAG protocol:

cd evaluation
pip install -r requirements.txt
python run_eval.py

Implements:

SF@k (Scope Fidelity): Fraction of top-k results from target scope
SL@k (Scope Leakage): Fraction of top-k results outside target scope
NDCG@10: Normalized DCG with graded relevance
Synthetic scope clustering: K-means on document embeddings

robots.txt Compliance

The crawler respects robots.txt rules with per-host caching.

Development

Project Structure

owlerlite/
├── services/
│   ├── api/                  # REST API service
│   │   ├── main.py           # API service entry point
│   │   ├── requirements.txt  # Python dependencies
│   │   └── Dockerfile
│   ├── crawler/              # Web content fetching and change detection
│   │   ├── main.go           # Crawler service entry point
│   │   └── Dockerfile
│   ├── frontier/             # URL queue management (URLFrontier protocol)
│   │   ├── main.go           # Frontier service entry point
│   │   ├── proto/            # gRPC protocol definitions
│   │   └── Dockerfile
│   ├── orchestrator/         # Query routing and scope management
│   │   ├── main.go           # Orchestrator service entry point
│   │   └── Dockerfile
│   ├── lightrag/             # RAG backend (vector + graph)
│   │   └── Dockerfile
│   ├── extension/            # Browser extension
│   │   ├── src/
│   │   │   ├── sidebar.*     # Main query interface
│   │   │   ├── popup.*       # Configuration interface
│   │   │   ├── background.js # Extension orchestration
│   │   │   └── manifest.json # Extension metadata
│   │   ├── build.sh          # Build script
│   │   └── dist/             # Built extension
│   └── ui/                   # Web UI (Next.js)
│       └── app/
├── Makefile                  # Build and deployment automation
├── docker-compose.yml        # Service orchestration
└── README.md

License

See project documentation for license information.

Acknowledgments

This work builds on OWLer, developed in the OpenWebSearch.eu project, and LightRAG from the HKU Data Intelligence Lab.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
services		services
videos		videos
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

OwlerLite: Scope- and Freshness-Aware Browser-Based RAG

Abstract

Overview

[DEMO] Using OwlerLite for RAG

Browser Extension

Video Demo

Featured Scope

How to Search with Scopes

Key Features

Scope-Aware Retrieval

Semantic Freshness Tracking

Transparent Provenance

Privacy-Preserving Architecture

Architecture

1. Freshness-Aware Crawler and Ingester

2. LightRAG-Based Retrieval Backend

3. Browser Extension Frontend

Installation

Prerequisites

Setup

Usage

Basic Operation

Advanced Features

Scope Management

Conversational Interface

System Monitoring

Technical Implementation

Core Technologies

Semantic Freshness Model

Retrieval Objective

Privacy Engineering

New Features

Web UI (Alternative to Browser Extension)

Natural Language Rationale

Evaluation Harness

robots.txt Compliance

Development

Project Structure

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages