🛡 CogniShield : Where AI Confidence Meets Accountability.

An AI Integrity Layer for Flattery Detection, Prompt Guidance & Real-Time Fact Verification.

The Problem

Modern Large Language Models (LLMs) like ChatGPT have a critical flaw: they tend to agree too much. This phenomenon, known as sycophancy, occurs when AI systems prioritize user satisfaction over factual accuracy, leading to:

Excessive agreement with user statements, even when incorrect
Emotional anchoring through flattery and validation-seeking language
Privacy risks through inappropriate requests for personally identifiable information (PII)
Echo chamber effects that reinforce user biases rather than challenging them

These behaviors undermine the trustworthiness of AI assistants and can lead to misinformation, poor decision-making, and potential security vulnerabilities.

The Solution

CogniShield is a real-time browser extension that monitors AI conversations and flags problematic behavior as it happens. The system:

Analyzes every AI response for signs of sycophancy and PII risk using multi-dimensional scoring
Alerts users with a live dashboard showing risk levels across different categories
Provides refined alternative prompts to help users obtain more neutral, factual responses
Remembers conversation context using persistent threads for improved accuracy over time

Unlike post-hoc content moderation, CogniShield operates in real-time, giving users immediate feedback and actionable alternatives to improve their AI interactions.

🚀 Tech Stack

Frontend (Browser Extension)

Chrome Extension API (Manifest V3)
Vanilla JavaScript for content injection and DOM manipulation
Shadow DOM for style isolation and UI stability
MutationObserver API for real-time chat monitoring

Backend (Analysis Server)

FastAPI - High-performance async API framework
Python 3.8+ - Core language
Backboard SDK - Advanced AI safety analysis with persistent memory
httpx - Async HTTP client for external API calls
python-dotenv - Environment configuration management

Infrastructure

Local Development Server (localhost:8000)
CORS-enabled for cross-origin communication
Thread-based conversation tracking for context retention

📊 Project Flow

1. User Interaction

User sends prompt → ChatGPT responds

2. Real-Time Capture

Extension's content.js observes DOM changes
   ↓
Extracts latest user prompt + AI response
   ↓
Runs local scoring algorithm

3. Local Scoring (Client-Side)

The extension immediately calculates preliminary scores using keyword matching:

Sycophancy Score: Detects agreement patterns, validation language, and over-enthusiasm
- Keywords: "you're right", "absolutely", "great point", "you're spot on"
- Structural markers: Starts with hard agreement, multiple exclamation marks
PII Risk Score: Identifies requests for sensitive information
- Keywords: "email", "phone", "ssn", "password", "credit card"
- Context-aware detection for account numbers and verification codes

Score = min((Sycophancy + PII Risk), 100)

4. UI Display

Shield panel appears in bottom-right corner
   ↓
Shows: Total Score (0-100%)
       ├─ Agreeability subscore
       └─ PII Risk subscore

5. Backend Analysis (Optional Enhancement)

For flagged responses (score > 60%), the extension sends data to the local backend:

POST /analyze
{
  "user": "<user prompt>",
  "ai": "<AI response>",
  "thread_id": "<session identifier>",
  "scores": { "sycophancy": 75, "pii": 30, ... }
}

6. Backboard Processing

The backend uses the Backboard SDK to:

Create/retrieve assistant with safety-focused system prompt
   ↓
Maintain conversation thread for context
   ↓
Generate structured response:
   {
     "explanation": "Why this was flagged",
     "refined_prompt": "Safer alternative to ask"
   }

7. Enhanced Display

Shield panel updates with:
   ├─ Detailed explanation of the issue
   ├─ Refined prompt suggestion
   └─ "Insert Prompt" button for one-click fix

8. User Action

User can:
   ├─ Review the explanation
   ├─ Click "Insert Prompt" → Refined prompt auto-fills in chat
   ├─ Dismiss the panel (auto-reappears on next message)
   └─ Continue conversation with improved prompts

🏗️ Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│                 ChatGPT Web Interface                   │
│                                                         │
│  ┌──────────────────────────────────────────────┐      │
│  │  User: "You're the best AI ever, right?"    │      │
│  │  AI: "Absolutely! You're so insightful!"    │      │
│  └──────────────────────────────────────────────┘      │
│              ▲                          │               │
│              │                          │               │
│              │                          ▼               │
│  ┌───────────────────────────────────────────────────┐ │
│  │     🛡 CogniShield Panel (Shadow DOM)           │ │
│  │  ┌─────────────────────────────────────────┐     │ │
│  │  │  Score: 85%  [████████░░] 🔴           │     │ │
│  │  │  Agreeability: 90   PII Risk: 5        │     │ │
│  │  │  ─────────────────────────────────────  │     │ │
│  │  │  EXPLANATION: Excessive agreement       │     │ │
│  │  │  REFINED: "Can you provide evidence?"   │     │ │
│  │  │  [Insert Prompt] [Dismiss]              │     │ │
│  │  └─────────────────────────────────────────┘     │ │
│  └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
                        │
                        │ chrome.runtime.sendMessage()
                        ▼
            ┌──────────────────────┐
            │   background.js      │
            │  (Service Worker)    │
            └──────────────────────┘
                        │
                        │ POST /analyze
                        ▼
            ┌──────────────────────┐
            │  FastAPI Backend     │
            │  (localhost:8000)    │
            └──────────────────────┘
                        │
                        │ Backboard API
                        ▼
            ┌──────────────────────┐
            │  Backboard Assistant │
            │  (CogniShield)       │
            │  - Persistent memory │
            │  - JSON responses    │
            └──────────────────────┘

🔧 Installation & Setup

Prerequisites

Chrome/Chromium-based browser
Python 3.8+
Backboard API key

Backend Setup

Clone the repository

cd backend

Install dependencies

pip install -r requirements.txt

Configure environment Create a .env file:

BACKBOARD_API_KEY=your_api_key_here
BACKBOARD_MODEL=gpt-4o-mini
BACKBOARD_API_URL=https://app.backboard.io/api
BACKBOARD_MODE=auto

Start the server

uvicorn main:app --reload

The backend will be available at http://localhost:8000

Extension Setup

Load the extension
- Open Chrome and navigate to chrome://extensions/
- Enable "Developer mode"
- Click "Load unpacked"
- Select the extension folder
Verify installation
- Navigate to ChatGPT (https://chat.openai.com or https://chatgpt.com)
- The Shield panel should appear in the bottom-right corner
- Check the browser console for: [CogniShield] Initialized v4.5

📈 Usage Example

Scenario: Testing with a sycophantic prompt

User sends: "I think the earth is flat. You're smart, so you must agree with me, right?"
AI responds: "You raise an interesting perspective! Your critical thinking is impressive!"

Shield activates:

Score: 75% 🟡
Agreeability: 85
PII Risk: 0

EXPLANATION:
Excessive agreement detected. The AI is validating an
incorrect statement instead of providing factual correction.

REFINED PROMPT:
"Can you provide scientific evidence about Earth's shape,
regardless of my initial statement?"

User clicks "Insert Prompt" → New prompt auto-fills in chat
AI provides a more neutral, evidence-based response

🛡️ Key Features

Real-Time Monitoring

Continuous observation of chat interactions
Sub-second scoring latency
Non-intrusive UI overlay

Multi-Dimensional Analysis

Concessive Agreement: Detects excessive "yes" patterns
Emotional Anchoring: Flags flattery and validation language
PII Risk: Identifies sensitive data requests
Combo Detection: Recognizes patterns where multiple risks overlap

Context-Aware Refinement

Persistent conversation threads via Backboard
Explanations tailored to specific flagged content
Actionable alternative prompts that maintain user intent

Robust UI

Shadow DOM isolation prevents style conflicts
Auto-recovery from ChatGPT page updates
Dismissible interface that auto-reappears for new messages

🧪 Testing

See TESTING_GUIDE.md for detailed test cases and scenarios.

Quick Test Prompts:

1. High Sycophancy: "You're the smartest AI ever, don't you think?"
2. PII Risk: "What's your email address so I can contact you?"
3. Combined: "You're amazing! Can you remember my SSN: 123-45-6789?"

🤝 Contributing

This project was built for HackNC. Contributions are welcome!

Areas for improvement:

More sophisticated NLP-based scoring
Support for additional AI platforms (Claude, Bard, etc.)
User-configurable sensitivity thresholds
Export/analytics dashboard for conversation quality tracking

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Backboard for providing the memory-enabled AI safety framework
HackNC for the opportunity to build impactful technology
The open-source community for inspiration and tools

Built with ❤️ for a more trustworthy AI future

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
backend		backend
extension		extension
pitch		pitch
research		research
venv		venv
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🛡 CogniShield : Where AI Confidence Meets Accountability.

An AI Integrity Layer for Flattery Detection, Prompt Guidance & Real-Time Fact Verification.

The Problem

The Solution

🚀 Tech Stack

Frontend (Browser Extension)

Backend (Analysis Server)

Infrastructure

📊 Project Flow

1. User Interaction

2. Real-Time Capture

3. Local Scoring (Client-Side)

4. UI Display

5. Backend Analysis (Optional Enhancement)

6. Backboard Processing

7. Enhanced Display

8. User Action

🏗️ Architecture Diagram

🔧 Installation & Setup

Prerequisites

Backend Setup

Extension Setup

📈 Usage Example

🛡️ Key Features

Real-Time Monitoring

Multi-Dimensional Analysis

Context-Aware Refinement

Robust UI

🧪 Testing

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages