Skip to content

ykachala/hookstream

Repository files navigation

hookstream

4,107 events/sec at p95 8ms. Production-grade webhook delivery engine with guaranteed delivery, exponential backoff, real-time observability, and multi-tenant isolation — deployable to Kubernetes in minutes.

Metric Result
Ingest throughput 4,107 events/sec
p95 ingest latency 8ms
Delivery throughput 1,812 deliveries/sec
p95 delivery latency 143ms
Steady-state error rate 0.00%

k6 load test — 4 vCPU / 8 GB, 1 API + 3 worker replicas. Full results: benchmarks/results/baseline-2025-02-13.txt

TypeScript Node.js Redis PostgreSQL Kubernetes Docker Prometheus Grafana GitHub Actions CI


What this is

Hookstream is a standalone webhook delivery service. When your application fires an event, Hookstream handles getting that payload to subscriber endpoints — reliably, with retries, with full delivery history, and with observability built in.

Think of it as a self-hosted Svix or Hookdeck. Every SaaS company needs this internally; most build it badly as an afterthought. This is the correct version.

Built for systems where delivery failure is not acceptable.


Architecture

Your Application
      │
      │  POST /events  (fire and forget)
      ▼
┌──────────────────┐
│   Ingest API      │  Express + TypeScript
│   - Auth check    │  Validates, enqueues, returns 202
│   - Payload sign  │  HMAC-SHA256 signature header
│   - Enqueue       │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Redis / BullMQ  │  Durable job queue
│   - Per-tenant    │  Separate queues per tenant
│   - Priority      │  Configurable concurrency
│   - Persistence   │
└────────┬─────────┘
         │
         ▼
┌──────────────────────────────────────────┐
│         Delivery Workers                  │
│                                           │
│  ┌─────────────────────────────────────┐ │
│  │  For each subscriber endpoint:      │ │
│  │  1. POST payload with HMAC header   │ │
│  │  2. Expect 2xx within timeout       │ │
│  │  3. On failure → exponential retry  │ │
│  │     Attempt 1: immediate            │ │
│  │     Attempt 2: +30s                 │ │
│  │     Attempt 3: +5min                │ │
│  │     Attempt 4: +30min               │ │
│  │     Attempt 5: +2hr → dead letter   │ │
│  └─────────────────────────────────────┘ │
│                                           │
│  Circuit breaker per endpoint:            │
│  5 consecutive failures → open circuit   │
│  Auto-retry after 10min cooling period   │
└────────┬─────────────────────────────────┘
         │
    ┌────┴──────────────┐
    ▼                   ▼
PostgreSQL          Prometheus
(delivery log,      (metrics: delivery
 dead letter,        rate, latency,
 subscriber cfg)     failure rate,
                     queue depth)

Tech stack

Layer Technology Why
Runtime Node.js 20 + TypeScript Type-safe message contracts across producers and workers
Queue BullMQ on Redis Persistent, ordered, retry-aware job queue with dashboard
Database PostgreSQL 15 Delivery log, subscriber registry, dead letter store
Metrics Prometheus + Grafana Real-time delivery visibility; Grafana dashboard included
Containerisation Docker + Docker Compose Dev stack up in one command
Orchestration Kubernetes + Helm Production deployment with horizontal scaling of workers
CI/CD GitHub Actions Lint → test → build → push to registry
Logging Pino Structured JSON logs, low overhead

Features

Delivery

  • Guaranteed at-least-once delivery
  • Exponential backoff with configurable retry schedule
  • Per-endpoint circuit breaker — failed endpoints are paused, not hammered
  • Dead letter queue for undeliverable events with manual replay
  • HMAC-SHA256 signature on every delivery (X-Hookstream-Signature header)
  • Configurable delivery timeout per endpoint

Subscriber management

  • Register endpoints via API with per-endpoint event type filters
  • Endpoint verification challenge (similar to Stripe's webhook registration)
  • Per-tenant endpoint isolation — tenants cannot see each other's delivery logs
  • Pause / resume individual endpoints
  • Rotate signing secrets without downtime

Observability

  • Prometheus metrics exported at /metrics:
    • hookstream_deliveries_total (labels: tenant, event_type, status)
    • hookstream_delivery_duration_seconds
    • hookstream_queue_depth (per tenant)
    • hookstream_circuit_breaker_state (per endpoint)
  • Grafana dashboard definition included (dashboards/hookstream.json)
  • Structured JSON logging on every delivery attempt
  • Health endpoint: GET /health — liveness + queue connectivity

API

POST   /api/v1/events                        # Ingest event
GET    /api/v1/events/:id                    # Event detail + delivery attempts
POST   /api/v1/subscribers                   # Register endpoint
GET    /api/v1/subscribers                   # List endpoints
DELETE /api/v1/subscribers/:id               # Remove endpoint
PATCH  /api/v1/subscribers/:id/pause         # Pause delivery
POST   /api/v1/subscribers/:id/rotate-secret # Rotate HMAC secret
GET    /api/v1/deliveries                    # Delivery log (filterable)
POST   /api/v1/deliveries/:id/replay         # Replay a failed delivery
POST   /api/v1/deliveries/diagnose           # AI-powered failure diagnosis
GET    /api/v1/dead-letter                   # Dead letter queue
GET    /metrics                              # Prometheus metrics
GET    /health                               # Health check

AI delivery diagnostics

POST /api/v1/deliveries/diagnose sends an endpoint's recent failure history to Claude and returns a structured root-cause diagnosis — useful when a subscriber endpoint is in the dead-letter queue and you need to understand why without manually querying delivery logs.

curl -X POST http://localhost:3000/api/v1/deliveries/diagnose \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"subscriber_id": "sub_abc123"}'
{
  "data": {
    "summary": "Endpoint returning 401 consistently across all 14 attempts over 6 hours.",
    "likely_cause": "Authentication token has expired or been rotated on the receiving end.",
    "pattern": "auth",
    "recommended_action": "Rotate the signing secret on the subscriber and update the token on the receiving endpoint.",
    "confidence": "high"
  }
}

Requires ANTHROPIC_API_KEY in .env. Returns 501 if not configured.


Getting started

git clone https://github.com/ykachala/hookstream.git
cd hookstream
cp .env.example .env
docker compose up

Services:

  • API: http://localhost:3000
  • BullMQ Board: http://localhost:3001
  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3002 (admin / admin)
# Run tests
npm test

# Run load test (k6)
k6 run tests/load/ingest.js

Register a subscriber and fire an event

# Register your endpoint
curl -X POST http://localhost:3000/api/v1/subscribers \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://your-endpoint.com/hooks", "events": ["payment.completed", "user.created"]}'

# Fire an event
curl -X POST http://localhost:3000/api/v1/events \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"type": "payment.completed", "payload": {"amount": 5000, "currency": "ZAR"}}'

Verifying webhook signatures

On your endpoint, verify the X-Hookstream-Signature header:

import crypto from 'crypto';

function verifySignature(payload: string, signature: string, secret: string): boolean {
  const expected = crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex');
  return crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(`sha256=${expected}`));
}

Performance

Benchmarked on a 4-vCPU / 8GB instance with 3 worker replicas:

Metric Value
Ingest throughput 4,100 events/sec
Delivery throughput (200 OK endpoints) 1,800 deliveries/sec
p95 ingest latency 8ms
p95 delivery latency (fast endpoint) 145ms
Queue drain time (10k backlog) ~6 seconds

Load test: k6 run tests/load/ingest.js — no sleep, targets ingest saturation. Captured output from the baseline run: benchmarks/results/baseline-2025-02-13.txt.


Kubernetes deployment

helm install hookstream ./helm/hookstream \
  --set workers.replicas=3 \
  --set redis.url=$REDIS_URL \
  --set database.url=$DATABASE_URL

Workers scale independently of the API — scale them horizontally to increase delivery throughput without touching the ingest layer.


Project structure

hookstream/
├── src/
│   ├── api/              # Ingest and management routes
│   ├── workers/          # BullMQ delivery workers
│   │   ├── DeliveryWorker.ts
│   │   └── CircuitBreaker.ts
│   ├── queue/            # Queue definitions and producers
│   ├── db/               # PostgreSQL queries (no ORM — raw queries for perf)
│   ├── metrics/          # Prometheus client setup and counters
│   └── config/
├── tests/
│   ├── unit/
│   ├── integration/
│   └── load/             # k6 scripts
├── helm/                 # Kubernetes Helm chart
├── dashboards/           # Grafana dashboard JSON
├── docker-compose.yml
└── .github/workflows/

Related


Author: Yoweli Kachala  |  LinkedIn  |  Cape Town, South Africa

About

Webhook delivery engine — guaranteed delivery, circuit breakers, 4,107 ev/s. TypeScript + BullMQ + Kubernetes

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors