Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
723151b
doc(analytics): refine 031 living plan with pre-implementation decisions
bryangingechen Mar 25, 2026
120bc75
feat(site_analytics): A1 — app scaffold, AnalyticsPageView model, set…
bryangingechen Mar 25, 2026
1523985
feat(site_analytics): A2 — ingestion endpoint, hashing, bot-filter, t…
bryangingechen Mar 25, 2026
802d73e
feat(site_analytics): A3 — daily aggregate model, service, Celery task
bryangingechen Mar 25, 2026
52199f8
feat(site_analytics): A4 — monthly aggregate model, prune task, beat …
bryangingechen Mar 25, 2026
43335e7
feat(site_analytics): A5 — admin registrations for all three models
bryangingechen Mar 25, 2026
acac5e3
feat(site_analytics): A6 — empty-UA hardening flag
bryangingechen Mar 25, 2026
2b4a3c6
feat(site_analytics): add CORS support; convert 031 living plan to fi…
bryangingechen Mar 25, 2026
6315955
fix(dashboard): tolerate missing Dashboard keys in prs_to_list
bryangingechen Mar 25, 2026
bc10070
doc: update queueboard_main_workflow.md
bryangingechen Mar 25, 2026
19eb5e2
feat(dashboard): inject analytics tracking snippet into generated pages
bryangingechen Mar 25, 2026
92b422f
doc: expand queueboard_main_workflow.md with context and secrets table
bryangingechen Mar 25, 2026
8de812a
feat(dashboard): add privacy notice alongside analytics snippet
bryangingechen Mar 25, 2026
433e7c6
Revert "feat(dashboard): add privacy notice alongside analytics snippet"
bryangingechen Mar 25, 2026
05846da
feat(dashboard): add privacy notice and document disclosure rationale
bryangingechen Mar 26, 2026
aff0205
fix(site_analytics): pin clock in time-dependent aggregation/prune tests
bryangingechen Mar 26, 2026
c077649
feat(site_analytics): rotate visitor-hash salt monthly via DB
bryangingechen Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,16 @@ ANALYZER_AREA_STATS_TTL_SECONDS=300
# Convergence snapshots: how often to collect syncer/analyzer convergence counts (seconds)
ANALYTICS_CONVERGENCE_PERIOD_SECONDS=900

# Site analytics (pageview ingestion)
# Secret salt for visitor monthly hash (sha256-based; required in production).
SITE_ANALYTICS_HASH_SALT=
# Comma-separated slugs of sites allowed to post events; unknown slugs → 400.
SITE_ANALYTICS_ALLOWED_SITES=
# Raw pageview retention window in days (default 540 ≈ 18 months).
SITE_ANALYTICS_RETENTION_DAYS=540
# Reject requests with an empty User-Agent (stricter bot hardening; default off).
SITE_ANALYTICS_REJECT_EMPTY_UA=0

# CI filter (opt-in allowlist)
# Set SYNCER_CI_FILTER_MODE=allowlist to enable filtering by the allow lists below.
# Substrings matched case-insensitively against CheckRun.name and StatusContext.context.
Expand Down
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Project Structure & Module Organization
- `src/queueboard/` contains the legacy Python data pipeline: GraphQL queries under `queries/`, HTML assets in `static/`, and scripts like `dashboard.py`, `process.py`, and `suggest_reviewer.py`.
- `qb_site/` hosts the Django codebase; apps live in `qb_site/{core,syncer,analyzer,api,zulip_bot}/` and share settings from `qb_site/qb_site/settings/`.
- `qb_site/` hosts the Django codebase; apps live in `qb_site/{core,syncer,analyzer,api,zulip_bot,site_analytics}/` and share settings from `qb_site/qb_site/settings/`.
- `scripts/` provides operational helpers; `test/` stores fixture JSON for dashboard regression checks; `docs/` captures architecture plans/decisions.

## Build, Test, and Development Commands
Expand Down Expand Up @@ -61,7 +61,7 @@ Notes
## Keeping AGENTS.md Files Updated
- Every directory with significant logic has its own `AGENTS.md` (mirrored as `CLAUDE.md`).
Current locations: root, `qb_site/`, `qb_site/syncer/`, `qb_site/analyzer/`,
`qb_site/zulip_bot/`, `src/queueboard/`.
`qb_site/zulip_bot/`, `qb_site/site_analytics/`, `src/queueboard/`.
- When you add, rename, or remove management commands, Celery tasks, key services, or
directory structure, update the relevant `AGENTS.md` in the same commit/PR.
- When you add a new app or significant sub-directory, create a matching `AGENTS.md`
Expand Down
322 changes: 153 additions & 169 deletions docs/design-decisions/031-analytics-ingestion-design.md

Large diffs are not rendered by default.

171 changes: 51 additions & 120 deletions docs/queueboard_main_workflow.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,32 @@
Here is the main workflow in the `queueboard` repo, which queries data and generates the dashboard using the code in this repo (`queueboard-core`). For the planned v2 ingestion that replaces these ad‑hoc scripts with a database‑backed syncer, see docs/syncer_ingestion_plan.md.
Note that in the `queueboard` repo, the JSON files in `data/` and `processed-data/` are persisted from run to run by git pushes in this workflow.
This is also true of a few auxiliary text files: `closed_prs_to_backfill.txt`, `missing_prs.txt`, `redownload.txt`, `stubborn_prs.txt`.
This document describes the main GitHub Actions workflow used in the sibling
[`queueboard`](https://github.com/leanprover-community/queueboard) repo.
The workflow runs every 8 minutes, fetches fresh PR metadata from a deployed
instance of `qb_site/` (the Django backend in this repo), generates static
dashboard HTML, and publishes it to GitHub Pages.

## How it works

1. **Checkout** — checks out `queueboard-core` (this repo) to get scripts,
GraphQL query templates, and the `queueboard` Python package.
2. **Fetch + generate** — calls `python -m queueboard.dashboard --api` three
times, once per rule set (different queue-classification rules for
experimentation). Each run downloads JSON payloads from the backend API and
renders a set of HTML dashboard pages into `gh-pages/<rule-set-dir>/`.
3. **Deploy** — uploads the `gh-pages/` tree as a Pages artifact and deploys
it if the run is on the `master` branch and all three generation steps
succeeded.

## Required repository secrets

| Secret | Purpose |
|---|---|
| `QUEUEBOARD_API_BASE_URL` | Base URL of the deployed `qb_site` instance (e.g. `https://queueboard.example.com`). Used both to fetch API payloads and as the analytics endpoint host. |
| `QUEUEBOARD_ANALYTICS_SITE` | Site slug registered in `SITE_ANALYTICS_ALLOWED_SITES` on the server (e.g. `queueboard`). When set, a privacy-preserving analytics snippet is injected into every generated page. Omit to disable analytics. |

If `QUEUEBOARD_ANALYTICS_SITE` is absent (secret not configured), the snippet
is silently omitted and all other workflow behaviour is unchanged.

## Workflow YAML

```yaml
name: Update PR metadata
Expand Down Expand Up @@ -30,13 +56,8 @@ jobs:
url: ${{ steps.deployment.outputs.page_url }}

steps:
- name: Checkout repository
uses: actions/checkout@1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
with:
ref: master

- name: "Checkout queueboard-core"
uses: actions/checkout@1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: leanprover-community/queueboard-core
ref: master
Expand All @@ -50,153 +71,63 @@ jobs:
cp queueboard-core/reviewer-topics.json .

- name: "Setup Python"
uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548 # v6.1.0
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.12"

- name: "Setup uv"
uses: astral-sh/setup-uv@1e862dfacbd1d6d858c55d9b792c756523627244 # v7.1.4
uses: astral-sh/setup-uv@e06108dd0aef18192324c70427afc47652e63a82 # v7.5.0

- name: Install queueboard-core (editable)
run: |
uv venv
cd queueboard-core
uv pip install -e .

- name: "Run scripts/gather_stats.sh"
shell: bash -euo pipefail {0}
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
scripts/gather_stats.sh 12 2>&1 | tee gather_stats.log # Log output to gather_stats.log

- name: "Update aggregate data file"
if: ${{ !cancelled() }}
run: |
# Write files with aggregate PR data, to "processed_data/{all_pr,open_pr,assignment}_data.json".
uv run python -m queueboard.process

- name: "Download .json files for all open PRs"
id: "download-json"
if: ${{ !cancelled() }}
env:
GH_TOKEN: ${{ github.token }}
run: |
bash scripts/dashboard.sh

- name: "Check data integrity"
if: ${{ !cancelled() && (steps.download-json.outcome == 'success') }}
run: |
uv run python -m queueboard.check_data_integrity

- name: "(Re-)download missing or outdated PR data"
if: ${{ !cancelled() && (steps.download-json.outcome == 'success') }}
env:
GH_TOKEN: ${{ github.token }}
run: |
scripts/download_missing_outdated_PRs.sh

- name: "Update aggregate data file (again)"
id: update-aggregate-again
if: ${{ !cancelled() }}
run: |
# Write files with aggregate PR data, to "processed_data/{all_pr,open_pr,assignment}_data.json".
uv run python -m queueboard.process

- name: Commit changes
if: ${{ !cancelled() }}
id: "commit"
run: |
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git config --global user.name 'github-actions[bot]'
git add data
# Split the large file before committing and remove original
# Can be reconstructed with `cat processed_data/all_pr_data.json.* > processed_data/all_pr_data.json`
split -b 10M processed_data/all_pr_data.json processed_data/all_pr_data.json.
rm -f processed_data/all_pr_data.json
git add processed_data
# These files may not exist, if there was no broken download to begin with resp. all metadata is up to date.
rm -f broken_pr_data.txt
rm -f outdated_prs.txt
git add *.txt
git commit -m 'Update data; redownload missing and outdated data; regenerate aggregate files'

- name: Push changes
if: ${{ !cancelled() && steps.commit.outcome == 'success' }}
run: |
# FIXME: make this more robust about incoming changes
# The other workflow does not push to this branch, so this should be fine.
git push

- name: Upload artifact containing files used to generate API
id: upload-pre-api-artifact
if: ${{ !cancelled() && (steps.download-json.outcome == 'success') && (steps.update-aggregate-again.outcome == 'success') }}
uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
with:
name: pre-api-artifact
path: |
queue.json
all-open-PRs-1.json
all-open-PRs-2a.json
all-open-PRs-2b.json
all-open-PRs-3.json
processed_data/open_pr_data.json
processed_data/assignment_data.json

- name: "Generate the data used in dashboard generation"
id: generate-dashboard-data
if: ${{ !cancelled() && (steps.download-json.outcome == 'success') }}
run: |
uv run python -m queueboard.dashboard_data "all-open-PRs-1.json" "all-open-PRs-2a.json" "all-open-PRs-2b.json" "all-open-PRs-3.json"
rm all-open-PRs-*.json queue.json

- name: Upload artifact containing API files
id: upload-api-artifact
if: ${{ !cancelled() && (steps.generate-dashboard-data.outcome == 'success') }}
uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
with:
name: api-artifact
path: api/

- name: "Regenerate the dashboard webpages"
id: generate-dashboard
if: ${{ !cancelled() && (steps.generate-dashboard-data.outcome == 'success') }}
run: |
uv run python -m queueboard.dashboard

- name: "Generate dashboard from API (rule set 1)"
id: generate-dashboard-api-rs1
if: ${{ !cancelled() && (steps.generate-dashboard-data.outcome == 'success') }}
env:
QUEUEBOARD_API_BASE_URL: ${{ secrets.QUEUEBOARD_API_BASE_URL }}
QUEUEBOARD_ANALYTICS_SITE: ${{ secrets.QUEUEBOARD_ANALYTICS_SITE }}
run: |
uv run python -m queueboard.dashboard \
--api \
--rule-set-id 1 \
--gh-pages-dir gh-pages/test-rule-set-1 \
--api-dir api-rule-set-1
--api-dir api-rule-set-1 \
--gh-pages-dir gh-pages/test-rule-set-1

- name: "Generate dashboard from API (rule set 2)"
id: generate-dashboard-api-rs2
if: ${{ !cancelled() && (steps.generate-dashboard-data.outcome == 'success') }}
env:
QUEUEBOARD_API_BASE_URL: ${{ secrets.QUEUEBOARD_API_BASE_URL }}
QUEUEBOARD_ANALYTICS_SITE: ${{ secrets.QUEUEBOARD_ANALYTICS_SITE }}
run: |
uv run python -m queueboard.dashboard \
--api \
--rule-set-id 2 \
--gh-pages-dir gh-pages/test-rule-set-2 \
--api-dir api-rule-set-2
--api-dir api-rule-set-2 \
--gh-pages-dir gh-pages/test-rule-set-2

- name: "Generate dashboard from API (rule set 3)"
id: generate-dashboard-api-rs3
env:
QUEUEBOARD_API_BASE_URL: ${{ secrets.QUEUEBOARD_API_BASE_URL }}
QUEUEBOARD_ANALYTICS_SITE: ${{ secrets.QUEUEBOARD_ANALYTICS_SITE }}
run: |
uv run python -m queueboard.dashboard \
--api \
--rule-set-id 3 \
--api-dir api-rule-set-3

- name: Upload artifact
id: pages-artifact
if: ${{ !cancelled() && (steps.generate-dashboard.outcome == 'success') && (steps.generate-dashboard-api-rs1.outcome == 'success') && (steps.generate-dashboard-api-rs2.outcome == 'success') }}
if: ${{ (steps.generate-dashboard-api-rs1.outcome == 'success') && (steps.generate-dashboard-api-rs2.outcome == 'success') && (steps.generate-dashboard-api-rs3.outcome == 'success') }}
uses: actions/upload-pages-artifact@7b1f4a764d45c48632c6b24a0339c27f5614fb0b # v4.0.0
with:
path: gh-pages

- name: Deploy to GitHub Pages
if: ${{ github.ref == 'refs/heads/master' && !cancelled() && (steps.pages-artifact.outcome == 'success') }}
if: ${{ github.ref == 'refs/heads/master' && (steps.pages-artifact.outcome == 'success') }}
id: deployment
uses: actions/deploy-pages@d6db90164ac5ed86f2b6aed7e0febac5b3c0c03e # v4.0.5
```
2 changes: 2 additions & 0 deletions qb_site/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,13 @@
- `analyzer`: derived queue/revision/dependency state and snapshots.
- `api`: DRF views/serializers for queueboard surfaces.
- `zulip_bot`: Zulip webhook/command integration and policies.
- `site_analytics`: privacy-preserving pageview ingestion and aggregation for static/funder-facing sites.
- Keep new modules inside the owning app (`models/`, `services/`, `tasks/`, `management/commands/`, `tests/`).
- App-specific guidance:
- `qb_site/syncer/AGENTS.md` for ingestion, discovery/backfill, and sync admin workflows.
- `qb_site/analyzer/AGENTS.md` for revision/queue/dependency sweeps and analytics models.
- `qb_site/zulip_bot/AGENTS.md` for webhook/command/policy/registration behavior.
- `qb_site/site_analytics/AGENTS.md` for pageview ingestion, aggregation tasks, and privacy rules.

## Core Commands
```bash
Expand Down
2 changes: 2 additions & 0 deletions qb_site/api/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@

from django.urls import path
from api.views import index
from api.views.analytics_collect import AnalyticsCollectView
from api.views.queueboard_dependency_graph import QueueboardDependencyGraphView
from api.views.queueboard_snapshot import QueueboardSnapshotView
from api.views.reviewer_assignment import AreaStatsView, ReviewerAssignmentsView

urlpatterns: list = [
path("", index, name="index"),
path("v1/analytics/collect", AnalyticsCollectView.as_view(), name="analytics-collect"),
path("v1/queueboard/snapshot", QueueboardSnapshotView.as_view(), name="queueboard-snapshot"),
path(
"v1/queueboard/dependency-graph",
Expand Down
90 changes: 90 additions & 0 deletions qb_site/api/views/analytics_collect.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
"""POST /api/v1/analytics/collect — lightweight pageview ingestion endpoint."""

from __future__ import annotations

from django.conf import settings
from django.utils import timezone
from rest_framework import status
from rest_framework.request import Request
from rest_framework.response import Response
from rest_framework.views import APIView

from site_analytics.models import AnalyticsPageView
from site_analytics.services.bot_filter import is_bot
from site_analytics.services.hashing import compute_visitor_hash, get_client_ip

# Hard caps to guard against oversized payloads hitting DB column limits.
_PATH_MAX = 2000
_REFERRER_MAX = 2000
_UA_MAX = 1000

# CORS headers added to every response so browsers on third-party/static sites
# can call this endpoint without a server-side proxy.
_CORS_HEADERS = {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "POST, OPTIONS",
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Max-Age": "86400",
}


def _cors(response: Response) -> Response:
for key, value in _CORS_HEADERS.items():
response[key] = value
return response


class AnalyticsCollectView(APIView):
"""Ingest a single pageview event.

Intentionally minimal: validate, hash, insert, return 204.
All heavier work (aggregation, reporting) happens in periodic tasks.

CORS headers are always returned so browsers on third-party static sites
can call this endpoint directly.
"""

authentication_classes: list = []
permission_classes: list = []

def options(self, request: Request, *args: object, **kwargs: object) -> Response:
"""Handle CORS preflight requests."""
return _cors(Response(status=status.HTTP_204_NO_CONTENT))

def post(self, request: Request, *args: object, **kwargs: object) -> Response:
site = (request.data.get("site") or "").strip()
path = (request.data.get("path") or "").strip()
referrer = (request.data.get("referrer") or "").strip()
user_agent = request.META.get("HTTP_USER_AGENT", "").strip()

if not site:
return _cors(Response({"detail": "site is required"}, status=status.HTTP_400_BAD_REQUEST))
if not path:
return _cors(Response({"detail": "path is required"}, status=status.HTTP_400_BAD_REQUEST))

allowed_sites = settings.SITE_ANALYTICS_ALLOWED_SITES
if site not in allowed_sites:
return _cors(Response({"detail": "unknown site"}, status=status.HTTP_400_BAD_REQUEST))

# Reject empty UA when the stricter hardening flag is enabled.
if not user_agent and settings.SITE_ANALYTICS_REJECT_EMPTY_UA:
return _cors(Response(status=status.HTTP_204_NO_CONTENT))

# Silently drop bot traffic rather than returning an error, to avoid
# leaking information about detection heuristics.
if is_bot(user_agent):
return _cors(Response(status=status.HTTP_204_NO_CONTENT))

now = timezone.now()
visitor_month_hash = compute_visitor_hash(get_client_ip(request), user_agent)

AnalyticsPageView.objects.create(
site=site,
path=path[:_PATH_MAX],
referrer=referrer[:_REFERRER_MAX],
user_agent=user_agent[:_UA_MAX],
occurred_at=now,
visitor_month_hash=visitor_month_hash,
)

return _cors(Response(status=status.HTTP_204_NO_CONTENT))
Loading