Guide-first, production-oriented reference implementation for exporting GitHub repository access, storing daily snapshots in BigQuery, and exposing current-state and historical views for governance reporting.
This repo focuses on a practical governance problem: turning GitHub access into something queryable, reviewable, and auditable over time instead of treating it as a point-in-time export.
It is designed for security, platform, and engineering governance use cases where teams need a repeatable way to answer:
- who has access to which repositories
- which access is direct versus team-derived
- where privileged access is concentrated
- whether outside collaborators or stale org states still have access
- whether the collection pipeline itself is healthy
No real organization data is included in this repository.
- Governance with history: it keeps append-only snapshots instead of overwriting state, which makes audits and trend analysis possible
- Current state and history together: it supports both latest-state views for reporting and raw snapshots for investigations
- Operational visibility: it tracks sync health and skipped items instead of silently hiding partial failures
- Practical cloud path: Cloud Run, BigQuery, Scheduler, and Looker Studio give a straightforward deployment and reporting model
On each run, the service:
- calls the GitHub REST API through
gh - exports organization access data into CSV files under a per-run temporary directory
- loads the CSVs into append-only BigQuery snapshot tables
- refreshes
latest_*views for current-state consumers - refreshes
current_access_matrixfor dashboarding - records sync telemetry in
sync_runsandsync_skipped_items
The result is a model that supports both:
- audit and trend analysis from historical raw tables
- current-state reporting from stable reporting views
- GitHub access review and entitlement reporting
- Security governance dashboards
- Platform engineering ownership and team-access mapping
- Management visibility into privileged or exceptional access
- Operational monitoring of the sync pipeline itself
flowchart LR
A["Cloud Scheduler"] --> B["Cloud Run /sync"]
B --> C["GitHub API export script"]
C --> D["CSV snapshots in /tmp/<run_id>"]
D --> E["BigQuery raw snapshot tables"]
E --> F["latest_* views"]
F --> G["current_access_matrix view"]
E --> H["sync_runs + sync_skipped_items"]
G --> I["Looker Studio dashboard"]
Supporting documentation:
- app/: Flask service, loader logic, schemas, and BigQuery orchestration
- scripts/: GitHub export script and GCP deployment helpers
- tests/: focused tests for parsing and view query generation
- docs/: architecture, deployment, and dashboard documentation
- sample-data/: synthetic example outputs for schema illustration
- environment-driven configuration
- append-only snapshot model for audit history
- latest-state views for simple consumers
- reporting view for dashboard use
- retry handling for transient GitHub API failures
- explicit skipped-item logging instead of silent data loss
- authenticated Cloud Run endpoint for scheduled execution
The service reads these environment variables:
| Variable | Purpose | Default |
|---|---|---|
ORG |
GitHub organization to export | example-org |
BQ_PROJECT |
Target GCP project | example-gcp-project |
BQ_DATASET |
Target BigQuery dataset | github_access_audit |
BQ_LOCATION |
BigQuery and deployment region | us-central1 |
EXPORT_SCRIPT_PATH |
Exporter path inside the container | /app/scripts/export_all_access.sh |
GH_TOKEN |
GitHub PAT injected from Secret Manager | none |
The loader writes append-only data into these BigQuery tables:
| Table | Purpose |
|---|---|
repos |
Repository inventory and metadata |
org_memberships |
Org membership role and state per user |
outside_collaborators |
External users outside the org |
teams |
Team inventory |
team_members |
Team membership relationships |
team_repo_permissions |
Team-to-repository grants |
repo_user_permissions |
Effective repository access by user |
repo_team_permissions |
Team access visible from the repository side |
custom_repository_roles |
Custom role definitions when available |
sync_runs |
Run-level health and load summary |
sync_skipped_items |
API failures and skipped items per run |
Every snapshot table includes:
snapshot_datesnapshot_tsrun_idorgsource_file
| View | Purpose |
|---|---|
latest_<table> |
Most recent snapshot for a given raw table |
current_access_matrix |
Joined current-state model for dashboarding |
The Cloud Run service exposes:
GET /healthz- lightweight health check
POST /sync- executes one end-to-end sync run
The /sync endpoint should remain authenticated.
- Python 3.12+
ghjqgcloudbq
- Cloud Run
- Cloud Scheduler
- Secret Manager
- BigQuery
- Cloud Build
- Artifact Registry
Recommended minimum token access:
- Classic PAT:
read:organdrepo - Fine-grained PAT: org member/team read plus repository metadata and collaborator read for the target repositories
cd <repo-dir>
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
export GH_TOKEN='your_pat'
export ORG='example-org'
export BQ_PROJECT='example-gcp-project'
export BQ_DATASET='github_access_audit'
export BQ_LOCATION='us-central1'
python -m app.mainTrigger a sync locally:
curl -X POST http://127.0.0.1:8080/syncThe deployment scripts are reusable and environment-driven. Do not edit the scripts for each environment; set the variables explicitly.
cd <repo-dir>
export PROJECT_ID='example-gcp-project'
export REGION='us-central1'
export DATASET='github_access_audit'
export ORG='example-org'
export SERVICE_NAME='github-access-sync'
export SECRET_NAME='github-access-token'
export GH_PAT='your_pat'
bash scripts/deploy.shIf GH_PAT is not exported, the bootstrap step falls back to gh auth token from the local GitHub CLI keychain.
Detailed deployment and IAM guidance: docs/deployment.md
Recommended usage pattern:
- use
current_access_matrixfor current-state access reporting - use
latest_*views for simple point-in-time reporting - use raw snapshot tables and
sync_runsfor historical analysis and operational tracking
Step-by-step Looker Studio instructions: docs/dashboard-guide.md
Synthetic sample files are included to illustrate the reporting shape:
- Keep the Cloud Run service private.
- Allow only approved callers such as Cloud Scheduler or a dedicated invoker service account.
- Do not commit PATs,
.envfiles, service account keys, exported snapshots, or real dashboard screenshots. - Treat
sync_skipped_itemsas an operational signal; skipped reads should be reviewed, not ignored. - Consider migrating from PAT-based auth to a GitHub App if you need stronger credential controls.
Run the local verification suite:
cd <repo-dir>
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
pytest
bash -n scripts/*.sh
python -m compileall app tests- Export duration depends on organization size and GitHub API responsiveness.
- Some GitHub endpoints may intermittently fail or throttle; these are retried and then recorded as skipped items if they still fail.
- Custom repository roles may be unavailable depending on plan and token permissions.
current_access_matrixis row-based, so distinct repo-user counts should use a calculated key instead of raw record count.
- GitHub App authentication
- Terraform for infrastructure provisioning
- CI for tests and static checks
- alerting on sync failures or abnormal skip volume
- richer lineage for direct versus inherited access paths