-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Parent Epic: #190
Status: ✅ COMPLETED (2025-12-28)
Grafana dashboards and alerts deployed via Terraform IaC to Grafana Cloud.
Overview
Create Grafana dashboards and alerting rules for M3W production monitoring.
Deployed Resources
| Resource | URL | Status |
|---|---|---|
| System Overview | https://test3207.grafana.net/d/m3w-system-overview | ✅ Working |
| Application Dashboard | https://test3207.grafana.net/d/m3w-application | |
| Log Explorer | https://test3207.grafana.net/d/m3w-log-explorer | ⏳ Pending (needs logging PR merged) |
| Alert Rules | https://test3207.grafana.net/alerting/list | ✅ Working |
Dashboards
1. System Overview ✅ Working Now
- Node health (all 4 VMs)
- CPU, Memory, Disk, Network per node
- Container count
- System load, Uptime
2. Application Dashboard ⚠️ Partial
- Container CPU/Memory panels ✅ Working
- Request rate/error rate/latency ⏳ Needs structured logs from Backend structured logging with traceId #249
3. Log Explorer ⏳ Pending
- Pre-configured queries for common searches
- Error log view
- Request trace view (by traceId)
- Requires: Backend structured logging (Backend structured logging with traceId #249) merged and deployed
Alert Rules ✅ Deployed
Critical (Email immediately)
| Alert | Condition | For | Status |
|---|---|---|---|
| Node Down | No metrics for 2min | 2m | ✅ Active |
| Disk Usage Critical | >90% usage | 5m | ✅ Active |
| Memory Usage Critical | >90% usage | 5m | ✅ Active |
Warning (Email digest)
| Alert | Condition | For | Status |
|---|---|---|---|
| High CPU | >80% for 10min | 10m | ✅ Active |
| High Memory | >80% for 10min | 10m | ✅ Active |
| Disk Usage Warning | >80% for 5min | 5m | ✅ Active |
Tasks
- Create System Overview dashboard
- Create Application dashboard
- Create Log Explorer dashboard
- Configure alert contact point (Email)
- Create alert rules
- Test alerts (trigger manually)
- Document dashboard access (in SECRETS.md)
IaC Files (m3w-k8s)
terraform/
├── grafana.tf # Dashboard, alert, contact point resources
├── grafana/
│ ├── system-overview.json
│ ├── application.json
│ └── log-explorer.json
├── providers.tf # grafana provider v4.21.0
└── variables.tf # grafana_url, grafana_service_account_token
Dependencies
- Set up Grafana Cloud + deploy Alloy agents #251 (Alloy deployed, data flowing)
- Backend structured logging with traceId #249 (Backend structured logging) - for Application/Log dashboards
- OpenResty JSON access log format #252 (Gateway JSON logs)
Reactions are currently unavailable