Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,13 @@ make lint-shellcheck # shellcheck all *.sh
make validate-maestro # renders maestro chart to /dev/null
```

Lifecycle function checks:
```bash
make test-lifecycle-function # go test ./... in functions/lifecycle-enforcer/
make build-lifecycle-function # go build ./... in functions/lifecycle-enforcer/
make lint-lifecycle-function # go vet ./... in functions/lifecycle-enforcer/
```

Template/dry-run all four Helmfile environments explicitly:
```bash
# environment specific
Expand Down Expand Up @@ -90,6 +97,7 @@ Key variables:
| `BROKER_TYPE` | `googlepubsub` | `rabbitmq` | |
| `API_IMAGE_TAG` | `dev` | `local` | |
| `IMAGE_PULL_POLICY` | `Always` | `IfNotPresent` | |
| `LIFECYCLE_DIR` | `functions/lifecycle-enforcer` | `functions/lifecycle-enforcer` | Go Cloud Function source |

---

Expand Down
22 changes: 22 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ MAESTRO_CONSUMER ?= cluster1
MAESTRO_NAMESPACE ?= maestro
KUBECONFIG ?= $(HOME)/.kube/config

LIFECYCLE_DIR ?= functions/lifecycle-enforcer

CLEANER_NAMESPACE ?= $(NAMESPACE)
CLEANER_SCHEDULE ?= 0 * * * *
CLEANER_LABEL_SELECTOR ?= hyperfleet.io/cluster-id
Expand Down Expand Up @@ -243,6 +245,26 @@ uninstall-hyperfleet-adapters: check-kubectl-context ## Uninstall Hyperfleet Ada
helmfile -f helmfile/helmfile.yaml.gotmpl -e $(HELMFILE_ENV) -l component=adapter destroy


# ==== Lifecycle Function Targets ====
.PHONY: test-lifecycle-function
test-lifecycle-function: ## Run unit tests for the lifecycle enforcer function
@command -v go >/dev/null 2>&1 || { echo "ERROR: go is not installed"; exit 1; }
cd "$(LIFECYCLE_DIR)" && go test ./... -v

.PHONY: build-lifecycle-function
build-lifecycle-function: ## Build the lifecycle enforcer function
@command -v go >/dev/null 2>&1 || { echo "ERROR: go is not installed"; exit 1; }
cd "$(LIFECYCLE_DIR)" && go build ./...

.PHONY: lint-lifecycle-function
lint-lifecycle-function: ## Lint the lifecycle enforcer function
@command -v go >/dev/null 2>&1 || { echo "ERROR: go is not installed"; exit 1; }
cd "$(LIFECYCLE_DIR)" && go vet ./...

.PHONY: add-ttl-labels
add-ttl-labels: ## Add TTL labels to existing GKE clusters (DRY_RUN=true by default)
./scripts/add-ttl-labels.sh

# ==== Namespace Cleaner Targets ====
.PHONY: install-cleaner
install-cleaner: check-helm check-kubectl ## Install namespace cleaner CronJob (CLEANER_SCHEDULE, CLEANER_LABEL_SELECTOR, CLEANER_AGE_MINUTES)
Expand Down
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,15 @@ Configuration precedence (highest to lowest):
| `make install-cleaner` | Install namespace cleaner CronJob (configurable via `CLEANER_*` variables) |
| `make uninstall-cleaner` | Uninstall namespace cleaner CronJob |

### Lifecycle Enforcer

| Target | Description |
|--------|-------------|
| `make test-lifecycle-function` | Run unit tests for the lifecycle enforcer Cloud Function |
| `make build-lifecycle-function` | Build the lifecycle enforcer Cloud Function |
| `make lint-lifecycle-function` | Lint the lifecycle enforcer Cloud Function |
| `make add-ttl-labels` | Add TTL labels to existing GKE clusters (`DRY_RUN=true` by default) |

### Validation / CI

| Target | Description |
Expand Down Expand Up @@ -189,16 +198,20 @@ hyperfleet-infra/
│ ├── maestro/ # Maestro umbrella chart (deps via helm-git)
│ └── rabbitmq/ # Dev-only RabbitMQ (not production-ready)
├── scripts/
│ ├── add-ttl-labels.sh # Adds TTL labels to existing GKE clusters
│ ├── generate-rabbitmq-values.sh # Generates RabbitMQ broker config
│ └── kind-build-images.sh # Builds and loads images into kind
├── functions/
│ └── lifecycle-enforcer/ # Cloud Function: GKE cluster lifecycle enforcement
├── terraform/
│ ├── README.md # Detailed Terraform documentation
│ ├── main.tf # Root module (GKE cluster, Pub/Sub, firewall)
│ ├── main.tf # Root module (GKE cluster, Pub/Sub, firewall, lifecycle)
│ ├── helm-values-files.tf # Writes generated Helm values via local_file
│ ├── bootstrap/ # One-time GCP setup scripts (admin only)
│ ├── shared/ # Shared VPC infrastructure (deploy once)
│ ├── modules/
│ │ ├── cluster/gke/ # GKE cluster module
│ │ ├── lifecycle/ # Lifecycle enforcer (Cloud Function + Scheduler)
│ │ └── pubsub/ # Google Pub/Sub module
│ └── envs/gke/ # Per-developer tfvars and tfbackend files
├── generated-values-from-terraform/ # Auto-generated, gitignored
Expand Down Expand Up @@ -237,6 +250,12 @@ terraform apply

See [terraform/shared/README.md](terraform/shared/README.md) for details.

## Lifecycle Enforcer

A Cloud Function (Go) that enforces the [GCP Developer Cluster Lifecycle Policy](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/docs/gcp-developer-cluster-lifecycle.md) — idle shutdown (>12h), TTL expiration, and missing owner enforcement. Runs hourly via Cloud Scheduler, deployed via Terraform (`enable_lifecycle_enforcer = true`).

See [functions/lifecycle-enforcer/README.md](functions/lifecycle-enforcer/README.md) for architecture, deployment, rollout, and configuration details.

## Related Repositories

- [hyperfleet-api](https://github.com/openshift-hyperfleet/hyperfleet-api) — API server
Expand Down
185 changes: 185 additions & 0 deletions functions/lifecycle-enforcer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Lifecycle Enforcer

A Cloud Function (Go) that enforces the [GCP Developer Cluster Lifecycle Policy](https://github.com/openshift-hyperfleet/architecture/blob/main/hyperfleet/docs/gcp-developer-cluster-lifecycle.md).

Runs hourly via Cloud Scheduler, iterates all GKE clusters in `hcm-hyperfleet`, and enforces:

- **Idle shutdown** — scales node pools to 0 when all nodes have been running >12h
- **TTL expiration** — scales to 0 when the `ttl` label date has passed; deletes the cluster after 48h
- **Missing owner** — scales to 0 on detection; deletes after 7 days
- **Exempt clusters** — `environment: cicd` and `hyperfleet-dev-ci-infra-*` are skipped

## Architecture

```
Cloud Scheduler (hourly, UTC)
▼ HTTP POST + OIDC token
Cloud Function Gen2 (lifecycle-enforcer)
├─ Lists all GKE clusters in hcm-hyperfleet
├─ For each cluster:
│ ├─ Fetches node pools + instances via Compute API
│ ├─ EvaluateCluster() → Decision (skip/shutdown/delete/label-only)
│ └─ Executes the action (or logs if DRY_RUN=true)
└─ Returns JSON with per-cluster results
```

### Infrastructure (Terraform)

| Resource | Purpose |
| ------------------- | ------------------------------------------------------------------------------- |
| Cloud Function Gen2 | Iterates clusters, evaluates rules, executes actions |
| Cloud Scheduler | Hourly HTTP POST trigger with OIDC auth |
| GCS Bucket | Stores the function source zip |
| Service Accounts | Function SA (`container.admin`, `compute.viewer`), Scheduler SA (`run.invoker`) |

Terraform module: [`terraform/modules/lifecycle/`](../../terraform/modules/lifecycle/)

## Enforcement Rules

### Decision priority

1. **Exempt check** — skip if `environment: cicd` or name starts with `hyperfleet-dev-ci-infra-`
2. **Deletion check** — if `shutdown-date` label exists and grace period expired:
- Missing owner + >7 days → delete
- TTL expired + >48h → delete
3. **Shutdown check** — if cluster is running (node count > 0):
- Missing `owner` label → scale to 0, set `shutdown-date`
- Missing or expired `ttl` label → scale to 0, set `shutdown-date`
- All nodes running >12h → scale to 0 (no `shutdown-date`, no deletion path)
4. **No action** — cluster is healthy

### Labels

| Label | Set by | Purpose |
| --------------- | ---------------------------------- | ----------------------------------------------------------------- |
| `environment` | Terraform (`var.environment`) | `dev` = enforced, `cicd` = exempt |
| `ttl` | Terraform (`plantimestamp()` + 5d) | Expiration date (`YYYY-MM-DD`). Re-applying Terraform renews it |
| `owner` | Terraform (`var.developer_name`) | Cluster ownership |
| `shutdown-date` | Enforcer function | Tracks when a cluster was first shut down (grace period tracking) |

### State machine

```
┌─────────────────────────────────────────────┐
│ RUNNING │
│ (nodes > 0, TTL valid, owner present) │
└──────────┬──────────────┬──────────────┬────┘
│ │ │
idle >12h │ TTL expired │ no owner │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ SCALED DOWN │ │ SCALED DOWN │ │ SCALED DOWN │
│ (idle) │ │ +shutdown- │ │ +shutdown- │
│ no deletion │ │ date label │ │ date label │
│ path │ │ │ │ │
└──────────────┘ └──────┬───────┘ └──────┬───────┘
│ │
developer scales │ 48h │ 7 days
back up anytime ▼ ▼
┌──────────┐ ┌──────────┐
│ DELETED │ │ DELETED │
└──────────┘ └──────────┘
```

## Scaling back up (daily workflow)

The idle shutdown scales your cluster to 0 once all nodes have been running for more than 12 hours. To start working the next day, scale your node pool back up:

```bash
gcloud container clusters resize hyperfleet-dev-<username> \
--node-pool hyperfleet-dev-<username>-pool \
--num-nodes 1 \
--zone us-central1-a \
--project hcm-hyperfleet \
--quiet
```

No TTL renewal is needed — the idle shutdown does not affect your TTL or trigger the deletion path.

## Renewing your cluster (TTL)

The `ttl` label is set to current date + 5 days on every `terraform apply`. When your TTL is about to expire (or already expired), renew it:

```bash
make install-terraform
```

This resets the TTL and clears the enforcement state. If the cluster was already scaled to 0, you also need to scale the node pool back up (see above).

## Deployment

### Enable in tfvars

```hcl
enable_lifecycle_enforcer = true
lifecycle_enforcer_dry_run = true # safe by default
```

### Apply

```bash
make install-terraform
```

### Rollout

1. Deploy with `lifecycle_enforcer_dry_run = true` (default) — logs all actions without executing
2. Check logs in Cloud Logging:
```
resource.type="cloud_run_revision"
resource.labels.service_name="lifecycle-enforcer"
```
3. Add TTL labels to existing clusters:
```bash
DRY_RUN=false make add-ttl-labels
```
4. When confident, set `lifecycle_enforcer_dry_run = false` and re-apply

### Configuration

| Variable | Default | Description |
| ----------------------------- | ----------- | --------------------------------------------- |
| `enable_lifecycle_enforcer` | `false` | Deploy the Cloud Function and Cloud Scheduler |
| `lifecycle_enforcer_dry_run` | `true` | Log actions without executing |
| `lifecycle_enforcer_schedule` | `0 * * * *` | Cloud Scheduler cron expression (hourly) |

### Environment variables (Cloud Function)

| Variable | Default | Description |
| ---------------- | ---------------- | --------------------------------------------- |
| `PROJECT_ID` | `hcm-hyperfleet` | GCP project to scan for clusters |
| `DRY_RUN` | `true` | Set to `false` to execute enforcement actions |

## Development

### Run tests

```bash
make test-lifecycle-function
```

### Build

```bash
make build-lifecycle-function
```

### Lint

```bash
make lint-lifecycle-function
```

### Code structure

| File | Purpose |
| ------------------ | ---------------------------------------------------------------------------- |
| `decision.go` | Pure enforcement decision logic — no GCP SDK dependency, fully unit-testable |
| `decision_test.go` | Table-driven tests covering all enforcement scenarios |
| `function.go` | Cloud Function entry point, GKE/Compute API client, action executor |

The decision logic is intentionally separated from the GKE API interaction. `EvaluateCluster()` is a pure function that takes a `ClusterInfo` struct and returns a `Decision` — no mocking needed for tests.
Loading