sync to refactored break by brownzebra · Pull Request #2 · brownzebra/cluster-forge

brownzebra · 2024-12-17T08:07:44Z

No description provided.

It is already linked from the content.

EAI-5784 Document how to install manually with Helm

…crets.enabled Add externalSecrets.enabled (default true) values key to keycloak-old and minio-tenant-config charts. Wrap each ExternalSecret/Secret/ClusterSecretStore template with {{- if .Values.externalSecrets.enabled }} so that consumers can opt out and provide credentials as plain Kubernetes Secrets. Default value true keeps render output bit-equal to existing behavior. install_base.sh: - Pass --set externalSecrets.enabled=false to keycloak-old and minio-tenant-config helm template calls (manual install does not use the external-secrets operator) - Drop the rm -f workarounds that previously stripped es-* and *-clustersecretstore.yaml templates after copying them to a tmpdir - Drop the TEMP_MINIO_CONFIG_DIR mktemp dance (no patches remain for minio-tenant-config so the chart can be templated directly) Affected charts: - sources/keycloak-old: 4 ExternalSecret templates (es-airm-realm-credentials, es-keycloak-credentials, es-keycloak-cnpg-user-credentials, es-keycloak-cnpg-superuser-credentials) - sources/minio-tenant-config: 1 Secret + 2 ExternalSecret + 1 ClusterSecretStore templates (minio-secret-example, minio-es-env-config, minio-es-default-user, minio-tls-clustersecretstore) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…d + postgresql.*) Decouple keycloak-old from the in-cluster CNPG Cluster resource so that the chart can also point at an external PostgreSQL. values.yaml additions: - cnpg.enabled (default true): whether to render the in-cluster CNPG Cluster - cnpg.instances and cnpg.storage.storageClass: CNPG Cluster sizing/storage - postgresql.{host,port,database,username,userSecretName}: connection used by keycloak-deployment in both modes; defaults match the in-cluster CNPG service so existing renders are bit-equal templates/keycloak-cnpg.yaml: - Wrap the entire Cluster resource with {{- if .Values.cnpg.enabled }} - Parametrize bootstrap.initdb.{database,owner,postInitSQL,secret.name} from postgresql.* so the Cluster matches what keycloak-deployment connects to - Parametrize storage.storageClass and walStorage.storageClass from cnpg.storage.storageClass (replaces the install_base.sh sed patch) - Parametrize instances from cnpg.instances (the previous --set cnpg.instances install_base.sh argument was a dead value) templates/keycloak-deployment.yaml: - KC_DB_URL_HOST/PORT/DATABASE read from .Values.postgresql.* - KC_DB_USERNAME/PASSWORD secret name reads from postgresql.userSecretName templates/es-keycloak-cnpg-{user,superuser}-credentials.yaml: - Extend the externalSecrets.enabled gate with .Values.cnpg.enabled (CNPG user/superuser secrets are not relevant when the Cluster is not rendered) - Parametrize ExternalSecret target.name and metadata.name from postgresql.userSecretName so user-supplied secret names propagate install_base.sh: - Drop the sed patch that rewrote storageClass: default in keycloak-cnpg.yaml - Pass --set cnpg.storage.storageClass=${DEFAULT_STORAGE_CLASS_NAME} - Drop the dead --set storageClassName argument (no chart template reads it) Pluggable mode usage: kubectl create secret generic keycloak-db-user -n keycloak \ --from-literal=username=keycloak --from-literal=password=... helm template keycloak sources/keycloak-old \ --set cnpg.enabled=false --set externalSecrets.enabled=false \ --set postgresql.host=mydb.example.com \ --set postgresql.userSecretName=keycloak-db-user \ --namespace keycloak | kubectl apply --server-side -f - Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…ed patches Move three remaining install_base.sh sed patches into the chart so the manual install can helm template the upstream chart directly without copying it to a tmpdir first. templates/keycloak-realm-templates-cm.yaml: - Fix the placeholder on line 109 (admin-client-id-value -> __AIRM_ADMIN_CLIENT_ID__). This was an outright bug: every other reference to the AIRM admin client id in the realm template uses the runtime placeholder __AIRM_ADMIN_CLIENT_ID__ that the init container sed-replaces with $AIRM_ADMIN_CLIENT_ID at startup. Line 109 had the example string admin-client-id-value, so the realm-import policy entry never resolved to the real admin client id. templates/keycloak-deployment.yaml: - KC_HOSTNAME reads from .Values.hostname with a printf default of https://kc.<domain>. install_base.sh was already passing --set hostname=${KC_HOSTNAME} but no template consumed it, so it relied on a sed patch to overwrite the hardcoded value. - Add -y and "*/dev/*" / "*/sys/*" / "*/proc/*" excludes to the zip command in init-auth-extensions so it doesn't choke on symlinks or system paths when packaging the SilogenExtensionPackage jar. values.yaml: - Add hostname: "" placeholder so the override is documented. Chart.yaml: keycloak-old 0.1.0 -> 0.2.0, minio-tenant-config 0.1.0 -> 0.2.0 (minor bumps for the new values keys added in this PR series; defaults keep render output bit-equal to 0.1.0). install_base.sh: - Drop the four sed patches and the TEMP_KC_DIR mktemp dance that copied keycloak-old to a tmpdir just to rewrite four lines. Helm templates the chart directly from ${SOURCES_DIR}/keycloak-old now. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

Distinguish from default 9000 and avoid locally otherwise reserved port 9000.

…b.sh Apply the chart-side parametrization landed earlier in this PR series so the manual install can connect AIWB and Keycloak to a user-supplied PostgreSQL without any post-install patching. install_base.sh: - Add the same env vars db.sh used to read (POSTGRES_HOST/PORT, AIWB_DB_*, KEYCLOAK_DB_*) plus AIWB_DB_SECRET_NAME and KEYCLOAK_DB_SECRET_NAME (default aiwb-db-user / keycloak-db-user, distinct from the cnpg-managed defaults). - When PLUGGABLE_DB=true, create the credentials secrets in aiwb and keycloak namespaces (kubectl create secret) so the deployments find them at startup. - Pass --set postgresql.{host,port,database,username,userSecretName} to both the keycloak-old and aiwb helm template invocations. The chart templates (which were already parametrized in earlier commits in this PR series) consume them for KC_DB_URL_*, the wait-for-db init container, the liquibase-migrate init container, the aiwb-api environment, and all secret references — replacing every kubectl set env / kubectl patch db.sh used to perform after the fact. - Update the comment above the CNPG operator install and the success message at the end of the script to reflect that no follow-up db.sh run is needed. db.sh: - Replace the patching script with a deprecation notice. All cluster mutations (kubectl delete cluster, kubectl patch secret, kubectl set env, kubectl patch deployment, kubectl rollout restart) are gone — they are now redundant because install_base.sh applies the same configuration at helm template time. The file can be deleted once external consumers have migrated. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

EAI-6050 Replace workarounds for BYO DB and ExternalSecrets

Switching branches while developing was brittle.

EAI-6050 Fix default values in Helm reference install script

Extract CNPG-specific Secrets into secrets-aiwb-cnpg.yaml and MinIO-related Secrets into secrets-aiwb-minio.yaml so install_base.sh can apply each only when the corresponding in-cluster component is being installed. Update secrets.md with the new sources and per-secret pluggable mode notes. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…tall_base.sh In PLUGGABLE_S3=true mode, install_base.sh now: - skips secrets-aiwb-minio.yaml (placeholder MinIO secrets) - creates minio-credentials in aiwb and workbench from MINIO_ACCESS_KEY / MINIO_SECRET_KEY env vars - creates the in-cluster minio.minio-tenant-default redirect Service whose Endpoints point to MINIO_HOST_IP:MINIO_PORT (used by aim-performance and any other consumer that hardcodes the in-cluster MinIO URL) Mirrors the existing PLUGGABLE_DB=true env-based secret creation pattern. Conditional applies for the new secrets-aiwb-cnpg.yaml / secrets-aiwb-minio.yaml files added in the same place. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

s3.sh previously combined a destructive in-cluster MinIO Tenant migration with live patches to Secrets, Deployments, and a redirect Service. The redirect Service and credential creation now live in install_base.sh's PLUGGABLE_S3 branch, the AIWB --set minio.url already drove the env update, and the Tenant deletion / namespace label fix were one-shot migration steps. What remains is a non-destructive verifier that prints AIWB's MINIO_URL / MINIO_BUCKET, the access keys in the aiwb / workbench minio-credentials Secrets, the redirect Endpoints, and runs a curl pod against the in-cluster URL's /minio/health/live to confirm the redirect resolves to a healthy external MinIO. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The previous guides documented a manual post-install patching workflow (delete CNPG / MinIO Tenant, kubectl patch Secrets, kubectl set env on Deployments) that no longer matches reality — install_base.sh now drives both PLUGGABLE_DB and PLUGGABLE_S3 end to end. Both pages now describe: - prerequisites (external Postgres / S3, required databases or buckets) - the env vars consumed by install_base.sh - pluggable.sh as the single entry point - what install_base.sh does in pluggable mode - verification - limitations (no migration path, single-host assumption, etc.) Cross-references to EXTERNAL_FIXES.md added for the aim-performance BUCKET_STORAGE_HOST workaround and the AIWB pg_isready -U postgres noise. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

`export PLUGGABLE_DB=false` unconditionally overwrote whatever the caller had exported, so `PLUGGABLE_DB=true ./pluggable.sh <DOMAIN>` silently fell back to the in-cluster Postgres / MinIO path. Switch to the ${VAR:-default} pattern so pre-set values pass through. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

When upgrading from a PLUGGABLE_S3=false install (where the MinIO Operator created a selector-backed Service named "minio") to PLUGGABLE_S3=true, the selectorless redirect Service apply produced server-side-apply conflicts. Delete Service and Endpoints first with --ignore-not-found so fresh installs remain a no-op and re-installs over an existing in-cluster MinIO succeed. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The printed instructions referenced ./pluggable/scripts/s3.sh — a path that does not exist — and treated s3.sh as the install entry point. After the recent refactor, install_base.sh handles all PLUGGABLE_S3=true setup and s3.sh is only a post-install verifier. Point users at scripts/pluggable.sh with the right env vars and mention s3.sh as an optional verification step. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

The PLUGGABLE_S3=true secret-creation block referenced MINIO_ACCESS_KEY / MINIO_SECRET_KEY without defaults, tripping `set -u` when callers hadn't exported them. Default to `examplepass` to match s3_minio_container.sh's own defaults so the stock dev flow works out of the box. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Re-running install_base.sh failed because kubectl-patch took ownership of .rules, conflicting with the next server-side helm apply, and the JSON patch added a duplicate tokenreviews rule. Force-conflict the helm apply so kubectl reclaims ownership, and skip the patch when the rule is already present. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

scripts/bootstrap.sh: render_actual_helm_manifests() function that renders actual Helm manifests instead of ArgoCD Applications in template mode Proper handling of valuesFiles using JSON to avoid bash array syntax issues Fixed symlink resolution with readlink -f root/values.yaml: Kaiwo version from v0.2.0-rc11 to v0.2.0-rc12 to support GPU preemption features

EAI-6030 Select GPU Operator and DeviceConfig by GPU family

…ontroller-replicas fix(kyverno): set explicit replicas=1 for all controllers (EAI-6864)

Fix node-exporter port conflict by templatizing hardcoded port 9100

Disable seaweed ingress, as we create HTTPRoutes

Upgrade seaweedfs operator to 0.1.29

This reverts commit 3cf7211.

This reverts commit a8577b6.

Revert seaweed upgrade

Fix formatting of a list and fix broken link.

…orge (#734) * EAI-1500 update documentation * EAI-5821 add ai-gateway config * fix: render argocd application template for oci and https * EAI-5821 fix sbom components * feat(envoy-gateway-config): add tls-passthrough gateway for k8s API Replace the :6443 k8s-passthrough listener on the shared `https` gateway with a dedicated `tls-passthrough` gateway on :443 that owns the external MetalLB LoadBalancer and does SNI-based TLS passthrough: k8s.<domain> -> kube API service, *.<domain> -> apps gateway. The apps gateway moves to ClusterIP behind it. The listener and TLSRoutes carry explicit hostnames: Envoy Gateway TLS passthrough builds SNI filter chains from hostnames, so an empty hostname yields an empty Envoy config that never routes. Listening on :443 instead of :6443 avoids hijacking pod->apiserver traffic where the node IP equals the MetalLB pool IP. Refs: EAI-5821 * fix(envoy-gateway-config): invert gateway front door while debugging * fix(envoy-gateway): scope AI extension to its own listeners Set extensionManager listener.includeAll=false so the AI Gateway xDS translation hook only receives listeners generated for its own resources (AIGatewayRoute/AIServiceBackend/InferencePool). With includeAll=true the hook also received the L4 tls-passthrough listener and tried to insert its request-header-metadata HTTP filter into a TCP filter chain that has no HTTPConnectionManager. That failed xDS translation for the entire GatewayClass, so the passthrough data plane got an empty snapshot and never left initialization. * fix(envoy-gateway-config): restore passthrough gateway as front door Revert the debug inversion: the tls-passthrough gateway owns the external MetalLB LoadBalancer on :443 (SNI passthrough) and the apps gateway drops back to ClusterIP behind it. The inversion was a workaround for the passthrough data plane not starting, which is now fixed. * EAI-5821: Wire API key auth and metrics for AI gateway - Bump cluster-auth to 0.6.0-rc2, which injects x-api-key-id and x-auth-username on every authenticated request and supports SecurityPolicy contextExtensions for per-IS group enforcement (required by the ai-gateway-discovery controller) - Add api_key_id and aim_service_id to access log fields so every AI gateway request is attributed to the originating API key and AIM service in structured logs * EAI-5821: Bump cluster-auth to 0.6.0-rc3 * fix(envoy-gateway): wire EPP into shared listener Set listener.includeAll=true so the AI extension injects the EPP ext_proc filter into the shared https :443 listener. InferencePool routes were returning 503 (no healthy upstream) because nothing set x-gateway-destination-endpoint on that Gateway-owned listener. Add failOpen=true so the extension erroring on the tls-passthrough L4 listener (an HTTP filter can't splice into a TCP chain) no longer fails that proxy's xDS translation and leaves it stuck in init. mergeGateways is off, so each gateway is a separate translation pass: the https proxy gets the filter, the passthrough proxy keeps its original xDS. * EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to v1.0.8 and cluster-auth to 0.6.0-rc4 * Revert "EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to v1.0.8 and cluster-auth to 0.6.0-rc4" This reverts commit 28a58c4. * EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to 1.0.8 and cluster-auth to 0.6.0-rc4 * EAI-5821: Bump cluster-auth to 0.6.0-rc5 * EAI-5821: Bump cluster-auth image from 0.6.0-rc5 to 0.6.0-rc6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * EAI-5821: bump cluster-auth to 0.6.0-rc7 * EAI-5821: bump cluster-auth to 0.6.0-rc8 * fix: reorder ext_proc before ext_authz in EnvoyProxy Without filterOrder, ext_authz (cluster-auth) runs before ext_proc (AI Gateway) sets x-ai-eg-model from the request body. This causes cluster-auth to fall back to a catch-all route with no annotations, firing defaultAction: allow and bypassing per-model authorization. Fixes EAI-6805. * fix: correct filterOrder syntax for EnvoyProxy CRD v1.7.1 CRD uses name+before/after, not relativeTo+position. * EAI-6805: set x-ai-eg-model via Lua before ext_authz The AI Gateway's ext_proc is inserted via xdsTranslator post-hooks, which run after filterOrder is applied, so filterOrder cannot reorder it relative to ext_authz. Without x-ai-eg-model set, cluster-auth falls through to a catch-all route with no annotations and allows any model. Add a Lua EnvoyExtensionPolicy that buffers the request body, extracts the "model" JSON field, and sets x-ai-eg-model before ext_authz runs. Order the Lua filter before ext_authz via EnvoyProxy.spec.filterOrder. * EAI-6805: split AI traffic onto a dedicated ai-gateway * fix: scrape ai-gateway-extproc on all gateway pods The ai-gateway-extproc scrape job was filtering to only the 'https' gateway pod. After splitting inference traffic to a separate 'ai-gateway', the extproc sidecar on that pod was never scraped, causing zero metrics for workbench deployments. Remove the gateway name filter — namespace + app name is sufficient to select all Envoy proxy pods. * EAI-5821: set pod label on ai-gateway-extproc scrape for API metrics queries * EAI-5821: fix ai-gateway-extproc scrape - consolidate to metrics-k8s, cover all gateway pods --------- Co-authored-by: Mika Ranta <Mika.Ranta@amd.com> Co-authored-by: John Lybeck <john.lybeck@amd.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Tomas Saaristola <tsaarist@amd.com>

… and add ai-gateway-discovery to enabledApps in all cluster sizes

Update backup_and_restore.md

add ai-gateway-discovery app, and add envoy/ai-gateway overrides to root/values.yaml

…is present When x-ai-eg-backend is set (from body or as a direct header), x-ai-eg-model is intentionally skipped. This prevents model-name fallback rules in other HTTPRoutes from matching — Envoy Gateway maintains per-HTTPRoute ordering so an older route's model-name rule fires before a newer route's UUID rule when both have equal specificity. Without x-ai-eg-model, only the UUID rule and catch-all rules remain; the UUID rule (2 conditions) beats catch-all (0 conditions) by specificity regardless of creation order.

…-log header

…model headers

…route-specificity EAI-7043: fix UUID-based AIM routing via Lua filter

update release pipeline to respect the target

pre and others added 30 commits April 28, 2026 13:12

fix(byo): Fix conditional of PLUGGABLE_GW

cbee48f

doc(BYO): Fix links in README

03e0e42

doc(BYO): Write section for Known Workarounds

ec6412a

It is already linked from the content.

Merge pull request #694 from silogen/EAI-5784_byok_documentation

3fe0baf

EAI-5784 Document how to install manually with Helm

fix(byo): Change Minio listen port to 9999 on host

5551ab6

Distinguish from default 9000 and avoid locally otherwise reserved port 9000.

chore(byo): Skip cnpg-operator install when PLUGGABLE_DB=true

63821e1

chore(byo): Remove deprecated db.sh

1d45447

chore(byo): Skip installing cnpg-operator when PLUGGABLE_DB=true

7d7bc80

fix(keycloak-old): Do not hide and ignore errors

74fecb0

Merge pull request #697 from silogen/EAI-6050_refactor_helm_1

f77662c

EAI-6050 Replace workarounds for BYO DB and ExternalSecrets

chore(byok): Read default value for DEFAULT_STORAGE_CLASS_NAME

cb234d0

fix(byok): Use "main" branch for cluster-forge

020f7d0

fix(byok): Always initialize the repo from scratch with FORCE_UPDATE

4601747

Switching branches while developing was brittle.

chore(byok): Read default value for CLUSTER_FORGE_BRANCH

3c7ec21

Merge pull request #698 from silogen/EAI-6050_fix-default-values

722366c

EAI-6050 Fix default values in Helm reference install script

Merge branch 'main' into EAI-5893_feat_multiple_helm_value_files

7554115

blankdots and others added 30 commits June 16, 2026 10:39

Merge pull request #744 from silogen/EAI-6030-gpu-stack-family

12055a1

EAI-6030 Select GPU Operator and DeviceConfig by GPU family

Merge pull request #743 from silogen/EAI-6864-fix-kyverno-admission-c…

4a2b613

…ontroller-replicas fix(kyverno): set explicit replicas=1 for all controllers (EAI-6864)

Merge pull request #739 from silogen/fix-node-exporter-port-template

78179fd

Fix node-exporter port conflict by templatizing hardcoded port 9100

Upgrade seaweedfs operator to 0.1.29

3cf7211

Add missing seaweed apps in enabled apps for values large (#751)

06af72e

Disable seaweed ingress, as we create HTTPRoutes

a8577b6

Merge pull request #752 from silogen/fix-disable-seaweed-ingress

9091b72

Disable seaweed ingress, as we create HTTPRoutes

Merge pull request #753 from silogen/fix-upgrade-seaweed-0-1-29

9e4e61c

Upgrade seaweedfs operator to 0.1.29

Revert "Upgrade seaweedfs operator to 0.1.29"

02cc550

This reverts commit 3cf7211.

'Revert "Disable seaweed ingress, as we create HTTPRoutes"

b83ae70

This reverts commit a8577b6.

Merge pull request #755 from silogen/revert-seaweed-upgrade

330e472

Revert seaweed upgrade

Update backup_and_restore.md

e7ea2a0

Fix formatting of a list and fix broken link.

feat: add ai-gateway-discovery

8383688

fix: correct routeHostname for ai-gateway-discovery in roov/values.yaml

d8386f1

fix: correct routeHostname for ai-gateway-discovery in roov/values.yaml

5e72418

fix: aiGateway: enabled for envoy-gateway-config in root/values.yaml,…

f97ad89

… and add ai-gateway-discovery to enabledApps in all cluster sizes

fix: aiGateway: enabled for envoy-gateway-config in root/values.yaml,…

03993b9

… and add ai-gateway-discovery to enabledApps in all cluster sizes

fix: remove unneccessary comment

8ea9a5e

Merge pull request #758 from silogen/docs-formatting

1a33279

Update backup_and_restore.md

fix: update SBOM with ai-gateway-discovery

44ec388

Merge pull request #760 from silogen/ai-gw-int-test

a9ba75c

add ai-gateway-discovery app, and add envoy/ai-gateway overrides to root/values.yaml

Add documentation for how to migrate from minio to seaweedfs (#763)

d1ae644

EAI-7043: preserve model attribution in access logs via x-ai-eg-model…

a31bc97

…-log header

EAI-7043: switch to 3-condition UUID rule, Lua sets both backend and …

c6f0f73

…model headers

bump cluster-auth to 0.6.0-rc9

810bda7

bump cluster-auth to 0.6.0-rc10

80bea9c

Merge pull request #764 from silogen/EAI-7043-fix-ai-gateway-backend-…

73b3ca4

…route-specificity EAI-7043: fix UUID-based AIM routing via Lua filter

Merge pull request #749 from silogen/EAI-1500_ci_gh-release-target-sha

27945ff

update release pipeline to respect the target

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync to refactored break#2

sync to refactored break#2
brownzebra wants to merge 2318 commits into
brownzebra:clean-working-directoryfrom
silogen:main

brownzebra commented Dec 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

brownzebra commented Dec 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants