sync to refactored break#2
Open
brownzebra wants to merge 2318 commits into
Open
Conversation
It is already linked from the content.
EAI-5784 Document how to install manually with Helm
…crets.enabled
Add externalSecrets.enabled (default true) values key to keycloak-old and
minio-tenant-config charts. Wrap each ExternalSecret/Secret/ClusterSecretStore
template with {{- if .Values.externalSecrets.enabled }} so that consumers can
opt out and provide credentials as plain Kubernetes Secrets.
Default value true keeps render output bit-equal to existing behavior.
install_base.sh:
- Pass --set externalSecrets.enabled=false to keycloak-old and
minio-tenant-config helm template calls (manual install does not use the
external-secrets operator)
- Drop the rm -f workarounds that previously stripped es-* and
*-clustersecretstore.yaml templates after copying them to a tmpdir
- Drop the TEMP_MINIO_CONFIG_DIR mktemp dance (no patches remain for
minio-tenant-config so the chart can be templated directly)
Affected charts:
- sources/keycloak-old: 4 ExternalSecret templates (es-airm-realm-credentials,
es-keycloak-credentials, es-keycloak-cnpg-user-credentials,
es-keycloak-cnpg-superuser-credentials)
- sources/minio-tenant-config: 1 Secret + 2 ExternalSecret + 1 ClusterSecretStore
templates (minio-secret-example, minio-es-env-config, minio-es-default-user,
minio-tls-clustersecretstore)
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…d + postgresql.*)
Decouple keycloak-old from the in-cluster CNPG Cluster resource so that the
chart can also point at an external PostgreSQL.
values.yaml additions:
- cnpg.enabled (default true): whether to render the in-cluster CNPG Cluster
- cnpg.instances and cnpg.storage.storageClass: CNPG Cluster sizing/storage
- postgresql.{host,port,database,username,userSecretName}: connection used
by keycloak-deployment in both modes; defaults match the in-cluster CNPG
service so existing renders are bit-equal
templates/keycloak-cnpg.yaml:
- Wrap the entire Cluster resource with {{- if .Values.cnpg.enabled }}
- Parametrize bootstrap.initdb.{database,owner,postInitSQL,secret.name} from
postgresql.* so the Cluster matches what keycloak-deployment connects to
- Parametrize storage.storageClass and walStorage.storageClass from
cnpg.storage.storageClass (replaces the install_base.sh sed patch)
- Parametrize instances from cnpg.instances (the previous --set
cnpg.instances install_base.sh argument was a dead value)
templates/keycloak-deployment.yaml:
- KC_DB_URL_HOST/PORT/DATABASE read from .Values.postgresql.*
- KC_DB_USERNAME/PASSWORD secret name reads from postgresql.userSecretName
templates/es-keycloak-cnpg-{user,superuser}-credentials.yaml:
- Extend the externalSecrets.enabled gate with .Values.cnpg.enabled (CNPG
user/superuser secrets are not relevant when the Cluster is not rendered)
- Parametrize ExternalSecret target.name and metadata.name from
postgresql.userSecretName so user-supplied secret names propagate
install_base.sh:
- Drop the sed patch that rewrote storageClass: default in keycloak-cnpg.yaml
- Pass --set cnpg.storage.storageClass=${DEFAULT_STORAGE_CLASS_NAME}
- Drop the dead --set storageClassName argument (no chart template reads it)
Pluggable mode usage:
kubectl create secret generic keycloak-db-user -n keycloak \
--from-literal=username=keycloak --from-literal=password=...
helm template keycloak sources/keycloak-old \
--set cnpg.enabled=false --set externalSecrets.enabled=false \
--set postgresql.host=mydb.example.com \
--set postgresql.userSecretName=keycloak-db-user \
--namespace keycloak | kubectl apply --server-side -f -
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…ed patches Move three remaining install_base.sh sed patches into the chart so the manual install can helm template the upstream chart directly without copying it to a tmpdir first. templates/keycloak-realm-templates-cm.yaml: - Fix the placeholder on line 109 (admin-client-id-value -> __AIRM_ADMIN_CLIENT_ID__). This was an outright bug: every other reference to the AIRM admin client id in the realm template uses the runtime placeholder __AIRM_ADMIN_CLIENT_ID__ that the init container sed-replaces with $AIRM_ADMIN_CLIENT_ID at startup. Line 109 had the example string admin-client-id-value, so the realm-import policy entry never resolved to the real admin client id. templates/keycloak-deployment.yaml: - KC_HOSTNAME reads from .Values.hostname with a printf default of https://kc.<domain>. install_base.sh was already passing --set hostname=${KC_HOSTNAME} but no template consumed it, so it relied on a sed patch to overwrite the hardcoded value. - Add -y and "*/dev/*" / "*/sys/*" / "*/proc/*" excludes to the zip command in init-auth-extensions so it doesn't choke on symlinks or system paths when packaging the SilogenExtensionPackage jar. values.yaml: - Add hostname: "" placeholder so the override is documented. Chart.yaml: keycloak-old 0.1.0 -> 0.2.0, minio-tenant-config 0.1.0 -> 0.2.0 (minor bumps for the new values keys added in this PR series; defaults keep render output bit-equal to 0.1.0). install_base.sh: - Drop the four sed patches and the TEMP_KC_DIR mktemp dance that copied keycloak-old to a tmpdir just to rewrite four lines. Helm templates the chart directly from ${SOURCES_DIR}/keycloak-old now. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Distinguish from default 9000 and avoid locally otherwise reserved port 9000.
…b.sh
Apply the chart-side parametrization landed earlier in this PR series so the
manual install can connect AIWB and Keycloak to a user-supplied PostgreSQL
without any post-install patching.
install_base.sh:
- Add the same env vars db.sh used to read (POSTGRES_HOST/PORT, AIWB_DB_*,
KEYCLOAK_DB_*) plus AIWB_DB_SECRET_NAME and KEYCLOAK_DB_SECRET_NAME (default
aiwb-db-user / keycloak-db-user, distinct from the cnpg-managed defaults).
- When PLUGGABLE_DB=true, create the credentials secrets in aiwb and keycloak
namespaces (kubectl create secret) so the deployments find them at startup.
- Pass --set postgresql.{host,port,database,username,userSecretName} to both
the keycloak-old and aiwb helm template invocations. The chart templates
(which were already parametrized in earlier commits in this PR series)
consume them for KC_DB_URL_*, the wait-for-db init container, the
liquibase-migrate init container, the aiwb-api environment, and all secret
references — replacing every kubectl set env / kubectl patch db.sh used to
perform after the fact.
- Update the comment above the CNPG operator install and the success message
at the end of the script to reflect that no follow-up db.sh run is needed.
db.sh:
- Replace the patching script with a deprecation notice. All cluster mutations
(kubectl delete cluster, kubectl patch secret, kubectl set env, kubectl
patch deployment, kubectl rollout restart) are gone — they are now redundant
because install_base.sh applies the same configuration at helm template
time. The file can be deleted once external consumers have migrated.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
EAI-6050 Replace workarounds for BYO DB and ExternalSecrets
Switching branches while developing was brittle.
EAI-6050 Fix default values in Helm reference install script
Extract CNPG-specific Secrets into secrets-aiwb-cnpg.yaml and MinIO-related Secrets into secrets-aiwb-minio.yaml so install_base.sh can apply each only when the corresponding in-cluster component is being installed. Update secrets.md with the new sources and per-secret pluggable mode notes. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…tall_base.sh In PLUGGABLE_S3=true mode, install_base.sh now: - skips secrets-aiwb-minio.yaml (placeholder MinIO secrets) - creates minio-credentials in aiwb and workbench from MINIO_ACCESS_KEY / MINIO_SECRET_KEY env vars - creates the in-cluster minio.minio-tenant-default redirect Service whose Endpoints point to MINIO_HOST_IP:MINIO_PORT (used by aim-performance and any other consumer that hardcodes the in-cluster MinIO URL) Mirrors the existing PLUGGABLE_DB=true env-based secret creation pattern. Conditional applies for the new secrets-aiwb-cnpg.yaml / secrets-aiwb-minio.yaml files added in the same place. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
s3.sh previously combined a destructive in-cluster MinIO Tenant migration with live patches to Secrets, Deployments, and a redirect Service. The redirect Service and credential creation now live in install_base.sh's PLUGGABLE_S3 branch, the AIWB --set minio.url already drove the env update, and the Tenant deletion / namespace label fix were one-shot migration steps. What remains is a non-destructive verifier that prints AIWB's MINIO_URL / MINIO_BUCKET, the access keys in the aiwb / workbench minio-credentials Secrets, the redirect Endpoints, and runs a curl pod against the in-cluster URL's /minio/health/live to confirm the redirect resolves to a healthy external MinIO. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The previous guides documented a manual post-install patching workflow (delete CNPG / MinIO Tenant, kubectl patch Secrets, kubectl set env on Deployments) that no longer matches reality — install_base.sh now drives both PLUGGABLE_DB and PLUGGABLE_S3 end to end. Both pages now describe: - prerequisites (external Postgres / S3, required databases or buckets) - the env vars consumed by install_base.sh - pluggable.sh as the single entry point - what install_base.sh does in pluggable mode - verification - limitations (no migration path, single-host assumption, etc.) Cross-references to EXTERNAL_FIXES.md added for the aim-performance BUCKET_STORAGE_HOST workaround and the AIWB pg_isready -U postgres noise. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
`export PLUGGABLE_DB=false` unconditionally overwrote whatever the caller
had exported, so `PLUGGABLE_DB=true ./pluggable.sh <DOMAIN>` silently fell
back to the in-cluster Postgres / MinIO path. Switch to the
${VAR:-default} pattern so pre-set values pass through.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When upgrading from a PLUGGABLE_S3=false install (where the MinIO Operator created a selector-backed Service named "minio") to PLUGGABLE_S3=true, the selectorless redirect Service apply produced server-side-apply conflicts. Delete Service and Endpoints first with --ignore-not-found so fresh installs remain a no-op and re-installs over an existing in-cluster MinIO succeed. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The printed instructions referenced ./pluggable/scripts/s3.sh — a path that does not exist — and treated s3.sh as the install entry point. After the recent refactor, install_base.sh handles all PLUGGABLE_S3=true setup and s3.sh is only a post-install verifier. Point users at scripts/pluggable.sh with the right env vars and mention s3.sh as an optional verification step. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The PLUGGABLE_S3=true secret-creation block referenced MINIO_ACCESS_KEY / MINIO_SECRET_KEY without defaults, tripping `set -u` when callers hadn't exported them. Default to `examplepass` to match s3_minio_container.sh's own defaults so the stock dev flow works out of the box. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Re-running install_base.sh failed because kubectl-patch took ownership of .rules, conflicting with the next server-side helm apply, and the JSON patch added a duplicate tokenreviews rule. Force-conflict the helm apply so kubectl reclaims ownership, and skip the patch when the rule is already present. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
scripts/bootstrap.sh: render_actual_helm_manifests() function that renders actual Helm manifests instead of ArgoCD Applications in template mode Proper handling of valuesFiles using JSON to avoid bash array syntax issues Fixed symlink resolution with readlink -f root/values.yaml: Kaiwo version from v0.2.0-rc11 to v0.2.0-rc12 to support GPU preemption features
EAI-6030 Select GPU Operator and DeviceConfig by GPU family
…ontroller-replicas fix(kyverno): set explicit replicas=1 for all controllers (EAI-6864)
Fix node-exporter port conflict by templatizing hardcoded port 9100
Disable seaweed ingress, as we create HTTPRoutes
Upgrade seaweedfs operator to 0.1.29
This reverts commit 3cf7211.
This reverts commit a8577b6.
Revert seaweed upgrade
Fix formatting of a list and fix broken link.
…orge (#734) * EAI-1500 update documentation * EAI-5821 add ai-gateway config * fix: render argocd application template for oci and https * EAI-5821 fix sbom components * feat(envoy-gateway-config): add tls-passthrough gateway for k8s API Replace the :6443 k8s-passthrough listener on the shared `https` gateway with a dedicated `tls-passthrough` gateway on :443 that owns the external MetalLB LoadBalancer and does SNI-based TLS passthrough: k8s.<domain> -> kube API service, *.<domain> -> apps gateway. The apps gateway moves to ClusterIP behind it. The listener and TLSRoutes carry explicit hostnames: Envoy Gateway TLS passthrough builds SNI filter chains from hostnames, so an empty hostname yields an empty Envoy config that never routes. Listening on :443 instead of :6443 avoids hijacking pod->apiserver traffic where the node IP equals the MetalLB pool IP. Refs: EAI-5821 * fix(envoy-gateway-config): invert gateway front door while debugging * fix(envoy-gateway): scope AI extension to its own listeners Set extensionManager listener.includeAll=false so the AI Gateway xDS translation hook only receives listeners generated for its own resources (AIGatewayRoute/AIServiceBackend/InferencePool). With includeAll=true the hook also received the L4 tls-passthrough listener and tried to insert its request-header-metadata HTTP filter into a TCP filter chain that has no HTTPConnectionManager. That failed xDS translation for the entire GatewayClass, so the passthrough data plane got an empty snapshot and never left initialization. * fix(envoy-gateway-config): restore passthrough gateway as front door Revert the debug inversion: the tls-passthrough gateway owns the external MetalLB LoadBalancer on :443 (SNI passthrough) and the apps gateway drops back to ClusterIP behind it. The inversion was a workaround for the passthrough data plane not starting, which is now fixed. * EAI-5821: Wire API key auth and metrics for AI gateway - Bump cluster-auth to 0.6.0-rc2, which injects x-api-key-id and x-auth-username on every authenticated request and supports SecurityPolicy contextExtensions for per-IS group enforcement (required by the ai-gateway-discovery controller) - Add api_key_id and aim_service_id to access log fields so every AI gateway request is attributed to the originating API key and AIM service in structured logs * EAI-5821: Bump cluster-auth to 0.6.0-rc3 * fix(envoy-gateway): wire EPP into shared listener Set listener.includeAll=true so the AI extension injects the EPP ext_proc filter into the shared https :443 listener. InferencePool routes were returning 503 (no healthy upstream) because nothing set x-gateway-destination-endpoint on that Gateway-owned listener. Add failOpen=true so the extension erroring on the tls-passthrough L4 listener (an HTTP filter can't splice into a TCP chain) no longer fails that proxy's xDS translation and leaves it stuck in init. mergeGateways is off, so each gateway is a separate translation pass: the https proxy gets the filter, the passthrough proxy keeps its original xDS. * EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to v1.0.8 and cluster-auth to 0.6.0-rc4 * Revert "EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to v1.0.8 and cluster-auth to 0.6.0-rc4" This reverts commit 28a58c4. * EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to 1.0.8 and cluster-auth to 0.6.0-rc4 * EAI-5821: Bump cluster-auth to 0.6.0-rc5 * EAI-5821: Bump cluster-auth image from 0.6.0-rc5 to 0.6.0-rc6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * EAI-5821: bump cluster-auth to 0.6.0-rc7 * EAI-5821: bump cluster-auth to 0.6.0-rc8 * fix: reorder ext_proc before ext_authz in EnvoyProxy Without filterOrder, ext_authz (cluster-auth) runs before ext_proc (AI Gateway) sets x-ai-eg-model from the request body. This causes cluster-auth to fall back to a catch-all route with no annotations, firing defaultAction: allow and bypassing per-model authorization. Fixes EAI-6805. * fix: correct filterOrder syntax for EnvoyProxy CRD v1.7.1 CRD uses name+before/after, not relativeTo+position. * EAI-6805: set x-ai-eg-model via Lua before ext_authz The AI Gateway's ext_proc is inserted via xdsTranslator post-hooks, which run after filterOrder is applied, so filterOrder cannot reorder it relative to ext_authz. Without x-ai-eg-model set, cluster-auth falls through to a catch-all route with no annotations and allows any model. Add a Lua EnvoyExtensionPolicy that buffers the request body, extracts the "model" JSON field, and sets x-ai-eg-model before ext_authz runs. Order the Lua filter before ext_authz via EnvoyProxy.spec.filterOrder. * EAI-6805: split AI traffic onto a dedicated ai-gateway * fix: scrape ai-gateway-extproc on all gateway pods The ai-gateway-extproc scrape job was filtering to only the 'https' gateway pod. After splitting inference traffic to a separate 'ai-gateway', the extproc sidecar on that pod was never scraped, causing zero metrics for workbench deployments. Remove the gateway name filter — namespace + app name is sufficient to select all Envoy proxy pods. * EAI-5821: set pod label on ai-gateway-extproc scrape for API metrics queries * EAI-5821: fix ai-gateway-extproc scrape - consolidate to metrics-k8s, cover all gateway pods --------- Co-authored-by: Mika Ranta <Mika.Ranta@amd.com> Co-authored-by: John Lybeck <john.lybeck@amd.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Tomas Saaristola <tsaarist@amd.com>
… and add ai-gateway-discovery to enabledApps in all cluster sizes
… and add ai-gateway-discovery to enabledApps in all cluster sizes
Update backup_and_restore.md
add ai-gateway-discovery app, and add envoy/ai-gateway overrides to root/values.yaml
…is present When x-ai-eg-backend is set (from body or as a direct header), x-ai-eg-model is intentionally skipped. This prevents model-name fallback rules in other HTTPRoutes from matching — Envoy Gateway maintains per-HTTPRoute ordering so an older route's model-name rule fires before a newer route's UUID rule when both have equal specificity. Without x-ai-eg-model, only the UUID rule and catch-all rules remain; the UUID rule (2 conditions) beats catch-all (0 conditions) by specificity regardless of creation order.
…route-specificity EAI-7043: fix UUID-based AIM routing via Lua filter
update release pipeline to respect the target
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.