Skip to content

sync to refactored break#2

Open
brownzebra wants to merge 2318 commits into
brownzebra:clean-working-directoryfrom
silogen:main
Open

sync to refactored break#2
brownzebra wants to merge 2318 commits into
brownzebra:clean-working-directoryfrom
silogen:main

Conversation

@brownzebra

Copy link
Copy Markdown
Owner

No description provided.

pre and others added 30 commits April 28, 2026 13:12
It is already linked from the content.
EAI-5784 Document how to install manually with Helm
…crets.enabled

Add externalSecrets.enabled (default true) values key to keycloak-old and
minio-tenant-config charts. Wrap each ExternalSecret/Secret/ClusterSecretStore
template with {{- if .Values.externalSecrets.enabled }} so that consumers can
opt out and provide credentials as plain Kubernetes Secrets.

Default value true keeps render output bit-equal to existing behavior.

install_base.sh:
- Pass --set externalSecrets.enabled=false to keycloak-old and
  minio-tenant-config helm template calls (manual install does not use the
  external-secrets operator)
- Drop the rm -f workarounds that previously stripped es-* and
  *-clustersecretstore.yaml templates after copying them to a tmpdir
- Drop the TEMP_MINIO_CONFIG_DIR mktemp dance (no patches remain for
  minio-tenant-config so the chart can be templated directly)

Affected charts:
- sources/keycloak-old: 4 ExternalSecret templates (es-airm-realm-credentials,
  es-keycloak-credentials, es-keycloak-cnpg-user-credentials,
  es-keycloak-cnpg-superuser-credentials)
- sources/minio-tenant-config: 1 Secret + 2 ExternalSecret + 1 ClusterSecretStore
  templates (minio-secret-example, minio-es-env-config, minio-es-default-user,
  minio-tls-clustersecretstore)

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…d + postgresql.*)

Decouple keycloak-old from the in-cluster CNPG Cluster resource so that the
chart can also point at an external PostgreSQL.

values.yaml additions:
- cnpg.enabled (default true): whether to render the in-cluster CNPG Cluster
- cnpg.instances and cnpg.storage.storageClass: CNPG Cluster sizing/storage
- postgresql.{host,port,database,username,userSecretName}: connection used
  by keycloak-deployment in both modes; defaults match the in-cluster CNPG
  service so existing renders are bit-equal

templates/keycloak-cnpg.yaml:
- Wrap the entire Cluster resource with {{- if .Values.cnpg.enabled }}
- Parametrize bootstrap.initdb.{database,owner,postInitSQL,secret.name} from
  postgresql.* so the Cluster matches what keycloak-deployment connects to
- Parametrize storage.storageClass and walStorage.storageClass from
  cnpg.storage.storageClass (replaces the install_base.sh sed patch)
- Parametrize instances from cnpg.instances (the previous --set
  cnpg.instances install_base.sh argument was a dead value)

templates/keycloak-deployment.yaml:
- KC_DB_URL_HOST/PORT/DATABASE read from .Values.postgresql.*
- KC_DB_USERNAME/PASSWORD secret name reads from postgresql.userSecretName

templates/es-keycloak-cnpg-{user,superuser}-credentials.yaml:
- Extend the externalSecrets.enabled gate with .Values.cnpg.enabled (CNPG
  user/superuser secrets are not relevant when the Cluster is not rendered)
- Parametrize ExternalSecret target.name and metadata.name from
  postgresql.userSecretName so user-supplied secret names propagate

install_base.sh:
- Drop the sed patch that rewrote storageClass: default in keycloak-cnpg.yaml
- Pass --set cnpg.storage.storageClass=${DEFAULT_STORAGE_CLASS_NAME}
- Drop the dead --set storageClassName argument (no chart template reads it)

Pluggable mode usage:
  kubectl create secret generic keycloak-db-user -n keycloak \
    --from-literal=username=keycloak --from-literal=password=...
  helm template keycloak sources/keycloak-old \
    --set cnpg.enabled=false --set externalSecrets.enabled=false \
    --set postgresql.host=mydb.example.com \
    --set postgresql.userSecretName=keycloak-db-user \
    --namespace keycloak | kubectl apply --server-side -f -

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…ed patches

Move three remaining install_base.sh sed patches into the chart so the
manual install can helm template the upstream chart directly without
copying it to a tmpdir first.

templates/keycloak-realm-templates-cm.yaml:
- Fix the placeholder on line 109 (admin-client-id-value -> __AIRM_ADMIN_CLIENT_ID__).
  This was an outright bug: every other reference to the AIRM admin client id
  in the realm template uses the runtime placeholder __AIRM_ADMIN_CLIENT_ID__
  that the init container sed-replaces with $AIRM_ADMIN_CLIENT_ID at startup.
  Line 109 had the example string admin-client-id-value, so the realm-import
  policy entry never resolved to the real admin client id.

templates/keycloak-deployment.yaml:
- KC_HOSTNAME reads from .Values.hostname with a printf default of
  https://kc.<domain>. install_base.sh was already passing
  --set hostname=${KC_HOSTNAME} but no template consumed it, so it relied on
  a sed patch to overwrite the hardcoded value.
- Add -y and "*/dev/*" / "*/sys/*" / "*/proc/*" excludes to the zip command
  in init-auth-extensions so it doesn't choke on symlinks or system paths
  when packaging the SilogenExtensionPackage jar.

values.yaml:
- Add hostname: "" placeholder so the override is documented.

Chart.yaml: keycloak-old 0.1.0 -> 0.2.0, minio-tenant-config 0.1.0 -> 0.2.0
(minor bumps for the new values keys added in this PR series; defaults keep
render output bit-equal to 0.1.0).

install_base.sh:
- Drop the four sed patches and the TEMP_KC_DIR mktemp dance that copied
  keycloak-old to a tmpdir just to rewrite four lines. Helm templates the
  chart directly from ${SOURCES_DIR}/keycloak-old now.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Distinguish from default 9000 and avoid locally
otherwise reserved port 9000.
…b.sh

Apply the chart-side parametrization landed earlier in this PR series so the
manual install can connect AIWB and Keycloak to a user-supplied PostgreSQL
without any post-install patching.

install_base.sh:
- Add the same env vars db.sh used to read (POSTGRES_HOST/PORT, AIWB_DB_*,
  KEYCLOAK_DB_*) plus AIWB_DB_SECRET_NAME and KEYCLOAK_DB_SECRET_NAME (default
  aiwb-db-user / keycloak-db-user, distinct from the cnpg-managed defaults).
- When PLUGGABLE_DB=true, create the credentials secrets in aiwb and keycloak
  namespaces (kubectl create secret) so the deployments find them at startup.
- Pass --set postgresql.{host,port,database,username,userSecretName} to both
  the keycloak-old and aiwb helm template invocations. The chart templates
  (which were already parametrized in earlier commits in this PR series)
  consume them for KC_DB_URL_*, the wait-for-db init container, the
  liquibase-migrate init container, the aiwb-api environment, and all secret
  references — replacing every kubectl set env / kubectl patch db.sh used to
  perform after the fact.
- Update the comment above the CNPG operator install and the success message
  at the end of the script to reflect that no follow-up db.sh run is needed.

db.sh:
- Replace the patching script with a deprecation notice. All cluster mutations
  (kubectl delete cluster, kubectl patch secret, kubectl set env, kubectl
  patch deployment, kubectl rollout restart) are gone — they are now redundant
  because install_base.sh applies the same configuration at helm template
  time. The file can be deleted once external consumers have migrated.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
EAI-6050 Replace workarounds for BYO DB and ExternalSecrets
Switching branches while developing was brittle.
EAI-6050 Fix default values in Helm reference install script
Extract CNPG-specific Secrets into secrets-aiwb-cnpg.yaml and MinIO-related
Secrets into secrets-aiwb-minio.yaml so install_base.sh can apply each only
when the corresponding in-cluster component is being installed. Update
secrets.md with the new sources and per-secret pluggable mode notes.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…tall_base.sh

In PLUGGABLE_S3=true mode, install_base.sh now:
- skips secrets-aiwb-minio.yaml (placeholder MinIO secrets)
- creates minio-credentials in aiwb and workbench from MINIO_ACCESS_KEY /
  MINIO_SECRET_KEY env vars
- creates the in-cluster minio.minio-tenant-default redirect Service whose
  Endpoints point to MINIO_HOST_IP:MINIO_PORT (used by aim-performance and
  any other consumer that hardcodes the in-cluster MinIO URL)

Mirrors the existing PLUGGABLE_DB=true env-based secret creation pattern.
Conditional applies for the new secrets-aiwb-cnpg.yaml / secrets-aiwb-minio.yaml
files added in the same place.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
s3.sh previously combined a destructive in-cluster MinIO Tenant migration
with live patches to Secrets, Deployments, and a redirect Service. The
redirect Service and credential creation now live in install_base.sh's
PLUGGABLE_S3 branch, the AIWB --set minio.url already drove the env update,
and the Tenant deletion / namespace label fix were one-shot migration steps.

What remains is a non-destructive verifier that prints AIWB's MINIO_URL /
MINIO_BUCKET, the access keys in the aiwb / workbench minio-credentials
Secrets, the redirect Endpoints, and runs a curl pod against the in-cluster
URL's /minio/health/live to confirm the redirect resolves to a healthy
external MinIO.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The previous guides documented a manual post-install patching workflow
(delete CNPG / MinIO Tenant, kubectl patch Secrets, kubectl set env on
Deployments) that no longer matches reality — install_base.sh now drives
both PLUGGABLE_DB and PLUGGABLE_S3 end to end.

Both pages now describe:
- prerequisites (external Postgres / S3, required databases or buckets)
- the env vars consumed by install_base.sh
- pluggable.sh as the single entry point
- what install_base.sh does in pluggable mode
- verification
- limitations (no migration path, single-host assumption, etc.)

Cross-references to EXTERNAL_FIXES.md added for the aim-performance
BUCKET_STORAGE_HOST workaround and the AIWB pg_isready -U postgres noise.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
`export PLUGGABLE_DB=false` unconditionally overwrote whatever the caller
had exported, so `PLUGGABLE_DB=true ./pluggable.sh <DOMAIN>` silently fell
back to the in-cluster Postgres / MinIO path. Switch to the
${VAR:-default} pattern so pre-set values pass through.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
When upgrading from a PLUGGABLE_S3=false install (where the MinIO Operator
created a selector-backed Service named "minio") to PLUGGABLE_S3=true, the
selectorless redirect Service apply produced server-side-apply conflicts.
Delete Service and Endpoints first with --ignore-not-found so fresh installs
remain a no-op and re-installs over an existing in-cluster MinIO succeed.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The printed instructions referenced ./pluggable/scripts/s3.sh — a path that
does not exist — and treated s3.sh as the install entry point. After the
recent refactor, install_base.sh handles all PLUGGABLE_S3=true setup and
s3.sh is only a post-install verifier. Point users at scripts/pluggable.sh
with the right env vars and mention s3.sh as an optional verification step.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The PLUGGABLE_S3=true secret-creation block referenced MINIO_ACCESS_KEY /
MINIO_SECRET_KEY without defaults, tripping `set -u` when callers hadn't
exported them. Default to `examplepass` to match s3_minio_container.sh's
own defaults so the stock dev flow works out of the box.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Re-running install_base.sh failed because kubectl-patch took ownership of
.rules, conflicting with the next server-side helm apply, and the JSON
patch added a duplicate tokenreviews rule. Force-conflict the helm apply
so kubectl reclaims ownership, and skip the patch when the rule is
already present.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
scripts/bootstrap.sh:
render_actual_helm_manifests() function that renders actual Helm manifests instead of ArgoCD Applications in template mode
Proper handling of valuesFiles using JSON to avoid bash array syntax issues
Fixed symlink resolution with readlink -f

root/values.yaml:
Kaiwo version from v0.2.0-rc11 to v0.2.0-rc12 to support GPU preemption features
blankdots and others added 30 commits June 16, 2026 10:39
EAI-6030 Select GPU Operator and DeviceConfig by GPU family
…ontroller-replicas

fix(kyverno): set explicit replicas=1 for all controllers (EAI-6864)
Fix node-exporter port conflict by templatizing hardcoded port 9100
Disable seaweed ingress, as we create HTTPRoutes
Fix formatting of a list and fix broken link.
…orge (#734)

* EAI-1500  update documentation

* EAI-5821 add ai-gateway config

* fix: render argocd application template for oci and https

* EAI-5821 fix sbom components

* feat(envoy-gateway-config): add tls-passthrough gateway for k8s API

  Replace the :6443 k8s-passthrough listener on the shared `https` gateway
  with a dedicated `tls-passthrough` gateway on :443 that owns the external
  MetalLB LoadBalancer and does SNI-based TLS passthrough:
  k8s.<domain> -> kube API service, *.<domain> -> apps gateway. The apps
  gateway moves to ClusterIP behind it.

  The listener and TLSRoutes carry explicit hostnames: Envoy Gateway TLS
  passthrough builds SNI filter chains from hostnames, so an empty hostname
  yields an empty Envoy config that never routes. Listening on :443 instead
  of :6443 avoids hijacking pod->apiserver traffic where the node IP equals
  the MetalLB pool IP.

  Refs: EAI-5821

* fix(envoy-gateway-config): invert gateway front door while debugging

* fix(envoy-gateway): scope AI extension to its own listeners

  Set extensionManager listener.includeAll=false so the AI Gateway xDS
  translation hook only receives listeners generated for its own resources
  (AIGatewayRoute/AIServiceBackend/InferencePool).

  With includeAll=true the hook also received the L4 tls-passthrough
  listener and tried to insert its request-header-metadata HTTP filter into
  a TCP filter chain that has no HTTPConnectionManager. That failed xDS
  translation for the entire GatewayClass, so the passthrough data plane
  got an empty snapshot and never left initialization.

* fix(envoy-gateway-config): restore passthrough gateway as front door

  Revert the debug inversion: the tls-passthrough gateway owns the external
  MetalLB LoadBalancer on :443 (SNI passthrough) and the apps gateway drops
  back to ClusterIP behind it. The inversion was a workaround for the
  passthrough data plane not starting, which is now fixed.

* EAI-5821: Wire API key auth and metrics for AI gateway

- Bump cluster-auth to 0.6.0-rc2, which injects x-api-key-id and
  x-auth-username on every authenticated request and supports
  SecurityPolicy contextExtensions for per-IS group enforcement
  (required by the ai-gateway-discovery controller)
- Add api_key_id and aim_service_id to access log fields so every
  AI gateway request is attributed to the originating API key and
  AIM service in structured logs

* EAI-5821: Bump cluster-auth to 0.6.0-rc3

* fix(envoy-gateway): wire EPP into shared listener

  Set listener.includeAll=true so the AI extension injects the EPP
  ext_proc filter into the shared https :443 listener. InferencePool
  routes were returning 503 (no healthy upstream) because nothing set
  x-gateway-destination-endpoint on that Gateway-owned listener.

  Add failOpen=true so the extension erroring on the tls-passthrough
  L4 listener (an HTTP filter can't splice into a TCP chain) no longer
  fails that proxy's xDS translation and leaves it stuck in init.
  mergeGateways is off, so each gateway is a separate translation pass:
  the https proxy gets the filter, the passthrough proxy keeps its
  original xDS.

* EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to v1.0.8 and cluster-auth to 0.6.0-rc4

* Revert "EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to v1.0.8 and cluster-auth to 0.6.0-rc4"

This reverts commit 28a58c4.

* EAI-5821: Add ext-proc metrics scraping, bump otel-lgtm-stack to 1.0.8 and cluster-auth to 0.6.0-rc4

* EAI-5821: Bump cluster-auth to 0.6.0-rc5

* EAI-5821: Bump cluster-auth image from 0.6.0-rc5 to 0.6.0-rc6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* EAI-5821: bump cluster-auth to 0.6.0-rc7

* EAI-5821: bump cluster-auth to 0.6.0-rc8

* fix: reorder ext_proc before ext_authz in EnvoyProxy

Without filterOrder, ext_authz (cluster-auth) runs before ext_proc
(AI Gateway) sets x-ai-eg-model from the request body. This causes
cluster-auth to fall back to a catch-all route with no annotations,
firing defaultAction: allow and bypassing per-model authorization.

Fixes EAI-6805.

* fix: correct filterOrder syntax for EnvoyProxy CRD v1.7.1

CRD uses name+before/after, not relativeTo+position.

* EAI-6805: set x-ai-eg-model via Lua before ext_authz

The AI Gateway's ext_proc is inserted via xdsTranslator post-hooks,
which run after filterOrder is applied, so filterOrder cannot reorder it
relative to ext_authz. Without x-ai-eg-model set, cluster-auth falls
through to a catch-all route with no annotations and allows any model.

Add a Lua EnvoyExtensionPolicy that buffers the request body, extracts
the "model" JSON field, and sets x-ai-eg-model before ext_authz runs.
Order the Lua filter before ext_authz via EnvoyProxy.spec.filterOrder.

* EAI-6805: split AI traffic onto a dedicated ai-gateway

* fix: scrape ai-gateway-extproc on all gateway pods

The ai-gateway-extproc scrape job was filtering to only the 'https'
gateway pod. After splitting inference traffic to a separate 'ai-gateway',
the extproc sidecar on that pod was never scraped, causing zero metrics
for workbench deployments. Remove the gateway name filter — namespace +
app name is sufficient to select all Envoy proxy pods.

* EAI-5821: set pod label on ai-gateway-extproc scrape for API metrics queries

* EAI-5821: fix ai-gateway-extproc scrape - consolidate to metrics-k8s, cover all gateway pods

---------

Co-authored-by: Mika Ranta <Mika.Ranta@amd.com>
Co-authored-by: John Lybeck <john.lybeck@amd.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Tomas Saaristola <tsaarist@amd.com>
… and add ai-gateway-discovery to enabledApps in all cluster sizes
… and add ai-gateway-discovery to enabledApps in all cluster sizes
add ai-gateway-discovery app, and add envoy/ai-gateway overrides to root/values.yaml
…is present

When x-ai-eg-backend is set (from body or as a direct header), x-ai-eg-model
is intentionally skipped. This prevents model-name fallback rules in other
HTTPRoutes from matching — Envoy Gateway maintains per-HTTPRoute ordering so
an older route's model-name rule fires before a newer route's UUID rule when
both have equal specificity. Without x-ai-eg-model, only the UUID rule and
catch-all rules remain; the UUID rule (2 conditions) beats catch-all (0
conditions) by specificity regardless of creation order.
…route-specificity

EAI-7043: fix UUID-based AIM routing via Lua filter
update release pipeline to respect the target
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.