Add K8s Agent: Streamlit-based on-prem Kubernetes cluster management UI#16
Add K8s Agent: Streamlit-based on-prem Kubernetes cluster management UI#16devin-ai-integration[bot] wants to merge 31 commits into
Conversation
- Profile Manager: CRUD for cluster profiles with node definitions (control-plane/worker), SSH credentials - Cluster Creation: SSH-based provisioning with CRI-O, Flannel CNI, kubeadm, best practices hardening - Cluster Debugger: Diagnostic commands with AI-powered root cause analysis and recommendations - Monitoring Setup: One-click Prometheus + Grafana deployment with dashboards and alerting rules - Log Analysis: Multi-source log collection, error pattern extraction, cross-source correlation - AI Assistant: Chat interface powered by LLM for Kubernetes questions - Integrated with Infosys AI Gateway for LLM capabilities
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
- Remove unused NodeInfo import from app.py - Remove unused pyyaml and pandas from requirements.txt
- Add crio_root, crio_runroot, kubelet_root, log_root fields to ClusterProfile - Add http_proxy, https_proxy, no_proxy, http_proxy_alt, https_proxy_alt fields - Update generated scripts to configure CRI-O storage paths via crio.conf.d - Update control-plane init script to use custom audit log dir and kubelet root - Add proxy env vars to common setup and control-plane init scripts - Add Storage Paths and Proxy Settings sections to Profile Manager UI - Show storage/proxy details in Manage Profiles view and profile summary
…est uploads - Add is_llm_configured() helper to detect when LLM is not set up - Make all LLM imports lazy to avoid errors when LLM deps missing - Guard all AI-powered UI features with is_llm_configured() checks - Show informative fallback messages when LLM is not configured - Add Offline Manifests tab for uploading Flannel YAML and other files - Add flannel_manifest_path/prometheus_manifest_path to ClusterProfile - SCP user-provided Flannel manifest to nodes during provisioning - Core features (cluster creation, debugging, monitoring, logs) work without LLM
…anations, graphical events timeline
…silent reset on submit
…, fix disk usage for imported clusters
…ral-storage per namespace
…amp window to correlate_errors
…do for crictl, test summary
…rror persists after rerun
…ubelet version, OS, container runtime, cluster-info
| with open(kc_path, "w") as f: | ||
| f.write(kubeconfig_content) |
There was a problem hiding this comment.
🔴 Kubeconfig credentials written to disk with world-readable permissions
All temporary kubeconfig files are written with plain open(path, "w") which creates files using the process umask (typically 0o644 — owner rw, group r, others r). This means any user on the system can read the kubeconfig, which often contains bearer tokens or client certificates granting full cluster access.
This contrasts with profile_manager.py:85 which correctly uses os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o600) for profile files (that also embed kubeconfig_content). The same restricted-permission pattern should be used for all kubeconfig files.
All affected locations writing kubeconfig with insecure permissions
config.py:146-147(fetch_namespaces)modules/cluster_debugger.py:78-79(_run_local_kubectl)modules/log_analyzer.py:43-44(_run_local_shell)modules/monitoring_setup.py:39-40(_run_local_shell)modules/cluster_creator.py:1403-1404(run_kubectl)
Was this helpful? React with 👍 or 👎 to provide feedback.
… Restart Tracker, Network Policy Visualizer, PVC/Storage Dashboard
…n, pattern mining, summarization
…env in subsequent steps
… in kubeconfig path
| kubeconfig_path = config.get_kubeconfig_path("_debug_temp") | ||
| with open(kubeconfig_path, "w") as f: | ||
| f.write(kubeconfig_content) | ||
| full_cmd = f"{kubectl} --kubeconfig={kubeconfig_path} {kubectl_args}" |
There was a problem hiding this comment.
🔴 Race condition: shared temp kubeconfig files cause operations to target wrong cluster
The _run_local_kubectl and _run_local_shell functions in the debugger, log analyzer, and monitoring modules all write kubeconfig content to fixed-name temp files (_debug_temp, _log_temp, _monitor_temp), then execute kubectl commands referencing those files. In a multi-user Streamlit deployment, concurrent sessions operating on different imported clusters will overwrite each other's kubeconfig files. If session A writes cluster-A's kubeconfig to _debug_temp.kubeconfig and session B overwrites it with cluster-B's kubeconfig before session A's kubectl command reads it, session A's command runs against cluster B. This could cause diagnostics, monitoring commands, or log collection to execute against the wrong cluster.
Affected locations
modules/cluster_debugger.py:77— uses_debug_tempmodules/log_analyzer.py:42— uses_log_tempmodules/monitoring_setup.py:38— uses_monitor_tempconfig.py:117— uses_ns_fetch
Note that cluster_creator.py:1398-1400 correctly uses a per-profile filename, showing the fix pattern.
Prompt for agents
The _run_local_kubectl function in cluster_debugger.py (and the identical _run_local_shell functions in log_analyzer.py and monitoring_setup.py) writes kubeconfig content to a shared temp file with a static name (_debug_temp, _log_temp, _monitor_temp respectively), then runs kubectl referencing that file. This creates a TOCTOU race condition in multi-session Streamlit deployments.
The fix pattern already exists in cluster_creator.py:1398-1400 where run_kubectl uses a per-profile filename. Apply the same approach to all four affected locations:
1. In cluster_debugger.py _run_local_kubectl (line 77): Instead of get_kubeconfig_path('_debug_temp'), use a unique path per invocation. Options: use tempfile.NamedTemporaryFile with delete=False (and clean up after), or use a session-specific or profile-specific name.
2. In log_analyzer.py _run_local_shell (line 42): Same fix.
3. In monitoring_setup.py _run_local_shell (line 38): Same fix.
4. In config.py fetch_namespaces (line 117): Same fix.
The simplest approach is to use Python's tempfile.NamedTemporaryFile(suffix='.kubeconfig', dir=kc_dir, delete=False) to create a unique file per call, then clean up in a try/finally block. Alternatively, pass the profile name through to these functions and use it in the filename, similar to cluster_creator.py.
Was this helpful? React with 👍 or 👎 to provide feedback.
| """ | ||
| is_helm = command.strip().startswith("helm ") | ||
|
|
||
| if profile.kubeconfig_content: |
There was a problem hiding this comment.
🔴 run_kubectl routes on kubeconfig_content alone, inconsistent with all other routing functions
The run_kubectl function at k8s-agent/modules/cluster_creator.py:1392 checks only if profile.kubeconfig_content: to decide whether to run commands locally vs. via SSH. Every other routing function in the codebase — _run_on_cluster in log_analyzer.py:66, _run_on_cluster in monitoring_setup.py:62, run_diagnostic in cluster_debugger.py:159, run_custom_command in cluster_debugger.py:227, and check_pod_issues in cluster_debugger.py:335 — consistently checks both profile.cluster_source == "imported" and profile.kubeconfig_content. This means a provisioned cluster that also has its kubeconfig stored (e.g., fetched post-provisioning) would use local kubectl in run_kubectl but SSH in all other functions. If the local machine can't reach the API server directly (e.g., it's in a private network only reachable via SSH), run_kubectl commands would fail while debugger/monitoring/log commands succeed via SSH. The function's own docstring also states the intent is routing for imported clusters.
| if profile.kubeconfig_content: | |
| if profile.cluster_source == "imported" and profile.kubeconfig_content: |
Was this helpful? React with 👍 or 👎 to provide feedback.
…codes, per-path/upstream breakdowns, slow requests
… Limits, add pod count per node, fix Set Active profile button
…te kubeconfig paths in shell commands
…ectbox renders, add on_change callback, delete profile_selector on all profile state changes
…ama connection, model fetching, streaming support
Summary
Adds a new
k8s-agent/Streamlit application for managing on-premises Kubernetes clusters. The tool provides a web UI with the following capabilities:/var/lib. Proxy/alternate proxy settings are injected into the master node environment during provisioning. Includes a Reset Cluster tab to tear down and optionally re-provision.All remote operations go through
subprocess-based SSH (provisioned clusters) or localkubectlwith kubeconfig (imported clusters). LLM calls support both OpenAI-compatible and Ollama chat APIs.Updates since last revision
New: Ollama LLM support — The LLM integration now supports two providers, selectable from the sidebar LLM Settings panel:
http://10.73.98.113:11434). No API key required.Ollama-specific features:
GET /api/tagsand populates a model dropdownPOST /api/chatwith{"message": {"content": "..."}}response formatoptions.temperature; max tokens viaoptions.num_predictconfig.LLM_PROVIDER,config.OLLAMA_BASE_URL,config.OLLAMA_MODELmodule-level variables)Configuration via env vars:
LLM_PROVIDER=ollama,OLLAMA_BASE_URL=http://...,OLLAMA_MODEL=llama3. Or configure entirely via the sidebar UI.Files changed:
config.py(newOLLAMA_BASE_URL,OLLAMA_MODEL,get_active_llm_url(),get_active_model()),modules/llm_client.py(refactored into_build_messages/_build_headers/_build_payloadhelpers, provider-awarequery_llm/stream_llm, newlist_ollama_models()),app.py(sidebar LLM Settings panel redesigned with provider toggle).Previous updates
Fix: profile switching broken (Set Active button + sidebar dropdown) — The previous fix (deleting the
profile_selectorwidget key beforest.rerun()) was insufficient because Streamlit ignores theindexparameter when a widget key already exists in session state — it reads the stale value fromsession_state[key]instead. The new fix:session_state["profile_selector"]withactive_profilebefore the selectbox widget is instantiated, so Streamlit reads the correct value.on_changecallback to keepactive_profilein sync when the user changes the dropdown directly.profile_selectorkey in all code paths that modifyactive_profile(Set Active button, Create Profile, Import Cluster, Delete Profile) — the previous fix only covered the Set Active button.Tested locally by creating two profiles and swapping between them via both the sidebar dropdown and the "Set Active" button in Manage Profiles — see recording below.
View original video (rec-9a37d52576914ae9aab09d68fb7260e5-edited.mp4)
"Collect pod logs" mode in Smart Log Analysis — Smart Log Analysis now has three modes: "Collect from cluster" (system-level logs), "Collect pod logs", and "Paste logs". The new mode provides a guided workflow to analyze Istio sidecar access logs or any pod's logs:
istio-proxyis pre-selected when present, making Istio access log analysis one-clickFix: "Set Active" profile button crash — The previous fix (directly assigning
st.session_state.profile_selector = profile.name) caused aStreamlitAPIExceptionbecause Streamlit prohibits modifying widget state after the widget is instantiated. New fix: the button now deletes theprofile_selectorwidget key from session state before callingst.rerun(), allowing the sidebar selectbox to reinitialize from the updatedactive_profilevalue via itsindexparameter.Fix: kubeconfig path not quoted in shell commands —
cluster_debugger.py:_run_local_kubectlandconfig.py:fetch_namespacesnow quote the kubeconfig path in shell commands (--kubeconfig="{path}"), consistent withcluster_creator.py. Prevents breakage ifDATA_DIRcontains spaces.Fix: profile JSON files written with 0600 permissions —
save_profile()now usesos.open()with0o600mode instead of defaultopen(), restricting read access to the file owner. Profile JSON files contain embedded kubeconfig credentials for imported clusters.Istio/Envoy access log analysis — Smart Log Analysis now auto-detects Istio/Envoy access logs and shows comprehensive response time analytics:
Removed Helm Releases and Network Policies tabs from Resource Viewer per user request. These tabs and all associated code have been removed.
Removed init containers from Resource Requests/Limits — The "Container Resource Requests & Limits" tab now only shows application containers, excluding init containers. The "Init" column has been removed from the table and TSV export.
Pod count per node in Node Containers tab — For imported clusters, the Node Containers tab now shows a Pod Distribution Across Nodes summary with per-node pod counts displayed as metrics, total/avg/min/max/spread statistics, and a distribution health check (warns if spread exceeds 50% of the average, indicating uneven scheduling). Each node expander also shows its pod count in the header.
Pod & container dropdowns in Pod Logs tab — The Pod Logs tab (Log Analysis page) now has a "Load Pods" button that fetches all pods in the selected namespace. Once loaded, the pod name input switches to a dropdown showing pod names with their status. Selecting a pod also populates a container dropdown.
Smart Log Analysis (LogAI-inspired) — ML-powered log analysis (TF-IDF + DBSCAN clustering, anomaly detection, Drain-style pattern mining, auto-summarization, log volume timeline). Uses scikit-learn natively.
Multi-Cluster Dashboard, Certificate Manager, Cost Estimator, Pod Restart Tracker, PVC/Storage Dashboard — Six advanced features added.
Fix: unquoted paths in
rm -rfduring cluster reset — All resetrm -rfcommands now quote the path.Fix: unsanitized profile name in kubeconfig path — Profile name is now sanitized via
re.sub(r"[^\w.-]", "_", profile.name).Fix: proxy
/etc/environmentformat forpam_env— Generates plainKEY=VALUElines (notexport KEY=VALUE).kubectl/helm auto-detection, namespace auto-fetch, kubeconfig import, K8s/CRI-O version 1.35, Pod Security Standard explanations, LLM made fully optional, offline manifest uploads, step-by-step SSH provisioning, enriched cluster details, flash messages, feedback messages, metrics components, deployment scaling, pod shell, resource name dropdowns, node containers (crictl), cluster reset.
Review & Testing Checklist for Human
run_ssh_commandpasses user-supplied strings (hostnames, IPs, custom commands, namespace names, storage paths, proxy URLs) directly into shell commands via f-string interpolation with no sanitization.run_custom_commandallows arbitrary shell execution. The Pod Shell tab passes user-typed commands tokubectl exec. The Node Containers tab allows free-text CRI commands passed directly torun_ssh_command. While reset paths, kubeconfig paths, and profile names are now quoted/sanitized, the broader command injection surface remains./api/chat,/api/tags, NDJSON streaming) was written from Ollama API docs but has not been tested against the actual Ollama endpoint athttp://10.73.98.113:11434. Response format parsing (especially streaming NDJSON vs OpenAI SSE) could fail silently or produce garbled output. Thelist_ollama_modelsfunction swallows all exceptions and returns an empty list, which could hide connectivity issues.config.LLM_PROVIDER,config.OLLAMA_BASE_URL, andconfig.OLLAMA_MODELat module level on every Streamlit rerun. This works for single-user sessions but could cause race conditions if Streamlit is serving multiple concurrent users (each rerun overwrites the same module globals). The OpenAI provider settings (LLM_API_URL,LLM_API_KEY,LLM_MODEL) are still read-only from env vars.kubectl top pods --containerstext by column position; (d) Smart Log Analysis TF-IDF + DBSCAN on real log volumes; (e) "Collect pod logs" mode parsesget_pod_listoutput by whitespace column positions — may break if pod names/statuses contain unexpected formatting.session_state["profile_selector"]before instantiation and uses anon_changecallback. Four separate code paths (Set Active, Create Profile, Import Cluster, Delete Profile) all deleteprofile_selectorbeforest.rerun(). If a future code path setsactive_profilewithout deleting the widget key, the stale selectbox value will silently override the intended profile.kubeconfig_content(full kubeconfig YAML including cluster credentials) is stored in profile JSON files atdata/profiles/. File permissions are now restricted to0600, but the content itself is unencrypted.kubeadm reset,rm -rfon kubelet/CRI-O/etcd data, and flushes all iptables rules. The only safeguard is typing "RESET" in a text input.Suggested test plan:
http://10.73.98.113:11434→ click "Fetch available models" → verify model list populates → select a model → go to AI Assistant → send a message → verify streaming response renders correctly. Then try Cluster Debugger AI analysis and Log Analysis AI analysis with Ollama to verify non-streamingquery_llmalso works.StreamlitAPIExceptionor stale profile state.istio-proxyis pre-selected in container dropdown → click "Fetch Pod Logs & Analyze" → verify logs are fetched and Istio access log analysis is triggered automatically.data/profiles/have600permissions (not world-readable).Notes
logaiPython package is not imported at runtime. Smart Log Analysis reimplements key algorithms using scikit-learn directly._run_on_cluster()pattern is duplicated across three modules (cluster_debugger.py,monitoring_setup.py,log_analyzer.py). Consider extracting to a shared utility.SSH_ONLY_COMMANDSandSSH_ONLY_LOG_SOURCESare hardcoded sets that must be updated if new commands/sources are added.config.*module globals at runtime for Ollama settings. OpenAI settings remain env-var–based and read-only.health_colorvariable is assigned but never used.kubectl get pods --field-selector spec.nodeName=...which is not equivalent tocrictl ps -a.scikit-learn>=1.3.0,numpy>=1.24.0, andpandas>=2.0.0significantly increase install size.import re as _reinsiderun_kubectlis a function-level import; consider moving to the top-level import block.http://10.73.98.113:11434) is hardcoded inconfig.py. This is a user-specific internal IP that should likely be empty by default for other deployments.Link to Devin session: https://partner-workshops.devinenterprise.com/sessions/f7a6f611aedb4141ba2e24adecaee0b3