diff --git a/docs/self_hosted/architecture/architecture.md b/docs/self_hosted/architecture/architecture.md new file mode 100644 index 0000000000..2b4e13b340 --- /dev/null +++ b/docs/self_hosted/architecture/architecture.md @@ -0,0 +1,24 @@ +# Reference Architecture + +The Sekoia Self-Hosted architecture is built for resilience and scalability within a customer-managed perimeter. + +## Core components +The platform is delivered as a single digitally signed archive containing all necessary applications: + +* **Deployer**: Handles the Kubernetes installation and support services. +* **Microservices**: Includes the core platform logic and Sekoia applications. +* **Databases**: Integrated data persistence layers. +* **Rules Catalog**: The collection of Sekoia detection rules. + +## High availability logic +The architecture is designed for production-grade reliability: + +* **Redundancy**: Built-in component redundancy to prevent single points of failure. +* **Self-healing**: Automated recovery mechanisms for continuous operations. +* **Scalability**: Support for multi-node compute clusters to handle high-volume ingestion. + +## Data Workflow + +* **Ingestion**: Event logs are sent to forwarders which point to a unique domain/port redirected to your Load Balancer. +* **Processing**: Microservices are decoupled to prevent cascading failures. +* **Storage**: Data is written to an S3-compatible storage managed by the customer. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_configuration.md b/docs/self_hosted/deployment/deployment_configuration.md new file mode 100644 index 0000000000..e8c46592c3 --- /dev/null +++ b/docs/self_hosted/deployment/deployment_configuration.md @@ -0,0 +1,259 @@ +# Deployment configuration reference + +The `config.yml` file is the central manifest used to describe your environment and service requirements. It dictates how the self-hosted controller orchestrates infrastructure deployment and application settings. + +## Configuration example + +You can use this complete example as a template for your own deployment. Ensure you replace the placeholders and environment variable references with your actual values. + +```yaml +global: + dev: false + emit_mm_notif: false + host: "app.sekoia.local" + alternative_hosts: "api.sekoia.local" + version: + platform: + version: "v0.0.1" + path: /opt/sekoia/platform/ + skip_existing_local: false + skip_existing_manifest: false + manifest_max_age: 300 + push_workers: 4 + data: + detection-rules: + version: "010126" + path: /opt/sekoia/data/ + intake-formats: + version: "010126" + path: /opt/sekoia/data/ + playbook-library: + version: "010126" + path: /opt/sekoia/data/ + cti: + version: "010126" + path: /opt/sekoia/data/ +utils: + ansible: + datadir: "resources/ansible" + ssh-key: env.SERVERS_SSH_KEY + user: root + password: env.SERVERS_SUDO_PASSWORD + inventory: + managers: + - 10.0.0.1 + workers: + - 10.0.0.2 + - 10.0.0.3 + git: + auth_method: "http" + repo_url: "" + http: + username: env.GIT_HTTP_USERNAME + password: env.GIT_HTTP_PASSWORD + proxy: "" + ssh: + key_path: env.GIT_SSH_KEY_PATH + kubernetes: + kubeconfig_path: "/tmp/self-hosted-controller/kubeconfig.yml" + autologin: false + oci_registry: + url: env.OCI_REGISTRY_URL + username: env.OCI_REGISTRY_USERNAME + password: env.OCI_REGISTRY_PASSWORD + check_repo: "my-project/shc-probe" + chart_repo: "my-project/charts" + image_repo: "my-project/images" + prometheus: + url: env.PROMETHEUS_URL + query_window: "1h" + query_timeout: 10 + default_label_filters: { + "platform": "app.dev1.sekoia.io" + } + argocd: + namespace: "argocd" + root_app_name: "root" + notification: + url: "http://localhost:6666" + channel: "mi-self-hosted" + thread_id: "deploy-job" + platform_installer: + image: "registry.sekoia.io/sekoialab/platform-installer:self-hosted-v0.14.0" + +modules: + k3s_install: + k3s_release: "v1.31.12+k3s1" + k3s_tls_san: [] + kube_manager_fqdn: "10.0.0.1" + k3s_extra_args: "" + k3s_extra_labels: {} + k3s_extra_taints: [] + registry_url: https://registry.sekoia.io + registry_subpath: "" + registry_username: env.REGISTRY_USERNAME + registry_password: env.REGISTRY_PASSWORD + pull_images_with_proxy: false + k3s_http_proxy: "" + k3s_https_proxy: "" + k3s_no_proxy: "127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,.lab" + reboot_after_install: false + push_argo_stacks: + repo_path: "/tmp/argo-stacks" + helm_install: + kube_manager_fqdn: "10.0.0.1" + forward_dns: "10.X.X.X" + wipe_storage: + enabled: false + kube_crash_recovery: + pod_ready_timeout: 300 + poll_interval: 10 + platform_configuration: + config: + global: + host: "app.sekoia.local" + alternative_hosts: "api.sekoia.local" + delivery_host: "app.sekoia.local" + proxy: + http_proxy: "http://proxy.lab:3128" + https_proxy: "http://proxy.lab:3128" + no_proxy: "127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,.lab" + grafana: + root_url: "https://app.sekoia.local/grafana" + email: + email_sender: "noreply@sekoia.local" + smtp: + host: "mail.server.local" + user: "smtp-user" + password: "smtp-password" + port: "25" + tls: "False" + starttls: "True" + local_argocd: + repo_name: "" + repo_url: "" + helm_repo_url: "" + git_username: env.GIT_HTTP_USERNAME + git_password: env.GIT_HTTP_PASSWORD + oci_username: env.REGISTRY_USERNAME + oci_password: env.REGISTRY_PASSWORD +``` + +## Manifest parameters + +### 1. Global +This section defines the platform identity, notification behavior, and how the controller fetches release assets from remote storage. + +* **global.dev** (boolean): Enables development mode behaviors, such as verbose logging and extended error reporting. +* **global.emit_mm_notif** (boolean): Enables the sending of installation progress notifications to Mattermost. +* **global.host** (string): The primary FQDN used to access the Sekoia.io platform. +* **global.alternative_hosts** (string): Secondary FQDNs used for API access or auxiliary services. + +#### global.version.fetch (optional: online deployment only) +* **global.version.fetch.access-key** (string): S3 access key required to authenticate with the release bucket. +* **global.version.fetch.secret-key** (string): S3 secret key required to authenticate with the release bucket. +* **global.version.fetch.endpoint** (string): The S3 API endpoint URL (e.g., Linode, AWS, or MinIO). +* **global.version.fetch.region** (string): The geographical region of the S3 bucket. +* **global.version.fetch.bucket** (string): The name of the S3 bucket containing the self-hosted release assets. + +#### global.version.platform +* **global.version.platform.version** (string): The specific version tag of the platform to deploy. +* **global.version.platform.path** (string): Local absolute path on the admin node where the release archive is stored. +* **global.version.platform.skip_existing_local** (boolean): If `true`, the controller skips downloading files already present on the local disk. +* **global.version.platform.skip_existing_manifest** (boolean): If `true`, the controller always uses the local manifest copy, bypassing age checks. +* **global.version.platform.manifest_max_age** (integer): Time in seconds before the local manifest is considered expired and re-downloaded. +* **global.version.platform.push_workers** (integer): Number of parallel threads used to push images and Helm charts to the local registry. + +#### global.version.data +* **global.version.data.detection-rules** (object): Version and local path for the detection logic bundle. +* **global.version.data.intake-formats** (object): Version and local path for log parsing formats. +* **global.version.data.playbook-library** (object): Version and local path for automation playbooks. +* **global.version.data.cti** (object): Version and local path for Cyber Threat Intelligence data. + +### 2. Utils +This section configures the underlying tools (Ansible, Git, Kubernetes) and external service integrations required for the deployment. + +#### utils.ansible +* **utils.ansible.datadir** (string): Path to the directory containing Ansible playbooks, roles, and inventories. +* **utils.ansible.ssh-key** (string): Path or environment variable for the SSH private key used to manage nodes. +* **utils.ansible.user** (string): The remote user used for SSH connections (e.g., `root` or `debian`). +* **utils.ansible.password** (string): The sudo password or environment variable for privilege escalation. +* **utils.ansible.inventory.managers** (list): IP addresses or FQDNs of the Kubernetes manager nodes. +* **utils.ansible.inventory.workers** (list): IP addresses or FQDNs of the Kubernetes worker nodes. + +#### utils.git +* **utils.git.auth_method** (string): Protocol used to authenticate with Git repositories (`http` or `ssh`). +* **utils.git.repo_url** (string): The remote URL of the Git repository containing environment manifests. +* **utils.git.http.username** (string): Username for HTTP-based Git authentication. +* **utils.git.http.password** (string): Password or Token for HTTP-based Git authentication. +* **utils.git.http.proxy** (string): Optional proxy URL specifically for Git HTTP operations. +* **utils.git.ssh.key_path** (string): Path to the SSH key for Git authentication. + +#### utils.kubernetes +* **utils.kubernetes.kubeconfig_path** (string): Destination path where the generated cluster `kubeconfig` will be stored. +* **utils.kubernetes.autologin** (boolean): If `true`, automatically performs a CLI login to the cluster after deployment. + +#### utils.oci_registry +* **utils.oci_registry.url** (string): The full URL of the OCI-compliant container registry. +* **utils.oci_registry.username** (string): Registry authentication username. +* **utils.oci_registry.password** (string): Registry authentication password. +* **utils.oci_registry.check_repo** (string): Full path to an image used for registry health-check probes. +* **utils.oci_registry.chart_repo** (string): Base repository path for Helm chart storage. +* **utils.oci_registry.image_repo** (string): Base repository path for Docker image storage. + +#### utils.prometheus +* **utils.prometheus.url** (string): The endpoint URL for the Prometheus monitoring server. +* **utils.prometheus.query_window** (string): The default time window applied to metric queries (e.g., `1h`). +* **utils.prometheus.query_timeout** (integer): Maximum duration in seconds for a Prometheus query to complete. +* **utils.prometheus.default_label_filters** (object): Set of default labels used to filter all outgoing Prometheus queries. + +#### utils.argocd +* **utils.argocd.namespace** (string): The Kubernetes namespace where ArgoCD services are deployed. +* **utils.argocd.root_app_name** (string): The name assigned to the "App-of-Apps" root manifest. + +#### utils.notification +* **utils.notification.url** (string): Base URL for the internal notification service. +* **utils.notification.channel** (string): The Mattermost channel identifier for posting updates. +* **utils.notification.thread_id** (string): A logical identifier used to group notification messages into threads. + +#### utils.platform_installer +* **utils.platform_installer.image** (string): The full Docker image URI for the Sekoia.io platform installer. + +### 3. Modules +This section provides granular configuration for each functional phase of the platform installation. + +#### modules.k3s_install +* **modules.k3s_install.k3s_release** (string): The Kubernetes version tag to be installed. +* **modules.k3s_install.k3s_tls_san** (list): List of additional SANs (Subject Alternative Names) for the API server certificate. +* **modules.k3s_install.kube_manager_fqdn** (string): The FQDN or IP of the primary manager node used for cluster orchestration. +* **modules.k3s_install.k3s_extra_args** (string): Additional command-line arguments passed to the K3s server/agent process. +* **modules.k3s_install.k3s_extra_labels** (object): Key-value pairs to be applied as labels to the Kubernetes nodes. +* **modules.k3s_install.k3s_extra_taints** (list): List of taints to be applied to the Kubernetes nodes. +* **modules.k3s_install.registry_url** (string): The private registry URL used by the nodes to pull images. +* **modules.k3s_install.registry_subpath** (string): Optional prefix path for rewriting registry image locations. +* **modules.k3s_install.pull_images_with_proxy** (boolean): Enables the use of an HTTP proxy for `containerd` image pulls. +* **modules.k3s_install.k3s_http_proxy** (string): The HTTP proxy URL for the K3s runtime. +* **modules.k3s_install.k3s_https_proxy** (string): The HTTPS proxy URL for the K3s runtime. +* **modules.k3s_install.k3s_no_proxy** (string): List of CIDRs and domains that bypass the proxy. +* **modules.k3s_install.reboot_after_install** (boolean): If `true`, the host nodes are rebooted immediately following the K3s installation. + +#### modules.push_argo_stacks +* **modules.push_argo_stacks.repo_path** (string): Local directory path where the ArgoCD manifest repositories are synchronized. + +#### modules.helm_install +* **modules.helm_install.kube_manager_fqdn** (string): FQDN or IP of the manager node for Helm deployment tasks. +* **modules.helm_install.forward_dns** (string): The upstream DNS server IP used by the cluster's CoreDNS. + +#### modules.wipe_storage +* **modules.wipe_storage.enabled** (boolean): Authorizes the controller to format and wipe disks (required for Ceph/Rook storage setup). + +#### modules.kube_crash_recovery +* **modules.kube_crash_recovery.pod_ready_timeout** (integer): Seconds to wait for all pods to reach the `Ready` state before a deployment phase times out. +* **modules.kube_crash_recovery.poll_interval** (integer): Frequency in seconds for checking pod status during recovery phases. + +#### modules.platform_configuration +* **modules.platform_configuration.config.global** (object): Duplication of platform FQDNs for internal application context. +* **modules.platform_configuration.config.proxy** (object): Application-layer HTTP/HTTPS/NO_PROXY settings. +* **modules.platform_configuration.config.grafana.root_url** (string): The external URL used to access the Grafana dashboard. +* **modules.platform_configuration.config.email.smtp** (object): SMTP server details (`host`, `port`, `user`, `password`, `tls`, `starttls`) for platform alerts and notifications. +* **modules.platform_configuration.config.local_argocd** (object): Git and OCI credentials specifically for the ArgoCD instance to sync internal application manifests. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_guide.md b/docs/self_hosted/deployment/deployment_guide.md new file mode 100644 index 0000000000..b4b5c99b4c --- /dev/null +++ b/docs/self_hosted/deployment/deployment_guide.md @@ -0,0 +1,135 @@ +# Deploy the platform + +This guide describes how to install the Sekoia Self-Hosted platform using the orchestration CLI. + +## Prerequisites +* An orchestration node running with docker installed on it. +* Access to the digitally signed Sekoia installer archive. + +## Preparation work + +### 1. Download and Extract + +1. **Download**: Follow the dedicated procedure and use your specific credentials to download the Self-Hosted release. +2. **Transfer**: Import your archive into the admin node. +3. **Integrity Check**: Always verify the archive checksum before proceeding. +4. **Extract**: + ```bash + tar -xvf sekoia-archive.tar.gz -C $SEKOIA_LOCAL_DIR + ``` +!!! note "Disk space requirements" + Ensure the destination directory provides at least 100 GB of available disk space to extract the Sekoia release. + +### 2. Initialize the Controller + +For the first installation, the Self-Hosted Controller (SHC) is not available as a service. You must load the Docker image manually to initialize the environment: + +```bash + docker load -i $SEKOIA_LOCAL_DIR/v0.0.1/images/self-hosted-controller.tgz +``` + +### 3. Write your configuration file + +Prepare your config.yml manifest. This file acts as the single source of truth for your infrastructure, service, and network parameters. Ensure this file is ready before moving to the next step, as it will be mapped to the container during script execution. Please read [this documentation](./deployment_configuration.md) to define your parameters + +### 4. Create the Execution Script +To simplify Sekoia installation commands and manage environment variables, create a local script called run-shc.sh with the following content: + +```bash +#!/bin/bash +DOCKER_IMAGE="sekoialab/self-hosted-controller-cli:latest" + +docker run --rm \ +-e SERVERS_SUDO_PASSWORD="$SERVERS_SUDO_PASSWORD" \ +-e SERVERS_SSH_KEY="$SERVERS_SSH_KEY" \ +-e REGISTRY_USERNAME="$REGISTRY_USERNAME" \ +-e REGISTRY_PASSWORD="$REGISTRY_PASSWORD" \ +-e GIT_HTTP_USERNAME="$GIT_HTTP_USERNAME" \ +-e GIT_HTTP_PASSWORD="$GIT_HTTP_PASSWORD" \ +--network=host \ +-v $CONFIG_HOST:/tmp/config.yaml \ +-v $SEKOIA_LOCAL_DIR:/opt/sekoia \ +${DOCKER_IMAGE} -c /tmp/config.yaml "$@" +``` + +!!! note "Environment variable configuration" + You can define these variables directly in your configuration file or as environment variables on the host system. + +| Variable | Description | +| :--- | :--- | +| `SEKOIA_LOCAL_DIR` | The directory used to extract the local Sekoia release. | +| `SEKOIA_CONFIG_FILE` | The path to the local configuration file for the self-hosted controller. | +| `SERVERS_SUDO_PASSWORD` | The sudo password to access target machines (optional, depends on your configuration). | +| `SERVERS_SSH_KEY` | The SSH key used to access remote machines. | +| `REGISTRY_USERNAME` | The username for the OCI registry. | +| `REGISTRY_PASSWORD` | The password for the OCI registry. | +| `GIT_HTTP_USERNAME` | The username for Code repository. | +| `GIT_HTTP_PASSWORD` | The password for Code repository. | + + +Once created, make the script executable and test your configuration: + +```bash +chmod +x run-shc.sh +./run-shc list +``` + +## Deployment Options + +### Option 1: Bundle Deployment + +This method executes all commands sequentially, providing the simplest installation path. + +Trigger the deployment: + +```bash +./run-shc exec Install +``` + +Wait for the final convergence report. + +### Option 2: Step-by-Step Deployment + +If you prefer manual control, you can execute the deployment stages individually. + +#### 1. Run Preflight Checks +You must validate the environment before proceeding. The tool checks the Self-Hosted configuration, local Git, local OCI registry, release files, and server availability. + +```bash +./run-shc exec CheckRequiredConfigItems +./run-shc exec CheckLocalGit +./run-shc exec CheckLocalOCIRegistry +./run-shc exec CheckLocalReleaseFiles +./run-shc exec CheckServersAreReachable +./run-shc exec CheckServerSpec +``` + +!!! warning "Preflight block" + The installation will not proceed if critical validation checks fail. + +#### 2. Provision Local Registries + +```bash +./run-shc exec PushImages +./run-shc exec PushCharts +./run-shc exec PushArgoStacks +``` + +#### 3. Install Kube Stack + +```bash +./run-shc exec K3SInstall +./run-shc exec GetKubeconfig +./run-shc exec HelmInstall +./run-shc exec CheckKubernetesCluster +``` + +#### 4. Deploy Sekoia Platform + +```bash +./run-shc exec sekoia +``` + +## Results + +The Sekoia platform is now operational. You can access the interface via the Load Balancer URL provided in your configuration. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_prerequisites.md b/docs/self_hosted/deployment/deployment_prerequisites.md new file mode 100644 index 0000000000..83d1590df2 --- /dev/null +++ b/docs/self_hosted/deployment/deployment_prerequisites.md @@ -0,0 +1,41 @@ +# Self-Hosted technical requirements + +Sekoia Self-Hosted is an enterprise-grade solution that requires specific infrastructure, network, and operational readiness. As a customer-operated platform, you are responsible for the end-to-end lifecycle, including provisioning, management, and monitoring. + +## Infrastructure Prerequisites +You must provision and manage the following dedicated infrastructure components: + +| Component | CPU | RAM | Storage / Throughput | +| :--- | :--- | :--- | :--- | +| **Load Balancer** (e.g., HAProxy, Nginx) | 4 | 8GB | 50MB/s (validate per customer) | +| **Local Image Registry** (e.g., Harbor) | 4 | 8GB | 5TB | +| **Local Code Registry** (e.g., Git) | 4 | 8GB | 100GB | +| **Orchestration Node** (Admin) | 4 | 8GB | 100GB | + +## Networking and Storage +* **Networking**: You must provide internal DNS resolution, synchronized NTP infrastructure, and SMTP servers for alerts. +* **S3 Storage**: You must provide an S3-compatible bucket for event storage. + * **Calculation**: Total capacity = Raw daily ingested events (GB) x retention days. + * **Note**: The 4TB local SSD on compute nodes is strictly for system use, not long-term event storage. + +## Scaling Table +The following table outlines the estimated hardware footprint per cluster based on asset count and daily ingestion volume. + +| Assets | Daily Volume | Compute Nodes | GPU Nodes | +| :--- | :--- | :--- | :--- | +| 5,000 | 500 GB | 6 | 1 | +| 10,000 | 1 TB | 12 | 1 | +| 20,000 | 2 TB | 24 | 2 | +| 50,000 | 5 TB | 60 | 2 | +| 100,000 | 10 TB | 120 | 2 | + +!!! warning "Theoretical Sizing" + The figures provided in the scaling table are estimates for guidance. Infrastructure requirements vary based on specific replication constraints, data retention policies, and query patterns. These figures must be validated by Sekoia prior to deployment. + +!!! warning "Hardware Specifications" + Each compute node must meet the minimum specification: 44 CPU (3.2 GHz minimum), 128GB RAM, and 4TB SSD (Debian 11). GPU nodes require an NVIDIA H100. + +## Operational Readiness +* **Operations Team**: A dedicated in-house or partner team must handle deployment, configuration, updates, and 24/7 monitoring. +* **Infrastructure Management**: You must have the ability to provision and manage your own private cloud, on-premises servers, or virtual machines. +* **Licensing**: A valid license must be provided, defining the maximum supervised assets and daily event-ingest capacity. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_process.md b/docs/self_hosted/deployment/deployment_process.md new file mode 100644 index 0000000000..0a858d5ed1 --- /dev/null +++ b/docs/self_hosted/deployment/deployment_process.md @@ -0,0 +1,19 @@ +# The Deployment Process + +A central tool called the **Self-Hosted Controller** manages the platform lifecycle. This tool takes a configuration file as a parameter, with all settings described in the configuration section. + +All available commands can be listed by running the `list` command. Specific details for these commands are provided in the relevant sections below (Deployment, Debug, Check). Before proceeding, please consult the list of prerequisites for a Self-Hosted deployment. + +## Core Principles +The deployment engine is designed around three fundamental pillars to ensure stability and predictability: + +* **Preflight Validation**: The CLI performs comprehensive checks before any execution. It verifies OS versions, network connectivity, and file checksums, blocking execution until the environment meets all minimum requirements. +* **Declarative Configuration**: The platform state is defined by a `config.yml` manifest. This file acts as the single source of truth for infrastructure settings (IPs, load-balancers, DNS), service configurations (SMTP, feature toggles, AI modules), and scaling parameters (node counts, resource quotas). +* **Orchestration**: A core engine computes the difference between the actual and desired state to execute only necessary tasks. + +## Workflow Execution +The CLI supports both online and air-gapped environments: + +* **Online**: Fetches artifacts directly from authorized registries. +* **Air-gapped**: Operates in fully disconnected modes using pre-staged packages and locally cached manifests. + diff --git a/docs/self_hosted/index.md b/docs/self_hosted/index.md new file mode 100644 index 0000000000..6cced18484 --- /dev/null +++ b/docs/self_hosted/index.md @@ -0,0 +1,54 @@ +# Sekoia Self Hosted + +Sekoia Self-Hosted is a customer-operated deployment of the Sekoia AI SOC platform designed for environments with high regulatory, sovereignty, or connectivity constraints. It provides cloud-grade detection and automation capabilities while keeping data processing and infrastructure entirely under your authority. + +## Purpose and logic + +Highly regulated sectors often operate under legal or technical constraints that make standard SaaS models impractical. Sekoia Self-Hosted addresses these challenges by allowing you to maintain full control over your security operations.The platform ensures: + +The platform ensures: + +* **Data Sovereignty**: Data remains within your infrastructure to meet residency and sovereignty laws. +* **Network Independence**: Operations can run in restricted or fully air-gapped networks with no external connectivity. +* **Infrastructure Control**: You manage the deployment, monitoring, and lifecycle of the platform on your own infrastructure. + +## Key benefits + +* **Regulatory Compliance**: The platform is compatible with national and sector-specific regulations, including public procurement frameworks. +* **Enterprise-Grade Operations**: You can provision and scale environments with automated orchestration and built-in autoscaling. +* **High Availability**: The architecture includes redundancy and self-healing capabilities to ensure continuous security operations. +* **Continuous Protection**: The business layer is decoupled from the technical foundation, allowing you to receive daily updates for threat intelligence, security rules, integrations, and automations. This ensures your detection capabilities improve continuously without requiring platform upgrades or service disruption. + +## Intended audience + +Sekoia Self-Hosted is specifically designed for organizations that require maximum control over their security stack. + +Target organizations include: + +* **Highly Regulated Industries**: Sectors such as finance, government, and healthcare. +* **Sovereign Entities**: Organizations with strict national data requirements or those using sovereign service providers. +* **Restricted Environments**: Teams operating in air-gapped or classified infrastructures. +* **Large Enterprises**: Organizations running dedicated security operations with an internal platform team or a trusted MSSP. + +!!! note "Operational Responsibility" + Sekoia Self-Hosted requires a dedicated operations team to manage the end-to-end lifecycle, including deployment, configuration, and monitoring. + +## Documentation map + +Use the following links to navigate the technical documentation and manage your deployment lifecycle. + +### Architecture +* [Reference Architecture](architecture/architecture.md): Explore the technical design, component interactions, and data flow of the self-hosted solution. + +### Deployment +* [Deployment Process](deployment/deployment_process.md): Understand the high-level stages of a standard installation. +* [Deployment Prerequisites](deployment/deployment_prerequisites.md): Verify hardware, software, and network requirements before you begin. +* [Deployment Guide](deployment/deployment_guide.md): Follow step-by-step instructions to install and initialize the platform. +* [Deployment Configuration](deployment/deployment_configuration.md): Consult the comprehensive reference for the `config.yml` manifest parameters. + +### Operations and Support +* [Monitoring Guide](monitoring/monitoring_guide.md): Learn how to observe platform health and performance metrics. +* [Debug Tool](troubleshooting/debug_tool.md): Use the built-in diagnostic tools to identify and resolve common issues. + +### Release updates +* [Release Notes](release_notes/0.0.1.md): Review the latest features, improvements, and fixes included in each version. \ No newline at end of file diff --git a/docs/self_hosted/monitoring/monitoring_guide.md b/docs/self_hosted/monitoring/monitoring_guide.md new file mode 100644 index 0000000000..3d29d8548a --- /dev/null +++ b/docs/self_hosted/monitoring/monitoring_guide.md @@ -0,0 +1,247 @@ +# Monitoring + +Sekoia Self-Hosted ships with a complete observability stack so you can monitor cluster health, application status, database availability, and resource consumption in real time. This page explains how the stack is organized, when to use which tool, and gives a full command reference for on-demand diagnostics. + +This page is for platform administrators operating a Sekoia Self-Hosted instance. It covers both daily monitoring and incident response. + +## Monitoring at a glance + +Sekoia Self-Hosted provides two complementary monitoring layers: + +| Layer | Tool | Purpose | When to use | +|---|---|---|---| +| **Continuous monitoring** | Grafana, Loki, Prometheus, Alertmanager | Real-time dashboards, log search, metric collection, alerting | Daily operations, capacity planning, alert-driven response | +| **On-demand diagnostics** | Self-Hosted Controller (SHC) CLI | Targeted health checks on cluster, applications, databases, and resources | Incident triage, post-deployment validation, troubleshooting | + +In practice, you will spend most of your time in Grafana — dashboards and alerts surface issues as they happen. The SHC CLI is the tool you reach for when an alert fires, when a deployment looks wrong, or when you need a structured snapshot of platform state. + +--- + +## Continuous monitoring: Grafana, Loki, Prometheus + +The observability stack is deployed as part of the platform — no additional installation is required. + +### Built-in dashboards + +A set of Grafana dashboards is shipped with every release. They cover: + +- **Cluster health** — node status, control-plane availability, Kubernetes events +- **Resource utilization** — CPU, memory, disk, and network usage per node and namespace +- **Throughput** — event ingestion rate, processing latency, queue depth +- **Error rates** — service-level error counts, HTTP status distribution +- **Security events** — authentication failures, audit-log highlights + +You can customize these dashboards or add your own to fit your monitoring conventions. + +### Centralized logs + +Promtail agents run on every node and ship all container logs to a Loki cluster. Use Grafana's **Explore** view to search, filter, and inspect logs by service, namespace, pod, or time range. + +Typical investigation flow: +1. Open Grafana → **Explore** → select the Loki data source +2. Filter by `namespace` and `pod` (or `app` label) +3. Narrow the time range to the incident window +4. Refine with substring or regex filters + +### Metrics and alerting + +Prometheus collects system- and application-level metrics across the platform. Alertmanager evaluates alerting rules and dispatches notifications. + +- All metrics and alerts are visualized in Grafana. +- You can tune the default thresholds or add new rules to match your SLAs. +- Notifications can be routed to email (via the SMTP server configured at install time) or to any Alertmanager-compatible receiver. + +### Retention and external forwarding + +You can adjust log and metric retention to meet your compliance requirements. The stack also supports forwarding to external observability systems: + +- **Metrics** — via Prometheus `remote_write` +- **Logs** — via the Loki HTTP API + +--- + +## On-demand diagnostics: Self-Hosted Controller (SHC) + +The Self-Hosted Controller (SHC) is a CLI that runs a set of targeted health checks against the cluster, ArgoCD, databases, and the Kubernetes Metrics API. It is the right tool for incident triage and structured platform snapshots. + +### Prerequisites + +- SHC commands are executed from the **admin node** provisioned during installation (see [Deployment prerequisites](../prerequisites/index.md)). +- All commands are invoked via the `run-shc.sh` wrapper script using the `exec` subcommand. +- The admin node must have network access to the Kubernetes API of the target cluster. +- For `DebugResourceAllocation`, the Kubernetes **Metrics Server** must be deployed. If it is not, the command exits with an actionable error — run the `HelmInstall` module first. + +### When to use SHC vs Grafana + +| Situation | Use | +|---|---| +| Routine monitoring, trend analysis, alert review | Grafana | +| An alert just fired and you need to confirm what's broken | SHC (`DebugArgoCD` first) | +| You just deployed or upgraded and want a clean health snapshot | SHC (full workflow, see below) | +| You want to inspect log content in detail | Grafana / Loki | +| You suspect over- or under-provisioned memory requests | SHC (`DebugResourceAllocation`) | + +### Cluster node health + +```bash +./run-shc.sh exec CheckKubernetesCluster +``` + +Connects to the Kubernetes API and performs two checks: + +- The number of nodes in the cluster matches the number of unique hosts in the Ansible inventory. +- Every node reports a `Ready=True` condition. + +On success: + +``` +Kubernetes cluster is healthy nodes=3 expected=3 +``` + +Run this first to confirm the cluster itself is healthy before investigating application-level issues. + +### Application status + +```bash +./run-shc.sh exec DebugArgoCD +``` + +Connects to the ArgoCD API and renders three terminal panels in sequence. + +**Panel 1 — Repositories** + +Lists every Git or Helm repository registered in ArgoCD: + +| Column | Description | +|---|---| +| Name | Repository alias | +| URL | Remote URL | +| Type | `git` or `helm` | + +**Panel 2 — Root application** + +Displays the sync and health status of the root app-of-apps as a single-line summary: + +``` +Sync: Synced | Health: Healthy +``` + +**Panel 3 — Applications** + +Lists all managed ArgoCD applications, sorted alphabetically: + +| Column | Values | +|---|---| +| Name | Application name (with `-on-self-hosted` suffix stripped) | +| Sync Status | `Synced` (green) · `OutOfSync` (red) · `Unknown` (dim) | +| Health Status | `Healthy` (green) · `Progressing` (yellow) · `Degraded` / `Missing` (red) · `Suspended` (cyan) | + +**Reading the output:** + +- A row that is green on both columns is nominal. +- `OutOfSync` means the live cluster state diverges from the ArgoCD git source — run `DebugArgoCDSyncAll` to recover. +- `Progressing` is transient and expected during deployments; if it persists beyond a few minutes, check the pod logs for the affected application. +- `Degraded` or `Missing` indicates a service that failed to start or whose resources could not be created — use `DebugDatabases` and pod logs to investigate further. + +> **Tip:** This is the first command to run when investigating a broken or partially deployed platform. + +### Database health + +```bash +./run-shc.sh exec DebugDatabases +``` + +Inspects the `support` namespace and renders two tables covering all database workloads. + +**StatefulSets table** + +Covers non-CNPG databases (e.g. Redis, ClickHouse, MinIO): + +| Column | Description | +|---|---| +| Name | StatefulSet name | +| Expected | Declared replica count | +| Ready | Currently ready replicas | +| Status | `Healthy` (green) · `Warning` (yellow) · `Unhealthy` (red) | +| Restarts | Total container restarts across all pods | +| Last Restart | Human-readable age of the most recent restart (e.g. `5m ago`) | +| Issues | Per-pod detail: `not ready`, `waiting (CrashLoopBackOff)`, `crashed Xm ago` | + +**CNPG Clusters table** + +Same columns for all CloudNativePG (`postgresql.cnpg.io/v1`) clusters (platform PostgreSQL instances). + +**Reading the output:** + +- **Healthy** — all replicas are running and ready, no recent restarts. +- **Warning** — all replicas are running, but recent restarts (within the past 10 minutes) or waiting containers were detected. Monitor and re-run to see if the issue clears. +- **Unhealthy** — at least one replica is not ready or not in `Running` phase. Check the `Issues` column for the specific pod and reason, then inspect logs with `kubectl logs -n support `. + +### Resource usage + +```bash +./run-shc.sh exec DebugResourceAllocation +``` + +Queries the Kubernetes Metrics API (`metrics.k8s.io/v1beta1`) for live memory consumption across all namespaces and compares it against each pod's declared memory requests. + +> **Prerequisite:** Metrics Server must be deployed. If the Metrics API is unavailable, the command exits with an actionable error — run the `HelmInstall` module first. + +**RAM allocation — waste report** + +All pods with a memory request declared, sorted by wasted RAM (highest first): + +| Column | Description | +|---|---| +| Namespace | Kubernetes namespace | +| Pod | Pod name | +| Request | Declared `resources.requests.memory` | +| Usage | Live consumption from the Metrics API | +| Waste | Request − Usage | +| Waste % | Green < 50 % · Yellow ≥ 50 % · Red ≥ 80 % | + +**Pods without memory requests** + +A separate warning panel lists all pods for which no `resources.requests.memory` is set, alongside their current live usage. These pods are a scheduling risk — Kubernetes cannot make informed placement or eviction decisions for them. + +**Reading the output:** + +- Rows highlighted in red (≥ 80 % waste) indicate pods whose memory requests are significantly over-provisioned. Adjusting their requests downward reduces scheduling pressure across the cluster. +- Pods in the "Without Memory Requests" panel should be reviewed — if they consume significant memory without a declared request, they risk being evicted under node pressure. + +--- + +## Recommended workflows + +### Daily monitoring + +For continuous operations, Grafana is the primary surface: + +1. Review the **Cluster health** dashboard — confirm all nodes are Ready and the control plane is responsive. +2. Check the **Resource utilization** dashboard — watch for nodes or namespaces trending toward saturation. +3. Check the **Throughput** dashboard — verify ingestion rate is within expected range and no queue is backing up. +4. Review active alerts in Alertmanager — acknowledge or escalate as needed. + +If any dashboard surfaces an anomaly, switch to the SHC workflow below to confirm the platform state before investigating further. + +### Incident response + +When an alert fires or a dashboard shows degradation, run the following sequence for a complete health snapshot: + +```bash +# 1. Confirm all cluster nodes are ready +./run-shc.sh exec CheckKubernetesCluster + +# 2. Check application sync and health status +./run-shc.sh exec DebugArgoCD + +# 3. Inspect database availability +./run-shc.sh exec DebugDatabases + +# 4. Review resource allocation efficiency +./run-shc.sh exec DebugResourceAllocation +``` + +This order is deliberate: cluster → applications → databases → resources. Each step rules out a layer before moving on, so when something fails you know roughly where to focus. + +If applications appear `OutOfSync` or `Degraded` after running `DebugArgoCD`, see [Troubleshooting and debugging](./troubleshooting.md) for recovery procedures. \ No newline at end of file diff --git a/docs/self_hosted/release_notes/0.0.1.md b/docs/self_hosted/release_notes/0.0.1.md new file mode 100644 index 0000000000..6bac9f17eb --- /dev/null +++ b/docs/self_hosted/release_notes/0.0.1.md @@ -0,0 +1,71 @@ +# Release Notes v0.0.1 (MVP) + +This release introduces the initial version of Sekoia Self-Hosted, providing a robust security operations foundation for disconnected and regulated environments. + +## New features + +* **Air-gap support**: Enable full deployment and operational capabilities in restricted or isolated environments with no external connectivity. +* **Deployment CLI**: Access a unified orchestration tool designed to manage the platform lifecycle, from initialization to upgrades. + +## Technical foundation + +* **Kubernetes stack**: The platform is built on a lightweight and certified K3s distribution for optimized resource management. +* **OS Support**: This release is officially certified for deployment on **Debian 11 (Bullseye)**. + +## Functional scope + +The functional scope of Sekoia Self-Hosted aligns with the **Defend Core** subscription, with specific exceptions related to air-gapped environment constraints. + +| Feature | Available | Description | +| :--- | :--- | :--- | +| **Meta-playbooks** | Yes | Supports advanced automation workflows. | +| **OC Notifications** | Yes | Operations Center notification system. | +| **Observable Tags Enrichment** | No | Automatic enrichment of events with observable tags. | +| **Cloud-to-Cloud Ingestion** | No | Not supported for air-gapped deployments. | +| **Encrypted Ingestion** | Yes | Supports Syslog TLS, Relp TLS, and HTTPS ingestion. | +| **Custom Intake Formats** | Yes | Allows creation and management of custom parsing formats. | +| **Sigma Correlation** | Yes | Full support for Sigma-based correlation rules. | +| **Playbooks** | Yes | Built-in automation and orchestration capabilities. | +| **Automatic Asset Discovery** | Yes | Identifies assets within the monitored perimeter. | +| **Retrohunt** | Yes | Search for past indicators in historical data. | +| **Anomaly Detection Engine** | Yes | Statistical and behavioral anomaly detection. | +| **Case Management** | Yes | Standard security incident tracking and management. | +| **Hot Storage** | Yes | High-performance storage for active investigation. | +| **Sekoia Endpoint Agent** | Yes | Support for host-level visibility and response. | +| **Contextualized Alerts** | No | Requires real-time CTI embedding (unsupported in air-gap). | +| **SOL Query Builder** | Yes | Visual and syntax-based search interface. | +| **Detection Rules** | Yes | Access to the standard Sekoia.io detection library. | +| **Event Drop Detection** | Yes | Monitoring of log ingestion continuity. | +| **Cases Custom Status** | Yes | Ability to define specific incident lifecycles. | +| **Investigation Graph** | Yes | Visual representation of security incidents and entities. | +| **Notebooks** | Yes | Collaborative workspaces for threat hunting. | +| **Sigma Pattern Validation** | Yes | Built-in syntax checking for Sigma rules. | +| **SOL Dataset** | Yes | Logical grouping of event data. | +| **Dashboard Filters** | Yes | Dynamic filtering for visualization modules. | +| **Roy Assistant** | No | AI-assistant not compatible with air-gapped environments. | +| **Dashboards** | Yes | Customizable visual monitoring interfaces. | +| **APIs** | Yes | Full programmatic access to platform functions. | +| **Member Management** | Yes | RBAC and user administration. | +| **Usage Reporting** | Yes | Statistics on data volume and platform usage. | +| **Subscription Management** | Yes | Internal license and subscription tracking. | +| **SSO / MFA** | Yes | Integration with identity providers for secure access. | +| **Region Threat Telemetry** | Yes | Geographic-based threat visualization. | + +## Specific environment constraints + +Deploying in air-gapped or restricted environments introduces the following operational changes: + +### Threat Intelligence and Detection +While the standalone **Threat Intelligence (CTI)** research module is not available in air-gapped deployments, the platform remains fully powered by Sekoia.io intelligence. + +* **Detection Rules**: All rules (Sigma, patterns) are embedded in the release and fully operational. +* **CTI Context**: Live cloud-based enrichment and manual exploration of the CTI database (threat actors, malwares, reports) are not supported without external connectivity. + +### Security content delivery +To ensure continuous protection, every product release includes the latest version of: + +* Sekoia detection rules. +* Integration connectors (Intake formats). +* Automation library (SOAR modules) + +This ensures your deployment remains up to date with the latest threat detection logic even without external connectivity. \ No newline at end of file diff --git a/docs/self_hosted/troubleshooting/debug_tool.md b/docs/self_hosted/troubleshooting/debug_tool.md new file mode 100644 index 0000000000..d542fc0b06 --- /dev/null +++ b/docs/self_hosted/troubleshooting/debug_tool.md @@ -0,0 +1,254 @@ +# Troubleshooting and Debugging + +A centralized set of diagnostic tools is available through the Self-Hosted Controller (SHC). These commands let you validate your configuration, inspect the health of your deployment, audit secrets, diagnose database issues, and identify resource pressure on the cluster. + +All commands below are executed via the `run-shc.sh` wrapper script using the `exec` subcommand, which instantiates the named module inside the controller container and runs it to completion. + +--- + +## Configuration Validation + +### Validate the local configuration file + +```bash +./run-shc.sh exec CheckLocalConfig +``` + +Loads `config.yml` and validates every key against the schema defined in `help.yaml`. The check covers: + +- **Missing required fields** — any key marked `required: true` in the schema that is absent from your config will be reported with an error log line. +- **Format mismatches** — values that do not match the expected regex pattern are flagged individually. +- **Schema integrity** — ensures no required key is hidden from the help output (which would indicate an authoring error in `help.yaml`). + +If all checks pass, the command exits cleanly with: + +``` +Validating configuration... +Configuration is valid entries_checked= +``` + +If errors are found, each is logged individually before the command exits with a non-zero status: + +``` +Missing required config key=global.version.platform.version +Format mismatch key=utils.oci_registry.url value=… expected_format=^https?://… +Configuration validation failed: 2 error(s) found +``` + +### Inspect resolved environment variables (verbose mode) + +```bash +./run-shc.sh -v exec CheckLocalConfig +``` + +Adding `-v` enables debug-level logging. Before running the schema validation, the controller dumps the **fully resolved** in-memory config tree — including every `env.VAR_NAME` value substituted with its actual environment variable content. This is the most reliable way to confirm that secrets injected via environment variables (e.g. `REGISTRY_PASSWORD`, `SERVERS_SSH_KEY`) are correctly loaded into the controller process. + +--- + +## Infrastructure Connectivity + +### Check SSH connectivity to all servers + +```bash +./run-shc.sh exec CheckServersAreReachable +``` + +Uses Ansible to run a `ping` playbook against all hosts declared in `utils.ansible.inventory`. This confirms that the controller can open an SSH connection to every node, which is a prerequisite for any Ansible-based operation (installation, node reboots, etc.). + +On success: + +``` +Pinging all configured servers via Ansible... +All configured servers are reachable +``` + +On failure, the Ansible output is surfaced and the command exits with a non-zero status indicating which host(s) could not be reached. + +### Check Kubernetes cluster health + +```bash +./run-shc.sh exec CheckKubernetesCluster +``` + +Fetches the kubeconfig, connects to the cluster API, and verifies: + +- The number of nodes in the cluster matches the number of unique hosts in the Ansible inventory. +- Every node has a `Ready=True` condition. + +On success: + +``` +Kubernetes cluster is healthy nodes=3 expected=3 +``` + +Fails immediately if any node is not Ready, or if the node count does not match the inventory (which may indicate a node that joined or was evicted without being reflected in the config). + +--- + +## Local Artifact Checks + +### Verify release files on disk + +```bash +./run-shc.sh exec CheckLocalReleaseFiles +``` + +Iterates over every entry under `global.version` in your config (e.g. `platform`, `data.detection-rules`) and checks that: + +1. The base directory (`path`) exists. +2. The version subdirectory (`path/`) exists and is non-empty. + +Useful after a `DownloadReleaseFiles` run to confirm that all expected archives are present before attempting an installation or update. + +### Verify git repository access + +```bash +./run-shc.sh exec CheckLocalGit +``` + +Clones the configured `utils.git.repo_url` into a temporary directory and tests both **pull** and **push** access using the configured credentials. Use this to validate that the git token has sufficient permissions before running any module that writes ArgoCD stacks to the repository. + +### Verify OCI registry access + +```bash +./run-shc.sh exec CheckLocalOCIRegistry +``` + +Connects to `utils.oci_registry.url` and verifies **push**, **pull**, and **delete** access. Push is tested first; if it fails, pull and delete are skipped and reported as untested. Use this to confirm registry credentials before pushing Helm charts or images. + +--- + +## Service and Application Health + +### ArgoCD status dashboard + +```bash +./run-shc.sh exec DebugArgoCD +``` + +Connects to the ArgoCD API and renders three tables in the terminal: + +| Table | Content | +|---|---| +| **Repositories** | Name, URL, and type of every registered ArgoCD repository | +| **Root Application** | Sync and health status of the root app-of-apps | +| **Applications** | Per-application sync status and health status, sorted alphabetically | + +Sync statuses are colour-coded: `Synced` in green, `OutOfSync` in red. Health statuses: `Healthy` in green, `Progressing` in yellow, `Degraded`/`Missing` in red. + +This is the first command to run when investigating a broken or partially deployed platform. + +### Force a full re-synchronisation + +```bash +./run-shc.sh exec DebugArgoCDSyncAll +``` + +Triggers a three-phase synchronisation of **all** ArgoCD applications in parallel: + +1. **Partial sync** — for each application, syncs only resources whose `kind` matches a configurable regex (default: `secretgenerator|configmap`). This refreshes ConfigMaps and SecretGenerator objects before secrets are regenerated. +2. **Operator restart** — rolls out the `sekoiaio-secret-operator` deployment in the `support` namespace and waits for it to come back up (default: 60 s), ensuring it picks up the refreshed SecretGenerators. +3. **Full sync** — syncs every application in its entirety. + +Use this when applications are stuck in `OutOfSync`, after a manual change to the ArgoCD git repository, or after a platform upgrade. + +> **Note:** The sync timeout per application defaults to 300 s and up to 32 applications are synced concurrently. Both values can be overridden in `config.yml` under `modules.debug_argocd_sync_all`. + +--- + +## Secret Diagnostics + +### Audit missing or incomplete secrets + +```bash +./run-shc.sh exec DebugMissingSecrets +``` + +Compares the desired state (every `SecretGenerator` CRD declared via `secretoperator.sekoia.io/v1alpha1`) against the actual Kubernetes `Secret` objects present in the cluster, and reports: + +- Secrets that are entirely missing (the CRD exists but no `Secret` was created). +- Secrets that exist but have incomplete keys. +- For each missing key: the vault path where the value is expected to come from. + +Additionally, the module runs a `platform-installer dumpconfig` job and cross-references each missing secret against the platform-installer configuration, helping distinguish between secrets that were never defined versus secrets that failed to be generated. + +### Scan ArgoCD stacks for unrendered template placeholders + +```bash +./run-shc.sh exec DebugKustomizeStacksTemplates +``` + +Clones (or pulls) the ArgoCD git repository and recursively scans every YAML file for values that still contain the `SH_TMPL` marker — the sentinel string used by the stack templater to indicate a value that should have been substituted. For each match, the following is logged: + +- File path and document index within a multi-document YAML file +- Kubernetes `kind` and `metadata.name` of the affected object +- Dot-separated YAML path to the offending field and its current value + +A summary of occurrences per YAML path (sorted by frequency) is printed at the end. Use this when services fail to start due to misconfigured values that were not properly rendered during installation. + +--- + +## Database Diagnostics + +### Inspect database pod health + +```bash +./run-shc.sh exec DebugDatabases +``` + +Inspects the `support` namespace and renders two tables: + +**StatefulSets** — covers non-CNPG databases (e.g. Redis, ClickHouse): + +| Column | Description | +|---|---| +| Name | StatefulSet name | +| Expected / Ready | Declared vs. ready replica count | +| Status | `Healthy` / `Warning` / `Unhealthy` (colour-coded) | +| Restarts | Total container restarts across all pods | +| Last Restart | Human-readable age of the most recent restart (e.g. `5m ago`) | +| Issues | Per-pod issues: `not ready`, `waiting (CrashLoopBackOff)`, `crashed Xm ago`, etc. | + +**CNPG Clusters** — same columns for all CloudNativePG `postgresql.cnpg.io/v1` clusters (e.g. the platform PostgreSQL instances). + +A pod is flagged as having a **recent restart** if its last termination timestamp is within the past 10 minutes. Status is `Unhealthy` if any replica is not ready or not in `Running` phase; `Warning` if all replicas are running but recent restarts or waiting containers were detected. + +--- + +## Cluster Resource Management + +### RAM allocation waste report + +```bash +./run-shc.sh exec DebugResourceAllocation +``` + +Queries the Kubernetes Metrics API (`metrics.k8s.io/v1beta1`) for live memory consumption across all namespaces, then compares it against each pod's declared memory `requests`. The output is two tables: + +**RAM Allocation — Waste Report** — all pods that declare a memory request, sorted by wasted RAM descending: + +| Column | Description | +|---|---| +| Namespace / Pod | Pod identity | +| Request | Declared memory request | +| Usage | Live memory consumption from the Metrics API | +| Waste | `request − usage` | +| Waste % | Colour-coded: green < 50 %, yellow ≥ 50 %, red ≥ 80 % | + +**Pods Without Memory Requests** — pods for which no `resources.requests.memory` is set, shown with their current usage. These pods are a scheduling risk as Kubernetes cannot make informed placement decisions for them. + +> **Prerequisites:** The Metrics Server must be deployed in the cluster. If the Metrics API is unavailable, the command exits with an error suggesting you run the `HelmInstall` module first. + +--- + +## Platform Installer Debug + +### Launch a paused installer job + +```bash +./run-shc.sh exec DebugPlatformInstallation +``` + +Deploys the `platform-installer` Helm chart with a `pause` command override instead of the normal install/update sequence. The job pod starts and stays alive without performing any changes, giving you an interactive shell to inspect the installer environment, mounted secrets, and configuration files. Any stale release from a previous debug session is cleaned up automatically before deploying. + +Use this when you need to manually inspect or test the platform-installer's runtime context without triggering a full installation. diff --git a/mkdocs.yml b/mkdocs.yml index 093ef33ffc..9619fbdf09 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -75,6 +75,21 @@ nav: - Roy AI Assistant: getting_started/ai_assistant.md - Best practices: getting_started/best_practices.md - Troubleshooting tips: getting_started/get_troubleshooting_tips.md +- Sekoia Self Hosted: + - Introduction: self_hosted/index.md + - Architecture: + - Reference Architecture: self_hosted/architecture/architecture.md + - Deployment: + - Deployment Process: self_hosted/deployment/deployment_process.md + - Deployment Prerequisites: self_hosted/deployment/deployment_prerequisites.md + - Deployment Guide: self_hosted/deployment/deployment_guide.md + - Deployment Configuration: self_hosted/deployment/deployment_configuration.md + - Monitoring: + - Monitoring Guide: self_hosted/monitoring/monitoring_guide.md + - Troubleshooting: + - Debug Tool: self_hosted/troubleshooting/debug_tool.md + - Release Notes: + - 0.0.1: self_hosted/release_notes/0.0.1.md - Sekoia Defend (XDR): - Introduction: xdr/index.md - Quick start guide: xdr/xdr_quick_start.md