From 5da8c5ca1dd89b51e014db0da396de366af4e00d Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Fri, 13 Mar 2026 17:04:21 +0100 Subject: [PATCH 1/9] initial commit --- docs/self_hosted/architecture/architecture.md | 19 ++++++++++ .../deployment/deployment_configuration.md | 19 ++++++++++ .../deployment/deployment_guide.md | 37 +++++++++++++++++++ .../deployment/deployment_prerequisites.md | 30 +++++++++++++++ .../deployment/deployment_process.md | 13 +++++++ docs/self_hosted/index.md | 34 +++++++++++++++++ docs/self_hosted/monitoring/metrics.md | 12 ++++++ .../monitoring/monitoring_guide.md | 16 ++++++++ docs/self_hosted/release_notes/0.1.0.md | 13 +++++++ .../self_hosted/troubleshooting/debug_tool.md | 16 ++++++++ mkdocs.yml | 16 ++++++++ 11 files changed, 225 insertions(+) create mode 100644 docs/self_hosted/architecture/architecture.md create mode 100644 docs/self_hosted/deployment/deployment_configuration.md create mode 100644 docs/self_hosted/deployment/deployment_guide.md create mode 100644 docs/self_hosted/deployment/deployment_prerequisites.md create mode 100644 docs/self_hosted/deployment/deployment_process.md create mode 100644 docs/self_hosted/index.md create mode 100644 docs/self_hosted/monitoring/metrics.md create mode 100644 docs/self_hosted/monitoring/monitoring_guide.md create mode 100644 docs/self_hosted/release_notes/0.1.0.md create mode 100644 docs/self_hosted/troubleshooting/debug_tool.md diff --git a/docs/self_hosted/architecture/architecture.md b/docs/self_hosted/architecture/architecture.md new file mode 100644 index 0000000000..2f5a45cab3 --- /dev/null +++ b/docs/self_hosted/architecture/architecture.md @@ -0,0 +1,19 @@ +# Reference Architecture + +The Sekoia Self-Hosted architecture is built for resilience and scalability within a customer-managed perimeter. + +## Core components +The platform is delivered as a single digitally signed archive containing all necessary applications: + +* **Deployer**: Handles the Kubernetes installation and support services. +* **Microservices**: Includes the core platform logic and Sekoia applications. +* **Databases**: Integrated data persistence layers. +* **Rules Catalog**: The collection of Sekoia detection rules. + +## High availability logic +The architecture is designed for production-grade reliability: + +* **Redundancy**: Built-in component redundancy to prevent single points of failure. +* **Self-healing**: Automated recovery mechanisms for continuous operations. +* **Scalability**: Support for multi-node compute clusters to handle high-volume ingestion. + diff --git a/docs/self_hosted/deployment/deployment_configuration.md b/docs/self_hosted/deployment/deployment_configuration.md new file mode 100644 index 0000000000..11ab6f27d2 --- /dev/null +++ b/docs/self_hosted/deployment/deployment_configuration.md @@ -0,0 +1,19 @@ +# Deployment configuration + +The `customer.yml` file is the central manifest used to describe your specific environment and service requirements. + +## Manifest parameters +The following sections must be defined in your configuration file: + +### Infrastructure settings +* **Nodes**: List of IP addresses for compute and GPU servers. +* **DNS**: Domain names and resolution settings for the platform. +* **Load Balancer**: Target addresses for Web UI and ingestion flows. + +### Service settings +* **SMTP**: Configuration for system notifications and alerts. +* **Feature Toggles**: Activation of specific modules, such as AI features. +* **Resource Quotas**: Scaling parameters and node counts based on your sizing requirements. + +!!! note "Validating the Manifest" + The deployment CLI validates the syntax and integrity of the `customer.yml` file during the preflight stage. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_guide.md b/docs/self_hosted/deployment/deployment_guide.md new file mode 100644 index 0000000000..9d86595ba6 --- /dev/null +++ b/docs/self_hosted/deployment/deployment_guide.md @@ -0,0 +1,37 @@ +# Deploy the platform + +This guide describes how to install the Sekoia Self-Hosted platform using the orchestration CLI. + +## Prerequisites +* An orchestration node running Debian 12 with Python 3.11. +* A completed `customer.yml` manifest. +* Access to the digitally signed Sekoia installer archive. + +## Procedure + +### 1. Extract the installer +To begin, set up the deployment environment on your orchestration node. + +1. Navigate to your installation directory. +2. Extract the signed archive. +3. Initialize the deployment CLI. + +### 2. Run preflight checks +You must validate the environment before proceeding with the installation. + +1. Execute the CLI `check` command. +2. Review the output to confirm all hardware and network requirements are met. +3. Resolve any reported failures. + +!!! warning "Preflight block" + The installation will not proceed if critical validation checks fail. + +### 3. Launch orchestration +Trigger the deployment to configure the platform components. + +1. Load the `customer.yml` manifest. +2. Run the `deploy` command. +3. Wait for the final convergence report. + +## Results +The Sekoia platform is now operational. You can access the interface via the load balancer URL provided in your configuration. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_prerequisites.md b/docs/self_hosted/deployment/deployment_prerequisites.md new file mode 100644 index 0000000000..4687493601 --- /dev/null +++ b/docs/self_hosted/deployment/deployment_prerequisites.md @@ -0,0 +1,30 @@ +# Self-Hosted technical requirements + +Sekoia Self-Hosted requires specific infrastructure, network, and operational readiness. As a customer-operated platform, you are responsible for the end-to-end lifecycle. + +## General requirements +Ensure your organization meets these operational prerequisites: + +* **Operations team**: You must maintain a dedicated team responsible for deployment, configuration, and monitoring. +* **Infrastructure management**: You must be able to provision and manage your own infrastructure, whether on-premises or via virtual machines. +* **Licensing**: You must possess a valid license defining the maximum supervised assets and event-ingest capacity. + +## Hardware footprint +The following specifications are estimated for a deployment of approximately 500GB/day. + +| Component | Specification | +| :--- | :--- | +| **Compute servers (6x)** | 44 CPU (3.2 GHz min), 128GB RAM, 2TB SSD, Debian 11 | +| **GPU server (1x)** | NVIDIA H100 dedicated for AI processing | +| **Storage** | S3-compatible bucket | +| **Admin node (1x)** | Debian 12, Python 3.11, 4 CPU / 8GB RAM | + +!!! warning "Consult Sekoia for sizing" + These figures are estimates provided for guidance. Contact Sekoia for a full requirements review before you commence deployment. + +## Network and connectivity +Configure the internal network to allow communication between platform components: + +* **DNS and Time**: Provide DNS resolution for all nodes and a synchronized NTP infrastructure. +* **Load balancer**: Configure a load balancer for Web UI access (Ports 80/443) and event ingestion (Ports 10514/11514). +* **Internal traffic**: Allow communication for Kubernetes management (Ports 6443, 2379-2380) and VXLAN overlay (Port 8472). \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_process.md b/docs/self_hosted/deployment/deployment_process.md new file mode 100644 index 0000000000..f04cb33d96 --- /dev/null +++ b/docs/self_hosted/deployment/deployment_process.md @@ -0,0 +1,13 @@ +# The Deployment process + +The Sekoia Self-Hosted platform uses an idempotent, agentless CLI to manage the platform lifecycle. This process unifies installation, configuration, and drift reconciliation. + +## Core principles +The deployment engine operates on three main pillars: + +* **Preflight validation**: The CLI verifies OS versions, network connectivity, and credentials before any change is applied. +* **Declarative configuration**: You define the desired state of your infrastructure in a `customer.yml` manifest. +* **Orchestration**: A core engine computes the difference between the actual and desired state to execute only necessary tasks. + +## Artifact delivery +Each release is delivered as a single digitally signed archive that guarantees tamper evidence and verifiability. This archive contains the core platform and Sekoia applications, including the deployer, Kubernetes stack, and microservices. \ No newline at end of file diff --git a/docs/self_hosted/index.md b/docs/self_hosted/index.md new file mode 100644 index 0000000000..6261e85b97 --- /dev/null +++ b/docs/self_hosted/index.md @@ -0,0 +1,34 @@ +# Sekoia Self Hosted + +Sekoia Self-Hosted is a customer-operated deployment of the Sekoia AI SOC platform designed for environments with high regulatory, sovereignty, or connectivity constraints. It provides cloud-grade detection and automation capabilities while keeping data processing and infrastructure entirely under your authority. + +## Purpose and logic + +Highly regulated sectors often operate under legal or technical constraints that make standard SaaS models impractical. Sekoia Self-Hosted addresses these challenges by allowing you to maintain full control over your security operations.The platform ensures: + +The platform ensures: + +* **Data Sovereignty**: Data remains within your infrastructure to meet residency and sovereignty laws. +* **Network Independence**: Operations can run in restricted or fully air-gapped networks with no external connectivity. +* **Infrastructure Control**: You manage the deployment, monitoring, and lifecycle of the platform on your own infrastructure. + +## Key benefits + +* **Regulatory Compliance**: The platform is compatible with national and sector-specific regulations, including public procurement frameworks. +* **Enterprise-Grade Operations**: You can provision and scale environments with automated orchestration and built-in autoscaling. +* **High Availability**: The architecture includes redundancy and self-healing capabilities to ensure continuous security operations. +* **Decoupled Intelligence**: While the technical foundation follows a structured release cycle, security content and threat intelligence are updated daily to minimize intelligence gaps. + +## Intended audience + +Sekoia Self-Hosted is specifically designed for organizations that require maximum control over their security stack. + +Target organizations include: + +* **Highly Regulated Industries**: Sectors such as finance, government, and healthcare. +* **Sovereign Entities**: Organizations with strict national data requirements or those using sovereign service providers. +* **Restricted Environments**: Teams operating in air-gapped or classified infrastructures. +* **Large Enterprises**: Organizations running dedicated security operations with an internal platform team or a trusted MSSP. + +!!! note "Operational Responsibility" + Sekoia Self-Hosted requires a dedicated operations team to manage the end-to-end lifecycle, including deployment, configuration, and monitoring. \ No newline at end of file diff --git a/docs/self_hosted/monitoring/metrics.md b/docs/self_hosted/monitoring/metrics.md new file mode 100644 index 0000000000..2971137d2e --- /dev/null +++ b/docs/self_hosted/monitoring/metrics.md @@ -0,0 +1,12 @@ +# Platform metrics + +The platform exposes technical metrics to monitor the health of the underlying Kubernetes cluster and microservices. + +## Key performance indicators +| Metric | Description | +| :--- | :--- | +| **Node Health** | CPU, memory, and disk utilization per node. | +| **Pod Status** | Availability and restart counts for microservices. | +| **Ingestion Volume** | Daily GB ingested vs. licensed capacity. | +| **Detection Latency** | Time taken between event ingestion and rule matching. | + diff --git a/docs/self_hosted/monitoring/monitoring_guide.md b/docs/self_hosted/monitoring/monitoring_guide.md new file mode 100644 index 0000000000..d6070f4eb5 --- /dev/null +++ b/docs/self_hosted/monitoring/monitoring_guide.md @@ -0,0 +1,16 @@ +# Monitor cluster health + +Sekoia Self-Hosted includes integrated tools to provide real-time visibility into cluster health and performance. + +## Built-in dashboards +You can access monitoring dashboards within the platform to view: +* Cluster resource usage (CPU, RAM, Storage). +* Microservices status and logs. +* Event ingestion rates and processing latency. + +## Diagnostic collection +To troubleshoot performance issues, you can generate a diagnostic bundle using the deployment CLI. + +1. Connect to the orchestration node. +2. Run the CLI `diagnose` command. +3. Export the resulting log bundle for analysis. \ No newline at end of file diff --git a/docs/self_hosted/release_notes/0.1.0.md b/docs/self_hosted/release_notes/0.1.0.md new file mode 100644 index 0000000000..4ca2b68062 --- /dev/null +++ b/docs/self_hosted/release_notes/0.1.0.md @@ -0,0 +1,13 @@ +# Release Note 0.1.0 + +This is the initial release of the Sekoia Self-Hosted platform MVP. + +## New features +* **Air-gap support**: Full deployment capability in restricted environments. +* **Deployment CLI**: Introduction of the unified orchestration tool. +* **Signed Artifacts**: Verifiable and tamper-evident delivery model. + +## Technical foundation +* **Kubernetes stack**: Based on K3s. +* **OS Support**: Certified for Debian 11. +* **AI SOC Core**: Inclusion of built-in dashboards and diagnostic tools. \ No newline at end of file diff --git a/docs/self_hosted/troubleshooting/debug_tool.md b/docs/self_hosted/troubleshooting/debug_tool.md new file mode 100644 index 0000000000..acf0609a24 --- /dev/null +++ b/docs/self_hosted/troubleshooting/debug_tool.md @@ -0,0 +1,16 @@ +# Troubleshooting with the Debug Tool + +The Sekoia Self-Hosted CLI includes specialized diagnostic tools to assist in troubleshooting deployment or operational issues. + +## Debug capabilities +The debug tool can perform the following actions: + +* **Connectivity checks**: Verify network paths between compute nodes and storage. +* **Integrity checks**: Confirm that all deployed artifacts match the expected signatures. +* **State reconciliation**: Identify configuration drift and force a synchronization. + +## Running a debug session +To initiate a debug session: +1. Access the orchestration node. +2. Run the CLI with the `debug` flag. +3. Review the automated report for any "failed" status indicators. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 093ef33ffc..2338b0c61f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -75,6 +75,22 @@ nav: - Roy AI Assistant: getting_started/ai_assistant.md - Best practices: getting_started/best_practices.md - Troubleshooting tips: getting_started/get_troubleshooting_tips.md +- Sekoia Self Hosted: + - Introduction: self_hosted/index.md + - Architecture: + - Reference Architecture: self_hosted/architecture/architecture.md + - Deployment: + - Deployment Process: self_hosted/deployment/deployment_process.md + - Deployment Prerequisites: self_hosted/deployment/deployment_prerequisites.md + - Deploymennt Configuration: self_hosted/deployment/deployment_configuration.md + - Deployment Guide: self_hosted/deployment/deployment_guide.md + - Monitoring: + - Monitoring Guide: self_hosted/monitoring/monitoring_guide.md + - Metrics: self_hosted/monitoring/metrics.md + - Troubleshooting: + - Debug Tool: self_hosted/troubleshooting/debug_tool.md + - Release Notes: + - 0.1.0: self_hosted/release_notes/0.1.0.md - Sekoia Defend (XDR): - Introduction: xdr/index.md - Quick start guide: xdr/xdr_quick_start.md From 1361a5f34de3969ddbeaedb4946c6c0b2f9d595f Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Wed, 29 Apr 2026 12:20:30 +0200 Subject: [PATCH 2/9] update deployment config --- docs/self_hosted/architecture/architecture.md | 5 + .../deployment/deployment_configuration.md | 191 ++++++++++++++++-- .../deployment/deployment_guide.md | 119 +++++++++-- .../deployment/deployment_prerequisites.md | 67 +++--- .../deployment/deployment_process.md | 22 +- docs/self_hosted/index.md | 2 +- docs/self_hosted/monitoring/metrics.md | 12 -- .../release_notes/{0.1.0.md => 0.0.1.md} | 5 +- mkdocs.yml | 5 +- 9 files changed, 341 insertions(+), 87 deletions(-) delete mode 100644 docs/self_hosted/monitoring/metrics.md rename docs/self_hosted/release_notes/{0.1.0.md => 0.0.1.md} (67%) diff --git a/docs/self_hosted/architecture/architecture.md b/docs/self_hosted/architecture/architecture.md index 2f5a45cab3..2b4e13b340 100644 --- a/docs/self_hosted/architecture/architecture.md +++ b/docs/self_hosted/architecture/architecture.md @@ -17,3 +17,8 @@ The architecture is designed for production-grade reliability: * **Self-healing**: Automated recovery mechanisms for continuous operations. * **Scalability**: Support for multi-node compute clusters to handle high-volume ingestion. +## Data Workflow + +* **Ingestion**: Event logs are sent to forwarders which point to a unique domain/port redirected to your Load Balancer. +* **Processing**: Microservices are decoupled to prevent cascading failures. +* **Storage**: Data is written to an S3-compatible storage managed by the customer. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_configuration.md b/docs/self_hosted/deployment/deployment_configuration.md index 11ab6f27d2..ecc9ca30ef 100644 --- a/docs/self_hosted/deployment/deployment_configuration.md +++ b/docs/self_hosted/deployment/deployment_configuration.md @@ -1,19 +1,184 @@ # Deployment configuration -The `customer.yml` file is the central manifest used to describe your specific environment and service requirements. +The `config.yml` file is the central manifest used to describe your specific environment and service requirements. -## Manifest parameters -The following sections must be defined in your configuration file: +## 1. Example `config.yml` -### Infrastructure settings -* **Nodes**: List of IP addresses for compute and GPU servers. -* **DNS**: Domain names and resolution settings for the platform. -* **Load Balancer**: Target addresses for Web UI and ingestion flows. +```yaml +global: + dev: true + host: "app.sekoia.local" + alternative_hosts: "api.sekoia.local" + version: + platform: + version: "v0.0.1" + path: /opt/sekoia/platform/ + skip_existing_local: true + push_workers: 4 + data: + detection-rules: + version: "010126" + path: /opt/sekoia/data/ + intake-formats: + version: "010126" + path: /opt/sekoia/data/ + playbook-library: + version: "010126" + path: /opt/sekoia/data/ + cti: + version: "010126" + path: /opt/sekoia/data/ +utils: + ansible: + datadir: "resources/ansible" + ssh-key: env.SERVERS_SSH_KEY + user: debian + password: "" + silent: false + verbosity: 2 + inventory: + managers: ["TBD"] + workers: ["TBD"] + oci_registry: + url: https://registry.lab/ + username: env.REGISTRY_USERNAME + password: env.REGISTRY_PASSWORD + check_repo: "TBD" + chart_repo: "TBD" + image_repo: "TBD" + git: + repo_url: "" + auth_method: "http" + http: + username: env.GIT_HTTP_USERNAME + password: env.GIT_HTTP_PASSWORD + ssh: + key_path: env.GIT_SSH_KEY_PATH + kubernetes: + autologin: true + platform_installer: + image: "?" -### Service settings -* **SMTP**: Configuration for system notifications and alerts. -* **Feature Toggles**: Activation of specific modules, such as AI features. -* **Resource Quotas**: Scaling parameters and node counts based on your sizing requirements. +modules: + k3s_install: + k3s_release: "v1.31.12+k3s1" + k3s_tls_san: [] + kube_manager_fqdn: "TBD" + k3s_extra_args: "" + k3s_extra_labels: {} + k3s_extra_taints: [] + registry_url: https://registry.lab + registry_subpath: "TBD" + registry_username: env.REGISTRY_USERNAME + registry_password: env.REGISTRY_PASSWORD + pull_images_with_proxy: true + k3s_no_proxy: "127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,.lab" + reboot_after_install: false + helm_install: + kube_manager_fqdn: "TBD" + forward_dns: "TBD" + push_argo_stacks: + repo_path: "/tmp/argo-stacks" + base_branch: "master" + wipe_storage: + enabled: true + platform_configuration: + config: + global: + host: "app.sekoia.local" + alternative_hosts: "api.sekoia.local" + delivery_host: "app.sekoia.local" + proxy: + http_proxy: "http://proxy.lab:3128" + https_proxy: "http://proxy.lab:3128" + no_proxy: "127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,.lab" + grafana: + root_url: "https://app.sekoia.local/grafana" + email: + email_sender: "noreply@sekoia.local" + smtp: + host: "mail.server.local" + user: "smtp-user" + password: "smtp-password" + port: "25" + tls: "False" + starttls: "True" + local_argocd: + repo_name: "self-hosted" + repo_url: "TBD" + helm_repo_url: "TBD" + git_username: env.GIT_HTTP_USERNAME + git_password: env.GIT_HTTP_PASSWORD + oci_username: env.REGISTRY_USERNAME + oci_password: env.REGISTRY_PASSWORD +``` -!!! note "Validating the Manifest" - The deployment CLI validates the syntax and integrity of the `customer.yml` file during the preflight stage. \ No newline at end of file + +## 2. Manifest parameters + +## 1. Global +Defines the platform identity and version management. + +* **`dev`** (boolean): Enables development mode (e.g., debug logging). +* **`host`** (string): The primary FQDN (Fully Qualified Domain Name) for the platform. +* **`alternative_hosts`** (string): Additional FQDNs for API or auxiliary access. +* **`version`**: Defines software and data bundle versions. + * **`platform`**: Platform release details. + * `version`: The release tag (e.g., "v0.0.1"). + * `path`: Local absolute path on the admin node where the release archive is stored. + * `skip_existing_local`: Avoids re-copying existing files. + * `push_workers`: Parallel workers for image deployment. + * **`data`**: Configuration for data modules (`detection-rules`, `intake-formats`, `playbook-library`, `cti`). + * Each sub-item requires a `version` tag and a local `path`. + +## 2. Utils +Contains the infrastructure "plumbing" (credentials, paths, and registry settings). + +* **`ansible`**: + * `datadir`: Local path to Ansible inventory/roles. + * `ssh-key`: Reference to the SSH key (`env.SERVERS_SSH_KEY`). + * `user`: SSH user (typically `debian`). + * `password`: Sudo password (`env.SERVERS_SUDO_PASSWORD`). + * `inventory`: Lists of `managers` and `workers` node IPs/hostnames. +* **`oci_registry`**: Container registry settings. + * `url`: The OCI registry URL. + * `username` / `password`: Auth credentials (`env.REGISTRY_USERNAME`, `env.REGISTRY_PASSWORD`). + * `check_repo`, `chart_repo`, `image_repo`: Repositories used for specific operations. +* **`git`**: Authentication for ArgoCD repositories. + * `auth_method`: Authentication type (`http` or `ssh`). +* **`platform_installer`**: URI for the controller installer image. + +## 3. Modules +Configures the functional installation of the platform stacks. + +### k3s_install +Handles the underlying Kubernetes layer. +* **`k3s_release`**: The Kubernetes version tag. +* **`registry_url` / `registry_subpath`**: Internal image registry configuration. +* **`pull_images_with_proxy`**: Enables HTTP proxy for container pulls. +* **`k3s_no_proxy`**: CIDR/domains excluded from proxy traffic. +* **`k3s_extra_args`**: Additional CLI flags for the K3s server. + +### helm_install +Handles Kubernetes application deployment. + +* **`kube_manager_fqdn`**: FQDN of the primary manager. +* **`forward_dns`**: DNS nameservers for CoreDNS. + +### push_argo_stacks + +* **`repo_path`**: Local directory of the Git checkout. +* **`base_branch`**: Branch used to synchronize stacks (e.g., "master"). + +### wipe_storage + +* **`enabled`**: If `true`, allows the controller to format disk storage. + +### platform_configuration +Application-layer configuration. + +* **`global`**: Repeats FQDN definitions for the internal context. +* **`proxy`**: Proxy settings for application traffic (http/https/no_proxy). +* **`grafana`**: `root_url` for dashboard access. +* **`email`**: SMTP configuration (`host`, `port`, `user`, `password`, `tls`, `starttls`). +* **`local_argocd`**: Git and OCI credentials used by the internal ArgoCD instance to synchronize application manifests. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_guide.md b/docs/self_hosted/deployment/deployment_guide.md index 9d86595ba6..6092c12e76 100644 --- a/docs/self_hosted/deployment/deployment_guide.md +++ b/docs/self_hosted/deployment/deployment_guide.md @@ -3,35 +3,116 @@ This guide describes how to install the Sekoia Self-Hosted platform using the orchestration CLI. ## Prerequisites -* An orchestration node running Debian 12 with Python 3.11. -* A completed `customer.yml` manifest. +* An orchestration node running Debian 11 (also compatible with 12/13) with Python 3.11 and docker installed * Access to the digitally signed Sekoia installer archive. -## Procedure +## Preparation work -### 1. Extract the installer -To begin, set up the deployment environment on your orchestration node. +### 1. Download and Extract -1. Navigate to your installation directory. -2. Extract the signed archive. -3. Initialize the deployment CLI. +1. **Download**: Follow the dedicated procedure and use your specific credentials to download the Self-Hosted release. +2. **Transfer**: Import your archive into the admin node. +3. **Integrity Check**: Always verify the archive checksum before proceeding. +4. **Extract**: + ```bash + tar -xvf sekoia-archive.tar.gz + ``` -### 2. Run preflight checks -You must validate the environment before proceeding with the installation. +### 2. Initialize the Controller -1. Execute the CLI `check` command. -2. Review the output to confirm all hardware and network requirements are met. -3. Resolve any reported failures. +For the first installation, the Self-Hosted Controller (SHC) is not available as a service. You must load the Docker image manually to initialize the environment: + +```bash + docker load -i $SEKOIA_ARCHIVE/v0.0.1/images/self-hosted-controller.tgz +``` + +### 3. Write your configuration file + +Prepare your config.yml manifest. This file acts as the single source of truth for your infrastructure, service, and network parameters. Ensure this file is ready before moving to the next step, as it will be mapped to the container during script execution. + +### 4. Create the Execution Script +To simplify Sekoia installation commands and manage environment variables, create a local script called run-shc.sh with the following content: + +```bash +#!/bin/bash +DOCKER_IMAGE="sekoialab/self-hosted-controller-cli:latest" + +docker run --rm \ +-e SERVERS_SUDO_PASSWORD="" \ +-e SERVERS_SSH_KEY="$SERVERS_SSH_KEY" \ +-e REGISTRY_USERNAME="$REGISTRY_USERNAME" \ +-e REGISTRY_PASSWORD="$REGISTRY_PASSWORD" \ +-e GIT_HTTP_USERNAME="$GIT_HTTP_USERNAME" \ +-e GIT_HTTP_PASSWORD="$GIT_HTTP_PASSWORD" \ +--network=host \ +-v $CONFIG_HOST:/tmp/config.yaml \ +-v $SEKOIA_ARCHIVE:/opt/sekoia \ +${DOCKER_IMAGE} -c /tmp/config.yaml "$@" +``` + +Once created, make the script executable and test your configuration: + +```bash +chmod +x run-shc.sh +./run-shc list +``` + +## Deployment Options + +### Option 1: Bundle Deployment + +This method executes all commands sequentially, providing the simplest installation path. + +Trigger the deployment: + +```bash +./run-shc exec Install +``` + +Wait for the final convergence report. + +### Option 2: Step-by-Step Deployment + +If you prefer manual control, you can execute the deployment stages individually. + +#### 1. Run Preflight Checks +You must validate the environment before proceeding. The tool checks the Self-Hosted configuration, local Git, local OCI registry, release files, and server availability. + +```bash +./run-shc exec CheckRequiredConfigItems +./run-shc exec CheckLocalGit +./run-shc exec CheckLocalOCIRegistry +./run-shc exec CheckLocalReleaseFiles +./run-shc exec CheckServersAreReachable +./run-shc exec CheckServerSpec +``` !!! warning "Preflight block" The installation will not proceed if critical validation checks fail. -### 3. Launch orchestration -Trigger the deployment to configure the platform components. +#### 2. Provision Local Registries -1. Load the `customer.yml` manifest. -2. Run the `deploy` command. -3. Wait for the final convergence report. +```bash +./run-shc exec PushImages +./run-shc exec PushCharts +./run-shc exec PushArgoStacks +``` + +#### 3. Install Kube Stack + +```bash +./run-shc exec K3SInstall +./run-shc exec GetKubeconfig +./run-shc exec HelmInstall +./run-shc exec CheckKubernetesCluster +``` + +#### 4. Deploy Sekoia Platform + +```bash +./run-shc exec sekoia +``` ## Results -The Sekoia platform is now operational. You can access the interface via the load balancer URL provided in your configuration. \ No newline at end of file + +The Sekoia platform is now operational. You can access the interface via the Load Balancer URL provided in your configuration. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_prerequisites.md b/docs/self_hosted/deployment/deployment_prerequisites.md index 4687493601..83d1590df2 100644 --- a/docs/self_hosted/deployment/deployment_prerequisites.md +++ b/docs/self_hosted/deployment/deployment_prerequisites.md @@ -1,30 +1,41 @@ # Self-Hosted technical requirements -Sekoia Self-Hosted requires specific infrastructure, network, and operational readiness. As a customer-operated platform, you are responsible for the end-to-end lifecycle. - -## General requirements -Ensure your organization meets these operational prerequisites: - -* **Operations team**: You must maintain a dedicated team responsible for deployment, configuration, and monitoring. -* **Infrastructure management**: You must be able to provision and manage your own infrastructure, whether on-premises or via virtual machines. -* **Licensing**: You must possess a valid license defining the maximum supervised assets and event-ingest capacity. - -## Hardware footprint -The following specifications are estimated for a deployment of approximately 500GB/day. - -| Component | Specification | -| :--- | :--- | -| **Compute servers (6x)** | 44 CPU (3.2 GHz min), 128GB RAM, 2TB SSD, Debian 11 | -| **GPU server (1x)** | NVIDIA H100 dedicated for AI processing | -| **Storage** | S3-compatible bucket | -| **Admin node (1x)** | Debian 12, Python 3.11, 4 CPU / 8GB RAM | - -!!! warning "Consult Sekoia for sizing" - These figures are estimates provided for guidance. Contact Sekoia for a full requirements review before you commence deployment. - -## Network and connectivity -Configure the internal network to allow communication between platform components: - -* **DNS and Time**: Provide DNS resolution for all nodes and a synchronized NTP infrastructure. -* **Load balancer**: Configure a load balancer for Web UI access (Ports 80/443) and event ingestion (Ports 10514/11514). -* **Internal traffic**: Allow communication for Kubernetes management (Ports 6443, 2379-2380) and VXLAN overlay (Port 8472). \ No newline at end of file +Sekoia Self-Hosted is an enterprise-grade solution that requires specific infrastructure, network, and operational readiness. As a customer-operated platform, you are responsible for the end-to-end lifecycle, including provisioning, management, and monitoring. + +## Infrastructure Prerequisites +You must provision and manage the following dedicated infrastructure components: + +| Component | CPU | RAM | Storage / Throughput | +| :--- | :--- | :--- | :--- | +| **Load Balancer** (e.g., HAProxy, Nginx) | 4 | 8GB | 50MB/s (validate per customer) | +| **Local Image Registry** (e.g., Harbor) | 4 | 8GB | 5TB | +| **Local Code Registry** (e.g., Git) | 4 | 8GB | 100GB | +| **Orchestration Node** (Admin) | 4 | 8GB | 100GB | + +## Networking and Storage +* **Networking**: You must provide internal DNS resolution, synchronized NTP infrastructure, and SMTP servers for alerts. +* **S3 Storage**: You must provide an S3-compatible bucket for event storage. + * **Calculation**: Total capacity = Raw daily ingested events (GB) x retention days. + * **Note**: The 4TB local SSD on compute nodes is strictly for system use, not long-term event storage. + +## Scaling Table +The following table outlines the estimated hardware footprint per cluster based on asset count and daily ingestion volume. + +| Assets | Daily Volume | Compute Nodes | GPU Nodes | +| :--- | :--- | :--- | :--- | +| 5,000 | 500 GB | 6 | 1 | +| 10,000 | 1 TB | 12 | 1 | +| 20,000 | 2 TB | 24 | 2 | +| 50,000 | 5 TB | 60 | 2 | +| 100,000 | 10 TB | 120 | 2 | + +!!! warning "Theoretical Sizing" + The figures provided in the scaling table are estimates for guidance. Infrastructure requirements vary based on specific replication constraints, data retention policies, and query patterns. These figures must be validated by Sekoia prior to deployment. + +!!! warning "Hardware Specifications" + Each compute node must meet the minimum specification: 44 CPU (3.2 GHz minimum), 128GB RAM, and 4TB SSD (Debian 11). GPU nodes require an NVIDIA H100. + +## Operational Readiness +* **Operations Team**: A dedicated in-house or partner team must handle deployment, configuration, updates, and 24/7 monitoring. +* **Infrastructure Management**: You must have the ability to provision and manage your own private cloud, on-premises servers, or virtual machines. +* **Licensing**: A valid license must be provided, defining the maximum supervised assets and daily event-ingest capacity. \ No newline at end of file diff --git a/docs/self_hosted/deployment/deployment_process.md b/docs/self_hosted/deployment/deployment_process.md index f04cb33d96..0a858d5ed1 100644 --- a/docs/self_hosted/deployment/deployment_process.md +++ b/docs/self_hosted/deployment/deployment_process.md @@ -1,13 +1,19 @@ -# The Deployment process +# The Deployment Process -The Sekoia Self-Hosted platform uses an idempotent, agentless CLI to manage the platform lifecycle. This process unifies installation, configuration, and drift reconciliation. +A central tool called the **Self-Hosted Controller** manages the platform lifecycle. This tool takes a configuration file as a parameter, with all settings described in the configuration section. -## Core principles -The deployment engine operates on three main pillars: +All available commands can be listed by running the `list` command. Specific details for these commands are provided in the relevant sections below (Deployment, Debug, Check). Before proceeding, please consult the list of prerequisites for a Self-Hosted deployment. -* **Preflight validation**: The CLI verifies OS versions, network connectivity, and credentials before any change is applied. -* **Declarative configuration**: You define the desired state of your infrastructure in a `customer.yml` manifest. +## Core Principles +The deployment engine is designed around three fundamental pillars to ensure stability and predictability: + +* **Preflight Validation**: The CLI performs comprehensive checks before any execution. It verifies OS versions, network connectivity, and file checksums, blocking execution until the environment meets all minimum requirements. +* **Declarative Configuration**: The platform state is defined by a `config.yml` manifest. This file acts as the single source of truth for infrastructure settings (IPs, load-balancers, DNS), service configurations (SMTP, feature toggles, AI modules), and scaling parameters (node counts, resource quotas). * **Orchestration**: A core engine computes the difference between the actual and desired state to execute only necessary tasks. -## Artifact delivery -Each release is delivered as a single digitally signed archive that guarantees tamper evidence and verifiability. This archive contains the core platform and Sekoia applications, including the deployer, Kubernetes stack, and microservices. \ No newline at end of file +## Workflow Execution +The CLI supports both online and air-gapped environments: + +* **Online**: Fetches artifacts directly from authorized registries. +* **Air-gapped**: Operates in fully disconnected modes using pre-staged packages and locally cached manifests. + diff --git a/docs/self_hosted/index.md b/docs/self_hosted/index.md index 6261e85b97..cdcdb7ac94 100644 --- a/docs/self_hosted/index.md +++ b/docs/self_hosted/index.md @@ -17,7 +17,7 @@ The platform ensures: * **Regulatory Compliance**: The platform is compatible with national and sector-specific regulations, including public procurement frameworks. * **Enterprise-Grade Operations**: You can provision and scale environments with automated orchestration and built-in autoscaling. * **High Availability**: The architecture includes redundancy and self-healing capabilities to ensure continuous security operations. -* **Decoupled Intelligence**: While the technical foundation follows a structured release cycle, security content and threat intelligence are updated daily to minimize intelligence gaps. +* **Continuous Protection**: The business layer is decoupled from the technical foundation, allowing you to receive daily updates for threat intelligence, security rules, integrations, and automations. This ensures your detection capabilities improve continuously without requiring platform upgrades or service disruption. ## Intended audience diff --git a/docs/self_hosted/monitoring/metrics.md b/docs/self_hosted/monitoring/metrics.md deleted file mode 100644 index 2971137d2e..0000000000 --- a/docs/self_hosted/monitoring/metrics.md +++ /dev/null @@ -1,12 +0,0 @@ -# Platform metrics - -The platform exposes technical metrics to monitor the health of the underlying Kubernetes cluster and microservices. - -## Key performance indicators -| Metric | Description | -| :--- | :--- | -| **Node Health** | CPU, memory, and disk utilization per node. | -| **Pod Status** | Availability and restart counts for microservices. | -| **Ingestion Volume** | Daily GB ingested vs. licensed capacity. | -| **Detection Latency** | Time taken between event ingestion and rule matching. | - diff --git a/docs/self_hosted/release_notes/0.1.0.md b/docs/self_hosted/release_notes/0.0.1.md similarity index 67% rename from docs/self_hosted/release_notes/0.1.0.md rename to docs/self_hosted/release_notes/0.0.1.md index 4ca2b68062..d9ad4312f8 100644 --- a/docs/self_hosted/release_notes/0.1.0.md +++ b/docs/self_hosted/release_notes/0.0.1.md @@ -1,13 +1,12 @@ -# Release Note 0.1.0 +# Release Note 0.0.1 (MVP) This is the initial release of the Sekoia Self-Hosted platform MVP. ## New features * **Air-gap support**: Full deployment capability in restricted environments. * **Deployment CLI**: Introduction of the unified orchestration tool. -* **Signed Artifacts**: Verifiable and tamper-evident delivery model. ## Technical foundation * **Kubernetes stack**: Based on K3s. * **OS Support**: Certified for Debian 11. -* **AI SOC Core**: Inclusion of built-in dashboards and diagnostic tools. \ No newline at end of file +* **SOC Core**: Sekoia Defend Core capabilitiess \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 2338b0c61f..840bcf67a6 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -82,15 +82,14 @@ nav: - Deployment: - Deployment Process: self_hosted/deployment/deployment_process.md - Deployment Prerequisites: self_hosted/deployment/deployment_prerequisites.md - - Deploymennt Configuration: self_hosted/deployment/deployment_configuration.md - Deployment Guide: self_hosted/deployment/deployment_guide.md + - Deploymennt Configuration: self_hosted/deployment/deployment_configuration.md - Monitoring: - Monitoring Guide: self_hosted/monitoring/monitoring_guide.md - - Metrics: self_hosted/monitoring/metrics.md - Troubleshooting: - Debug Tool: self_hosted/troubleshooting/debug_tool.md - Release Notes: - - 0.1.0: self_hosted/release_notes/0.1.0.md + - 0.0.1: self_hosted/release_notes/0.0.1.md - Sekoia Defend (XDR): - Introduction: xdr/index.md - Quick start guide: xdr/xdr_quick_start.md From 8d653bf09a5f3a55ae409fd77a44e4db953c9a41 Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Wed, 29 Apr 2026 17:52:33 +0200 Subject: [PATCH 3/9] Minor update --- .../deployment/deployment_guide.md | 29 +++++++++++++++---- 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/docs/self_hosted/deployment/deployment_guide.md b/docs/self_hosted/deployment/deployment_guide.md index 6092c12e76..b73823693d 100644 --- a/docs/self_hosted/deployment/deployment_guide.md +++ b/docs/self_hosted/deployment/deployment_guide.md @@ -3,7 +3,7 @@ This guide describes how to install the Sekoia Self-Hosted platform using the orchestration CLI. ## Prerequisites -* An orchestration node running Debian 11 (also compatible with 12/13) with Python 3.11 and docker installed +* An orchestration node running with docker installed on it. * Access to the digitally signed Sekoia installer archive. ## Preparation work @@ -15,20 +15,22 @@ This guide describes how to install the Sekoia Self-Hosted platform using the or 3. **Integrity Check**: Always verify the archive checksum before proceeding. 4. **Extract**: ```bash - tar -xvf sekoia-archive.tar.gz + tar -xvf sekoia-archive.tar.gz -C $SEKOIA_LOCAL_DIR ``` +!!! note "Disk space requirements" + Ensure the destination directory provides at least 50 GB of available disk space to extract the Sekoia release. ### 2. Initialize the Controller For the first installation, the Self-Hosted Controller (SHC) is not available as a service. You must load the Docker image manually to initialize the environment: ```bash - docker load -i $SEKOIA_ARCHIVE/v0.0.1/images/self-hosted-controller.tgz + docker load -i $SEKOIA_LOCAL_DIR/v0.0.1/images/self-hosted-controller.tgz ``` ### 3. Write your configuration file -Prepare your config.yml manifest. This file acts as the single source of truth for your infrastructure, service, and network parameters. Ensure this file is ready before moving to the next step, as it will be mapped to the container during script execution. +Prepare your config.yml manifest. This file acts as the single source of truth for your infrastructure, service, and network parameters. Ensure this file is ready before moving to the next step, as it will be mapped to the container during script execution. Please read [this documentation](./deployment_configuration.md) to define your parameters ### 4. Create the Execution Script To simplify Sekoia installation commands and manage environment variables, create a local script called run-shc.sh with the following content: @@ -38,7 +40,7 @@ To simplify Sekoia installation commands and manage environment variables, creat DOCKER_IMAGE="sekoialab/self-hosted-controller-cli:latest" docker run --rm \ --e SERVERS_SUDO_PASSWORD="" \ +-e SERVERS_SUDO_PASSWORD="$SERVERS_SUDO_PASSWORD" \ -e SERVERS_SSH_KEY="$SERVERS_SSH_KEY" \ -e REGISTRY_USERNAME="$REGISTRY_USERNAME" \ -e REGISTRY_PASSWORD="$REGISTRY_PASSWORD" \ @@ -46,10 +48,25 @@ docker run --rm \ -e GIT_HTTP_PASSWORD="$GIT_HTTP_PASSWORD" \ --network=host \ -v $CONFIG_HOST:/tmp/config.yaml \ --v $SEKOIA_ARCHIVE:/opt/sekoia \ +-v $SEKOIA_LOCAL_DIR:/opt/sekoia \ ${DOCKER_IMAGE} -c /tmp/config.yaml "$@" ``` +!!! note "Environment variable configuration" + You can define these variables directly in your configuration file or as environment variables on the host system. + +| Variable | Description | +| :--- | :--- | +| `SEKOIA_LOCAL_DIR` | The directory used to extract the local Sekoia release. | +| `SEKOIA_CONFIG_FILE` | The path to the local configuration file for the self-hosted controller. | +| `SERVERS_SUDO_PASSWORD` | The sudo password to access target machines (optional, depends on your configuration). | +| `SERVERS_SSH_KEY` | The SSH key used to access remote machines. | +| `REGISTRY_USERNAME` | The username for the OCI registry. | +| `REGISTRY_PASSWORD` | The password for the OCI registry. | +| `GIT_HTTP_USERNAME` | The username for Code repository. | +| `GIT_HTTP_PASSWORD` | The password for Code repository. | + + Once created, make the script executable and test your configuration: ```bash From 601f58c8e099c9fe6747d8c5e70a5bd30ae38571 Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Wed, 29 Apr 2026 18:17:02 +0200 Subject: [PATCH 4/9] improve config --- .../deployment/deployment_configuration.md | 277 +++++++++++------- mkdocs.yml | 2 +- 2 files changed, 177 insertions(+), 102 deletions(-) diff --git a/docs/self_hosted/deployment/deployment_configuration.md b/docs/self_hosted/deployment/deployment_configuration.md index ecc9ca30ef..e8c46592c3 100644 --- a/docs/self_hosted/deployment/deployment_configuration.md +++ b/docs/self_hosted/deployment/deployment_configuration.md @@ -1,19 +1,24 @@ -# Deployment configuration +# Deployment configuration reference -The `config.yml` file is the central manifest used to describe your specific environment and service requirements. +The `config.yml` file is the central manifest used to describe your environment and service requirements. It dictates how the self-hosted controller orchestrates infrastructure deployment and application settings. -## 1. Example `config.yml` +## Configuration example + +You can use this complete example as a template for your own deployment. Ensure you replace the placeholders and environment variable references with your actual values. ```yaml global: - dev: true + dev: false + emit_mm_notif: false host: "app.sekoia.local" alternative_hosts: "api.sekoia.local" version: platform: version: "v0.0.1" path: /opt/sekoia/platform/ - skip_existing_local: true + skip_existing_local: false + skip_existing_manifest: false + manifest_max_age: 300 push_workers: 4 data: detection-rules: @@ -32,56 +37,77 @@ utils: ansible: datadir: "resources/ansible" ssh-key: env.SERVERS_SSH_KEY - user: debian - password: "" - silent: false - verbosity: 2 + user: root + password: env.SERVERS_SUDO_PASSWORD inventory: - managers: ["TBD"] - workers: ["TBD"] - oci_registry: - url: https://registry.lab/ - username: env.REGISTRY_USERNAME - password: env.REGISTRY_PASSWORD - check_repo: "TBD" - chart_repo: "TBD" - image_repo: "TBD" + managers: + - 10.0.0.1 + workers: + - 10.0.0.2 + - 10.0.0.3 git: - repo_url: "" auth_method: "http" + repo_url: "" http: username: env.GIT_HTTP_USERNAME password: env.GIT_HTTP_PASSWORD + proxy: "" ssh: key_path: env.GIT_SSH_KEY_PATH kubernetes: - autologin: true + kubeconfig_path: "/tmp/self-hosted-controller/kubeconfig.yml" + autologin: false + oci_registry: + url: env.OCI_REGISTRY_URL + username: env.OCI_REGISTRY_USERNAME + password: env.OCI_REGISTRY_PASSWORD + check_repo: "my-project/shc-probe" + chart_repo: "my-project/charts" + image_repo: "my-project/images" + prometheus: + url: env.PROMETHEUS_URL + query_window: "1h" + query_timeout: 10 + default_label_filters: { + "platform": "app.dev1.sekoia.io" + } + argocd: + namespace: "argocd" + root_app_name: "root" + notification: + url: "http://localhost:6666" + channel: "mi-self-hosted" + thread_id: "deploy-job" platform_installer: - image: "?" + image: "registry.sekoia.io/sekoialab/platform-installer:self-hosted-v0.14.0" modules: k3s_install: k3s_release: "v1.31.12+k3s1" k3s_tls_san: [] - kube_manager_fqdn: "TBD" + kube_manager_fqdn: "10.0.0.1" k3s_extra_args: "" k3s_extra_labels: {} k3s_extra_taints: [] - registry_url: https://registry.lab - registry_subpath: "TBD" + registry_url: https://registry.sekoia.io + registry_subpath: "" registry_username: env.REGISTRY_USERNAME registry_password: env.REGISTRY_PASSWORD - pull_images_with_proxy: true + pull_images_with_proxy: false + k3s_http_proxy: "" + k3s_https_proxy: "" k3s_no_proxy: "127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,.lab" reboot_after_install: false - helm_install: - kube_manager_fqdn: "TBD" - forward_dns: "TBD" push_argo_stacks: repo_path: "/tmp/argo-stacks" - base_branch: "master" + helm_install: + kube_manager_fqdn: "10.0.0.1" + forward_dns: "10.X.X.X" wipe_storage: - enabled: true + enabled: false + kube_crash_recovery: + pod_ready_timeout: 300 + poll_interval: 10 platform_configuration: config: global: @@ -104,81 +130,130 @@ modules: tls: "False" starttls: "True" local_argocd: - repo_name: "self-hosted" - repo_url: "TBD" - helm_repo_url: "TBD" + repo_name: "" + repo_url: "" + helm_repo_url: "" git_username: env.GIT_HTTP_USERNAME git_password: env.GIT_HTTP_PASSWORD oci_username: env.REGISTRY_USERNAME oci_password: env.REGISTRY_PASSWORD ``` +## Manifest parameters + +### 1. Global +This section defines the platform identity, notification behavior, and how the controller fetches release assets from remote storage. + +* **global.dev** (boolean): Enables development mode behaviors, such as verbose logging and extended error reporting. +* **global.emit_mm_notif** (boolean): Enables the sending of installation progress notifications to Mattermost. +* **global.host** (string): The primary FQDN used to access the Sekoia.io platform. +* **global.alternative_hosts** (string): Secondary FQDNs used for API access or auxiliary services. + +#### global.version.fetch (optional: online deployment only) +* **global.version.fetch.access-key** (string): S3 access key required to authenticate with the release bucket. +* **global.version.fetch.secret-key** (string): S3 secret key required to authenticate with the release bucket. +* **global.version.fetch.endpoint** (string): The S3 API endpoint URL (e.g., Linode, AWS, or MinIO). +* **global.version.fetch.region** (string): The geographical region of the S3 bucket. +* **global.version.fetch.bucket** (string): The name of the S3 bucket containing the self-hosted release assets. + +#### global.version.platform +* **global.version.platform.version** (string): The specific version tag of the platform to deploy. +* **global.version.platform.path** (string): Local absolute path on the admin node where the release archive is stored. +* **global.version.platform.skip_existing_local** (boolean): If `true`, the controller skips downloading files already present on the local disk. +* **global.version.platform.skip_existing_manifest** (boolean): If `true`, the controller always uses the local manifest copy, bypassing age checks. +* **global.version.platform.manifest_max_age** (integer): Time in seconds before the local manifest is considered expired and re-downloaded. +* **global.version.platform.push_workers** (integer): Number of parallel threads used to push images and Helm charts to the local registry. + +#### global.version.data +* **global.version.data.detection-rules** (object): Version and local path for the detection logic bundle. +* **global.version.data.intake-formats** (object): Version and local path for log parsing formats. +* **global.version.data.playbook-library** (object): Version and local path for automation playbooks. +* **global.version.data.cti** (object): Version and local path for Cyber Threat Intelligence data. + +### 2. Utils +This section configures the underlying tools (Ansible, Git, Kubernetes) and external service integrations required for the deployment. + +#### utils.ansible +* **utils.ansible.datadir** (string): Path to the directory containing Ansible playbooks, roles, and inventories. +* **utils.ansible.ssh-key** (string): Path or environment variable for the SSH private key used to manage nodes. +* **utils.ansible.user** (string): The remote user used for SSH connections (e.g., `root` or `debian`). +* **utils.ansible.password** (string): The sudo password or environment variable for privilege escalation. +* **utils.ansible.inventory.managers** (list): IP addresses or FQDNs of the Kubernetes manager nodes. +* **utils.ansible.inventory.workers** (list): IP addresses or FQDNs of the Kubernetes worker nodes. + +#### utils.git +* **utils.git.auth_method** (string): Protocol used to authenticate with Git repositories (`http` or `ssh`). +* **utils.git.repo_url** (string): The remote URL of the Git repository containing environment manifests. +* **utils.git.http.username** (string): Username for HTTP-based Git authentication. +* **utils.git.http.password** (string): Password or Token for HTTP-based Git authentication. +* **utils.git.http.proxy** (string): Optional proxy URL specifically for Git HTTP operations. +* **utils.git.ssh.key_path** (string): Path to the SSH key for Git authentication. + +#### utils.kubernetes +* **utils.kubernetes.kubeconfig_path** (string): Destination path where the generated cluster `kubeconfig` will be stored. +* **utils.kubernetes.autologin** (boolean): If `true`, automatically performs a CLI login to the cluster after deployment. + +#### utils.oci_registry +* **utils.oci_registry.url** (string): The full URL of the OCI-compliant container registry. +* **utils.oci_registry.username** (string): Registry authentication username. +* **utils.oci_registry.password** (string): Registry authentication password. +* **utils.oci_registry.check_repo** (string): Full path to an image used for registry health-check probes. +* **utils.oci_registry.chart_repo** (string): Base repository path for Helm chart storage. +* **utils.oci_registry.image_repo** (string): Base repository path for Docker image storage. + +#### utils.prometheus +* **utils.prometheus.url** (string): The endpoint URL for the Prometheus monitoring server. +* **utils.prometheus.query_window** (string): The default time window applied to metric queries (e.g., `1h`). +* **utils.prometheus.query_timeout** (integer): Maximum duration in seconds for a Prometheus query to complete. +* **utils.prometheus.default_label_filters** (object): Set of default labels used to filter all outgoing Prometheus queries. + +#### utils.argocd +* **utils.argocd.namespace** (string): The Kubernetes namespace where ArgoCD services are deployed. +* **utils.argocd.root_app_name** (string): The name assigned to the "App-of-Apps" root manifest. + +#### utils.notification +* **utils.notification.url** (string): Base URL for the internal notification service. +* **utils.notification.channel** (string): The Mattermost channel identifier for posting updates. +* **utils.notification.thread_id** (string): A logical identifier used to group notification messages into threads. + +#### utils.platform_installer +* **utils.platform_installer.image** (string): The full Docker image URI for the Sekoia.io platform installer. + +### 3. Modules +This section provides granular configuration for each functional phase of the platform installation. + +#### modules.k3s_install +* **modules.k3s_install.k3s_release** (string): The Kubernetes version tag to be installed. +* **modules.k3s_install.k3s_tls_san** (list): List of additional SANs (Subject Alternative Names) for the API server certificate. +* **modules.k3s_install.kube_manager_fqdn** (string): The FQDN or IP of the primary manager node used for cluster orchestration. +* **modules.k3s_install.k3s_extra_args** (string): Additional command-line arguments passed to the K3s server/agent process. +* **modules.k3s_install.k3s_extra_labels** (object): Key-value pairs to be applied as labels to the Kubernetes nodes. +* **modules.k3s_install.k3s_extra_taints** (list): List of taints to be applied to the Kubernetes nodes. +* **modules.k3s_install.registry_url** (string): The private registry URL used by the nodes to pull images. +* **modules.k3s_install.registry_subpath** (string): Optional prefix path for rewriting registry image locations. +* **modules.k3s_install.pull_images_with_proxy** (boolean): Enables the use of an HTTP proxy for `containerd` image pulls. +* **modules.k3s_install.k3s_http_proxy** (string): The HTTP proxy URL for the K3s runtime. +* **modules.k3s_install.k3s_https_proxy** (string): The HTTPS proxy URL for the K3s runtime. +* **modules.k3s_install.k3s_no_proxy** (string): List of CIDRs and domains that bypass the proxy. +* **modules.k3s_install.reboot_after_install** (boolean): If `true`, the host nodes are rebooted immediately following the K3s installation. + +#### modules.push_argo_stacks +* **modules.push_argo_stacks.repo_path** (string): Local directory path where the ArgoCD manifest repositories are synchronized. + +#### modules.helm_install +* **modules.helm_install.kube_manager_fqdn** (string): FQDN or IP of the manager node for Helm deployment tasks. +* **modules.helm_install.forward_dns** (string): The upstream DNS server IP used by the cluster's CoreDNS. + +#### modules.wipe_storage +* **modules.wipe_storage.enabled** (boolean): Authorizes the controller to format and wipe disks (required for Ceph/Rook storage setup). + +#### modules.kube_crash_recovery +* **modules.kube_crash_recovery.pod_ready_timeout** (integer): Seconds to wait for all pods to reach the `Ready` state before a deployment phase times out. +* **modules.kube_crash_recovery.poll_interval** (integer): Frequency in seconds for checking pod status during recovery phases. -## 2. Manifest parameters - -## 1. Global -Defines the platform identity and version management. - -* **`dev`** (boolean): Enables development mode (e.g., debug logging). -* **`host`** (string): The primary FQDN (Fully Qualified Domain Name) for the platform. -* **`alternative_hosts`** (string): Additional FQDNs for API or auxiliary access. -* **`version`**: Defines software and data bundle versions. - * **`platform`**: Platform release details. - * `version`: The release tag (e.g., "v0.0.1"). - * `path`: Local absolute path on the admin node where the release archive is stored. - * `skip_existing_local`: Avoids re-copying existing files. - * `push_workers`: Parallel workers for image deployment. - * **`data`**: Configuration for data modules (`detection-rules`, `intake-formats`, `playbook-library`, `cti`). - * Each sub-item requires a `version` tag and a local `path`. - -## 2. Utils -Contains the infrastructure "plumbing" (credentials, paths, and registry settings). - -* **`ansible`**: - * `datadir`: Local path to Ansible inventory/roles. - * `ssh-key`: Reference to the SSH key (`env.SERVERS_SSH_KEY`). - * `user`: SSH user (typically `debian`). - * `password`: Sudo password (`env.SERVERS_SUDO_PASSWORD`). - * `inventory`: Lists of `managers` and `workers` node IPs/hostnames. -* **`oci_registry`**: Container registry settings. - * `url`: The OCI registry URL. - * `username` / `password`: Auth credentials (`env.REGISTRY_USERNAME`, `env.REGISTRY_PASSWORD`). - * `check_repo`, `chart_repo`, `image_repo`: Repositories used for specific operations. -* **`git`**: Authentication for ArgoCD repositories. - * `auth_method`: Authentication type (`http` or `ssh`). -* **`platform_installer`**: URI for the controller installer image. - -## 3. Modules -Configures the functional installation of the platform stacks. - -### k3s_install -Handles the underlying Kubernetes layer. -* **`k3s_release`**: The Kubernetes version tag. -* **`registry_url` / `registry_subpath`**: Internal image registry configuration. -* **`pull_images_with_proxy`**: Enables HTTP proxy for container pulls. -* **`k3s_no_proxy`**: CIDR/domains excluded from proxy traffic. -* **`k3s_extra_args`**: Additional CLI flags for the K3s server. - -### helm_install -Handles Kubernetes application deployment. - -* **`kube_manager_fqdn`**: FQDN of the primary manager. -* **`forward_dns`**: DNS nameservers for CoreDNS. - -### push_argo_stacks - -* **`repo_path`**: Local directory of the Git checkout. -* **`base_branch`**: Branch used to synchronize stacks (e.g., "master"). - -### wipe_storage - -* **`enabled`**: If `true`, allows the controller to format disk storage. - -### platform_configuration -Application-layer configuration. - -* **`global`**: Repeats FQDN definitions for the internal context. -* **`proxy`**: Proxy settings for application traffic (http/https/no_proxy). -* **`grafana`**: `root_url` for dashboard access. -* **`email`**: SMTP configuration (`host`, `port`, `user`, `password`, `tls`, `starttls`). -* **`local_argocd`**: Git and OCI credentials used by the internal ArgoCD instance to synchronize application manifests. \ No newline at end of file +#### modules.platform_configuration +* **modules.platform_configuration.config.global** (object): Duplication of platform FQDNs for internal application context. +* **modules.platform_configuration.config.proxy** (object): Application-layer HTTP/HTTPS/NO_PROXY settings. +* **modules.platform_configuration.config.grafana.root_url** (string): The external URL used to access the Grafana dashboard. +* **modules.platform_configuration.config.email.smtp** (object): SMTP server details (`host`, `port`, `user`, `password`, `tls`, `starttls`) for platform alerts and notifications. +* **modules.platform_configuration.config.local_argocd** (object): Git and OCI credentials specifically for the ArgoCD instance to sync internal application manifests. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 840bcf67a6..9619fbdf09 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -83,7 +83,7 @@ nav: - Deployment Process: self_hosted/deployment/deployment_process.md - Deployment Prerequisites: self_hosted/deployment/deployment_prerequisites.md - Deployment Guide: self_hosted/deployment/deployment_guide.md - - Deploymennt Configuration: self_hosted/deployment/deployment_configuration.md + - Deployment Configuration: self_hosted/deployment/deployment_configuration.md - Monitoring: - Monitoring Guide: self_hosted/monitoring/monitoring_guide.md - Troubleshooting: From db75edecd5ba88a682204c41e3e99c7c78912737 Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Wed, 29 Apr 2026 18:23:11 +0200 Subject: [PATCH 5/9] improve index --- docs/self_hosted/index.md | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/docs/self_hosted/index.md b/docs/self_hosted/index.md index cdcdb7ac94..6cced18484 100644 --- a/docs/self_hosted/index.md +++ b/docs/self_hosted/index.md @@ -31,4 +31,24 @@ Target organizations include: * **Large Enterprises**: Organizations running dedicated security operations with an internal platform team or a trusted MSSP. !!! note "Operational Responsibility" - Sekoia Self-Hosted requires a dedicated operations team to manage the end-to-end lifecycle, including deployment, configuration, and monitoring. \ No newline at end of file + Sekoia Self-Hosted requires a dedicated operations team to manage the end-to-end lifecycle, including deployment, configuration, and monitoring. + +## Documentation map + +Use the following links to navigate the technical documentation and manage your deployment lifecycle. + +### Architecture +* [Reference Architecture](architecture/architecture.md): Explore the technical design, component interactions, and data flow of the self-hosted solution. + +### Deployment +* [Deployment Process](deployment/deployment_process.md): Understand the high-level stages of a standard installation. +* [Deployment Prerequisites](deployment/deployment_prerequisites.md): Verify hardware, software, and network requirements before you begin. +* [Deployment Guide](deployment/deployment_guide.md): Follow step-by-step instructions to install and initialize the platform. +* [Deployment Configuration](deployment/deployment_configuration.md): Consult the comprehensive reference for the `config.yml` manifest parameters. + +### Operations and Support +* [Monitoring Guide](monitoring/monitoring_guide.md): Learn how to observe platform health and performance metrics. +* [Debug Tool](troubleshooting/debug_tool.md): Use the built-in diagnostic tools to identify and resolve common issues. + +### Release updates +* [Release Notes](release_notes/0.0.1.md): Review the latest features, improvements, and fixes included in each version. \ No newline at end of file From a25c24aa1291a8482d00978d26a4fe026694ff49 Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Thu, 30 Apr 2026 08:57:36 +0200 Subject: [PATCH 6/9] update release note --- docs/self_hosted/release_notes/0.0.1.md | 73 ++++++++++++++++++++++--- 1 file changed, 66 insertions(+), 7 deletions(-) diff --git a/docs/self_hosted/release_notes/0.0.1.md b/docs/self_hosted/release_notes/0.0.1.md index d9ad4312f8..6bac9f17eb 100644 --- a/docs/self_hosted/release_notes/0.0.1.md +++ b/docs/self_hosted/release_notes/0.0.1.md @@ -1,12 +1,71 @@ -# Release Note 0.0.1 (MVP) +# Release Notes v0.0.1 (MVP) -This is the initial release of the Sekoia Self-Hosted platform MVP. +This release introduces the initial version of Sekoia Self-Hosted, providing a robust security operations foundation for disconnected and regulated environments. ## New features -* **Air-gap support**: Full deployment capability in restricted environments. -* **Deployment CLI**: Introduction of the unified orchestration tool. + +* **Air-gap support**: Enable full deployment and operational capabilities in restricted or isolated environments with no external connectivity. +* **Deployment CLI**: Access a unified orchestration tool designed to manage the platform lifecycle, from initialization to upgrades. ## Technical foundation -* **Kubernetes stack**: Based on K3s. -* **OS Support**: Certified for Debian 11. -* **SOC Core**: Sekoia Defend Core capabilitiess \ No newline at end of file + +* **Kubernetes stack**: The platform is built on a lightweight and certified K3s distribution for optimized resource management. +* **OS Support**: This release is officially certified for deployment on **Debian 11 (Bullseye)**. + +## Functional scope + +The functional scope of Sekoia Self-Hosted aligns with the **Defend Core** subscription, with specific exceptions related to air-gapped environment constraints. + +| Feature | Available | Description | +| :--- | :--- | :--- | +| **Meta-playbooks** | Yes | Supports advanced automation workflows. | +| **OC Notifications** | Yes | Operations Center notification system. | +| **Observable Tags Enrichment** | No | Automatic enrichment of events with observable tags. | +| **Cloud-to-Cloud Ingestion** | No | Not supported for air-gapped deployments. | +| **Encrypted Ingestion** | Yes | Supports Syslog TLS, Relp TLS, and HTTPS ingestion. | +| **Custom Intake Formats** | Yes | Allows creation and management of custom parsing formats. | +| **Sigma Correlation** | Yes | Full support for Sigma-based correlation rules. | +| **Playbooks** | Yes | Built-in automation and orchestration capabilities. | +| **Automatic Asset Discovery** | Yes | Identifies assets within the monitored perimeter. | +| **Retrohunt** | Yes | Search for past indicators in historical data. | +| **Anomaly Detection Engine** | Yes | Statistical and behavioral anomaly detection. | +| **Case Management** | Yes | Standard security incident tracking and management. | +| **Hot Storage** | Yes | High-performance storage for active investigation. | +| **Sekoia Endpoint Agent** | Yes | Support for host-level visibility and response. | +| **Contextualized Alerts** | No | Requires real-time CTI embedding (unsupported in air-gap). | +| **SOL Query Builder** | Yes | Visual and syntax-based search interface. | +| **Detection Rules** | Yes | Access to the standard Sekoia.io detection library. | +| **Event Drop Detection** | Yes | Monitoring of log ingestion continuity. | +| **Cases Custom Status** | Yes | Ability to define specific incident lifecycles. | +| **Investigation Graph** | Yes | Visual representation of security incidents and entities. | +| **Notebooks** | Yes | Collaborative workspaces for threat hunting. | +| **Sigma Pattern Validation** | Yes | Built-in syntax checking for Sigma rules. | +| **SOL Dataset** | Yes | Logical grouping of event data. | +| **Dashboard Filters** | Yes | Dynamic filtering for visualization modules. | +| **Roy Assistant** | No | AI-assistant not compatible with air-gapped environments. | +| **Dashboards** | Yes | Customizable visual monitoring interfaces. | +| **APIs** | Yes | Full programmatic access to platform functions. | +| **Member Management** | Yes | RBAC and user administration. | +| **Usage Reporting** | Yes | Statistics on data volume and platform usage. | +| **Subscription Management** | Yes | Internal license and subscription tracking. | +| **SSO / MFA** | Yes | Integration with identity providers for secure access. | +| **Region Threat Telemetry** | Yes | Geographic-based threat visualization. | + +## Specific environment constraints + +Deploying in air-gapped or restricted environments introduces the following operational changes: + +### Threat Intelligence and Detection +While the standalone **Threat Intelligence (CTI)** research module is not available in air-gapped deployments, the platform remains fully powered by Sekoia.io intelligence. + +* **Detection Rules**: All rules (Sigma, patterns) are embedded in the release and fully operational. +* **CTI Context**: Live cloud-based enrichment and manual exploration of the CTI database (threat actors, malwares, reports) are not supported without external connectivity. + +### Security content delivery +To ensure continuous protection, every product release includes the latest version of: + +* Sekoia detection rules. +* Integration connectors (Intake formats). +* Automation library (SOAR modules) + +This ensures your deployment remains up to date with the latest threat detection logic even without external connectivity. \ No newline at end of file From 41abd4d0cba35babe79f2718ec838da9237ea40a Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Thu, 30 Apr 2026 11:19:39 +0200 Subject: [PATCH 7/9] update minimum size disk --- docs/self_hosted/deployment/deployment_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/self_hosted/deployment/deployment_guide.md b/docs/self_hosted/deployment/deployment_guide.md index b73823693d..b4b5c99b4c 100644 --- a/docs/self_hosted/deployment/deployment_guide.md +++ b/docs/self_hosted/deployment/deployment_guide.md @@ -18,7 +18,7 @@ This guide describes how to install the Sekoia Self-Hosted platform using the or tar -xvf sekoia-archive.tar.gz -C $SEKOIA_LOCAL_DIR ``` !!! note "Disk space requirements" - Ensure the destination directory provides at least 50 GB of available disk space to extract the Sekoia release. + Ensure the destination directory provides at least 100 GB of available disk space to extract the Sekoia release. ### 2. Initialize the Controller From 689d3332f4f144096c69aef7cee47cb3f8eae2f9 Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Tue, 5 May 2026 16:48:10 +0200 Subject: [PATCH 8/9] Debug tool --- .../self_hosted/troubleshooting/debug_tool.md | 262 +++++++++++++++++- 1 file changed, 250 insertions(+), 12 deletions(-) diff --git a/docs/self_hosted/troubleshooting/debug_tool.md b/docs/self_hosted/troubleshooting/debug_tool.md index acf0609a24..d542fc0b06 100644 --- a/docs/self_hosted/troubleshooting/debug_tool.md +++ b/docs/self_hosted/troubleshooting/debug_tool.md @@ -1,16 +1,254 @@ -# Troubleshooting with the Debug Tool +# Troubleshooting and Debugging -The Sekoia Self-Hosted CLI includes specialized diagnostic tools to assist in troubleshooting deployment or operational issues. +A centralized set of diagnostic tools is available through the Self-Hosted Controller (SHC). These commands let you validate your configuration, inspect the health of your deployment, audit secrets, diagnose database issues, and identify resource pressure on the cluster. -## Debug capabilities -The debug tool can perform the following actions: +All commands below are executed via the `run-shc.sh` wrapper script using the `exec` subcommand, which instantiates the named module inside the controller container and runs it to completion. -* **Connectivity checks**: Verify network paths between compute nodes and storage. -* **Integrity checks**: Confirm that all deployed artifacts match the expected signatures. -* **State reconciliation**: Identify configuration drift and force a synchronization. +--- -## Running a debug session -To initiate a debug session: -1. Access the orchestration node. -2. Run the CLI with the `debug` flag. -3. Review the automated report for any "failed" status indicators. \ No newline at end of file +## Configuration Validation + +### Validate the local configuration file + +```bash +./run-shc.sh exec CheckLocalConfig +``` + +Loads `config.yml` and validates every key against the schema defined in `help.yaml`. The check covers: + +- **Missing required fields** — any key marked `required: true` in the schema that is absent from your config will be reported with an error log line. +- **Format mismatches** — values that do not match the expected regex pattern are flagged individually. +- **Schema integrity** — ensures no required key is hidden from the help output (which would indicate an authoring error in `help.yaml`). + +If all checks pass, the command exits cleanly with: + +``` +Validating configuration... +Configuration is valid entries_checked= +``` + +If errors are found, each is logged individually before the command exits with a non-zero status: + +``` +Missing required config key=global.version.platform.version +Format mismatch key=utils.oci_registry.url value=… expected_format=^https?://… +Configuration validation failed: 2 error(s) found +``` + +### Inspect resolved environment variables (verbose mode) + +```bash +./run-shc.sh -v exec CheckLocalConfig +``` + +Adding `-v` enables debug-level logging. Before running the schema validation, the controller dumps the **fully resolved** in-memory config tree — including every `env.VAR_NAME` value substituted with its actual environment variable content. This is the most reliable way to confirm that secrets injected via environment variables (e.g. `REGISTRY_PASSWORD`, `SERVERS_SSH_KEY`) are correctly loaded into the controller process. + +--- + +## Infrastructure Connectivity + +### Check SSH connectivity to all servers + +```bash +./run-shc.sh exec CheckServersAreReachable +``` + +Uses Ansible to run a `ping` playbook against all hosts declared in `utils.ansible.inventory`. This confirms that the controller can open an SSH connection to every node, which is a prerequisite for any Ansible-based operation (installation, node reboots, etc.). + +On success: + +``` +Pinging all configured servers via Ansible... +All configured servers are reachable +``` + +On failure, the Ansible output is surfaced and the command exits with a non-zero status indicating which host(s) could not be reached. + +### Check Kubernetes cluster health + +```bash +./run-shc.sh exec CheckKubernetesCluster +``` + +Fetches the kubeconfig, connects to the cluster API, and verifies: + +- The number of nodes in the cluster matches the number of unique hosts in the Ansible inventory. +- Every node has a `Ready=True` condition. + +On success: + +``` +Kubernetes cluster is healthy nodes=3 expected=3 +``` + +Fails immediately if any node is not Ready, or if the node count does not match the inventory (which may indicate a node that joined or was evicted without being reflected in the config). + +--- + +## Local Artifact Checks + +### Verify release files on disk + +```bash +./run-shc.sh exec CheckLocalReleaseFiles +``` + +Iterates over every entry under `global.version` in your config (e.g. `platform`, `data.detection-rules`) and checks that: + +1. The base directory (`path`) exists. +2. The version subdirectory (`path/`) exists and is non-empty. + +Useful after a `DownloadReleaseFiles` run to confirm that all expected archives are present before attempting an installation or update. + +### Verify git repository access + +```bash +./run-shc.sh exec CheckLocalGit +``` + +Clones the configured `utils.git.repo_url` into a temporary directory and tests both **pull** and **push** access using the configured credentials. Use this to validate that the git token has sufficient permissions before running any module that writes ArgoCD stacks to the repository. + +### Verify OCI registry access + +```bash +./run-shc.sh exec CheckLocalOCIRegistry +``` + +Connects to `utils.oci_registry.url` and verifies **push**, **pull**, and **delete** access. Push is tested first; if it fails, pull and delete are skipped and reported as untested. Use this to confirm registry credentials before pushing Helm charts or images. + +--- + +## Service and Application Health + +### ArgoCD status dashboard + +```bash +./run-shc.sh exec DebugArgoCD +``` + +Connects to the ArgoCD API and renders three tables in the terminal: + +| Table | Content | +|---|---| +| **Repositories** | Name, URL, and type of every registered ArgoCD repository | +| **Root Application** | Sync and health status of the root app-of-apps | +| **Applications** | Per-application sync status and health status, sorted alphabetically | + +Sync statuses are colour-coded: `Synced` in green, `OutOfSync` in red. Health statuses: `Healthy` in green, `Progressing` in yellow, `Degraded`/`Missing` in red. + +This is the first command to run when investigating a broken or partially deployed platform. + +### Force a full re-synchronisation + +```bash +./run-shc.sh exec DebugArgoCDSyncAll +``` + +Triggers a three-phase synchronisation of **all** ArgoCD applications in parallel: + +1. **Partial sync** — for each application, syncs only resources whose `kind` matches a configurable regex (default: `secretgenerator|configmap`). This refreshes ConfigMaps and SecretGenerator objects before secrets are regenerated. +2. **Operator restart** — rolls out the `sekoiaio-secret-operator` deployment in the `support` namespace and waits for it to come back up (default: 60 s), ensuring it picks up the refreshed SecretGenerators. +3. **Full sync** — syncs every application in its entirety. + +Use this when applications are stuck in `OutOfSync`, after a manual change to the ArgoCD git repository, or after a platform upgrade. + +> **Note:** The sync timeout per application defaults to 300 s and up to 32 applications are synced concurrently. Both values can be overridden in `config.yml` under `modules.debug_argocd_sync_all`. + +--- + +## Secret Diagnostics + +### Audit missing or incomplete secrets + +```bash +./run-shc.sh exec DebugMissingSecrets +``` + +Compares the desired state (every `SecretGenerator` CRD declared via `secretoperator.sekoia.io/v1alpha1`) against the actual Kubernetes `Secret` objects present in the cluster, and reports: + +- Secrets that are entirely missing (the CRD exists but no `Secret` was created). +- Secrets that exist but have incomplete keys. +- For each missing key: the vault path where the value is expected to come from. + +Additionally, the module runs a `platform-installer dumpconfig` job and cross-references each missing secret against the platform-installer configuration, helping distinguish between secrets that were never defined versus secrets that failed to be generated. + +### Scan ArgoCD stacks for unrendered template placeholders + +```bash +./run-shc.sh exec DebugKustomizeStacksTemplates +``` + +Clones (or pulls) the ArgoCD git repository and recursively scans every YAML file for values that still contain the `SH_TMPL` marker — the sentinel string used by the stack templater to indicate a value that should have been substituted. For each match, the following is logged: + +- File path and document index within a multi-document YAML file +- Kubernetes `kind` and `metadata.name` of the affected object +- Dot-separated YAML path to the offending field and its current value + +A summary of occurrences per YAML path (sorted by frequency) is printed at the end. Use this when services fail to start due to misconfigured values that were not properly rendered during installation. + +--- + +## Database Diagnostics + +### Inspect database pod health + +```bash +./run-shc.sh exec DebugDatabases +``` + +Inspects the `support` namespace and renders two tables: + +**StatefulSets** — covers non-CNPG databases (e.g. Redis, ClickHouse): + +| Column | Description | +|---|---| +| Name | StatefulSet name | +| Expected / Ready | Declared vs. ready replica count | +| Status | `Healthy` / `Warning` / `Unhealthy` (colour-coded) | +| Restarts | Total container restarts across all pods | +| Last Restart | Human-readable age of the most recent restart (e.g. `5m ago`) | +| Issues | Per-pod issues: `not ready`, `waiting (CrashLoopBackOff)`, `crashed Xm ago`, etc. | + +**CNPG Clusters** — same columns for all CloudNativePG `postgresql.cnpg.io/v1` clusters (e.g. the platform PostgreSQL instances). + +A pod is flagged as having a **recent restart** if its last termination timestamp is within the past 10 minutes. Status is `Unhealthy` if any replica is not ready or not in `Running` phase; `Warning` if all replicas are running but recent restarts or waiting containers were detected. + +--- + +## Cluster Resource Management + +### RAM allocation waste report + +```bash +./run-shc.sh exec DebugResourceAllocation +``` + +Queries the Kubernetes Metrics API (`metrics.k8s.io/v1beta1`) for live memory consumption across all namespaces, then compares it against each pod's declared memory `requests`. The output is two tables: + +**RAM Allocation — Waste Report** — all pods that declare a memory request, sorted by wasted RAM descending: + +| Column | Description | +|---|---| +| Namespace / Pod | Pod identity | +| Request | Declared memory request | +| Usage | Live memory consumption from the Metrics API | +| Waste | `request − usage` | +| Waste % | Colour-coded: green < 50 %, yellow ≥ 50 %, red ≥ 80 % | + +**Pods Without Memory Requests** — pods for which no `resources.requests.memory` is set, shown with their current usage. These pods are a scheduling risk as Kubernetes cannot make informed placement decisions for them. + +> **Prerequisites:** The Metrics Server must be deployed in the cluster. If the Metrics API is unavailable, the command exits with an error suggesting you run the `HelmInstall` module first. + +--- + +## Platform Installer Debug + +### Launch a paused installer job + +```bash +./run-shc.sh exec DebugPlatformInstallation +``` + +Deploys the `platform-installer` Helm chart with a `pause` command override instead of the normal install/update sequence. The job pod starts and stays alive without performing any changes, giving you an interactive shell to inspect the installer environment, mounted secrets, and configuration files. Any stale release from a previous debug session is cleaned up automatically before deploying. + +Use this when you need to manually inspect or test the platform-installer's runtime context without triggering a full installation. From 17374d16d0ee0819ca850611e725568c25bc8c6d Mon Sep 17 00:00:00 2001 From: Pierre Penhouet Date: Thu, 21 May 2026 14:45:36 +0200 Subject: [PATCH 9/9] add monitoring doc --- .../monitoring/monitoring_guide.md | 255 +++++++++++++++++- 1 file changed, 243 insertions(+), 12 deletions(-) diff --git a/docs/self_hosted/monitoring/monitoring_guide.md b/docs/self_hosted/monitoring/monitoring_guide.md index d6070f4eb5..3d29d8548a 100644 --- a/docs/self_hosted/monitoring/monitoring_guide.md +++ b/docs/self_hosted/monitoring/monitoring_guide.md @@ -1,16 +1,247 @@ -# Monitor cluster health +# Monitoring -Sekoia Self-Hosted includes integrated tools to provide real-time visibility into cluster health and performance. +Sekoia Self-Hosted ships with a complete observability stack so you can monitor cluster health, application status, database availability, and resource consumption in real time. This page explains how the stack is organized, when to use which tool, and gives a full command reference for on-demand diagnostics. -## Built-in dashboards -You can access monitoring dashboards within the platform to view: -* Cluster resource usage (CPU, RAM, Storage). -* Microservices status and logs. -* Event ingestion rates and processing latency. +This page is for platform administrators operating a Sekoia Self-Hosted instance. It covers both daily monitoring and incident response. -## Diagnostic collection -To troubleshoot performance issues, you can generate a diagnostic bundle using the deployment CLI. +## Monitoring at a glance -1. Connect to the orchestration node. -2. Run the CLI `diagnose` command. -3. Export the resulting log bundle for analysis. \ No newline at end of file +Sekoia Self-Hosted provides two complementary monitoring layers: + +| Layer | Tool | Purpose | When to use | +|---|---|---|---| +| **Continuous monitoring** | Grafana, Loki, Prometheus, Alertmanager | Real-time dashboards, log search, metric collection, alerting | Daily operations, capacity planning, alert-driven response | +| **On-demand diagnostics** | Self-Hosted Controller (SHC) CLI | Targeted health checks on cluster, applications, databases, and resources | Incident triage, post-deployment validation, troubleshooting | + +In practice, you will spend most of your time in Grafana — dashboards and alerts surface issues as they happen. The SHC CLI is the tool you reach for when an alert fires, when a deployment looks wrong, or when you need a structured snapshot of platform state. + +--- + +## Continuous monitoring: Grafana, Loki, Prometheus + +The observability stack is deployed as part of the platform — no additional installation is required. + +### Built-in dashboards + +A set of Grafana dashboards is shipped with every release. They cover: + +- **Cluster health** — node status, control-plane availability, Kubernetes events +- **Resource utilization** — CPU, memory, disk, and network usage per node and namespace +- **Throughput** — event ingestion rate, processing latency, queue depth +- **Error rates** — service-level error counts, HTTP status distribution +- **Security events** — authentication failures, audit-log highlights + +You can customize these dashboards or add your own to fit your monitoring conventions. + +### Centralized logs + +Promtail agents run on every node and ship all container logs to a Loki cluster. Use Grafana's **Explore** view to search, filter, and inspect logs by service, namespace, pod, or time range. + +Typical investigation flow: +1. Open Grafana → **Explore** → select the Loki data source +2. Filter by `namespace` and `pod` (or `app` label) +3. Narrow the time range to the incident window +4. Refine with substring or regex filters + +### Metrics and alerting + +Prometheus collects system- and application-level metrics across the platform. Alertmanager evaluates alerting rules and dispatches notifications. + +- All metrics and alerts are visualized in Grafana. +- You can tune the default thresholds or add new rules to match your SLAs. +- Notifications can be routed to email (via the SMTP server configured at install time) or to any Alertmanager-compatible receiver. + +### Retention and external forwarding + +You can adjust log and metric retention to meet your compliance requirements. The stack also supports forwarding to external observability systems: + +- **Metrics** — via Prometheus `remote_write` +- **Logs** — via the Loki HTTP API + +--- + +## On-demand diagnostics: Self-Hosted Controller (SHC) + +The Self-Hosted Controller (SHC) is a CLI that runs a set of targeted health checks against the cluster, ArgoCD, databases, and the Kubernetes Metrics API. It is the right tool for incident triage and structured platform snapshots. + +### Prerequisites + +- SHC commands are executed from the **admin node** provisioned during installation (see [Deployment prerequisites](../prerequisites/index.md)). +- All commands are invoked via the `run-shc.sh` wrapper script using the `exec` subcommand. +- The admin node must have network access to the Kubernetes API of the target cluster. +- For `DebugResourceAllocation`, the Kubernetes **Metrics Server** must be deployed. If it is not, the command exits with an actionable error — run the `HelmInstall` module first. + +### When to use SHC vs Grafana + +| Situation | Use | +|---|---| +| Routine monitoring, trend analysis, alert review | Grafana | +| An alert just fired and you need to confirm what's broken | SHC (`DebugArgoCD` first) | +| You just deployed or upgraded and want a clean health snapshot | SHC (full workflow, see below) | +| You want to inspect log content in detail | Grafana / Loki | +| You suspect over- or under-provisioned memory requests | SHC (`DebugResourceAllocation`) | + +### Cluster node health + +```bash +./run-shc.sh exec CheckKubernetesCluster +``` + +Connects to the Kubernetes API and performs two checks: + +- The number of nodes in the cluster matches the number of unique hosts in the Ansible inventory. +- Every node reports a `Ready=True` condition. + +On success: + +``` +Kubernetes cluster is healthy nodes=3 expected=3 +``` + +Run this first to confirm the cluster itself is healthy before investigating application-level issues. + +### Application status + +```bash +./run-shc.sh exec DebugArgoCD +``` + +Connects to the ArgoCD API and renders three terminal panels in sequence. + +**Panel 1 — Repositories** + +Lists every Git or Helm repository registered in ArgoCD: + +| Column | Description | +|---|---| +| Name | Repository alias | +| URL | Remote URL | +| Type | `git` or `helm` | + +**Panel 2 — Root application** + +Displays the sync and health status of the root app-of-apps as a single-line summary: + +``` +Sync: Synced | Health: Healthy +``` + +**Panel 3 — Applications** + +Lists all managed ArgoCD applications, sorted alphabetically: + +| Column | Values | +|---|---| +| Name | Application name (with `-on-self-hosted` suffix stripped) | +| Sync Status | `Synced` (green) · `OutOfSync` (red) · `Unknown` (dim) | +| Health Status | `Healthy` (green) · `Progressing` (yellow) · `Degraded` / `Missing` (red) · `Suspended` (cyan) | + +**Reading the output:** + +- A row that is green on both columns is nominal. +- `OutOfSync` means the live cluster state diverges from the ArgoCD git source — run `DebugArgoCDSyncAll` to recover. +- `Progressing` is transient and expected during deployments; if it persists beyond a few minutes, check the pod logs for the affected application. +- `Degraded` or `Missing` indicates a service that failed to start or whose resources could not be created — use `DebugDatabases` and pod logs to investigate further. + +> **Tip:** This is the first command to run when investigating a broken or partially deployed platform. + +### Database health + +```bash +./run-shc.sh exec DebugDatabases +``` + +Inspects the `support` namespace and renders two tables covering all database workloads. + +**StatefulSets table** + +Covers non-CNPG databases (e.g. Redis, ClickHouse, MinIO): + +| Column | Description | +|---|---| +| Name | StatefulSet name | +| Expected | Declared replica count | +| Ready | Currently ready replicas | +| Status | `Healthy` (green) · `Warning` (yellow) · `Unhealthy` (red) | +| Restarts | Total container restarts across all pods | +| Last Restart | Human-readable age of the most recent restart (e.g. `5m ago`) | +| Issues | Per-pod detail: `not ready`, `waiting (CrashLoopBackOff)`, `crashed Xm ago` | + +**CNPG Clusters table** + +Same columns for all CloudNativePG (`postgresql.cnpg.io/v1`) clusters (platform PostgreSQL instances). + +**Reading the output:** + +- **Healthy** — all replicas are running and ready, no recent restarts. +- **Warning** — all replicas are running, but recent restarts (within the past 10 minutes) or waiting containers were detected. Monitor and re-run to see if the issue clears. +- **Unhealthy** — at least one replica is not ready or not in `Running` phase. Check the `Issues` column for the specific pod and reason, then inspect logs with `kubectl logs -n support `. + +### Resource usage + +```bash +./run-shc.sh exec DebugResourceAllocation +``` + +Queries the Kubernetes Metrics API (`metrics.k8s.io/v1beta1`) for live memory consumption across all namespaces and compares it against each pod's declared memory requests. + +> **Prerequisite:** Metrics Server must be deployed. If the Metrics API is unavailable, the command exits with an actionable error — run the `HelmInstall` module first. + +**RAM allocation — waste report** + +All pods with a memory request declared, sorted by wasted RAM (highest first): + +| Column | Description | +|---|---| +| Namespace | Kubernetes namespace | +| Pod | Pod name | +| Request | Declared `resources.requests.memory` | +| Usage | Live consumption from the Metrics API | +| Waste | Request − Usage | +| Waste % | Green < 50 % · Yellow ≥ 50 % · Red ≥ 80 % | + +**Pods without memory requests** + +A separate warning panel lists all pods for which no `resources.requests.memory` is set, alongside their current live usage. These pods are a scheduling risk — Kubernetes cannot make informed placement or eviction decisions for them. + +**Reading the output:** + +- Rows highlighted in red (≥ 80 % waste) indicate pods whose memory requests are significantly over-provisioned. Adjusting their requests downward reduces scheduling pressure across the cluster. +- Pods in the "Without Memory Requests" panel should be reviewed — if they consume significant memory without a declared request, they risk being evicted under node pressure. + +--- + +## Recommended workflows + +### Daily monitoring + +For continuous operations, Grafana is the primary surface: + +1. Review the **Cluster health** dashboard — confirm all nodes are Ready and the control plane is responsive. +2. Check the **Resource utilization** dashboard — watch for nodes or namespaces trending toward saturation. +3. Check the **Throughput** dashboard — verify ingestion rate is within expected range and no queue is backing up. +4. Review active alerts in Alertmanager — acknowledge or escalate as needed. + +If any dashboard surfaces an anomaly, switch to the SHC workflow below to confirm the platform state before investigating further. + +### Incident response + +When an alert fires or a dashboard shows degradation, run the following sequence for a complete health snapshot: + +```bash +# 1. Confirm all cluster nodes are ready +./run-shc.sh exec CheckKubernetesCluster + +# 2. Check application sync and health status +./run-shc.sh exec DebugArgoCD + +# 3. Inspect database availability +./run-shc.sh exec DebugDatabases + +# 4. Review resource allocation efficiency +./run-shc.sh exec DebugResourceAllocation +``` + +This order is deliberate: cluster → applications → databases → resources. Each step rules out a layer before moving on, so when something fails you know roughly where to focus. + +If applications appear `OutOfSync` or `Degraded` after running `DebugArgoCD`, see [Troubleshooting and debugging](./troubleshooting.md) for recovery procedures. \ No newline at end of file