diff --git a/docs/network.mdx b/docs/network.mdx new file mode 100644 index 000000000..113f86bd3 --- /dev/null +++ b/docs/network.mdx @@ -0,0 +1,70 @@ +--- +id: network-resources +title: Network Resources +sidebar_label: Network Resources +sidebar_position: 9 +--- +import Zoom from "react-medium-image-zoom"; + +# Network Resources + +This documentation provides a clear and simple overview of the network architecture that SleakOps deploys in client environments. It explains how the network is organized, how resources are protected, and how both internal and external communication is managed. + +> ❓ **Note:** The network is designed to ensure security, scalability, and high availability. It enables environment separation, protects sensitive data, and exposes services in a secure and controlled way. + +## 1. Overview of the Architecture + +The SleakOps network infrastructure is based on the following key components: + +- **VPC (Virtual Private Cloud):** Segregates networks by environment (Management, Production, Development). +- **Subnets:** + - *Public:* exposed to the Internet. + - *Private:* restricted access, Internet access via NAT Gateway. + - *Persistence:* for databases and storage. +- **Internet Gateway:** Enables communication between the VPC and the Internet. +- **Route Tables:** Define routing paths between subnets and to/from the Internet. +- **Security Groups:** Virtual firewalls that control inbound and outbound traffic for resources. +- **Internal DNS:** Allows internal resources to communicate using hostnames instead of IP addresses. +- **External-DNS:** Runs inside each Kubernetes (EKS) cluster and automatically manages public DNS records in Route53 for exposed services. + +## 2. Typical Communication Flow + +The following illustrates a typical flow of network traffic in SleakOps: + +1. **Access from the Internet:** + A user accesses a publicly exposed service (e.g., an API). Traffic reaches the Internet Gateway and is routed to the public subnet. + +2. **Access Control:** + The Security Group associated with the resource evaluates whether the connection is allowed. + +3. **Internal Communication:** + Internal services (in private or persistence subnets) communicate using internal DNS, under Security Group rules. + +4. **Service Exposure:** + If a service within a Kubernetes cluster needs to be publicly accessible (e.g., an API), it is exposed via an Application Load Balancer, and External-DNS registers the public domain automatically in Route53. + +> This segmentation and control ensure that only necessary services are exposed while keeping sensitive data protected. + + + reference-architecture + + +## 3. External-DNS and Route53 + +An automated solution is used to manage public DNS records for deployed services, integrating the infrastructure with external DNS providers like Route53. + +- External-DNS **does not expose services directly**. It automates DNS record management for resources that are already exposed (e.g., via an Application Load Balancer). +- This allows services to be securely and easily accessible from the Internet. + +## 4. Cross-Environment Connectivity via VPC Peering + +To enable controlled communication between environments (e.g., between Management and Production), SleakOps sets up **VPC Peering connections** between the different VPCs. + +- **VPC Peering** enables two VPCs to exchange internal traffic as if they were part of the same network. +- **It does not require** Internet, NAT Gateway, or VPN traffic routing. +- It is a direct connection between two networks. + +> 💡 Besides Internet Gateway access, SleakOps also supports other connectivity options such as **Pritunl VPN**, **NAT Gateway**, and **Transit Gateway**, depending on use case and required isolation level. diff --git a/docs/troubleshooting/_category_.json b/docs/troubleshooting/_category_.json new file mode 100644 index 000000000..533678cc3 --- /dev/null +++ b/docs/troubleshooting/_category_.json @@ -0,0 +1,11 @@ +{ + "label": "Troubleshooting", + "position": 8, + "collapsible": true, + "collapsed": true, + "description": "Common issues and solutions for SleakOps users", + "link": { + "type": "doc", + "id": "troubleshooting/index" + } +} \ No newline at end of file diff --git a/docs/troubleshooting/api-access-troubleshooting-private-services.mdx b/docs/troubleshooting/api-access-troubleshooting-private-services.mdx new file mode 100644 index 000000000..dfa550979 --- /dev/null +++ b/docs/troubleshooting/api-access-troubleshooting-private-services.mdx @@ -0,0 +1,254 @@ +--- +sidebar_position: 3 +title: "Private API Service Access Issues" +description: "Troubleshooting connectivity issues with private API services in Kubernetes clusters" +date: "2024-12-19" +category: "user" +tags: ["api", "vpn", "private-service", "connectivity", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Private API Service Access Issues + +**Date:** December 19, 2024 +**Category:** User +**Tags:** API, VPN, Private Service, Connectivity, Troubleshooting + +## Problem Description + +**Context:** Users experiencing connectivity issues when trying to access private API services deployed in Kubernetes clusters through SleakOps platform. + +**Observed Symptoms:** + +- Unable to connect to private API services +- API requests timing out or failing +- Intermittent connectivity issues +- Services accessible internally but not externally + +**Relevant Configuration:** + +- Service type: Private API service +- Network: Kubernetes cluster with private networking +- Access method: VPN connection required +- Service exposure: Internal cluster networking + +**Error Conditions:** + +- Connection failures when VPN is not active +- API unreachable from external networks +- No response from private service endpoints +- Network timeout errors + +## Detailed Solution + + + +First, ensure your VPN connection is active and properly configured: + +1. **Check VPN Status:** + + - Verify VPN client is connected + - Check connection status in your VPN client + - Ensure you're connected to the correct VPN profile + +2. **Test VPN Connectivity:** + + ```bash + # Test connectivity to cluster internal networks + ping + + # Check if you can reach cluster DNS + nslookup ..svc.cluster.local + ``` + +3. **Verify Network Routes:** + ```bash + # Check routing table + route -n # Linux/Mac + route print # Windows + ``` + + + + + +Check if the API service is properly configured for private access: + +1. **Check Service Type:** + + ```bash + kubectl get svc -n + ``` + +2. **Verify Service Endpoints:** + + ```bash + kubectl get endpoints -n + ``` + +3. **Check Service Configuration:** + ```yaml + # Example private service configuration + apiVersion: v1 + kind: Service + metadata: + name: private-api-service + namespace: default + spec: + type: ClusterIP # Internal access only + ports: + - port: 80 + targetPort: 8080 + selector: + app: api-service + ``` + + + + + +Examine pod logs to identify potential issues: + +1. **Check API Pod Logs:** + + ```bash + # Get pod logs + kubectl logs -n + + # Follow logs in real-time + kubectl logs -f -n + + # Get logs from all containers in pod + kubectl logs -n --all-containers + ``` + +2. **Look for Common Issues:** + + - Connection timeouts + - Authentication failures + - Resource constraints + - Database connectivity issues + +3. **Check Pod Status:** + ```bash + kubectl get pods -n + kubectl describe pod -n + ``` + + + + + +Perform network-level troubleshooting: + +1. **Test from Within Cluster:** + + ```bash + # Create a debug pod + kubectl run debug-pod --image=nicolaka/netshoot -it --rm + + # Test connectivity from inside cluster + curl http://..svc.cluster.local + ``` + +2. **Check Network Policies:** + + ```bash + kubectl get networkpolicies -n + ``` + +3. **Verify DNS Resolution:** + + ```bash + # From debug pod + nslookup ..svc.cluster.local + ``` + +4. **Test Port Connectivity:** + ```bash + # Test specific port + telnet + nc -zv + ``` + + + + + +If you need external access to the private API: + +1. **Create Ingress Resource:** + + ```yaml + apiVersion: networking.k8s.io/v1 + kind: Ingress + metadata: + name: private-api-ingress + namespace: default + annotations: + nginx.ingress.kubernetes.io/rewrite-target: / + spec: + rules: + - host: api.yourdomain.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: private-api-service + port: + number: 80 + ``` + +2. **Configure TLS (Optional):** + ```yaml + spec: + tls: + - hosts: + - api.yourdomain.com + secretName: api-tls-secret + ``` + + + + + +**Quick Fixes:** + +1. **Restart VPN Connection:** + + - Disconnect and reconnect VPN + - Try different VPN servers if available + +2. **Clear DNS Cache:** + + ```bash + # Linux + sudo systemctl flush-dns + + # macOS + sudo dscacheutil -flushcache + + # Windows + ipconfig /flushdns + ``` + +3. **Check Firewall Rules:** + - Ensure local firewall allows VPN traffic + - Verify corporate firewall settings + +**Best Practices:** + +- Always connect to VPN before accessing private services +- Use service discovery instead of hardcoded IPs +- Implement proper health checks for API services +- Monitor service logs regularly +- Set up alerts for service availability + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/aws-cost-monitoring-and-optimization.mdx b/docs/troubleshooting/aws-cost-monitoring-and-optimization.mdx new file mode 100644 index 000000000..d5592abb8 --- /dev/null +++ b/docs/troubleshooting/aws-cost-monitoring-and-optimization.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "AWS Cost Monitoring and Optimization" +description: "Understanding and managing AWS cost increases in SleakOps environments" +date: "2025-01-16" +category: "provider" +tags: ["aws", "costs", "monitoring", "optimization", "billing"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# AWS Cost Monitoring and Optimization + +**Date:** January 16, 2025 +**Category:** Provider +**Tags:** AWS, Costs, Monitoring, Optimization, Billing + +## Problem Description + +**Context:** Users experiencing gradual monthly increases in AWS costs (approximately $50/month) in their SleakOps-managed environments and need to understand the root causes and forecast future expenses. + +**Observed Symptoms:** + +- Monthly cost increases of approximately $50 +- Costs rising consistently over several months +- Uncertainty about whether increases will continue +- Need for accurate cost forecasting + +**Relevant Configuration:** + +- Multiple AWS environments managed through SleakOps +- Recent deployments: databases, Grafana, Loki monitoring +- NodePools configuration changes in November +- Multiple AWS accounts potentially involved + +**Error Conditions:** + +- Cost increases not directly attributable to infrastructure changes +- Potential costs from failed account creation attempts +- Possible traffic-related cost increases +- Forecasting discrepancies between different views + +## Detailed Solution + + + +### Step-by-Step Cost Analysis + +1. **Establish a baseline period**: Identify when your environments became stable +2. **Track incremental changes**: Document each infrastructure change with dates +3. **Separate traffic from infrastructure costs**: AWS charges for networking traffic +4. **Account for deployment timing**: Mid-month deployments affect next month's full costs + +### Key factors that affect costs: + +- **Infrastructure changes**: New databases, monitoring tools +- **Traffic increases**: More users = higher networking costs +- **Deployment timing**: Partial month vs. full month billing +- **Multiple accounts**: Costs distributed across different AWS accounts + + + + + +### Infrastructure Components and Their Cost Impact + +| Component | Typical Monthly Cost | Notes | +| ------------------------- | --------------------- | -------------------- | +| Small RDS instances | $15-30 | Per database | +| Grafana + Loki monitoring | ~$15 static + traffic | Monitoring stack | +| NodePool changes | Minimal | Usually <$5/month | +| EKS cluster base | $72/month | Per cluster | +| Load balancers | $16-25/month | Per ALB/NLB | + +### Traffic-Related Costs + +- **Data transfer out**: $0.09/GB (first 1GB free) +- **Inter-AZ transfer**: $0.01/GB +- **NAT Gateway**: $0.045/GB processed + + + + + +### Using AWS Cost Explorer + +1. **Access Cost Explorer** from your root AWS account +2. **Filter by account ID** to isolate costs per account +3. **Group by service** to identify which AWS services are increasing +4. **Set date ranges** to compare month-over-month + +### Key metrics to analyze: + +```bash +# Example cost breakdown to look for: +- EC2 instances (compute) +- RDS (databases) +- EKS (Kubernetes service) +- Data Transfer (networking) +- EBS (storage) +- Load Balancers +``` + +### Questions to ask: + +- When did environments become stable? +- What was deployed and when? +- Has application traffic increased? +- Are there unused resources in failed accounts? + + + + + +### AWS Cost Forecasting Limitations + +- **Early month forecasts** (first 3-5 days) can be inaccurate +- **Seasonal variations** affect predictions +- **New deployments** skew forecasting algorithms +- **Traffic spikes** create temporary forecast inflation + +### Best practices for forecasting: + +1. **Wait until mid-month** for more accurate forecasts +2. **Use 3-month trends** rather than single month comparisons +3. **Account for known changes** when projecting +4. **Monitor daily spend** to catch anomalies early + +### SleakOps Cost Monitoring + +SleakOps provides cost visibility through: + +- Real-time cost dashboards +- Monthly cost breakdowns by environment +- Resource utilization metrics +- Optimization recommendations + + + + + +### Identifying Unnecessary Costs + +1. **Failed account resources**: Check for resources in accounts that failed during initial setup +2. **Unused databases**: Identify databases with no connections +3. **Over-provisioned instances**: Right-size based on utilization +4. **Orphaned load balancers**: Remove unused ALBs/NLBs + +### Cleanup process: + +```bash +# Example resources to check: +- Unused EBS volumes +- Stopped but not terminated EC2 instances +- Load balancers with no targets +- RDS instances with no connections +- Elastic IPs not associated with instances +``` + +### Optimization strategies: + +- **Reserved instances** for predictable workloads +- **Spot instances** for non-critical workloads +- **Auto-scaling** to match demand +- **Storage optimization** (gp3 vs gp2, lifecycle policies) + + + + + +### AWS Cost Alerts + +Set up billing alerts for: + +- Monthly budget thresholds +- Unusual spending patterns +- Per-service cost increases + +### SleakOps Monitoring + +- Review monthly cost reports +- Monitor resource utilization +- Track cost per environment +- Set up notifications for significant changes + +### Regular review process: + +1. **Weekly**: Check for cost anomalies +2. **Monthly**: Review cost trends and forecasts +3. **Quarterly**: Optimize resource allocation +4. **Annually**: Review reserved instance strategy + + + +--- + +_This FAQ was automatically generated on January 16, 2025 based on a real user query._ diff --git a/docs/troubleshooting/aws-ec2-public-ip-assignment.mdx b/docs/troubleshooting/aws-ec2-public-ip-assignment.mdx new file mode 100644 index 000000000..ceb688d01 --- /dev/null +++ b/docs/troubleshooting/aws-ec2-public-ip-assignment.mdx @@ -0,0 +1,171 @@ +--- +sidebar_position: 3 +title: "EC2 Instance Not Getting Public IPv4 Address" +description: "Solution for EC2 instances created in VPC without public IP assignment" +date: "2024-01-15" +category: "provider" +tags: ["aws", "ec2", "vpc", "networking", "public-ip"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# EC2 Instance Not Getting Public IPv4 Address + +**Date:** January 15, 2024 +**Category:** Provider +**Tags:** AWS, EC2, VPC, Networking, Public IP + +## Problem Description + +**Context:** User creates an EC2 instance within a production VPC through SleakOps but the instance doesn't receive a public IPv4 address that can be accessed externally by third-party providers. + +**Observed Symptoms:** + +- EC2 instance created successfully in the specified VPC +- No public IPv4 address assigned to the instance +- Instance cannot be reached from external networks +- Third-party providers cannot access the instance + +**Relevant Configuration:** + +- Environment: Production VPC +- Instance type: EC2 +- Network: VPC-based deployment +- Access requirement: External provider access needed + +**Error Conditions:** + +- Public IP not automatically assigned during instance creation +- Instance only has private IP within VPC +- External connectivity not available + +## Detailed Solution + + + +The most common reason an EC2 instance doesn't get a public IP is that it's launched in a private subnet or a public subnet without auto-assign public IP enabled. + +**To verify subnet configuration:** + +1. Go to **AWS Console** → **VPC** → **Subnets** +2. Find the subnet where your instance was launched +3. Check the **"Auto-assign public IPv4 address"** setting +4. If it's disabled, this explains why no public IP was assigned + + + + + +The recommended solution is to assign an Elastic IP (EIP) to your instance: + +**Steps to assign Elastic IP:** + +1. Go to **AWS Console** → **EC2** → **Elastic IPs** +2. Click **"Allocate Elastic IP address"** +3. Choose **Amazon's pool of IPv4 addresses** +4. Click **"Allocate"** +5. Select the newly created EIP +6. Click **"Actions"** → **"Associate Elastic IP address"** +7. Select your EC2 instance +8. Click **"Associate"** + +```bash +# Using AWS CLI +aws ec2 allocate-address --domain vpc +aws ec2 associate-address --instance-id i-1234567890abcdef0 --allocation-id eipalloc-12345678 +``` + + + + + +Once you have a public IP, ensure your security groups allow the necessary traffic: + +**For HTTP/HTTPS access:** + +``` +Type: HTTP +Protocol: TCP +Port Range: 80 +Source: 0.0.0.0/0 + +Type: HTTPS +Protocol: TCP +Port Range: 443 +Source: 0.0.0.0/0 +``` + +**For specific provider access:** + +``` +Type: Custom TCP +Protocol: TCP +Port Range: [YOUR_APPLICATION_PORT] +Source: [PROVIDER_IP_RANGE] +``` + + + + + +Ensure your subnet's route table has a route to an Internet Gateway: + +1. Go to **AWS Console** → **VPC** → **Route Tables** +2. Find the route table associated with your subnet +3. Check for a route like: + - **Destination:** `0.0.0.0/0` + - **Target:** `igw-xxxxxxxxx` (Internet Gateway) + +If this route doesn't exist, your instance won't have internet access even with a public IP. + + + + + +If you're managing this through SleakOps, you can configure public IP assignment: + +1. In your **Project Configuration** +2. Go to **Infrastructure** → **Compute** +3. Enable **"Assign Public IP"** for your EC2 instances +4. Or configure **Elastic IP** allocation in the advanced settings + +```yaml +# Example SleakOps configuration +compute: + ec2_instances: + - name: "production-instance" + instance_type: "t3.medium" + subnet_type: "public" + assign_public_ip: true + elastic_ip: true +``` + + + + + +**Important cost information:** + +- **Elastic IPs are free** when associated with a running instance +- **Elastic IPs cost $0.005/hour** when not associated with an instance +- **Data transfer charges** apply for traffic going out to the internet +- Always release unused Elastic IPs to avoid charges + + + + + +If you don't want to use Elastic IPs, consider these alternatives: + +1. **Application Load Balancer (ALB)**: For web applications +2. **Network Load Balancer (NLB)**: For TCP/UDP traffic +3. **NAT Gateway**: For outbound-only internet access +4. **VPC Endpoints**: For AWS service access without internet + +These solutions can provide external access without directly assigning public IPs to instances. + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/aws-marketplace-login-setup-issue.mdx b/docs/troubleshooting/aws-marketplace-login-setup-issue.mdx new file mode 100644 index 000000000..16ea18499 --- /dev/null +++ b/docs/troubleshooting/aws-marketplace-login-setup-issue.mdx @@ -0,0 +1,164 @@ +--- +sidebar_position: 3 +title: "AWS Marketplace Login and Setup Issue" +description: "Solution for login problems after subscribing to SleakOps through AWS Marketplace" +date: "2024-01-15" +category: "user" +tags: ["aws-marketplace", "authentication", "login", "setup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# AWS Marketplace Login and Setup Issue + +**Date:** January 15, 2024 +**Category:** User +**Tags:** AWS Marketplace, Authentication, Login, Setup + +## Problem Description + +**Context:** User subscribed to SleakOps through AWS Marketplace but encounters authentication issues when trying to complete the product setup. + +**Observed Symptoms:** + +- After subscribing through AWS Marketplace, clicking "Setup Product" redirects to registration page +- System doesn't recognize the user as logged in +- Manual login redirects to home page without completing setup +- Issue persists across multiple attempts in the same browser session +- Setup process never completes successfully + +**Relevant Configuration:** + +- Subscription method: AWS Marketplace +- Browser: Same browser used throughout the process +- Login status: User appears logged in but system doesn't recognize it +- Setup stage: Initial product setup after marketplace subscription + +**Error Conditions:** + +- Error occurs immediately after AWS Marketplace subscription +- Happens when clicking "Setup Product" button +- Persists after manual login attempts +- Occurs consistently across multiple browser sessions + +## Detailed Solution + + + +When you subscribe to SleakOps through AWS Marketplace, there's a specific authentication flow that needs to be completed: + +1. **AWS Marketplace Subscription**: You subscribe through AWS +2. **Redirect to SleakOps**: AWS redirects you to our platform with special tokens +3. **Account Linking**: Your AWS account gets linked to a SleakOps account +4. **Setup Completion**: The product setup process begins + +If this flow is interrupted, authentication issues can occur. + + + + + +The most common cause is browser cache conflicts. Follow these steps: + +1. **Clear SleakOps cookies**: + + - Go to your browser settings + - Find "Cookies and site data" + - Search for `sleakops.com` and delete all cookies + +2. **Clear AWS Marketplace cookies**: + + - Also clear cookies for `aws.amazon.com` + - This ensures a clean authentication state + +3. **Clear browser cache**: + + - Clear cached images and files + - This prevents old authentication tokens from interfering + +4. **Restart the process**: + - Go back to AWS Marketplace + - Click "Setup Product" again + + + + + +To isolate browser-related issues: + +1. **Open incognito/private window** +2. **Go to AWS Marketplace** +3. **Navigate to your SleakOps subscription** +4. **Click "Setup Product"** + +If this works, the issue is definitely browser cache/cookies related. + + + + + +If the automatic flow fails, you can manually link your accounts: + +1. **Create a SleakOps account** (if you don't have one): + + - Go to `https://app.sleakops.com/register` + - Use the same email address as your AWS account + +2. **Contact support** with: + + - Your AWS account ID + - Your SleakOps account email + - Your AWS Marketplace subscription details + +3. **We'll manually link** your accounts and activate your subscription + + + + + +Some browsers have stricter security policies that can interfere with cross-domain authentication: + +**Recommended browsers:** + +- Chrome (latest version) +- Firefox (latest version) +- Safari (if on macOS) + +**Browsers to avoid:** + +- Older versions of Internet Explorer +- Browsers with aggressive ad blockers +- Browsers with strict privacy settings + +**Browser settings to check:** + +- Disable ad blockers for AWS and SleakOps domains +- Allow third-party cookies temporarily +- Ensure JavaScript is enabled + + + + + +If you're still having issues, follow this complete recovery process: + +1. **Logout from all AWS services** +2. **Clear all browser data** (cookies, cache, local storage) +3. **Restart your browser** +4. **Login to AWS Console** +5. **Go to AWS Marketplace → Manage subscriptions** +6. **Find SleakOps subscription** +7. **Click "Setup Product"** +8. **Complete the authentication flow without interruption** + +If this still doesn't work, contact our support team with: + +- Your AWS account ID +- Screenshots of the error +- Browser and version information + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/aws-waf-application-load-balancer-protection.mdx b/docs/troubleshooting/aws-waf-application-load-balancer-protection.mdx new file mode 100644 index 000000000..a2f8e75fe --- /dev/null +++ b/docs/troubleshooting/aws-waf-application-load-balancer-protection.mdx @@ -0,0 +1,361 @@ +--- +sidebar_position: 15 +title: "AWS WAF Configuration for Application Load Balancer Protection" +description: "How to configure AWS WAF to protect your Application Load Balancer from malicious traffic and bot attacks" +date: "2024-12-19" +category: "provider" +tags: ["aws", "waf", "security", "load-balancer", "bot-protection"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# AWS WAF Configuration for Application Load Balancer Protection + +**Date:** December 19, 2024 +**Category:** Provider +**Tags:** AWS, WAF, Security, Load Balancer, Bot Protection + +## Problem Description + +**Context:** Users experiencing malicious bot traffic or fake registrations on their platform need to implement protection at the infrastructure level using AWS WAF (Web Application Firewall). + +**Observed Symptoms:** + +- Fake user registrations appearing in the platform +- Suspicious bot traffic patterns +- Potential security threats from automated attacks +- Need for traffic filtering before it reaches the application + +**Relevant Configuration:** + +- Platform: SleakOps on AWS +- Load Balancer: Application Load Balancer (ALB) +- Service: AWS WAF v2 +- Protection needed: Bot detection and traffic filtering + +**Error Conditions:** + +- Malicious traffic bypassing application-level security +- Automated bot attacks on registration endpoints +- Need for proactive traffic filtering + +## Detailed Solution + + + +AWS WAF (Web Application Firewall) is a cloud-based firewall service that helps protect your web applications from common web exploits and bots. While SleakOps doesn't have native WAF integration yet, you can easily configure it manually to protect your Application Load Balancer. + +**Key Benefits:** + +- Blocks malicious traffic before it reaches your application +- Provides bot detection and mitigation +- Offers rate limiting capabilities +- Includes managed rule sets for common attack patterns + + + + + +Before configuring AWS WAF, ensure you have: + +**Required Access:** + +- AWS Console access with appropriate permissions +- WAF administrator permissions +- Ability to modify load balancer configurations + +**Information to Gather:** + +1. **Application Load Balancer ARN** + + ```bash + # Find your ALB in AWS Console or via CLI + aws elbv2 describe-load-balancers --query 'LoadBalancers[*].[LoadBalancerName,LoadBalancerArn]' + ``` + +2. **Application Details** + + - Primary domain name(s) + - Critical endpoints that need protection (e.g., /register, /login) + - Expected legitimate traffic patterns + +3. **Security Requirements** + + - Geographic restrictions needed + - Rate limiting requirements + - Bot detection sensitivity + + + + + +Create a new Web ACL to define your protection rules: + +1. **Navigate to AWS WAF Console** + + - Go to AWS Console → WAF & Shield + - Click "Create web ACL" + +2. **Configure Basic Settings** + + ``` + Name: sleakops-alb-protection + Description: WAF protection for SleakOps ALB + Resource type: Application Load Balancer + ``` + +3. **Add Resource** + + - Select your Application Load Balancer + - Choose the region where your ALB is located + +4. **Set Default Action** + + - Default action: "Allow" (recommended for initial setup) + - This allows traffic unless blocked by specific rules + + + + + +Add essential protection rules to your Web ACL: + +**1. AWS Managed Rule Sets (Recommended):** + +```yaml +# Core Rule Set - Protects against OWASP Top 10 +Rule Name: AWSManagedRulesCommonRuleSet +Priority: 1 +Action: Block + +# Known Bad Inputs - Protects against malicious requests +Rule Name: AWSManagedRulesKnownBadInputsRuleSet +Priority: 2 +Action: Block + +# Bot Control - Advanced bot detection +Rule Name: AWSManagedRulesBotControlRuleSet +Priority: 3 +Action: Block +``` + +**2. Custom Rate Limiting Rule:** + +```yaml +Rule Name: RateLimitingRule +Priority: 4 +Condition: Rate-based rule +Rate limit: 2000 requests per 5 minutes +Action: Block +Scope: All requests from single IP +``` + +**3. Geographic Restrictions (Optional):** + +```yaml +Rule Name: GeoBlockRule +Priority: 5 +Condition: Geographic match +Countries: [List of countries to block] +Action: Block +``` + + + + + +Connect your Web ACL to your Application Load Balancer: + +1. **In the Web ACL Configuration** + + - Go to "Associated AWS resources" + - Click "Add AWS resources" + +2. **Select Your ALB** + + - Resource type: Application Load Balancer + - Select your specific load balancer + - Click "Add" + +3. **Verify Association** + + ```bash + # Verify WAF is associated with ALB + aws wafv2 list-web-acls --scope REGIONAL --region us-east-1 + ``` + + + + + +Enable logging to monitor blocked requests and tune your rules: + +1. **Create CloudWatch Log Group** + + ```bash + # Create log group for WAF logs + aws logs create-log-group --log-group-name aws-waf-logs-sleakops + ``` + +2. **Enable WAF Logging** + + - In WAF Console, go to your Web ACL + - Click "Logging and metrics" + - Enable logging + - Choose your CloudWatch log group + +3. **Configure Log Analysis** + + ```json + { + "logDestinationConfigs": [ + "arn:aws:logs:us-east-1:123456789012:log-group:aws-waf-logs-sleakops" + ], + "logFormat": "json", + "managedByFirewallManager": false + } + ``` + + + + + +Test your WAF configuration to ensure it's working correctly: + +**1. Test Legitimate Traffic** + +```bash +# Test normal requests should pass through +curl -I https://your-domain.com/ + +# Should return normal response headers +``` + +**2. Test Rate Limiting** + +```bash +# Generate multiple requests to test rate limiting +for i in {1..100}; do + curl -s -o /dev/null -w "%{http_code}\n" https://your-domain.com/ +done + +# Should show 403 responses after hitting rate limit +``` + +**3. Monitor WAF Metrics** + +- Go to CloudWatch → Metrics → WAF +- Monitor blocked requests and allowed requests +- Check for false positives + +**4. Review Logs** + +```bash +# Query WAF logs for blocked requests +aws logs filter-log-events \ + --log-group-name aws-waf-logs-sleakops \ + --filter-pattern "\"action\":\"BLOCK\"" +``` + + + + + +Optimize your WAF rules based on real traffic patterns: + +**1. Review False Positives** + +- Monitor legitimate requests being blocked +- Adjust rule sensitivity if needed +- Add exception rules for specific endpoints + +**2. Custom Rules for Your Application** + +```yaml +# Block requests to admin endpoints from public IPs +Rule Name: AdminProtection +Priority: 10 +Condition: Path matches "/admin/*" AND NOT source IP in allowed range +Action: Block + +# Protect registration endpoint with stricter rate limiting +Rule Name: RegistrationProtection +Priority: 11 +Condition: Path matches "/register" +Rate limit: 5 requests per 5 minutes +Action: Block +``` + +**3. Monitor and Adjust** + +- Weekly review of blocked traffic patterns +- Adjust rate limits based on legitimate usage +- Update geographic restrictions as needed + + + + + +**AWS WAF Pricing (Approximate):** + +- Web ACL: $1.00 per month +- Rule evaluations: $0.60 per million requests +- Managed rule groups: $1.00-$10.00 per month each +- Bot Control managed rules: $10.00 per month + $0.80 per million requests + +**Cost Optimization Tips:** + +1. **Start with Essential Rules** + + - Begin with Core Rule Set and Known Bad Inputs + - Add Bot Control only if needed + - Monitor costs before adding additional managed rules + +2. **Rule Priority Optimization** + + ```yaml + # Order rules by likelihood of match (most specific first) + Priority 1: Geographic restrictions (if applicable) + Priority 2: Rate limiting + Priority 3: AWS Managed Rules + ``` + +3. **Regular Review** + + - Monthly cost analysis + - Remove unused rules + - Optimize rate limiting thresholds + + + + + +**Issue 1: Legitimate Traffic Being Blocked** + +```bash +# Check WAF logs for specific blocked requests +aws logs filter-log-events \ + --log-group-name aws-waf-logs-sleakops \ + --filter-pattern "\"action\":\"BLOCK\"" \ + --start-time $(date -d '1 hour ago' +%s)000 + +# Solution: Add exception rules or adjust rule sensitivity +``` + +**Issue 2: WAF Not Blocking Expected Traffic** + +- Verify Web ACL is associated with correct ALB +- Check rule order and priorities +- Ensure default action is properly configured + +**Issue 3: High False Positive Rate** + +- Review specific managed rule groups causing issues +- Implement count mode before block mode for new rules +- Add custom exception rules for legitimate patterns + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/build-newrelic-pkg-resources-deprecation-warning.mdx b/docs/troubleshooting/build-newrelic-pkg-resources-deprecation-warning.mdx new file mode 100644 index 000000000..4de07a342 --- /dev/null +++ b/docs/troubleshooting/build-newrelic-pkg-resources-deprecation-warning.mdx @@ -0,0 +1,192 @@ +--- +sidebar_position: 3 +title: "Build Job Failing with New Relic pkg_resources Deprecation Warning" +description: "Solution for build failures caused by New Relic pkg_resources deprecation warnings in Python environments" +date: "2025-06-10" +category: "project" +tags: ["build", "python", "newrelic", "pkg_resources", "deployment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Build Job Failing with New Relic pkg_resources Deprecation Warning + +**Date:** June 10, 2025 +**Category:** Project +**Tags:** Build, Python, New Relic, pkg_resources, Deployment + +## Problem Description + +**Context:** Users experience build job failures when trying to deploy to production, with error messages related to New Relic's use of deprecated pkg_resources API, even when New Relic is not explicitly used in their application. + +**Observed Symptoms:** + +- Build jobs fail during production deployment +- Warning message about pkg_resources deprecation appears +- Error originates from New Relic configuration module +- Issue blocks critical production deployments + +**Relevant Configuration:** + +- Python version: 3.9 +- New Relic package location: `/usr/local/lib/python3.9/site-packages/newrelic/config.py` +- Setuptools version: Likely 81 or higher +- Build environment: SleakOps managed containers + +**Error Conditions:** + +- Error occurs during build process +- Appears in production deployment pipeline +- Warning references pkg_resources deprecation scheduled for 2025-11-30 +- Issue persists even when New Relic is not actively used + +## Detailed Solution + + + +This issue occurs because: + +1. **New Relic agent is installed** in the base Python environment used by SleakOps build containers +2. **pkg_resources is deprecated** in newer versions of setuptools (81+) +3. **New Relic hasn't updated** their code to use the newer importlib.metadata API +4. The warning is treated as an error in the build process + +Even if you don't use New Relic directly, it may be installed as part of the base container image for monitoring purposes. + + + + + +Add this to your `requirements.txt` or `pyproject.toml` to pin setuptools to a version before the deprecation: + +**For requirements.txt:** + +```txt +setuptools<81 +``` + +**For pyproject.toml:** + +```toml +[build-system] +requires = ["setuptools<81", "wheel"] + +[project] +dependencies = [ + "setuptools<81", + # your other dependencies +] +``` + +**For Dockerfile:** + +```dockerfile +RUN pip install "setuptools<81" +``` + + + + + +You can suppress the specific warning by setting environment variables in your build configuration: + +**In your deployment configuration:** + +```yaml +environment: + PYTHONWARNINGS: "ignore::UserWarning:newrelic.config" +``` + +**Or suppress all UserWarnings (less recommended):** + +```yaml +environment: + PYTHONWARNINGS: "ignore::UserWarning" +``` + +**In Dockerfile:** + +```dockerfile +ENV PYTHONWARNINGS="ignore::UserWarning:newrelic.config" +``` + + + + + +If New Relic is installed but not used: + +**Option 1: Remove New Relic completely** + +```bash +pip uninstall newrelic +``` + +**Option 2: Update to latest New Relic version** + +```bash +pip install --upgrade newrelic +``` + +**Option 3: Add to requirements.txt with latest version** + +```txt +newrelic>=9.0.0 +``` + +Check the [New Relic Python agent releases](https://github.com/newrelic/newrelic-python-agent/releases) for the latest version that addresses this issue. + + + + + +Here's a complete Dockerfile approach that addresses the issue: + +```dockerfile +FROM python:3.9-slim + +# Pin setuptools to avoid pkg_resources deprecation warnings +RUN pip install --upgrade pip "setuptools<81" wheel + +# Set environment variable to suppress New Relic warnings if needed +ENV PYTHONWARNINGS="ignore::UserWarning:newrelic.config" + +# Copy and install requirements +COPY requirements.txt . +RUN pip install -r requirements.txt + +# Rest of your Dockerfile +COPY . . +CMD ["python", "app.py"] +``` + + + + + +For a permanent fix: + +1. **Monitor New Relic updates**: Keep track of when New Relic releases a version that uses `importlib.metadata` instead of `pkg_resources` + +2. **Update base images regularly**: Ensure your base container images are updated with compatible versions + +3. **Dependency management**: Use dependency management tools like `pip-tools` or `poetry` to lock versions: + +```bash +# Using pip-tools +pip-compile requirements.in +``` + +4. **CI/CD pipeline adjustment**: Add checks in your pipeline to catch these warnings early: + +```yaml +# In your CI/CD configuration +script: + - python -W error::UserWarning -c "import sys; print('No warnings')" || echo "Warnings detected" +``` + + + +--- + +_This FAQ was automatically generated on June 10, 2025 based on a real user query._ diff --git a/docs/troubleshooting/build-pods-stuck-creation.mdx b/docs/troubleshooting/build-pods-stuck-creation.mdx new file mode 100644 index 000000000..7fdb7fb02 --- /dev/null +++ b/docs/troubleshooting/build-pods-stuck-creation.mdx @@ -0,0 +1,183 @@ +--- +sidebar_position: 3 +title: "Build Pods Stuck in Creation State" +description: "Solution for builds that get stuck with pods never completing creation" +date: "2025-03-22" +category: "project" +tags: ["build", "pods", "deployment", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Build Pods Stuck in Creation State + +**Date:** March 22, 2025 +**Category:** Project +**Tags:** Build, Pods, Deployment, Troubleshooting + +## Problem Description + +**Context:** User experiences builds that remain stuck in an incomplete state, with pods that never finish creating and are not visible in the interface. + +**Observed Symptoms:** + +- Builds appear "bugged" or stuck +- Pods are not visible in the interface +- Pods never complete their creation process +- Issue persists for extended periods (12+ hours) +- Build process appears frozen + +**Relevant Configuration:** + +- Platform: SleakOps +- Build system: Kubernetes-based builds +- Duration: Extended periods without resolution +- User interface: Pods not appearing in dashboard + +**Error Conditions:** + +- Builds initiated but never complete +- Pod creation process hangs indefinitely +- No visible progress in build status +- Issue requires manual intervention to resolve + +## Detailed Solution + + + +When builds get stuck with pods not appearing, this typically indicates: + +1. **Resource constraints**: Insufficient cluster resources to schedule pods +2. **Image pull issues**: Problems downloading container images +3. **Node scheduling problems**: Pods cannot be assigned to available nodes +4. **Persistent volume issues**: Storage-related problems preventing pod startup +5. **Network connectivity**: Issues with cluster networking + + + + + +To diagnose and resolve stuck build pods: + +**Step 1: Check cluster resources** + +```bash +# Check node resources +kubectl top nodes + +# Check pod status in build namespace +kubectl get pods -n --show-labels + +# Describe stuck pods for detailed information +kubectl describe pod -n +``` + +**Step 2: Check for pending pods** + +```bash +# List all pending pods +kubectl get pods --all-namespaces --field-selector=status.phase=Pending + +# Check events for scheduling issues +kubectl get events --sort-by=.metadata.creationTimestamp +``` + + + + + +**Solution 1: Restart stuck builds** + +In SleakOps dashboard: + +1. Navigate to **Projects** → **Your Project** +2. Go to **Builds** section +3. Find the stuck build +4. Click **Cancel Build** +5. Start a new build + +**Solution 2: Clear build cache** + +If builds consistently get stuck: + +1. Go to **Project Settings** +2. Navigate to **Build Configuration** +3. Enable **"Clear build cache"** option +4. Trigger a new build + +**Solution 3: Check resource limits** + +Verify your project's resource allocation: + +```yaml +# Example resource configuration +resources: + requests: + memory: "512Mi" + cpu: "500m" + limits: + memory: "1Gi" + cpu: "1000m" +``` + + + + + +**Best practices to avoid stuck builds:** + +1. **Monitor resource usage**: + + - Regularly check cluster resource consumption + - Set appropriate resource requests and limits + - Monitor build queue length + +2. **Optimize build configuration**: + + - Use smaller base images when possible + - Implement proper build caching strategies + - Set reasonable build timeouts + +3. **Regular maintenance**: + - Periodically clean up old builds + - Monitor and clean unused Docker images + - Keep build environments updated + +**Example build configuration:** + +```yaml +build: + timeout: 30m + resources: + requests: + memory: 1Gi + cpu: 500m + cache: + enabled: true + ttl: 24h +``` + + + + + +Contact SleakOps support if: + +- Builds remain stuck after trying the above solutions +- Multiple projects are affected simultaneously +- The issue persists for more than 2 hours +- You see cluster-wide resource issues + +**Information to provide when contacting support:** + +- Project name and build ID +- Duration of the issue +- Recent changes to build configuration +- Screenshots of stuck build status +- Any error messages from build logs + + + +--- + +_This FAQ was automatically generated on March 22, 2025 based on a real user query._ diff --git a/docs/troubleshooting/build-status-discrepancy-lens-vs-platform.mdx b/docs/troubleshooting/build-status-discrepancy-lens-vs-platform.mdx new file mode 100644 index 000000000..ee2d6bb86 --- /dev/null +++ b/docs/troubleshooting/build-status-discrepancy-lens-vs-platform.mdx @@ -0,0 +1,169 @@ +--- +sidebar_position: 3 +title: "Build Status Discrepancy Between Platform and Lens" +description: "Solution for build status showing as 'creating' in platform while Lens shows successful completion" +date: "2024-01-15" +category: "project" +tags: ["build", "deployment", "lens", "status", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Build Status Discrepancy Between Platform and Lens + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Build, Deployment, Lens, Status, Troubleshooting + +## Problem Description + +**Context:** User experiences a discrepancy between the build status shown in the SleakOps platform and what is observed in Lens (Kubernetes IDE). The platform shows the build as stuck in "creating" status while Lens indicates the build completed successfully. + +**Observed Symptoms:** + +- Build status stuck at "creating" in SleakOps platform +- Lens shows the build completed successfully +- Status synchronization issue between platform UI and actual Kubernetes state +- Build appears to be functioning despite platform status + +**Relevant Configuration:** + +- Environment: Production +- Platform: SleakOps +- Monitoring tool: Lens +- Build process: Appears to complete successfully in cluster + +**Error Conditions:** + +- Status discrepancy occurs during build process +- Platform UI does not reflect actual Kubernetes state +- Issue persists even after successful build completion +- May affect deployment workflow visibility + +## Detailed Solution + + + +To confirm the real status of your build: + +1. **Check Kubernetes resources directly:** + + ```bash + kubectl get pods -n + kubectl get deployments -n + kubectl describe deployment + ``` + +2. **In Lens:** + + - Navigate to Workloads → Deployments + - Check the status of your application + - Verify pod readiness and running state + +3. **Check build logs:** + ```bash + kubectl logs -f deployment/ + ``` + + + + + +To resolve the status synchronization issue: + +1. **Refresh the SleakOps dashboard:** + + - Hard refresh the browser (Ctrl+F5 or Cmd+Shift+R) + - Clear browser cache if necessary + +2. **Trigger a status sync:** + + - Navigate to your project in SleakOps + - Click on "Refresh Status" if available + - Or trigger a minor configuration update to force sync + +3. **Check platform logs:** + - Contact support to verify if there are any controller sync issues + - Platform may need to reconcile the actual cluster state + + + + + +This issue typically occurs due to: + +1. **Controller synchronization delays:** + + - Platform controller may be experiencing delays + - Network connectivity issues between platform and cluster + +2. **Webhook or event processing issues:** + + - Kubernetes events not properly reaching the platform + - Event processing queue backlog + +3. **Resource state caching:** + + - Platform may be showing cached state + - Actual cluster state has progressed beyond cached version + +4. **API rate limiting:** + - Platform may be rate-limited when querying cluster status + - Causes delayed status updates + + + + + +If your build is actually working (confirmed via Lens): + +1. **Continue with your workflow:** + + - The application is likely functioning correctly + - Platform status will eventually sync + +2. **Monitor application health:** + + ```bash + # Check application endpoints + kubectl get svc -n + + # Test application connectivity + kubectl port-forward svc/ 8080:80 + ``` + +3. **Document the issue:** + - Take screenshots of both platform and Lens status + - Note the timestamp when discrepancy was observed + - This helps support team investigate the root cause + + + + + +To minimize status discrepancy issues: + +1. **Use multiple monitoring tools:** + + - Don't rely solely on platform UI + - Keep Lens or kubectl handy for verification + +2. **Set up monitoring alerts:** + + - Configure alerts based on actual pod/deployment status + - Use Prometheus/Grafana for independent monitoring + +3. **Regular platform updates:** + + - Ensure SleakOps platform is updated to latest version + - Updates often include sync improvements + +4. **Report status issues promptly:** + - Early reporting helps identify patterns + - Assists platform team in improving synchronization + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/celery-beat-duplicate-execution.mdx b/docs/troubleshooting/celery-beat-duplicate-execution.mdx new file mode 100644 index 000000000..8e1362c50 --- /dev/null +++ b/docs/troubleshooting/celery-beat-duplicate-execution.mdx @@ -0,0 +1,367 @@ +--- +sidebar_position: 3 +title: "Celery Beat Duplicate Task Execution" +description: "Solution for preventing Celery Beat tasks from running multiple times when scaling backend pods" +date: "2024-12-23" +category: "workload" +tags: ["celery", "cronjob", "scaling", "backend", "duplicate-tasks"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Celery Beat Duplicate Task Execution + +**Date:** December 23, 2024 +**Category:** Workload +**Tags:** Celery, Cronjob, Scaling, Backend, Duplicate Tasks + +## Problem Description + +**Context:** When running a backend application with multiple pods in Kubernetes, Celery Beat scheduled tasks execute multiple times - once for each running pod instance. + +**Observed Symptoms:** + +- Scheduled Celery Beat tasks run multiple times (e.g., 4 times if there are 4 backend pods) +- Each pod instance runs its own Celery Beat scheduler +- Tasks that should run once are duplicated across all pod replicas +- Potential data inconsistency or resource waste due to duplicate executions + +**Relevant Configuration:** + +- Backend deployment: Multiple pods (e.g., 4 replicas) +- Task scheduler: Celery Beat integrated within backend pods +- Platform: SleakOps Kubernetes environment +- Workload type: Backend service with scheduled tasks + +**Error Conditions:** + +- Problem occurs when backend is scaled beyond 1 replica +- Each pod runs Celery Beat independently +- No coordination between pod instances for scheduled tasks +- Tasks execute N times where N = number of backend pod replicas + +## Detailed Solution + + + +The recommended approach is to stop using Celery Beat and migrate to Kubernetes CronJobs. This ensures tasks run only once regardless of backend pod count. + +**Benefits of CronJobs:** + +- Guaranteed single execution per schedule +- Native Kubernetes scheduling +- Better resource isolation +- Easier monitoring and debugging +- No dependency on backend pod scaling + + + + + +**Step 1: Identify Current Celery Beat Tasks** + +List all your current scheduled tasks in your Celery configuration: + +```python +# Example current celery beat configuration +from celery.schedules import crontab + +beat_schedule = { + 'send-daily-report': { + 'task': 'myapp.tasks.send_daily_report', + 'schedule': crontab(hour=9, minute=0), + }, + 'cleanup-old-data': { + 'task': 'myapp.tasks.cleanup_old_data', + 'schedule': crontab(hour=2, minute=0), + }, +} +``` + +**Step 2: Create CronJob Executions in SleakOps** + +For each Celery Beat task, create a separate CronJob execution: + +1. Go to your project in SleakOps +2. Navigate to **Executions** section +3. Click **Add Execution** +4. Select **CronJob** type +5. Configure the schedule and command + + + + + +**Example 1: Daily Report CronJob** + +```yaml +# SleakOps CronJob configuration +name: daily-report-cronjob +type: cronjob +schedule: "0 9 * * *" # Every day at 9:00 AM +image: your-backend-image:latest +command: ["python", "manage.py", "send_daily_report"] +resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "200m" +``` + +**Example 2: Data Cleanup CronJob** + +```yaml +# SleakOps CronJob configuration +name: cleanup-cronjob +type: cronjob +schedule: "0 2 * * *" # Every day at 2:00 AM +image: your-backend-image:latest +command: ["python", "manage.py", "cleanup_old_data"] +resources: + requests: + memory: "128Mi" + cpu: "50m" + limits: + memory: "256Mi" + cpu: "100m" +``` + +**Cron Schedule Format:** + +- `"0 9 * * *"` - Daily at 9:00 AM +- `"*/15 * * * *"` - Every 15 minutes +- `"0 */6 * * *"` - Every 6 hours +- `"0 0 * * 0"` - Weekly on Sunday at midnight + + + + + +After creating CronJobs, remove Celery Beat from your backend: + +**Step 1: Update Backend Configuration** + +```python +# Remove or comment out beat_schedule +# beat_schedule = { +# 'send-daily-report': { +# 'task': 'myapp.tasks.send_daily_report', +# 'schedule': crontab(hour=9, minute=0), +# }, +# } + +# Keep only the Celery app configuration for async tasks +app = Celery('myapp') +app.config_from_object('django.conf:settings', namespace='CELERY') +``` + +**Step 2: Update Deployment** + +Ensure your backend deployment no longer starts Celery Beat: + +```dockerfile +# Remove celery beat from your startup command +# OLD: CMD ["celery", "-A", "myapp", "beat", "--loglevel=info"] +# NEW: Only run the web server +CMD ["gunicorn", "myapp.wsgi:application"] +``` + + + + + +If you must continue using Celery Beat, here are alternative approaches: + +**Option 1: Dedicated Celery Beat Pod** + +Create a separate deployment just for Celery Beat with replica count = 1: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: celery-beat +spec: + replicas: 1 # Always keep this at 1 + selector: + matchLabels: + app: celery-beat + template: + spec: + containers: + - name: celery-beat + image: your-backend-image:latest + command: ["celery", "-A", "myapp", "beat", "--loglevel=info"] +``` + +**Option 2: Leader Election (Complex)** + +Implement leader election so only one pod runs Celery Beat, but this adds complexity and is not recommended. + +**Why CronJobs are Better:** + +- Simpler architecture +- Native Kubernetes feature +- Better resource management +- Easier troubleshooting +- No single point of failure + + + + + +After migrating to CronJobs, monitor their execution to ensure everything works correctly: + +**Step 1: Check CronJob Status in SleakOps** + +1. Go to your project's **Executions** section +2. Verify that your CronJobs appear in the list +3. Check the **Last Run** and **Next Run** times +4. Monitor the **Status** column for any failures + +**Step 2: View CronJob Logs** + +```bash +# Check CronJob execution history +kubectl get cronjobs + +# View recent job executions +kubectl get jobs + +# Check logs of a specific job execution +kubectl logs job/daily-report-cronjob- +``` + +**Step 3: Set Up Alerts (Optional)** + +Configure alerts for failed CronJob executions: + +```yaml +# Example alert configuration +apiVersion: v1 +kind: ConfigMap +metadata: + name: cronjob-alerts +data: + alert-rules.yaml: | + groups: + - name: cronjob.rules + rules: + - alert: CronJobFailed + expr: kube_job_status_failed > 0 + for: 0m + labels: + severity: warning + annotations: + summary: "CronJob {{ $labels.job_name }} failed" +``` + +**Step 4: Verify No Duplicate Executions** + +Monitor your application logs to confirm tasks are no longer running multiple times: + +```bash +# Check application logs for duplicate task executions +kubectl logs -l app=your-backend-app | grep "task_name" + +# Should see single execution per scheduled time, not multiple +``` + + + + + +Use this checklist to verify your migration was successful: + +**Pre-Migration Checklist:** + +- [ ] Document all current Celery Beat tasks and their schedules +- [ ] Identify the commands needed to run each task +- [ ] Plan the migration schedule to avoid service disruption +- [ ] Prepare rollback plan if needed + +**Post-Migration Checklist:** + +- [ ] All CronJobs are created and visible in SleakOps +- [ ] CronJob schedules match the original Celery Beat schedules +- [ ] First execution of each CronJob completes successfully +- [ ] No duplicate task executions in application logs +- [ ] Celery Beat configuration removed from backend code +- [ ] Backend deployment no longer starts Celery Beat process +- [ ] Resource usage is optimized (no idle Celery Beat processes) + +**Testing Checklist:** + +- [ ] Scale backend pods up and down - verify tasks still run once +- [ ] Manually trigger a CronJob to test execution +- [ ] Verify CronJob failure handling and retry logic +- [ ] Check that scheduled tasks maintain expected timing +- [ ] Confirm database/external service interactions work correctly + + + + + +**Issue 1: CronJob Not Executing** + +```bash +# Check CronJob configuration +kubectl describe cronjob your-cronjob-name + +# Common causes: +# - Incorrect cron schedule format +# - Missing required environment variables +# - Image pull errors +# - Resource constraints +``` + +**Solution:** +- Verify cron schedule syntax using online cron validators +- Ensure all environment variables are properly configured +- Check that the container image is accessible +- Review resource requests and limits + +**Issue 2: CronJob Fails but Celery Task Would Succeed** + +Common differences when migrating from Celery Beat: + +```python +# Celery Beat runs in application context +# CronJob runs as separate container - ensure: + +# 1. Database connections are properly configured +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', + 'HOST': os.environ.get('DB_HOST'), + # ... other settings + } +} + +# 2. All required environment variables are available +# 3. Task can run independently without Celery worker context +``` + +**Issue 3: Different Timezone Behavior** + +```yaml +# Ensure consistent timezone in CronJob +spec: + schedule: "0 9 * * *" + timeZone: "UTC" # Explicitly set timezone + jobTemplate: + spec: + template: + spec: + containers: + - name: task + env: + - name: TZ + value: "UTC" +``` + + + +_This FAQ was automatically generated on December 23, 2024 based on a real user query._ diff --git a/docs/troubleshooting/ci-cd-build-stage-failure.mdx b/docs/troubleshooting/ci-cd-build-stage-failure.mdx new file mode 100644 index 000000000..8b4985729 --- /dev/null +++ b/docs/troubleshooting/ci-cd-build-stage-failure.mdx @@ -0,0 +1,179 @@ +--- +sidebar_position: 3 +title: "CI/CD Build Stage Failure" +description: "Troubleshooting CI/CD pipeline build failures in specific environments" +date: "2024-10-10" +category: "project" +tags: ["ci-cd", "build", "pipeline", "deployment", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CI/CD Build Stage Failure + +**Date:** October 10, 2024 +**Category:** Project +**Tags:** CI/CD, Build, Pipeline, Deployment, Troubleshooting + +## Problem Description + +**Context:** User experiences CI/CD pipeline build failures in the 'stage' environment while the same code works correctly in 'develop' environment and locally. The build job fails before reaching the compilation stage. + +**Observed Symptoms:** + +- Build fails in 'stage' environment but works in 'develop' +- Local compilation and execution work correctly +- Error occurs before the build job reaches the cluster +- Pipeline fails at the build stage, preventing deployment +- Last successful commit continues running while new commits fail to build + +**Relevant Configuration:** + +- Project: byron-backoffice-service-dev-develop +- Environment: 'stage' (failing) vs 'develop' (working) +- Last successful commit: d419a838b44f1c11b22adf68f6cf984170def38f +- CI/CD pipeline configured through SleakOps + +**Error Conditions:** + +- Error occurs during CI/CD build stage +- Failure happens before compilation attempt +- Only affects 'stage' environment +- Prevents new deployments from being created + +## Detailed Solution + + + +When CI/CD builds fail in one environment but work in another, the issue is typically related to: + +1. **Environment-specific CI/CD configuration files** +2. **Different environment variables or secrets** +3. **Branch-specific pipeline configurations** +4. **Resource constraints in the target environment** + +Start by comparing the CI/CD configuration files between environments. + + + + + +To check if the CI/CD file is correctly configured: + +1. **Access your SleakOps project dashboard** +2. **Navigate to the CI/CD section** +3. **Compare the pipeline configuration** between 'develop' and 'stage' +4. **Verify the CI/CD file was copied correctly** from SleakOps + +**Common issues to check:** + +```yaml +# Check for environment-specific differences +stages: + - build + - test + - deploy + +variables: + ENVIRONMENT: "stage" # Make sure this matches + +build: + stage: build + script: + - echo "Building for $ENVIRONMENT" + # Verify build commands are identical +``` + + + + + +To trigger a fresh build: + +**Option 1: From SleakOps Dashboard** + +1. Go to your project in SleakOps +2. Navigate to the **Deployments** section +3. Click the **"Build"** button for the stage environment +4. Monitor the build logs for specific error messages + +**Option 2: From Git Repository** + +1. Make a small commit (like updating a comment) +2. Push to the branch that triggers the stage pipeline +3. Monitor the pipeline execution + + + + + +Since 'develop' works but 'stage' fails, compare these configurations: + +**Environment Variables:** + +- Check if all required environment variables are set for 'stage' +- Verify secrets and credentials are properly configured +- Ensure database connections and external service URLs are correct + +**Resource Allocation:** + +- Verify the 'stage' environment has sufficient resources +- Check if there are any resource quotas or limits +- Ensure the cluster has available capacity + +**Branch Configuration:** + +```yaml +# Check if pipeline triggers are correctly configured +only: + - develop # for develop environment + - stage # for stage environment +``` + + + + + +To get more detailed information about the failure: + +1. **Access the build logs** in your CI/CD platform (GitHub Actions, GitLab CI, etc.) +2. **Look for the exact error message** that occurs before compilation +3. **Check for common pre-build failures:** + - Docker image pull failures + - Missing dependencies or tools + - Permission issues + - Network connectivity problems + +**Common pre-build error patterns:** + +```bash +# Docker-related errors +Error: Failed to pull image +Error: Cannot connect to Docker daemon + +# Permission errors +Permission denied +Access denied + +# Network/connectivity errors +Connection timeout +DNS resolution failed +``` + + + + + +If you need immediate functionality while troubleshooting: + +1. **Keep the current working deployment** (commit d419a838b44f1c11b22adf68f6cf984170def38f) +2. **Create a hotfix branch** from the last working commit if urgent changes are needed +3. **Test fixes in 'develop' first** before applying to 'stage' + +This ensures your application remains functional while investigating the root cause. + + + +--- + +_This FAQ was automatically generated on October 10, 2024 based on a real user query._ diff --git a/docs/troubleshooting/ci-cd-github-actions-not-triggering.mdx b/docs/troubleshooting/ci-cd-github-actions-not-triggering.mdx new file mode 100644 index 000000000..7b546d946 --- /dev/null +++ b/docs/troubleshooting/ci-cd-github-actions-not-triggering.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "GitHub Actions CI/CD Pipeline Not Triggering" +description: "Solution for when continuous integration builds stop working after push to repository" +date: "2024-01-15" +category: "project" +tags: ["github", "ci-cd", "deployment", "troubleshooting", "pipeline"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# GitHub Actions CI/CD Pipeline Not Triggering + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** GitHub, CI/CD, Deployment, Troubleshooting, Pipeline + +## Problem Description + +**Context:** User experiences issues with continuous integration pipeline in SleakOps where GitHub Actions workflows stop triggering automatic builds and deployments after pushing to the development branch. + +**Observed Symptoms:** + +- Push to development branch no longer triggers automatic builds +- Previously working CI/CD pipeline has stopped functioning +- No visible workflow execution in GitHub Actions +- Deployments are not being created automatically + +**Relevant Configuration:** + +- Repository: GitHub +- Branch: dev (development branch) +- Project: mx-simplee-web +- Pipeline: GitHub Actions with YAML configuration +- Platform: SleakOps integration + +**Error Conditions:** + +- Pipeline worked previously but stopped suddenly +- No apparent changes to YAML configuration file +- Push events are not triggering workflow execution +- No error messages visible in repository + +## Detailed Solution + + + +First, verify the current status of your GitHub Actions workflows: + +1. Go to your GitHub repository +2. Click on the **Actions** tab +3. Look for the latest workflow runs +4. Check if workflows are: + - Not being triggered at all + - Failing during execution + - Queued but not running + +This will help identify whether the issue is with triggering or execution. + + + + + +Even if you haven't changed the YAML file recently, verify its current state: + +1. Check the workflow file location: `.github/workflows/[workflow-name].yml` +2. Verify the trigger configuration: + +```yaml +on: + push: + branches: [dev, main] + pull_request: + branches: [dev, main] +``` + +3. Ensure the branch name matches exactly (case-sensitive) +4. Check for any syntax errors using a YAML validator + + + + + +Branch protection rules can prevent workflows from running: + +1. Go to **Settings** → **Branches** in your repository +2. Check if there are protection rules on your `dev` branch +3. Ensure "Restrict pushes that create files" is not blocking your workflow +4. Verify that required status checks aren't preventing execution + + + + + +Check if GitHub Actions has the necessary permissions: + +1. Go to **Settings** → **Actions** → **General** +2. Ensure "Actions permissions" is set to allow workflows +3. Check "Workflow permissions" - should be "Read and write permissions" +4. Verify that "Allow GitHub Actions to create and approve pull requests" is enabled if needed + + + + + +Check the integration between GitHub and SleakOps: + +1. In SleakOps dashboard, go to your project settings +2. Verify the connected repository URL is correct +3. Check that the target branch is set to `dev` +4. Ensure the webhook is still active in GitHub: + - Go to **Settings** → **Webhooks** in your repository + - Look for SleakOps webhook and verify it's active + - Check recent deliveries for any failed requests + + + + + +If the issue persists, try recreating the workflow: + +1. Create a new branch from `dev` +2. Make a small change to trigger the workflow +3. Create a simple test workflow: + +```yaml +name: Test CI +on: + push: + branches: [dev] +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Test + run: echo "Workflow is working" +``` + +4. Push the changes and verify the workflow triggers +5. If successful, gradually add back your original workflow steps + + + + + +Based on common scenarios, try these solutions: + +1. **Re-push to trigger**: Make a small commit and push again +2. **Check quotas**: Ensure you haven't exceeded GitHub Actions minutes +3. **Restart workflows**: Cancel any stuck workflows and retry +4. **Update checkout action**: Use the latest version `actions/checkout@v4` +5. **Clear cache**: Delete workflow caches if they're corrupted + +```bash +# Force trigger workflow with empty commit +git commit --allow-empty -m "Trigger workflow" +git push origin dev +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/ci-cd-pipeline-setup-troubleshooting.mdx b/docs/troubleshooting/ci-cd-pipeline-setup-troubleshooting.mdx new file mode 100644 index 000000000..1cde6ffa9 --- /dev/null +++ b/docs/troubleshooting/ci-cd-pipeline-setup-troubleshooting.mdx @@ -0,0 +1,189 @@ +--- +sidebar_position: 3 +title: "CI/CD Pipeline Not Triggering on Branch Push" +description: "Troubleshooting guide for CI/CD pipelines that don't trigger automatic builds and deployments" +date: "2024-03-14" +category: "project" +tags: ["ci-cd", "pipeline", "git", "deployment", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CI/CD Pipeline Not Triggering on Branch Push + +**Date:** March 14, 2024 +**Category:** Project +**Tags:** CI/CD, Pipeline, Git, Deployment, Troubleshooting + +## Problem Description + +**Context:** User pushes code changes to a development branch but the CI/CD pipeline doesn't automatically trigger to create new container images and deploy updates to the environment. + +**Observed Symptoms:** + +- Code pushes to dev branch don't trigger pipeline execution +- New container images are not being built automatically +- Deployments don't update with the latest code changes +- Pipeline appears to be inactive or misconfigured + +**Relevant Configuration:** + +- Target branch: Development branch (e.g., `dev`, `develop`) +- Repository: Git-based source control +- Pipeline: SleakOps CI/CD integration +- Expected behavior: Automatic build and deployment on push + +**Error Conditions:** + +- Pipeline doesn't trigger after git push +- No build activity visible in project dashboard +- Deployment remains on previous version +- No error messages or pipeline logs generated + +## Detailed Solution + + + +First, ensure your project is configured to watch the correct branch: + +1. Navigate to **Project > Settings > General Config** +2. Check the **Target Branch** setting +3. Verify it matches the branch you're pushing to (e.g., `dev`, `develop`, `main`) +4. Save changes if you need to update the branch name + +**Common Issues:** + +- Project configured for `main` but pushing to `dev` +- Branch name mismatch (e.g., `develop` vs `development`) +- Case sensitivity issues in branch names + + + + + +Ensure your repository has a properly configured pipeline file: + +1. Go to **Project > Settings > Git Pipelines** +2. Review the YAML example provided +3. Create or update your pipeline file in your repository +4. Common file names: `.sleakops.yml`, `.sleakops/pipeline.yml` + +**Example Pipeline Configuration:** + +```yaml +version: "1" +pipeline: + triggers: + - branch: dev + on: push + stages: + - name: build + steps: + - name: build-image + action: docker-build + dockerfile: Dockerfile + - name: deploy + steps: + - name: deploy-to-dev + action: deploy + environment: development +``` + +**Key Points:** + +- Ensure the `branch` in triggers matches your target branch +- Include both `build` and `deploy` stages +- Verify the pipeline file is in the repository root or `.sleakops/` directory + + + + + +Set up the required API key for pipeline authentication: + +1. Go to **Settings > CLI** in your SleakOps dashboard +2. Generate or copy your API key +3. In your Git repository settings, add a new environment variable: + - **Name:** `SLEAKOPS_KEY` + - **Value:** Your API key from step 2 + - **Scope:** Available to pipeline/CI processes + +**For Different Git Providers:** + +**GitHub:** + +- Go to Repository > Settings > Secrets and variables > Actions +- Add new repository secret: `SLEAKOPS_KEY` + +**GitLab:** + +- Go to Project > Settings > CI/CD > Variables +- Add variable: `SLEAKOPS_KEY` (mark as protected if needed) + +**Bitbucket:** + +- Go to Repository > Repository settings > Pipelines > Repository variables +- Add variable: `SLEAKOPS_KEY` + + + + + +After completing the setup, verify everything is working: + +1. **Check Pipeline Status:** + + - Go to your project dashboard + - Look for pipeline activity in the **Deployments** or **Builds** section + +2. **Test with a Small Change:** + + - Make a minor change to your code + - Commit and push to your target branch + - Monitor the pipeline execution + +3. **Review Logs:** + + - Check pipeline logs for any error messages + - Verify build and deployment steps are executing + +4. **Common Verification Points:** + - Pipeline file is committed to the repository + - Branch names match exactly (case-sensitive) + - API key is valid and has necessary permissions + - Repository webhook is configured (usually automatic) + + + + + +If the pipeline still doesn't trigger: + +**Check Webhook Configuration:** + +- Verify your Git provider has the correct webhook URL +- Test webhook delivery in your Git provider's settings + +**Validate API Key Permissions:** + +- Ensure the API key has project deployment permissions +- Try regenerating the API key if it's old + +**Review Pipeline Syntax:** + +- Validate YAML syntax using an online YAML validator +- Check for indentation errors in the pipeline file + +**Contact Support:** +If issues persist, provide the following information: + +- Project name and ID +- Target branch name +- Recent commit SHA that should have triggered the pipeline +- Any error messages from the dashboard + + + +--- + +_This FAQ was automatically generated on March 14, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cicd-pip-to-pipx-installation.mdx b/docs/troubleshooting/cicd-pip-to-pipx-installation.mdx new file mode 100644 index 000000000..1ba53a42f --- /dev/null +++ b/docs/troubleshooting/cicd-pip-to-pipx-installation.mdx @@ -0,0 +1,188 @@ +--- +sidebar_position: 3 +title: "CI/CD Pipeline Error - SleakOps Installation Method" +description: "Fix for CI/CD pipeline failures when installing SleakOps CLI tool" +date: "2024-10-15" +category: "project" +tags: ["cicd", "pipeline", "installation", "pipx", "python"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CI/CD Pipeline Error - SleakOps Installation Method + +**Date:** October 15, 2024 +**Category:** Project +**Tags:** CI/CD, Pipeline, Installation, pipx, Python + +## Problem Description + +**Context:** CI/CD workflows fail when attempting to install the SleakOps CLI tool using the traditional `pip install` method in GitHub Actions or other CI/CD environments. + +**Observed Symptoms:** + +- CI/CD pipeline failures during SleakOps installation step +- Installation errors in workflow execution +- Build process interruption at dependency installation phase + +**Relevant Configuration:** + +- Installation method: `pip install sleakops` (incorrect) +- Environment: CI/CD pipeline (GitHub Actions, GitLab CI, etc.) +- Python package manager: pip vs pipx + +**Error Conditions:** + +- Error occurs during workflow execution +- Happens specifically at the SleakOps installation step +- May be related to dependency conflicts or isolation issues + +## Detailed Solution + + + +The solution is to replace `pip install sleakops` with `pipx install sleakops` in your CI/CD workflow configuration. + +**Before (incorrect):** + +```yaml +- name: Install SleakOps + run: pip install sleakops +``` + +**After (correct):** + +```yaml +- name: Install SleakOps + run: pipx install sleakops +``` + + + + + +`pipx` is the recommended method for installing Python CLI applications because: + +1. **Isolation**: Creates isolated environments for each application +2. **No conflicts**: Prevents dependency conflicts with other Python packages +3. **Clean installation**: Keeps your global Python environment clean +4. **Better for CLI tools**: Specifically designed for command-line applications + + + + + +Here's a complete example of how to properly install SleakOps in a GitHub Actions workflow: + +```yaml +name: Deploy with SleakOps + +on: + push: + branches: [main] + +jobs: + deploy: + runs-on: ubuntu-latest + + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: "3.9" + + - name: Install pipx + run: | + python -m pip install --upgrade pip + python -m pip install pipx + python -m pipx ensurepath + + - name: Install SleakOps + run: pipx install sleakops + + - name: Deploy application + run: | + sleakops deploy + env: + SLEAKOPS_TOKEN: ${{ secrets.SLEAKOPS_TOKEN }} +``` + + + + + +**GitLab CI (.gitlab-ci.yml):** + +```yaml +stages: + - deploy + +deploy: + stage: deploy + image: python:3.9 + before_script: + - pip install pipx + - pipx install sleakops + script: + - sleakops deploy +``` + +**Jenkins Pipeline:** + +```groovy +pipeline { + agent any + stages { + stage('Install SleakOps') { + steps { + sh 'pip install pipx' + sh 'pipx install sleakops' + } + } + stage('Deploy') { + steps { + sh 'sleakops deploy' + } + } + } +} +``` + + + + + +If you still encounter issues after switching to pipx: + +1. **Ensure pipx is installed:** + + ```bash + python -m pip install pipx + python -m pipx ensurepath + ``` + +2. **Check Python version compatibility:** + + - SleakOps requires Python 3.7 or higher + - Use `python --version` to verify + +3. **Clear pipx cache if needed:** + + ```bash + pipx uninstall sleakops + pipx install sleakops + ``` + +4. **Verify installation:** + ```bash + sleakops --version + ``` + + + +--- + +_This FAQ was automatically generated on October 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cloudfront-existing-s3-bucket.mdx b/docs/troubleshooting/cloudfront-existing-s3-bucket.mdx new file mode 100644 index 000000000..b7f0e601f --- /dev/null +++ b/docs/troubleshooting/cloudfront-existing-s3-bucket.mdx @@ -0,0 +1,170 @@ +--- +sidebar_position: 3 +title: "CloudFront CDN for Existing S3 Bucket" +description: "How to create a CloudFront distribution for an S3 bucket already created in SleakOps" +date: "2024-12-19" +category: "dependency" +tags: ["cloudfront", "s3", "cdn", "aws", "storage"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CloudFront CDN for Existing S3 Bucket + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** CloudFront, S3, CDN, AWS, Storage + +## Problem Description + +**Context:** User has an S3 bucket already created through SleakOps and wants to add a CloudFront CDN distribution to it. The CloudFront option is available during bucket creation but not visible for existing buckets. + +**Observed Symptoms:** + +- CloudFront option available only during S3 bucket creation +- No visible option to add CloudFront to existing buckets in SleakOps interface +- Need to enable CDN for already deployed S3 storage + +**Relevant Configuration:** + +- S3 bucket: Already created through SleakOps +- Platform: AWS +- Service: CloudFront CDN distribution needed +- Current limitation: Dependencies editing not enabled in SleakOps + +**Error Conditions:** + +- Cannot modify existing S3 bucket dependencies in SleakOps +- No direct option to add CloudFront after bucket creation +- User wants to avoid recreating the existing bucket + +## Detailed Solution + + + +Currently, SleakOps does not support editing dependencies for existing resources. This means you cannot add CloudFront to an S3 bucket after it has been created through the platform interface. + +The CloudFront option is only available during the initial S3 bucket creation process. + + + + + +You can create a CloudFront distribution manually using the AWS Console: + +1. **Access AWS Console** + + - Log into your AWS account + - Navigate to **CloudFront** service + +2. **Create Distribution** + + - Click **Create Distribution** + - Select **Web** distribution type + +3. **Configure Origin** + + - **Origin Domain**: Select your existing S3 bucket from the dropdown + - **Origin Path**: Leave empty (unless you want to serve from a specific folder) + - **Origin Access**: Choose **Origin Access Control (OAC)** for better security + +4. **Distribution Settings** + + - **Price Class**: Choose based on your geographic needs + - **Alternate Domain Names (CNAMEs)**: Add your custom domain if needed + - **SSL Certificate**: Use default or upload custom certificate + +5. **Deploy** + - Click **Create Distribution** + - Wait for deployment (usually 15-20 minutes) + + + + + +After creating the CloudFront distribution, update your S3 bucket policy to allow CloudFront access: + +1. **Get CloudFront Distribution ID** + + - Copy the Distribution ID from CloudFront console + +2. **Update Bucket Policy** + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "AllowCloudFrontServicePrincipal", + "Effect": "Allow", + "Principal": { + "Service": "cloudfront.amazonaws.com" + }, + "Action": "s3:GetObject", + "Resource": "arn:aws:s3:::your-bucket-name/*", + "Condition": { + "StringEquals": { + "AWS:SourceArn": "arn:aws:cloudfront::your-account-id:distribution/your-distribution-id" + } + } + } + ] + } + ``` + +3. **Apply Policy** + - Go to S3 bucket permissions + - Update bucket policy with the JSON above + - Replace placeholders with your actual values + + + + + +Verify your CloudFront distribution is working: + +1. **Get CloudFront URL** + + - Copy the Distribution Domain Name from CloudFront console + +2. **Test Access** + + ```bash + # Test with curl + curl https://your-distribution-domain.cloudfront.net/your-file.txt + + # Or test in browser + https://your-distribution-domain.cloudfront.net/your-file.txt + ``` + +3. **Verify Cache Headers** + ```bash + curl -I https://your-distribution-domain.cloudfront.net/your-file.txt + ``` + + + + + +If manual setup is too complex, you can recreate the S3 bucket with CloudFront: + +1. **Backup Data** + + - Download all files from existing bucket + - Note current bucket configuration + +2. **Create New S3 Dependency** + + - In SleakOps, create new S3 bucket + - Enable CloudFront option during creation + - Upload your backed-up data + +3. **Update Application** + - Update your application to use new bucket + - Test connectivity and functionality + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-addons-after-migration.mdx b/docs/troubleshooting/cluster-addons-after-migration.mdx new file mode 100644 index 000000000..94d41214f --- /dev/null +++ b/docs/troubleshooting/cluster-addons-after-migration.mdx @@ -0,0 +1,178 @@ +--- +sidebar_position: 3 +title: "Cluster Addons Missing After Migration" +description: "How to locate and restore cluster addons after platform migration" +date: "2024-10-15" +category: "cluster" +tags: ["migration", "addons", "loki", "grafana", "monitoring"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cluster Addons Missing After Migration + +**Date:** October 15, 2024 +**Category:** Cluster +**Tags:** Migration, Addons, Loki, Grafana, Monitoring + +## Problem Description + +**Context:** After a cluster migration in SleakOps, users cannot locate previously installed addons such as Loki, Grafana, and other monitoring tools in the new interface. + +**Observed Symptoms:** + +- Cluster addons (Loki, Grafana, etc.) are not visible in the new interface +- Previously configured monitoring tools appear to be missing +- Users cannot access monitoring dashboards that were available before migration +- Uncertainty about addon status after platform migration + +**Relevant Configuration:** + +- Platform: SleakOps with new interface +- Affected addons: Loki, Grafana, and other monitoring tools +- Migration context: Recent cluster migration performed +- Interface: Updated/new SleakOps interface + +**Error Conditions:** + +- Addons not visible immediately after migration +- Occurs when accessing the new interface post-migration +- Affects monitoring and logging capabilities +- May impact operational visibility + +## Detailed Solution + + + +During cluster migrations, SleakOps temporarily deactivates addons to ensure: + +1. **Data integrity**: Prevents data corruption during the migration process +2. **Resource management**: Avoids conflicts between old and new cluster configurations +3. **Clean migration**: Ensures addons are properly reconfigured for the new environment +4. **State consistency**: Maintains consistent addon states across the migration + +This is a standard procedure and addons are reactivated once the migration is complete. + + + + + +To find your addons in the updated SleakOps interface: + +1. **Navigate to Cluster Management**: + + - Go to your cluster dashboard + - Look for the "Addons" or "Extensions" section + +2. **Check the Monitoring Section**: + + - Access the "Monitoring" tab + - Look for Grafana, Loki, and other monitoring tools + +3. **Verify Addon Status**: + + - Check if addons show as "Active" or "Pending" + - Some addons may need a few minutes to fully initialize + +4. **Access Grafana Dashboard**: + ``` + # Typical Grafana access pattern + https://grafana.[your-cluster-domain] + ``` + + + + + +If addons are still not visible: + +1. **Wait for automatic reactivation**: + + - Most addons reactivate automatically within 10-15 minutes + - Check the cluster status for any pending operations + +2. **Manual reactivation if needed**: + + - Go to Cluster Settings → Addons + - Toggle off and on any addons that appear inactive + - Save the configuration + +3. **Verify addon health**: + + ```bash + # Check addon pods status + kubectl get pods -n monitoring + kubectl get pods -n logging + ``` + +4. **Contact support if issues persist**: + - If addons don't appear after 30 minutes + - If you encounter errors during reactivation + + + + + +After addons are visible, verify they're working correctly: + +1. **Grafana Dashboard Access**: + + - Log into Grafana using your SleakOps credentials + - Verify dashboards are displaying data + - Check that data sources are connected + +2. **Loki Log Aggregation**: + + - Verify logs are being collected + - Check log retention policies + - Test log queries in Grafana + +3. **Monitoring Alerts**: + + - Verify alert rules are active + - Test notification channels + - Check alert history + +4. **Performance Metrics**: + - Confirm metrics collection is working + - Verify historical data availability + - Check metric retention settings + + + + + +If addons remain unavailable: + +1. **Check cluster resources**: + + ```bash + kubectl get nodes + kubectl top nodes + ``` + +2. **Verify namespace status**: + + ```bash + kubectl get namespaces + kubectl get pods --all-namespaces + ``` + +3. **Review addon logs**: + + ```bash + kubectl logs -n monitoring deployment/grafana + kubectl logs -n logging deployment/loki + ``` + +4. **Common solutions**: + - Restart addon deployments + - Check for resource constraints + - Verify network policies + - Review storage class availability + + + +--- + +_This FAQ was automatically generated on October 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-automatic-shutdown-startup-issues.mdx b/docs/troubleshooting/cluster-automatic-shutdown-startup-issues.mdx new file mode 100644 index 000000000..a47975a60 --- /dev/null +++ b/docs/troubleshooting/cluster-automatic-shutdown-startup-issues.mdx @@ -0,0 +1,231 @@ +--- +sidebar_position: 3 +title: "Cluster Automatic Shutdown and Startup Issues" +description: "Troubleshooting problems with automatic cluster shutdown/startup causing API failures" +date: "2025-02-14" +category: "cluster" +tags: ["cluster", "automation", "shutdown", "startup", "step-functions", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cluster Automatic Shutdown and Startup Issues + +**Date:** February 14, 2025 +**Category:** Cluster +**Tags:** Cluster, Automation, Shutdown, Startup, Step Functions, AWS + +## Problem Description + +**Context:** SleakOps clusters configured with automatic shutdown/startup schedules may experience issues where the cluster fails to start properly, causing API failures and application downtime. + +**Observed Symptoms:** + +- Applications showing connection errors or timeouts +- API endpoints returning error responses +- Backend services failing to respond to health checks +- All applications in the cluster appearing as unavailable +- Issues occurring after scheduled cluster shutdown periods + +**Relevant Configuration:** + +- Cluster type: Development/staging environments +- Automatic shutdown: Configured for nighttime hours +- Automatic startup: Configured for business hours +- AWS Step Functions: Used for cluster lifecycle management +- Region: us-east-1 (typically) + +**Error Conditions:** + +- Errors appear in the morning after automatic startup +- Cluster appears to be running but applications are not accessible +- Step Function execution may have failed or completed with errors +- Manual intervention required to restore service + +## Detailed Solution + + + +If you're experiencing this issue right now, you can resolve it by manually triggering the cluster startup: + +1. **Access AWS Console** in your development account +2. **Navigate to Step Functions** service +3. **Find the cluster startup Step Function** (usually named with pattern: `*-up-sfn-*`) +4. **Execute the Step Function** manually +5. **Wait for completion** (typically 5-10 minutes) +6. **Verify applications** are accessible again + +**Direct link format:** + +``` +https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/statemachines/view/[YOUR_STEP_FUNCTION_ARN] +``` + + + + + +This issue typically occurs due to: + +1. **Step Function execution failures**: The automatic startup process encounters errors +2. **Timing issues**: Dependencies between services during startup +3. **Resource constraints**: Insufficient resources during cluster initialization +4. **Network connectivity**: Temporary network issues during startup +5. **Configuration drift**: Changes in cluster configuration affecting automation + +**Common failure points:** + +- Node group scaling issues +- Pod scheduling problems +- Service discovery delays +- Load balancer health check failures + + + + + +To prevent this issue from recurring: + +**1. Monitor Step Function executions:** + +```bash +# Check recent executions +aws stepfunctions list-executions --state-machine-arn [YOUR_ARN] --max-items 10 +``` + +**2. Set up CloudWatch alarms:** + +- Step Function execution failures +- Cluster startup duration exceeding thresholds +- Application health check failures + +**3. Implement retry mechanisms:** + +- Configure Step Functions with retry logic +- Add exponential backoff for failed steps +- Include manual approval steps for critical failures + +**4. Health check improvements:** + +- Extend health check timeout periods +- Add dependency checks between services +- Implement graceful startup sequences + + + + + +**Step 1: Check Step Function status** + +```bash +# Get execution details +aws stepfunctions describe-execution --execution-arn [EXECUTION_ARN] +``` + +**Step 2: Verify cluster status** + +```bash +# Check cluster status +kubectl get nodes +kubectl get pods --all-namespaces +``` + +**Step 3: Check application logs** + +```bash +# Check pod logs for errors +kubectl logs -f deployment/[YOUR_APP] -n [NAMESPACE] +``` + +**Step 4: Verify service connectivity** + +```bash +# Test internal service connectivity +kubectl exec -it [POD_NAME] -- curl http://[SERVICE_NAME]:8080/health +``` + +**Step 5: Check ingress and load balancer** + +```bash +# Verify ingress status +kubectl get ingress +kubectl describe ingress [INGRESS_NAME] +``` + + + + + +**Startup sequence optimization:** + +1. **Staggered startup**: Don't start all services simultaneously +2. **Dependency ordering**: Start databases before applications +3. **Health check delays**: Allow sufficient time for services to initialize +4. **Resource reservations**: Ensure adequate CPU/memory during startup + +**Step Function configuration:** + +```json +{ + "Comment": "Cluster startup with retry logic", + "StartAt": "StartCluster", + "States": { + "StartCluster": { + "Type": "Task", + "Resource": "arn:aws:states:::aws-sdk:eks:updateCluster", + "Retry": [ + { + "ErrorEquals": ["States.TaskFailed"], + "IntervalSeconds": 30, + "MaxAttempts": 3, + "BackoffRate": 2.0 + } + ], + "Next": "WaitForCluster" + } + } +} +``` + +**Monitoring configuration:** + +- Set up alerts for failed executions +- Monitor cluster resource utilization +- Track application startup times +- Log all automation events + + + + + +If the automatic shutdown/startup continues to cause issues: + +**Option 1: Adjust shutdown/startup times** + +- Extend startup time before business hours +- Add buffer time for complete initialization +- Stagger shutdown of different services + +**Option 2: Implement health checks** + +- Add comprehensive health checks before marking cluster as ready +- Include application-level health verification +- Implement automatic rollback on health check failures + +**Option 3: Use cluster autoscaling** + +- Configure cluster autoscaler for automatic scaling +- Use node auto-provisioning for cost optimization +- Implement pod disruption budgets + +**Option 4: Consider always-on for critical environments** + +- Keep production-like environments always running +- Use cost optimization through right-sizing instead of shutdown +- Implement resource quotas and limits + + + +--- + +_This FAQ was automatically generated on February 14, 2025 based on a real user query._ diff --git a/docs/troubleshooting/cluster-aws-vcpu-quota-limit.mdx b/docs/troubleshooting/cluster-aws-vcpu-quota-limit.mdx new file mode 100644 index 000000000..25ccb3a73 --- /dev/null +++ b/docs/troubleshooting/cluster-aws-vcpu-quota-limit.mdx @@ -0,0 +1,200 @@ +--- +sidebar_position: 3 +title: "AWS vCPU Quota Limit Error with Karpenter" +description: "Solution for VcpuLimitExceeded error when using GPU instances in EKS clusters" +date: "2024-04-24" +category: "cluster" +tags: ["aws", "karpenter", "quota", "gpu", "instances", "eks"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# AWS vCPU Quota Limit Error with Karpenter + +**Date:** April 24, 2024 +**Category:** Cluster +**Tags:** AWS, Karpenter, Quota, GPU, Instances, EKS + +## Problem Description + +**Context:** User attempts to configure a nodepool with GPU instances (g4ad.xlarge) for high-performance computing workloads but encounters vCPU quota limitations when Karpenter tries to provision the instances. + +**Observed Symptoms:** + +- Karpenter fails to launch NodeClaim with "VcpuLimitExceeded" error +- Error message: "You have requested more vCPU capacity than your current vCPU limit of 0 allows" +- GPU instances (g4ad.xlarge) cannot be provisioned +- Standard instances (c7a.large, c7a.xlarge) work correctly +- Nodepool configuration appears correct but nodes are not created + +**Relevant Configuration:** + +- Instance types: g4ad.xlarge (GPU instances) +- Karpenter NodeClaim provisioning +- AWS Service Quota initially set to 0 for GPU instance families +- Nodepool selector configuration using node.kubernetes.io/instance-type + +**Error Conditions:** + +- Error occurs during Karpenter node provisioning +- Specific to GPU instance types (g4ad family) +- AWS Service Quota blocks instance creation +- Standard compute instances work without issues + +## Detailed Solution + + + +AWS implements Service Quotas to control resource usage across different instance families. GPU instances have separate quotas from standard compute instances: + +- **Standard instances** (t3, c5, m5, etc.): Usually have default quotas +- **GPU instances** (g4, p3, p4, etc.): Often start with 0 quota for security +- **Specialized instances**: May require explicit quota requests + +The error "VcpuLimitExceeded" indicates your account doesn't have sufficient quota for the requested instance type. + + + + + +To request a quota increase for GPU instances: + +1. **Access AWS Service Quotas Console**: + + - Go to [AWS Service Quotas Console](https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-FD8E9B9A) + - Navigate to **Amazon Elastic Compute Cloud (Amazon EC2)** + +2. **Find the correct quota**: + + - Search for "Running On-Demand G and VT instances" + - This quota covers g4ad.xlarge instances + +3. **Request increase**: + + - Click "Request quota increase" + - Start with a conservative number (e.g., 32-64 vCPUs) + - Provide business justification + +4. **For NVIDIA instances**, there's a separate quota: + - "Running On-Demand P instances" + - Required for p3, p4 instance families + + + + + +Calculate the vCPUs needed based on your instance requirements: + +```yaml +# Example calculation for g4ad.xlarge +Instance: g4ad.xlarge +vCPUs per instance: 4 +Desired instances: 10 +Total vCPUs needed: 40 +# Request quota: 64 vCPUs (with buffer) +``` + +**Common GPU instance vCPU counts**: + +- g4ad.xlarge: 4 vCPUs +- g4ad.2xlarge: 8 vCPUs +- g4ad.4xlarge: 16 vCPUs +- g4dn.xlarge: 4 vCPUs +- p3.2xlarge: 8 vCPUs + + + + + +Ensure your nodepool is correctly configured for GPU instances: + +```yaml +# Nodepool configuration example +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: gpu-nodepool +spec: + template: + spec: + requirements: + - key: node.kubernetes.io/instance-type + operator: In + values: + - g4ad.xlarge + - g4ad.2xlarge + - key: karpenter.sh/capacity-type + operator: In + values: + - on-demand # GPU instances work better with on-demand + nodeClassRef: + apiVersion: karpenter.k8s.aws/v1beta1 + kind: EC2NodeClass + name: gpu-nodeclass +``` + +**Important considerations**: + +- Use `on-demand` capacity type for GPU instances +- Ensure AMI supports GPU drivers +- Configure appropriate taints/tolerations for GPU workloads + + + + + +**Monitor quota usage**: + +```bash +# Check current quota usage +aws service-quotas get-service-quota \ + --service-code ec2 \ + --quota-code L-FD8E9B9A + +# Monitor Karpenter logs +kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter +``` + +**Common issues and solutions**: + +1. **Quota approved but still failing**: + + - Wait 15-30 minutes for quota to propagate + - Try different availability zones + +2. **Instance not available**: + + - Check instance availability in your region + - Consider alternative instance types + +3. **AMI compatibility**: + - Ensure AMI supports GPU drivers + - Use EKS-optimized AMI with GPU support + + + + + +**Quota management**: + +- Request quotas proactively before deployment +- Start with conservative numbers and increase as needed +- Monitor usage to avoid unexpected limits + +**Cost optimization**: + +- Use Spot instances for non-critical GPU workloads +- Implement proper node scaling policies +- Consider mixed instance types in nodepools + +**Configuration management**: + +- Use Infrastructure as Code (Terraform) for quota requests +- Document quota requirements in deployment guides +- Set up monitoring for quota utilization + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-connection-troubleshooting.mdx b/docs/troubleshooting/cluster-connection-troubleshooting.mdx new file mode 100644 index 000000000..a513d783e --- /dev/null +++ b/docs/troubleshooting/cluster-connection-troubleshooting.mdx @@ -0,0 +1,230 @@ +--- +sidebar_position: 3 +title: "Cluster Connection Issues" +description: "Troubleshooting guide for cluster connectivity problems" +date: "2024-12-19" +category: "cluster" +tags: ["connection", "kubectl", "aws-sdk", "troubleshooting", "network"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cluster Connection Issues + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** Connection, kubectl, AWS SDK, Troubleshooting, Network + +## Problem Description + +**Context:** User is unable to connect to a Kubernetes cluster through SleakOps platform, despite the cluster appearing to be running normally. + +**Observed Symptoms:** + +- Unable to establish connection to the cluster +- Connection attempts fail without clear error messages +- Cluster status appears normal on the platform +- Issue persists across multiple connection attempts + +**Relevant Configuration:** + +- Platform: SleakOps Kubernetes cluster +- Required tools: kubectl, AWS SDK, potential VPN client +- Network: Variable (different internet connections) +- Local environment: User's local machine + +**Error Conditions:** + +- Connection failures occur during cluster access attempts +- Issue may be related to outdated local dependencies +- Network connectivity may be a contributing factor +- Problem persists despite cluster being operational + +## Detailed Solution + + + +Ensure all required tools are updated to their latest versions: + +**kubectl:** + +```bash +# Check current version +kubectl version --client + +# Update kubectl (Linux/macOS) +curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" +sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl + +# Update kubectl (Windows) +choco upgrade kubernetes-cli +``` + +**AWS CLI:** + +```bash +# Check current version +aws --version + +# Update AWS CLI +pip install --upgrade awscli +# or +curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +unzip awscliv2.zip +sudo ./aws/install --update +``` + +**Pritunl (if using VPN):** + +- Download the latest version from the official Pritunl website +- Uninstall the old version before installing the new one + + + + + +Perform these network diagnostics: + +**1. Test basic connectivity:** + +```bash +# Test DNS resolution +nslookup your-cluster-endpoint.eks.amazonaws.com + +# Test port connectivity +telnet your-cluster-endpoint.eks.amazonaws.com 443 +``` + +**2. Check firewall and proxy settings:** + +```bash +# Check if behind corporate firewall +curl -I https://your-cluster-endpoint.eks.amazonaws.com + +# Test with different network (mobile hotspot) +# Switch to mobile data and retry connection +``` + +**3. VPN connection verification:** + +- Ensure VPN is connected if required +- Try disconnecting/reconnecting VPN +- Check VPN client logs for errors + + + + + +Update your local cluster configuration: + +**1. Re-download cluster config:** + +```bash +# For AWS EKS clusters +aws eks update-kubeconfig --region your-region --name your-cluster-name + +# Verify config +kubectl config current-context +kubectl config view +``` + +**2. Test cluster connectivity:** + +```bash +# Test basic cluster access +kubectl cluster-info +kubectl get nodes +kubectl get pods --all-namespaces +``` + +**3. Clear and regenerate credentials:** + +```bash +# Clear AWS credentials cache +rm -rf ~/.aws/cli/cache/ + +# Re-authenticate if needed +aws configure +``` + + + + + +If standard connection fails, try these alternatives: + +**1. Use SleakOps Web Terminal:** + +- Access cluster through the SleakOps web interface +- Use the built-in terminal feature +- This bypasses local configuration issues + +**2. Port forwarding for specific services:** + +```bash +# Forward specific service ports +kubectl port-forward service/your-service 8080:80 +``` + +**3. Temporary cluster access:** + +```bash +# Generate temporary kubeconfig +kubectl config set-cluster temp-cluster --server=https://your-endpoint +kubectl config set-context temp-context --cluster=temp-cluster +kubectl config use-context temp-context +``` + + + + + +Follow this step-by-step diagnosis process: + +**Step 1: Environment Check** + +```bash +# Check all tool versions +kubectl version --client +aws --version +helm version (if applicable) +``` + +**Step 2: Connectivity Test** + +```bash +# Test cluster endpoint +curl -k https://your-cluster-endpoint.eks.amazonaws.com/version +``` + +**Step 3: Authentication Verification** + +```bash +# Verify AWS credentials +aws sts get-caller-identity + +# Test cluster authentication +kubectl auth can-i get pods +``` + +**Step 4: Network Analysis** + +- Try from different network (mobile data) +- Check corporate firewall rules +- Verify VPN connectivity if required + +**Step 5: Configuration Validation** + +```bash +# Validate kubeconfig +kubectl config validate + +# Check context +kubectl config get-contexts +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-critical-addons-node-failure.mdx b/docs/troubleshooting/cluster-critical-addons-node-failure.mdx new file mode 100644 index 000000000..e3bf9591b --- /dev/null +++ b/docs/troubleshooting/cluster-critical-addons-node-failure.mdx @@ -0,0 +1,182 @@ +--- +sidebar_position: 3 +title: "Critical Addons Node Failure in Production" +description: "Solution for CriticalAddonsOnly node failures causing production downtime" +date: "2024-01-15" +category: "cluster" +tags: + ["eks", "critical-addons", "high-availability", "production", "autoscaling"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Critical Addons Node Failure in Production + +**Date:** January 15, 2024 +**Category:** Cluster +**Tags:** EKS, Critical Addons, High Availability, Production, AutoScaling + +## Problem Description + +**Context:** Production EKS cluster experiences downtime due to missing CriticalAddonsOnly node, causing system failures and service unavailability. + +**Observed Symptoms:** + +- Production systems are down +- CriticalAddonsOnly node is missing from the cluster +- Critical Kubernetes addons cannot be scheduled +- Service disruption affecting end users + +**Relevant Configuration:** + +- Environment: Production EKS cluster +- Node type: CriticalAddonsOnly dedicated node +- Current cost: ~$10/month +- High availability cost: ~$50/month + +**Error Conditions:** + +- Single point of failure in critical addons scheduling +- No backup nodes available for critical system components +- AutoScaling group lacks instance type diversity + +## Detailed Solution + + + +For immediate resolution without additional costs: + +1. **Access AWS Console** + + - Navigate to EC2 → Auto Scaling Groups + - Find your cluster's CriticalAddons AutoScaling Group + +2. **Edit Launch Template** + + ```bash + # Example instance types to add + - t3.medium + - t3.large + - m5.large + - m5.xlarge + ``` + +3. **Update AutoScaling Group** + + - Go to "Instance Types" section + - Add multiple compatible instance types + - This provides fallback options when primary type is unavailable + +4. **Trigger Node Replacement** + ```bash + # Force new node creation + kubectl drain --ignore-daemonsets --delete-emptydir-data + kubectl delete node + ``` + + + + + +For production environments, implement high availability: + +1. **Enable HA in SleakOps Dashboard** + + - Go to Cluster Settings + - Navigate to "Critical Addons" section + - Enable "High Availability" option + - Cost increase: ~$10 → $50/month + +2. **Benefits of HA Setup** + + - Multiple CriticalAddons nodes across AZs + - Automatic failover capabilities + - Zero downtime for critical components + - Production-grade reliability + +3. **Configuration Example** + ```yaml + critical_addons: + high_availability: true + min_nodes: 2 + max_nodes: 3 + instance_types: ["t3.medium", "t3.large", "m5.large"] + availability_zones: ["us-east-1a", "us-east-1b", "us-east-1c"] + ``` + + + + + +**For Production Clusters:** + +1. **Always Enable High Availability** + + - Critical for production workloads + - Prevents single points of failure + - Minimal cost increase for maximum reliability + +2. **Instance Type Diversity** + + ```yaml + # Good practice: multiple instance types + instance_types: + - "t3.medium" # Primary choice + - "t3.large" # Fallback option + - "m5.large" # Alternative family + - "m5.xlarge" # Larger fallback + ``` + +3. **Multi-AZ Distribution** + + - Spread nodes across availability zones + - Protects against zone-level failures + - Ensures addon availability during outages + +4. **Monitoring Setup** + + ```bash + # Monitor critical addon pods + kubectl get pods -n kube-system -l k8s-app=critical-addon + + # Check node readiness + kubectl get nodes -l node-role.kubernetes.io/critical-addons + ``` + + + + + +**Timing for Changes:** + +- **During Business Hours**: Safe for HA configurations and instance type additions +- **No Downtime Required**: These are minor modifications +- **Real-time Monitoring**: Can be performed while monitoring systems + +**Steps for Safe Implementation:** + +1. **Pre-change Verification** + + ```bash + # Check current cluster state + kubectl get nodes + kubectl get pods -n kube-system + ``` + +2. **Implement Changes** + + - Add instance types to AutoScaling Group + - Or enable High Availability through SleakOps + +3. **Post-change Validation** + ```bash + # Verify new configuration + kubectl get nodes -o wide + kubectl describe node + ``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-eks-spot-instances-unavailable.mdx b/docs/troubleshooting/cluster-eks-spot-instances-unavailable.mdx new file mode 100644 index 000000000..e30db1cc4 --- /dev/null +++ b/docs/troubleshooting/cluster-eks-spot-instances-unavailable.mdx @@ -0,0 +1,210 @@ +--- +sidebar_position: 3 +title: "EKS Spot Instances Unavailable During Nodegroup Creation" +description: "Solution for EKS nodegroup failures due to unavailable Spot instances" +date: "2024-01-15" +category: "cluster" +tags: ["eks", "spot-instances", "nodegroup", "aws", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# EKS Spot Instances Unavailable During Nodegroup Creation + +**Date:** January 15, 2024 +**Category:** Cluster +**Tags:** EKS, Spot Instances, Nodegroup, AWS, Troubleshooting + +## Problem Description + +**Context:** User experiences issues with EKS cluster nodegroup creation when using Spot instances, particularly after automatic cluster start/stop operations during weekends. + +**Observed Symptoms:** + +- Nodegroup fails to launch due to unavailable Spot instances +- Cluster automatic start/stop functionality causes node provisioning issues +- Critical add-ons nodes (criticaladdonsonly) fail to start +- Cluster becomes partially unavailable affecting staging environment + +**Relevant Configuration:** + +- Platform: AWS EKS +- Instance type: Spot instances +- Environment: Staging (STG) +- Feature: Automatic cluster start/stop for weekends +- Region: us-east-1 + +**Error Conditions:** + +- Error occurs during automatic cluster restart after weekend shutdown +- Spot instances of required type are not available in the region +- Manual cluster restart partially resolves the issue but some nodes remain unavailable +- Problem appears to be recurring + +## Detailed Solution + + + +Spot instances in AWS have variable availability based on: + +1. **Current demand**: High demand reduces availability +2. **Instance type**: Some types are more scarce than others +3. **Availability zone**: Different zones have different capacity +4. **Time of day/week**: Demand patterns affect availability + +When AWS doesn't have enough Spot capacity, nodegroup creation fails with capacity errors. + + + + + +To resolve the current issue: + +1. **Check Spot instance availability**: + + - Go to AWS Console → EC2 → Spot Requests + - Check current Spot prices and availability + +2. **Modify nodegroup configuration**: + + ```yaml + # Add multiple instance types for better availability + instance_types: + - "m5.large" + - "m5a.large" + - "m4.large" + - "c5.large" + ``` + +3. **Use mixed instance policy**: + - Combine On-Demand and Spot instances + - Set a percentage split (e.g., 20% On-Demand, 80% Spot) + + + + + +Configure your nodegroup with these best practices: + +```yaml +# Recommended configuration +nodegroup_config: + instance_types: + - "m5.large" + - "m5a.large" + - "m4.large" + - "c5.large" + - "c4.large" + capacity_type: "SPOT" + scaling_config: + min_size: 1 + max_size: 10 + desired_size: 3 + update_config: + max_unavailable_percentage: 25 + # Diversify across multiple AZs + subnets: + - "subnet-xxx" # us-east-1a + - "subnet-yyy" # us-east-1b + - "subnet-zzz" # us-east-1c +``` + +**Key recommendations**: + +- Use 4-5 different instance types +- Spread across multiple availability zones +- Consider similar performance characteristics + + + + + +For critical system components that must always be available: + +1. **Create a dedicated On-Demand nodegroup**: + + ```yaml + critical_nodegroup: + capacity_type: "ON_DEMAND" + instance_types: ["t3.medium"] + scaling_config: + min_size: 2 + max_size: 3 + desired_size: 2 + taints: + - key: "CriticalAddonsOnly" + value: "true" + effect: "NoSchedule" + ``` + +2. **Use node selectors for critical workloads**: + ```yaml + # In your critical workload manifests + nodeSelector: + node.kubernetes.io/instance-type: "t3.medium" + tolerations: + - key: "CriticalAddonsOnly" + operator: "Equal" + value: "true" + effect: "NoSchedule" + ``` + + + + + +To prevent issues with automatic cluster start/stop: + +1. **Implement graceful startup sequence**: + + - Start critical nodegroups first + - Wait for system pods to be ready + - Then start application nodegroups + +2. **Configure startup health checks**: + + ```bash + # Add to startup script + kubectl wait --for=condition=Ready nodes --all --timeout=300s + kubectl wait --for=condition=Ready pods -n kube-system --all --timeout=300s + ``` + +3. **Consider disabling auto start/stop temporarily**: + + - Until Spot availability improves + - Or until mixed instance configuration is implemented + +4. **Set up monitoring alerts**: + - Alert when nodegroups fail to start + - Monitor Spot instance interruption rates + + + + + +If Spot instance issues persist: + +1. **Hybrid approach**: + + - Use On-Demand for system components + - Use Spot for application workloads + +2. **Multi-region deployment**: + + - Consider spreading workloads across regions + - Use regions with better Spot availability + +3. **Reserved instances**: + + - For predictable workloads + - Better cost control than On-Demand + +4. **Fargate for critical workloads**: + - No instance management required + - Higher cost but better reliability + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-eks-upgrade-schedule-process.mdx b/docs/troubleshooting/cluster-eks-upgrade-schedule-process.mdx new file mode 100644 index 000000000..a072500c8 --- /dev/null +++ b/docs/troubleshooting/cluster-eks-upgrade-schedule-process.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "EKS Cluster Upgrade Schedule and Process" +description: "Understanding EKS cluster upgrade frequency, process, and management in SleakOps" +date: "2025-02-06" +category: "cluster" +tags: ["eks", "upgrade", "kubernetes", "maintenance", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# EKS Cluster Upgrade Schedule and Process + +**Date:** February 6, 2025 +**Category:** Cluster +**Tags:** EKS, Upgrade, Kubernetes, Maintenance, AWS + +## Problem Description + +**Context:** Users need to understand how EKS cluster upgrades are managed in SleakOps, including frequency, process, and whether manual intervention is required. + +**Observed Symptoms:** + +- Questions about upgrade frequency and timing +- Concerns about downtime during upgrades +- Uncertainty about manual vs automatic upgrade processes +- Need for clarity on upgrade roadmap and version progression + +**Relevant Configuration:** + +- Current EKS version: 1.29 +- Target version: 1.32 (planned for first semester) +- Upgrade frequency: 2 major upgrades per year +- Platform: AWS EKS managed by SleakOps + +**Error Conditions:** + +- Potential service interruption during upgrades +- Risk of compatibility issues with manual cluster modifications +- Need for coordination between SleakOps team and users + +## Detailed Solution + + + +SleakOps manages EKS cluster upgrades with the following schedule: + +- **Frequency**: 2 major Kubernetes version upgrades per year +- **Current Status**: Version 1.29 +- **Target**: Version 1.32 by end of first semester 2025 +- **Remaining Upgrades**: 3 version upgrades (1.29 → 1.30 → 1.31 → 1.32) + +**Upgrade Timeline:** + +- Upgrades are typically scheduled during maintenance windows +- Users receive advance notification via email +- Specific dates are communicated prior to each upgrade + + + + + +SleakOps handles all EKS cluster upgrades: + +- **Automatic Management**: SleakOps team manages and executes all upgrades +- **No Manual Intervention**: Users do not need to perform manual upgrades through AWS Console +- **Pre-testing**: Extensive testing is performed before production upgrades +- **Monitoring**: SleakOps team monitors the upgrade process + +**On-Demand Upgrades:** +In some cases, SleakOps provides a workflow within the platform for users to trigger upgrades on-demand when needed. + + + + + +**Expected Downtime:** + +- Upgrades are designed to minimize downtime +- Extensive pre-upgrade testing reduces risk of issues +- Most upgrades should not cause service interruption + +**Potential Risks:** + +- Manual modifications to clusters may cause compatibility issues +- Custom configurations not managed by SleakOps could be affected +- Users should avoid manual changes to prevent upgrade complications + +**Recommendations:** + +- Stay alert during upgrade windows +- Avoid manual cluster modifications +- Report any issues immediately to SleakOps support + + + + + +**Before Upgrades:** + +1. **Review Applications**: Ensure your applications are compatible with the target Kubernetes version +2. **Backup Critical Data**: Although managed by SleakOps, ensure your application data is backed up +3. **Monitor Communications**: Watch for upgrade notifications from SleakOps +4. **Avoid Manual Changes**: Do not make manual modifications to the cluster before upgrades + +**During Upgrades:** + +1. **Stay Available**: Be available to respond to any issues +2. **Monitor Applications**: Check application health after upgrade completion +3. **Report Issues**: Contact SleakOps immediately if problems arise + +**After Upgrades:** + +1. **Verify Functionality**: Test critical application features +2. **Check Logs**: Review application and system logs for any issues +3. **Update Dependencies**: Ensure application dependencies are compatible with new Kubernetes version + + + + + +**Current Upgrade Path:** + +``` +Current: 1.29 → Target: 1.32 +Upgrade sequence: 1.29 → 1.30 → 1.31 → 1.32 +``` + +**Compatibility Considerations:** + +- Each version upgrade maintains backward compatibility for most features +- Deprecated APIs may be removed in newer versions +- Custom resources and operators should be verified for compatibility +- Third-party integrations may require updates + +**Best Practices:** + +- Keep applications updated to use current Kubernetes APIs +- Avoid using deprecated features +- Test applications in development environments with newer Kubernetes versions +- Follow Kubernetes deprecation policies + + + +--- + +_This FAQ was automatically generated on February 6, 2025 based on a real user query._ diff --git a/docs/troubleshooting/cluster-eks-upgrade-volume-attachment-issue.mdx b/docs/troubleshooting/cluster-eks-upgrade-volume-attachment-issue.mdx new file mode 100644 index 000000000..83ac46c84 --- /dev/null +++ b/docs/troubleshooting/cluster-eks-upgrade-volume-attachment-issue.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "EKS Cluster Upgrade Failed Due to Orphaned Volume" +description: "Solution for EKS cluster upgrade failures caused by volumes attached to nodes but not in use" +date: "2024-12-19" +category: "cluster" +tags: ["eks", "upgrade", "volumes", "nodegroup", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# EKS Cluster Upgrade Failed Due to Orphaned Volume + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** EKS, Upgrade, Volumes, NodeGroup, Troubleshooting + +## Problem Description + +**Context:** During an EKS cluster upgrade to version 1.31, the NodeGroup upgrade process fails, leaving old nodes running while preventing the upgrade from completing successfully. + +**Observed Symptoms:** + +- EKS cluster upgrade fails during NodeGroup upgrade phase +- Old nodes remain active instead of being replaced +- CI/CD pipelines stop functioning properly +- Deployment processes are affected +- Volume attachment conflicts prevent node replacement + +**Relevant Configuration:** + +- EKS cluster version: Upgrading to 1.31 +- Affected volume: `/app/certs` (orphaned volume) +- Volume status: Attached to node but not in use by any pods +- Volume state in SleakOps: Marked as "deleted" but still physically attached + +**Error Conditions:** + +- NodeGroup upgrade fails due to volume attachment conflicts +- Occurs when volumes are attached to nodes but not actively used +- Problem appears during the node replacement phase of the upgrade +- Affects clusters with previously deleted but still attached volumes + +## Detailed Solution + + + +The upgrade failure occurs because: + +1. **Orphaned volumes**: Volumes that were deleted in SleakOps but remain physically attached to EC2 instances +2. **Node replacement conflict**: During EKS upgrades, AWS needs to replace nodes, but attached volumes prevent clean node termination +3. **Volume state mismatch**: The volume exists in AWS but is not tracked in the current deployment configuration + +This is a common issue when volumes are removed from SleakOps configuration but the underlying AWS EBS volumes remain attached to the nodes. + + + + + +To fix the upgrade issue: + +1. **Identify the problematic volume**: + + ```bash + # Check attached volumes on the node + kubectl describe nodes + # Look for volumes in AWS EC2 console + aws ec2 describe-volumes --filters "Name=attachment.instance-id,Values=i-xxxxxxxxx" + ``` + +2. **Detach the orphaned volume**: + + ```bash + # Detach volume from EC2 instance + aws ec2 detach-volume --volume-id vol-xxxxxxxxx + ``` + +3. **Retry the cluster upgrade**: + - The upgrade should proceed normally once the volume conflict is resolved + - Monitor the NodeGroup replacement process + + + + + +Important considerations for data safety: + +1. **Volume data is preserved**: Detaching the volume does not delete the data +2. **Volume remains in AWS**: The EBS volume stays intact in your AWS account +3. **Recovery options**: You can reattach the volume later if needed + +```bash +# Check volume status after detachment +aws ec2 describe-volumes --volume-ids vol-xxxxxxxxx + +# Volume will show status as "available" instead of "in-use" +``` + +**Best practice**: Take a snapshot before detaching if the volume contains critical data: + +```bash +# Create snapshot for safety +aws ec2 create-snapshot --volume-id vol-xxxxxxxxx --description "Backup before cluster upgrade" +``` + + + + + +To avoid this issue in future upgrades: + +1. **Clean volume removal**: When removing volumes in SleakOps, ensure they are properly detached: + + - Remove volume from SleakOps configuration + - Verify volume is detached from nodes + - Optionally delete the volume if no longer needed + +2. **Pre-upgrade checklist**: + + ```bash + # Check for orphaned volumes before upgrading + kubectl get pv + kubectl get pvc --all-namespaces + + # Verify no unused volumes are attached + aws ec2 describe-volumes --filters "Name=attachment.instance-id,Values=i-*" + ``` + +3. **Regular maintenance**: Periodically audit and clean up unused volumes + + + + + +After resolving the volume issue and completing the upgrade: + +1. **Verify cluster version**: + + ```bash + kubectl version --short + aws eks describe-cluster --name your-cluster-name --query 'cluster.version' + ``` + +2. **Check node status**: + + ```bash + kubectl get nodes + # Verify all nodes are running the new version + ``` + +3. **Test CI/CD functionality**: + + - Deploy a test application + - Verify deployment pipelines work correctly + - Check that all services are accessible + +4. **Monitor for issues**: + - Watch cluster logs for any anomalies + - Verify all workloads are running normally + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-kubernetes-upgrade-process.mdx b/docs/troubleshooting/cluster-kubernetes-upgrade-process.mdx new file mode 100644 index 000000000..4f212e3a1 --- /dev/null +++ b/docs/troubleshooting/cluster-kubernetes-upgrade-process.mdx @@ -0,0 +1,168 @@ +--- +sidebar_position: 3 +title: "Kubernetes Cluster Upgrade Process" +description: "How to handle cluster upgrades in SleakOps platform" +date: "2024-12-19" +category: "cluster" +tags: ["kubernetes", "upgrade", "cluster", "maintenance"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes Cluster Upgrade Process + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** Kubernetes, Upgrade, Cluster, Maintenance + +## Problem Description + +**Context:** Users see an "Upgrade Required" notification in the Cluster section of their SleakOps dashboard and need guidance on how to proceed with the Kubernetes cluster upgrade. + +**Observed Symptoms:** + +- "Upgrade Required" banner appears in the Cluster - Production section +- Uncertainty about the upgrade process and potential downtime +- Concerns about service interruption during the upgrade +- Questions about code, version, or database compatibility + +**Relevant Configuration:** + +- Platform: SleakOps managed Kubernetes cluster +- Environment: Production cluster +- Upgrade type: Kubernetes version upgrade with core components and addons + +**Error Conditions:** + +- No actual errors, but upgrade notification requires action +- Potential compatibility issues with external resources +- Risk of service disruption if not properly planned + +## Detailed Solution + + + +When you see the "Upgrade Required" notification: + +1. **Click Accept** on the upgrade notification in your SleakOps dashboard +2. The system will automatically begin the upgrade process +3. **No manual intervention** is required during the upgrade + +The upgrade process is fully automated and managed by SleakOps. + + + + + +**Duration:** Approximately 1 hour for the complete upgrade process + +**Downtime:** + +- **No downtime expected** for properly configured services +- Services with multiple pod replicas will continue running +- Single-pod services may experience brief interruptions during node upgrades + +**Process order:** + +1. Core SleakOps nodes are upgraded one by one (rolling upgrade) +2. Internal addons are upgraded after nodes complete +3. All components listed in the upgrade notification are updated + + + + + +**Application Readiness:** + +- Ensure your applications have **multiple pod replicas** for high availability +- Verify that your services can handle rolling restarts +- Check that your applications don't rely on specific node assignments + +**External Resources:** + +- Review any **external installations** in your cluster (not managed by SleakOps) +- Check the Kubernetes changelog for deprecated resources +- Verify compatibility of custom operators or third-party tools + +**Database Considerations:** + +- No special database preparations needed for SleakOps managed upgrades +- Ensure database connections can handle brief network interruptions +- Verify that persistent volumes are properly configured + + + + + +Before accepting the upgrade: + +1. **Review the Kubernetes changelog** provided in the upgrade notification +2. **Check for deprecated APIs** that your external tools might use: + ```bash + kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found + ``` +3. **Update any external tools** that use deprecated Kubernetes APIs +4. **Test critical applications** in a staging environment if possible + +**Common deprecated resources to check:** + +- Old API versions for Deployments, Services, Ingress +- Custom Resource Definitions (CRDs) with deprecated schemas +- Network policies using deprecated APIs + + + + + +During the upgrade: + +1. **Monitor your applications** through your usual monitoring tools +2. **Check the SleakOps dashboard** for upgrade progress +3. **Watch for any alerts** from your monitoring systems + +**Signs of successful upgrade:** + +- Nodes show updated Kubernetes version +- All pods are running and healthy +- Services respond normally +- No persistent error logs + +**If issues occur:** + +- Contact SleakOps support immediately +- Provide specific error messages or symptoms +- Include the upgrade notification details + + + + + +After the upgrade completes: + +1. **Verify cluster status:** + + ```bash + kubectl get nodes + kubectl get pods --all-namespaces + ``` + +2. **Check application health:** + + - Test critical application endpoints + - Verify database connectivity + - Confirm monitoring and logging are working + +3. **Review upgrade logs:** + + - Check SleakOps dashboard for upgrade summary + - Review any warnings or notices + +4. **Update documentation:** + - Record the new Kubernetes version + - Update any version-specific configurations + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-lens-connection-timeout.mdx b/docs/troubleshooting/cluster-lens-connection-timeout.mdx new file mode 100644 index 000000000..ed7384458 --- /dev/null +++ b/docs/troubleshooting/cluster-lens-connection-timeout.mdx @@ -0,0 +1,196 @@ +--- +sidebar_position: 3 +title: "Lens Connection Timeout to Kubernetes Cluster" +description: "Solution for timeout errors when connecting to Kubernetes clusters through Lens IDE" +date: "2024-04-29" +category: "user" +tags: ["lens", "kubernetes", "connection", "timeout", "vpn", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Lens Connection Timeout to Kubernetes Cluster + +**Date:** April 29, 2024 +**Category:** User +**Tags:** Lens, Kubernetes, Connection, Timeout, VPN, Troubleshooting + +## Problem Description + +**Context:** Users experience timeout errors when attempting to connect to Kubernetes clusters through Lens IDE, despite following all setup steps correctly. + +**Observed Symptoms:** + +- Connection timeout errors in Lens IDE +- Unable to access cluster resources through Lens +- Successful authentication but failed cluster communication +- Error occurs for specific team members while others can connect normally + +**Relevant Configuration:** + +- Tool: Lens Kubernetes IDE +- Authentication: AWS IAM user credentials +- Connection method: Through SleakOps VPN +- Cluster type: EKS (Amazon Elastic Kubernetes Service) + +**Error Conditions:** + +- Timeout occurs during cluster connection attempt +- Error persists despite correct configuration steps +- Problem appears to be network-related rather than authentication-related +- Issue may be intermittent or affect specific users + +## Detailed Solution + + + +Lens connection timeouts to Kubernetes clusters typically occur due to: + +1. **Network connectivity issues**: VPN connection problems or firewall restrictions +2. **DNS resolution problems**: Unable to resolve cluster endpoint +3. **Authentication token expiration**: AWS credentials or kubeconfig tokens expired +4. **Cluster endpoint accessibility**: EKS cluster endpoint not reachable from current network +5. **Lens configuration issues**: Incorrect kubeconfig or context settings + + + + + +First, ensure your VPN connection is working properly: + +1. **Check VPN status**: Verify you're connected to the SleakOps VPN +2. **Test connectivity**: Try pinging internal resources +3. **DNS resolution**: Ensure you can resolve internal hostnames + +```bash +# Test VPN connectivity +ping internal-resource.sleakops.local + +# Check DNS resolution +nslookup your-cluster-endpoint.eks.amazonaws.com +``` + + + + + +Update your kubeconfig file to ensure fresh credentials: + +```bash +# Update kubeconfig for EKS cluster +aws eks update-kubeconfig --region your-region --name your-cluster-name + +# Verify the configuration +kubectl config current-context + +# Test basic connectivity +kubectl get nodes +``` + +If using AWS CLI profiles: + +```bash +# Specify the profile +aws eks update-kubeconfig --region your-region --name your-cluster-name --profile your-profile +``` + + + + + +Ensure Lens is configured correctly: + +1. **Import kubeconfig**: Go to **File** → **Add Cluster** → **From kubeconfig** +2. **Select correct context**: Choose the right cluster context +3. **Verify proxy settings**: Check if Lens proxy settings match your network configuration + +**Lens Proxy Configuration:** + +- Go to **Preferences** → **Proxy** +- Ensure proxy settings match your VPN configuration +- Try disabling proxy if using VPN + + + + + +If the problem persists, perform these network diagnostics: + +```bash +# Check if cluster endpoint is reachable +telnet your-cluster-endpoint.eks.amazonaws.com 443 + +# Test with curl +curl -k https://your-cluster-endpoint.eks.amazonaws.com/version + +# Check routing +traceroute your-cluster-endpoint.eks.amazonaws.com +``` + +**Common network issues:** + +- Corporate firewall blocking Kubernetes API ports (443, 6443) +- VPN not routing traffic to AWS regions properly +- DNS not resolving EKS endpoints correctly + + + + + +If Lens continues to timeout, try these alternatives: + +1. **Use kubectl directly**: Test if kubectl works without Lens +2. **Try different network**: Test from a different network location +3. **Use AWS CloudShell**: Access cluster through AWS Console CloudShell +4. **Port forwarding**: Use kubectl port-forward for specific services + +```bash +# Test direct kubectl access +kubectl get pods --all-namespaces + +# Port forward for specific services +kubectl port-forward service/your-service 8080:80 +``` + + + + + +Ensure the AWS IAM user has proper permissions: + +1. **Check IAM policies**: Verify EKS access policies are attached +2. **Verify aws-auth ConfigMap**: Ensure user is mapped in the cluster +3. **Test AWS CLI access**: Confirm AWS credentials work + +```bash +# Test AWS credentials +aws sts get-caller-identity + +# Check EKS cluster access +aws eks describe-cluster --name your-cluster-name + +# Verify kubectl access +kubectl auth can-i get pods +``` + + + + + +While the VPN server issue is being resolved: + +1. **Use AWS Console**: Access Kubernetes resources through AWS EKS console +2. **AWS CloudShell**: Use CloudShell for kubectl commands +3. **Local port forwarding**: Forward specific services to localhost +4. **Alternative VPN**: If available, try connecting through a different VPN endpoint + +```bash +# Example port forward for dashboard access +kubectl port-forward -n kubernetes-dashboard service/kubernetes-dashboard 8443:443 +``` + + + +--- + +_This FAQ was automatically generated on 4/29/2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-manual-shutdown-scheduled-feature.mdx b/docs/troubleshooting/cluster-manual-shutdown-scheduled-feature.mdx new file mode 100644 index 000000000..4e927b22b --- /dev/null +++ b/docs/troubleshooting/cluster-manual-shutdown-scheduled-feature.mdx @@ -0,0 +1,167 @@ +--- +sidebar_position: 3 +title: "Cluster Manual Shutdown Requires Scheduled Feature" +description: "Solution for manual cluster shutdown when scheduled shutdown feature is not enabled" +date: "2024-12-12" +category: "cluster" +tags: ["cluster", "shutdown", "scheduled", "manual", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cluster Manual Shutdown Requires Scheduled Feature + +**Date:** December 12, 2024 +**Category:** Cluster +**Tags:** Cluster, Shutdown, Scheduled, Manual, Configuration + +## Problem Description + +**Context:** Users want to manually stop their SleakOps cluster at specific times (for testing, cost optimization, etc.) but find that the manual shutdown button is not available or functional. + +**Observed Symptoms:** + +- Manual cluster shutdown button is not available in the interface +- Cannot stop cluster manually even when needed for specific situations +- Cluster remains active consuming resources when not needed +- No option to pause cluster for undefined periods + +**Relevant Configuration:** + +- SleakOps cluster deployed and running +- Scheduled Shutdown feature: Not enabled +- Manual shutdown requirement: Immediate or flexible timing +- Use case: Testing, development, cost optimization + +**Error Conditions:** + +- Manual shutdown not possible without scheduled feature enabled +- Resources continue consuming costs during inactive periods +- Cannot pause cluster for undefined time periods +- Inflexibility for ad-hoc cluster management needs + +## Detailed Solution + + + +To enable manual cluster shutdown, you must first activate the "Scheduled Shutdown" feature: + +1. Go to your **Cluster Configuration** +2. Look for the **"Scheduled Shutdown"** option +3. **Enable** this feature +4. Configure basic schedule settings (can be minimal) + +Once enabled, this feature runs a background module that allows both scheduled and manual shutdown operations. + + + + + +You can set up a minimal schedule that gives you maximum manual control: + +```yaml +# Example minimal configuration +scheduled_shutdown: + enabled: true + schedule: + # Set a very permissive schedule or off-hours only + weekdays: [] + weekend: [] + timezone: "UTC" +``` + +**Steps:** + +1. Enable Scheduled Shutdown +2. Set minimal or no automatic schedules +3. Use manual controls as needed +4. The feature can be disabled later if not needed + + + + + +After enabling the Scheduled Shutdown feature: + +1. Navigate to your cluster dashboard +2. Look for the **shutdown/stop button** (now available) +3. Click to manually stop the cluster +4. Start the cluster manually when needed +5. No predefined schedule required for manual operations + +**Benefits:** + +- Stop cluster during testing breaks +- Pause for undefined periods +- Resume when needed +- Optimize costs for development environments + + + + + +For development clusters that need flexible on/off control: + +```yaml +cluster_config: + name: "dev-cluster" + scheduled_shutdown: + enabled: true + auto_schedule: false # No automatic scheduling + manual_control: true # Enable manual start/stop + +# This allows: +# - Manual shutdown anytime +# - Manual startup when needed +# - No automatic operations +# - Cost optimization for dev environments +``` + +**Use Cases:** + +- Development and testing environments +- Clusters used sporadically +- Cost-sensitive projects +- Temporary project pauses + + + + + +**Known Issue:** This requirement is currently a platform inconsistency that should be improved. + +**Current Behavior:** + +- Manual shutdown requires Scheduled Shutdown feature enabled +- Background module must be running for shutdown capabilities +- SleakOps runs various resources continuously by default + +**Expected Future Improvement:** + +- Manual shutdown should be available without scheduling requirements +- More flexible cluster management options +- Better separation between scheduled and manual operations + + + + + +Until the platform limitation is resolved: + +1. **Enable Scheduled Shutdown** even if you don't need automatic scheduling +2. **Set minimal schedules** that don't interfere with your workflow +3. **Use manual controls** as your primary shutdown method +4. **Document your usage patterns** to optimize future configurations + +**Cost Optimization Tips:** + +- Enable the feature immediately after cluster creation +- Stop clusters during known inactive periods +- Monitor resource usage to identify optimization opportunities +- Consider multiple smaller clusters instead of one large always-on cluster + + + +--- + +_This FAQ was automatically generated on December 12, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-node-errors-web-api-unavailable.mdx b/docs/troubleshooting/cluster-node-errors-web-api-unavailable.mdx new file mode 100644 index 000000000..08f192877 --- /dev/null +++ b/docs/troubleshooting/cluster-node-errors-web-api-unavailable.mdx @@ -0,0 +1,178 @@ +--- +sidebar_position: 3 +title: "Web and API Unavailable Due to Node Errors" +description: "Solution for web and API services failing due to Kubernetes node issues" +date: "2024-01-15" +category: "cluster" +tags: ["nodes", "web", "api", "troubleshooting", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Web and API Unavailable Due to Node Errors + +**Date:** January 15, 2024 +**Category:** Cluster +**Tags:** Nodes, Web, API, Troubleshooting, Kubernetes + +## Problem Description + +**Context:** User reports that both web and API services are not functioning properly, with error messages indicating node-related issues in the Kubernetes cluster. + +**Observed Symptoms:** + +- Web service is not accessible +- API endpoints are not responding +- Error messages mentioning node problems +- Complete service unavailability + +**Relevant Configuration:** + +- Platform: SleakOps Kubernetes cluster +- Affected services: Web frontend and API backend +- Error type: Node-related errors + +**Error Conditions:** + +- Both web and API services fail simultaneously +- Errors point to underlying node infrastructure issues +- Services remain unavailable until node issues are resolved + +## Detailed Solution + + + +First, check the status of your cluster nodes to identify the specific issue: + +1. **Access SleakOps Dashboard** +2. Navigate to **Clusters** → **Your Cluster** +3. Go to **Nodes** section +4. Check for nodes with status: + - `NotReady` + - `Unknown` + - `SchedulingDisabled` + +Alternatively, if you have kubectl access: + +```bash +kubectl get nodes +kubectl describe nodes +``` + + + + + +**Node Resource Exhaustion:** + +- **Symptom**: Nodes show high CPU/memory usage +- **Solution**: Scale up nodepool or add more nodes + +**Network Connectivity Issues:** + +- **Symptom**: Nodes can't communicate with control plane +- **Solution**: Check security groups and network configuration + +**Disk Space Issues:** + +- **Symptom**: Nodes running out of disk space +- **Solution**: Clean up unused images or increase disk size + +**Node Failure:** + +- **Symptom**: Nodes completely unresponsive +- **Solution**: Replace failed nodes through SleakOps + + + + + +Once node issues are resolved, restart your web and API services: + +1. **In SleakOps Dashboard:** + + - Go to **Projects** → **Your Project** + - Find your web and API workloads + - Click **Restart** on each service + +2. **Via kubectl (if available):** + +```bash +# Restart web deployment +kubectl rollout restart deployment/web-app -n your-namespace + +# Restart API deployment +kubectl rollout restart deployment/api-app -n your-namespace + +# Check rollout status +kubectl rollout status deployment/web-app -n your-namespace +kubectl rollout status deployment/api-app -n your-namespace +``` + + + + + +If the issue is related to insufficient node capacity: + +1. **Access SleakOps Dashboard** +2. Go to **Clusters** → **Your Cluster** +3. Navigate to **Nodepools** +4. Select the affected nodepool +5. Increase **Desired Size** or **Max Size** +6. Click **Update Nodepool** + +The system will automatically provision new nodes and redistribute workloads. + + + + + +After implementing fixes, monitor the recovery: + +1. **Check node status** until all show `Ready` +2. **Verify pod status**: + ```bash + kubectl get pods -n your-namespace + ``` +3. **Test web service** by accessing your application URL +4. **Test API endpoints** using curl or your preferred tool: + ```bash + curl -X GET https://your-api-url/health + ``` +5. **Monitor logs** for any remaining errors: + ```bash + kubectl logs -f deployment/web-app -n your-namespace + kubectl logs -f deployment/api-app -n your-namespace + ``` + + + + + +To prevent similar issues: + +1. **Set up monitoring alerts** for node health +2. **Configure auto-scaling** for nodepools +3. **Implement resource requests and limits** for your applications +4. **Regular health checks** on critical services +5. **Monitor cluster metrics** regularly through SleakOps dashboard + +**Recommended monitoring setup:** + +```yaml +# Example resource limits +resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-nodepool-memory-limit-deployment-failure.mdx b/docs/troubleshooting/cluster-nodepool-memory-limit-deployment-failure.mdx new file mode 100644 index 000000000..264575ea2 --- /dev/null +++ b/docs/troubleshooting/cluster-nodepool-memory-limit-deployment-failure.mdx @@ -0,0 +1,225 @@ +--- +sidebar_position: 3 +title: "Deployment Failure Due to Nodepool Memory Limits" +description: "Solution for deployment failures caused by nodepool reaching memory capacity limits" +date: "2024-03-13" +category: "cluster" +tags: ["deployment", "nodepool", "memory", "scaling", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Failure Due to Nodepool Memory Limits + +**Date:** March 13, 2024 +**Category:** Cluster +**Tags:** Deployment, Nodepool, Memory, Scaling, Troubleshooting + +## Problem Description + +**Context:** User experiences deployment failures in QA environment where builds and deployments take over 50 minutes and eventually timeout, preventing successful application updates. + +**Observed Symptoms:** + +- Deployment builds taking more than 50 minutes +- Deployment process stops/times out before completion +- Migration pods cannot be scheduled +- New application versions fail to deploy + +**Relevant Configuration:** + +- Environment: QA cluster +- Application: Backend service +- Deployment timeout: ~50 minutes +- Nodepool: Limited memory capacity +- Migration pods requiring additional resources + +**Error Conditions:** + +- Nodepool reaches memory limit when trying to add new nodes +- Insufficient resources to schedule migration pods +- Deployment pipeline timeout due to resource constraints +- New pods cannot be created due to capacity limits + +## Detailed Solution + + + +The deployment failure occurs because: + +1. **Nodepool Memory Limit**: The nodepool has reached its maximum memory allocation +2. **Resource Scheduling**: Kubernetes cannot schedule new pods (like migration pods) due to insufficient resources +3. **Deployment Dependencies**: The deployment process requires additional resources that aren't available +4. **Capacity Planning**: The current nodepool configuration doesn't account for peak resource usage during deployments + + + + + +To resolve the immediate issue: + +1. **Access SleakOps Dashboard** +2. Navigate to **Cluster Management** → **Nodepools** +3. Select the affected nodepool +4. **Increase Memory Allocation**: + - Go to **Configuration** tab + - Increase **Max Memory** limit + - Or increase **Max Nodes** if using node-based scaling +5. **Apply Changes** and wait for new nodes to be provisioned + +```yaml +# Example nodepool configuration +nodepool_config: + min_nodes: 2 + max_nodes: 8 # Increased from previous limit + instance_type: "t3.large" # Or upgrade instance type + max_memory_gb: 32 # Increased memory limit +``` + + + + + +To prevent future issues, monitor your cluster resources: + +1. **Use SleakOps Nodepool Dashboard**: + + - Check CPU/Memory utilization graphs + - Monitor trends during deployment times + - Set up alerts for high resource usage + +2. **Key Metrics to Watch**: + + - Memory utilization > 80% + - CPU utilization > 70% + - Number of pending pods + - Node capacity vs. usage + +3. **Access Kubecost** (if installed): + - Ensure VPN connection is active + - Navigate to cost analysis dashboard + - Review resource allocation efficiency + + + + + +To improve deployment reliability: + +1. **Resource Requests and Limits**: + +```yaml +# Set appropriate resource requests +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + +2. **Deployment Strategy**: + +```yaml +# Use rolling updates with proper resource management +strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 1 +``` + +3. **Migration Job Configuration**: + +```yaml +# Ensure migration jobs have appropriate resources +apiVersion: batch/v1 +kind: Job +metadata: + name: migration-job +spec: + template: + spec: + containers: + - name: migrate + resources: + requests: + memory: "256Mi" + cpu: "100m" +``` + + + + + +For sustainable cluster management: + +1. **Calculate Peak Usage**: + + - Normal operation resources + - Deployment-time additional resources + - Migration and maintenance job resources + - Buffer for unexpected spikes (20-30%) + +2. **Nodepool Sizing Strategy**: + + - **Development/QA**: 2-4 nodes with auto-scaling + - **Production**: 3-6 nodes minimum with higher limits + - **Consider instance types**: Balance cost vs. performance + +3. **Auto-scaling Configuration**: + +```yaml +autoscaling: + enabled: true + min_nodes: 2 + max_nodes: 10 + target_cpu_percent: 70 + target_memory_percent: 80 +``` + + + + + +When deployments fail: + +1. **Check Nodepool Status**: + + ```bash + kubectl get nodes + kubectl describe nodes + ``` + +2. **Check Pod Status**: + + ```bash + kubectl get pods --all-namespaces + kubectl describe pod + ``` + +3. **Check Resource Usage**: + + ```bash + kubectl top nodes + kubectl top pods --all-namespaces + ``` + +4. **Check Events**: + + ```bash + kubectl get events --sort-by=.metadata.creationTimestamp + ``` + +5. **Common Error Messages**: + - `Insufficient memory` + - `Insufficient cpu` + - `0/X nodes are available` + - `FailedScheduling` + + + +--- + +_This FAQ was automatically generated on March 13, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-nodepool-missing-after-shutdown.mdx b/docs/troubleshooting/cluster-nodepool-missing-after-shutdown.mdx new file mode 100644 index 000000000..c807a6d1d --- /dev/null +++ b/docs/troubleshooting/cluster-nodepool-missing-after-shutdown.mdx @@ -0,0 +1,143 @@ +--- +sidebar_position: 3 +title: "Missing Nodepools After Cluster Shutdown" +description: "Solution for nodepools disappearing after cluster shutdown/startup cycle" +date: "2025-02-20" +category: "cluster" +tags: ["nodepool", "shutdown", "cluster", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Missing Nodepools After Cluster Shutdown + +**Date:** February 20, 2025 +**Category:** Cluster +**Tags:** Nodepool, Shutdown, Cluster, Troubleshooting + +## Problem Description + +**Context:** Users experience missing nodepools after a cluster shutdown and startup cycle, particularly when using the nighttime shutdown feature in SleakOps. + +**Observed Symptoms:** + +- Build processes fail after cluster restart +- Nodepools are missing from the cluster +- Applications cannot be deployed due to lack of compute resources +- Cluster appears to be running but without worker nodes + +**Relevant Configuration:** + +- Cluster with nighttime shutdown enabled +- Multiple nodepools configured +- Build processes dependent on nodepool availability + +**Error Conditions:** + +- Error occurs after cluster shutdown/startup cycle +- Affects clusters with automatic shutdown features +- Nodepools do not automatically restore after cluster restart +- Build and deployment operations fail + +## Detailed Solution + + + +If you're experiencing missing nodepools after a cluster shutdown: + +1. **Contact SleakOps Support** immediately to restore nodepools +2. **Avoid using builds** until nodepools are restored +3. **Check cluster status** in the SleakOps dashboard +4. **Verify nodepool configuration** once restored + +The SleakOps team can manually restore your nodepools while the permanent fix is being implemented. + + + + + +This issue is caused by a known bug in the nighttime shutdown feature: + +- **Shutdown process**: Properly shuts down the cluster +- **Startup process**: Restarts the cluster but fails to restore nodepools +- **Impact**: Leaves the cluster without worker nodes +- **Status**: Bug is currently being fixed by the SleakOps team + + + + + +Until the fix is available: + +1. **Disable nighttime shutdown** if possible: + + - Go to **Cluster Settings** + - Navigate to **Power Management** + - Disable **Automatic Shutdown** + +2. **Manual shutdown alternative**: + + - Shut down cluster manually when needed + - Ensure you're available to verify nodepool status after restart + +3. **Monitor cluster status**: + - Check nodepool availability before starting builds + - Verify worker nodes are present in Kubernetes dashboard + + + + + +After nodepools are restored, verify they're working correctly: + +```bash +# Check nodes in cluster +kubectl get nodes + +# Verify nodepool status +kubectl get nodes --show-labels + +# Check if pods can be scheduled +kubectl get pods --all-namespaces +``` + +Expected output should show: + +- Multiple worker nodes in "Ready" state +- Proper node labels indicating nodepool membership +- Pods successfully scheduled across nodes + + + + + +To avoid this issue in the future: + +1. **Wait for the fix**: SleakOps is working on a permanent solution +2. **Monitor announcements**: Watch for updates about the fix availability +3. **Re-enable shutdown carefully**: Only re-enable automatic shutdown after the fix is deployed +4. **Test thoroughly**: After the fix, test the shutdown/startup cycle in a non-production environment first + + + + + +Contact SleakOps support immediately if: + +- Nodepools are missing after cluster restart +- Builds are failing due to insufficient resources +- You cannot deploy applications +- Cluster shows as running but has no worker nodes + +Provide the following information: + +- Cluster name and ID +- Time when the shutdown/startup occurred +- Screenshots of the error messages +- Current nodepool configuration + + + +--- + +_This FAQ was automatically generated on February 20, 2025 based on a real user query._ diff --git a/docs/troubleshooting/cluster-production-check-node-scaling.mdx b/docs/troubleshooting/cluster-production-check-node-scaling.mdx new file mode 100644 index 000000000..ebf2166f3 --- /dev/null +++ b/docs/troubleshooting/cluster-production-check-node-scaling.mdx @@ -0,0 +1,217 @@ +--- +sidebar_position: 3 +title: "Production Check Node Scaling and Resource Changes" +description: "Understanding node scaling and resource changes when enabling Production Check in SleakOps clusters" +date: "2024-11-20" +category: "cluster" +tags: ["production", "scaling", "karpenter", "nodes", "taints", "availability"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Production Check Node Scaling and Resource Changes + +**Date:** November 20, 2024 +**Category:** Cluster +**Tags:** Production, Scaling, Karpenter, Nodes, Taints, Availability + +## Problem Description + +**Context:** User enabled the "Production" check on their SleakOps cluster and configured an on-demand node pool, but noticed significant increases in node count and infrastructure resource usage. + +**Observed Symptoms:** + +- Node count increased significantly after enabling Production check +- 7 total nodes with 4 assigned to infrastructure (non-application workloads) +- All nodes now have taints applied (previously only non-container nodes had taints) +- Higher resource consumption and billing costs +- Infrastructure workloads separated from application workloads + +**Relevant Configuration:** + +- Production check: Enabled +- Node pool type: On-demand instances only +- SleakOps version: 1.7.0 (with taint changes) +- Karpenter: Enabled with consolidation policies +- High availability setup: Multi-AZ deployment + +**Error Conditions:** + +- Increased operational costs due to more nodes +- Resource over-provisioning in some workloads +- Confusion about infrastructure vs application node allocation + +## Detailed Solution + + + +When you enable the Production check in SleakOps, the following changes are automatically applied to increase cluster availability: + +**Node Infrastructure Changes:** + +- Adds an additional node to the core node group in a different availability zone +- Ensures multi-AZ redundancy for critical cluster components +- Separates infrastructure workloads from application workloads using taints + +**Critical System Redundancy:** + +- Adds redundancy to critical cluster systems like Karpenter and ALB Controller +- Results in more pods running for infrastructure components +- Ensures high availability for cluster management services + +**Application Protection:** + +- Adds Pod Disruption Budgets (PDBs) to deployments +- Prevents Karpenter's consolidation policies from impacting running services +- Protects against node rotation affecting application availability + + + + + +Starting with SleakOps version 1.7.0, the platform implements a taint-based workload separation strategy: + +**Infrastructure Workloads:** + +- Run on dedicated nodes with specific taints +- Use Graviton instances (more cost-effective) +- Handle cluster management, monitoring, and logging components + +**Application Workloads:** + +- Run on separate nodes without infrastructure taints +- Use standard instance types optimized for your applications +- Isolated from infrastructure resource competition + +**Cost Impact:** + +- Infrastructure nodes use cheaper Graviton instances +- Application nodes maintain performance characteristics +- Overall cost may be similar despite more nodes + +```yaml +# Example taint configuration +infrastructure_nodes: + taints: + - key: "sleakops.com/infrastructure" + value: "true" + effect: "NoSchedule" + instance_types: ["t4g.medium", "t4g.large"] # Graviton instances + +application_nodes: + taints: [] # No taints for application workloads + instance_types: ["t3.medium", "t3.large"] # Standard instances +``` + + + + + +To optimize your cluster resources after enabling Production check: + +**Memory Optimization:** + +- Review memory requests vs actual usage +- Example: Worker jobs using 2GB peak but requesting 8GB total +- Adjust resource requests to match actual consumption + +```yaml +# Before optimization +resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "1000m" + +# After optimization based on actual usage +resources: + requests: + memory: "512Mi" # Reduced based on actual usage + cpu: "250m" + limits: + memory: "2Gi" # More realistic limit + cpu: "500m" +``` + +**Infrastructure Component Tuning:** + +- Grafana resource requirements can be optimized +- Loki storage and memory allocation can be adjusted +- Anti-affinity rules may need correction for specific use cases + + + + + +Karpenter automatically optimizes instance selection and consolidation: + +**Instance Selection:** + +- Always selects the cheapest instances that meet workload requirements +- Considers spot vs on-demand pricing when spot is enabled +- Balances performance requirements with cost + +**Consolidation Policy:** + +- Automatically consolidates workloads when possible +- Respects Pod Disruption Budgets (PDBs) to maintain availability +- May not always result in fewer instances, but ensures cost efficiency + +**Availability vs Cost Trade-off:** + +- Production check prioritizes availability over cost +- On-demand instances increase billing but provide stability +- Multi-AZ deployment ensures resilience but requires more resources + +```yaml +# Karpenter NodePool configuration for cost optimization +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: cost-optimized +spec: + requirements: + - key: "karpenter.sh/capacity-type" + operator: In + values: ["spot", "on-demand"] # Prefer spot when available + - key: "node.kubernetes.io/instance-type" + operator: In + values: ["t3.medium", "t3.large", "t4g.medium", "t4g.large"] + disruption: + consolidationPolicy: WhenUnderutilized + consolidateAfter: 30s +``` + + + + + +To better understand and optimize your cluster: + +**Resource Monitoring:** + +- Use Grafana dashboards to monitor actual vs requested resources +- Track node utilization across different node groups +- Monitor cost trends after Production check enablement + +**Optimization Actions:** + +1. Review and adjust resource requests for all workloads +2. Consider enabling spot instances for non-critical workloads +3. Monitor PDB configurations to ensure they're not too restrictive +4. Regular review of Karpenter consolidation metrics + +**Expected Outcomes:** + +- Higher availability and resilience +- More predictable performance +- Optimized resource utilization over time +- Better separation of concerns between infrastructure and applications + + + +--- + +_This FAQ was automatically generated on November 20, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-production-mode-spot-instances.mdx b/docs/troubleshooting/cluster-production-mode-spot-instances.mdx new file mode 100644 index 000000000..b7b00e6b2 --- /dev/null +++ b/docs/troubleshooting/cluster-production-mode-spot-instances.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "Cluster Production Mode and Instance Types" +description: "Understanding production mode, SPOT vs On-Demand instances, and Karpenter configuration" +date: "2024-12-19" +category: "cluster" +tags: ["production", "spot-instances", "on-demand", "karpenter", "eks"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cluster Production Mode and Instance Types + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** Production, SPOT Instances, On-Demand, Karpenter, EKS + +## Problem Description + +**Context:** Users need to understand the difference between production and development cluster modes in SleakOps, particularly regarding instance types (SPOT vs On-Demand) and their impact on cluster stability and costs. + +**Observed Symptoms:** + +- Confusion about when to use production mode +- Uncertainty about SPOT vs On-Demand instances for nodepools +- Questions about whether production mode automatically changes nodepool configurations +- Pod scheduling issues related to node taints in Karpenter + +**Relevant Configuration:** + +- Cluster mode: Production vs Development +- Instance types: SPOT vs On-Demand +- Karpenter nodepool configuration +- Critical system workloads (ALB, external-dns, etc.) + +**Error Conditions:** + +- Pods failing to schedule due to node taints +- Cluster instability with SPOT instances +- Confusion about cost optimization vs reliability + +## Detailed Solution + + + +Production mode in SleakOps affects **critical cluster infrastructure**, not your application workloads: + +**What Production Mode Changes:** + +- Uses On-Demand instances for critical system components (Karpenter, ALB Controller, External-DNS, etc.) +- Applies additional configurations that improve cluster availability +- Ensures system stability for production workloads + +**What Production Mode Does NOT Change:** + +- Your application nodepools remain unchanged +- You can still use SPOT instances for your workloads +- Application deployment configurations are not affected + + + + + +**SPOT Instances:** + +- **Pros:** Up to 90% cost savings, good for fault-tolerant workloads +- **Cons:** Can be terminated by AWS with 2-minute notice +- **Best for:** Development environments, batch processing, stateless applications + +**On-Demand Instances:** + +- **Pros:** Guaranteed availability, predictable costs +- **Cons:** Higher cost (full price) +- **Best for:** Critical applications, databases, production workloads requiring high availability + +**Recommendation for Nodepools:** + +```yaml +# For applications that can handle node rotation +instance_type: spot + +# For critical applications requiring guaranteed availability +instance_type: on-demand +``` + + + + + +Karpenter automatically manages different node types with specific taints: + +**Common Karpenter Taints:** + +- `karpenter.sh/nodepool: spot-amd64` - SPOT instances with AMD64 architecture +- `karpenter.sh/nodepool: spot-arm64` - SPOT instances with ARM64 architecture +- `karpenter.sh/nodepool: ondemand-amd64` - On-Demand instances with AMD64 architecture +- `karpenter.sh/nodepool: ondemand-arm64` - On-Demand instances with ARM64 architecture + +**If pods can't schedule due to taints:** + +1. Check your pod tolerations +2. Verify nodepool configuration +3. Ensure Karpenter can provision appropriate nodes + +```yaml +# Example toleration for SPOT instances +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule +``` + + + + + +**For Production Clusters:** + +1. **Enable Production Mode:** + + - Go to Cluster Settings + - Enable "Production Mode" + - This secures critical system components + +2. **Nodepool Strategy:** + + - Use **On-Demand** for critical applications (databases, APIs) + - Use **SPOT** for fault-tolerant workloads (workers, batch jobs) + - Consider mixed nodepools for cost optimization + +3. **Application Preparation for SPOT:** + - Implement graceful shutdown handling + - Use horizontal pod autoscaling + - Design for node rotation tolerance + - Implement proper health checks + +**Example Mixed Configuration:** + +```yaml +# Critical workloads nodepool +critical_nodepool: + instance_type: on-demand + min_size: 2 + max_size: 10 + +# Batch processing nodepool +batch_nodepool: + instance_type: spot + min_size: 0 + max_size: 50 +``` + + + + + +If pods in `kube-system` namespace fail to schedule: + +**1. Check Node Availability:** + +```bash +kubectl get nodes +kubectl describe nodes +``` + +**2. Check Pod Status:** + +```bash +kubectl get pods -n kube-system +kubectl describe pod -n kube-system +``` + +**3. Check Karpenter Logs:** + +```bash +kubectl logs -n karpenter deployment/karpenter +``` + +**4. Verify Nodepool Configuration:** + +- Ensure nodepools are properly configured +- Check if Karpenter can provision new nodes +- Verify AWS quotas and permissions + +**5. Common Solutions:** + +- Restart Karpenter deployment +- Check AWS EC2 quotas +- Verify subnet capacity +- Review security group configurations + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-prometheus-memory-issues.mdx b/docs/troubleshooting/cluster-prometheus-memory-issues.mdx new file mode 100644 index 000000000..8d731413c --- /dev/null +++ b/docs/troubleshooting/cluster-prometheus-memory-issues.mdx @@ -0,0 +1,204 @@ +--- +sidebar_position: 3 +title: "Prometheus Memory Issues Causing Pod Hangs" +description: "Solution for Prometheus memory issues that cause application pods to hang in production" +date: "2024-12-23" +category: "cluster" +tags: ["prometheus", "memory", "monitoring", "troubleshooting", "pods"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Prometheus Memory Issues Causing Pod Hangs + +**Date:** December 23, 2024 +**Category:** Cluster +**Tags:** Prometheus, Memory, Monitoring, Troubleshooting, Pods + +## Problem Description + +**Context:** Users experience application pods hanging in production environments, where pods appear healthy in monitoring tools like Lens but API requests become completely unresponsive. The issue is typically related to Prometheus memory consumption affecting node stability. + +**Observed Symptoms:** + +- Application pods show as "green" or healthy in Kubernetes monitoring tools +- API requests hang completely without returning responses +- Only some pods hang while others in the same deployment continue working normally +- Load balancer distributes traffic to both working and hanging pods +- Issue occurs primarily in production with high traffic, not in development +- Pods don't crash or restart, they just become unresponsive + +**Relevant Configuration:** + +- Prometheus addon installed in cluster +- Multiple application replicas (e.g., 15 replicas) +- Production environment with significant traffic load +- Pods distributed across multiple nodes + +**Error Conditions:** + +- Prometheus pod consuming excessive memory +- Node resource exhaustion causing pod hangs +- Intermittent failures affecting subset of pods +- Poor user experience due to request timeouts + +## Detailed Solution + + + +This issue typically occurs when: + +1. **Prometheus memory exhaustion**: The Prometheus pod starts consuming more RAM than allocated +2. **Node resource starvation**: When Prometheus exhausts node resources, it affects all pods on that node +3. **Partial service degradation**: Only pods on affected nodes hang, while others continue working +4. **Third-party service latency**: External API calls may contribute to the problem by creating bottlenecks + +The key indicator is that pods appear healthy but become unresponsive to requests. + + + + + +To resolve the immediate issue: + +1. **Access Prometheus addon configuration**: + + - Go to your cluster's **Addons** section + - Find **Prometheus** in the addon list + - Click to open configuration + +2. **Increase minimum RAM allocation**: + + - Locate the "RAM minimum" setting + - Increase the value to **2200 MB** (or higher based on your needs) + - Save the configuration + +3. **Apply the changes**: + - The system will update Prometheus with the new memory allocation + - This process may take up to 20 minutes to complete + +```yaml +# Example Prometheus configuration +prometheus: + resources: + requests: + memory: "2200Mi" + limits: + memory: "4000Mi" +``` + + + + + +If nodes are already affected: + +1. **Identify the problematic Prometheus pod**: + + - Look for Prometheus pod with containers showing orange/warning status + - Note which node is hosting this pod + +2. **Delete the affected node**: + + - Click on the node name to open node details + - Select "Delete" to remove the node + - **Important**: Do this at the same time or just before updating Prometheus + +3. **Verify recovery**: + - Wait for the Prometheus pod to be rescheduled on a new node + - Check that all containers show green/healthy status + - Monitor application pods for restored functionality + + + + + +To prevent future issues and better diagnose problems: + +1. **Install Application Performance Monitoring (APM)**: + + - **OpenTelemetry** (available as cluster addon, currently in beta) + - **New Relic** (external service) + - **Datadog** (external service) + +2. **Configure OpenTelemetry in SleakOps**: + + - Go to **Cluster** → **Addons** + - Install **OpenTelemetry** addon + - Apply to your project services + +3. **Monitor third-party service latency**: + - Check response times from external APIs + - Identify potential bottlenecks in service dependencies + - Set up alerts for high latency or timeouts + +```yaml +# Example OpenTelemetry configuration +opentelemetry: + enabled: true + exporters: + - jaeger + - prometheus + sampling_rate: 0.1 +``` + + + + + +**Resource Management:** + +- Set appropriate resource requests and limits for all applications +- Monitor resource usage trends over time +- Scale Prometheus resources proactively based on cluster size + +**Monitoring Setup:** + +- Implement comprehensive APM to track application performance +- Set up alerts for resource exhaustion +- Monitor third-party service dependencies + +**Troubleshooting Process:** + +- Create support tickets when issues occur with specific timeframes +- Document exact symptoms and error conditions +- Include relevant configuration details + +**Load Testing:** + +- Test applications under production-like traffic loads +- Validate resource allocation in staging environments +- Monitor for memory leaks and performance degradation + + + + + +When experiencing similar issues: + +**Immediate Actions:** + +1. Check Prometheus pod status and resource usage +2. Identify which nodes are affected +3. Increase Prometheus memory allocation +4. Delete affected nodes if necessary + +**Investigation Steps:** + +1. Note exact timeframes when issues occur +2. Check Grafana dashboards for resource metrics +3. Review application logs for errors or timeouts +4. Monitor third-party service response times + +**Documentation:** + +1. Create support tickets with specific timeframes +2. Include pod names and node information +3. Describe exact symptoms observed +4. Note any recent configuration changes + + + +--- + +_This FAQ was automatically generated on December 23, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-understanding-ec2-instances-and-costs.mdx b/docs/troubleshooting/cluster-understanding-ec2-instances-and-costs.mdx new file mode 100644 index 000000000..e653d9f0f --- /dev/null +++ b/docs/troubleshooting/cluster-understanding-ec2-instances-and-costs.mdx @@ -0,0 +1,202 @@ +--- +sidebar_position: 3 +title: "Understanding EC2 Instances and Costs in EKS Clusters" +description: "Explanation of EC2 instances created by SleakOps and their cost implications" +date: "2025-01-29" +category: "cluster" +tags: ["eks", "aws", "ec2", "costs", "karpenter", "vpn"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Understanding EC2 Instances and Costs in EKS Clusters + +**Date:** January 29, 2025 +**Category:** Cluster +**Tags:** EKS, AWS, EC2, Costs, Karpenter, VPN + +## Problem Description + +**Context:** Users notice multiple EC2 instances running in their AWS account alongside their EKS cluster and want to understand the purpose of each instance and associated costs. + +**Observed Symptoms:** + +- Multiple EC2 instances appearing in AWS console (e.g., t4g.medium, t3a.small, c7g.xlarge) +- Uncertainty about which instances are for VPN vs cluster nodes +- Questions about cost implications as workloads scale +- Need to understand relationship between pods and EC2 instances + +**Relevant Configuration:** + +- EKS cluster with Karpenter node management +- Spot instances configuration for pods +- VPN instance for secure access +- Various instance types (t4g.medium, t3a.small, c7g.xlarge) + +**Error Conditions:** + +- No technical errors, but confusion about infrastructure costs +- Need for cost optimization understanding +- Scaling behavior clarification needed + +## Detailed Solution + + + +**VPN Instance:** + +- **Purpose**: Provides secure VPN access to your cluster and resources +- **Typical size**: Usually t3a.small or similar small instance +- **Cost**: Fixed cost, runs continuously regardless of workload +- **Scaling**: Does not scale with your application workload + +**Cluster Worker Nodes:** + +- **Purpose**: Run your application pods and Kubernetes workloads +- **Managed by**: Karpenter (automatic node provisioning) +- **Instance types**: Various sizes (t4g.medium, c7g.xlarge, etc.) +- **Cost**: Variable, scales with your workload demands + + + + + +Karpenter automatically manages your cluster nodes: + +```yaml +# Example Nodepool configuration +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: default +spec: + template: + spec: + requirements: + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: karpenter.sh/capacity-type + operator: In + values: ["spot", "on-demand"] + nodeClassRef: + apiVersion: karpenter.k8s.aws/v1beta1 + kind: EC2NodeClass + name: default +``` + +**Key behaviors:** + +- **Automatic scaling**: Creates instances when pods need resources +- **Cost optimization**: Prefers spot instances when available +- **Right-sizing**: Selects appropriate instance types for workload requirements +- **Cleanup**: Terminates unused instances to reduce costs + + + + + +**Fixed Costs:** + +- VPN instance: ~$10-20/month (depending on instance size) +- EKS control plane: $0.10/hour (~$73/month) + +**Variable Costs (scale with workload):** + +- Worker node instances: Depends on: + - Number of pods running + - Resource requirements (CPU/memory) + - Instance types selected by Karpenter + - Spot vs On-Demand pricing + +**Cost optimization tips:** + +```bash +# Monitor your node utilization +kubectl top nodes + +# Check pod resource requests +kubectl describe pods -A | grep -A 5 "Requests:" + +# View Karpenter node decisions +kubectl logs -n karpenter deployment/karpenter +``` + + + + + +**Scaling Process:** + +1. **Pod Creation**: You deploy more pods to your cluster +2. **Resource Assessment**: Karpenter evaluates resource requirements +3. **Node Provisioning**: If existing nodes can't accommodate new pods: + - Karpenter provisions new EC2 instances + - Selects optimal instance type based on requirements + - Prefers spot instances for cost savings +4. **Pod Scheduling**: Kubernetes schedules pods on available nodes +5. **Scale Down**: When pods are deleted, Karpenter removes unused nodes + +**Example scaling scenario:** + +```yaml +# If you have pods with these requirements: +resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi +``` + +Karpenter will: + +- Calculate total resource needs +- Select cost-effective instance types +- Provision minimum number of instances needed + + + + + +**Cost Monitoring Tools:** + +1. **AWS Cost Explorer**: Track EC2 spending by instance type +2. **Karpenter Metrics**: Monitor node provisioning decisions +3. **SleakOps Dashboard**: View cluster resource utilization + +**Cost Control Strategies:** + +```yaml +# Set resource limits in your deployments +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + containers: + - name: app + resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi +``` + +**Nodepool cost controls:** + +- Configure maximum instance sizes +- Set node pool limits +- Use spot instances when possible +- Configure appropriate resource requests + +For detailed nodepool configuration, see: [Nodepool Documentation](https://docs.sleakops.com/cluster/nodepools) + + + +--- + +_This FAQ was automatically generated on January 29, 2025 based on a real user query._ diff --git a/docs/troubleshooting/cluster-upgrade-maintenance-scheduling.mdx b/docs/troubleshooting/cluster-upgrade-maintenance-scheduling.mdx new file mode 100644 index 000000000..64384a615 --- /dev/null +++ b/docs/troubleshooting/cluster-upgrade-maintenance-scheduling.mdx @@ -0,0 +1,247 @@ +--- +sidebar_position: 3 +title: "Cluster Upgrade and Maintenance Scheduling" +description: "Best practices for scheduling cluster upgrades and maintenance windows" +date: "2024-12-19" +category: "cluster" +tags: ["cluster", "upgrade", "maintenance", "scheduling", "downtime"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cluster Upgrade and Maintenance Scheduling + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** Cluster, Upgrade, Maintenance, Scheduling, Downtime + +## Problem Description + +**Context:** Organizations need to coordinate cluster upgrades and maintenance activities to minimize service disruption and ensure proper communication between SleakOps team and clients. + +**Observed Symptoms:** + +- Unexpected service interruptions during maintenance windows +- Lack of clear communication about upgrade schedules +- Incomplete cluster upgrades causing platform inconsistencies +- Services becoming unavailable without prior notice +- Confusion about maintenance start and end times + +**Relevant Configuration:** + +- Cluster upgrade processes (Kubernetes version updates) +- Infrastructure migration activities +- Domain and DNS configuration changes +- Load balancer and ingress updates + +**Error Conditions:** + +- Services failing during unscheduled maintenance +- Partial upgrades leaving clusters in inconsistent states +- DNS and ingress configuration issues post-upgrade +- Certificate validation problems after infrastructure changes + +## Detailed Solution + + + +### Pre-Maintenance Planning + +Before scheduling any cluster upgrade or maintenance: + +1. **Assess Impact**: Identify all services and workloads that will be affected +2. **Define Scope**: Clearly outline what will be upgraded or modified +3. **Estimate Duration**: Provide realistic time estimates for the maintenance window +4. **Prepare Rollback Plan**: Have a clear rollback strategy in case of issues + +### Communication Protocol + +- **Advance Notice**: Notify clients at least 48-72 hours before maintenance +- **Detailed Schedule**: Provide specific start and end times +- **Contact Information**: Ensure emergency contacts are available during maintenance +- **Status Updates**: Send regular updates during the maintenance window + + + + + +### Cluster Upgrade Best Practices + +```yaml +# Example maintenance schedule communication +Maintenance Window: + Start: "2024-12-20 02:00 UTC" + End: "2024-12-20 06:00 UTC" + Timezone: "UTC-3 (Argentina Time: 23:00 - 03:00)" + +Affected Services: + - Kubernetes cluster upgrade (1.28 → 1.29) + - Node pool updates + - Load balancer configuration + - DNS and certificate renewal + +Expected Downtime: + - Estimated: 15-30 minutes + - Maximum: 2 hours +``` + +### Coordination Steps + +1. **Schedule Agreement**: Confirm maintenance window with client +2. **Pre-upgrade Checklist**: Verify all prerequisites are met +3. **Backup Creation**: Ensure all critical data is backed up +4. **Staged Approach**: Upgrade non-production environments first +5. **Monitoring Setup**: Have monitoring and alerting ready + + + + + +### Pre-Maintenance Notification + +```markdown +Subject: Scheduled Maintenance - [Cluster Name] - [Date] + +Dear [Client], + +We have scheduled maintenance for your cluster: + +**Maintenance Details:** + +- Date: [Date] +- Start Time: [Time] [Timezone] +- End Time: [Time] [Timezone] +- Expected Duration: [Duration] + +**What will be done:** + +- [List of activities] + +**Expected Impact:** + +- [Description of potential service interruptions] + +**Preparation Required:** + +- [Any actions needed from client] + +We will send updates during the maintenance window. + +Best regards, +SleakOps Team +``` + +### During Maintenance Updates + +```markdown +Subject: Maintenance Update - [Status] - [Time] + +**Status:** [In Progress/Completed/Issue] +**Time:** [Current Time] +**Progress:** [Description of current activities] +**Next Steps:** [What happens next] +**Estimated Completion:** [Updated time if different] +``` + + + + + +### Post-Maintenance Checklist + +1. **Service Verification** + + - Test all critical services + - Verify DNS resolution + - Check SSL certificates + - Validate load balancer rules + +2. **Performance Monitoring** + + - Monitor cluster performance for 24-48 hours + - Check resource utilization + - Verify auto-scaling functionality + +3. **Client Communication** + - Send completion notification + - Provide summary of changes made + - Include any post-maintenance instructions + +### Completion Notification Template + +```markdown +Subject: Maintenance Completed - [Cluster Name] + +Dear [Client], + +The scheduled maintenance has been completed successfully. + +**Summary:** + +- Start Time: [Actual start time] +- End Time: [Actual end time] +- Duration: [Actual duration] + +**Changes Made:** + +- [List of completed activities] + +**Verification:** + +- All services are operational +- Performance metrics are normal +- No issues detected + +**Next Steps:** + +- Monitor your applications for the next 24 hours +- Contact us immediately if you notice any issues + +Thank you for your patience. + +Best regards, +SleakOps Team +``` + + + + + +### If Issues Occur During Maintenance + +1. **Immediate Assessment** + + - Stop current activities if safe to do so + - Assess the severity of the issue + - Determine if rollback is necessary + +2. **Client Communication** + + - Notify client immediately of any issues + - Provide realistic timeline for resolution + - Explain available options (continue, rollback, pause) + +3. **Escalation Process** + - Have senior team members available + - Establish clear escalation paths + - Maintain detailed logs of all activities + +### Emergency Contact Protocol + +```yaml +Emergency Contacts: + Primary: "[Primary contact and phone]" + Secondary: "[Secondary contact and phone]" + Escalation: "[Management contact]" + +Communication Channels: + - Email (immediate) + - Phone/SMS (critical issues) + - Slack/Teams (real-time updates) +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cluster-upgrade-zero-downtime.mdx b/docs/troubleshooting/cluster-upgrade-zero-downtime.mdx new file mode 100644 index 000000000..4fd7a2ece --- /dev/null +++ b/docs/troubleshooting/cluster-upgrade-zero-downtime.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "Kubernetes Cluster Upgrade with Zero Downtime" +description: "How to perform cluster upgrades in SleakOps without service interruption" +date: "2025-02-19" +category: "cluster" +tags: ["upgrade", "kubernetes", "downtime", "maintenance", "production"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes Cluster Upgrade with Zero Downtime + +**Date:** February 19, 2025 +**Category:** Cluster +**Tags:** Upgrade, Kubernetes, Downtime, Maintenance, Production + +## Problem Description + +**Context:** Users need to upgrade their Kubernetes clusters in SleakOps but are concerned about potential downtime during critical business hours, especially during high-traffic periods or end-of-month operations. + +**Observed Symptoms:** + +- Concern about service interruption during cluster upgrades +- Need to schedule upgrades during low-traffic hours +- Uncertainty about the upgrade process and its impact +- Questions about self-service upgrade capabilities + +**Relevant Configuration:** + +- Production clusters with multiple replicas +- Web services and workers with high availability requirements +- Critical business operations during specific time windows +- Deployment policies configured for rolling updates + +**Error Conditions:** + +- Risk of downtime during business-critical hours +- Potential service degradation during upgrade process +- Need for immediate support if issues arise during upgrade + +## Detailed Solution + + + +SleakOps implements zero-downtime cluster upgrades through: + +1. **Rolling Node Updates**: Nodes are rotated gradually, not all at once +2. **Deployment Policies**: 33% of each deployment must remain active during upgrades +3. **Pod Distribution**: Services are distributed across multiple nodes +4. **Health Checks**: Continuous monitoring ensures service availability + +**Requirements for Zero Downtime:** + +- Production mode cluster +- At least 2 replicas per web service or worker +- Proper resource allocation and pod anti-affinity rules + + + + + +To run the upgrade yourself: + +1. **Access the Console**: Go to your cluster's settings page +2. **Navigate to General Settings**: Find the upgrade section +3. **Review Upgrade Details**: Check the target version and changes +4. **Initiate Upgrade**: Click the upgrade button when ready + +**Console Path:** + +``` +SleakOps Console → Clusters → [Your Cluster] → Settings → General Settings +``` + +**Best Practices:** + +- Schedule during low-traffic hours (early morning recommended) +- Notify your team before starting +- Monitor application logs during the process +- Have rollback plan ready if needed + + + + + +During the upgrade process, monitor: + +1. **Application Health**: Check your application endpoints +2. **Pod Status**: Ensure pods are being recreated successfully +3. **Resource Usage**: Monitor CPU and memory during transition +4. **Logs**: Watch for any error messages in application logs + +**Key Indicators of Successful Upgrade:** + +- All pods show "Running" status +- Application endpoints respond normally +- No error spikes in monitoring dashboards +- Database connections remain stable + + + + + +**Recommended Timing:** + +- Early morning hours (2-6 AM local time) +- Weekends or maintenance windows +- Avoid end-of-month or high-traffic periods +- Consider time zones for global applications + +**Pre-Upgrade Checklist:** + +- [ ] Verify cluster has multiple replicas +- [ ] Check recent application deployments are stable +- [ ] Ensure monitoring is active +- [ ] Notify relevant team members +- [ ] Plan for immediate support if needed + + + + + +SleakOps provides: + +1. **Priority Support**: Same-day priority for upgrade-related issues +2. **Proactive Monitoring**: Team monitors critical upgrades +3. **Immediate Response**: Quick resolution for any problems +4. **Post-Upgrade Validation**: Verification that all services are healthy + +**When to Contact Support:** + +- If upgrade takes longer than expected +- Application errors appear during upgrade +- Pods fail to restart properly +- Performance degradation after upgrade + + + + + +If issues occur during upgrade: + +1. **Immediate Actions**: + + - Contact SleakOps support immediately + - Document any error messages + - Avoid manual interventions unless directed + +2. **Rollback Process**: + + - SleakOps can initiate rollback if necessary + - Applications will be restored to previous versions + - Data integrity is maintained throughout + +3. **Post-Incident**: + - Root cause analysis provided + - Recommendations for future upgrades + - Schedule retry with additional precautions + + + +--- + +_This FAQ was automatically generated on February 19, 2025 based on a real user query._ diff --git a/docs/troubleshooting/connecting-to-cluster-with-lens.mdx b/docs/troubleshooting/connecting-to-cluster-with-lens.mdx new file mode 100644 index 000000000..9b5bd1c95 --- /dev/null +++ b/docs/troubleshooting/connecting-to-cluster-with-lens.mdx @@ -0,0 +1,205 @@ +--- +sidebar_position: 3 +title: "Connecting to Kubernetes Cluster with Lens IDE" +description: "Step-by-step guide to connect to your SleakOps cluster using Lens IDE" +date: "2024-12-19" +category: "cluster" +tags: ["lens", "kubernetes", "ide", "connection", "kubectl"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Connecting to Kubernetes Cluster with Lens IDE + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** Lens, Kubernetes, IDE, Connection, kubectl + +## Problem Description + +**Context:** Users need to connect to their SleakOps Kubernetes cluster from their local development environment to monitor workloads, debug issues, and manage resources. + +**Observed Symptoms:** + +- Need to access cluster resources from local machine +- Want to use a visual interface instead of command line only +- Require ability to view pods, logs, and workload status +- Need to troubleshoot deployment issues + +**Relevant Configuration:** + +- SleakOps cluster with connection data available +- Local development machine +- Kubernetes IDE preference (Lens recommended) +- kubectl CLI as alternative option + +**Use Cases:** + +- Monitoring pod status and logs +- Debugging deployment issues +- Managing Kubernetes resources visually +- Troubleshooting application health checks + +## Detailed Solution + + + +Lens is a Kubernetes IDE that provides a visual interface for managing Kubernetes clusters. It's recommended by SleakOps because it offers: + +- **Visual cluster management**: Easy-to-use graphical interface +- **Real-time monitoring**: Live view of pods, services, and deployments +- **Log aggregation**: Centralized log viewing from multiple pods +- **Resource management**: Create, edit, and delete Kubernetes resources +- **Multi-cluster support**: Connect to multiple clusters simultaneously + +**Alternatives:** + +- kubectl CLI (command-line interface) +- Other Kubernetes IDEs (k9s, Octant, etc.) +- Cloud provider consoles + + + + + +Before connecting to your cluster, ensure you have these 4 dependencies installed on your local machine: + +1. **Lens IDE**: Download from [k8slens.dev](https://k8slens.dev) +2. **kubectl**: Kubernetes command-line tool +3. **AWS CLI**: For AWS authentication (if using AWS) +4. **Docker**: Container runtime (optional but recommended) + +### Installation Commands + +**macOS (using Homebrew):** + +```bash +# Install kubectl +brew install kubectl + +# Install AWS CLI +brew install awscli + +# Install Docker Desktop +brew install --cask docker +``` + +**Linux (Ubuntu/Debian):** + +```bash +# Install kubectl +curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" +sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl + +# Install AWS CLI +curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +unzip awscliv2.zip +sudo ./aws/install +``` + +**Windows:** + +- Download Lens from the official website +- Install kubectl using chocolatey: `choco install kubernetes-cli` +- Install AWS CLI from AWS documentation + + + + + +To connect to your SleakOps cluster, you need to authenticate and obtain the necessary credentials: + +1. **Access SleakOps Dashboard**: Log into your SleakOps account +2. **Navigate to Cluster**: Go to your specific cluster +3. **Find Connection Data**: Look for "Connection Data" in the cluster options +4. **Get AWS Credentials**: Follow the provided link to obtain: + - AWS Access Key ID + - AWS Secret Access Key + - Session Token (if using temporary credentials) + +### Configure AWS Credentials + +```bash +# Configure AWS CLI with your credentials +aws configure +# Enter your Access Key ID, Secret Access Key, and region + +# Or set environment variables +export AWS_ACCESS_KEY_ID="your-access-key" +export AWS_SECRET_ACCESS_KEY="your-secret-key" +export AWS_SESSION_TOKEN="your-session-token" # if applicable +``` + + + + + +The kubeconfig file contains the cluster connection information: + +1. **Get kubeconfig YAML**: Copy the kubeconfig YAML from the SleakOps connection data +2. **Save to file**: Save it to your local kubeconfig file + +### Method 1: Direct file replacement + +```bash +# Backup existing kubeconfig (if any) +cp ~/.kube/config ~/.kube/config.backup + +# Create the .kube directory if it doesn't exist +mkdir -p ~/.kube + +# Save the new kubeconfig +# Paste the YAML content from SleakOps into ~/.kube/config +``` + +### Method 2: Merge with existing config + +```bash +# If you have multiple clusters, merge configurations +export KUBECONFIG=~/.kube/config:~/path/to/sleakops-config.yaml +kubectl config view --flatten > ~/.kube/config.new +mv ~/.kube/config.new ~/.kube/config +``` + +### Verify connection + +```bash +# Test the connection +kubectl cluster-info +kubectl get nodes +``` + + + + + +Once you have the kubeconfig configured: + +1. **Open Lens IDE** +2. **Add Cluster**: Click on "+" or "Add Cluster" +3. **Paste kubeconfig**: Paste the YAML content from SleakOps +4. **Connect**: Lens will automatically detect and connect to your cluster + +### Navigating in Lens + +- **Workloads → Pods**: View all pods in your cluster +- **Workloads → Deployments**: Manage your deployments +- **Network → Services**: View and manage services +- **Storage**: Manage persistent volumes and claims +- **Namespaces**: Your project environment will appear as `project-name-environment` + +### Finding Your Application + +Your application pods will be in a namespace following this pattern: + +``` +[project-name]-[environment-name] +``` + +For example: `myapp-production` or `myapp-staging` + + + +--- + +_This FAQ was automatically generated on December 19, 2025 based on a real user query._ diff --git a/docs/troubleshooting/cors-configuration-troubleshooting.mdx b/docs/troubleshooting/cors-configuration-troubleshooting.mdx new file mode 100644 index 000000000..d5abeebaa --- /dev/null +++ b/docs/troubleshooting/cors-configuration-troubleshooting.mdx @@ -0,0 +1,208 @@ +--- +sidebar_position: 3 +title: "CORS Configuration and Troubleshooting" +description: "How to diagnose and fix CORS errors in applications deployed on SleakOps" +date: "2024-12-19" +category: "workload" +tags: ["cors", "frontend", "backend", "api", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CORS Configuration and Troubleshooting + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** CORS, Frontend, Backend, API, Troubleshooting + +## Problem Description + +**Context:** Users experience CORS (Cross-Origin Resource Sharing) errors when their frontend application tries to communicate with their backend API, even though CORS appears to be properly configured. + +**Observed Symptoms:** + +- CORS error messages in browser console when making API calls +- Frontend requests to backend API are blocked +- Error occurs despite having CORS configuration in place +- Application worked previously but now shows CORS errors + +**Relevant Configuration:** + +- Frontend and backend deployed on SleakOps +- CORS configuration present in backend code +- API endpoints accessible but blocked by browser CORS policy + +**Error Conditions:** + +- Error appears when frontend makes requests to backend +- May occur on all requests or specific endpoints +- Browser blocks the request before it reaches the server +- Error persists despite apparent correct CORS setup + +## Detailed Solution + + + +SleakOps infrastructure does not interfere with CORS configuration. CORS is handled entirely at the application level, meaning: + +- **Infrastructure Level**: SleakOps load balancers and ingress controllers pass through CORS headers without modification +- **Application Level**: Your backend application is responsible for setting proper CORS headers +- **Browser Enforcement**: CORS is enforced by the browser, not by the server infrastructure + +This means CORS issues are typically related to your application configuration, not the platform. + + + + + +One of the most common CORS issues is related to Content-Type headers: + +**Check your CORS configuration allows the Content-Type you're sending:** + +```javascript +// Example: Express.js CORS configuration +const cors = require("cors"); + +app.use( + cors({ + origin: ["http://localhost:3000", "https://your-frontend-domain.com"], + methods: ["GET", "POST", "PUT", "DELETE", "OPTIONS"], + allowedHeaders: [ + "Content-Type", + "Authorization", + "X-Requested-With", + "Accept", + ], + credentials: true, + }) +); +``` + +**Common Content-Type values that need explicit permission:** + +- `application/json` +- `application/x-www-form-urlencoded` +- `multipart/form-data` +- `text/plain` + + + + + +**Check if recent changes affected CORS settings:** + +1. **Verify CORS middleware is properly configured:** + +```python +# Example: Flask-CORS configuration +from flask_cors import CORS + +app = Flask(__name__) +CORS(app, origins=[ + "http://localhost:3000", + "https://your-frontend-domain.com" +], supports_credentials=True) +``` + +2. **Ensure CORS headers are set for all relevant endpoints:** + +```javascript +// Manual CORS headers (if not using middleware) +app.use((req, res, next) => { + res.header("Access-Control-Allow-Origin", "https://your-frontend-domain.com"); + res.header("Access-Control-Allow-Methods", "GET,PUT,POST,DELETE,OPTIONS"); + res.header("Access-Control-Allow-Headers", "Content-Type, Authorization"); + res.header("Access-Control-Allow-Credentials", "true"); + + if (req.method === "OPTIONS") { + res.sendStatus(200); + } else { + next(); + } +}); +``` + +3. **Check for environment-specific configurations:** + +```yaml +# Example: Environment variables for CORS +CORS_ALLOWED_ORIGINS: "https://your-frontend.sleakops.app,http://localhost:3000" +CORS_ALLOW_CREDENTIALS: "true" +``` + + + + + +**Determine if the error affects all requests or specific ones:** + +1. **Test different endpoints:** + +```bash +# Test with curl to see server response +curl -H "Origin: https://your-frontend.com" \ + -H "Access-Control-Request-Method: POST" \ + -H "Access-Control-Request-Headers: Content-Type" \ + -X OPTIONS \ + https://your-api.sleakops.app/api/endpoint +``` + +2. **Check browser developer tools:** + + - Open Network tab + - Look for preflight OPTIONS requests + - Check response headers for CORS headers + - Verify the actual request headers being sent + +3. **Common patterns:** + - **Simple requests** (GET, POST with simple content-types) may work + - **Complex requests** (custom headers, JSON content-type) may fail + - **Preflight requests** (OPTIONS) may be missing proper responses + + + + + +**Check if frontend libraries are sending unexpected headers:** + +1. **Axios configuration:** + +```javascript +// Ensure axios is not adding problematic headers +const api = axios.create({ + baseURL: "https://your-api.sleakops.app", + withCredentials: true, + headers: { + "Content-Type": "application/json", + // Remove any custom headers that might trigger preflight + }, +}); +``` + +2. **Fetch API configuration:** + +```javascript +// Proper fetch configuration +fetch("https://your-api.sleakops.app/api/endpoint", { + method: "POST", + credentials: "include", // Only if you need cookies + headers: { + "Content-Type": "application/json", + }, + body: JSON.stringify(data), +}); +``` + +3. **Check for authentication headers:** + +```javascript +// Ensure Authorization header is allowed in CORS +const token = localStorage.getItem("token"); +if (token) { + headers["Authorization"] = `Bearer ${token}`; +} +``` + + + +_This FAQ was automatically generated on December 25, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cost-analysis-and-resource-optimization.mdx b/docs/troubleshooting/cost-analysis-and-resource-optimization.mdx new file mode 100644 index 000000000..f49e06773 --- /dev/null +++ b/docs/troubleshooting/cost-analysis-and-resource-optimization.mdx @@ -0,0 +1,179 @@ +--- +sidebar_position: 3 +title: "Cost Analysis and Resource Optimization in SleakOps" +description: "Guide for analyzing costs and optimizing RAM resources for WebServices and databases" +date: "2024-12-19" +category: "general" +tags: + [ + "cost-analysis", + "aws", + "grafana", + "resource-optimization", + "webservice", + "database", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cost Analysis and Resource Optimization in SleakOps + +**Date:** December 19, 2024 +**Category:** General +**Tags:** Cost Analysis, AWS, Grafana, Resource Optimization, WebService, Database + +## Problem Description + +**Context:** Users need to analyze monthly costs and optimize resource allocation for WebServices and databases in SleakOps, particularly when considering RAM adjustments. + +**Observed Symptoms:** + +- Need for monthly cost analysis instead of daily breakdown +- Uncertainty about RAM resource optimization impact on costs +- Difficulty in determining optimal resource allocation for WebServices and databases +- Need to understand cost implications before making resource changes + +**Relevant Configuration:** + +- Platform: SleakOps with AWS backend +- Resources: WebServices and databases requiring RAM optimization +- Monitoring: Grafana dashboards available +- Billing: AWS console with detailed cost breakdown + +**Error Conditions:** + +- Inadequate cost visibility for decision making +- Potential resource over-allocation or under-allocation +- Difficulty predicting cost impact of resource changes + +## Detailed Solution + + + +For the most detailed cost analysis, use the AWS Console: + +1. **Navigate to AWS Cost Explorer or Billing Dashboard** +2. **Apply the crucial filter**: Set `Charge Type = Usage` + + - This filter is essential because AWS credits can show $0 costs even when resources are being used + - The Usage filter shows actual resource consumption regardless of credits applied + +3. **View detailed cost breakdown table**: + - Group by service, resource, or time period + - Analyze trends over monthly periods + - Export data for further analysis if needed + +**Important**: Always use the `Charge Type = Usage` filter to get accurate cost visibility, especially if your account has AWS credits that might mask actual usage costs. + + + + + +To determine if your WebService needs more RAM: + +1. **Access Grafana Dashboard**: + + - Navigate to `Kubernetes / Compute Resources / Namespace (Pods)` + +2. **Analyze current RAM usage**: + + - Look for your WebService pods + - Check RAM utilization percentage during different load periods + - Pay attention to both idle and peak usage patterns + +3. **Interpretation guidelines**: + - **Idle usage around 20%**: Generally acceptable baseline + - **Peak usage above 80%**: Consider increasing RAM + - **Consistent usage below 30%**: Potential for RAM reduction + +```yaml +# Example resource configuration +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + + + + + +When considering RAM adjustments: + +**For small increases (e.g., 300MB per pod)**: + +- If running 3 pods: 300MB × 3 = ~1GB total increase +- Impact on cluster costs may be minimal or zero +- Existing cluster nodes might have sufficient unused RAM capacity +- Cost increase depends on whether additional nodes are required + +**Cost estimation process**: + +1. **Check current node utilization** in Grafana +2. **Calculate total RAM increase** across all pods +3. **Verify if existing nodes can accommodate** the increase +4. **Estimate additional node costs** if scaling is required + +**Best practices**: + +- Start with small incremental changes +- Monitor impact before making larger adjustments +- Consider both WebService and database optimization together +- Test changes in staging environment first + + + + + +When optimizing database resources: + +1. **Monitor database performance metrics**: + + - Query execution times + - Connection pool utilization + - Memory usage patterns + +2. **Common optimization scenarios**: + + - **Over-allocated database RAM**: Can often be reduced if WebService RAM is increased + - **Database caching**: More WebService RAM can reduce database load + - **Connection pooling**: Optimize based on actual concurrent connections + +3. **Coordinated optimization approach**: + - Increase WebService RAM for better application performance + - Monitor database load reduction + - Gradually decrease database RAM if metrics support it + - Validate performance throughout the process + + + + + +For better monthly cost visibility: + +1. **AWS Cost Explorer**: + + - Set time range to monthly periods + - Group costs by service or resource tags + - Create custom reports for SleakOps resources + +2. **SleakOps billing dashboard**: + + - Use date range selectors for monthly views + - Export data for trend analysis + - Set up cost alerts for budget management + +3. **Regular monitoring routine**: + - Weekly resource utilization reviews + - Monthly cost analysis and optimization + - Quarterly resource allocation assessments + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/cronjob-timezone-deployment-behavior.mdx b/docs/troubleshooting/cronjob-timezone-deployment-behavior.mdx new file mode 100644 index 000000000..d189b1e86 --- /dev/null +++ b/docs/troubleshooting/cronjob-timezone-deployment-behavior.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 3 +title: "CronJob Timezone and Deployment Behavior" +description: "Understanding CronJob timezone configuration and execution behavior during deployments" +date: "2024-10-24" +category: "workload" +tags: ["cronjob", "timezone", "deployment", "scheduling", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CronJob Timezone and Deployment Behavior + +**Date:** October 24, 2024 +**Category:** Workload +**Tags:** CronJob, Timezone, Deployment, Scheduling, Kubernetes + +## Problem Description + +**Context:** User needs to configure CronJobs to run at specific times (e.g., 17:30 daily) and wants to understand timezone handling and execution behavior during deployments. + +**Observed Symptoms:** + +- CronJob appears to execute during deployment/release +- Uncertainty about cluster timezone configuration +- Need for precise scheduling at specific local times +- Concern about unintended executions + +**Relevant Configuration:** + +- Cron expression: `30 17 * * *` (daily at 17:30) +- Target timezone: Argentina (UTC-3) +- Cluster timezone: UTC +- Deployment triggers: Manual releases + +**Error Conditions:** + +- CronJob executes during deployment instead of scheduled time +- Timezone confusion between UTC and local time +- Multiple executions when only one daily execution is desired + +## Detailed Solution + + + +Kubernetes clusters in SleakOps are configured with **UTC timezone** by default. This means: + +- All CronJob schedules are interpreted in UTC time +- Your cron expression `30 17 * * *` runs at 17:30 UTC +- To run at 17:30 Argentina time (UTC-3), you need `30 20 * * *` + +**Time Conversion Example:** + +- Argentina time: 17:30 (UTC-3) +- UTC equivalent: 20:30 +- Correct cron expression: `30 20 * * *` + + + + + +CronJobs should **NOT** execute during deployments. If this is happening, it indicates a configuration issue: + +**Normal Behavior:** + +- CronJobs only execute based on their schedule +- Deployments update the CronJob definition but don't trigger execution +- Existing scheduled jobs continue running + +**If CronJob runs during deployment, check:** + +1. **RestartPolicy**: Should be `OnFailure` or `Never` +2. **Job History**: Verify `successfulJobsHistoryLimit` and `failedJobsHistoryLimit` +3. **Concurrency**: Set `concurrencyPolicy: Forbid` to prevent overlapping jobs + + + + + +For a daily job at 17:30 Argentina time: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: daily-job +spec: + schedule: "30 20 * * *" # 20:30 UTC = 17:30 Argentina + concurrencyPolicy: Forbid + successfulJobsHistoryLimit: 3 + failedJobsHistoryLimit: 1 + jobTemplate: + spec: + template: + spec: + restartPolicy: OnFailure + containers: + - name: job-container + image: your-image + command: ["your-command"] +``` + +**Key Settings:** + +- `concurrencyPolicy: Forbid`: Prevents multiple instances +- `successfulJobsHistoryLimit: 3`: Keeps last 3 successful jobs +- `restartPolicy: OnFailure`: Only restarts on failure + + + + + +If you need more precise timezone control: + +**Option 1: Kubernetes 1.25+ TimeZone Support** + +```yaml +spec: + schedule: "30 17 * * *" + timeZone: "America/Argentina/Buenos_Aires" +``` + +**Option 2: Application-Level Timezone Handling** + +```bash +# In your container +export TZ=America/Argentina/Buenos_Aires +date # Will show Argentina time +``` + +**Option 3: Manual UTC Calculation** + +- Always calculate UTC equivalent of your desired local time +- Account for daylight saving time changes +- Update cron expressions when DST changes + + + + + +**Check CronJob Status:** + +```bash +kubectl get cronjobs +kubectl describe cronjob your-cronjob-name +``` + +**View Job History:** + +```bash +kubectl get jobs +kubectl logs job/your-job-name +``` + +**Common Issues:** + +1. **Wrong timezone calculation**: Double-check UTC conversion +2. **Deployment triggers execution**: Review job configuration +3. **Multiple executions**: Set `concurrencyPolicy: Forbid` +4. **Jobs not running**: Verify cron syntax and cluster time + +**Verify Cluster Time:** + +```bash +kubectl run temp-pod --image=busybox --rm -it -- date +``` + + + +--- + +_This FAQ was automatically generated on October 24, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-credentials-access.mdx b/docs/troubleshooting/database-credentials-access.mdx new file mode 100644 index 000000000..ec2bad699 --- /dev/null +++ b/docs/troubleshooting/database-credentials-access.mdx @@ -0,0 +1,175 @@ +--- +sidebar_position: 3 +title: "Database Credentials Access in SleakOps" +description: "How to access database credentials stored as Kubernetes secrets" +date: "2024-12-19" +category: "dependency" +tags: ["database", "credentials", "secrets", "postgres", "vargroup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Credentials Access in SleakOps + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** Database, Credentials, Secrets, Postgres, VarGroup + +## Problem Description + +**Context:** Users need to access database credentials (specifically dbMaster user password) for administrative tasks like disabling triggers and data restoration, but the password is not directly visible in the SleakOps interface. + +**Observed Symptoms:** + +- Database username is visible but password is not shown +- Need credentials for administrative database operations +- Password not available in SleakOps database interface +- Required for tasks like trigger management and data restoration + +**Relevant Configuration:** + +- Database type: PostgreSQL +- User: dbMaster +- Storage: Kubernetes Secret in cluster +- Secret naming pattern: `{dbName}-postgres` +- Location: Project namespace + +**Error Conditions:** + +- Cannot perform administrative database tasks without credentials +- Password not displayed in standard SleakOps interface +- Need access across multiple environments (development, staging, production) + +## Detailed Solution + + + +SleakOps doesn't store database passwords in its own database for security reasons. Instead, credentials are stored as Kubernetes Secrets within your cluster: + +- **Storage location**: Kubernetes Secret in the project namespace +- **Secret name format**: `{dbName}-postgres` +- **Security**: Follows Kubernetes secret management best practices +- **Access**: Available through VarGroup editing interface + + + + + +To view database credentials in SleakOps: + +1. **Navigate to your project** +2. **Go to VarGroup section** +3. **Click "Edit" on the relevant VarGroup** +4. **View the credentials**: SleakOps fetches the secret from the cluster and displays it + +```bash +# The secret structure typically looks like: +apiVersion: v1 +kind: Secret +metadata: + name: {dbName}-postgres + namespace: {project-namespace} +data: + username: + password: +``` + + + + + +If you have cluster access, you can retrieve credentials directly: + +```bash +# List secrets in the project namespace +kubectl get secrets -n {project-namespace} + +# Get the specific database secret +kubectl get secret {dbName}-postgres -n {project-namespace} -o yaml + +# Decode the password directly +kubectl get secret {dbName}-postgres -n {project-namespace} -o jsonpath='{.data.password}' | base64 --decode +``` + +**Note**: This method requires kubectl access to your cluster. + + + + + +For accessing credentials in different environments: + +1. **Development Environment**: + + - Navigate to development project + - Follow VarGroup editing process + +2. **Staging/Production**: + - Repeat the same process for each environment + - Each environment has its own namespace and secrets + - Secret names follow the same pattern: `{dbName}-postgres` + +```bash +# Example for different environments +# Development +kubectl get secret myapp-postgres -n myproject-dev + +# Staging +kubectl get secret myapp-postgres -n myproject-staging + +# Production +kubectl get secret myapp-postgres -n myproject-prod +``` + + + + + +Once you have the credentials, you can use them for: + +**Database Connection**: + +```bash +# Connect to PostgreSQL +psql -h {database-host} -U dbMaster -d {database-name} +``` + +**Common Administrative Tasks**: + +```sql +-- Disable triggers +ALTER TABLE {table_name} DISABLE TRIGGER ALL; + +-- Enable triggers +ALTER TABLE {table_name} ENABLE TRIGGER ALL; + +-- Create database backup +pg_dump -h {host} -U dbMaster {database_name} > backup.sql + +-- Restore database +psql -h {host} -U dbMaster {database_name} < backup.sql +``` + + + + + +**Important Security Notes**: + +- Credentials are stored securely as Kubernetes Secrets +- Access is controlled through SleakOps project permissions +- Never store credentials in plain text files +- Use environment-specific credentials +- Rotate passwords regularly + +**Access Control**: + +- Only users with project access can view credentials +- VarGroup editing requires appropriate permissions +- Cluster access is restricted to authorized users + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-maintenance-window-strategies.mdx b/docs/troubleshooting/database-maintenance-window-strategies.mdx new file mode 100644 index 000000000..aaf5504a8 --- /dev/null +++ b/docs/troubleshooting/database-maintenance-window-strategies.mdx @@ -0,0 +1,524 @@ +--- +sidebar_position: 15 +title: "Database Maintenance Window Strategies" +description: "Solutions for performing database migrations without blocking operations" +date: "2024-01-15" +category: "dependency" +tags: + [ + "database", + "migrations", + "maintenance", + "mysql", + "postgresql", + "read-replicas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Maintenance Window Strategies + +**Date:** January 15, 2024 +**Category:** Dependency +**Tags:** Database, Migrations, Maintenance, MySQL, PostgreSQL, Read-replicas + +## Problem Description + +**Context:** Users need to perform database migrations (like ALTER TABLE operations) on production databases that have high connection volumes from multiple sources including web services, cronjobs, and serverless functions. + +**Observed Symptoms:** + +- Database migrations get blocked due to table locks +- ALTER operations cannot complete due to concurrent connections +- High number of active connections from various services +- Migration timeouts or failures during peak usage + +**Relevant Configuration:** + +- Database: Production database with high connection volume +- Migration type: ALTER TABLE operations +- Connection sources: Web services, cronjobs, Lambda functions +- Table size: Small tables (2 records) but high query frequency + +**Error Conditions:** + +- Migrations fail due to table locking conflicts +- ALTER operations wait indefinitely for table locks +- Database performance degradation during migration attempts + +## Detailed Solution + + + +Since SleakOps doesn't currently provide a built-in maintenance window feature, you can implement this at the application level: + +**Option 1: Feature Flag Approach** + +```javascript +// Environment variable or config +const MAINTENANCE_MODE = process.env.MAINTENANCE_MODE === "true"; + +// In your application routes +app.use((req, res, next) => { + if (MAINTENANCE_MODE && req.path !== "/health") { + return res.status(503).json({ + message: "System under maintenance. Please try again later.", + retryAfter: 300, // seconds + }); + } + next(); +}); +``` + +**Option 2: Database Connection Pooling Control** + +```javascript +// Temporarily reduce connection pool size +const pool = mysql.createPool({ + host: "localhost", + user: "user", + password: "password", + database: "mydb", + connectionLimit: MAINTENANCE_MODE ? 1 : 10, +}); +``` + + + + + +Implementing a master-slave architecture can significantly reduce load on your primary database: + +**Database Configuration:** + +```yaml +# In your SleakOps database configuration +database: + type: mysql # or postgresql + master: + instance_class: db.t3.medium + allocated_storage: 100 + read_replicas: + - instance_class: db.t3.small + allocated_storage: 100 + region: same # or different for geographic distribution +``` + +**Application Code Changes:** + +```javascript +// Separate connection pools +const masterPool = mysql.createPool({ + host: process.env.DB_MASTER_HOST, + // ... master config +}); + +const replicaPool = mysql.createPool({ + host: process.env.DB_REPLICA_HOST, + // ... replica config +}); + +// Use replica for read operations +function getCountries() { + return replicaPool.query("SELECT * FROM countries"); +} + +// Use master for write operations +function updateCountry(id, data) { + return masterPool.query("UPDATE countries SET ? WHERE id = ?", [data, id]); +} +``` + +**Important Considerations:** + +- Read replicas have eventual consistency (few seconds delay) +- Critical reads that need immediate consistency should use master +- Route all writes to master database + + + + + +For small, rarely-changing datasets like country information, application-level caching is highly effective: + +**In-Memory Caching:** + +```javascript +class CountryCache { + constructor() { + this.cache = new Map(); + this.lastUpdate = null; + this.TTL = 60 * 60 * 1000; // 1 hour + } + + async getCountries() { + const now = Date.now(); + + if (!this.lastUpdate || now - this.lastUpdate > this.TTL) { + const countries = await this.fetchFromDatabase(); + this.cache.set("countries", countries); + this.lastUpdate = now; + return countries; + } + + return this.cache.get("countries"); + } + + async fetchFromDatabase() { + // Use replica for this read + return replicaPool.query("SELECT * FROM countries"); + } +} + +const countryCache = new CountryCache(); +``` + +**Redis Caching (Alternative):** + +```javascript +const redis = require("redis"); +const client = redis.createClient(process.env.REDIS_URL); + +async function getCachedCountries() { + const cached = await client.get("countries"); + + if (cached) { + return JSON.parse(cached); + } + + const countries = await replicaPool.query("SELECT * FROM countries"); + await client.setex("countries", 3600, JSON.stringify(countries)); // 1 hour TTL + + return countries; +} +``` + + + + + +**Strategy 1: Off-Peak Execution** + +```bash +# Schedule migrations during low-traffic periods +# Use cron or your CI/CD pipeline +0 2 * * * /path/to/migration-script.sh +``` + +**Strategy 2: Online Schema Changes (MySQL)** + +```sql +-- Use pt-online-schema-change for large tables +pt-online-schema-change \ + --alter "ADD COLUMN new_field VARCHAR(255)" \ + --execute \ + D=database_name,t=table_name +``` + +**Strategy 3: Blue-Green Database Deployment** + +```bash +# 1. Create new database instance +# 2. Apply migrations to new instance +# 3. Set up replication from old to new +# 4. Switch application to new database +# 5. Verify and cleanup old instance +``` + +**Strategy 4: Connection Throttling** + +```javascript +// Temporarily limit connections before migration +const connectionSemaphore = new Semaphore(2); // Allow only 2 concurrent connections + +async function executeQuery(query) { + await connectionSemaphore.acquire(); + try { + return await pool.query(query); + } finally { + connectionSemaphore.release(); + } +} +``` + + + + + +**Key Metrics to Monitor:** + +```javascript +// Database connection monitoring +const dbMetrics = { + activeConnections: () => pool.pool._allConnections.length, + availableConnections: () => pool.pool._freeConnections.length, + pendingConnections: () => pool.pool._connectionQueue.length, + queryQueueLength: () => pool.pool._acquiringConnections.length +}; + +// Log metrics during migration +setInterval(() => { + console.log('DB Metrics:', { + active: dbMetrics.activeConnections(), + available: dbMetrics.availableConnections(), + pending: dbMetrics.pendingConnections(), + queue: dbMetrics.queryQueueLength() + }); +}, 5000); // Every 5 seconds +``` + +**Database Performance Queries:** + +```sql +-- MySQL: Monitor running queries +SHOW PROCESSLIST; + +-- Check for locked tables +SHOW OPEN TABLES WHERE In_use > 0; + +-- Monitor slow queries +SHOW FULL PROCESSLIST; + +-- PostgreSQL: Monitor active connections +SELECT pid, usename, application_name, client_addr, state, query +FROM pg_stat_activity +WHERE state = 'active'; + +-- Check for locks +SELECT blocked_locks.pid AS blocked_pid, + blocked_activity.usename AS blocked_user, + blocking_locks.pid AS blocking_pid, + blocking_activity.usename AS blocking_user, + blocked_activity.query AS blocked_statement, + blocking_activity.query AS current_statement_in_blocking_process +FROM pg_catalog.pg_locks blocked_locks +JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid +JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype +AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE +JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid +WHERE NOT blocked_locks.GRANTED; +``` + +**AWS CloudWatch Integration:** + +```javascript +const AWS = require('aws-sdk'); +const cloudwatch = new AWS.CloudWatch(); + +async function publishCustomMetric(metricName, value) { + const params = { + Namespace: 'SleakOps/Database', + MetricData: [{ + MetricName: metricName, + Value: value, + Unit: 'Count', + Timestamp: new Date() + }] + }; + + await cloudwatch.putMetricData(params).promise(); +} + +// Example usage during migration +await publishCustomMetric('ActiveConnections', dbMetrics.activeConnections()); +``` + + + + + +**Connection Killing Procedures:** + +```sql +-- MySQL: Kill problematic connections +-- First, identify the connection +SHOW PROCESSLIST; + +-- Kill specific connection +KILL CONNECTION ; + +-- Kill all connections from specific user (emergency) +SELECT CONCAT('KILL CONNECTION ', id, ';') as kill_command +FROM INFORMATION_SCHEMA.PROCESSLIST +WHERE USER = 'problem_user'; + +-- PostgreSQL: Terminate connections +-- Find problematic connections +SELECT pid, usename, application_name, client_addr, state, query_start, query +FROM pg_stat_activity +WHERE state = 'active'; + +-- Terminate specific connection +SELECT pg_terminate_backend(); + +-- Terminate all connections from specific application +SELECT pg_terminate_backend(pid) +FROM pg_stat_activity +WHERE application_name = 'problem_app'; +``` + +**Migration Rollback Procedures:** + +```bash +#!/bin/bash +# Emergency rollback script + +# 1. Stop application traffic +kubectl scale deployment myapp --replicas=0 + +# 2. Restore from backup +aws rds restore-db-instance-from-db-snapshot \ + --db-instance-identifier mydb-rollback \ + --db-snapshot-identifier mydb-before-migration + +# 3. Update DNS or connection strings +# 4. Scale application back up +kubectl scale deployment myapp --replicas=3 + +# 5. Verify functionality +curl -f http://myapp.example.com/health || exit 1 +``` + +**Automated Circuit Breaker:** + +```javascript +class DatabaseCircuitBreaker { + constructor(threshold = 5, timeout = 60000) { + this.failureCount = 0; + this.threshold = threshold; + this.timeout = timeout; + this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN + this.nextAttempt = Date.now(); + } + + async execute(operation) { + if (this.state === 'OPEN') { + if (Date.now() < this.nextAttempt) { + throw new Error('Circuit breaker is OPEN'); + } + this.state = 'HALF_OPEN'; + } + + try { + const result = await operation(); + this.onSuccess(); + return result; + } catch (error) { + this.onFailure(); + throw error; + } + } + + onSuccess() { + this.failureCount = 0; + this.state = 'CLOSED'; + } + + onFailure() { + this.failureCount++; + if (this.failureCount >= this.threshold) { + this.state = 'OPEN'; + this.nextAttempt = Date.now() + this.timeout; + } + } +} +``` + + + + + +**Pre-Migration Checklist:** + +1. **Backup Verification:** + ```bash + # Verify backup integrity + aws rds describe-db-snapshots --db-instance-identifier mydb + aws rds restore-db-instance-from-db-snapshot --dry-run + ``` + +2. **Performance Baseline:** + ```bash + # Capture performance metrics before migration + aws cloudwatch get-metric-statistics \ + --namespace AWS/RDS \ + --metric-name DatabaseConnections \ + --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%S) \ + --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ + --period 300 \ + --statistics Average + ``` + +3. **Communication Plan:** + ```markdown + ## Maintenance Window Notification Template + + **Subject:** Scheduled Database Maintenance - [Date/Time] + + **What:** Database schema update for improved performance + **When:** [Date] at [Time] ([Timezone]) + **Duration:** Estimated 30 minutes + **Impact:** Minimal - read-only mode during migration + **Rollback:** Available within 15 minutes if needed + ``` + +**Post-Migration Verification:** + +```javascript +// Automated verification script +async function verifyMigration() { + const checks = [ + { name: 'Database Connection', test: () => pool.query('SELECT 1') }, + { name: 'Schema Validation', test: () => pool.query('DESCRIBE countries') }, + { name: 'Data Integrity', test: () => pool.query('SELECT COUNT(*) FROM countries') }, + { name: 'Performance Test', test: () => Promise.race([ + pool.query('SELECT * FROM countries LIMIT 10'), + new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 5000)) + ]) } + ]; + + for (const check of checks) { + try { + await check.test(); + console.log(`✅ ${check.name} - PASSED`); + } catch (error) { + console.error(`❌ ${check.name} - FAILED:`, error.message); + throw new Error(`Verification failed: ${check.name}`); + } + } +} +``` + +**Long-term Optimization:** + +1. **Query Optimization:** + ```sql + -- Add appropriate indexes + CREATE INDEX idx_countries_name ON countries(name); + CREATE INDEX idx_countries_code ON countries(code); + + -- Analyze query performance + EXPLAIN SELECT * FROM countries WHERE name = 'United States'; + ``` + +2. **Connection Pool Tuning:** + ```javascript + const optimizedPool = mysql.createPool({ + host: process.env.DB_HOST, + user: process.env.DB_USER, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + connectionLimit: 20, // Based on RDS instance capacity + acquireTimeout: 60000, + timeout: 60000, + reconnect: true, + charset: 'utf8mb4' + }); + ``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-migration-data-integrity-issues.mdx b/docs/troubleshooting/database-migration-data-integrity-issues.mdx new file mode 100644 index 000000000..74ba434c9 --- /dev/null +++ b/docs/troubleshooting/database-migration-data-integrity-issues.mdx @@ -0,0 +1,256 @@ +--- +sidebar_position: 3 +title: "Database Migration Data Integrity Issues" +description: "Troubleshooting incomplete data transfer during database migrations" +date: "2024-12-19" +category: "dependency" +tags: ["database", "migration", "data-integrity", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Migration Data Integrity Issues + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** Database, Migration, Data Integrity, Troubleshooting + +## Problem Description + +**Context:** During production database migration processes in SleakOps, users may experience issues where the migration completes without errors but the data integrity is compromised. + +**Observed Symptoms:** + +- Migration process completes successfully without error messages +- Recent data is missing from the target database +- Some tables appear incomplete after migration +- Data inconsistencies between source and destination databases + +**Relevant Configuration:** + +- Migration type: Production database migration +- Environment: Production to production transfer +- Migration tool: SleakOps database migration utilities +- Database type: Not specified (PostgreSQL/MySQL/etc.) + +**Error Conditions:** + +- Migration appears successful but data validation fails +- Most recent records are not transferred +- Incomplete table transfers occur sporadically +- No explicit error messages during migration process + +## Detailed Solution + + + +Before starting any database migration, perform these validation steps: + +1. **Record count verification**: + + ```sql + -- Check total records in source database + SELECT table_name, + (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count + FROM ( + SELECT table_name, + query_to_xml(format('select count(*) as cnt from %I.%I', + table_schema, table_name), false, true, '') as xml_count + FROM information_schema.tables + WHERE table_schema = 'public' + ) t; + ``` + +2. **Identify latest timestamps**: + + ```sql + -- Find most recent records per table + SELECT table_name, MAX(created_at) as latest_record + FROM your_table_name + GROUP BY table_name; + ``` + +3. **Create migration checklist**: + - Document current record counts + - Note latest timestamps + - Identify critical tables + - Plan validation queries + + + + + +When migrations complete but data is missing: + +1. **Check migration logs**: + + ```bash + # Review SleakOps migration logs + kubectl logs -n sleakops-system deployment/migration-controller + + # Look for specific patterns + grep -i "error\|warning\|timeout" migration.log + ``` + +2. **Verify connection timeouts**: + + - Database connection may timeout during large transfers + - Check network stability between source and destination + - Review database connection pool settings + +3. **Transaction isolation issues**: + ```sql + -- Check for long-running transactions + SELECT pid, now() - pg_stat_activity.query_start AS duration, query + FROM pg_stat_activity + WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'; + ``` + + + + + +After migration, perform these validation steps: + +1. **Record count comparison**: + + ```bash + # Script to compare record counts + #!/bin/bash + + SOURCE_DB="source_connection_string" + TARGET_DB="target_connection_string" + + for table in $(psql $SOURCE_DB -t -c "SELECT tablename FROM pg_tables WHERE schemaname='public';"); do + source_count=$(psql $SOURCE_DB -t -c "SELECT COUNT(*) FROM $table;") + target_count=$(psql $TARGET_DB -t -c "SELECT COUNT(*) FROM $table;") + + if [ "$source_count" != "$target_count" ]; then + echo "MISMATCH: $table - Source: $source_count, Target: $target_count" + fi + done + ``` + +2. **Data freshness verification**: + + ```sql + -- Check if recent data was migrated + SELECT table_name, + MAX(created_at) as latest_migrated, + NOW() - MAX(created_at) as data_age + FROM ( + -- Union all tables with timestamp columns + SELECT 'users' as table_name, created_at FROM users + UNION ALL + SELECT 'orders' as table_name, created_at FROM orders + -- Add other tables as needed + ) combined + GROUP BY table_name; + ``` + +3. **Referential integrity check**: + ```sql + -- Verify foreign key relationships + SELECT conname, conrelid::regclass, confrelid::regclass + FROM pg_constraint + WHERE contype = 'f' + AND NOT EXISTS ( + SELECT 1 FROM pg_constraint c2 + WHERE c2.conname = pg_constraint.conname + AND c2.connamespace != pg_constraint.connamespace + ); + ``` + + + + + +If data integrity issues are detected: + +1. **Incremental data sync**: + + ```sql + -- Sync missing recent records + INSERT INTO target_table + SELECT * FROM source_table + WHERE created_at > (SELECT MAX(created_at) FROM target_table) + ON CONFLICT (id) DO UPDATE SET + column1 = EXCLUDED.column1, + updated_at = EXCLUDED.updated_at; + ``` + +2. **Table-specific re-migration**: + + ```bash + # Re-migrate specific tables + pg_dump source_db -t table_name | psql target_db + ``` + +3. **Point-in-time recovery setup**: + + - Enable WAL archiving before future migrations + - Create database snapshots before migration + - Implement automated backup verification + +4. **Migration rollback procedure**: + + ```bash + # Restore from pre-migration backup + pg_restore -d target_database backup_file.dump + + # Verify restoration + psql target_database -c "SELECT COUNT(*) FROM critical_table;" + ``` + + + + + +To prevent future migration data integrity issues: + +1. **Implement migration testing**: + + - Always test migrations on staging environment first + - Use production data snapshots for testing + - Validate data integrity in test environment + +2. **Set up monitoring**: + ```yaml + # SleakOps monitoring configuration + apiVersion: v1 + kind: ConfigMap + metadata: + name: migration-monitoring + data: + config.yaml: | + checks: + - name: record_count_validation + query: "SELECT COUNT(*) FROM critical_table" + expected_min: 1000 + - name: data_freshness + query: "SELECT MAX(created_at) FROM events WHERE created_at > NOW() - INTERVAL '1 hour'" + expected_result: true + ``` + +3. **Create rollback procedures**: + + - Document exact steps to revert migration + - Test rollback procedures in staging + - Keep backup restoration scripts ready + +4. **Use incremental migration strategies**: + + - Break large migrations into smaller chunks + - Validate each chunk before proceeding + - Implement checkpointing for long-running migrations + +5. **Document everything**: + - Record all migration steps and timing + - Document any deviations from planned process + - Keep detailed logs for troubleshooting + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-migration-environment-variables.mdx b/docs/troubleshooting/database-migration-environment-variables.mdx new file mode 100644 index 000000000..ff2c244ec --- /dev/null +++ b/docs/troubleshooting/database-migration-environment-variables.mdx @@ -0,0 +1,455 @@ +--- +sidebar_position: 15 +title: "Database Migration Environment Variables Configuration" +description: "Solution for database migration issues with environment variables in SleakOps" +date: "2024-12-11" +category: "project" +tags: ["database", "migration", "environment-variables", "dotnet", "postgres"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Migration Environment Variables Configuration + +**Date:** December 11, 2024 +**Category:** Project +**Tags:** Database, Migration, Environment Variables, .NET, PostgreSQL + +## Problem Description + +**Context:** User encounters multiple issues when running .NET Entity Framework database migrations in SleakOps, specifically with PostgreSQL database connections and environment variable configurations. + +**Observed Symptoms:** + +- Database migration command fails due to incorrect project path +- Database connection errors due to missing password authentication +- Migration hooks fail to find required environment variables like `CORS_ALLOWED_ORIGINS` +- Environment variables not accessible to migration processes + +**Relevant Configuration:** + +- Framework: .NET Entity Framework +- Database: PostgreSQL +- Environment: SleakOps platform +- Variable Groups: Backend project environment variables +- Migration command: `dotnet ef database update` + +**Error Conditions:** + +- Incorrect project path in migration command +- Mismatched environment variable names (`POSTGRES_PASS` vs `POSTGRES_PASSWORD`) +- Variable Groups scoped to specific services instead of global scope +- Migration hooks unable to access required environment variables + +## Detailed Solution + + + +The correct command syntax for running Entity Framework migrations is: + +```bash +dotnet ef database update --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj +``` + +**Common mistakes:** + +- Typos in the project path +- Missing or incorrect relative path references +- Wrong project file extension or name + +**Verification steps:** + +1. Verify the project file exists at the specified path +2. Check that the path is relative to your current working directory +3. Ensure the `.csproj` file name matches exactly + +```bash +# Verify project file exists +ls -la ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj + +# Check working directory structure +pwd && ls -la +``` + +**Alternative approaches:** + +```bash +# Run from project directory +cd Netdo.Firev.WebApi +dotnet ef database update + +# Specify connection string directly +dotnet ef database update --connection "Host=localhost;Database=mydb;Username=user;Password=pass" + +# Use configuration from specific environment +dotnet ef database update --environment Production +``` + + + + + +Environment variable naming must match exactly between your application configuration and Variable Groups: + +**Problem:** Application expects `POSTGRES_PASSWORD` but Variable Group contains `POSTGRES_PASS` + +**Solution:** + +1. **Check your application's configuration files** (appsettings.json, appsettings.Development.json): + +```json +{ + "ConnectionStrings": { + "DefaultConnection": "Host=${POSTGRES_HOST};Port=${POSTGRES_PORT};Database=${POSTGRES_DB};Username=${POSTGRES_USER};Password=${POSTGRES_PASSWORD};" + } +} +``` + +2. **Update Variable Groups to match exactly:** + +```env +# Variable Group: Backend Environment Variables +POSTGRES_HOST=your-db-host +POSTGRES_PORT=5432 +POSTGRES_DB=your-database-name +POSTGRES_USER=your-username +POSTGRES_PASSWORD=your-password # Match this name exactly +CORS_ALLOWED_ORIGINS=https://yourdomain.com +``` + +3. **Common naming patterns to standardize:** + +| Application Config | Variable Group | Status | +| ------------------- | ------------------- | ----------- | +| `POSTGRES_PASSWORD` | `POSTGRES_PASS` | ❌ Mismatch | +| `POSTGRES_PASSWORD` | `POSTGRES_PASSWORD` | ✅ Correct | +| `DATABASE_URL` | `DB_URL` | ❌ Mismatch | +| `DATABASE_URL` | `DATABASE_URL` | ✅ Correct | + + + + + +Variable Groups must be accessible to the migration process. Check the scope configuration: + +**Problem:** Variable Groups scoped only to specific workloads + +**Solution:** + +1. **Make Variable Groups globally accessible** within the project: + +```yaml +# In SleakOps Variable Group configuration +variable_group: + name: "backend-env-vars" + scope: "project" # Not "workload-specific" + environment: "production" + variables: + POSTGRES_HOST: "db.example.com" + POSTGRES_PASSWORD: "secure-password" + CORS_ALLOWED_ORIGINS: "https://yourdomain.com" +``` + +2. **Verify Variable Group inheritance:** + +- Global variables: Available to all workloads in the project +- Project variables: Available to all workloads in specific project +- Workload variables: Only available to specific workload + +**Best practice:** Use project-scoped Variable Groups for database connections that multiple workloads need to access. + + + + + +Ensure migration hooks can access the required environment variables: + +**Hook Configuration Example:** + +```yaml +# In your workload configuration +workload: + type: job + name: database-migration + image: mcr.microsoft.com/dotnet/sdk:8.0 + command: | + echo "Starting database migration..." + echo "Database host: $POSTGRES_HOST" + + # Verify all required variables are present + if [ -z "$POSTGRES_PASSWORD" ]; then + echo "Error: POSTGRES_PASSWORD not found" + exit 1 + fi + + if [ -z "$CORS_ALLOWED_ORIGINS" ]; then + echo "Error: CORS_ALLOWED_ORIGINS not found" + exit 1 + fi + + # Run migration + dotnet ef database update --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj + + environment_variables: + # Reference Variable Groups + - variable_group: backend-env-vars + # Override or add specific variables if needed + - name: ASPNETCORE_ENVIRONMENT + value: Production +``` + +**Pre-migration variable verification script:** + +```bash +#!/bin/bash +# verify-env-vars.sh + +required_vars=( + "POSTGRES_HOST" + "POSTGRES_PORT" + "POSTGRES_DB" + "POSTGRES_USER" + "POSTGRES_PASSWORD" + "CORS_ALLOWED_ORIGINS" +) + +echo "Verifying environment variables..." + +missing_vars=() +for var in "${required_vars[@]}"; do + if [ -z "${!var}" ]; then + missing_vars+=("$var") + else + echo "✅ $var is set" + fi +done + +if [ ${#missing_vars[@]} -ne 0 ]; then + echo "❌ Missing required environment variables:" + printf '%s\n' "${missing_vars[@]}" + exit 1 +fi + +echo "✅ All required environment variables are present" +``` + + + + + +Proper configuration of .NET Entity Framework connection strings: + +**1. appsettings.json configuration:** + +```json +{ + "Logging": { + "LogLevel": { + "Default": "Information", + "Microsoft.AspNetCore": "Warning", + "Microsoft.EntityFrameworkCore": "Information" + } + }, + "AllowedHosts": "*", + "ConnectionStrings": { + "DefaultConnection": "Host=${POSTGRES_HOST};Port=${POSTGRES_PORT:-5432};Database=${POSTGRES_DB};Username=${POSTGRES_USER};Password=${POSTGRES_PASSWORD};Include Error Detail=true" + }, + "CorsSettings": { + "AllowedOrigins": "${CORS_ALLOWED_ORIGINS}" + } +} +``` + +**2. Program.cs configuration:** + +```csharp +var builder = WebApplication.CreateBuilder(args); + +// Replace environment variables in connection string +var connectionString = builder.Configuration.GetConnectionString("DefaultConnection"); +connectionString = Environment.ExpandEnvironmentVariables(connectionString); + +builder.Services.AddDbContext(options => + options.UseNpgsql(connectionString, npgsqlOptions => + { + npgsqlOptions.EnableRetryOnFailure( + maxRetryCount: 3, + maxRetryDelay: TimeSpan.FromSeconds(10), + errorCodesToAdd: null); + })); + +// Configure CORS +var corsOrigins = Environment.GetEnvironmentVariable("CORS_ALLOWED_ORIGINS")?.Split(',') + ?? new[] { "http://localhost:3000" }; + +builder.Services.AddCors(options => +{ + options.AddDefaultPolicy(policy => + { + policy.WithOrigins(corsOrigins) + .AllowAnyHeader() + .AllowAnyMethod() + .AllowCredentials(); + }); +}); +``` + +**3. Migration-specific connection configuration:** + +```bash +# Create a migration-specific connection string +export MIGRATION_CONNECTION_STRING="Host=${POSTGRES_HOST};Port=${POSTGRES_PORT};Database=${POSTGRES_DB};Username=${POSTGRES_USER};Password=${POSTGRES_PASSWORD};" + +# Run migration with explicit connection string +dotnet ef database update --connection "$MIGRATION_CONNECTION_STRING" --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj +``` + + + + + +**Step 1: Verify Database Connectivity** + +```bash +# Test database connection using psql +export PGPASSWORD="$POSTGRES_PASSWORD" +psql -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT version();" +``` + +**Step 2: Enable detailed logging** + +```json +{ + "Logging": { + "LogLevel": { + "Default": "Debug", + "Microsoft.EntityFrameworkCore.Database.Command": "Information", + "Microsoft.EntityFrameworkCore.Infrastructure": "Information" + } + } +} +``` + +**Step 3: Run migrations with verbose output** + +```bash +# Enable verbose Entity Framework logging +dotnet ef database update --verbose --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj + +# Check migration status +dotnet ef migrations list --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj +``` + +**Step 4: Manual migration verification** + +```bash +# Generate SQL script instead of applying directly +dotnet ef migrations script --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj --output migration.sql + +# Review the generated SQL +cat migration.sql + +# Apply manually if needed +psql -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -d "$POSTGRES_DB" -f migration.sql +``` + +**Step 5: Debug environment variables in container** + +```bash +# Add debugging commands to your migration job +echo "=== Environment Variables Debug ===" +env | grep -E "(POSTGRES|CORS)" | sort +echo "=== End Debug ===" + +# Test variable expansion +echo "Connection would be: Host=${POSTGRES_HOST};Port=${POSTGRES_PORT};Database=${POSTGRES_DB};Username=${POSTGRES_USER};Password=***" +``` + + + + + +**1. Use dedicated migration jobs:** + +```yaml +workload: + type: job + name: db-migration + schedule: null # Run manually or via CI/CD + restart_policy: Never + image: your-app-image:latest + command: | + # Pre-migration checks + ./scripts/verify-env-vars.sh + ./scripts/test-db-connection.sh + + # Backup before migration (if needed) + ./scripts/backup-database.sh + + # Run migration + dotnet ef database update --project src/YourApp.csproj + + # Post-migration verification + ./scripts/verify-migration.sh +``` + +**2. Environment-specific Variable Groups:** + +```yaml +# Development environment +variable_group: + name: "backend-dev" + environment: "development" + variables: + POSTGRES_HOST: "dev-db.internal" + POSTGRES_DB: "app_development" + CORS_ALLOWED_ORIGINS: "http://localhost:3000,http://localhost:3001" + +# Production environment +variable_group: + name: "backend-prod" + environment: "production" + variables: + POSTGRES_HOST: "prod-db.internal" + POSTGRES_DB: "app_production" + CORS_ALLOWED_ORIGINS: "https://yourdomain.com" +``` + +**3. Migration rollback strategy:** + +```bash +#!/bin/bash +# rollback-migration.sh + +# Get current migration +current_migration=$(dotnet ef migrations list --project src/YourApp.csproj --json | jq -r '.[-1].safeName') + +# Rollback to previous migration +previous_migration=$(dotnet ef migrations list --project src/YourApp.csproj --json | jq -r '.[-2].safeName') + +echo "Rolling back from $current_migration to $previous_migration" +dotnet ef database update "$previous_migration" --project src/YourApp.csproj +``` + +**4. Automated testing:** + +```bash +#!/bin/bash +# test-migration.sh + +# Apply migration to test database +export POSTGRES_DB="${POSTGRES_DB}_test" +dotnet ef database update --project src/YourApp.csproj + +# Run integration tests +dotnet test --filter Category=Integration + +# Cleanup test database +dropdb "${POSTGRES_DB}_test" +``` + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-migration-heroku-to-aws-rds.mdx b/docs/troubleshooting/database-migration-heroku-to-aws-rds.mdx new file mode 100644 index 000000000..44c7afccf --- /dev/null +++ b/docs/troubleshooting/database-migration-heroku-to-aws-rds.mdx @@ -0,0 +1,442 @@ +--- +sidebar_position: 3 +title: "Database Migration from Heroku to AWS RDS" +description: "Complete guide for migrating PostgreSQL databases from Heroku to AWS RDS using DMS and pg_restore" +date: "2024-12-19" +category: "dependency" +tags: ["database", "migration", "heroku", "aws", "rds", "postgresql", "dms"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Migration from Heroku to AWS RDS + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** Database, Migration, Heroku, AWS, RDS, PostgreSQL, DMS + +## Problem Description + +**Context:** Users need to migrate large PostgreSQL databases from Heroku to AWS RDS through SleakOps platform, dealing with incomplete dumps and size discrepancies between source and restored databases. + +**Observed Symptoms:** + +- Heroku dump restoration results in smaller database size (94GB vs 385GB production) +- Standard curl-based dump download may be incomplete +- Need for external database access for migration tools +- Requirement for AWS DMS setup to complete migration + +**Relevant Configuration:** + +- Source: Heroku PostgreSQL database (385GB) +- Target: AWS RDS PostgreSQL +- Restored dump size: 94GB (incomplete) +- Migration method: pg_restore with --jobs=8 --no-owner --no-acl + +**Error Conditions:** + +- Incomplete data transfer using standard Heroku dump methods +- Size mismatch indicating data loss during migration +- Need for continuous replication to catch up missing data + +## Detailed Solution + + + +Heroku's standard dump process may not capture all data due to: + +1. **Timeout limitations**: Large databases may timeout during dump generation +2. **Active transactions**: Data written during dump creation might be missed +3. **Compression issues**: Some data types may not compress/decompress properly +4. **Connection limits**: Heroku may limit long-running dump operations + +**Verification steps:** + +```bash +# Check table row counts in both databases +psql -h heroku-host -c "SELECT schemaname,tablename,n_tup_ins-n_tup_del as rowcount FROM pg_stat_user_tables ORDER BY rowcount DESC;" +psql -h rds-host -c "SELECT schemaname,tablename,n_tup_ins-n_tup_del as rowcount FROM pg_stat_user_tables ORDER BY rowcount DESC;" +``` + + + + + +AWS Database Migration Service (DMS) can handle large database migrations with minimal downtime: + +**Prerequisites:** + +1. Create DMS replication instance +2. Configure source endpoint (Heroku PostgreSQL) +3. Configure target endpoint (AWS RDS) +4. Set up external database access + +**DMS Configuration:** + +```json +{ + "replication-instance-class": "dms.t3.large", + "allocated-storage": 100, + "apply-immediately": true, + "auto-minor-version-upgrade": true, + "multi-az": false, + "publicly-accessible": true, + "vpc-security-group-ids": ["sg-xxxxxxxxxxxx"] +} +``` + +**Source Endpoint (Heroku):** + +```bash +# Create Heroku source endpoint +aws dms create-endpoint \ + --endpoint-identifier heroku-source \ + --endpoint-type source \ + --engine-name postgres \ + --server-name your-heroku-host.amazonaws.com \ + --port 5432 \ + --database-name your_database \ + --username your_username \ + --password your_password \ + --ssl-mode require +``` + +**Target Endpoint (RDS):** + +```bash +# Create RDS target endpoint +aws dms create-endpoint \ + --endpoint-identifier rds-target \ + --endpoint-type target \ + --engine-name postgres \ + --server-name your-rds-endpoint.amazonaws.com \ + --port 5432 \ + --database-name postgres \ + --username rds_username \ + --password rds_password +``` + + + + + +**Full Load + CDC Task:** + +```json +{ + "replication-task-identifier": "heroku-to-rds-migration", + "source-endpoint-arn": "arn:aws:dms:region:account:endpoint:heroku-source", + "target-endpoint-arn": "arn:aws:dms:region:account:endpoint:rds-target", + "replication-instance-arn": "arn:aws:dms:region:account:rep:replication-instance", + "migration-type": "full-load-and-cdc", + "table-mappings": { + "rules": [ + { + "rule-type": "selection", + "rule-id": "1", + "rule-name": "1", + "object-locator": { + "schema-name": "public", + "table-name": "%" + }, + "rule-action": "include" + } + ] + } +} +``` + +**Task Settings:** + +```json +{ + "TargetMetadata": { + "TargetSchema": "", + "SupportLobs": true, + "FullLobMode": false, + "LobChunkSize": 0, + "LimitedSizeLobMode": true, + "LobMaxSize": 32, + "InlineLobMaxSize": 0, + "LoadMaxFileSize": 0, + "ParallelLoadThreads": 0, + "ParallelLoadBufferSize": 0, + "BatchApplyEnabled": false, + "TaskRecoveryTableEnabled": false + }, + "FullLoadSettings": { + "TargetTablePrepMode": "DROP_AND_CREATE", + "CreatePkAfterFullLoad": false, + "StopTaskCachedChangesApplied": false, + "StopTaskCachedChangesNotApplied": false, + "MaxFullLoadSubTasks": 8, + "TransactionConsistencyTimeout": 600, + "CommitRate": 10000 + }, + "ChangeProcessingTuning": { + "BatchApplyPreserveTransaction": true, + "BatchApplyTimeoutMin": 1, + "BatchApplyTimeoutMax": 30, + "BatchApplyMemoryLimit": 500, + "BatchSplitSize": 0, + "MinTransactionSize": 1000, + "CommitTimeout": 1, + "MemoryLimitTotal": 1024, + "MemoryKeepTime": 60, + "StatementCacheSize": 50 + } +} +``` + + + + + +To allow DMS to access both Heroku and RDS: + +**Heroku Database Configuration:** + +1. **Enable external connections** in Heroku PostgreSQL settings +2. **Whitelist DMS replication instance IP** in Heroku +3. **Configure SSL requirements**: + +```sql +-- Check Heroku SSL configuration +SELECT name, setting FROM pg_settings WHERE name LIKE '%ssl%'; + +-- Verify connection requirements +SHOW ssl; +``` + +**RDS Security Group Configuration:** + +```bash +# Allow DMS replication instance access +aws ec2 authorize-security-group-ingress \ + --group-id sg-xxxxxxxxxxxx \ + --protocol tcp \ + --port 5432 \ + --source-group sg-dms-replication-instance +``` + +**Network Configuration:** + +```bash +# Create VPC endpoints for DMS if needed +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxxxxxx \ + --service-name com.amazonaws.region.dms \ + --route-table-ids rtb-xxxxxxxxx +``` + + + + + +**Monitor DMS Task:** + +```bash +# Check task status +aws dms describe-replication-tasks \ + --filters Name=replication-task-id,Values=heroku-to-rds-migration + +# Monitor CloudWatch metrics +aws cloudwatch get-metric-statistics \ + --namespace AWS/DMS \ + --metric-name CDCLatencySource \ + --dimensions Name=ReplicationTaskArn,Value=your-task-arn \ + --start-time 2024-12-19T00:00:00Z \ + --end-time 2024-12-19T23:59:59Z \ + --period 300 \ + --statistics Average +``` + +**Data Validation Queries:** + +```sql +-- Compare row counts +SELECT + 'source' as database, + schemaname, + tablename, + n_tup_ins - n_tup_del as estimated_rows +FROM pg_stat_user_tables +WHERE schemaname = 'public' +ORDER BY estimated_rows DESC; + +-- Compare data sizes +SELECT pg_size_pretty(pg_database_size(current_database())) as database_size; + +-- Check for missing sequences +SELECT sequence_name, last_value +FROM information_schema.sequences s +JOIN pg_sequences ps ON s.sequence_name = ps.sequencename; +``` + +**Performance Monitoring:** + +```sql +-- Monitor active connections during migration +SELECT + client_addr, + state, + query_start, + state_change, + query +FROM pg_stat_activity +WHERE state != 'idle' +ORDER BY query_start; + +-- Check for locks +SELECT + blocked_locks.pid AS blocked_pid, + blocked_activity.usename AS blocked_user, + blocking_locks.pid AS blocking_pid, + blocking_activity.usename AS blocking_user, + blocked_activity.query AS blocked_statement, + blocking_activity.query AS current_statement_in_blocking_process +FROM pg_catalog.pg_locks blocked_locks +JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid +JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype +AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE +AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation +WHERE NOT blocked_locks.GRANTED; +``` + + + + + +**Pre-cutover Checklist:** + +1. **Verify data consistency**: + + ```sql + -- Final row count comparison + SELECT tablename, + (SELECT count(*) FROM tablename) as current_count + FROM pg_tables + WHERE schemaname = 'public'; + ``` + +2. **Check CDC lag**: + + ```bash + aws dms describe-replication-tasks --query 'ReplicationTasks[0].ReplicationTaskStats' + ``` + +3. **Prepare rollback plan**: + - Document current Heroku connection strings + - Prepare DNS changes + - Test rollback procedures + +**Cutover Steps:** + +1. **Put application in maintenance mode** +2. **Stop CDC replication**: + + ```bash + aws dms stop-replication-task --replication-task-arn your-task-arn + ``` + +3. **Final data verification** +4. **Update application configuration**: + + ```yaml + # Update SleakOps VariableGroup + DATABASE_URL: postgresql://username:password@rds-endpoint:5432/database + DATABASE_HOST: rds-endpoint.amazonaws.com + DATABASE_PORT: 5432 + ``` + +5. **Deploy application updates** +6. **Perform smoke tests** +7. **Remove maintenance mode** + +**Post-cutover Validation:** + +```bash +# Test application connectivity +curl -I https://your-app.com/health + +# Monitor application logs +kubectl logs -f deployment/your-app -n namespace + +# Check database connections +psql -h rds-endpoint -c "SELECT count(*) FROM pg_stat_activity;" +``` + + + + + +**Large Object (LOB) Issues:** + +```sql +-- Find tables with LOB columns +SELECT table_name, column_name, data_type +FROM information_schema.columns +WHERE data_type IN ('text', 'bytea', 'oid') +AND table_schema = 'public'; + +-- Check LOB sizes +SELECT + tablename, + attname, + n_distinct, + most_common_vals +FROM pg_stats +WHERE tablename IN (SELECT tablename FROM pg_tables WHERE schemaname = 'public') +AND attname IN (SELECT column_name FROM information_schema.columns WHERE data_type = 'text'); +``` + +**Permission Issues:** + +```sql +-- Grant necessary permissions to DMS user +GRANT CONNECT ON DATABASE your_database TO dms_user; +GRANT USAGE ON SCHEMA public TO dms_user; +GRANT SELECT ON ALL TABLES IN SCHEMA public TO dms_user; +GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO dms_user; + +-- For target RDS +GRANT CREATE ON DATABASE your_database TO dms_user; +GRANT ALL PRIVILEGES ON SCHEMA public TO dms_user; +``` + +**Network Connectivity Issues:** + +```bash +# Test connectivity from DMS subnet +aws dms test-connection \ + --replication-instance-arn your-replication-instance-arn \ + --endpoint-arn your-endpoint-arn +``` + +**Performance Optimization:** + +1. **Increase DMS instance size** if needed +2. **Optimize target RDS instance**: + + ```sql + -- Temporarily increase write capacity + ALTER SYSTEM SET shared_buffers = '2GB'; + ALTER SYSTEM SET effective_cache_size = '8GB'; + ALTER SYSTEM SET maintenance_work_mem = '1GB'; + SELECT pg_reload_conf(); + ``` + +3. **Parallel loading settings**: + ```json + { + "MaxFullLoadSubTasks": 8, + "ParallelLoadThreads": 8, + "CommitRate": 50000 + } + ``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-migrations-execution-hooks.mdx b/docs/troubleshooting/database-migrations-execution-hooks.mdx new file mode 100644 index 000000000..21698d4fe --- /dev/null +++ b/docs/troubleshooting/database-migrations-execution-hooks.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 15 +title: "Database Migrations with Execution Hooks" +description: "How to configure and manage database migrations using SleakOps Execution Hooks" +date: "2024-12-19" +category: "workload" +tags: ["database", "migrations", "hooks", "executions", "deployment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Migrations with Execution Hooks + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** Database, Migrations, Hooks, Executions, Deployment + +## Problem Description + +**Context:** Users need to run database migrations as part of their deployment process in SleakOps, specifically for .NET applications using Entity Framework migrations. + +**Observed Symptoms:** + +- Need to execute `dotnet ef database update` commands during deployments +- Uncertainty about when and how migrations should run in the CI/CD pipeline +- Questions about running migrations independently of deployments +- Need for automated migration execution before code updates + +**Relevant Configuration:** + +- Environment: Development and Production environments +- Framework: .NET with Entity Framework +- Migration command: `dotnet ef database update` +- Hook type: `pre-upgrade` +- Execution type: Hook and Job + +**Error Conditions:** + +- Migrations not running automatically during deployments +- Need to run migrations without triggering full deployment +- Integration of migration commands in build process + +## Detailed Solution + + + +SleakOps provides built-in database migration support through Execution Hooks: + +**How it works:** + +1. **Pre-upgrade Hooks**: Automatically created as `db-migration` hooks in your environments +2. **Automatic execution**: Runs before each deployment when you push code to `develop` or `main` branches +3. **Command execution**: Executes `update-database` command before updating application code + +**Configuration:** + +```yaml +# Hook configuration (automatically created) +name: db-migration +type: pre-upgrade +command: dotnet ef database update --project YourProject.csproj +``` + +**Process flow:** + +1. Push code to `develop` or `main` branch +2. CI/CD triggers Build process +3. Deployment starts +4. **Pre-upgrade hook runs** → Database migration executes +5. Application code gets updated in the cluster + +This means migrations run automatically with every deployment, so manual migration execution before builds is typically unnecessary. + + + + + +For running migrations independently of deployments: + +**Create a Job-type Execution:** + +1. Go to your project in SleakOps +2. Navigate to **Executions** section +3. Create new execution with type **Job** +4. Configure the migration command + +**Job configuration:** + +```yaml +name: manual-db-migration +type: job +command: dotnet ef database update --project YourProject.csproj +``` + +**Characteristics:** + +- **One-time execution**: Runs only when manually triggered +- **Independent**: Doesn't affect deployments or builds +- **On-demand**: Execute whenever you need to run migrations manually + +**Use cases:** + +- Emergency database fixes +- Testing migrations in staging +- Rollback scenarios +- Initial database setup + + + + + +For running migrations during the build process (not recommended as primary approach): + +**Add migration command to Dockerfile:** + +```dockerfile +# Your existing Dockerfile content +WORKDIR /app +COPY . . + +# Add migration command before CMD +RUN dotnet ef database update --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj + +# Your existing CMD instruction +CMD ["dotnet", "YourApp.dll"] +``` + +**Important considerations:** + +- **Database connectivity**: Ensure database is accessible during build +- **Connection strings**: Must be available at build time +- **Build environment**: Database server must be reachable from build environment +- **Security**: Avoid exposing production credentials in build process + +**Alternative approach using multi-stage builds:** + +```dockerfile +# Build stage +FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build +WORKDIR /src +COPY . . +RUN dotnet restore +RUN dotnet build + +# Migration stage (optional) +FROM build AS migration +RUN dotnet ef database update --project YourProject.csproj + +# Runtime stage +FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS runtime +WORKDIR /app +COPY --from=build /src/published . +CMD ["dotnet", "YourApp.dll"] +``` + + + + + +**Recommended approach:** + +1. **Use pre-upgrade hooks** (default SleakOps behavior) for automatic migrations +2. **Create Job-type executions** for manual/emergency migrations +3. **Avoid Dockerfile migrations** unless specific requirements demand it + +**Migration strategy checklist:** + +- ✅ Verify pre-upgrade hooks are configured in all environments +- ✅ Test migrations in development environment first +- ✅ Create Job-type execution for manual migration capability +- ✅ Ensure database connection strings are properly configured +- ✅ Monitor migration execution logs during deployments + +**Troubleshooting common issues:** + +- **Hook not running**: Check if hook exists in environment configuration +- **Connection failures**: Verify database connectivity from cluster +- **Permission errors**: Ensure service account has database access +- **Migration conflicts**: Review Entity Framework migration history + +**Environment-specific considerations:** + +- **Development**: Frequent migrations, use automatic hooks +- **Staging**: Test migrations before production, use both hooks and jobs +- **Production**: Careful migration planning, backup before migration + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-performance-optimization.mdx b/docs/troubleshooting/database-performance-optimization.mdx new file mode 100644 index 000000000..b5e8c89c7 --- /dev/null +++ b/docs/troubleshooting/database-performance-optimization.mdx @@ -0,0 +1,207 @@ +--- +sidebar_position: 3 +title: "Database Performance Optimization and Scaling" +description: "Solutions for database performance issues, endpoint timeouts, and scaling strategies" +date: "2024-01-15" +category: "dependency" +tags: ["database", "performance", "scaling", "timeout", "optimization"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Performance Optimization and Scaling + +**Date:** January 15, 2024 +**Category:** Dependency +**Tags:** Database, Performance, Scaling, Timeout, Optimization + +## Problem Description + +**Context:** Users experience slow endpoint response times and timeout errors despite increasing application resources. The issue affects user experience and requires performance optimization strategies. + +**Observed Symptoms:** + +- Endpoint timeout errors +- Slow response times for API calls +- User reports of application slowness +- Performance issues persist after scaling application resources + +**Relevant Configuration:** + +- Database: AWS RDS instance +- Application resources: Already increased CPU and RAM +- Workload: Multiple pods running +- Platform: SleakOps on AWS + +**Error Conditions:** + +- Timeouts occur on specific endpoints +- Performance degradation affects multiple users +- Issues persist despite resource scaling +- Problem appears to be database-related + +## Detailed Solution + + + +To identify the root cause of performance issues, you need to implement proper monitoring and analysis: + +**Recommended APM Tools:** + +- **New Relic**: Comprehensive application performance monitoring +- **DataDog**: Full-stack monitoring with database insights +- **AWS X-Ray**: Distributed tracing for AWS applications + +**Key Metrics to Monitor:** + +- Database query execution times +- Connection pool utilization +- CPU and memory usage on database +- Network latency between application and database +- Slow query logs + + + + + +**Vertical Scaling (Scale Up):** + +You can scale your RDS instance directly from AWS Console: + +1. Go to **AWS RDS Console** +2. Select your database instance +3. Click **Modify** +4. Choose a larger instance class +5. Apply changes immediately or during maintenance window + +**Note:** Scaling will cause 20-30 minutes of potential slowness during the transition. + +**Horizontal Scaling Options:** + +- **Read Replicas**: For read-heavy workloads +- **Database Sharding**: For write-heavy applications +- **Connection Pooling**: Optimize connection management + +```yaml +# Example database configuration +database: + instance_class: "db.t3.large" # Scale up from db.t3.medium + allocated_storage: 100 + max_connections: 200 + connection_timeout: 30 +``` + + + + + +**Query Optimization:** + +- Identify and optimize slow queries +- Add proper database indexes +- Implement query caching strategies +- Use connection pooling + +**Code-Level Improvements:** + +- Implement asynchronous processing for heavy operations +- Add caching layers (Redis, Memcached) +- Optimize database connection handling +- Use batch operations where possible + +**SleakOps Workload Scaling:** + +```yaml +# Scale your workloads in SleakOps +workload: + replicas: 5 # Increase number of pods + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1000m" + memory: "2Gi" +``` + + + + + +**Database Monitoring:** + +1. Enable **AWS Performance Insights** for RDS +2. Set up **CloudWatch** custom metrics +3. Configure slow query logging +4. Monitor connection counts and wait events + +**Application Monitoring:** + +1. Integrate APM tool (New Relic, DataDog) +2. Add custom metrics for business logic +3. Implement distributed tracing +4. Set up alerting for performance thresholds + +**Key Performance Indicators:** + +- Average response time < 200ms +- Database query time < 100ms +- Error rate < 1% +- CPU utilization < 70% + + + + + +**Database-Related Issues:** + +- Inefficient queries without proper indexes +- Too many concurrent connections +- Lock contention and blocking queries +- Insufficient database resources + +**Application-Related Issues:** + +- Synchronous calls to external APIs +- Memory leaks causing garbage collection pressure +- Inefficient data serialization +- Lack of proper caching strategies + +**Infrastructure Issues:** + +- Network latency between services +- Insufficient application resources +- Poor load balancing configuration +- Inadequate connection pooling + + + + + +**Is it safe to scale RDS directly in AWS?** + +Yes, you can safely modify your RDS instance directly from the AWS Console even when using SleakOps: + +**Steps:** + +1. Go to **AWS RDS Console** +2. Select your database instance +3. Click **Modify** +4. Change instance class or storage +5. Choose **Apply Immediately** or schedule for maintenance window + +**Important Considerations:** + +- **Downtime**: 20-30 minutes of potential slowness +- **Connection Impact**: Existing connections may be dropped +- **SleakOps Compatibility**: Changes won't affect SleakOps functionality +- **Monitoring**: Watch performance during and after the change + +**Why this option isn't in SleakOps Console:** +SleakOps focuses on application deployment and management. Database infrastructure changes are typically handled directly in the cloud provider's console for more granular control. + + + +--- + +_This FAQ section was automatically generated on January 15, 2024, based on a real user inquiry._ diff --git a/docs/troubleshooting/database-performance-sql-query-optimization.mdx b/docs/troubleshooting/database-performance-sql-query-optimization.mdx new file mode 100644 index 000000000..0bd30195f --- /dev/null +++ b/docs/troubleshooting/database-performance-sql-query-optimization.mdx @@ -0,0 +1,208 @@ +--- +sidebar_position: 15 +title: "Database Performance Issues and SQL Query Optimization" +description: "Troubleshooting slow database queries causing application timeouts" +date: "2024-12-19" +category: "dependency" +tags: ["database", "performance", "sql", "timeout", "optimization"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Performance Issues and SQL Query Optimization + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** Database, Performance, SQL, Timeout, Optimization + +## Problem Description + +**Context:** Production websites experiencing severe performance degradation with widespread timeout errors, traced back to problematic SQL queries causing high database load. + +**Observed Symptoms:** + +- Production sites running extremely slowly +- Timeout errors occurring across multiple endpoints +- High CPU consumption on database instances +- Application response times significantly degraded +- Users unable to complete normal operations + +**Relevant Configuration:** + +- Environment: Production +- Database type: SQL-based (PostgreSQL/MySQL) +- Application tier: Multiple web services +- Monitoring: CPU usage alerts triggered + +**Error Conditions:** + +- Timeouts occur during peak usage periods +- Database CPU utilization spikes correlate with slow queries +- Problem affects multiple application components simultaneously +- Performance degradation impacts user experience significantly + +## Detailed Solution + + + +When experiencing widespread application timeouts, start with database performance analysis: + +1. **Check database metrics** in your monitoring dashboard +2. **Identify slow queries** using database logs or monitoring tools +3. **Monitor CPU and memory usage** on database instances +4. **Review recent deployments** that might have introduced problematic queries + +```bash +# Check PostgreSQL slow queries +SELECT query, mean_time, calls, total_time +FROM pg_stat_statements +ORDER BY mean_time DESC +LIMIT 10; + +# Check MySQL slow queries +SHOW PROCESSLIST; +SELECT * FROM information_schema.processlist +WHERE command != 'Sleep' ORDER BY time DESC; +``` + + + + + +Once problematic queries are identified, apply these optimization strategies: + +**1. Index Analysis:** + +```sql +-- Check missing indexes +EXPLAIN ANALYZE SELECT * FROM your_table WHERE problematic_column = 'value'; + +-- Create appropriate indexes +CREATE INDEX idx_column_name ON table_name(column_name); +``` + +**2. Query Rewriting:** + +- Avoid SELECT \* statements +- Use LIMIT clauses for large result sets +- Optimize JOIN operations +- Consider query caching + +**3. Database Configuration:** + +```yaml +# Example PostgreSQL optimization +shared_buffers: 256MB +effective_cache_size: 1GB +work_mem: 4MB +maintenance_work_mem: 64MB +``` + + + + + +Implement comprehensive monitoring to prevent future issues: + +**1. Query Performance Monitoring:** + +- Enable slow query logging +- Set up alerts for long-running queries +- Monitor query execution plans + +**2. Resource Monitoring:** + +```yaml +# Prometheus metrics for database monitoring +postgresql_exporter: + enabled: true + metrics: + - pg_stat_database + - pg_stat_statements + - pg_stat_activity +``` + +**3. Application-Level Monitoring:** + +- Database connection pool metrics +- Query response time distribution +- Error rate tracking + + + + + +Implement these practices to prevent future database performance issues: + +**1. Code Review Process:** + +- Review all database queries before deployment +- Use query analysis tools in development +- Test with production-like data volumes + +**2. Database Maintenance:** + +```bash +# Regular maintenance tasks +# PostgreSQL +VACUUM ANALYZE; +REINDEX DATABASE your_database; + +# MySQL +OPTIMIZE TABLE your_table; +ANALYZE TABLE your_table; +``` + +**3. Performance Testing:** + +- Load test database queries before production +- Monitor query performance in staging environments +- Set up automated performance regression tests + +**4. Connection Management:** + +```yaml +# Database connection pool configuration +database: + pool_size: 20 + max_connections: 100 + connection_timeout: 30s + idle_timeout: 300s +``` + + + + + +When facing critical database performance issues: + +**1. Immediate Actions:** + +- Identify and kill long-running queries if necessary +- Scale database resources temporarily +- Enable query caching if available +- Redirect traffic to backup systems if possible + +**2. Communication:** + +- Notify stakeholders about the issue +- Provide regular updates on resolution progress +- Document the incident for post-mortem analysis + +**3. Recovery Steps:** + +```bash +# Kill problematic queries (PostgreSQL) +SELECT pg_terminate_backend(pid) +FROM pg_stat_activity +WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes'; + +# Restart database connections +SELECT pg_reload_conf(); +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-postgresql-restore-errors.mdx b/docs/troubleshooting/database-postgresql-restore-errors.mdx new file mode 100644 index 000000000..cb8aac367 --- /dev/null +++ b/docs/troubleshooting/database-postgresql-restore-errors.mdx @@ -0,0 +1,161 @@ +--- +sidebar_position: 3 +title: "PostgreSQL Database Restore Errors" +description: "Solutions for common PostgreSQL database restore errors including constraint and index conflicts" +date: "2024-01-15" +category: "dependency" +tags: ["postgresql", "database", "restore", "constraints", "indexes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# PostgreSQL Database Restore Errors + +**Date:** January 15, 2024 +**Category:** Dependency +**Tags:** PostgreSQL, Database, Restore, Constraints, Indexes + +## Problem Description + +**Context:** When attempting to restore a PostgreSQL database dump, users encounter errors related to constraint dependencies and index conflicts during the restoration process. + +**Observed Symptoms:** + +- Error: "cannot drop index active_storage_blobs_pkey because constraint active_storage_blobs_pkey on table active_storage_blobs requires it" +- Error: "cannot drop constraint users_pkey on table public.users because other objects depend on it" +- Error: "column 'id' does not exist" when recreating indexes +- Restoration process fails during index and constraint manipulation + +**Relevant Configuration:** + +- Database: PostgreSQL (RDS instance) +- Environment variables: `SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS`, `SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME`, `SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME` +- Restoration tool: `pg_restore` +- Database contains Rails ActiveStorage tables and complex foreign key relationships + +**Error Conditions:** + +- Occurs during database restoration from dump files +- Happens when trying to drop primary key constraints that have dependent foreign keys +- Appears when restoration script tries to drop and recreate indexes in incorrect order +- Error persists across multiple restoration attempts + +## Detailed Solution + + + +When encountering constraint and index errors, the most reliable solution is to start with a completely clean database: + +```bash +# Connect to RDS instance and drop/recreate the database +psql -h "${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS}" \ +-U "${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME}" \ +-d postgres \ +-c "DROP DATABASE IF EXISTS ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME};" \ +-c "CREATE DATABASE ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME};" +``` + +This approach eliminates all existing constraints and indexes that might conflict with the restoration process. + + + + + +Some applications require specific schemas to exist before restoration. Create necessary schemas: + +```bash +# Connect to your RDS and create required schemas +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "CREATE SCHEMA IF NOT EXISTS heroku_ext;" +``` + +This is particularly important for applications migrated from Heroku or other platforms that use custom schemas. + + + + + +If you need to modify the restoration script, ensure constraints are dropped in the correct order: + +1. **Drop foreign key constraints first** +2. **Then drop primary key constraints** +3. **Finally drop indexes** + +The error occurs because the script tries to drop primary keys before dropping the foreign keys that depend on them. + +```sql +-- Correct order example: +-- 1. Drop foreign key constraints +ALTER TABLE campaign_agency_users DROP CONSTRAINT IF EXISTS fk_rails_80e17a26a2; + +-- 2. Drop primary key constraints +ALTER TABLE users DROP CONSTRAINT IF EXISTS users_pkey; + +-- 3. Drop indexes +DROP INDEX IF EXISTS active_storage_blobs_pkey CASCADE; +``` + + + + + +If the standard restoration continues to fail, try these alternatives: + +**Method 1: Use pg_restore with specific options** + +```bash +pg_restore --verbose --clean --no-acl --no-owner \ +-h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +your_dump_file.dump +``` + +**Method 2: Restore data only (skip schema)** + +```bash +pg_restore --data-only --verbose \ +-h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +your_dump_file.dump +``` + +**Method 3: Manual schema creation** + +1. Create the database schema manually using Rails migrations or SQL scripts +2. Then restore only the data using `--data-only` option + + + + + +After restoration, verify the database integrity: + +```bash +# Check if all tables exist +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "\dt" + +# Check constraints +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "SELECT conname FROM pg_constraint WHERE contype = 'f';" + +# Verify data integrity +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "SELECT COUNT(*) FROM users;" +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-restore-pod-management.mdx b/docs/troubleshooting/database-restore-pod-management.mdx new file mode 100644 index 000000000..8aa4cade2 --- /dev/null +++ b/docs/troubleshooting/database-restore-pod-management.mdx @@ -0,0 +1,232 @@ +--- +sidebar_position: 3 +title: "Database Restore Pod Management" +description: "Managing long-running database restore pods and their impact on deployments" +date: "2025-03-26" +category: "dependency" +tags: ["database", "restore", "pod", "deployment", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Restore Pod Management + +**Date:** March 26, 2025 +**Category:** Dependency +**Tags:** Database, Restore, Pod, Deployment, Troubleshooting + +## Problem Description + +**Context:** Users experience deployment issues when database restore pods are running for extended periods, causing conflicts with build and deployment processes in SleakOps. + +**Observed Symptoms:** + +- Build timeouts occurring while `restoredb` pod is running +- Deployment failures due to conflicting database operations +- Long-running restore pods (running for days) +- Build processes failing to complete successfully +- Multiple replica sets appearing for web services + +**Relevant Configuration:** + +- Pod type: `restoredb` (database restore operation) +- Duration: Running for multiple days (3-4 days) +- Impact: Affects build and deployment pipeline +- Platform: SleakOps Kubernetes environment + +**Error Conditions:** + +- Build timeouts when restore pod is active +- Deployment pipeline blocked by ongoing restore operations +- Resource conflicts between restore and application pods +- Inability to scale down restore pods when not needed + +## Detailed Solution + + + +Database restore pods can interfere with normal deployment operations because: + +1. **Resource Lock**: The restore process may lock database resources needed by the application +2. **Network Conflicts**: Database connections may be monopolized by the restore operation +3. **Memory/CPU Usage**: Long-running restore operations consume cluster resources +4. **State Conflicts**: Application pods may fail health checks while database is being restored + +This is why builds and deployments often fail while a restore pod is running. + + + + + +To monitor your restore pod status: + +**Using SleakOps Dashboard:** + +1. Go to **Workloads** → **Jobs** +2. Look for `restoredb` or similar restore jobs +3. Check the **Status** and **Duration** columns + +**Using Lens or kubectl:** + +```bash +# List all pods with restore in the name +kubectl get pods | grep restore + +# Check specific restore pod logs +kubectl logs -f + +# Check pod details and status +kubectl describe pod +``` + + + + + +**When you need the restore pod later:** + +If you plan to use the restore functionality soon (like tomorrow morning), you can temporarily scale down: + +```bash +# Scale down the restore job (if it's a deployment) +kubectl scale deployment restoredb --replicas=0 + +# Or delete the specific pod (if it's a standalone pod) +kubectl delete pod +``` + +**When you want to stop it completely:** + +```bash +# Delete the job entirely +kubectl delete job + +# Or through SleakOps dashboard +# Go to Workloads → Jobs → Delete the restore job +``` + +**Important:** Always ensure the restore operation is complete or can be safely interrupted before stopping it. + + + + + +To avoid conflicts with regular operations: + +**1. Schedule during maintenance windows:** + +- Plan restores during low-traffic periods +- Coordinate with deployment schedules +- Notify team members about planned restore operations + +**2. Use separate environments:** + +- Perform restores in staging first +- Test the restore process before production +- Keep production deployments separate from restore operations + +**3. Monitor resource usage:** + +```yaml +# Example resource limits for restore jobs +apiVersion: batch/v1 +kind: Job +metadata: + name: database-restore +spec: + template: + spec: + containers: + - name: restore + resources: + limits: + memory: "2Gi" + cpu: "1000m" + requests: + memory: "1Gi" + cpu: "500m" +``` + + + + + +**If builds are timing out:** + +1. **Check if restore is still needed:** + + - Verify the restore operation status + - Determine if it can be safely stopped + +2. **Temporary solution:** + + ```bash + # Stop the restore pod temporarily + kubectl delete pod + + # Retry your build + # The restore can be restarted later if needed + ``` + +3. **Long-term solution:** + - Schedule restores during maintenance windows + - Use separate database instances for restore testing + - Implement restore job timeouts + +**If you see multiple replica sets:** + +This is normal during deployments but can indicate issues: + +```bash +# Check replica set status +kubectl get rs + +# Clean up old replica sets if needed +kubectl delete rs +``` + + + + + +**1. Implement proper job management:** + +```yaml +# Add activeDeadlineSeconds to prevent infinite running +apiVersion: batch/v1 +kind: Job +metadata: + name: database-restore +spec: + activeDeadlineSeconds: 3600 # 1 hour timeout + backoffLimit: 3 + template: + # ... rest of job spec +``` + +**2. Use resource quotas:** + +```yaml +apiVersion: v1 +kind: ResourceQuota +metadata: + name: restore-quota +spec: + hard: + requests.cpu: "2" + requests.memory: 4Gi + limits.cpu: "4" + limits.memory: 8Gi +``` + +**3. Set up monitoring alerts:** + +- Alert when restore jobs run longer than expected +- Monitor resource usage during restore operations +- Set up notifications for failed deployments + + + +--- + +_This FAQ was automatically generated on March 26, 2025 based on a real user query._ diff --git a/docs/troubleshooting/database-restore-pod-procedures.mdx b/docs/troubleshooting/database-restore-pod-procedures.mdx new file mode 100644 index 000000000..eb0d0d907 --- /dev/null +++ b/docs/troubleshooting/database-restore-pod-procedures.mdx @@ -0,0 +1,685 @@ +--- +sidebar_position: 15 +title: "Database Restore in Pod Environment" +description: "Procedures for restoring database dumps in Kubernetes pods with connection resilience" +date: "2024-03-21" +category: "dependency" +tags: ["database", "restore", "dump", "pod", "tmux", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Database Restore in Pod Environment + +**Date:** March 21, 2024 +**Category:** Dependency +**Tags:** Database, Restore, Dump, Pod, Tmux, Kubernetes + +## Problem Description + +**Context:** When performing database restoration operations in Kubernetes pods, users need reliable procedures to handle large dump files while maintaining connection stability during long-running restore processes. + +**Observed Symptoms:** + +- Connection drops during long database restore operations +- Volume space issues due to accumulated old dump files +- Need for session persistence during restore processes +- Requirement for monitoring restore progress + +**Relevant Configuration:** + +- Environment: Production database restoration +- Platform: Kubernetes pods +- Tools: Database dump files, tmux for session management +- Storage: Pod volumes with limited space + +**Error Conditions:** + +- Connection timeouts during restore operations +- Insufficient disk space for dump files +- Process interruption due to network issues +- Loss of restore progress when connection drops + +## Detailed Solution + + + +The improved restore script includes several optimizations: + +```bash +#!/bin/bash +# Enhanced database restore script + +set -e + +# Configuration +DUMP_DIR="/data/dumps" +LOG_FILE="/data/logs/restore_$(date +%Y%m%d_%H%M%S).log" +MAX_DUMP_AGE_DAYS=7 + +# Function to clean old dumps +clean_old_dumps() { + echo "Cleaning dumps older than ${MAX_DUMP_AGE_DAYS} days..." | tee -a $LOG_FILE + find $DUMP_DIR -name "*.sql" -type f -mtime +$MAX_DUMP_AGE_DAYS -delete + find $DUMP_DIR -name "*.dump" -type f -mtime +$MAX_DUMP_AGE_DAYS -delete + echo "Old dumps cleaned successfully" | tee -a $LOG_FILE +} + +# Function to check available space +check_disk_space() { + AVAILABLE_SPACE=$(df $DUMP_DIR | awk 'NR==2 {print $4}') + REQUIRED_SPACE=1048576 # 1GB in KB + + if [ $AVAILABLE_SPACE -lt $REQUIRED_SPACE ]; then + echo "Warning: Low disk space. Available: ${AVAILABLE_SPACE}KB" | tee -a $LOG_FILE + clean_old_dumps + fi +} + +# Main restore function +restore_database() { + local dump_file=$1 + local database_name=$2 + + echo "Starting database restore: $dump_file -> $database_name" | tee -a $LOG_FILE + echo "Start time: $(date)" | tee -a $LOG_FILE + + # Restore with progress monitoring + pv $dump_file | psql -h $DB_HOST -U $DB_USER -d $database_name 2>&1 | tee -a $LOG_FILE + + echo "Restore completed at: $(date)" | tee -a $LOG_FILE +} + +# Pre-restore checks +check_disk_space +clean_old_dumps + +# Execute restore +restore_database "$1" "$2" +``` + + + + + +To handle connection drops during long restore operations, use tmux: + +```bash +# Start a new tmux session for the restore +kubectl exec -it -- tmux new-session -d -s restore + +# Attach to the session +kubectl exec -it -- tmux attach-session -t restore + +# Inside the tmux session, run the restore +./restore_script.sh /data/dumps/production_dump.sql production_db + +# Detach from session (Ctrl+b, then d) +# Session continues running even if connection drops + +# Reattach later to check progress +kubectl exec -it -- tmux attach-session -t restore + +# List all sessions +kubectl exec -it -- tmux list-sessions +``` + +**Benefits of using tmux:** + +- Session persistence across connection drops +- Ability to monitor progress remotely +- Multiple windows for parallel operations +- Session sharing between team members + + + + + +To prevent volume space issues during restore operations: + +```yaml +# Pod configuration with adequate storage +apiVersion: v1 +kind: Pod +metadata: + name: db-restore-pod +spec: + containers: + - name: restore-container + image: postgres:14 + volumeMounts: + - name: dump-storage + mountPath: /data/dumps + - name: logs-storage + mountPath: /data/logs + resources: + requests: + storage: "50Gi" # Adequate space for dumps + volumes: + - name: dump-storage + persistentVolumeClaim: + claimName: dump-pvc + - name: logs-storage + emptyDir: {} +``` + +**Space management commands:** + +```bash +# Check current usage +kubectl exec -it -- df -h /data/dumps + +# Clean old dumps manually +kubectl exec -it -- find /data/dumps -name "*.sql" -mtime +7 -delete + +# Monitor space during restore +kubectl exec -it -- watch "df -h /data/dumps" +``` + + + + + +To monitor the restore process effectively: + +```bash +# Using pv (pipe viewer) for progress monitoring +kubectl exec -it -- pv /data/dumps/large_dump.sql | psql -h localhost -U postgres -d target_db + +# Monitor logs in real-time +kubectl exec -it -- tail -f /data/logs/restore_*.log + +# Check database size growth +kubectl exec -it -- psql -h localhost -U postgres -c "SELECT pg_size_pretty(pg_database_size('target_db'));" + +# Monitor active connections +kubectl exec -it -- psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE datname='target_db';" +``` + +**Progress monitoring script:** + +```bash +#!/bin/bash +# progress_monitor.sh + +DB_NAME=$1 +while true; do + SIZE=$(psql -h localhost -U postgres -t -c "SELECT pg_size_pretty(pg_database_size('$DB_NAME'));") + echo "$(date): Database size: $SIZE" + sleep 30 +done +``` + + + + + +For complex restoration scenarios, use this enhanced procedure: + +**1. Pre-restore validation:** + +```bash +#!/bin/bash +# validate_restore.sh + +DB_HOST=${DB_HOST:-localhost} +DB_PORT=${DB_PORT:-5432} +DB_USER=${DB_USER:-postgres} +DUMP_FILE=$1 + +# Validate dump file +if [ ! -f "$DUMP_FILE" ]; then + echo "ERROR: Dump file $DUMP_FILE not found" + exit 1 +fi + +# Check file size and format +file_size=$(du -h "$DUMP_FILE" | cut -f1) +echo "Dump file size: $file_size" + +# Verify dump file integrity +pg_restore --list "$DUMP_FILE" > /dev/null 2>&1 +if [ $? -eq 0 ]; then + echo "Dump file is valid (custom format)" + RESTORE_CMD="pg_restore" +else + # Check if it's a plain SQL dump + head -20 "$DUMP_FILE" | grep -q "PostgreSQL database dump" + if [ $? -eq 0 ]; then + echo "Dump file is valid (SQL format)" + RESTORE_CMD="psql" + else + echo "ERROR: Invalid dump file format" + exit 1 + fi +fi + +echo "Pre-restore validation completed successfully" +``` + +**2. Database preparation:** + +```bash +#!/bin/bash +# prepare_database.sh + +DB_NAME=$1 +DB_HOST=${DB_HOST:-localhost} +DB_USER=${DB_USER:-postgres} + +# Terminate existing connections +psql -h $DB_HOST -U $DB_USER -d postgres -c " +SELECT pg_terminate_backend(pid) +FROM pg_stat_activity +WHERE datname = '$DB_NAME' AND pid <> pg_backend_pid();" + +# Drop and recreate database +psql -h $DB_HOST -U $DB_USER -d postgres -c "DROP DATABASE IF EXISTS $DB_NAME;" +psql -h $DB_HOST -U $DB_USER -d postgres -c "CREATE DATABASE $DB_NAME;" + +echo "Database $DB_NAME prepared for restore" +``` + +**3. Comprehensive restore script:** + +```bash +#!/bin/bash +# comprehensive_restore.sh + +set -e # Exit on any error + +DB_NAME=$1 +DUMP_FILE=$2 +DB_HOST=${DB_HOST:-localhost} +DB_USER=${DB_USER:-postgres} +TMUX_SESSION="db_restore_$(date +%s)" + +# Logging setup +LOG_FILE="/tmp/restore_${DB_NAME}_$(date +%Y%m%d_%H%M%S).log" +exec > >(tee -a "$LOG_FILE") 2>&1 + +echo "Starting database restore at $(date)" +echo "Database: $DB_NAME" +echo "Dump file: $DUMP_FILE" +echo "Log file: $LOG_FILE" + +# Validate inputs +if [ -z "$DB_NAME" ] || [ -z "$DUMP_FILE" ]; then + echo "Usage: $0 " + exit 1 +fi + +# Check if tmux session exists +if tmux has-session -t "$TMUX_SESSION" 2>/dev/null; then + echo "ERROR: Tmux session $TMUX_SESSION already exists" + exit 1 +fi + +# Start tmux session and run restore +tmux new-session -d -s "$TMUX_SESSION" bash -c " + set -e + echo 'Starting restore in tmux session: $TMUX_SESSION' + + # Set up environment + export PGPASSWORD=\$POSTGRES_PASSWORD + + # Run pre-restore validation + ./validate_restore.sh '$DUMP_FILE' + + # Prepare database + ./prepare_database.sh '$DB_NAME' + + # Start progress monitoring in background + ./progress_monitor.sh '$DB_NAME' & + MONITOR_PID=\$! + + # Perform restore based on file type + if pg_restore --list '$DUMP_FILE' > /dev/null 2>&1; then + echo 'Restoring from custom format dump...' + pg_restore -h $DB_HOST -U $DB_USER -d '$DB_NAME' -v --no-owner --no-acl '$DUMP_FILE' + else + echo 'Restoring from SQL dump...' + psql -h $DB_HOST -U $DB_USER -d '$DB_NAME' -f '$DUMP_FILE' + fi + + # Stop monitoring + kill \$MONITOR_PID 2>/dev/null || true + + echo 'Restore completed successfully at \$(date)' + + # Post-restore validation + ./post_restore_validation.sh '$DB_NAME' + + echo 'Press any key to exit tmux session...' + read +" + +echo "Restore started in tmux session: $TMUX_SESSION" +echo "To attach: tmux attach-session -t $TMUX_SESSION" +echo "To check progress: tmux capture-pane -t $TMUX_SESSION -p" +``` + + + + + +After restore completion, validate the restored database: + +**1. Data integrity validation:** + +```bash +#!/bin/bash +# post_restore_validation.sh + +DB_NAME=$1 +DB_HOST=${DB_HOST:-localhost} +DB_USER=${DB_USER:-postgres} + +echo "Starting post-restore validation for $DB_NAME" + +# Check database size +DB_SIZE=$(psql -h $DB_HOST -U $DB_USER -d $DB_NAME -t -c "SELECT pg_size_pretty(pg_database_size('$DB_NAME'));") +echo "Final database size: $DB_SIZE" + +# Count tables and records +TABLES=$(psql -h $DB_HOST -U $DB_USER -d $DB_NAME -t -c " +SELECT count(*) FROM information_schema.tables +WHERE table_schema = 'public' AND table_type = 'BASE TABLE';") +echo "Number of tables: $TABLES" + +# Check for any errors in postgres log +echo "Recent PostgreSQL errors (if any):" +psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c " +SELECT message, detail, hint +FROM pg_stat_database_conflicts +WHERE datname = '$DB_NAME';" 2>/dev/null || echo "No conflicts found" + +# Validate critical tables (customize for your schema) +echo "Validating critical tables..." +psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c " +SELECT + schemaname, + tablename, + n_tup_ins as inserts, + n_tup_upd as updates, + n_tup_del as deletes +FROM pg_stat_user_tables +ORDER BY n_tup_ins DESC +LIMIT 10;" + +# Check for missing indexes +echo "Checking for missing indexes..." +psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c " +SELECT + schemaname, + tablename, + indexname, + indexdef +FROM pg_indexes +WHERE schemaname = 'public' +ORDER BY tablename, indexname;" + +echo "Post-restore validation completed" +``` + +**2. Performance optimization:** + +```sql +-- Run after large restore operations +-- Update table statistics +ANALYZE; + +-- Rebuild indexes if needed +REINDEX DATABASE your_database_name; + +-- Update vacuum statistics +VACUUM ANALYZE; + +-- Check for bloated tables +SELECT + schemaname, + tablename, + n_dead_tup, + n_live_tup, + round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) as dead_percentage +FROM pg_stat_user_tables +WHERE n_dead_tup > 0 +ORDER BY dead_percentage DESC; +``` + + + + + +**Common restore problems and solutions:** + +**1. Out of disk space:** + +```bash +# Check available space before restore +df -h /var/lib/postgresql/data + +# Clean up old dumps +find /tmp -name "*.sql" -mtime +7 -delete +find /tmp -name "*.dump" -mtime +7 -delete + +# Monitor space during restore +watch -n 30 'df -h | grep -E "(Use%|/var/lib/postgresql)"' +``` + +**2. Permission errors:** + +```bash +# Fix ownership issues +chown -R postgres:postgres /var/lib/postgresql/data + +# Check PostgreSQL user permissions +psql -c "SELECT rolname, rolsuper, rolcreaterole, rolcreatedb FROM pg_roles WHERE rolname = 'postgres';" + +# Grant necessary permissions +GRANT CREATE ON DATABASE your_db TO your_user; +ALTER USER your_user CREATEDB; +``` + +**3. Connection timeouts:** + +```bash +# Increase connection timeout +export PGCONNECT_TIMEOUT=300 + +# Set statement timeout for long operations +psql -c "SET statement_timeout = '1h';" + +# Use connection pooling for multiple operations +psql -c "SET max_connections = 200;" +``` + +**4. Memory issues with large dumps:** + +```bash +# Use streaming restore for large files +pg_restore --clean --no-acl --no-owner --verbose \ + --jobs=4 \ + --dbname=postgresql://user:pass@host:port/dbname \ + dump_file.dump + +# Monitor memory usage +while true; do + echo "$(date): Memory usage:" + free -h + echo "PostgreSQL processes:" + ps aux | grep postgres | grep -v grep + sleep 60 +done +``` + + + + + +**Real-time restore monitoring:** + +```bash +#!/bin/bash +# advanced_monitor.sh + +DB_NAME=$1 +DUMP_FILE=$2 + +# Create monitoring dashboard +tmux new-session -d -s "restore_monitor" bash -c " + # Window 1: Progress monitoring + tmux rename-window 'Progress' + + # Window 2: Database metrics + tmux new-window -n 'DB_Metrics' + tmux send-keys 'watch -n 5 \"psql -d $DB_NAME -c \\\"SELECT + pg_size_pretty(pg_database_size(\\\\\"$DB_NAME\\\\\")) as db_size, + (SELECT count(*) FROM pg_stat_activity WHERE datname = \\\\\"$DB_NAME\\\\\") as connections, + (SELECT count(*) FROM pg_stat_activity WHERE state = \\\\\"active\\\\\") as active_queries;\\\"\"' Enter + + # Window 3: System resources + tmux new-window -n 'Resources' + tmux send-keys 'htop' Enter + + # Window 4: Disk usage + tmux new-window -n 'Disk' + tmux send-keys 'watch -n 10 \"df -h | grep -E \\\"(Use%|/var/lib/postgresql|/tmp)\\\"\"' Enter + + # Return to first window + tmux select-window -t 0 +" + +echo "Monitoring dashboard started. Attach with: tmux attach-session -t restore_monitor" +``` + +**Recovery procedures for failed restores:** + +```bash +#!/bin/bash +# restore_recovery.sh + +DB_NAME=$1 +BACKUP_DB="${DB_NAME}_backup_$(date +%Y%m%d_%H%M%S)" + +echo "Starting restore recovery procedure..." + +# 1. Create backup of current state +echo "Creating backup of current database state..." +pg_dump -h localhost -U postgres $DB_NAME > "${BACKUP_DB}.sql" + +# 2. Analyze what went wrong +echo "Analyzing restore failure..." +tail -100 /var/log/postgresql/postgresql.log | grep ERROR + +# 3. Check for partial data +echo "Checking for partial restore data..." +psql -d $DB_NAME -c " +SELECT + schemaname, + tablename, + n_tup_ins as row_count +FROM pg_stat_user_tables +WHERE n_tup_ins > 0 +ORDER BY n_tup_ins DESC;" + +# 4. Cleanup corrupted state +echo "Cleaning up corrupted state..." +psql -d postgres -c " +SELECT pg_terminate_backend(pid) +FROM pg_stat_activity +WHERE datname = '$DB_NAME' AND pid <> pg_backend_pid();" + +# 5. Reset database +echo "Resetting database for retry..." +dropdb $DB_NAME +createdb $DB_NAME + +echo "Recovery procedure completed. Database ready for restore retry." +``` + + + + + +**For production restore operations:** + +**1. Maintenance window procedures:** + +```bash +#!/bin/bash +# production_restore.sh + +# Pre-maintenance checks +echo "=== PRE-MAINTENANCE CHECKS ===" +echo "Current time: $(date)" +echo "Database size: $(psql -t -c "SELECT pg_size_pretty(pg_database_size('$DB_NAME'));")" +echo "Active connections: $(psql -t -c "SELECT count(*) FROM pg_stat_activity WHERE datname = '$DB_NAME';")" + +# Set maintenance mode +echo "=== SETTING MAINTENANCE MODE ===" +# Update application config or load balancer + +# Backup current state +echo "=== CREATING SAFETY BACKUP ===" +pg_dump $DB_NAME > "safety_backup_$(date +%Y%m%d_%H%M%S).sql" + +# Perform restore +echo "=== STARTING RESTORE ===" +# Run restore procedure here + +# Validate restore +echo "=== POST-RESTORE VALIDATION ===" +# Run validation scripts + +# Remove maintenance mode +echo "=== REMOVING MAINTENANCE MODE ===" +# Update application config + +echo "=== MAINTENANCE COMPLETED ===" +``` + +**2. Rollback procedures:** + +```bash +#!/bin/bash +# rollback_restore.sh + +ORIGINAL_BACKUP=$1 + +if [ -z "$ORIGINAL_BACKUP" ]; then + echo "ERROR: Please provide original backup file" + exit 1 +fi + +echo "EMERGENCY ROLLBACK INITIATED" +echo "Rolling back to: $ORIGINAL_BACKUP" + +# Quick rollback +dropdb $DB_NAME +createdb $DB_NAME +psql $DB_NAME < "$ORIGINAL_BACKUP" + +echo "Rollback completed. Service should be restored." +``` + +**3. Documentation and audit trail:** + +```bash +# Maintain restore log +cat >> restore_audit.log << EOF +Date: $(date) +Database: $DB_NAME +Dump File: $DUMP_FILE +Operator: $(whoami) +Result: SUCCESS/FAILURE +Duration: $DURATION +Notes: $NOTES +EOF +``` + + + +--- + +_This FAQ was automatically generated on March 21, 2024 based on a real user query._ diff --git a/docs/troubleshooting/database-typeorm-ssl-connection-error.mdx b/docs/troubleshooting/database-typeorm-ssl-connection-error.mdx new file mode 100644 index 000000000..ff9898069 --- /dev/null +++ b/docs/troubleshooting/database-typeorm-ssl-connection-error.mdx @@ -0,0 +1,234 @@ +--- +sidebar_position: 3 +title: "TypeORM SSL Connection Error with RDS" +description: "Solution for pg_hba.conf SSL connection errors when connecting TypeORM to RDS databases" +date: "2024-01-15" +category: "dependency" +tags: ["typeorm", "rds", "ssl", "database", "postgresql"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# TypeORM SSL Connection Error with RDS + +**Date:** January 15, 2024 +**Category:** Dependency +**Tags:** TypeORM, RDS, SSL, Database, PostgreSQL + +## Problem Description + +**Context:** User experiences SSL connection issues when trying to connect TypeORM to an RDS PostgreSQL database in production environment, while the same configuration works in development. + +**Observed Symptoms:** + +- Error during migration run: `no pg_hba.conf entry for host "10.130.96.232", user "postgres", database "rattlesnake", no encryption` +- Migrations fail to execute in production environment +- Same configuration works in development/staging environments +- Application cannot connect to the database + +**Relevant Configuration:** + +- Database: PostgreSQL on AWS RDS +- ORM: TypeORM +- Environment: Production vs Development difference +- SSL enforcement: RDS requires SSL connections + +**Error Conditions:** + +- Error occurs during TypeORM migration execution +- Happens when application attempts to connect to database +- Specific to production environment +- Related to SSL/encryption requirements + +## Detailed Solution + + + +The error `no pg_hba.conf entry for host... no encryption` indicates that: + +1. **RDS requires SSL connections** in production +2. **TypeORM is not configured** to use SSL +3. **Development environments** may have different SSL requirements +4. **pg_hba.conf** on RDS is configured to reject non-encrypted connections + +This is a common security difference between development and production database configurations. + + + + + +Add SSL configuration to your TypeORM connection options: + +```typescript +// TypeScript configuration +const connectionOptions: ConnectionOptions = { + type: "postgres", + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT || "5432"), + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + // Add SSL configuration + ssl: { + rejectUnauthorized: false, + }, + // ... other options +}; +``` + +```javascript +// JavaScript configuration +module.exports = { + type: "postgres", + host: process.env.DB_HOST, + port: process.env.DB_PORT || 5432, + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + ssl: { + rejectUnauthorized: false, + }, +}; +``` + + + + + +To handle different SSL requirements across environments: + +```typescript +const sslConfig = + process.env.NODE_ENV === "production" + ? { + ssl: { + rejectUnauthorized: false, + }, + } + : {}; + +const connectionOptions: ConnectionOptions = { + type: "postgres", + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT || "5432"), + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + ...sslConfig, + // ... other options +}; +``` + +Or use environment variables: + +```typescript +const connectionOptions: ConnectionOptions = { + // ... other config + ssl: + process.env.DB_SSL_ENABLED === "true" + ? { + rejectUnauthorized: + process.env.DB_SSL_REJECT_UNAUTHORIZED !== "false", + } + : false, +}; +``` + + + + + +When running migrations, ensure your TypeORM CLI configuration includes SSL: + +```json +// ormconfig.json +{ + "type": "postgres", + "host": "your-rds-endpoint", + "port": 5432, + "username": "postgres", + "password": "your-password", + "database": "your-database", + "ssl": { + "rejectUnauthorized": false + }, + "migrations": ["src/migrations/*.ts"], + "cli": { + "migrationsDir": "src/migrations" + } +} +``` + +Then run migrations: + +```bash +# Using npm/pnpm +pnpm typeorm migration:run + +# Or with explicit config +pnpm typeorm migration:run -f ormconfig.json +``` + + + + + +**Current Solution Security:** + +- `rejectUnauthorized: false` disables certificate validation +- Still uses encrypted connection (SSL/TLS) +- Acceptable for most production scenarios with RDS + +**More Secure Alternatives:** + +1. **Use RDS CA Certificate:** + +```typescript +ssl: { + ca: fs.readFileSync('rds-ca-2019-root.pem').toString(), + rejectUnauthorized: true +} +``` + +2. **Environment-based SSL mode:** + +```typescript +ssl: { + mode: 'require', // or 'verify-full' + rejectUnauthorized: process.env.NODE_ENV === 'production' +} +``` + +3. **Use AWS IAM Database Authentication** (for enhanced security) + + + + + +If the SSL configuration doesn't resolve the issue: + +1. **Verify RDS SSL settings:** + + - Check if `rds.force_ssl` parameter is enabled + - Verify security group allows connections from your application + +2. **Test connection manually:** + +```bash +psql "host=your-rds-endpoint port=5432 dbname=your-db user=postgres sslmode=require" +``` + +3. **Check TypeORM version compatibility:** + + - Ensure you're using a recent version of TypeORM + - Some older versions had SSL configuration issues + +4. **Verify environment variables:** + - Ensure all database connection variables are correctly set in production + - Check that the database endpoint is accessible from your production environment + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/dependencies-monitoring-graphs-not-displaying.mdx b/docs/troubleshooting/dependencies-monitoring-graphs-not-displaying.mdx new file mode 100644 index 000000000..aac550f20 --- /dev/null +++ b/docs/troubleshooting/dependencies-monitoring-graphs-not-displaying.mdx @@ -0,0 +1,128 @@ +--- +sidebar_position: 3 +title: "Dependencies Monitoring Graphs Not Displaying" +description: "Solution for RDS and OpenSearch monitoring graphs showing empty or not loading properly" +date: "2025-01-22" +category: "dependency" +tags: ["monitoring", "rds", "opensearch", "graphs", "dependencies"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Dependencies Monitoring Graphs Not Displaying + +**Date:** January 22, 2025 +**Category:** Dependency +**Tags:** Monitoring, RDS, OpenSearch, Graphs, Dependencies + +## Problem Description + +**Context:** When accessing the monitoring section for dependencies (RDS, OpenSearch, etc.) in SleakOps, the monitoring page loads but the usage graphs do not display any data or remain empty. + +**Observed Symptoms:** + +- Monitoring page loads successfully +- Graphs appear empty or show no data +- Issue affects both RDS and OpenSearch dependencies +- Problem occurs specifically with older dependencies + +**Relevant Configuration:** + +- Dependency type: RDS, OpenSearch +- Monitoring: Enabled +- Dependencies created before a specific platform update +- Graph visualization: Empty/blank + +**Error Conditions:** + +- Occurs on dependencies created before monitoring variable storage was implemented +- Affects multiple dependency types (RDS, OpenSearch) +- Monitoring data collection works but visualization fails +- Problem is consistent across affected dependencies + +## Detailed Solution + + + +This issue occurs due to a missing variable in the platform's database that is required for monitoring graph visualization. When certain dependencies were created, this monitoring variable was not stored, causing the graphs to fail to display data even though the monitoring system is collecting metrics. + +The problem specifically affects: + +- Dependencies created before the monitoring variable storage was implemented +- Both RDS and OpenSearch dependencies +- Any dependency that relies on this specific monitoring configuration + + + + + +While waiting for the platform fix, you can try these workarounds: + +1. **Check CloudWatch directly**: Access your AWS CloudWatch console to view RDS and OpenSearch metrics directly +2. **Use AWS CLI**: Query metrics using AWS CLI commands: + +```bash +# For RDS metrics +aws cloudwatch get-metric-statistics \ + --namespace AWS/RDS \ + --metric-name CPUUtilization \ + --dimensions Name=DBInstanceIdentifier,Value=your-db-instance \ + --start-time 2025-01-22T00:00:00Z \ + --end-time 2025-01-22T23:59:59Z \ + --period 3600 \ + --statistics Average + +# For OpenSearch metrics +aws cloudwatch get-metric-statistics \ + --namespace AWS/ES \ + --metric-name CPUUtilization \ + --dimensions Name=DomainName,Value=your-domain,Name=ClientId,Value=your-account-id \ + --start-time 2025-01-22T00:00:00Z \ + --end-time 2025-01-22T23:59:59Z \ + --period 3600 \ + --statistics Average +``` + +3. **Set up temporary monitoring**: Create custom CloudWatch dashboards for affected dependencies + + + + + +The SleakOps team is working on a comprehensive solution that will: + +1. **Identify all affected dependencies**: Scan the database for dependencies missing the monitoring variable +2. **Populate missing variables**: Add the required monitoring configuration to existing dependencies +3. **Prevent future occurrences**: Ensure all new dependencies include the monitoring variable from creation + +**Expected timeline**: The fix is being implemented as a complete solution rather than individual patches to ensure all affected users benefit simultaneously. + + + + + +To check if your dependencies are affected by this issue: + +1. **Navigate to Dependencies** in your SleakOps dashboard +2. **Select your RDS or OpenSearch dependency** +3. **Go to the Monitoring tab** +4. **Check if graphs display data**: + - If graphs are empty but the page loads → You're affected by this issue + - If graphs show data → Your dependency is working correctly + - If the page doesn't load → This might be a different issue + + + + + +For new dependencies created after the platform fix: + +1. **Verify monitoring setup**: After creating a new dependency, check that monitoring graphs display data within 5-10 minutes +2. **Test different time ranges**: Ensure graphs work for different time periods (1 hour, 24 hours, 7 days) +3. **Contact support early**: If new dependencies show this issue, report it immediately as it might indicate the fix needs adjustment + + + +--- + +_This FAQ was automatically generated on January 22, 2025 based on a real user query._ diff --git a/docs/troubleshooting/dependency-connection-timeout-mysql-redis.mdx b/docs/troubleshooting/dependency-connection-timeout-mysql-redis.mdx new file mode 100644 index 000000000..6414626b1 --- /dev/null +++ b/docs/troubleshooting/dependency-connection-timeout-mysql-redis.mdx @@ -0,0 +1,237 @@ +--- +sidebar_position: 3 +title: "MySQL and Redis Connection Timeout Issues" +description: "Troubleshooting connection timeouts to MySQL and Redis dependencies in production environments" +date: "2024-01-15" +category: "dependency" +tags: + ["mysql", "redis", "timeout", "connection", "troubleshooting", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# MySQL and Redis Connection Timeout Issues + +**Date:** January 15, 2024 +**Category:** Dependency +**Tags:** MySQL, Redis, Timeout, Connection, Troubleshooting, Networking + +## Problem Description + +**Context:** Production API service experiencing simultaneous connection timeouts to both MySQL database and Redis cache, while other services in the same environment work correctly and external connections (VPN) succeed. + +**Observed Symptoms:** + +- MySQL connection timeout: `Error: connect ETIMEDOUT` +- Redis connection timeout: `ConnectionTimeoutError: Connection timeout` +- Only affects specific production API service +- Other services can connect to the same dependencies successfully +- External connections via VPN work normally +- Database and Redis services are running and accessible + +**Relevant Configuration:** + +- Environment: Production API service +- Affected dependencies: MySQL database and Redis cache +- Connection method: Internal cluster networking +- Secrets and credentials: Present and loaded correctly +- External access: Working via VPN + +**Error Conditions:** + +- Errors occur simultaneously for both MySQL and Redis +- Problem isolated to one specific service/pod +- Intermittent behavior - works sometimes, fails others +- No recent configuration changes made + +## Detailed Solution + + + +When a single service loses connectivity to multiple dependencies while others work fine, this typically indicates: + +1. **Pod networking issues**: The specific pod may have network connectivity problems +2. **Resource constraints**: Memory/CPU limits causing connection pool exhaustion +3. **DNS resolution problems**: Service discovery issues within the cluster +4. **Security group/firewall changes**: Network policies blocking specific pod traffic + + + + + +**Step 1: Restart the affected service** + +1. In SleakOps dashboard, go to your project +2. Find the affected API service +3. Click **Restart** to recreate the pods +4. Monitor logs for connection recovery + +**Step 2: Check pod resource usage** + +```bash +# Check pod resource consumption +kubectl top pods -n your-namespace + +# Check pod events for resource issues +kubectl describe pod your-api-pod -n your-namespace +``` + +**Step 3: Verify network connectivity from pod** + +```bash +# Test connectivity from inside the pod +kubectl exec -it your-api-pod -n your-namespace -- sh + +# Test MySQL connection +telnet mysql-service 3306 + +# Test Redis connection +telnet redis-service 6379 + +# Check DNS resolution +nslookup mysql-service +nslookup redis-service +``` + + + + + +Connection timeouts often occur due to exhausted connection pools: + +**MySQL Connection Pool Settings:** + +```javascript +// Recommended MySQL connection configuration +const mysql = require("mysql2/promise"); + +const pool = mysql.createPool({ + host: process.env.MYSQL_HOST, + user: process.env.MYSQL_USER, + password: process.env.MYSQL_PASSWORD, + database: process.env.MYSQL_DATABASE, + connectionLimit: 10, + acquireTimeout: 60000, + timeout: 60000, + reconnect: true, +}); +``` + +**Redis Connection Settings:** + +```javascript +// Recommended Redis connection configuration +const redis = require("redis"); + +const client = redis.createClient({ + host: process.env.REDIS_HOST, + port: process.env.REDIS_PORT, + connectTimeout: 60000, + lazyConnect: true, + retryDelayOnFailover: 100, + maxRetriesPerRequest: 3, +}); +``` + + + + + +Insufficient resources can cause connection issues: + +**In SleakOps:** + +1. Go to your **API service configuration** +2. Check **Resource Limits**: + - Memory: Increase if close to limit + - CPU: Ensure adequate allocation + +**Recommended minimums for API services:** + +```yaml +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + +**Monitor resource usage:** + +```bash +# Check current resource usage +kubectl top pod your-api-pod -n your-namespace + +# Check resource events +kubectl get events -n your-namespace --sort-by='.lastTimestamp' +``` + + + + + +Check if network policies are blocking connections: + +**Verify network policies:** + +```bash +# List network policies +kubectl get networkpolicies -n your-namespace + +# Check specific policy details +kubectl describe networkpolicy policy-name -n your-namespace +``` + +**Test service connectivity:** + +```bash +# Test from another pod in the same namespace +kubectl run test-pod --image=busybox --rm -it --restart=Never -- sh + +# Inside the test pod: +telnet mysql-service.your-namespace.svc.cluster.local 3306 +telnet redis-service.your-namespace.svc.cluster.local 6379 +``` + + + + + +To prevent future issues, implement monitoring: + +**Connection monitoring:** + +```javascript +// Add connection health checks +const healthCheck = async () => { + try { + // Test MySQL + await pool.execute("SELECT 1"); + + // Test Redis + await redis.ping(); + + console.log("Dependencies healthy"); + } catch (error) { + console.error("Dependency health check failed:", error); + } +}; + +// Run health check every 30 seconds +setInterval(healthCheck, 30000); +``` + +**Add to your application:** + +1. Implement connection retry logic +2. Add circuit breaker patterns +3. Monitor connection pool metrics +4. Set up alerts for connection failures + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-automatic-updates-github.mdx b/docs/troubleshooting/deployment-automatic-updates-github.mdx new file mode 100644 index 000000000..e354dfae3 --- /dev/null +++ b/docs/troubleshooting/deployment-automatic-updates-github.mdx @@ -0,0 +1,201 @@ +--- +sidebar_position: 3 +title: "Automatic Deployment Updates from GitHub" +description: "How to configure automatic deployments when pushing code to GitHub repository" +date: "2024-08-02" +category: "project" +tags: ["deployment", "github", "ci-cd", "automation", "updates"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Automatic Deployment Updates from GitHub + +**Date:** August 2, 2024 +**Category:** Project +**Tags:** Deployment, GitHub, CI/CD, Automation, Updates + +## Problem Description + +**Context:** User wants to understand how code changes pushed to their GitHub repository are automatically deployed to their SleakOps environment during testing phases. + +**Observed Symptoms:** + +- Uncertainty about whether code changes are automatically deployed +- Need to understand the deployment workflow +- Questions about manual intervention requirements +- Testing phase deployment concerns + +**Relevant Configuration:** + +- Source: GitHub repository +- Platform: SleakOps +- Environment: Development/Testing +- Domain: firev.com.ar + +**Error Conditions:** + +- Unclear deployment process +- Potential manual steps required +- Testing workflow uncertainty + +## Detailed Solution + + + +SleakOps provides automatic deployment capabilities when properly configured: + +**Default Behavior:** + +- Code pushed to the connected branch triggers automatic builds +- Successful builds are automatically deployed to the target environment +- No manual intervention required in SleakOps dashboard + +**Requirements:** + +- GitHub repository must be properly connected to SleakOps +- Webhook configuration must be active +- Build configuration must be valid + + + + + +To ensure your repository is properly connected: + +1. **Check Repository Connection:** + + - Go to your Project in SleakOps + - Navigate to **Settings** → **Repository** + - Verify GitHub repository URL is correct + - Check that webhook is active (green status) + +2. **Verify Branch Configuration:** + + - Ensure the correct branch is selected for deployment + - Common branches: `main`, `master`, `develop` + +3. **Test the Connection:** + - Make a small change to your repository + - Push to the configured branch + - Check SleakOps **Executions** tab for new build + + + + + +The typical SleakOps deployment workflow: + +```mermaid +graph LR + A[Push to GitHub] --> B[Webhook Trigger] + B --> C[SleakOps Build] + C --> D[Build Success?] + D -->|Yes| E[Auto Deploy] + D -->|No| F[Build Failed] + E --> G[Application Updated] +``` + +**Steps:** + +1. **Code Push**: Developer pushes code to GitHub +2. **Webhook Trigger**: GitHub notifies SleakOps of changes +3. **Build Process**: SleakOps builds the application +4. **Automatic Deployment**: If build succeeds, deployment happens automatically +5. **Live Update**: Application is updated with new code + + + + + +For testing phases, consider these practices: + +**1. Use Development Environment:** + +```yaml +# Recommended setup +Environments: + - develop (for testing) + - production (for live site) +``` + +**2. Branch Strategy:** + +- Use `develop` branch for testing +- Use `main`/`master` for production +- Test changes in develop before merging to main + +**3. Monitoring Deployments:** + +- Check **Executions** tab after each push +- Monitor build logs for errors +- Verify application functionality after deployment + +**4. Rollback Strategy:** + +- Keep previous versions available +- Test rollback procedures +- Document known good states + + + + + +If automatic deployments aren't working: + +**1. Check Build Status:** + +- Go to **Executions** tab +- Look for failed builds (red status) +- Review build logs for errors + +**2. Verify Webhook:** + +- Check GitHub repository settings +- Look for SleakOps webhook in **Settings** → **Webhooks** +- Ensure webhook is active and receiving payloads + +**3. Branch Configuration:** + +- Confirm you're pushing to the correct branch +- Verify branch name matches SleakOps configuration + +**4. Build Configuration:** + +- Check Dockerfile syntax +- Verify environment variables +- Ensure all dependencies are properly defined + + + + + +You may need manual action in these cases: + +**1. Build Failures:** + +- Fix code issues and push again +- Update build configuration if needed + +**2. Environment Variables:** + +- Update variables in SleakOps dashboard +- Restart executions if variables changed + +**3. Infrastructure Changes:** + +- Scaling requirements +- Resource limit adjustments +- New dependencies + +**4. Domain Configuration:** + +- DNS changes +- SSL certificate updates +- Custom domain setup + + + +--- + +_This FAQ was automatically generated on August 2, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-build-failed-production.mdx b/docs/troubleshooting/deployment-build-failed-production.mdx new file mode 100644 index 000000000..55b12599f --- /dev/null +++ b/docs/troubleshooting/deployment-build-failed-production.mdx @@ -0,0 +1,198 @@ +--- +sidebar_position: 3 +title: "Production Build Failure with Log Loading Issues" +description: "Solution for production build failures when logs won't load in SleakOps dashboard" +date: "2024-01-15" +category: "project" +tags: ["build", "production", "logs", "deployment", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Production Build Failure with Log Loading Issues + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Build, Production, Logs, Deployment, Troubleshooting + +## Problem Description + +**Context:** Production build process has stopped working and when attempting to view error logs through the SleakOps dashboard, the loading indicator appears indefinitely without displaying the actual logs. + +**Observed Symptoms:** + +- Production build process fails to complete +- SleakOps dashboard shows perpetual loading when trying to access build logs +- Unable to view error details to diagnose the build failure +- Build worked previously but suddenly stopped functioning + +**Relevant Configuration:** + +- Environment: Production +- Platform: SleakOps +- Issue type: Build failure + Log access problem +- Status: Both build and log viewing are affected + +**Error Conditions:** + +- Build failure occurs during production deployment +- Log loading gets stuck in loading state +- Problem prevents proper troubleshooting +- Issue appeared suddenly after previously working builds + +## Detailed Solution + + + +When facing both build failures and log loading issues, follow this diagnostic approach: + +1. **Check build status**: Verify if the build is actually running or completely failed +2. **Browser refresh**: Try refreshing the SleakOps dashboard +3. **Different browser**: Test log access from an incognito window or different browser +4. **Network connectivity**: Ensure stable internet connection + + + + + +If the dashboard logs won't load, try these alternatives: + +**Via CLI (if available):** + +```bash +# Check recent builds +sleakops builds list --project your-project-name + +# Get specific build logs +sleakops builds logs --build-id +``` + +**Via API:** + +```bash +# Get build information +curl -H "Authorization: Bearer YOUR_TOKEN" \ + https://api.sleakops.com/v1/projects/PROJECT_ID/builds +``` + +**Check email notifications:** + +- Review any build failure emails that might contain error details + + + + + +Typical reasons for sudden production build failures: + +**Resource Issues:** + +- Insufficient memory or CPU during build +- Disk space exhaustion +- Build timeout exceeded + +**Configuration Changes:** + +- Environment variables modified or missing +- Dockerfile changes that broke the build +- Dependency version conflicts + +**External Dependencies:** + +- Package registry issues (npm, pip, etc.) +- Base image unavailability +- Network connectivity problems + +**Code Issues:** + +- Recent commits that introduced build-breaking changes +- Missing files or incorrect file paths +- Compilation errors + + + + + +When SleakOps dashboard logs won't load: + +**Browser-related fixes:** + +1. Clear browser cache and cookies +2. Disable browser extensions temporarily +3. Try incognito/private browsing mode +4. Update your browser to the latest version + +**Dashboard-specific solutions:** + +1. Log out and log back into SleakOps +2. Check if other dashboard sections work properly +3. Try accessing from a different device +4. Wait a few minutes and retry (temporary server issues) + +**If problem persists:** + +- Contact SleakOps support with specific build ID +- Provide browser console errors (F12 → Console tab) +- Include screenshot of the loading issue + + + + + +To get your production builds working again: + +**1. Identify the last working build:** + +```bash +# Find recent successful builds +sleakops builds list --status success --limit 5 +``` + +**2. Compare configurations:** + +- Check what changed between working and failing builds +- Review recent commits and configuration updates +- Verify environment variables haven't changed + +**3. Rollback if necessary:** + +```bash +# Deploy from last known good build +sleakops deploy --build-id +``` + +**4. Test with minimal changes:** + +- Try building with a simple change first +- Gradually add complexity to identify the breaking point + + + + + +To avoid similar problems: + +**Monitoring:** + +- Set up build failure notifications +- Monitor build duration trends +- Track resource usage during builds + +**Best practices:** + +- Test builds in staging before production +- Use dependency version pinning +- Implement build caching strategies +- Regular backup of working configurations + +**Documentation:** + +- Document build dependencies and requirements +- Keep track of working environment configurations +- Maintain rollback procedures + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-build-failures-during-updates.mdx b/docs/troubleshooting/deployment-build-failures-during-updates.mdx new file mode 100644 index 000000000..ae263c9c4 --- /dev/null +++ b/docs/troubleshooting/deployment-build-failures-during-updates.mdx @@ -0,0 +1,155 @@ +--- +sidebar_position: 3 +title: "Deployment Build Failures During Platform Updates" +description: "Solution for build failures that occur during scheduled platform updates" +date: "2024-10-14" +category: "project" +tags: ["deployment", "build", "updates", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Build Failures During Platform Updates + +**Date:** October 14, 2024 +**Category:** Project +**Tags:** Deployment, Build, Updates, Troubleshooting + +## Problem Description + +**Context:** Users experience deployment failures when trying to make their projects public during scheduled platform updates. The build process fails temporarily while the SleakOps platform undergoes maintenance or updates. + +**Observed Symptoms:** + +- Deployment errors when changing project visibility to public +- Build failures during specific time windows +- Error occurs without any downtime to running services +- Temporary inability to deploy new changes + +**Relevant Configuration:** + +- Project type: Development environment +- Action attempted: Changing project visibility to public +- Platform status: Under scheduled maintenance/update +- Service availability: No downtime for running services + +**Error Conditions:** + +- Error occurs during scheduled platform updates +- Build process fails temporarily +- Deployment pipeline is affected +- Running services remain unaffected + +## Detailed Solution + + + +Build failures during platform updates are temporary issues that occur when: + +1. **Platform components are being updated**: Core build infrastructure undergoes maintenance +2. **Build pipelines are temporarily unavailable**: CI/CD systems may be restarting +3. **Resource allocation changes**: Build resources are temporarily redistributed +4. **Configuration updates**: Platform settings are being modified + +**Important**: These failures do not affect running applications - only new deployments are impacted. + + + + + +When encountering build failures during updates: + +1. **Wait for update completion**: Most platform updates complete within 15-30 minutes +2. **Retry the deployment**: Once updates are complete, retry your original action +3. **Check platform status**: Monitor SleakOps status page or notifications +4. **Verify project state**: Ensure your project configuration remains intact + +```bash +# Example: Retry deployment after update +# No special commands needed - simply retry through the UI +``` + + + + + +To minimize impact from scheduled updates: + +1. **Schedule deployments appropriately**: + + - Avoid deployments during announced maintenance windows + - Plan critical deployments outside update schedules + +2. **Monitor platform communications**: + + - Subscribe to SleakOps status updates + - Check email notifications for scheduled maintenance + +3. **Implement deployment strategies**: + - Use staging environments for testing + - Deploy during low-traffic periods + - Have rollback plans ready + + + + + +If build failures persist after platform updates complete: + +1. **Verify project configuration**: + + ```yaml + # Check your project settings + visibility: public + environment: development + build_status: ready + ``` + +2. **Clear build cache**: + + - Go to Project Settings + - Navigate to Build Configuration + - Select "Clear Build Cache" + - Retry deployment + +3. **Check resource quotas**: + + - Verify your account limits + - Ensure sufficient build minutes available + - Check storage quotas + +4. **Contact support if needed**: + - Provide specific error messages + - Include project details and timing + - Use "Reply to all" for faster resolution + + + + + +During platform updates, monitor your services: + +1. **Running services remain available**: + + - No downtime expected for active deployments + - Existing applications continue running normally + - Only new builds/deployments are affected + +2. **Health check verification**: + + ```bash + # Verify your services are still responding + curl -I https://your-app.sleakops.dev + # Should return 200 OK + ``` + +3. **Log monitoring**: + - Check application logs for any anomalies + - Monitor resource usage during updates + - Verify database connections remain stable + + + +--- + +_This FAQ was automatically generated on October 14, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-environment-variables-migration-issues.mdx b/docs/troubleshooting/deployment-environment-variables-migration-issues.mdx new file mode 100644 index 000000000..68c3bba32 --- /dev/null +++ b/docs/troubleshooting/deployment-environment-variables-migration-issues.mdx @@ -0,0 +1,176 @@ +--- +sidebar_position: 3 +title: "Environment Variables Not Available After Migration" +description: "Solution for environment variables becoming unavailable during platform migrations affecting builds with default Dockerfile arguments" +date: "2024-04-24" +category: "deployment" +tags: ["environment-variables", "migration", "dockerfile", "secrets", "build"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Environment Variables Not Available After Migration + +**Date:** April 24, 2024 +**Category:** Deployment +**Tags:** Environment Variables, Migration, Dockerfile, Secrets, Build + +## Problem Description + +**Context:** During platform migrations that affect how secrets are exposed in environments, applications may lose access to environment variables, particularly affecting builds with default arguments in Dockerfiles. + +**Observed Symptoms:** + +- Environment variables suddenly become unavailable after deployment +- Applications fail to connect to external services (e.g., login endpoints) +- Variables that were previously working stop being received +- Issue affects multiple environments (development, staging) +- Problem occurs after merging PRs and triggering redeployments + +**Relevant Configuration:** + +- Platform: SleakOps on AWS +- Affected environments: Development and Staging +- Build type: Frontend applications with Dockerfile builds +- Variable type: Environment variables used for API endpoints + +**Error Conditions:** + +- Occurs during platform migrations affecting secret exposure +- Affects builds with default Dockerfile arguments +- Problem persists across multiple deployments +- Issue appears after successful PR merges and redeployments + +## Detailed Solution + + + +During platform migrations, SleakOps may update how secrets and environment variables are exposed to applications. This can temporarily affect: + +- **Dockerfile builds with default arguments**: Variables passed as build arguments may not be available +- **Runtime environment variables**: Variables needed during application execution +- **Secret mounting**: How secrets are made available to containers + +The migration process ensures improved security and consistency but may cause temporary disruptions. + + + + + +When environment variables become unavailable: + +1. **Check variable groups configuration**: + + - Verify that variable groups are properly configured + - Ensure variables are assigned to the correct environments + +2. **Review Dockerfile build arguments**: + + ```dockerfile + # Ensure ARG declarations are present + ARG API_ENDPOINT + ARG DATABASE_URL + + # Use environment variables properly + ENV API_ENDPOINT=${API_ENDPOINT} + ENV DATABASE_URL=${DATABASE_URL} + ``` + +3. **Validate environment-specific settings**: + - Check that variables are defined for each environment (dev, staging, prod) + - Verify variable names match exactly between configuration and code + + + + + +After migration completion: + +1. **Update variable groups**: + + - Review and update variable group assignments + - Ensure all required variables are properly configured + - Test variable availability in each environment + +2. **Redeploy applications**: + + ```bash + # Trigger a fresh deployment to pick up new variable configuration + git commit --allow-empty -m "Trigger redeploy after migration" + git push origin develop + ``` + +3. **Verify application functionality**: + - Test all endpoints that depend on environment variables + - Check application logs for any remaining variable-related errors + - Validate connectivity to external services + + + + + +To avoid issues during future migrations: + +```dockerfile +# 1. Declare all required build arguments +ARG NODE_ENV=production +ARG API_BASE_URL +ARG DATABASE_URL + +# 2. Set environment variables from build arguments +ENV NODE_ENV=${NODE_ENV} +ENV API_BASE_URL=${API_BASE_URL} +ENV DATABASE_URL=${DATABASE_URL} + +# 3. Provide fallback values where appropriate +ENV API_TIMEOUT=${API_TIMEOUT:-30000} + +# 4. Validate critical variables +RUN test -n "$API_BASE_URL" || (echo "API_BASE_URL is required" && exit 1) +``` + + + + + +To prevent similar issues: + +1. **Set up environment variable validation**: + + ```javascript + // Add validation in your application startup + const requiredVars = ["API_BASE_URL", "DATABASE_URL", "JWT_SECRET"]; + + requiredVars.forEach((varName) => { + if (!process.env[varName]) { + console.error(`Missing required environment variable: ${varName}`); + process.exit(1); + } + }); + ``` + +2. **Implement health checks**: + + ```javascript + // Health check endpoint to verify configuration + app.get("/health", (req, res) => { + const config = { + apiEndpoint: !!process.env.API_BASE_URL, + databaseConnected: !!process.env.DATABASE_URL, + // Don't expose actual values for security + }; + + res.json({ status: "ok", config }); + }); + ``` + +3. **Monitor deployment notifications**: + - Subscribe to platform migration announcements + - Test applications immediately after platform updates + - Maintain staging environments that mirror production configuration + + + +--- + +_This FAQ was automatically generated on April 24, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-fargate-vcpu-quota-limit.mdx b/docs/troubleshooting/deployment-fargate-vcpu-quota-limit.mdx new file mode 100644 index 000000000..8d21f4588 --- /dev/null +++ b/docs/troubleshooting/deployment-fargate-vcpu-quota-limit.mdx @@ -0,0 +1,198 @@ +--- +sidebar_position: 3 +title: "Fargate vCPU Quota Limit During Deployment" +description: "Solution for deployment failures due to Fargate vCPU quota limitations" +date: "2025-02-13" +category: "deployment" +tags: ["fargate", "aws", "quota", "vcpu", "deployment", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fargate vCPU Quota Limit During Deployment + +**Date:** February 13, 2025 +**Category:** Deployment +**Tags:** Fargate, AWS, Quota, vCPU, Deployment, Troubleshooting + +## Problem Description + +**Context:** Users experience deployment failures in SleakOps when using AWS Fargate as the deployment mode, specifically due to vCPU quota limitations that prevent successful deployments. + +**Observed Symptoms:** + +- Deployment fails during execution +- Error messages related to insufficient Fargate capacity +- Unable to complete deployment process +- Previous deployments may have worked but suddenly start failing +- Issue affects multiple deployment attempts + +**Relevant Configuration:** + +- Deployment mode: AWS Fargate +- Platform: SleakOps +- Service: Container deployments +- Resource type: vCPU allocation + +**Error Conditions:** + +- Error occurs during deployment execution +- Happens when Fargate vCPU quota is reached +- May coincide with attempts to delete variable groups or other resources +- Prevents both new deployments and cleanup operations + +## Detailed Solution + + + +Fargate vCPU quota limits can cause deployment failures. To identify if this is your issue: + +1. **Check AWS Service Quotas Console:** + + - Navigate to AWS Console → Service Quotas + - Search for "AWS Fargate" + - Look for "Fargate On-Demand vCPU resource count" + +2. **Review deployment logs:** + + - Look for error messages mentioning "insufficient capacity" + - Check for Fargate-specific error codes + - Monitor resource allocation failures + +3. **Check current usage:** + - AWS Console → ECS → Clusters + - Review running Fargate tasks and their vCPU allocation + + + + + +To request a Fargate vCPU quota increase: + +1. **Access Service Quotas:** + + ```bash + # Via AWS CLI (optional) + aws service-quotas get-service-quota \ + --service-code fargate \ + --quota-code L-3032A538 + ``` + +2. **Submit quota increase request:** + + - Go to AWS Console → Service Quotas + - Find "Fargate On-Demand vCPU resource count" + - Click "Request quota increase" + - Specify the new limit needed + - Provide business justification + +3. **Typical processing time:** + - Standard requests: 24-48 hours + - Urgent requests: Can be expedited through AWS Support + + + + + +While waiting for quota approval, try these workarounds: + +1. **Optimize resource allocation:** + + ```yaml + # Reduce vCPU allocation in your deployment configuration + resources: + limits: + cpu: "0.25" # Instead of "0.5" or "1.0" + memory: "512Mi" + requests: + cpu: "0.1" + memory: "256Mi" + ``` + +2. **Clean up unused resources:** + + - Stop unnecessary Fargate tasks + - Remove idle deployments + - Delete unused ECS services + +3. **Use different deployment strategy:** + + - Deploy in smaller batches + - Implement rolling deployments with lower concurrency + - Consider using EC2 launch type temporarily + +4. **Regional alternatives:** + - Deploy to a different AWS region with available capacity + - Use multiple regions to distribute load + + + + + +To prevent future Fargate vCPU quota issues: + +1. **Set up monitoring:** + + ```yaml + # CloudWatch alarm for Fargate usage + FargateVCPUUsageAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: FargateVCPUUsageHigh + MetricName: CPUUtilization + Namespace: AWS/ECS + Statistic: Average + Threshold: 80 + ComparisonOperator: GreaterThanThreshold + ``` + +2. **Implement resource governance:** + + - Set default resource limits for deployments + - Implement approval workflows for high-resource deployments + - Regular audit of resource usage + +3. **Plan capacity:** + + - Monitor usage trends + - Request quota increases proactively + - Maintain buffer capacity for peak usage + +4. **Documentation and alerts:** + - Document current quota limits + - Set up alerts at 70% and 85% usage + - Create runbooks for quota management + + + + + +When quota issues prevent cleanup operations (like deleting variable groups): + +1. **Temporary deployment for cleanup:** + + - Deploy minimal resources to enable cleanup operations + - Use smallest possible vCPU allocation + - Execute cleanup tasks immediately after deployment + +2. **Manual cleanup via AWS Console:** + + ```bash + # List ECS services that might be blocking cleanup + aws ecs list-services --cluster your-cluster-name + + # Stop services if safe to do so + aws ecs update-service --cluster your-cluster-name \ + --service your-service-name --desired-count 0 + ``` + +3. **Coordinate with SleakOps support:** + - Report the specific resource causing issues + - Request manual intervention if needed + - Provide deployment logs and error messages + + + +--- + +_This FAQ was automatically generated on February 13, 2025 based on a real user query._ diff --git a/docs/troubleshooting/deployment-helm-selector-immutable-error.mdx b/docs/troubleshooting/deployment-helm-selector-immutable-error.mdx new file mode 100644 index 000000000..128a58618 --- /dev/null +++ b/docs/troubleshooting/deployment-helm-selector-immutable-error.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Helm Deployment Selector Immutable Error" +description: "Solution for Kubernetes deployment selector field immutable error during Helm upgrades" +date: "2024-12-19" +category: "workload" +tags: ["helm", "deployment", "kubernetes", "selector", "upgrade"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Helm Deployment Selector Immutable Error + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** Helm, Deployment, Kubernetes, Selector, Upgrade + +## Problem Description + +**Context:** User encounters deployment failure when trying to deploy a project through SleakOps platform using Helm charts. + +**Observed Symptoms:** + +- Deployment fails with "UPGRADE FAILED" error +- Error message indicates "field is immutable" for spec.selector +- The deployment cannot be patched due to selector label changes +- Helm upgrade process is blocked + +**Relevant Configuration:** + +- Platform: SleakOps with Kubernetes cluster +- Deployment tool: Helm +- Error type: `spec.selector: Invalid value` with `field is immutable` +- Affected resource: Kubernetes Deployment + +**Error Conditions:** + +- Error occurs during Helm upgrade process +- Happens when deployment selector labels have been modified +- Prevents successful deployment of updated applications +- Typically occurs after manual changes or conflicts + +## Detailed Solution + + + +The error occurs because Kubernetes Deployment's `spec.selector` field is immutable after creation. This means: + +1. **Selector labels cannot be changed** once a Deployment is created +2. **Helm tries to update** the selector during upgrade +3. **Kubernetes rejects** the change due to immutability rules +4. **Manual modifications** or conflicts can trigger this issue + +The error message shows that Helm is trying to patch a deployment with different selector labels than what currently exists. + + + + + +The quickest solution is to delete the existing deployment and let SleakOps recreate it: + +**Using Lens (Kubernetes IDE):** + +1. Open Lens and connect to your cluster +2. Navigate to **Workloads** → **Deployments** +3. Find the problematic deployment (e.g., `velo-crawler-scheduler-produce-production-crawler-scheduler`) +4. Right-click and select **Delete** +5. Confirm the deletion + +**Using kubectl:** + +```bash +kubectl delete deployment velo-crawler-scheduler-produce-production-crawler-scheduler -n +``` + +**After deletion:** + +- Trigger a new deployment from SleakOps dashboard +- Or push code to trigger CI/CD pipeline +- The deployment will be recreated with correct selectors + + + + + +To avoid this issue in the future: + +**1. Avoid manual modifications:** + +- Don't manually edit deployments through Lens or kubectl +- Always use SleakOps dashboard or CI/CD for changes + +**2. Handle conflicts properly:** + +- When merge conflicts occur in Helm charts, ensure selector labels remain consistent +- Review changes in deployment templates before merging + +**3. Use proper Helm practices:** + +```yaml +# In your Helm template, ensure consistent labeling +apiVersion: apps/v1 +kind: Deployment +metadata: + name: { { include "chart.fullname" . } } + labels: { { - include "chart.labels" . | nindent 4 } } +spec: + selector: + matchLabels: { { - include "chart.selectorLabels" . | nindent 6 } } + template: + metadata: + labels: { { - include "chart.selectorLabels" . | nindent 8 } } +``` + + + + + +If you cannot delete the deployment immediately: + +**1. Scale down to zero:** + +```bash +kubectl scale deployment --replicas=0 -n +``` + +**2. Use Helm uninstall and reinstall:** + +```bash +# Uninstall the release +helm uninstall -n + +# Reinstall from SleakOps +# Trigger new deployment through platform +``` + +**3. Manual selector fix (advanced):** +If you need to preserve the deployment, you can: + +- Export the current deployment YAML +- Delete the deployment +- Modify the YAML to match expected selectors +- Apply the corrected version +- Then proceed with normal SleakOps deployment + + + + + +After applying the solution: + +**1. Check deployment status:** + +```bash +kubectl get deployments -n +kubectl describe deployment -n +``` + +**2. Verify pods are running:** + +```bash +kubectl get pods -n -l app.kubernetes.io/instance= +``` + +**3. Check application logs:** + +```bash +kubectl logs -l app.kubernetes.io/instance= -n +``` + +**4. Test application functionality:** + +- Verify the application is responding correctly +- Check any health endpoints +- Confirm expected behavior + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-image-not-updating-develop-branch.mdx b/docs/troubleshooting/deployment-image-not-updating-develop-branch.mdx new file mode 100644 index 000000000..9a72ed3e0 --- /dev/null +++ b/docs/troubleshooting/deployment-image-not-updating-develop-branch.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 3 +title: "Deployment Not Reflecting Latest Changes After Merge" +description: "Solution for when deployments don't update with latest code changes after merging to develop branch" +date: "2024-12-19" +category: "project" +tags: ["deployment", "build", "docker", "ci-cd", "image-caching"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Not Reflecting Latest Changes After Merge + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** Deployment, Build, Docker, CI/CD, Image Caching + +## Problem Description + +**Context:** Development team merges code changes to the develop branch, triggering automatic deployment, but the deployed application doesn't reflect the latest changes. The issue appears to be related to Docker image building not using the latest code. + +**Observed Symptoms:** + +- Deployment process executes successfully after merge to develop branch +- Application shows old functionality/content instead of latest changes +- Same behavior occurs across multiple repositories (3 repos affected) +- Docker image appears to be using cached or outdated content + +**Relevant Configuration:** + +- Branch: `develop` (auto-deployment enabled) +- Affected repositories: Multiple (3 repositories) +- Deployment trigger: Git merge/push to develop +- Platform: SleakOps automated deployment + +**Error Conditions:** + +- Issue occurs consistently after merging to develop branch +- Problem affects multiple repositories simultaneously +- No build errors reported, but changes not reflected +- Suspected Docker image caching issue + +## Detailed Solution + + + +The most common cause is Docker layer caching preventing the build from using the latest code: + +1. **Check build logs** in SleakOps deployment section +2. Look for messages like "Using cache" in Docker build steps +3. Verify the image tag/hash is different between builds +4. Check if the Dockerfile COPY commands are properly invalidating cache + +```dockerfile +# Problematic - cache may not invalidate +COPY . /app + +# Better - copy package files first, then source +COPY package*.json /app/ +RUN npm install +COPY . /app +``` + + + + + +To force SleakOps to rebuild without cache: + +1. Go to your **Project Dashboard** +2. Navigate to **Deployments** → **Build Settings** +3. Enable **"Force Rebuild"** option +4. Or add `--no-cache` flag in build configuration +5. Trigger a new deployment + +**Alternative method:** + +- Make a small commit (like updating a comment) +- Push to develop branch to trigger fresh build + + + + + +Ensure your Dockerfile properly invalidates cache when code changes: + +```dockerfile +# Good practice for Node.js apps +FROM node:18-alpine +WORKDIR /app + +# Copy package files first (cache layer) +COPY package*.json ./ +RUN npm ci --only=production + +# Copy source code (invalidates cache when code changes) +COPY . . + +# Add build timestamp to ensure fresh builds +ARG BUILD_DATE +ENV BUILD_DATE=${BUILD_DATE} + +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +Add build arguments that change with each build: + +1. **In your CI/CD configuration:** + +```yaml +build_args: + BUILD_DATE: "$(date +%Y%m%d-%H%M%S)" + GIT_COMMIT: "${GITHUB_SHA}" +``` + +2. **In your Dockerfile:** + +```dockerfile +ARG BUILD_DATE +ARG GIT_COMMIT +ENV BUILD_INFO="${BUILD_DATE}-${GIT_COMMIT}" + +# This ensures the layer is rebuilt every time +RUN echo "Build: ${BUILD_INFO}" > /app/build-info.txt +``` + + + + + +To confirm the issue is resolved: + +1. **Check image tags** in SleakOps: + + - Go to **Workloads** → **Your Service** + - Verify the image tag matches latest build + +2. **Add version endpoint** to your application: + +```javascript +// Add to your app +app.get("/version", (req, res) => { + res.json({ + version: process.env.BUILD_DATE || "unknown", + commit: process.env.GIT_COMMIT || "unknown", + timestamp: new Date().toISOString(), + }); +}); +``` + +3. **Test the endpoint** after deployment to confirm changes + + + + + +Since this affects 3 repositories, apply these changes systematically: + +1. **Create a template Dockerfile** with proper cache invalidation +2. **Update all affected repositories** with the same pattern +3. **Enable force rebuild** for all projects temporarily +4. **Test each repository** individually after changes + +**Batch update script example:** + +```bash +#!/bin/bash +REPOS=("repo1" "repo2" "repo3") + +for repo in "${REPOS[@]}"; do + echo "Updating $repo..." + cd $repo + # Copy optimized Dockerfile + cp ../templates/Dockerfile . + git add Dockerfile + git commit -m "Fix: Update Dockerfile for proper cache invalidation" + git push origin develop + cd .. +done +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-migration-hooks-troubleshooting.mdx b/docs/troubleshooting/deployment-migration-hooks-troubleshooting.mdx new file mode 100644 index 000000000..1916f3228 --- /dev/null +++ b/docs/troubleshooting/deployment-migration-hooks-troubleshooting.mdx @@ -0,0 +1,227 @@ +--- +sidebar_position: 3 +title: "Migration Hooks Deployment Issues" +description: "Troubleshooting deployment timeouts and migration hook failures in SleakOps" +date: "2024-12-11" +category: "project" +tags: ["deployment", "migration", "hooks", "troubleshooting", "timeout"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Migration Hooks Deployment Issues + +**Date:** December 11, 2024 +**Category:** Project +**Tags:** Deployment, Migration, Hooks, Troubleshooting, Timeout + +## Problem Description + +**Context:** Users experience deployment timeouts when migration hooks fail to complete successfully, causing deployments to remain in "STALLED" status indefinitely. + +**Observed Symptoms:** + +- Deployment shows timeout error after 30 minutes +- Migration hook pod never reaches "Succeeded" status +- Deployment remains in "STALLED" state +- Migration errors are not visible in deployment logs +- Manual migration execution in pod shows different results than hook execution + +**Relevant Configuration:** + +- Platform: SleakOps deployment system +- Hook type: Database migration hooks +- Deployment method: Kubernetes Jobs +- Log visibility: Limited error reporting in deployment logs + +**Error Conditions:** + +- Migration hook fails but doesn't report explicit error +- Database migration encounters constraint violations +- Hook execution doesn't properly exit with status codes +- Logs from migration hooks don't appear in stdout + +## Detailed Solution + + + +To check if your migration hook is the cause of deployment timeout: + +1. **Check hook pod status** in the cluster: + + ```bash + kubectl get pods -l job-name=migration-hook + kubectl describe pod + ``` + +2. **Look for hook completion status**: + + - Hook should show `Completed` status + - If stuck in `Running`, the migration never finished + - If showing `Failed`, check the pod logs + +3. **Verify hook logs**: + ```bash + kubectl logs + ``` + + + + + +When hook logs don't show the actual error, run migrations manually: + +1. **Access the application pod**: + + ```bash + kubectl exec -it -- /bin/bash + ``` + +2. **Run migrations manually**: + + ```bash + # For Rails applications + bundle exec rails db:migrate + + # For Django applications + python manage.py migrate + + # For Node.js applications + npm run migrate + ``` + +3. **Check for specific errors**: + - Database constraint violations + - Missing columns or tables + - Data type conflicts + - Null constraint violations + + + + + +**Null constraint violations:** + +```sql +-- Error: Attempting to add NOT NULL column to table with existing data +-- Solution: Add column as nullable first, then update values +ALTER TABLE users ADD COLUMN email VARCHAR(255); +UPDATE users SET email = 'default@example.com' WHERE email IS NULL; +ALTER TABLE users ALTER COLUMN email SET NOT NULL; +``` + +**Data type conflicts:** + +```sql +-- Error: Cannot change column type with existing data +-- Solution: Create new column, migrate data, drop old column +ALTER TABLE products ADD COLUMN price_new DECIMAL(10,2); +UPDATE products SET price_new = CAST(price AS DECIMAL(10,2)); +ALTER TABLE products DROP COLUMN price; +ALTER TABLE products RENAME COLUMN price_new TO price; +``` + + + + + +Make sure your migration scripts exit properly: + +**For shell scripts:** + +```bash +#!/bin/bash +set -e # Exit on any error + +# Run your migrations +echo "Running database migrations..." +bundle exec rails db:migrate + +# Explicit success exit +echo "Migrations completed successfully" +exit 0 +``` + +**For Python scripts:** + +```python +import sys +import subprocess + +try: + # Run migration command + result = subprocess.run(['python', 'manage.py', 'migrate'], + check=True, capture_output=True, text=True) + print("Migrations completed successfully") + sys.exit(0) +except subprocess.CalledProcessError as e: + print(f"Migration failed: {e.stderr}") + sys.exit(1) +``` + + + + + +To get better visibility into migration hook failures: + +1. **Add verbose logging to your migration scripts**: + + ```bash + #!/bin/bash + echo "[$(date)] Starting database migrations" + echo "[$(date)] Current database status:" + # Add database status check here + + echo "[$(date)] Running migrations..." + bundle exec rails db:migrate 2>&1 | tee /tmp/migration.log + + if [ ${PIPESTATUS[0]} -eq 0 ]; then + echo "[$(date)] Migrations completed successfully" + exit 0 + else + echo "[$(date)] Migrations failed" + cat /tmp/migration.log + exit 1 + fi + ``` + +2. **Ensure all output goes to stdout/stderr**: + - Redirect all command output to stdout + - Use `2>&1` to capture stderr + - Avoid writing logs only to files + + + + + +When a deployment is stuck due to failed migration hooks: + +1. **Delete the failed migration pod**: + + ```bash + kubectl delete pod + ``` + +2. **Fix the underlying migration issue**: + + - Manually resolve database conflicts + - Update migration scripts if needed + - Test migrations in a staging environment + +3. **Retry the deployment**: + + - The hook will be recreated automatically + - Monitor the new hook pod status + - Check logs for successful completion + +4. **If issues persist**: + - Consider temporarily disabling the migration hook + - Run migrations manually after deployment + - Update hook configuration to be more resilient + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-missing-tolerations-after-nodepool-changes.mdx b/docs/troubleshooting/deployment-missing-tolerations-after-nodepool-changes.mdx new file mode 100644 index 000000000..2510c737a --- /dev/null +++ b/docs/troubleshooting/deployment-missing-tolerations-after-nodepool-changes.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "Deployment Missing Tolerations After Nodepool Changes" +description: "Solution for pods that can't schedule after nodepool modifications" +date: "2025-03-05" +category: "cluster" +tags: ["nodepool", "tolerations", "deployment", "scheduling", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Missing Tolerations After Nodepool Changes + +**Date:** March 5, 2025 +**Category:** Cluster +**Tags:** Nodepool, Tolerations, Deployment, Scheduling, Kubernetes + +## Problem Description + +**Context:** After cluster upgrades or nodepool modifications, existing deployments may lose the ability to schedule pods on available nodes due to missing tolerations. + +**Observed Symptoms:** + +- Pods remain in Pending state despite available nodes +- Deployments that were previously working cannot schedule +- New nodepools are created but pods don't use them +- Error messages about pod scheduling failures + +**Relevant Configuration:** + +- Multiple nodepools in cluster (e.g., production and development) +- Different node types: on-demand and spot instances +- Deployments scaled to 0 and then scaled back up +- Recent cluster upgrades or nodepool changes + +**Error Conditions:** + +- Occurs after removing old default nodepools +- Happens when deployments haven't been updated after nodepool changes +- Affects deployments that were scaled down during nodepool transitions +- Problem persists until tolerations are manually updated + +## Detailed Solution + + + +This issue typically occurs when: + +1. **Default nodepool removal**: The original default nodepool is removed during cluster upgrades +2. **Missing tolerations**: Existing deployments don't have the correct tolerations for new nodepools +3. **Stale configurations**: Deployments retain old scheduling configurations that no longer match available nodes + +Kubernetes requires pods to have matching tolerations for nodes with taints, which are commonly used to separate different types of workloads (production vs development, on-demand vs spot instances). + + + + + +The easiest way to resolve this issue is through the SleakOps platform: + +1. **Navigate to your service** in the SleakOps dashboard +2. **Click "Edit"** on the affected deployment +3. **Don't modify any values** - keep everything as is +4. **Click "Deploy"** to trigger a new deployment + +This process will: + +- Automatically add the correct tolerations +- Update the deployment configuration +- Allow pods to schedule on the appropriate nodepools + +```bash +# The system will automatically add tolerations like: +tolerationsː +- key: "node-type" + operator: "Equal" + value: "spot" + effect: "NoSchedule" +``` + + + + + +To understand your nodepool setup: + +1. **Check available nodepools**: + + ```bash + kubectl get nodes --show-labels + ``` + +2. **Verify node taints**: + + ```bash + kubectl describe nodes | grep -A5 -B5 Taints + ``` + +3. **Check pod scheduling status**: + ```bash + kubectl describe pod | grep -A10 Events + ``` + +Common taint configurations: + +- Production nodes: `node-type=ondemand:NoSchedule` +- Development nodes: `node-type=spot:NoSchedule` + + + + + +If you have multiple deployments affected: + +1. **Identify all affected services** in SleakOps dashboard +2. **Repeat the edit/deploy process** for each service +3. **Prioritize critical services** first +4. **Monitor pod scheduling** after each update + +For services that were scaled to 0: + +1. First apply the fix (edit without changes + deploy) +2. Then scale up the deployment as needed + + + + + +To avoid this issue in the future: + +1. **Plan nodepool changes**: Coordinate with your SleakOps team before major changes +2. **Update deployments proactively**: When nodepools change, update affected deployments +3. **Use consistent tainting strategies**: Maintain consistent node labeling and tainting +4. **Monitor after upgrades**: Check deployment status after cluster upgrades + +**Best practices:** + +- Keep deployment configurations updated +- Test scaling operations after nodepool changes +- Document your nodepool taint strategy +- Set up monitoring for pod scheduling failures + + + +--- + +_This FAQ was automatically generated on March 5, 2025 based on a real user query._ diff --git a/docs/troubleshooting/deployment-newrelic-pkg-resources-warning.mdx b/docs/troubleshooting/deployment-newrelic-pkg-resources-warning.mdx new file mode 100644 index 000000000..fc138fc34 --- /dev/null +++ b/docs/troubleshooting/deployment-newrelic-pkg-resources-warning.mdx @@ -0,0 +1,182 @@ +--- +sidebar_position: 3 +title: "Production Deployment Failure with New Relic pkg_resources Warning" +description: "Solution for deployment failures caused by New Relic pkg_resources deprecation warning" +date: "2024-12-11" +category: "project" +tags: + ["deployment", "build", "newrelic", "python", "pkg_resources", "setuptools"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Production Deployment Failure with New Relic pkg_resources Warning + +**Date:** December 11, 2024 +**Category:** Project +**Tags:** Deployment, Build, NewRelic, Python, pkg_resources, Setuptools + +## Problem Description + +**Context:** User attempts to deploy a Python project to production using SleakOps CLI but encounters build failures related to New Relic's use of deprecated pkg_resources API. + +**Observed Symptoms:** + +- Build process starts normally but fails with "Something went wrong" message +- UserWarning about pkg_resources being deprecated appears in diagnostics +- Build gets stuck in "Building project..." loop before failing +- Deployment to production environment fails completely + +**Relevant Configuration:** + +- Project: `api-mecubro` +- Branch: `master` +- Python version: `3.9` +- New Relic integration enabled +- Command used: `sleakops build -p api-mecubro -b master -w` + +**Error Conditions:** + +- Error occurs during production deployment build process +- Warning originates from `/usr/local/lib/python3.9/site-packages/newrelic/config.py:4555` +- pkg\_resources deprecation warning suggests pinning Setuptools\<81 +- Build failure prevents successful deployment + +## Detailed Solution + + + +The quickest solution is to pin the Setuptools version in your project's requirements: + +1. **Add to requirements.txt**: + +```txt +setuptools<81 +``` + +2. **Or add to pyproject.toml** (if using): + +```toml +[build-system] +requires = ["setuptools<81", "wheel"] +``` + +3. **Rebuild the project**: + +```bash +sleakops build -p api-mecubro -b master -w +``` + + + + + +If you have control over the Dockerfile, you can pin Setuptools during the build process: + +```dockerfile +# Before installing other dependencies +RUN pip install "setuptools<81" + +# Then install your requirements +COPY requirements.txt . +RUN pip install -r requirements.txt +``` + +Or install it alongside other dependencies: + +```dockerfile +RUN pip install "setuptools<81" -r requirements.txt +``` + + + + + +The best long-term solution is to update the New Relic Python agent to a version that doesn't use pkg_resources: + +1. **Check current New Relic version**: + +```bash +pip show newrelic +``` + +2. **Update to latest version**: + +```txt +# In requirements.txt +newrelic>=9.0.0 +``` + +3. **Verify compatibility** with your Python version and application + +4. **Test thoroughly** in a development environment before deploying to production + + + + + +As a temporary workaround, you can suppress the specific warning: + +1. **Add environment variable in your deployment configuration**: + +```bash +export PYTHONWARNINGS="ignore::UserWarning:pkg_resources" +``` + +2. **Or add to your application's environment variables**: + +```yaml +# In your SleakOps project configuration +environment: + PYTHONWARNINGS: "ignore::UserWarning:pkg_resources" +``` + +**Note**: This only suppresses the warning but doesn't fix the underlying issue. + + + + + +After implementing any solution: + +1. **Test the build locally** (if possible): + +```bash +sleakops build -p api-mecubro -b master -w +``` + +2. **Monitor the build logs** for the warning message + +3. **Verify New Relic functionality** after deployment: + + - Check that metrics are being collected + - Verify APM data is appearing in New Relic dashboard + - Test error tracking functionality + +4. **Check application performance** to ensure no regressions + + + + + +To prevent this issue in new projects: + +1. **Use modern dependency management**: + + - Use `pyproject.toml` instead of `setup.py` when possible + - Pin critical dependencies including build tools + +2. **Regular dependency updates**: + + - Keep New Relic agent updated + - Monitor deprecation warnings in CI/CD + +3. **Build environment consistency**: + - Use specific Python and package versions + - Test builds in environments similar to production + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-old-pods-not-terminating.mdx b/docs/troubleshooting/deployment-old-pods-not-terminating.mdx new file mode 100644 index 000000000..7d74be652 --- /dev/null +++ b/docs/troubleshooting/deployment-old-pods-not-terminating.mdx @@ -0,0 +1,604 @@ +--- +sidebar_position: 3 +title: "Deployment Issue - Old Pods Not Terminating" +description: "Solution for when new deployments create pods but old pods remain active and receive traffic" +date: "2024-06-10" +category: "workload" +tags: ["deployment", "kubernetes", "pods", "traffic-routing", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Issue - Old Pods Not Terminating + +**Date:** June 10, 2024 +**Category:** Workload +**Tags:** Deployment, Kubernetes, Pods, Traffic Routing, Troubleshooting + +## Problem Description + +**Context:** User experiences issues when deploying new builds where new pods are created successfully but old pods from previous deployments (created 2 days ago) remain active and continue receiving traffic instead of the new pods. + +**Observed Symptoms:** + +- New pod is created with the latest image tag +- Old pods (2+ days old) are not being terminated +- All traffic continues to be routed to old pods instead of new ones +- New pod works correctly when accessed directly via port-forward +- Health checks on new pod are passing +- Attempting to manually delete old pods results in them restarting + +**Relevant Configuration:** + +- Project: `rattlesnake` +- Environment: `development` +- Image Tag: `7edb7463a22198fc4d79bac76cfcb2c0f94b3755` +- Platform: Kubernetes (viewed through Lens) + +**Error Conditions:** + +- Issue occurs during deployment process +- Old pods restart when manually deleted +- Traffic routing does not switch to new pods +- Problem has occurred previously with other projects + +## Detailed Solution + + + +This issue typically occurs due to one of these Kubernetes deployment problems: + +1. **Deployment strategy misconfiguration**: Rolling update settings may be preventing old pods from terminating +2. **Resource constraints**: Insufficient resources preventing new pods from becoming ready +3. **Service selector mismatch**: Service is not pointing to the new pods +4. **Readiness probe failures**: New pods may not be passing readiness checks +5. **Multiple deployment controllers**: Conflicting controllers managing the same pods + + + + + +First, verify the current state of your deployment: + +```bash +# Check deployment status +kubectl get deployments -n +kubectl describe deployment -n + +# Check pods and their ages +kubectl get pods -n --show-labels +kubectl get pods -n -o wide + +# Check replica sets +kubectl get replicasets -n +``` + +Look for: + +- Multiple replica sets with pods +- Pod readiness status +- Deployment rollout status + + + + + +Check if the service is correctly pointing to the new pods: + +```bash +# Check service endpoints +kubectl get endpoints -n +kubectl describe service -n + +# Compare service selector with pod labels +kubectl get service -n -o yaml +kubectl get pods -n --show-labels +``` + +The service selector should match the labels on your new pods, not the old ones. + + + + + +Even though port-forward works, check if readiness probes are configured correctly: + +```bash +# Check pod readiness status +kubectl describe pod -n + +# Look for readiness probe configuration +kubectl get pod -n -o yaml | grep -A 10 readinessProbe +``` + +If readiness probes are failing, the service won't route traffic to the new pods. + + + + + +If the deployment is stuck, force a complete rollout: + +```bash +# Check rollout status +kubectl rollout status deployment/ -n + +# Force restart the deployment +kubectl rollout restart deployment/ -n + +# Wait for rollout to complete +kubectl rollout status deployment/ -n --timeout=300s + +# Check if old replica sets are being terminated +kubectl get replicasets -n +``` + +**Alternative approach - Scale down and up:** + +```bash +# Scale down to 0 replicas +kubectl scale deployment --replicas=0 -n + +# Wait for all pods to terminate +kubectl get pods -n -w + +# Scale back up +kubectl scale deployment --replicas= -n +``` + + + + + +Verify there aren't multiple controllers managing the same pods: + +```bash +# Check for multiple deployments with same selector +kubectl get deployments -n -o wide + +# Check for other controllers (DaemonSets, StatefulSets, etc.) +kubectl get all -n + +# Look for duplicate resource names or selectors +kubectl get pods -n -o yaml | grep -A 5 "ownerReferences" + +# Check for orphaned replica sets +kubectl get replicasets -n --show-labels +``` + +**If you find multiple controllers:** + +```bash +# Delete duplicate or old deployments +kubectl delete deployment -n + +# Delete orphaned replica sets +kubectl delete replicaset -n +``` + + + + + +Verify that the cluster has enough resources for the new pods: + +```bash +# Check node resource usage +kubectl top nodes + +# Check resource requests vs limits +kubectl describe nodes | grep -A 5 "Allocated resources" + +# Check pod resource consumption +kubectl top pods -n + +# Check for pending pods due to resource constraints +kubectl get pods -n | grep Pending +kubectl describe pod -n +``` + +**Common resource issues:** + +- CPU or memory requests too high +- Node resource exhaustion +- Storage constraints +- GPU or other specialized resource unavailability + + + + + +Check why new pods might not be becoming ready: + +```bash +# Check pod events and logs +kubectl describe pod -n +kubectl logs -n --previous + +# Check init containers if present +kubectl logs -c -n + +# Check readiness probe failures +kubectl get events -n --field-selector involvedObject.name= +``` + +**Common startup issues:** + +```yaml +# Example of proper readiness probe configuration +readinessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 10 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 +``` + + + + + +Check and fix your deployment strategy: + +```bash +# Check current deployment strategy +kubectl get deployment -n -o yaml | grep -A 10 strategy +``` + +**Recommended rolling update strategy:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: your-app +spec: + replicas: 3 + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 # Maximum pods that can be unavailable + maxSurge: 1 # Maximum extra pods during update + template: + spec: + terminationGracePeriodSeconds: 30 + containers: + - name: your-app + image: your-app:latest + readinessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 10 + periodSeconds: 5 + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 +``` + +**Apply the fix:** + +```bash +# Edit deployment directly +kubectl edit deployment -n + +# Or apply from file +kubectl apply -f deployment.yaml -n +``` + + + + + +Ensure your service is correctly configured to route to new pods: + +```bash +# Check service configuration +kubectl get service -n -o yaml +``` + +**Verify service selector matches pod labels:** + +```yaml +# Service configuration +apiVersion: v1 +kind: Service +metadata: + name: your-app-service +spec: + selector: + app: your-app # Must match pod labels + version: current # Avoid version-specific selectors + ports: + - port: 80 + targetPort: 8080 +``` + +**Check pod labels:** + +```bash +# Verify pod labels match service selector +kubectl get pods -n --show-labels | grep + +# If labels don't match, update them +kubectl label pod version=current -n +``` + +**Force service endpoint refresh:** + +```bash +# Delete and recreate endpoints if stuck +kubectl delete endpoints -n + +# Service will automatically recreate endpoints +kubectl get endpoints -n +``` + + + + + +Remove old replica sets and pods that are no longer needed: + +```bash +# List all replica sets +kubectl get replicasets -n + +# Delete old replica sets (keep latest 2-3 for rollback) +kubectl delete replicaset -n + +# Clean up failed pods +kubectl get pods -n | grep -E "(Error|CrashLoopBackOff|ImagePullBackOff)" +kubectl delete pod -n + +# Set revision history limit to prevent accumulation +kubectl patch deployment -n -p '{"spec":{"revisionHistoryLimit":3}}' +``` + +**Automated cleanup script:** + +```bash +#!/bin/bash +# cleanup-old-resources.sh + +NAMESPACE=${1:-default} +DEPLOYMENT=${2} + +echo "Cleaning up old resources for deployment: $DEPLOYMENT in namespace: $NAMESPACE" + +# Get current replica set +CURRENT_RS=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE -o jsonpath='{.status.conditions[?(@.type=="Progressing")].reason}') + +# Delete old replica sets (keep latest 3) +kubectl get replicasets -n $NAMESPACE -l app=$DEPLOYMENT --sort-by='.metadata.creationTimestamp' | tail -n +4 | awk '{print $1}' | xargs -r kubectl delete replicaset -n $NAMESPACE + +# Force delete stuck pods +kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT | grep Terminating | awk '{print $1}' | xargs -r kubectl delete pod --force --grace-period=0 -n $NAMESPACE + +echo "Cleanup completed" +``` + + + + + +If this issue occurs in production and needs immediate resolution: + +**1. Emergency traffic routing:** + +```bash +# Create temporary service to route to specific pod +kubectl expose pod --port=80 --target-port=8080 --name=emergency-service -n + +# Update main service to point to emergency service +kubectl patch service -n -p '{"spec":{"selector":{"pod-name":""}}}' +``` + +**2. Force delete old pods:** + +```bash +# Force delete old pods immediately +kubectl delete pod --force --grace-period=0 -n + +# If pods are stuck in terminating state +kubectl patch pod -n -p '{"metadata":{"finalizers":null}}' +``` + +**3. Backup and restore approach:** + +```bash +# Backup current deployment +kubectl get deployment -n -o yaml > deployment-backup.yaml + +# Create new deployment with different name +sed 's/name: /name: -new/' deployment-backup.yaml | kubectl apply -f - + +# Update service to point to new deployment +kubectl patch service -n -p '{"spec":{"selector":{"app":"-new"}}}' + +# Delete old deployment once new one is stable +kubectl delete deployment -n +``` + + + + + +**1. Deployment monitoring:** + +```bash +# Monitor deployment rollouts +kubectl get events -n --sort-by='.lastTimestamp' | grep Deployment + +# Set up alerts for stuck deployments +kubectl get deployments -n -o json | jq '.items[] | select(.status.readyReplicas != .status.replicas)' +``` + +**2. Automated health checks:** + +```yaml +# Enhanced readiness probe +readinessProbe: + httpGet: + path: /health/ready + port: 8080 + httpHeaders: + - name: Host + value: localhost + initialDelaySeconds: 15 + periodSeconds: 5 + timeoutSeconds: 3 + successThreshold: 1 + failureThreshold: 3 + +# Startup probe for slow-starting containers +startupProbe: + httpGet: + path: /health/startup + port: 8080 + initialDelaySeconds: 10 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 30 # 5 minutes max startup time +``` + +**3. Deployment automation script:** + +```bash +#!/bin/bash +# safe-deploy.sh + +DEPLOYMENT=$1 +NAMESPACE=${2:-default} +IMAGE=$3 + +echo "Starting safe deployment of $DEPLOYMENT with image $IMAGE" + +# Update deployment +kubectl set image deployment/$DEPLOYMENT container=$IMAGE -n $NAMESPACE + +# Wait for rollout +kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --timeout=300s + +# Verify health +READY_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE -o jsonpath='{.status.readyReplicas}') +DESIRED_REPLICAS=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE -o jsonpath='{.spec.replicas}') + +if [ "$READY_REPLICAS" = "$DESIRED_REPLICAS" ]; then + echo "✅ Deployment successful" + # Clean up old replica sets + kubectl delete replicaset $(kubectl get replicasets -n $NAMESPACE -l app=$DEPLOYMENT --sort-by='.metadata.creationTimestamp' | tail -n +4 | awk '{print $1}') -n $NAMESPACE 2>/dev/null || true +else + echo "❌ Deployment failed, rolling back" + kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE + exit 1 +fi +``` + + + + + +**1. Deployment configuration best practices:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: your-app + labels: + app: your-app +spec: + replicas: 3 + revisionHistoryLimit: 3 # Limit old replica sets + selector: + matchLabels: + app: your-app + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 25% # Allow 25% of pods to be unavailable + maxSurge: 25% # Allow 25% extra pods during update + template: + metadata: + labels: + app: your-app + version: "{{ .Values.version }}" # Use templating for versions + spec: + terminationGracePeriodSeconds: 30 + containers: + - name: your-app + image: your-app:{{ .Values.tag }} + ports: + - containerPort: 8080 + readinessProbe: + httpGet: + path: /health/ready + port: 8080 + initialDelaySeconds: 10 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 + livenessProbe: + httpGet: + path: /health/live + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi +``` + +**2. Service configuration best practices:** + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: your-app-service + labels: + app: your-app +spec: + selector: + app: your-app # Don't include version in selector + ports: + - port: 80 + targetPort: 8080 + protocol: TCP + type: ClusterIP +``` + +**3. CI/CD integration:** + +```yaml +# GitLab CI example +deploy: + stage: deploy + script: + - kubectl set image deployment/your-app your-app=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA + - kubectl rollout status deployment/your-app --timeout=300s + - kubectl get pods -l app=your-app + only: + - main + environment: + name: production + when: manual +``` + + + +--- + +_This FAQ was automatically generated on June 10, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-pending-nodepool-changes.mdx b/docs/troubleshooting/deployment-pending-nodepool-changes.mdx new file mode 100644 index 000000000..a96078a24 --- /dev/null +++ b/docs/troubleshooting/deployment-pending-nodepool-changes.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 3 +title: "Deployment Pending After Nodepool Changes" +description: "Solution for deployments that remain pending after nodepool configuration changes" +date: "2024-01-23" +category: "cluster" +tags: ["deployment", "nodepool", "karpenter", "aws", "scaling"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Pending After Nodepool Changes + +**Date:** January 23, 2024 +**Category:** Cluster +**Tags:** Deployment, Nodepool, Karpenter, AWS, Scaling + +## Problem Description + +**Context:** After making nodepool configuration changes in SleakOps, deployments remain in a pending state and don't automatically apply the new configuration. The application stops functioning until manual intervention is required. + +**Observed Symptoms:** + +- Nodepool changes remain pending and don't automatically apply +- Application stops functioning after nodepool modifications +- Deployments require manual triggering to apply pending changes +- Issue may go unnoticed for extended periods + +**Relevant Configuration:** + +- Platform: AWS EKS with Karpenter +- Autoscaling: Enabled +- Resource limits: Potentially insufficient RAM allocation +- AWS EC2 CPU quotas: May be at limit + +**Error Conditions:** + +- Occurs after nodepool configuration changes +- Deployments remain pending until manually triggered +- May be related to AWS resource quota limits +- Autoscaling triggers unnecessary scaling due to low memory limits + +## Detailed Solution + + + +When nodepool changes are pending: + +1. **Access SleakOps Dashboard** +2. **Navigate to your project** +3. **Go to Deployments section** +4. **Manually trigger the pending deployment** +5. **Verify the application is functioning** + +This will immediately apply the pending nodepool changes and restore application functionality. + + + + + +To prevent unnecessary autoscaling that can trigger AWS quota limits: + +**Recommended Memory Settings:** + +- **Minimum RAM**: 512MB +- **Maximum RAM**: 1024MB + +```yaml +# Example resource configuration +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1024Mi" + cpu: "500m" +``` + +**Why this helps:** + +- Prevents autoscaling from triggering due to memory pressure +- Reduces unnecessary EC2 instance provisioning +- Avoids hitting AWS CPU quota limits + + + + + +If you encounter AWS quota limits: + +1. **Identify the quota limit**: + + - Go to AWS Console → Service Quotas + - Search for "Amazon Elastic Compute Cloud (Amazon EC2)" + - Check "Running On-Demand" quotas + +2. **Request quota increase**: + + - Click "Request quota increase" + - Specify the new limit needed + - Provide business justification + - Submit the request + +3. **Monitor the request**: + - AWS typically responds within 24-48 hours + - You'll receive email notifications about status + +**Note**: Make the request from your production AWS account for faster processing. + + + + + +To avoid missing pending deployments: + +1. **Set up notifications**: + + - Configure alerts for deployment status changes + - Monitor deployment pipeline regularly + +2. **Regular checks**: + + - Review pending deployments daily + - Verify application functionality after nodepool changes + +3. **Automated monitoring**: + + ```bash + # Check for pending deployments + kubectl get deployments --all-namespaces | grep -v "READY" + + # Monitor pod status + kubectl get pods --all-namespaces | grep Pending + ``` + + + + + +Common reasons for pending deployments after nodepool changes: + +1. **Resource constraints**: + + - Insufficient CPU/memory quotas + - Node capacity limitations + +2. **Karpenter provisioning delays**: + + - AWS API rate limits + - Instance type availability + +3. **Configuration conflicts**: + + - Incompatible resource requests + - Scheduling constraints + +4. **Timing issues**: + - Changes applied during high-traffic periods + - Concurrent deployment conflicts + +**Prevention strategies**: + +- Apply nodepool changes during low-traffic periods +- Ensure adequate AWS quotas before scaling +- Monitor resource utilization trends +- Test changes in staging environment first + + + +--- + +_This FAQ was automatically generated on January 23, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-redis-communication-error.mdx b/docs/troubleshooting/deployment-redis-communication-error.mdx new file mode 100644 index 000000000..67e5834b4 --- /dev/null +++ b/docs/troubleshooting/deployment-redis-communication-error.mdx @@ -0,0 +1,158 @@ +--- +sidebar_position: 3 +title: "Deployment Failure Due to Redis Communication Error" +description: "Solution for deployment failures caused by Redis connectivity issues" +date: "2024-01-15" +category: "deployment" +tags: ["deployment", "redis", "communication", "troubleshooting", "push"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Failure Due to Redis Communication Error + +**Date:** January 15, 2024 +**Category:** Deployment +**Tags:** Deployment, Redis, Communication, Troubleshooting, Push + +## Problem Description + +**Context:** User experiences deployment failures when pushing code to repository. The automatic deployment process gets stuck and doesn't complete successfully. + +**Observed Symptoms:** + +- Deployment process gets stuck after git push +- Automatic deployment doesn't trigger properly +- Deploy action remains in pending state +- No error messages visible to the user initially + +**Relevant Configuration:** + +- Project: simplee-web-dev +- Environment: dev +- Deployment method: Automatic deployment via git push +- Platform: SleakOps + +**Error Conditions:** + +- Error occurs during automatic deployment process +- Triggered by git push to repository +- Deployment pipeline fails to complete +- Issue persists across multiple push attempts + +## Detailed Solution + + + +This issue is typically caused by Redis communication errors within the SleakOps infrastructure. Redis is used for: + +- Managing deployment queues +- Storing deployment state information +- Coordinating between different services +- Caching deployment configurations + +When Redis communication fails, deployments get stuck in the queue and cannot proceed. + + + + + +If you encounter this issue: + +1. **Contact SleakOps Support**: This is typically an infrastructure issue that requires admin intervention +2. **Wait for confirmation**: Support team will resolve the Redis communication issue +3. **Retry deployment**: Once confirmed fixed, retry your deployment +4. **Clean conflicting deployments**: Support may need to clean up stuck deployments + +**Example support interaction:** + +``` +Subject: Deployment stuck - Redis communication error +Project: [your-project-name] +Environment: [environment-name] +Issue: Deployment gets stuck after git push +``` + + + + + +Once the Redis issue is resolved: + +1. **Verify the fix**: Confirm with support that the issue is resolved +2. **Clean local cache**: + ```bash + git fetch --all + git status + ``` +3. **Trigger new deployment**: + ```bash + # Make a small change or use empty commit + git commit --allow-empty -m "Retry deployment after Redis fix" + git push origin main + ``` +4. **Monitor deployment**: Watch the deployment progress in SleakOps dashboard + + + + + +To minimize impact of similar issues: + +1. **Monitor deployment status**: Always check deployment completion in SleakOps dashboard +2. **Set up notifications**: Configure deployment status notifications +3. **Have rollback plan**: Keep previous working version ready +4. **Regular health checks**: Monitor your application health after deployments + +**Deployment monitoring checklist:** + +- [ ] Deployment started successfully +- [ ] Build process completed +- [ ] Container deployment successful +- [ ] Application health check passed +- [ ] Traffic routing updated + + + + + +If the problem persists after Redis fix: + +1. **Check deployment logs**: + + - Go to SleakOps dashboard + - Navigate to your project + - Check deployment history and logs + +2. **Verify repository configuration**: + + ```yaml + # Check .sleakops/config.yml + version: "1" + environments: + dev: + cluster: your-cluster + namespace: your-namespace + ``` + +3. **Test with minimal change**: + + ```bash + # Update timestamp or version + echo "# Updated $(date)" >> README.md + git add README.md + git commit -m "Test deployment" + git push + ``` + +4. **Contact support with details**: + - Project name and environment + - Exact time of deployment attempt + - Any error messages from logs + - Git commit hash that failed + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-service-name-length-limit.mdx b/docs/troubleshooting/deployment-service-name-length-limit.mdx new file mode 100644 index 000000000..1d3045be9 --- /dev/null +++ b/docs/troubleshooting/deployment-service-name-length-limit.mdx @@ -0,0 +1,174 @@ +--- +sidebar_position: 3 +title: "Kubernetes Service Name Length Limit Error" +description: "Solution for service name exceeding 63 character limit during deployment" +date: "2025-02-13" +category: "project" +tags: ["deployment", "kubernetes", "service", "naming", "helm"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes Service Name Length Limit Error + +**Date:** February 13, 2025 +**Category:** Project +**Tags:** Deployment, Kubernetes, Service, Naming, Helm + +## Problem Description + +**Context:** User encounters a deployment failure when trying to deploy a project with a long name in SleakOps. The automatically generated Kubernetes service name exceeds the 63-character limit imposed by Kubernetes. + +**Observed Symptoms:** + +- Deployment fails during Helm upgrade process +- Error message: "UPGRADE FAILED: failed to create resource" +- Service name validation error: "must be no more than 63 characters" +- Automatically generated service names are too long + +**Relevant Configuration:** + +- Project name: `velo-contacto-email-sender` +- Environment: `production` +- Generated service name: `velo-contact-email-sender-production-velo-contact-email-sender-svc` +- Character count: Over 63 characters + +**Error Conditions:** + +- Error occurs during deployment process +- Happens with projects that have long names +- Affects automatically generated Kubernetes resource names +- Cannot be directly controlled by user configuration + +## Detailed Solution + + + +Kubernetes has strict naming conventions for resources: + +- **Service names**: Maximum 63 characters +- **Pod names**: Maximum 63 characters +- **ConfigMap/Secret names**: Maximum 253 characters +- Names must be lowercase alphanumeric with hyphens + +SleakOps automatically generates service names using the pattern: +`{project-name}-{environment}-{project-name}-svc` + +This can easily exceed the 63-character limit with longer project names. + + + + + +The quickest solution is to rename your project to a shorter name: + +1. **Go to Project Settings** in SleakOps dashboard +2. **Rename the project** to something shorter (e.g., `velo-email-sender`) +3. **Redeploy** the project + +**Example:** + +``` +Original: velo-contacto-email-sender (24 chars) +Shorter: velo-email-sender (17 chars) +``` + +This would generate: +`velo-email-sender-production-velo-email-sender-svc` (51 chars) + + + + + +If your project supports custom service naming, you can override the default: + +1. **In your project configuration**, look for service settings +2. **Add a custom service name**: + +```yaml +# sleakops.yaml or equivalent configuration +service: + name: "velo-contact-svc" # Custom short name + type: ClusterIP + port: 80 +``` + +3. **Redeploy** the project + + + + + +If you have access to Helm chart customization: + +1. **Create a values override file**: + +```yaml +# values-override.yaml +service: + name: "velo-contact-svc" + fullnameOverride: "velo-contact-svc" + nameOverride: "contact-svc" +``` + +2. **Apply during deployment**: + +```bash +helm upgrade --install my-release ./chart \ + -f values-override.yaml +``` + + + + + +To avoid this issue in the future: + +**Recommended naming patterns:** + +- Use abbreviations: `velo-contact-sender` instead of `velo-contacto-email-sender` +- Keep project names under 20 characters when possible +- Use meaningful but concise names +- Avoid redundant words + +**Examples of good project names:** + +``` +✅ velo-email-svc +✅ contact-sender +✅ notification-api +✅ user-auth + +❌ velo-contacto-email-sender-service +❌ my-super-long-project-name-with-details +❌ company-department-team-project-service +``` + + + + + +**Before creating projects:** + +1. **Calculate the final service name length**: + + - Project name + environment + "svc" + separators + - Should be under 60 characters for safety + +2. **Use a naming calculator**: + + ``` + {project-name}-{environment}-{project-name}-svc + + Example calculation: + "my-project" + "-production-" + "my-project" + "-svc" + = 10 + 12 + 10 + 4 = 36 characters ✅ + ``` + +3. **Test in staging first** with the same project name + + + +--- + +_This FAQ was automatically generated on February 13, 2025 based on a real user query._ diff --git a/docs/troubleshooting/deployment-stuck-creating-status.mdx b/docs/troubleshooting/deployment-stuck-creating-status.mdx new file mode 100644 index 000000000..375914e28 --- /dev/null +++ b/docs/troubleshooting/deployment-stuck-creating-status.mdx @@ -0,0 +1,597 @@ +--- +sidebar_position: 3 +title: "Deployment Stuck in Creating Status" +description: "Solution for deployments that get stuck during the deployment phase after successful build" +date: "2025-01-27" +category: "project" +tags: ["deployment", "build", "troubleshooting", "stuck", "creating"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Stuck in Creating Status + +**Date:** January 27, 2025 +**Category:** Project +**Tags:** Deployment, Build, Troubleshooting, Stuck, Creating + +## Problem Description + +**Context:** User pushes code to a branch and triggers a deployment through SleakOps. The build process completes successfully, but the deployment phase gets stuck in "creating" status for extended periods. + +**Observed Symptoms:** + +- Build process completes successfully +- Deployment remains in "creating" status for more than 14 minutes +- No visible progress or error messages during deployment phase +- Application may or may not actually deploy despite stuck status +- Issue occurs intermittently on specific branches + +**Relevant Configuration:** + +- Deployment type: `deploy_build_dev` +- Platform: SleakOps +- Build status: Successful +- Deployment status: Stuck in "creating" + +**Error Conditions:** + +- Occurs after successful build completion +- Deployment phase hangs without timeout +- No clear error messages in the UI +- May require manual intervention to resolve + +## Detailed Solution + + + +When a deployment gets stuck in "creating" status: + +1. **Check deployment logs**: Look for any error messages in the deployment logs +2. **Verify cluster status**: Ensure the target cluster is healthy and responsive +3. **Check pod status**: Verify if pods are actually being created despite the UI showing "creating" +4. **Monitor resource usage**: Check if there are resource constraints (CPU, memory, storage) + +```bash +# Check pod status in the cluster +kubectl get pods -n your-namespace + +# Check deployment status +kubectl get deployments -n your-namespace + +# Check events for any issues +kubectl get events -n your-namespace --sort-by='.lastTimestamp' +``` + + + + + +**Resource Constraints:** + +- Insufficient CPU or memory in the cluster +- Storage quota exceeded +- Node capacity issues + +**Solution:** Scale up cluster resources or optimize application resource requests + +**Image Pull Issues:** + +- Container image not found or inaccessible +- Registry authentication problems +- Network connectivity issues + +**Solution:** Verify image exists and registry credentials are valid + +**Configuration Problems:** + +- Invalid environment variables +- Missing secrets or config maps +- Incorrect service configurations + +**Solution:** Review and validate all configuration files + + + + + +Contact SleakOps support if: + +1. **Deployment stuck for more than 15 minutes** without any progress +2. **Cluster appears healthy** but deployments consistently fail +3. **No clear error messages** in logs or UI +4. **Multiple deployments affected** across different projects + +When contacting support, provide: + +- Project name and branch +- Deployment timestamp +- Any error messages or logs +- Recent infrastructure or application changes + + + + + +**Resource Management:** + +```yaml +# Set appropriate resource limits +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "256Mi" + cpu: "200m" +``` + +**Health Checks:** + +```yaml +# Configure proper health checks +livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 +``` + +**Deployment Strategy:** + +```yaml +# Use rolling updates for safer deployments +strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 1 +``` + + + + + +If deployment remains stuck: + +1. **Cancel current deployment** (if option available in UI) +2. **Wait for platform team intervention** to unlock stuck deployments +3. **Retry deployment** after confirmation from support +4. **Monitor closely** for similar issues in subsequent deployments + +**Post-resolution actions:** + +- Verify application is running correctly +- Check all services are accessible +- Monitor application logs for any issues +- Document any configuration changes made + + + + + +For complex deployment issues that require deeper investigation: + +1. **Deployment status analysis**: + +```bash +# Get detailed deployment information +kubectl describe deployment -n + +# Check replica set status +kubectl get replicasets -n +kubectl describe replicaset -n + +# Analyze deployment conditions +kubectl get deployment -o yaml -n | grep -A 10 conditions +``` + +2. **Pod creation troubleshooting**: + +```bash +# Check if pods are being created +kubectl get pods -l app= -n --watch + +# Look for pod events +kubectl describe pods -l app= -n + +# Check pod specifications +kubectl get pods -l app= -n -o yaml +``` + +3. **Resource availability verification**: + +```bash +# Check cluster resource availability +kubectl top nodes +kubectl describe nodes + +# Check namespace resource quotas +kubectl describe quota -n +kubectl describe limitrange -n + +# Check persistent volume claims +kubectl get pvc -n +kubectl describe pvc -n +``` + +4. **Network and service troubleshooting**: + +```bash +# Check service configuration +kubectl get services -n +kubectl describe service -n + +# Verify ingress configuration +kubectl get ingress -n +kubectl describe ingress -n + +# Check endpoints +kubectl get endpoints -n +``` + + + + + +Implement robust deployment patterns to prevent issues: + +1. **Blue-Green Deployment**: + +```yaml +# Blue deployment (current) +apiVersion: apps/v1 +kind: Deployment +metadata: + name: app-blue + labels: + version: blue +spec: + replicas: 3 + selector: + matchLabels: + app: myapp + version: blue + template: + metadata: + labels: + app: myapp + version: blue + spec: + containers: + - name: app + image: myapp:v1.0 + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "256Mi" + cpu: "200m" +``` + +2. **Canary Deployment Configuration**: + +```yaml +# Canary deployment with traffic splitting +apiVersion: argoproj.io/v1alpha1 +kind: Rollout +metadata: + name: app-rollout +spec: + replicas: 5 + strategy: + canary: + steps: + - setWeight: 20 + - pause: { duration: 10m } + - setWeight: 40 + - pause: { duration: 10m } + - setWeight: 60 + - pause: { duration: 10m } + - setWeight: 80 + - pause: { duration: 10m } + trafficRouting: + istio: + virtualService: + name: app-virtualservice +``` + +3. **Rolling Update with Careful Configuration**: + +```yaml +# Optimized rolling update strategy +spec: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 25% + maxSurge: 25% + minReadySeconds: 10 + progressDeadlineSeconds: 600 +``` + + + + + +Set up comprehensive monitoring to detect deployment issues early: + +1. **Deployment metrics to monitor**: + +```promql +# Deployment availability +kube_deployment_status_replicas_available / kube_deployment_spec_replicas + +# Deployment rollout duration +kube_deployment_status_observed_generation - kube_deployment_metadata_generation + +# Pod restart rate +rate(kube_pod_container_status_restarts_total[5m]) + +# Resource utilization +container_memory_usage_bytes / container_spec_memory_limit_bytes +``` + +2. **Alerting rules for deployment issues**: + +```yaml +groups: + - name: deployment-alerts + rules: + - alert: DeploymentReplicasMismatch + expr: kube_deployment_status_replicas_available != kube_deployment_spec_replicas + for: 15m + labels: + severity: warning + annotations: + summary: "Deployment {{ $labels.deployment }} has mismatched replicas" + description: "Deployment {{ $labels.deployment }} in namespace {{ $labels.namespace }} has {{ $value }} available replicas out of {{ $labels.replicas }} desired" + + - alert: DeploymentRolloutStuck + expr: kube_deployment_status_condition{condition="Progressing", status="false"} == 1 + for: 10m + labels: + severity: critical + annotations: + summary: "Deployment {{ $labels.deployment }} rollout is stuck" + description: "Deployment {{ $labels.deployment }} in namespace {{ $labels.namespace }} has been stuck for more than 10 minutes" +``` + +3. **Health check implementation**: + +```yaml +# Comprehensive health checks +spec: + containers: + - name: app + image: myapp:latest + ports: + - containerPort: 8080 + livenessProbe: + httpGet: + path: /health + port: 8080 + scheme: HTTP + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + successThreshold: 1 + failureThreshold: 3 + readinessProbe: + httpGet: + path: /ready + port: 8080 + scheme: HTTP + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + successThreshold: 1 + failureThreshold: 3 + startupProbe: + httpGet: + path: /startup + port: 8080 + scheme: HTTP + initialDelaySeconds: 10 + periodSeconds: 10 + timeoutSeconds: 5 + successThreshold: 1 + failureThreshold: 30 +``` + + + + + +Emergency procedures for critical deployment failures: + +1. **Immediate rollback procedure**: + +```bash +# Quick rollback to previous version +kubectl rollout undo deployment/ -n + +# Check rollback status +kubectl rollout status deployment/ -n + +# Verify application is working +kubectl get pods -l app= -n +curl -f https://your-app-url.com/health +``` + +2. **Force deployment restart**: + +```bash +# Force restart all pods in deployment +kubectl rollout restart deployment/ -n + +# Scale down and up if restart doesn't work +kubectl scale deployment --replicas=0 -n +kubectl scale deployment --replicas=3 -n +``` + +3. **Manual pod replacement**: + +```bash +# Delete stuck pods manually +kubectl delete pod -n --force --grace-period=0 + +# Check if new pods are created +kubectl get pods -l app= -n --watch +``` + +4. **Configuration recovery**: + +```bash +# Backup current configuration +kubectl get deployment -o yaml -n > deployment-backup.yaml + +# Apply known working configuration +kubectl apply -f known-working-deployment.yaml + +# Verify deployment status +kubectl describe deployment -n +``` + + + + + +Solutions specific to SleakOps platform deployment issues: + +1. **SleakOps deployment state management**: + +- Deployments in SleakOps may get stuck due to platform state management issues +- The platform tracks deployment progress through multiple internal states +- Sometimes manual intervention by SleakOps support is required to reset state + +2. **Platform integration checks**: + +```bash +# Verify SleakOps platform connectivity +# Check if platform can communicate with cluster +# This requires SleakOps support team verification +``` + +3. **Build and deployment pipeline verification**: + +- Ensure build completed successfully before deployment +- Check if image was pushed to registry correctly +- Verify deployment configuration matches build output + +4. **Environment-specific considerations**: + +```yaml +# SleakOps environment configuration +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config + namespace: production +data: + NODE_ENV: "production" + DATABASE_URL: "postgresql://..." + REDIS_URL: "redis://..." + # Ensure all required environment variables are set +``` + +5. **Resource quota management**: + +- SleakOps environments may have specific resource quotas +- Check if deployment exceeds allocated resources +- Consider optimizing resource requests and limits + +6. **Contact support escalation path**: + +```bash +# Information to provide to SleakOps support: +- Project name and environment +- Deployment ID (if available) +- Build ID that triggered the deployment +- Exact timestamp when deployment got stuck +- Any error messages from platform UI +- Recent changes to project configuration +``` + + + + + +Implement proactive measures to prevent deployment issues: + +1. **Pre-deployment validation**: + +```bash +# Validate Kubernetes manifests before deployment +kubectl apply --dry-run=client -f deployment.yaml +kubectl apply --dry-run=server -f deployment.yaml + +# Validate resource requests +kubectl describe quota -n +kubectl describe limitrange -n +``` + +2. **Deployment testing strategy**: + +```bash +# Test deployment in staging environment first +# Use identical configuration to production +# Validate all dependent services are available +``` + +3. **Monitoring and alerting setup**: + +```yaml +# Set up comprehensive monitoring +monitoring: + enabled: true + alerts: + deployment_stuck: + condition: "deployment_status != 'available'" + duration: "10m" + severity: "critical" + resource_exhaustion: + condition: "resource_usage > 80%" + duration: "5m" + severity: "warning" +``` + +4. **Documentation and runbooks**: + +- Maintain updated deployment procedures +- Document known issues and solutions +- Keep contact information for support escalation +- Regular review and testing of emergency procedures + +5. **Automated recovery mechanisms**: + +```yaml +# Implement automated health checks and recovery +spec: + template: + spec: + containers: + - name: app + image: myapp:latest + lifecycle: + preStop: + exec: + command: ["/bin/sh", "-c", "sleep 15"] + # Configure graceful shutdown + terminationGracePeriodSeconds: 30 +``` + + + +--- + +_This FAQ was automatically generated on January 27, 2025 based on a real user query._ diff --git a/docs/troubleshooting/deployment-stuck-starting-state.mdx b/docs/troubleshooting/deployment-stuck-starting-state.mdx new file mode 100644 index 000000000..a2f6453b8 --- /dev/null +++ b/docs/troubleshooting/deployment-stuck-starting-state.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 3 +title: "Deployment Stuck in Starting State" +description: "Solution for deployments stuck in starting state and unable to cancel or modify configuration" +date: "2025-01-27" +category: "project" +tags: ["deployment", "dockerfile", "build", "troubleshooting", "starting-state"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Stuck in Starting State + +**Date:** January 27, 2025 +**Category:** Project +**Tags:** Deployment, Dockerfile, Build, Troubleshooting, Starting State + +## Problem Description + +**Context:** User experiences a deployment that becomes stuck in "starting" state for hours, preventing cancellation and blocking configuration changes to Dockerfile settings. + +**Observed Symptoms:** + +- Deployment shows "starting" status for extended periods (hours instead of minutes) +- Cannot cancel the stuck deployment +- Unable to modify Dockerfile configuration settings while deployment is in starting state +- Build process appears to hang or fail silently +- Platform shows "a few seconds ago" but actual time elapsed is much longer + +**Relevant Configuration:** + +- Multi-stage Dockerfile with Gradle build +- Private GitHub repository dependencies requiring authentication +- Docker build arguments: `GITHUB_USER` and `GITHUB_TOKEN` +- Java application with Spring Boot +- Complex startup script with environment variable validation + +**Error Conditions:** + +- Dockerfile detection issues causing inconsistent state +- Missing or incorrectly configured Docker build arguments +- Dockerfile not found in repository root +- Build process fails but doesn't properly report failure status + +## Detailed Solution + + + +When a deployment gets stuck in starting state: + +1. **Contact Support**: The platform may need manual intervention to unlock the stuck state +2. **Wait for Manual Unlock**: Support team can manually destuck the deployment from the backend +3. **Verify Status**: Once unlocked, the deployment should either complete or show proper error status +4. **Check Application Health**: Ensure the application has proper health checks configured + +**Note**: This typically requires platform administrator intervention and cannot be resolved by users directly. + + + + + +For applications requiring private GitHub repository access: + +```dockerfile +# Build Stage +FROM gradle:jdk21-alpine AS build + +WORKDIR /workspace +COPY . . + +# Define build arguments for GitHub authentication +ARG GITHUB_USER +ARG GITHUB_TOKEN + +# Set environment variables for GitHub access +ENV GITHUB_USER=$GITHUB_USER +ENV GITHUB_TOKEN=$GITHUB_TOKEN + +# Configure Gradle properties for private repository access +RUN echo "gpr.user=$GITHUB_USER" >> ~/.gradle/gradle.properties && \ + echo "gpr.key=$GITHUB_TOKEN" >> ~/.gradle/gradle.properties + +# Build the application +RUN gradle clean assemble --no-daemon +``` + +**Important**: Ensure Docker build arguments are properly configured in SleakOps project settings. + + + + + +To configure Docker build arguments: + +1. Go to **Project Settings** → **Dockerfile Settings** +2. Add the required build arguments: + - `GITHUB_USER`: Your GitHub username + - `GITHUB_TOKEN`: Personal access token with repository access +3. **Save Changes**: Ensure the configuration is properly saved +4. **Verify**: Check that the arguments appear in the configuration + +**Troubleshooting**: + +- If save button is disabled, contact support to unlock the configuration +- Ensure tokens have proper permissions for private repository access +- Use fine-grained personal access tokens when possible + + + + + +Common Dockerfile detection problems: + +1. **Dockerfile Location**: Must be in the repository root or specified path +2. **Branch Selection**: Ensure you're deploying from the correct branch +3. **File Naming**: Use exact name `Dockerfile` (case-sensitive) +4. **Path Configuration**: If Dockerfile is in a subdirectory, specify the build context path + +**Verification Steps**: + +```bash +# Check if Dockerfile exists in root +ls -la Dockerfile + +# Verify current branch +git branch --show-current + +# Check file permissions +ls -la Dockerfile +``` + + + + + +Ensure your application has proper health checks: + +```dockerfile +# Add health check to Dockerfile +HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ + CMD curl -f http://localhost:${HEALTH_PORT}/health || exit 1 +``` + +**In your startup script**: + +```bash +# Validate required environment variables +REQUIRED_VARS=( + "PORT" + "HEALTH_PORT" + "PRIVATE_API_KEY" + "MONGODB_URI" + "MONGODB_DATABASE" + # ... other required vars +) + +for var in "${REQUIRED_VARS[@]}"; do + if [[ -z "${!var:-}" ]]; then + echo "[ERROR] Environment variable $var is not set." >&2 + exit 1 + fi +done +``` + + + + + +To prevent deployments from getting stuck: + +1. **Test Dockerfile Locally**: Always test your Dockerfile builds locally first +2. **Validate Environment Variables**: Ensure all required environment variables are set +3. **Use Proper Health Checks**: Implement comprehensive health checks +4. **Monitor Build Logs**: Check build logs for early error detection +5. **Gradual Deployment**: Test with minimal configuration first, then add complexity + +**Local Testing**: + +```bash +# Build locally to test +docker build --build-arg GITHUB_USER=your_user --build-arg GITHUB_TOKEN=your_token . + +# Test run locally +docker run -p 8080:8080 your-image +``` + + + +--- + +_This FAQ was automatically generated on January 27, 2025 based on a real user query._ diff --git a/docs/troubleshooting/deployment-stuck-state-resolution.mdx b/docs/troubleshooting/deployment-stuck-state-resolution.mdx new file mode 100644 index 000000000..3d41e3f72 --- /dev/null +++ b/docs/troubleshooting/deployment-stuck-state-resolution.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Deployment Stuck in Pending State" +description: "Solution for deployments that get stuck and block subsequent deployments" +date: "2024-12-28" +category: "project" +tags: ["deployment", "stuck", "troubleshooting", "build"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Stuck in Pending State + +**Date:** December 28, 2024 +**Category:** Project +**Tags:** Deployment, Stuck, Troubleshooting, Build + +## Problem Description + +**Context:** User experiences a deployment that gets stuck in a pending state, preventing any new deployments from being created. The deployment pod is not created and builds remain in "starting" status indefinitely. + +**Observed Symptoms:** + +- Deployment stuck in pending state without creating pods +- Cannot create new deployments for the same project +- Multiple builds stuck in "starting" status +- No error messages displayed in the UI +- Deployment appears to be blocking the entire deployment pipeline + +**Relevant Configuration:** + +- Environment: Production +- Project: Any project experiencing stuck deployments +- Build status: "Starting" (indefinitely) +- Deployment status: Pending/Stuck + +**Error Conditions:** + +- Occurs during deployment creation process +- Blocks subsequent deployment attempts +- Builds fail to progress beyond "starting" status +- No automatic recovery mechanism triggers + +## Detailed Solution + + + +If you encounter a stuck deployment, contact the SleakOps support team immediately. The resolution typically involves: + +1. **Task State Update**: Support team will identify tasks with outdated states +2. **Manual State Correction**: Update task states to reflect current reality +3. **Pipeline Unblocking**: Clear the deployment queue to allow new deployments + +**Contact Information:** + +- Email: support@sleakops.com +- Include your project name and deployment details + + + + + +To minimize the occurrence of stuck deployments: + +**Monitor Your Deployments:** + +```bash +# Check deployment status regularly +sleakops deployment list --project your-project + +# Monitor build progress +sleakops build list --status starting +``` + +**Set Reasonable Timeouts:** + +- Configure appropriate build timeouts in your project settings +- Monitor long-running builds that may indicate issues + +**Regular Health Checks:** + +- Verify deployment pipeline health before critical releases +- Test deployments in staging environment first + + + + + +Before contacting support, try these diagnostic steps: + +**1. Check Deployment Logs:** + +```bash +sleakops deployment logs --deployment-id +``` + +**2. Verify Resource Availability:** + +- Check if your cluster has sufficient resources +- Verify node capacity and pod limits + +**3. Review Build Logs:** + +```bash +sleakops build logs --build-id +``` + +**4. Check Project Status:** + +```bash +sleakops project status --project +``` + +**5. Verify Dependencies:** + +- Ensure all required services are running +- Check database connectivity +- Verify external service availability + + + + + +Escalate to SleakOps support when: + +**Immediate Escalation Required:** + +- Production deployments are blocked +- Multiple builds stuck for more than 30 minutes +- Critical business operations are affected + +**Information to Include:** + +- Project name and environment +- Deployment ID and timestamp +- Build IDs that are stuck +- Any error messages or logs +- Business impact description + +**Support Response:** + +- Production issues: Within 2 hours +- Non-production issues: Within 24 hours +- Critical outages: Immediate response + + + + + +After support resolves the issue: + +**1. Verify Deployment Pipeline:** + +```bash +# Test a simple deployment +sleakops deploy --project --environment staging +``` + +**2. Check Build System:** + +```bash +# Trigger a new build +sleakops build create --project +``` + +**3. Monitor for Recurrence:** + +- Watch subsequent deployments closely +- Report any similar issues immediately +- Document any patterns observed + +**4. Update Monitoring:** + +- Set up alerts for stuck deployments +- Configure notifications for long-running builds + + + +--- + +_This FAQ was automatically generated on December 28, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-teams-instances-not-creating.mdx b/docs/troubleshooting/deployment-teams-instances-not-creating.mdx new file mode 100644 index 000000000..9001cfc52 --- /dev/null +++ b/docs/troubleshooting/deployment-teams-instances-not-creating.mdx @@ -0,0 +1,453 @@ +--- +sidebar_position: 3 +title: "Deployment and Teams Instances Not Creating" +description: "Troubleshooting deployment failures and Teams instance creation issues" +date: "2024-01-15" +category: "project" +tags: ["deployment", "teams", "instances", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment and Teams Instances Not Creating + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Deployment, Teams, Instances, Troubleshooting + +## Problem Description + +**Context:** User reports that deployments are not executing and Teams instances are not being created in the SleakOps platform. + +**Observed Symptoms:** + +- Deployments are not being triggered or completed +- Teams instances are not being created +- No visible progress in deployment pipeline +- Services may appear stuck in pending state + +**Relevant Configuration:** + +- Domain: teams.simplee.cl +- Platform: SleakOps +- Deployment type: Teams instances + +**Error Conditions:** + +- Deployment process fails to start or complete +- Instance creation process does not execute +- May affect multiple services or environments + +## Detailed Solution + + + +First, verify the current status of your deployments: + +1. **Navigate to your Project Dashboard** +2. **Check the Deployments section** for any failed or pending deployments +3. **Review deployment logs** for error messages +4. **Verify the build status** in the CI/CD pipeline + +Look for common indicators: + +- Red status indicators +- Error messages in logs +- Stuck "pending" states +- Resource allocation issues + + + + + +Check your Teams instance configuration: + +1. **Go to Workloads** → **Teams Services** +2. **Verify the configuration** for teams.simplee.cl +3. **Check resource allocation**: + - CPU and memory limits + - Storage requirements + - Network configuration + +```yaml +# Example Teams configuration +apiVersion: v1 +kind: Service +metadata: + name: teams-service +spec: + selector: + app: teams + ports: + - port: 80 + targetPort: 8080 + type: LoadBalancer +``` + + + + + +Verify that your cluster has sufficient resources: + +1. **Check Node Resources**: + + - Available CPU and memory + - Disk space + - Network capacity + +2. **Verify Quotas**: + + - Namespace resource quotas + - Cluster-wide limits + - Provider-specific quotas (AWS, Azure, GCP) + +3. **Review Pod Status**: + ```bash + kubectl get pods -n your-namespace + kubectl describe pod -n your-namespace + ``` + + + + + +For Teams instances, networking configuration is critical: + +1. **Check DNS Configuration**: + + - Verify teams.simplee.cl DNS records + - Ensure proper A/CNAME records point to load balancer + +2. **Verify Load Balancer**: + + - Check if load balancer is provisioned + - Verify security groups and firewall rules + +3. **Test Connectivity**: + ```bash + nslookup teams.simplee.cl + curl -I https://teams.simplee.cl + ``` + + + + + +**Restart Deployment Process**: + +1. Navigate to your project +2. Go to **Deployments** tab +3. Click **Redeploy** on the failed deployment + +**Clear Stuck Resources**: + +1. Delete stuck pods manually if needed +2. Restart deployment controllers +3. Check for resource locks or conflicts + +**Update Configuration**: + +1. Verify all environment variables +2. Check secrets and configmaps +3. Ensure image repositories are accessible + +**Scale Resources**: + +1. Increase node capacity if needed +2. Adjust resource requests and limits +3. Consider using different instance types + + + + + +Escalate to SleakOps support if: + +1. **Infrastructure Issues**: + + - Cluster nodes are not responding + - Provider-level service outages + - Persistent resource allocation failures + +2. **Platform Issues**: + + - SleakOps dashboard not responding + - Deployment pipeline completely broken + - Authentication or authorization failures + +3. **Data Loss Concerns**: + - Persistent volumes not mounting + - Database connectivity issues + - Backup and recovery problems + +**Information to provide when escalating**: + +- Project name and environment +- Specific error messages from logs +- Timeline of when the issue started +- Recent changes made to configuration +- Screenshots of error states + +**Support Contact Methods**: + +- SleakOps Support Portal: Create a ticket with "URGENT" priority for production issues +- Email: support@sleakops.com with detailed problem description +- Emergency escalation: Include business impact assessment + + + + + +To prevent deployment and Teams instance issues: + +1. **Set up monitoring and alerts**: + +```yaml +# Example monitoring configuration +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: teams-monitoring +spec: + selector: + matchLabels: + app: teams + endpoints: + - port: metrics + interval: 30s + path: /metrics +``` + +2. **Implement health checks**: + +```yaml +# Add health check to Teams deployment +spec: + template: + spec: + containers: + - name: teams-app + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 +``` + +3. **Configure resource monitoring**: + +```bash +# Monitor cluster resources regularly +kubectl top nodes +kubectl top pods -n your-namespace +kubectl get events --sort-by=.metadata.creationTimestamp +``` + +4. **Set up automated scaling**: + +```yaml +# Horizontal Pod Autoscaler for Teams service +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: teams-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: teams-deployment + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + + + + + +For complex deployment issues that require deeper investigation: + +1. **Deep cluster analysis**: + +```bash +# Check cluster events for errors +kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp + +# Analyze node conditions +kubectl describe nodes + +# Check system pod status +kubectl get pods -n kube-system + +# Review cluster resource consumption +kubectl describe nodes | grep -A 5 "Allocated resources" +``` + +2. **Network diagnostics**: + +```bash +# Test DNS resolution from within cluster +kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup teams.simplee.cl + +# Check service endpoints +kubectl get endpoints -n your-namespace + +# Test connectivity between pods +kubectl exec -it -- wget -qO- http://teams-service:80/health +``` + +3. **Log aggregation and analysis**: + +```bash +# Collect logs from all related components +kubectl logs deployment/teams-deployment -n your-namespace --all-containers=true + +# Check ingress controller logs +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + +# Review load balancer logs (if applicable) +# This depends on your cloud provider and setup +``` + +4. **Container image debugging**: + +```bash +# Verify image availability +kubectl describe pod | grep -i image + +# Check image pull status +kubectl get events | grep -i "image" + +# Test image locally (if accessible) +docker pull your-teams-image:tag +docker run --rm your-teams-image:tag /health-check +``` + + + + + +When deployments fail, follow these recovery steps: + +1. **Immediate rollback to last working version**: + +```bash +# Check deployment history +kubectl rollout history deployment/teams-deployment -n your-namespace + +# Rollback to previous version +kubectl rollout undo deployment/teams-deployment -n your-namespace + +# Monitor rollback progress +kubectl rollout status deployment/teams-deployment -n your-namespace +``` + +2. **Configuration recovery**: + +```bash +# Backup current configuration +kubectl get deployment teams-deployment -o yaml > teams-backup.yaml + +# Apply known working configuration +kubectl apply -f teams-working-config.yaml + +# Verify deployment health +kubectl get pods -l app=teams -n your-namespace +``` + +3. **Data and state recovery**: + +```bash +# Check persistent volume status +kubectl get pv,pvc -n your-namespace + +# Verify database connectivity (if applicable) +kubectl exec -it -- pg_isready # For PostgreSQL +kubectl exec -it -- mysqladmin ping # For MySQL +``` + +4. **Service restoration verification**: + +```bash +# Test service endpoints +curl -f https://teams.simplee.cl/health + +# Check load balancer status +kubectl get svc teams-service -n your-namespace + +# Verify ingress configuration +kubectl get ingress -n your-namespace +kubectl describe ingress teams-ingress -n your-namespace +``` + + + + + +Follow these best practices to ensure reliable Teams deployments: + +1. **Deployment strategy**: + +```yaml +# Use rolling updates for zero-downtime deployments +spec: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 25% + maxSurge: 25% +``` + +2. **Resource management**: + +```yaml +# Always set resource requests and limits +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + +3. **Environment-specific configurations**: + +- Use ConfigMaps for environment variables +- Store secrets securely using Kubernetes Secrets +- Implement proper RBAC controls +- Use namespaces to isolate environments + +4. **Testing and validation**: + +- Deploy to staging environment first +- Run automated tests before production deployment +- Implement canary deployments for critical changes +- Monitor key metrics during and after deployment + +5. **Documentation and processes**: + +- Document all configuration changes +- Maintain deployment runbooks +- Create incident response procedures +- Regular backup and recovery testing + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/deployment-timeout-database-migrations.mdx b/docs/troubleshooting/deployment-timeout-database-migrations.mdx new file mode 100644 index 000000000..700190688 --- /dev/null +++ b/docs/troubleshooting/deployment-timeout-database-migrations.mdx @@ -0,0 +1,499 @@ +--- +sidebar_position: 3 +title: "Deployment Timeout During Database Migrations" +description: "Solution for deployment failures caused by database migration timeouts in pre-deploy tasks" +date: "2024-04-25" +category: "project" +tags: ["deployment", "database", "migrations", "timeout", "pre-deploy"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deployment Timeout During Database Migrations + +**Date:** April 25, 2024 +**Category:** Project +**Tags:** Deployment, Database, Migrations, Timeout, Pre-deploy + +## Problem Description + +**Context:** Backend service deployment fails during the pre-deploy phase when executing database migrations, causing the entire deployment process to timeout and fail. + +**Observed Symptoms:** + +- Deployment fails during pre-deploy task execution +- Database migrations take too long to complete +- Process terminates due to timeout limits +- Multiple deployment attempts show similar timeout behavior +- Secondary application errors may appear (ImportError) after timeout failures + +**Relevant Configuration:** + +- Service type: Backend application +- Pre-deploy task: Database migrations +- Infrastructure: Karpenter-managed nodes (no provisioning issues detected) +- Database: Production database with potentially large datasets + +**Error Conditions:** + +- Error occurs during pre-deploy phase +- Migrations exceed configured timeout limits +- Problem appears consistently across deployment attempts +- May be followed by application import errors in subsequent attempts + +## Detailed Solution + + + +Database migration timeouts typically occur due to: + +1. **Large data migrations**: Operations on tables with millions of records +2. **Schema changes**: Adding indexes or columns to large tables +3. **Lock contention**: Migrations conflicting with active database connections +4. **Resource constraints**: Insufficient database CPU/memory during migration +5. **Network latency**: Slow connection between application and database + +To diagnose the specific cause: + +```bash +# Check migration logs +kubectl logs -f deployment/your-backend-service -c pre-deploy + +# Monitor database performance during migration +# (AWS RDS example) +aws rds describe-db-instances --db-instance-identifier your-db +``` + + + + + +In SleakOps, you can configure longer timeout limits for pre-deploy tasks: + +1. Go to your **Project Settings** +2. Navigate to **Deployment Configuration** +3. Find **Pre-deploy Task Settings** +4. Increase the **Timeout** value: + +```yaml +# sleakops.yaml example +services: + backend: + pre_deploy: + timeout: 1800 # 30 minutes instead of default 10 minutes + command: "python manage.py migrate" +``` + +Recommended timeout values: + +- Small applications: 600 seconds (10 minutes) +- Medium applications: 1200 seconds (20 minutes) +- Large applications: 1800+ seconds (30+ minutes) + + + + + +To make migrations faster and more reliable: + +**1. Split large migrations:** + +```python +# Instead of one large migration +class Migration(migrations.Migration): + operations = [ + # 50 operations here + ] + +# Split into smaller migrations +class Migration001(migrations.Migration): + operations = [ + # 10 operations here + ] + +class Migration002(migrations.Migration): + operations = [ + # 10 more operations here + ] +``` + +**2. Use database-specific optimizations:** + +```python +# PostgreSQL example - add indexes concurrently +from django.contrib.postgres.operations import AddIndexConcurrently + +class Migration(migrations.Migration): + atomic = False # Required for concurrent operations + operations = [ + AddIndexConcurrently( + model_name='yourmodel', + index=models.Index(fields=['field_name'], name='idx_field_name'), + ), + ] +``` + +**3. Run heavy migrations offline:** + +```bash +# For very large migrations, run manually during maintenance windows +kubectl exec -it deployment/backend-service -- python manage.py migrate --plan +kubectl exec -it deployment/backend-service -- python manage.py migrate app_name migration_number +``` + + + + + +If timeouts persist, consider these deployment strategies: + +**1. Blue-Green Deployment with Manual Migration:** + +```yaml +# sleakops.yaml +services: + backend: + deployment_strategy: blue_green + pre_deploy: + enabled: false # Disable automatic migrations + health_check: + path: /health + timeout: 30 +``` + +Then run migrations manually: + +```bash +# After blue environment is ready +kubectl exec -it deployment/backend-service-blue -- python manage.py migrate +# Switch traffic after migration completes +``` + +**2. Rolling Deployment with Migration Jobs:** + +```yaml +# Create separate job for migrations +apiVersion: batch/v1 +kind: Job +metadata: + name: db-migration-job +spec: + template: + spec: + containers: + - name: migrate + image: your-backend-image + command: ["python", "manage.py", "migrate"] + restartPolicy: Never + backoffLimit: 3 +``` + +**3. Database Connection Pooling:** + +```python +# settings.py - Optimize database connections +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', + 'CONN_MAX_AGE': 600, # Connection pooling + 'OPTIONS': { + 'MAX_CONNS': 20, + 'MIN_CONNS': 5, + } + } +} +``` + + + + + +To prevent future timeout issues: + +**1. Monitor migration performance:** + +```bash +# Add logging to migrations +import logging +logger = logging.getLogger(__name__) + +class Migration(migrations.Migration): + def apply_migration(self, project_state, schema_editor, collect_sql=False): + logger.info(f"Starting migration {self.name}") + start_time = time.time() + result = super().apply_migration(project_state, schema_editor, collect_sql) + duration = time.time() - start_time + logger.info(f"Migration {self.name} completed in {duration:.2f} seconds") + return result +``` + +**2. Set up alerts:** + +```yaml +# Alert when migrations take too long +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: migration-alerts +spec: + groups: + - name: migrations + rules: + - alert: MigrationTimeout + expr: increase(django_migration_duration_seconds[5m]) > 300 + labels: + severity: warning + annotations: + summary: "Database migration taking too long" + description: "Migration has been running for more than 5 minutes" + - alert: MigrationFailure + expr: increase(django_migration_failed_total[5m]) > 0 + labels: + severity: critical + annotations: + summary: "Database migration failed" + description: "One or more migrations have failed in the last 5 minutes" +``` + +**3. Test migrations in staging:** + +```bash +# Always test with production-like data volumes +# Use database snapshots for realistic testing +pg_dump production_db > production_snapshot.sql +createdb staging_test_db +psql staging_test_db < production_snapshot.sql + +# Time the migration on staging +time python manage.py migrate --verbosity=2 + +# Check for locking issues +SELECT * FROM pg_stat_activity WHERE state = 'active'; +``` + + + + + +After resolving migration timeout issues: + +**1. Clean up migration files:** + +```bash +# Remove unnecessary migration files +python manage.py showmigrations --list | grep -E "\[X\].*0001_initial" + +# Squash migrations if needed (in development) +python manage.py squashmigrations app_name 0001 0005 +``` + +**2. Optimize database performance:** + +```sql +-- Update table statistics after large migrations +ANALYZE table_name; + +-- Check for missing indexes +SELECT schemaname, tablename, attname, n_distinct, correlation +FROM pg_stats +WHERE schemaname = 'public' AND n_distinct > 100; + +-- Rebuild indexes if necessary +REINDEX TABLE table_name; +``` + +**3. Review and optimize future migrations:** + +```python +# Use RunPython with proper transaction handling +from django.db import migrations, transaction + +def forwards_func(apps, schema_editor): + with transaction.atomic(): + # Migration logic here + pass + +def reverse_func(apps, schema_editor): + with transaction.atomic(): + # Reverse logic here + pass + +class Migration(migrations.Migration): + operations = [ + migrations.RunPython(forwards_func, reverse_func), + ] +``` + +**4. Document migration procedures:** + +```markdown +# Migration Checklist +- [ ] Test on staging with production data volume +- [ ] Estimate runtime based on staging tests +- [ ] Schedule during low-traffic window +- [ ] Have rollback plan ready +- [ ] Monitor database locks during migration +- [ ] Verify data integrity post-migration +``` + + + + + +If a migration is stuck and blocking deployments: + +**1. Immediate assessment:** + +```bash +# Check current migration status +python manage.py showmigrations + +# Identify running processes +ps aux | grep "manage.py migrate" + +# Check database locks +SELECT * FROM pg_locks WHERE NOT granted; +``` + +**2. Safe termination procedures:** + +```bash +# For Django migrations +# DO NOT kill -9 unless absolutely necessary +kill -TERM + +# For database locks +SELECT pg_cancel_backend(); +# Only if cancel doesn't work: +SELECT pg_terminate_backend(); +``` + +**3. Recovery steps:** + +```sql +-- Check migration table state +SELECT * FROM django_migrations ORDER BY applied DESC LIMIT 10; + +-- Manually mark migration as unapplied if needed (DANGEROUS) +DELETE FROM django_migrations +WHERE app = 'your_app' AND name = 'problematic_migration'; +``` + +**4. Alternative deployment strategy:** + +```bash +# Deploy without migrations first +python manage.py migrate --fake + +# Run migrations separately +python manage.py migrate --run-syncdb + +# Or use multiple smaller migration batches +python manage.py migrate app_name 0001 +python manage.py migrate app_name 0002 +# Continue incrementally +``` + + + + + +**Migration Development Best Practices:** + +1. **Design efficient migrations:** + +```python +# Avoid these patterns in large tables: +class Migration(migrations.Migration): + operations = [ + # BAD: Adds column with default to large table + migrations.AddField( + model_name='largetable', + name='new_field', + field=models.CharField(default='value'), + ), + + # BETTER: Add nullable first, populate separately + migrations.AddField( + model_name='largetable', + name='new_field', + field=models.CharField(null=True), + ), + ] + +# Then populate in separate migration: +def populate_field(apps, schema_editor): + LargeTable = apps.get_model('app', 'LargeTable') + for obj in LargeTable.objects.iterator(): + obj.new_field = 'value' + obj.save(update_fields=['new_field']) +``` + +2. **Use database-level optimizations:** + +```sql +-- Use concurrent index creation for large tables +CREATE INDEX CONCURRENTLY idx_table_field ON table_name(field); + +-- Add NOT NULL constraints in steps for large tables +-- Step 1: Add column as nullable +-- Step 2: Populate data +-- Step 3: Add NOT NULL constraint +ALTER TABLE table_name ALTER COLUMN field SET NOT NULL; +``` + +**Deployment Integration:** + +```yaml +# Kubernetes deployment with migration timeout +apiVersion: batch/v1 +kind: Job +metadata: + name: django-migrate +spec: + activeDeadlineSeconds: 1800 # 30 minutes timeout + template: + spec: + containers: + - name: migrate + image: your-app:latest + command: ["python", "manage.py", "migrate"] + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + restartPolicy: Never +``` + +**Monitoring and Alerting Setup:** + +```python +# Django management command with metrics +from django.core.management.base import BaseCommand +from django.db import connection +import time +import logging + +class Command(BaseCommand): + def handle(self, *args, **options): + start_time = time.time() + try: + # Run migration + call_command('migrate', verbosity=1) + duration = time.time() - start_time + # Log success metrics + logging.info(f"Migration completed in {duration:.2f} seconds") + except Exception as e: + duration = time.time() - start_time + logging.error(f"Migration failed after {duration:.2f} seconds: {e}") + raise +``` + + + +--- + +_This FAQ was automatically generated on April 25, 2024 based on a real user query._ diff --git a/docs/troubleshooting/django-celery-appregistrynotready-error.mdx b/docs/troubleshooting/django-celery-appregistrynotready-error.mdx new file mode 100644 index 000000000..f56ffde1e --- /dev/null +++ b/docs/troubleshooting/django-celery-appregistrynotready-error.mdx @@ -0,0 +1,212 @@ +--- +sidebar_position: 3 +title: "Django Celery AppRegistryNotReady Error" +description: "Solution for Django AppRegistryNotReady error when running Celery cron jobs" +date: "2024-12-26" +category: "workload" +tags: ["django", "celery", "cron", "appregistrynotready", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Django Celery AppRegistryNotReady Error + +**Date:** December 26, 2024 +**Category:** Workload +**Tags:** Django, Celery, Cron, AppRegistryNotReady, Troubleshooting + +## Problem Description + +**Context:** User is experiencing errors when running Celery cron jobs in a Django application deployed on SleakOps platform. + +**Observed Symptoms:** + +- `django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.` error +- Error occurs when using `celery -A (App) call function` command +- Cron jobs fail to execute properly +- Application appears to not be fully initialized when Celery tasks run + +**Relevant Configuration:** + +- Framework: Django with Celery +- Task execution: Cron jobs +- Command format: `celery -A (App) call function` +- Platform: SleakOps + +**Error Conditions:** + +- Error occurs during Celery task execution +- Happens specifically with cron job scheduling +- Django apps are not properly loaded when task runs +- May be related to Django settings configuration + +## Detailed Solution + + + +Ensure Django is properly initialized before Celery starts: + +```python +# celery.py +import os +from celery import Celery +from django.conf import settings + +# Set the default Django settings module +os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'your_project.settings') + +# Initialize Django +import django +django.setup() + +app = Celery('your_project') +app.config_from_object('django.conf:settings', namespace='CELERY') +app.autodiscover_tasks() +``` + + + + + +Check your Django `settings.py` to ensure all required apps are included: + +```python +# settings.py +INSTALLED_APPS = [ + 'django.contrib.admin', + 'django.contrib.auth', + 'django.contrib.contenttypes', + 'django.contrib.sessions', + 'django.contrib.messages', + 'django.contrib.staticfiles', + + # Third-party apps + 'celery', + 'django_celery_beat', # If using periodic tasks + 'django_celery_results', # If storing results in Django DB + + # Your apps + 'your_app_name', + # Add any missing apps here +] +``` + + + + + +Ensure your Celery tasks are properly defined: + +```python +# tasks.py +from celery import shared_task +from django.apps import apps + +@shared_task +def your_cron_task(): + # Ensure Django is ready + if not apps.ready: + import django + django.setup() + + # Your task logic here + return "Task completed" +``` + +For manual task execution, use: + +```bash +# Instead of: celery -A your_project call your_app.tasks.your_task +# Use: +celery -A your_project worker --loglevel=info + +# Or for one-time execution: +python manage.py shell -c "from your_app.tasks import your_task; your_task.delay()" +``` + + + + + +Verify that Django settings are properly loaded: + +```python +# In your task or celery.py +import os +print(f"DJANGO_SETTINGS_MODULE: {os.environ.get('DJANGO_SETTINGS_MODULE')}") + +from django.conf import settings +print(f"Settings configured: {settings.configured}") +print(f"Installed apps: {settings.INSTALLED_APPS}") +``` + +In SleakOps, ensure your environment variables are set: + +```yaml +# In your deployment configuration +environment: + DJANGO_SETTINGS_MODULE: "your_project.settings" + CELERY_BROKER_URL: "redis://redis:6379/0" + CELERY_RESULT_BACKEND: "redis://redis:6379/0" +``` + + + + + +For SleakOps deployments, ensure your worker configuration is correct: + +```yaml +# sleakops.yaml or similar configuration +workloads: + - name: celery-worker + type: worker + image: your-django-app + command: ["celery", "-A", "your_project", "worker", "--loglevel=info"] + environment: + DJANGO_SETTINGS_MODULE: "your_project.settings" + + - name: celery-beat + type: worker + image: your-django-app + command: ["celery", "-A", "your_project", "beat", "--loglevel=info"] + environment: + DJANGO_SETTINGS_MODULE: "your_project.settings" +``` + + + + + +To debug the issue: + +1. **Test locally first:** + + ```bash + # Set environment variable + export DJANGO_SETTINGS_MODULE=your_project.settings + + # Test Django setup + python -c "import django; django.setup(); print('Django setup successful')" + + # Test Celery task + python manage.py shell -c "from your_app.tasks import your_task; print(your_task())" + ``` + +2. **Check Celery worker logs:** + + ```bash + kubectl logs -f deployment/celery-worker + ``` + +3. **Verify Redis/Broker connection:** + ```python + from celery import current_app + print(current_app.control.inspect().stats()) + ``` + + + +--- + +_This FAQ was automatically generated on December 26, 2024 based on a real user query._ diff --git a/docs/troubleshooting/django-migration-conflicts-database-import.mdx b/docs/troubleshooting/django-migration-conflicts-database-import.mdx new file mode 100644 index 000000000..3e0d102aa --- /dev/null +++ b/docs/troubleshooting/django-migration-conflicts-database-import.mdx @@ -0,0 +1,184 @@ +--- +sidebar_position: 3 +title: "Django Migration Conflicts During Database Import" +description: "Solution for Django migration errors when importing database dumps with different migration states" +date: "2024-12-20" +category: "project" +tags: ["django", "migrations", "database", "import", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Django Migration Conflicts During Database Import + +**Date:** December 20, 2024 +**Category:** Project +**Tags:** Django, Migrations, Database, Import, Troubleshooting + +## Problem Description + +**Context:** When importing a database dump into a Django application deployed on SleakOps, the pre-upgrade hook responsible for running Django migrations fails because the system doesn't recognize that certain migrations have already been applied. + +**Observed Symptoms:** + +- Pre-upgrade hook fails during deployment +- Django migration system doesn't recognize previously applied migrations +- Database import process fails +- Migration conflicts between dump state and current codebase + +**Relevant Configuration:** + +- Framework: Django with Django REST Framework +- Database: PostgreSQL (typical for SleakOps deployments) +- Deployment: Kubernetes with pre-upgrade hooks +- Migration apps: cobranza, django_celery_results, and others + +**Error Conditions:** + +- Error occurs during pre-upgrade hook execution +- Happens when database dump contains different migration states than current code +- Affects multiple Django apps with pending migrations +- Problem persists until migration states are synchronized + +## Detailed Solution + + + +This issue occurs when: + +1. **Database dump origin**: The dump was created from a database where the project was running on a different branch +2. **Migration state mismatch**: The migration history in the dump doesn't match the current codebase migrations +3. **Django migration tracking**: Django's `django_migrations` table contains records that don't align with current migration files + +The key indicator is seeing migrations marked as `[ ]` (not applied) when running `python manage.py showmigrations`, even though the database contains the actual schema changes. + + + + + +To diagnose the issue, connect to your application pod and run: + +```bash +# Check current migration status +python manage.py showmigrations + +# Look for apps with mixed states like: +# cobranza +# [X] 0001_initial +# [ ] 0002_initial # <-- This indicates a conflict +``` + +Also check the database directly: + +```sql +-- Connect to your database and check migration records +SELECT app, name, applied FROM django_migrations +WHERE app IN ('cobranza', 'django_celery_results') +ORDER BY app, applied; +``` + + + + + +The most effective solution is to mark the conflicting migrations as applied without actually running them: + +```bash +# Connect to your application pod +kubectl exec -it -- bash + +# Mark specific migrations as fake-applied +python manage.py migrate cobranza 0002_initial --fake +python manage.py migrate django_celery_results 0011_taskresult_periodic_task_name --fake + +# Verify the fix +python manage.py showmigrations +``` + +For multiple migrations: + +```bash +# Mark all pending migrations as fake for a specific app +python manage.py migrate cobranza --fake + +# Or mark all pending migrations across all apps +python manage.py migrate --fake +``` + + + + + +To avoid this issue in the future: + +1. **Consistent branch dumps**: Always create database dumps from the same branch you'll deploy to + +2. **Migration synchronization**: Before creating dumps, ensure migrations are up to date: + + ```bash + python manage.py migrate + python manage.py showmigrations # Verify all are applied + ``` + +3. **Clean import process**: When importing dumps: + + ```bash + # After import, immediately check migration state + python manage.py showmigrations + # Fix any conflicts before deployment + ``` + +4. **Environment consistency**: Use the same Django version and app versions between dump creation and import environments + + + + + +For SleakOps deployments with pre-upgrade hooks: + +1. **Temporary hook modification**: If you need to deploy immediately, you can temporarily modify the pre-upgrade hook to use `--fake-initial`: + + ```yaml + # In your deployment configuration + preUpgradeHook: + command: | + python manage.py migrate --fake-initial + ``` + +2. **Post-deployment fix**: After successful deployment, connect to the pod and run the proper fake migrations as described above + +3. **Restore normal hook**: Once fixed, restore the normal migration hook: + ```yaml + preUpgradeHook: + command: | + python manage.py migrate + ``` + + + + + +After applying the fix: + +1. **Check migration status**: + + ```bash + python manage.py showmigrations + # All migrations should show [X] + ``` + +2. **Test application functionality**: + + - Verify database operations work correctly + - Check that all models are accessible + - Test critical application features + +3. **Successful deployment**: + - The next deployment should complete without migration errors + - Pre-upgrade hooks should execute successfully + + + +--- + +_This FAQ was automatically generated on December 20, 2024 based on a real user query._ diff --git a/docs/troubleshooting/dns-cloudflare-route53-configuration.mdx b/docs/troubleshooting/dns-cloudflare-route53-configuration.mdx new file mode 100644 index 000000000..2b3294f29 --- /dev/null +++ b/docs/troubleshooting/dns-cloudflare-route53-configuration.mdx @@ -0,0 +1,196 @@ +--- +sidebar_position: 3 +title: "DNS Configuration with Cloudflare and Route53" +description: "How to configure DNS records when using Cloudflare with AWS Route53 delegation" +date: "2024-12-19" +category: "provider" +tags: ["dns", "cloudflare", "route53", "aws", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DNS Configuration with Cloudflare and Route53 + +**Date:** December 19, 2024 +**Category:** Provider +**Tags:** DNS, Cloudflare, Route53, AWS, Configuration + +## Problem Description + +**Context:** Users need to configure DNS records when using Cloudflare as their DNS provider while having AWS services that require Route53 delegation for subdomains. + +**Observed Symptoms:** + +- DNS resolution fails for subdomains pointing to AWS Load Balancers +- Services are not accessible through their configured domain names +- DNS propagation issues between Cloudflare and Route53 +- CNAME/A record configuration conflicts + +**Relevant Configuration:** + +- DNS Provider: Cloudflare +- AWS Service: ELB (Elastic Load Balancer) +- Domain delegation: Route53 for specific subdomains +- Record types: CNAME or A records + +**Error Conditions:** + +- Domain resolution timeouts +- Services unreachable via configured domains +- DNS delegation not working properly +- Load balancer endpoints not resolving + +## Detailed Solution + + + +When using Cloudflare as your primary DNS provider but need Route53 for AWS services: + +1. **In Route53 (AWS Console):** + + - Create a hosted zone for your subdomain + - Note the NS (Name Server) records provided by Route53 + +2. **In Cloudflare:** + + - Create NS records pointing your subdomain to Route53 name servers + - Example: `subdomain.yourdomain.com` → Route53 NS records + +3. **Verify delegation:** + ```bash + dig NS subdomain.yourdomain.com + ``` + + + + + +For AWS Load Balancers, you need to create the appropriate DNS records: + +**In Cloudflare (for direct configuration):** + +``` +Type: CNAME +Name: corebackupgenerator +Value: internal-k8s-autolabproduction-fc0a036b93-1779329228.us-east-1.elb.amazonaws.com +TTL: Auto +Proxy status: DNS only (gray cloud) +``` + +**In Route53 (if using delegation):** + +``` +Type: A (Alias) +Name: corebackupgenerator.autolab.com.co +Alias Target: internal-k8s-autolabproduction-fc0a036b93-1779329228.us-east-1.elb.amazonaws.com +Routing Policy: Simple +``` + +**Important:** Use A (Alias) records in Route53 for better performance and automatic IP resolution. + + + + + +When configuring DNS records in Cloudflare for AWS services: + +1. **Disable Cloudflare Proxy (Gray Cloud):** + + - Click the orange cloud icon to make it gray + - This ensures direct connection to AWS services + - Required for non-HTTP services or custom ports + +2. **SSL/TLS Settings:** + + - Set SSL/TLS encryption mode to "Full" or "Full (strict)" + - Ensure certificates match between Cloudflare and AWS + +3. **Page Rules (if needed):** + - Create page rules to bypass Cloudflare for specific subdomains + - Useful for API endpoints or WebSocket connections + + + + + +**Diagnostic Commands:** + +1. **Check DNS propagation:** + + ```bash + dig corebackupgenerator.autolab.com.co + nslookup corebackupgenerator.autolab.com.co + ``` + +2. **Test from different DNS servers:** + + ```bash + dig @8.8.8.8 corebackupgenerator.autolab.com.co + dig @1.1.1.1 corebackupgenerator.autolab.com.co + ``` + +3. **Check delegation:** + ```bash + dig NS autolab.com.co + dig NS corebackupgenerator.autolab.com.co + ``` + +**Common Issues:** + +- **TTL too high:** Reduce TTL to 300 seconds during changes +- **Proxy enabled:** Disable Cloudflare proxy for AWS services +- **Wrong record type:** Use CNAME for external domains, A for IP addresses +- **Missing delegation:** Ensure NS records are properly configured + + + + + +**For Cloudflare + AWS Setup:** + +1. **Use subdomain delegation:** + + - Delegate specific subdomains to Route53 + - Keep main domain management in Cloudflare + +2. **Record type selection:** + + - Use A (Alias) records in Route53 for AWS resources + - Use CNAME records in Cloudflare for external domains + +3. **Monitoring and validation:** + + - Set up DNS monitoring + - Regularly test resolution from different locations + - Monitor SSL certificate validity + +4. **Documentation:** + - Document all DNS delegations + - Keep track of which records are managed where + - Maintain emergency contact information + +**Example Configuration:** + +```yaml +# Cloudflare Records +autolab.com.co: + - Type: A + Value: 192.168.1.1 + - Type: NS (for k8s subdomain) + Value: Route53 NS servers + +# Route53 Records (k8s.autolab.com.co zone) +k8s.autolab.com.co: + - Type: A (Alias) + Target: ELB DNS name + +corebackupgenerator.autolab.com.co: + - Type: A (Alias) + Target: internal-k8s-autolabproduction-fc0a036b93-1779329228.us-east-1.elb.amazonaws.com +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/dns-delegation-route53-domain-configuration.mdx b/docs/troubleshooting/dns-delegation-route53-domain-configuration.mdx new file mode 100644 index 000000000..d421873d3 --- /dev/null +++ b/docs/troubleshooting/dns-delegation-route53-domain-configuration.mdx @@ -0,0 +1,335 @@ +--- +sidebar_position: 3 +title: "DNS Delegation Issues with Route53 and Domain Providers" +description: "Solution for DNS delegation problems when configuring custom domains with Route53 and external domain registrars" +date: "2024-12-19" +category: "provider" +tags: ["dns", "route53", "aws", "domain", "delegation", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DNS Delegation Issues with Route53 and Domain Providers + +**Date:** December 19, 2024 +**Category:** Provider +**Tags:** DNS, Route53, AWS, Domain, Delegation, Troubleshooting + +## Problem Description + +**Context:** User is trying to configure a custom domain (ordenapp.com.ar) to point to their SleakOps deployment, but DNS delegation is not working properly despite correct Route53 configuration. + +**Observed Symptoms:** + +- DNS delegation appears correctly configured in Route53 +- Domain registrar (DonWeb) reports delegation is configured but "not propagating because DNS servers are not configured in the hosting provider" +- NS records are not replicating properly across DNS servers +- SSL certificate validation may be failing +- Domain does not resolve to the intended AWS infrastructure + +**Relevant Configuration:** + +- Domain: Custom domain (.com.ar) +- DNS Provider: AWS Route53 +- Domain Registrar: DonWeb (or similar) +- SSL: AWS Certificate Manager (ACM) +- Target: SleakOps deployment with Load Balancer + +**Error Conditions:** + +- DNS propagation fails despite correct NS delegation +- Certificate validation issues +- Domain does not resolve to target infrastructure +- NS records show inconsistent results across different DNS checkers + +## Detailed Solution + + + +First, ensure your Route53 hosted zone is properly configured: + +1. **Check Hosted Zone Creation:** + + - Go to AWS Console → Route53 → Hosted Zones + - Verify the hosted zone for your domain exists + - Note down the 4 NS (Name Server) records provided by AWS + +2. **Verify Required Records:** + + ``` + Type: NS + Name: your-domain.com + Value: + ns-123.awsdns-12.com + ns-456.awsdns-45.net + ns-789.awsdns-78.org + ns-012.awsdns-01.co.uk + + Type: A + Name: your-domain.com + Value: [Load Balancer IP or Alias] + + Type: CNAME (for SSL validation) + Name: _acme-challenge.your-domain.com + Value: [ACM validation value] + ``` + + + + + +The issue often occurs when the domain registrar's DNS delegation is incomplete: + +1. **Update Name Servers at Registrar:** + + - Log into your domain registrar (DonWeb, GoDaddy, etc.) + - Go to DNS Management or Name Servers section + - Replace default name servers with the 4 AWS Route53 NS records + - **Important:** Use ALL 4 name servers, not just 2 + +2. **Remove Conflicting DNS Records:** + + - Delete any existing A, CNAME, or other records at the registrar + - Only keep the NS delegation records + - Some registrars maintain their own DNS records even after delegation + +3. **Wait for Propagation:** + - DNS changes can take 24-48 hours to fully propagate + - Use tools like `dig` or online DNS checkers to monitor progress + + + + + +If delegation still doesn't work, follow these troubleshooting steps: + +1. **Check DNS Propagation:** + + ```bash + # Check NS records from different locations + dig NS your-domain.com @8.8.8.8 + dig NS your-domain.com @1.1.1.1 + + # Check if Route53 responds + dig A your-domain.com @ns-123.awsdns-12.com + ``` + +2. **Verify with Online Tools:** + + - Use https://dnschecker.org to check global propagation + - Check NS records specifically: `https://dnschecker.org/#NS/your-domain.com` + - Look for inconsistencies between different regions + +3. **Common Issues and Solutions:** + - **Partial delegation**: Ensure all 4 NS records are configured + - **TTL conflicts**: Old DNS records may be cached (wait for TTL expiry) + - **Registrar DNS interference**: Some registrars maintain parallel DNS records + + + + + +SSL certificate issues often accompany DNS delegation problems: + +1. **ACM Certificate Validation:** + + - Ensure the CNAME record for domain validation is in Route53 + - The validation record should be: `_acme-challenge.your-domain.com` + - Value should match exactly what ACM provides + +2. **Certificate Status Check:** + + ```bash + # Check certificate validation status + aws acm describe-certificate --certificate-arn your-cert-arn + ``` + +3. **Re-request Certificate if Needed:** + - If validation fails repeatedly, delete and re-request the certificate + - Ensure DNS delegation is working before requesting + + + + + +For SleakOps deployments, ensure proper domain configuration: + +1. **Load Balancer Configuration:** + + - Verify the A record points to the correct Load Balancer + - Use Alias records when possible instead of IP addresses + - Check that the Load Balancer is in the same region as Route53 + +2. **SleakOps Domain Settings:** + + - Update your SleakOps project configuration to use the custom domain + - Ensure SSL certificate is properly associated + - Verify ingress configuration accepts the new domain + +3. **Testing Domain Resolution:** + + ```bash + # Test domain resolution + curl -I https://your-domain.com + + # Check SSL certificate + openssl s_client -connect your-domain.com:443 -servername your-domain.com + ``` + + + + + +If DNS delegation still fails after following all steps: + +1. **Contact Registrar Support with Specific Information:** + + - Provide the 4 AWS Route53 name servers + - Explain that you need complete DNS delegation (not just forwarding) + - Ask them to verify no conflicting DNS records exist + - Request they check their DNS servers are not overriding delegation + +2. **Common Registrar Issues:** + + - Some registrars maintain "parking" DNS records + - Incomplete delegation (only 2 NS records instead of 4) + - DNS forwarding instead of true delegation + - Cached old DNS records on their servers + +3. **Documentation to Provide:** + - Screenshot of Route53 hosted zone + - Results from DNS checker tools + - Specific error messages or symptoms + +### Escalation Process: + +If the registrar cannot resolve the issue: + +1. **Consider Alternative Approaches:** + + - Transfer the domain to a more AWS-friendly registrar (Route53, Cloudflare, etc.) + - Use CNAME delegation for subdomains instead of NS delegation + - Implement DNS proxying through services like Cloudflare + +2. **Alternative Solution - CNAME Approach:** + + If NS delegation fails, consider using CNAME records for subdomains: + + ``` + # At your registrar, create: + www CNAME your-app.sleakops.com + api CNAME your-api.sleakops.com + ``` + + This approach requires each subdomain to be individually configured but avoids NS delegation issues. + + + + + +### DNS Delegation Testing + +Once delegation is configured, perform comprehensive testing: + +1. **DNS Resolution Testing:** + + ```bash + # Test from multiple DNS servers + nslookup ordenapp.com.ar 8.8.8.8 + nslookup ordenapp.com.ar 1.1.1.1 + nslookup ordenapp.com.ar 208.67.222.222 + + # Test NS record delegation + dig NS ordenapp.com.ar @8.8.8.8 + + # Test specific records + dig A www.ordenapp.com.ar @8.8.8.8 + dig CNAME api.ordenapp.com.ar @8.8.8.8 + ``` + +2. **SSL Certificate Validation:** + + ```bash + # Test SSL certificate + echo | openssl s_client -servername ordenapp.com.ar -connect ordenapp.com.ar:443 2>/dev/null | openssl x509 -noout -dates + + # Check certificate chain + curl -I https://ordenapp.com.ar + ``` + +3. **Web-based Testing Tools:** + + - Use https://dnschecker.org/ for global DNS propagation + - Use https://www.whatsmydns.net/ for worldwide DNS resolution + - Use https://www.ssllabs.com/ssltest/ for SSL configuration testing + +### Monitoring and Alerting + +Set up monitoring to detect DNS delegation issues: + +1. **CloudWatch Alarms:** + + ```yaml + # Route53 Health Check Configuration + Type: AWS::Route53::HealthCheck + Properties: + Type: HTTPS + ResourcePath: /health + FullyQualifiedDomainName: ordenapp.com.ar + Port: 443 + RequestInterval: 30 + FailureThreshold: 3 + ``` + +2. **External Monitoring:** + + - Set up monitoring with services like Pingdom, UptimeRobot, or DataDog + - Monitor both DNS resolution and HTTP response times + - Alert on certificate expiration + +3. **Automated DNS Validation:** + + ```bash + #!/bin/bash + # DNS delegation monitoring script + + DOMAIN="ordenapp.com.ar" + EXPECTED_NS="ns-1234.awsdns-01.com" + + # Check if NS delegation is working + CURRENT_NS=$(dig +short NS $DOMAIN @8.8.8.8 | head -1) + + if [[ "$CURRENT_NS" == *"awsdns"* ]]; then + echo "DNS delegation working correctly" + exit 0 + else + echo "DNS delegation failed - NS: $CURRENT_NS" + # Send alert notification + exit 1 + fi + ``` + +### Best Practices for DNS Management + +1. **Documentation:** + + - Keep records of all DNS changes with timestamps + - Document delegation contact information for domain registrar + - Maintain backup of DNS configurations + +2. **Change Management:** + + - Test DNS changes in staging environments first + - Implement DNS changes during low-traffic periods + - Have rollback procedures documented + +3. **Security Considerations:** + - Enable DNSSEC if supported by registrar + - Use DNS monitoring to detect hijacking attempts + - Regularly audit DNS records for unauthorized changes + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/dns-delegation-ssl-certificate-validation.mdx b/docs/troubleshooting/dns-delegation-ssl-certificate-validation.mdx new file mode 100644 index 000000000..75929f4f5 --- /dev/null +++ b/docs/troubleshooting/dns-delegation-ssl-certificate-validation.mdx @@ -0,0 +1,176 @@ +--- +sidebar_position: 3 +title: "DNS Delegation and SSL Certificate Validation Issues" +description: "Troubleshooting DNS delegation and AWS SSL certificate validation delays" +date: "2024-09-10" +category: "provider" +tags: ["dns", "ssl", "aws", "certificate", "delegation"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DNS Delegation and SSL Certificate Validation Issues + +**Date:** September 10, 2024 +**Category:** Provider +**Tags:** DNS, SSL, AWS, Certificate, Delegation + +## Problem Description + +**Context:** After completing DNS delegation setup, domains still appear as not delegated and SSL certificates are not being validated by AWS, despite DNS propagation being expected to complete. + +**Observed Symptoms:** + +- Domains show as not delegated in SleakOps dashboard +- SSL certificates remain unvalidated by AWS +- DNS propagation appears incomplete despite sufficient time passing +- Certificate validation process is stuck in pending state + +**Relevant Configuration:** + +- DNS delegation: Recently configured +- SSL certificates: AWS-managed certificates +- Provider: AWS +- Domain validation method: DNS validation + +**Error Conditions:** + +- Issue persists after expected DNS propagation time +- Certificates remain in pending validation state +- Domain delegation status not updating in platform + +## Detailed Solution + + + +DNS propagation can take up to 48 hours to complete globally. To check the current status: + +1. **Use online DNS propagation tools:** + + - Visit https://www.whatsmydns.net/ + - Enter your domain name + - Check if NS records are propagated globally + +2. **Command line verification:** + + ```bash + # Check NS records + dig NS yourdomain.com + + # Check from different DNS servers + dig @8.8.8.8 NS yourdomain.com + dig @1.1.1.1 NS yourdomain.com + ``` + +3. **Expected results:** + - All queries should return the same NS records + - Records should point to AWS Route53 name servers + + + + + +AWS Certificate Manager (ACM) validation can take additional time after DNS propagation: + +1. **Validation timeline:** + + - DNS propagation: Up to 48 hours + - AWS validation: Additional 24-72 hours after propagation + - Total time: Up to 5 days in some cases + +2. **Check certificate status in AWS Console:** + + ```bash + # Using AWS CLI + aws acm list-certificates --region us-east-1 + aws acm describe-certificate --certificate-arn your-cert-arn + ``` + +3. **Validation record verification:** + - Ensure CNAME validation records are present + - Check that records match exactly what AWS requires + - Verify no conflicting DNS records exist + + + + + +If delegation appears incomplete after 24-48 hours: + +1. **Verify nameserver configuration:** + + ```bash + # Check current nameservers + whois yourdomain.com | grep "Name Server" + ``` + +2. **Common issues to check:** + + - Nameservers not updated at domain registrar + - Typos in nameserver entries + - Old DNS records cached locally + - Conflicting A/CNAME records + +3. **SleakOps platform refresh:** + - Navigate to your domain settings + - Click "Refresh Status" or "Check Delegation" + - Wait for platform to re-verify delegation status + + + + + +Consider regenerating certificates if: + +1. **Validation has been pending for more than 5 days** +2. **DNS records were modified during validation process** +3. **Certificate shows validation errors in AWS Console** + +**Steps to regenerate:** + +1. In SleakOps dashboard: + + - Go to SSL Certificates section + - Select the problematic certificate + - Click "Regenerate Certificate" + +2. **Manual regeneration via AWS:** + + ```bash + # Delete old certificate (if not in use) + aws acm delete-certificate --certificate-arn old-cert-arn + + # Request new certificate + aws acm request-certificate \ + --domain-name yourdomain.com \ + --validation-method DNS + ``` + + + + + +**Set up monitoring to track progress:** + +1. **AWS CloudWatch events for certificate status changes** +2. **Regular DNS propagation checks** +3. **SleakOps platform notifications** + +**Expected timeline:** + +- Hour 0-24: DNS propagation begins +- Hour 24-48: DNS fully propagated globally +- Hour 48-120: AWS certificate validation completes +- Hour 120+: Consider regeneration if still pending + +**Signs of successful completion:** + +- Domain shows as "Delegated" in SleakOps +- Certificate status changes to "Issued" in AWS +- HTTPS access works for your domain + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/dns-domain-delegation-cloudflare-route53.mdx b/docs/troubleshooting/dns-domain-delegation-cloudflare-route53.mdx new file mode 100644 index 000000000..0e7167791 --- /dev/null +++ b/docs/troubleshooting/dns-domain-delegation-cloudflare-route53.mdx @@ -0,0 +1,221 @@ +--- +sidebar_position: 3 +title: "DNS Domain Delegation Issues Between CloudFlare and Route53" +description: "Resolving domain delegation and validation issues when using external DNS providers with SleakOps" +date: "2024-12-21" +category: "provider" +tags: ["dns", "route53", "cloudflare", "domain-delegation", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DNS Domain Delegation Issues Between CloudFlare and Route53 + +**Date:** December 21, 2024 +**Category:** Provider +**Tags:** DNS, Route53, CloudFlare, Domain Delegation, AWS + +## Problem Description + +**Context:** Users experience domain validation issues when trying to use SleakOps with domains managed by external DNS providers like CloudFlare, while SleakOps expects domains to be delegated to AWS Route53. + +**Observed Symptoms:** + +- Domain validation fails for main domains managed in CloudFlare +- Subdomain aliases cannot be validated (e.g., teams.simplee.cl) +- SSL certificate creation fails with validation errors +- Inconsistent behavior where some subdomains work while others don't +- NS records mismatch between what SleakOps expects and actual delegation + +**Relevant Configuration:** + +- Main domain: Managed in CloudFlare +- DNS provider: External (CloudFlare) vs Expected (Route53) +- SSL validation: Requires proper domain delegation +- SleakOps validation: Checks delegation once during setup + +**Error Conditions:** + +- Domain validation fails when NS records don't point to Route53 +- SSL certificate validation errors occur +- Subdomain creation blocked due to delegation issues +- Mixed DNS management causes inconsistent behavior + +## Detailed Solution + + + +**How SleakOps manages DNS:** + +1. **Centralized Management**: SleakOps centralizes all DNS management in your AWS accounts through Route53 +2. **Automatic Service Management**: Services deployed with SleakOps manage their DNS records automatically +3. **Manual Configuration**: Additional rules (email validation, external services) must be configured manually in Route53 +4. **Single Validation**: SleakOps validates domain delegation only once during initial setup + +**Where to view generated DNS records:** + +- All DNS records generated by SleakOps are visible in the platform +- Records include all environments, executions, and aliases you create +- Records are automatically managed for SleakOps services + + + + + +**For proper functionality, domains must be:** + +1. **Fully delegated to Route53**: Main domain NS records point to Route53 nameservers +2. **Subdomains of SleakOps-managed domains**: Created as subdomains of already delegated domains + +**Validation process:** + +```bash +# Check current NS records +dig NS simplee.cl + +# Compare with Route53 nameservers +# Should match the NS records shown in SleakOps platform +``` + +**Common delegation issues:** + +- NS records in domain registrar don't match Route53 nameservers +- Partial delegation (some subdomains work, others don't) +- CloudFlare proxy interfering with validation + + + + + +**Complete migration steps:** + +1. **Export existing DNS records from CloudFlare** + + - Download all existing DNS records + - Document any special configurations + +2. **Create Route53 hosted zone** + + - SleakOps creates this automatically when you add a domain + - Note the provided nameservers + +3. **Update domain registrar** + + ``` + # Change NS records at your domain registrar to: + ns-xxx.awsdns-xx.com + ns-xxx.awsdns-xx.co.uk + ns-xxx.awsdns-xx.net + ns-xxx.awsdns-xx.org + ``` + +4. **Recreate necessary records in Route53** + + - Manually add any custom DNS records + - Configure email validation records + - Set up external service records + +5. **Wait for propagation** + - DNS changes can take up to 48 hours to propagate + - Verify with `dig` or online DNS checkers + + + + + +**SSL certificate validation requirements:** + +1. **Remove CloudFlare proxy**: Disable orange cloud (proxy) in CloudFlare for validation records +2. **Proper delegation**: Ensure domain is properly delegated to Route53 +3. **DNS propagation**: Wait for DNS changes to propagate globally + +**Troubleshooting SSL validation:** + +```bash +# Check if domain resolves to correct nameservers +dig NS your-domain.com + +# Verify TXT records for SSL validation +dig TXT _acme-challenge.your-domain.com + +# Test domain resolution +nslookup your-domain.com +``` + +**Common fixes:** + +- Disable CloudFlare proxy during certificate validation +- Ensure all NS records point to Route53 +- Wait for DNS propagation (up to 48 hours) + + + + + +**If you need to keep some DNS in CloudFlare:** + +1. **Subdomain delegation**: Delegate only specific subdomains to Route53 + + ``` + # In CloudFlare, create NS records for subdomains: + app.yourdomain.com NS ns-xxx.awsdns-xx.com + app.yourdomain.com NS ns-xxx.awsdns-xx.co.uk + app.yourdomain.com NS ns-xxx.awsdns-xx.net + app.yourdomain.com NS ns-xxx.awsdns-xx.org + ``` + +2. **Use SleakOps subdomains**: Create all SleakOps services under a dedicated subdomain + - Example: `*.apps.yourdomain.com` managed by Route53 + - Main domain remains in CloudFlare + +**Potential issues with mixed management:** + +- Validation bugs may occur with frequent domain changes +- Inconsistent behavior between different subdomains +- SSL certificate validation complications + + + + + +**Common validation problems:** + +1. **NS record mismatch** + + ```bash + # Check what NS records are actually set + dig +short NS yourdomain.com + + # Compare with SleakOps expected NS records + # (visible in SleakOps platform) + ``` + +2. **Propagation delays** + - Use multiple DNS checkers: whatsmydns.net + - Wait 24-48 hours for full propagation + - Test from different geographic locations + +3. **CloudFlare interference** + - Disable proxy (orange cloud) for validation records + - Temporarily pause CloudFlare if needed + - Check CloudFlare DNS settings + +**Validation debugging:** + +```bash +# Test DNS resolution from different servers +dig @8.8.8.8 yourdomain.com +dig @1.1.1.1 yourdomain.com + +# Check SOA record +dig SOA yourdomain.com + +# Verify delegation chain +dig +trace yourdomain.com +``` + + + +--- + +_This FAQ section was automatically generated on December 21, 2024, based on a real user inquiry._ diff --git a/docs/troubleshooting/dns-migration-donweb-to-aws.mdx b/docs/troubleshooting/dns-migration-donweb-to-aws.mdx new file mode 100644 index 000000000..6eb01b39e --- /dev/null +++ b/docs/troubleshooting/dns-migration-donweb-to-aws.mdx @@ -0,0 +1,283 @@ +--- +sidebar_position: 3 +title: "DNS Migration from DonWeb to AWS" +description: "Complete guide for migrating DNS records from DonWeb to AWS Route 53" +date: "2024-01-15" +category: "provider" +tags: ["aws", "dns", "route53", "migration", "donweb"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DNS Migration from DonWeb to AWS + +**Date:** January 15, 2024 +**Category:** Provider +**Tags:** AWS, DNS, Route 53, Migration, DonWeb + +## Problem Description + +**Context:** User needs to migrate DNS records from DonWeb hosting provider to AWS Route 53, including WordPress landing page and corporate email services. + +**Observed Symptoms:** + +- Current DNS managed by DonWeb hosting provider +- WordPress landing page hosted on DonWeb +- Corporate email services hosted on DonWeb +- Need to consolidate all services in AWS +- Uncertainty about where to configure DNS records in AWS + +**Relevant Configuration:** + +- Current provider: DonWeb +- Services: WordPress website + Corporate email +- Target platform: AWS +- DNS service needed: AWS Route 53 + +**Error Conditions:** + +- Risk of service interruption during migration +- Need to maintain email service continuity +- WordPress site must remain accessible + +## Detailed Solution + + + +AWS Route 53 is the DNS service where you'll manage your domain records: + +1. **Access Route 53 Console**: + + - Go to AWS Console → Route 53 + - Click on "Hosted zones" + - Click "Create hosted zone" + +2. **Create Hosted Zone**: + + - Enter your domain name (e.g., `yourcompany.com`) + - Select "Public hosted zone" + - Click "Create hosted zone" + +3. **Note the Name Servers**: + - AWS will provide 4 name servers + - Save these for later use with your domain registrar + + + + + +Before migrating, you need to export your current DNS configuration: + +1. **Access DonWeb Control Panel**: + + - Log into your DonWeb account + - Navigate to DNS management section + - Look for "DNS Zone" or "DNS Records" + +2. **Document Current Records**: + Create a list of all current DNS records: + + ``` + A Records: + - @ (root domain) → IP_ADDRESS + - www → IP_ADDRESS + + MX Records (Email): + - @ → mail.donweb.com (priority 10) + + CNAME Records: + - Any subdomains pointing to other services + + TXT Records: + - SPF records for email + - Any verification records + ``` + +3. **Export Options**: + - Look for "Export" or "Download" option + - Some providers offer zone file export + - If not available, manually document all records + + + + + +For WordPress migration to AWS, you have several options: + +**Option 1: AWS Lightsail (Recommended for simple sites)** + +1. Create AWS Lightsail instance with WordPress +2. Migrate your WordPress files and database +3. Update DNS A record to point to new IP + +**Option 2: EC2 with RDS** + +1. Set up EC2 instance for web server +2. Set up RDS for MySQL database +3. Migrate WordPress files and database +4. Configure security groups and load balancer + +**Option 3: AWS App Runner or ECS** + +1. Containerize your WordPress application +2. Deploy using App Runner or ECS +3. Use RDS for database + +```bash +# Example: Creating Lightsail WordPress instance +aws lightsail create-instances \ + --instance-names "wordpress-site" \ + --availability-zone "us-east-1a" \ + --blueprint-id "wordpress" \ + --bundle-id "nano_2_0" +``` + + + + + +For corporate email migration, consider these AWS options: + +**Option 1: Amazon WorkMail** + +1. Set up Amazon WorkMail organization +2. Create user accounts +3. Configure MX records to point to WorkMail +4. Migrate existing emails (if needed) + +**Option 2: Third-party email with AWS DNS** + +1. Choose email provider (Google Workspace, Microsoft 365) +2. Configure MX records in Route 53 +3. Update SPF/DKIM records + +**MX Record Configuration Example:** + +``` +Type: MX +Name: @ (or leave blank) +Value: 10 inbound-smtp.us-east-1.amazonaws.com (for WorkMail) +TTL: 300 +``` + +**SPF Record Example:** + +``` +Type: TXT +Name: @ (or leave blank) +Value: "v=spf1 include:amazonses.com ~all" +TTL: 300 +``` + + + + + +Once your services are ready in AWS, configure the DNS records: + +1. **A Records for Website**: + + ``` + Type: A + Name: @ (root domain) + Value: YOUR_AWS_IP_ADDRESS + TTL: 300 + + Type: A + Name: www + Value: YOUR_AWS_IP_ADDRESS + TTL: 300 + ``` + +2. **MX Records for Email**: + + ``` + Type: MX + Name: @ + Value: 10 your-mail-server.amazonaws.com + TTL: 300 + ``` + +3. **CNAME Records** (if needed): + + ``` + Type: CNAME + Name: subdomain + Value: target.domain.com + TTL: 300 + ``` + +4. **TXT Records for Email Authentication**: + ``` + Type: TXT + Name: @ + Value: "v=spf1 include:amazonses.com ~all" + TTL: 300 + ``` + + + + + +**Phase 1: Preparation** + +1. Export all DNS records from DonWeb +2. Set up AWS services (Lightsail/EC2 for WordPress, WorkMail for email) +3. Create Route 53 hosted zone +4. Test new services before DNS switch + +**Phase 2: Migration** + +1. **Lower TTL values** (24-48 hours before migration): + - Change TTL to 300 seconds for faster propagation + - This allows quick rollback if needed + +2. **Update nameservers at domain registrar**: + - Change NS records to Route 53 nameservers + - Wait for propagation (up to 48 hours) + +3. **Monitor services**: + - Check website accessibility + - Test email functionality + - Monitor for any issues + +**Phase 3: Post-Migration** + +1. **Verify all services are working** +2. **Update TTL values back to normal** (3600 seconds) +3. **Clean up old DonWeb services** (after confirming everything works) +4. **Document the new configuration** for future reference + + + + + +**Issue: Email stops working after migration** + +- Check MX records are correctly configured +- Verify SPF/DKIM records are in place +- Test email sending and receiving + +**Issue: Website shows SSL certificate errors** + +- Ensure SSL certificate is properly configured in AWS +- Check that domain validation is complete +- Verify certificate covers both www and non-www versions + +**Issue: Some services still point to old provider** + +- Check all DNS records are migrated +- Look for hardcoded IP addresses in applications +- Verify CDN configurations if applicable + +**Issue: Slow DNS propagation** + +- Use multiple DNS checkers to verify propagation +- Clear local DNS cache +- Wait up to 48 hours for full global propagation + + + +--- + +_This FAQ section was automatically generated on January 15, 2024, based on a real user inquiry._ diff --git a/docs/troubleshooting/dns-propagation-public-deployment.mdx b/docs/troubleshooting/dns-propagation-public-deployment.mdx new file mode 100644 index 000000000..c42a6c916 --- /dev/null +++ b/docs/troubleshooting/dns-propagation-public-deployment.mdx @@ -0,0 +1,202 @@ +--- +sidebar_position: 3 +title: "DNS Resolution Issues for Public Deployments" +description: "Troubleshooting DNS propagation and resolution problems for public deployments in SleakOps" +date: "2024-12-10" +category: "project" +tags: ["dns", "deployment", "public", "domain", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DNS Resolution Issues for Public Deployments + +**Date:** December 10, 2024 +**Category:** Project +**Tags:** DNS, Deployment, Public, Domain, Troubleshooting + +## Problem Description + +**Context:** User creates a public deployment in SleakOps but the assigned domain is not resolving properly, preventing access to the deployed application. + +**Observed Symptoms:** + +- Public deployment created successfully in SleakOps +- Domain URL is generated (e.g., https://site.develop.velo.la) +- DNS resolution fails when accessing the URL +- Application is not accessible via the public domain + +**Relevant Configuration:** + +- Deployment type: Public deployment +- Domain format: `https://[app].[environment].[domain].la` +- DNS provider: Managed by SleakOps +- SSL/TLS: HTTPS enabled + +**Error Conditions:** + +- DNS queries return no results or incorrect IP addresses +- Browser shows "This site can't be reached" or similar errors +- DNS propagation may still be in progress +- Issue occurs immediately after deployment creation + +## Detailed Solution + + + +DNS changes can take time to propagate across the internet: + +- **Local DNS cache**: 5-15 minutes +- **ISP DNS servers**: 30 minutes to 2 hours +- **Global propagation**: Up to 24-48 hours (rare) +- **Typical resolution time**: 15-30 minutes + +This is normal behavior and not a platform issue. + + + + + +Use these tools to verify DNS propagation: + +**Online DNS checkers:** + +```bash +# Check from multiple locations +https://dnschecker.org/ +https://www.whatsmydns.net/ +``` + +**Command line tools:** + +```bash +# Check DNS resolution +nslookup site.develop.velo.la + +# Check from different DNS servers +nslookup site.develop.velo.la 8.8.8.8 +nslookup site.develop.velo.la 1.1.1.1 +``` + +**Expected result:** + +``` +Name: site.develop.velo.la +Address: [IP_ADDRESS] +``` + + + + + +If DNS has propagated but you still can't access the site: + +**Windows:** + +```cmd +ipconfig /flushdns +``` + +**macOS:** + +```bash +sudo dscacheutil -flushcache +sudo killall -HUP mDNSResponder +``` + +**Linux:** + +```bash +sudo systemctl restart systemd-resolved +# or +sudo service nscd restart +``` + +**Browser cache:** + +- Chrome: Settings → Privacy → Clear browsing data → Cached images and files +- Firefox: Settings → Privacy & Security → Clear Data + + + + + +Ensure your deployment is actually running: + +1. **Check deployment status in SleakOps dashboard:** + + - Go to your project + - Verify deployment shows as "Running" + - Check for any error messages + +2. **Verify application logs:** + + ```bash + # Check if application is starting correctly + kubectl logs -f deployment/[your-app-name] + ``` + +3. **Check service configuration:** + - Ensure the service is exposed correctly + - Verify port configuration matches your application + + + + + +While waiting for DNS propagation: + +**1. Use direct IP access:** + +```bash +# Get the load balancer IP +kubectl get services +# Access via IP: http://[EXTERNAL-IP] +``` + +**2. Modify local hosts file:** + +```bash +# Add to /etc/hosts (Linux/Mac) or C:\Windows\System32\drivers\etc\hosts (Windows) +[EXTERNAL-IP] site.develop.velo.la +``` + +**3. Use kubectl port-forward:** + +```bash +kubectl port-forward service/[service-name] 8080:80 +# Access via http://localhost:8080 +``` + + + + + +If DNS issues persist after 2 hours: + +**1. Check domain configuration:** + +- Verify the domain is correctly configured in SleakOps +- Ensure no typos in the domain name +- Check if custom domain settings are correct + +**2. Verify SSL certificate:** + +```bash +# Check SSL certificate status +openssl s_client -connect site.develop.velo.la:443 -servername site.develop.velo.la +``` + +**3. Contact support:** +If the issue persists beyond normal propagation times, contact SleakOps support with: + +- Deployment name and project +- Expected domain URL +- DNS checker results +- Error messages or screenshots + + + +--- + +_This FAQ was automatically generated on December 10, 2024 based on a real user query._ diff --git a/docs/troubleshooting/dns-resolution-failure-mysql-redis.mdx b/docs/troubleshooting/dns-resolution-failure-mysql-redis.mdx new file mode 100644 index 000000000..3fa789639 --- /dev/null +++ b/docs/troubleshooting/dns-resolution-failure-mysql-redis.mdx @@ -0,0 +1,292 @@ +--- +sidebar_position: 3 +title: "DNS Resolution Failure for MySQL and Redis Connections" +description: "Solution for DNS resolution failures causing MySQL and Redis connection errors" +date: "2024-12-19" +category: "dependency" +tags: ["dns", "mysql", "redis", "connection", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DNS Resolution Failure for MySQL and Redis Connections + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** DNS, MySQL, Redis, Connection, Troubleshooting + +## Problem Description + +**Context:** Application experiencing slow performance due to DNS resolution failures when attempting to connect to MySQL and Redis services in a Kubernetes environment. + +**Observed Symptoms:** + +- Application running very slowly +- Repeated MySQL connection failures +- Redis connection failures +- "Temporary failure in name resolution" errors +- "No alive nodes found in your cluster" messages +- Connection timeouts to AWS ElastiCache Redis instance + +**Relevant Configuration:** + +- MySQL connection: Using hostname resolution +- Redis connection: `redis-aws-production-bfdbf3f.pdvyst.0001.use2.cache.amazonaws.com:6379` +- Environment: AWS production cluster +- Error pattern: `php_network_getaddresses: getaddrinfo failed` + +**Error Conditions:** + +- DNS resolution fails intermittently +- Errors occur during high traffic periods +- Both MySQL and Redis affected simultaneously +- Application becomes unresponsive due to connection timeouts + +## Detailed Solution + + + +The "Temporary failure in name resolution" error indicates DNS resolution problems. This can happen due to: + +1. **DNS server overload**: Too many concurrent DNS queries +2. **Network connectivity issues**: Problems reaching DNS servers +3. **DNS cache issues**: Stale or corrupted DNS cache +4. **CoreDNS problems**: Issues with Kubernetes DNS service + +To diagnose: + +```bash +# Check DNS resolution from within a pod +kubectl exec -it -- nslookup mysql-hostname +kubectl exec -it -- nslookup redis-aws-production-bfdbf3f.pdvyst.0001.use2.cache.amazonaws.com + +# Check CoreDNS logs +kubectl logs -n kube-system -l k8s-app=kube-dns +``` + + + + + +Increase CoreDNS replicas to handle more DNS queries: + +```bash +# Check current CoreDNS deployment +kubectl get deployment coredns -n kube-system + +# Scale CoreDNS replicas +kubectl scale deployment coredns --replicas=3 -n kube-system + +# Verify scaling +kubectl get pods -n kube-system -l k8s-app=kube-dns +``` + +For high-traffic applications, consider 3-5 CoreDNS replicas. + + + + + +Implement DNS caching at the application level to reduce DNS queries: + +**For PHP applications:** + +```php +// Add to your database configuration +'mysql' => [ + 'host' => env('DB_HOST', 'localhost'), + 'options' => [ + PDO::ATTR_PERSISTENT => true, + PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true, + ], + // Enable connection pooling + 'pool' => [ + 'min_connections' => 5, + 'max_connections' => 20, + ] +], + +// For Redis connections +'redis' => [ + 'client' => 'predis', + 'options' => [ + 'cluster' => env('REDIS_CLUSTER', 'redis'), + 'prefix' => env('REDIS_PREFIX', Str::slug(env('APP_NAME', 'laravel'), '_').'_database_'), + // Add connection pooling + 'persistent' => true, + ], +] +``` + + + + + +Configure DNS settings in your deployment: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: your-app +spec: + template: + spec: + # Configure DNS policy + dnsPolicy: ClusterFirst + dnsConfig: + options: + # Reduce DNS timeout + - name: timeout + value: "1" + # Increase attempts + - name: attempts + value: "3" + # Enable DNS caching + - name: use-vc + - name: ndots + value: "2" + containers: + - name: app + image: your-app:latest + # Add DNS-related environment variables + env: + - name: DB_HOST + value: "mysql-service.default.svc.cluster.local" + - name: REDIS_HOST + value: "redis-service.default.svc.cluster.local" +``` + + + + + +Create Kubernetes services to avoid external DNS resolution: + +**For MySQL:** + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: mysql-external +spec: + type: ExternalName + externalName: your-mysql-hostname.amazonaws.com + ports: + - port: 3306 + targetPort: 3306 +``` + +**For Redis:** + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: redis-external +spec: + type: ExternalName + externalName: redis-aws-production-bfdbf3f.pdvyst.0001.use2.cache.amazonaws.com + ports: + - port: 6379 + targetPort: 6379 +``` + +Then update your application configuration: + +```bash +# Use service names instead of external hostnames +DB_HOST=mysql-external.default.svc.cluster.local +REDIS_HOST=redis-external.default.svc.cluster.local +``` + + + + + +Add monitoring to detect DNS issues early: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: dns-monitor +data: + monitor.sh: | + #!/bin/bash + while true; do + # Test DNS resolution + if ! nslookup mysql-service.default.svc.cluster.local > /dev/null 2>&1; then + echo "$(date): DNS resolution failed for MySQL" + fi + if ! nslookup redis-service.default.svc.cluster.local > /dev/null 2>&1; then + echo "$(date): DNS resolution failed for Redis" + fi + sleep 30 + done +``` + +Deploy as a sidecar container or separate monitoring pod. + +**Monitoring Deployment:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: dns-monitor +spec: + replicas: 1 + selector: + matchLabels: + app: dns-monitor + template: + metadata: + labels: + app: dns-monitor + spec: + containers: + - name: monitor + image: alpine:latest + command: ["/bin/sh"] + args: ["/scripts/monitor.sh"] + volumeMounts: + - name: monitor-script + mountPath: /scripts + resources: + requests: + memory: "64Mi" + cpu: "50m" + limits: + memory: "128Mi" + cpu: "100m" + volumes: + - name: monitor-script + configMap: + name: dns-monitor + defaultMode: 0755 +``` + +**Alert Configuration:** + +```yaml +# Add to Prometheus AlertManager +groups: + - name: dns-monitoring + rules: + - alert: DNSResolutionFailure + expr: increase(dns_resolution_failures_total[5m]) > 0 + for: 2m + labels: + severity: warning + annotations: + summary: "DNS resolution failures detected" + description: "Application experiencing DNS resolution issues for {{ $labels.service }}" +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/docker-build-cache-issues.mdx b/docs/troubleshooting/docker-build-cache-issues.mdx new file mode 100644 index 000000000..8ec1e3625 --- /dev/null +++ b/docs/troubleshooting/docker-build-cache-issues.mdx @@ -0,0 +1,175 @@ +--- +sidebar_position: 3 +title: "Docker Build Cache Issues in Production" +description: "Solution for Docker build cache preventing code changes from being deployed" +date: "2024-01-15" +category: "project" +tags: ["docker", "cache", "deployment", "build", "ecr"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Docker Build Cache Issues in Production + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Docker, Cache, Deployment, Build, ECR + +## Problem Description + +**Context:** User made code changes that work correctly in local development environment, but after deploying to production, the old behavior persists despite the code being correctly updated in the repository. + +**Observed Symptoms:** + +- Code changes work correctly in local development +- After deployment, production still shows old behavior/errors +- Source code files are correctly updated in production +- The issue appears to be related to Docker build caching + +**Relevant Configuration:** + +- Platform: SleakOps with Docker builds +- Environment: Production deployment +- Build system: Docker with layer caching +- Registry: AWS ECR for image storage + +**Error Conditions:** + +- Problem occurs after code deployment +- Local environment works correctly +- Production deployment doesn't reflect code changes +- Issue persists across multiple deployment attempts + +## Detailed Solution + + + +The most common cause is Docker build cache not detecting changes in your application code. To force cache invalidation, add this line to your Dockerfile before copying your application files: + +```dockerfile +# Add this before COPY commands +# Cache invalidator +RUN echo "Frontend cache bust: v2" > /dev/null + +# Then your normal COPY commands +COPY ./ClientApp /app/ClientApp +``` + +This forces Docker to rebuild all subsequent layers, ensuring your code changes are included. + + + + + +If cache invalidation doesn't work, you may need to clear the Docker images stored in AWS ECR: + +1. **Access AWS Console** + + - Switch to your production AWS account + - Navigate to **Amazon ECR** service + +2. **Find your repository** + + - Locate the repository containing your project's Docker images + - It will typically be named after your project + +3. **Delete cached images** + + - Select all images in the repository + - Delete them to force a complete rebuild + +4. **Deploy with new commit** + - Make a new commit (you can remove the cache invalidation line if desired) + - Deploy the changes + + + + + +To prevent this issue in the future, structure your Dockerfile to maximize cache efficiency: + +```dockerfile +# Good practice: Copy dependency files first +COPY package.json package-lock.json ./ +RUN npm install + +# Copy application code last (changes most frequently) +COPY ./ClientApp ./ClientApp +COPY ./ServerApp ./ServerApp + +# Build your application +RUN npm run build +``` + +This way, dependency installation is cached separately from your application code. + + + + + +If the problem persists, try these additional steps: + +1. **Force rebuild without cache** + + ```bash + # If using Docker directly + docker build --no-cache -t your-image . + ``` + +2. **Check build logs** + + - Review the deployment logs in SleakOps + - Look for "Using cache" messages that might indicate stale layers + +3. **Verify file timestamps** + + - Ensure your code changes have recent timestamps + - Check if the build process is picking up the correct files + +4. **Test with minimal changes** + - Make a small, visible change (like adding a console.log) + - Deploy and verify the change appears in production + + + + + +To avoid this problem in the future: + +1. **Use .dockerignore properly** + + ``` + node_modules + .git + .env.local + *.log + ``` + +2. **Implement proper cache busting** + + - Use build arguments with timestamps + - Include version numbers in your builds + +3. **Monitor build processes** + + - Check deployment logs regularly + - Verify that builds are actually rebuilding changed layers + +4. **Use multi-stage builds** + + ```dockerfile + FROM node:16 AS builder + COPY package*.json ./ + RUN npm install + COPY . . + RUN npm run build + + FROM nginx:alpine + COPY --from=builder /app/dist /usr/share/nginx/html + ``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/docker-build-cache-no-cache-option.mdx b/docs/troubleshooting/docker-build-cache-no-cache-option.mdx new file mode 100644 index 000000000..5fc8d1940 --- /dev/null +++ b/docs/troubleshooting/docker-build-cache-no-cache-option.mdx @@ -0,0 +1,168 @@ +--- +sidebar_position: 3 +title: "Docker Build Cache Issues with Build Arguments" +description: "Solution for Docker build cache not updating when changing build arguments" +date: "2025-02-10" +category: "project" +tags: ["docker", "build", "cache", "arguments", "no-cache"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Docker Build Cache Issues with Build Arguments + +**Date:** February 10, 2025 +**Category:** Project +**Tags:** Docker, Build, Cache, Arguments, No-cache + +## Problem Description + +**Context:** When changing Docker build arguments in SleakOps projects, the build process may use cached layers and not reflect the new argument values in the final image, even though a new build is triggered. + +**Observed Symptoms:** + +- Build arguments are changed in the project configuration +- New build is triggered successfully +- Deployment uses the same image tag +- Container still contains old files/configurations +- Build cache prevents argument changes from taking effect + +**Relevant Configuration:** + +- Docker build arguments (e.g., `MANIFEST_FILE_URL`) +- Build process using Docker layer caching +- SleakOps build system with custom tagging + +**Error Conditions:** + +- Occurs when modifying build arguments without code changes +- Docker cache reuses layers from previous builds +- New argument values don't propagate to the final image +- Problem persists until repository code is modified + +## Detailed Solution + + + +The most direct solution is to use the `--no-cache` flag when building. This forces Docker to rebuild all layers without using cache. + +**Current Status in SleakOps:** + +- The platform is being prepared to support build flags like `--no-cache` +- This will be available as a manual option in the frontend +- Users will be able to decide when to force a complete rebuild + +**When to use:** + +- After changing build arguments +- When external resources referenced by arguments have changed +- When troubleshooting cache-related issues + + + + + +Modify your Dockerfile to reduce cache interference: + +```dockerfile +# Instead of copying external resources during build +COPY external-resource.json /app/ + +# Move the download process to runtime +RUN echo "#!/bin/sh" > /app/download.sh && \ + echo "curl -o /app/resource.json \$RESOURCE_URL" >> /app/download.sh && \ + chmod +x /app/download.sh + +# Use environment variables at runtime +ENV RESOURCE_URL="" +CMD ["/app/download.sh", "&&", "your-app"] +``` + +**Benefits:** + +- Build arguments don't affect Docker layer caching +- External resources are fetched at runtime +- Changes to URLs don't require image rebuilds + + + + + +For automated builds, you can configure your CI/CD to always use `--no-cache` when build arguments change: + +```yaml +# Example CI/CD configuration +build: + script: + - | + if [ "$BUILD_ARGS_CHANGED" = "true" ]; then + docker build --no-cache -t $IMAGE_TAG . + else + docker build -t $IMAGE_TAG . + fi +``` + +**Considerations:** + +- Longer build times when using `--no-cache` +- Increased resource usage +- Recommended for production deployments with argument changes + + + + + +**1. Use environment variables instead of build arguments:** + +```dockerfile +# Instead of ARG +# ARG MANIFEST_FILE_URL +# Use ENV at runtime +ENV MANIFEST_FILE_URL="" +``` + +**2. Include argument values in cache-busting layers:** + +```dockerfile +ARG MANIFEST_FILE_URL +# Add a layer that changes when the argument changes +RUN echo "Cache bust: $MANIFEST_FILE_URL" > /tmp/cache-bust +RUN curl -o /app/manifest.json "$MANIFEST_FILE_URL" +``` + +**3. Use multi-stage builds:** + +```dockerfile +FROM alpine as downloader +ARG MANIFEST_FILE_URL +RUN curl -o /tmp/manifest.json "$MANIFEST_FILE_URL" + +FROM your-base-image +COPY --from=downloader /tmp/manifest.json /app/ +``` + + + + + +SleakOps is developing enhanced build control features: + +**Planned Features:** + +- `--no-cache` flag option in the frontend +- `--flush-cache` support for specific scenarios +- Automatic cache invalidation detection +- Build argument change detection + +**Timeline:** + +- Backend infrastructure is ready +- Frontend integration in development +- Manual control will be available first +- Automatic detection features planned for later releases + + + +--- + +_This FAQ was automatically generated on February 10, 2025 based on a real user query._ diff --git a/docs/troubleshooting/docker-build-environment-variables.mdx b/docs/troubleshooting/docker-build-environment-variables.mdx new file mode 100644 index 000000000..69135682c --- /dev/null +++ b/docs/troubleshooting/docker-build-environment-variables.mdx @@ -0,0 +1,194 @@ +--- +sidebar_position: 3 +title: "Docker Build Environment Variables Not Available" +description: "Solution for environment variables not being accessible during Docker build process" +date: "2025-03-10" +category: "project" +tags: ["docker", "dockerfile", "build", "environment-variables", "rails"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Docker Build Environment Variables Not Available + +**Date:** March 10, 2025 +**Category:** Project +**Tags:** Docker, Dockerfile, Build, Environment Variables, Rails + +## Problem Description + +**Context:** User is experiencing a Docker build failure where environment variables defined in SleakOps Docker settings are not accessible during the build process, specifically for Rails master key decryption. + +**Observed Symptoms:** + +- Build fails with "Missing encryption key to decrypt file" error +- Rails cannot find RAILS_MASTER_KEY environment variable +- Environment variables are defined in SleakOps Docker settings +- Build process cannot access the required encryption key + +**Relevant Configuration:** + +- Framework: Ruby on Rails +- Error: Missing RAILS_MASTER_KEY for SOPS decryption +- Environment variables defined in SleakOps Docker settings +- Build context: Docker container build process + +**Error Conditions:** + +- Error occurs during Docker build phase +- Environment variables are not passed to build context +- Rails credentials decryption fails +- Build process terminates with encryption key error + +## Detailed Solution + + + +Environment variables from SleakOps settings are only available at runtime, not during build. To use them during build, you must define them as ARG in your Dockerfile: + +```dockerfile +# Define the argument that will receive the environment variable +ARG RAILS_MASTER_KEY + +# Use the argument in your build process +RUN echo "Master key: $RAILS_MASTER_KEY" + +# Optional: Set it as environment variable for runtime +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY +``` + + + + + +In SleakOps, you need to configure build arguments separately from environment variables: + +1. Go to your **Project Settings** +2. Navigate to **Docker Configuration** +3. In the **Build Arguments** section (not Environment Variables) +4. Add your build argument: + ``` + RAILS_MASTER_KEY=${RAILS_MASTER_KEY} + ``` + +This passes the environment variable value as a build argument. + + + + + +For handling secrets during build: + +```dockerfile +# Method 1: Using ARG (recommended for non-sensitive data) +ARG RAILS_MASTER_KEY +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY + +# Method 2: Using Docker secrets (for sensitive data) +# RUN --mount=type=secret,id=rails_master_key \ +# RAILS_MASTER_KEY=$(cat /run/secrets/rails_master_key) && \ +# # your build commands here + +# Method 3: Multi-stage build to avoid exposing secrets +FROM ruby:3.0 as builder +ARG RAILS_MASTER_KEY +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY +# Build and decrypt here + +FROM ruby:3.0 as runtime +# Copy only necessary files, not the secrets +COPY --from=builder /app /app +``` + + + + + +For Rails applications with encrypted credentials: + +```dockerfile +# Define the master key as build argument +ARG RAILS_MASTER_KEY +ARG RAILS_ENV=production + +# Set environment variables +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY +ENV RAILS_ENV=$RAILS_ENV + +# Install dependencies +RUN bundle install + +# Precompile assets (this step needs the master key) +RUN RAILS_MASTER_KEY=$RAILS_MASTER_KEY rails assets:precompile + +# Alternative: Create the key file +# RUN mkdir -p config && echo "$RAILS_MASTER_KEY" > config/master.key +``` + +Make sure your `config/credentials.yml.enc` file is included in your Docker context. + + + + + +If the problem persists: + +1. **Verify the ARG is defined before use:** + + ```dockerfile + ARG RAILS_MASTER_KEY + RUN echo "Key length: ${#RAILS_MASTER_KEY}" # Should not be 0 + ``` + +2. **Check build logs for the argument:** + + ```bash + docker build --build-arg RAILS_MASTER_KEY=your_key_here . + ``` + +3. **Verify in SleakOps build logs:** + + - Look for "Step X: ARG RAILS_MASTER_KEY" + - Check if the build argument is being passed + +4. **Test locally:** + ```bash + # Test the build with the same argument + docker build --build-arg RAILS_MASTER_KEY="$(cat config/master.key)" . + ``` + + + + + +**Important security notes:** + +- Build arguments are visible in Docker history (`docker history`) +- For production, consider using Docker secrets or multi-stage builds +- Never commit master keys to version control +- Use different keys for different environments + +**Recommended approach for production:** + +```dockerfile +# Use multi-stage build +FROM ruby:3.0 as builder +ARG RAILS_MASTER_KEY +WORKDIR /app +COPY . . +RUN bundle install +RUN RAILS_MASTER_KEY=$RAILS_MASTER_KEY rails assets:precompile + +# Production stage without secrets +FROM ruby:3.0 as production +WORKDIR /app +COPY --from=builder /app/public/assets ./public/assets +COPY --from=builder /app/vendor/bundle ./vendor/bundle +# Don't copy the master key to final image +``` + + + +--- + +_This FAQ was automatically generated on March 10, 2025 based on a real user query._ diff --git a/docs/troubleshooting/docker-daphne-logging-configuration.mdx b/docs/troubleshooting/docker-daphne-logging-configuration.mdx new file mode 100644 index 000000000..28e96cc9b --- /dev/null +++ b/docs/troubleshooting/docker-daphne-logging-configuration.mdx @@ -0,0 +1,553 @@ +--- +sidebar_position: 3 +title: "Docker Logging Configuration for Daphne Applications" +description: "How to configure Docker containers to capture and display Daphne application logs" +date: "2025-01-21" +category: "workload" +tags: ["docker", "daphne", "logging", "django", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Docker Logging Configuration for Daphne Applications + +**Date:** January 21, 2025 +**Category:** Workload +**Tags:** Docker, Daphne, Logging, Django, Troubleshooting + +## Problem Description + +**Context:** User needs to configure a Docker container running a Django application with Daphne ASGI server to properly capture and display application logs through Docker's logging system. + +**Observed Symptoms:** + +- Application logs are not visible in Docker container logs +- Daphne server logs are not being captured by Docker +- Standard Docker logging parameters (`stdin_open: true`, `tty: true`) are not sufficient +- Application has logging configured but output is not reaching Docker's stdout + +**Relevant Configuration:** + +- Application: Django with Daphne ASGI server +- Container orchestration: Docker Compose +- Current parameters: `stdin_open: true` and `tty: true` +- Logger configured for Daphne in the application + +**Error Conditions:** + +- Logs are generated by the application but not visible in Docker logs +- Standard Docker logging configuration is insufficient for Daphne +- Need to redirect application logs to Docker's stdout/stderr streams + +## Detailed Solution + + + +The primary solution is to configure Daphne to output logs directly to `/dev/stdout`, which Docker can capture: + +**Method 1: Command line configuration** + +```dockerfile +# In your Dockerfile or docker-compose.yml command +CMD ["daphne", "-b", "0.0.0.0", "-p", "8000", "--access-log", "/dev/stdout", "--proxy-headers", "myproject.asgi:application"] +``` + +**Method 2: Environment variable** + +```yaml +# docker-compose.yml +services: + web: + environment: + - DAPHNE_ACCESS_LOG=/dev/stdout + - DAPHNE_ERROR_LOG=/dev/stderr +``` + + + + + +Update your Django `settings.py` to ensure logs are directed to stdout: + +```python +# settings.py +LOGGING = { + 'version': 1, + 'disable_existing_loggers': False, + 'formatters': { + 'verbose': { + 'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}', + 'style': '{', + }, + 'simple': { + 'format': '{levelname} {message}', + 'style': '{', + }, + }, + 'handlers': { + 'console': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'verbose', + }, + 'daphne': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'verbose', + }, + }, + 'root': { + 'handlers': ['console'], + 'level': 'INFO', + }, + 'loggers': { + 'daphne': { + 'handlers': ['daphne'], + 'level': 'INFO', + 'propagate': False, + }, + 'django': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + }, +} +``` + + + + + +Here's a complete example of how to configure your `docker-compose.yml`: + +```yaml +version: "3.8" + +services: + web: + build: . + ports: + - "8000:8000" + environment: + - DEBUG=1 + - PYTHONUNBUFFERED=1 # Important for real-time log output + command: > + daphne + --bind 0.0.0.0 + --port 8000 + --access-log /dev/stdout + --proxy-headers + myproject.asgi:application + logging: + driver: "json-file" + options: + max-size: "10m" + max-file: "3" + # Remove these if not needed for your use case + # stdin_open: true + # tty: true +``` + + + + + +Optimize your Dockerfile for better logging: + +```dockerfile +FROM python:3.11-slim + +# Set environment variables for better logging +ENV PYTHONUNBUFFERED=1 +ENV PYTHONDONTWRITEBYTECODE=1 + +WORKDIR /app + +# Copy requirements and install dependencies +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Copy application code +COPY . . + +# Create a non-root user +RUN useradd --create-home --shell /bin/bash app +USER app + +# Expose port +EXPOSE 8000 + +# Start Daphne with proper logging +CMD ["daphne", "-b", "0.0.0.0", "-p", "8000", "--access-log", "/dev/stdout", "--proxy-headers", "myproject.asgi:application"] +``` + + + + + +**Verify logs are working:** + +```bash +# View container logs +docker-compose logs -f web + +# Or for a specific container +docker logs -f +``` + +**Common issues and solutions:** + +1. **Logs still not appearing**: Ensure `PYTHONUNBUFFERED=1` is set +2. **Partial logs**: Check if your application is using print statements instead of proper logging +3. **Performance issues**: Consider using structured logging (JSON format) + +**Test logging configuration:** + +```python +# Add this to your Django views or management commands +import logging +logger = logging.getLogger(__name__) + +def test_view(request): + logger.info("Test log message from Django view") + return HttpResponse("Check Docker logs for the message") +``` + +**Advanced logging with structured output:** + +```python +# For JSON structured logs +LOGGING = { + 'version': 1, + 'disable_existing_loggers': False, + 'formatters': { + 'json': { + 'format': '{"timestamp": "%(asctime)s", "level": "%(levelname)s", "logger": "%(name)s", "message": "%(message)s"}', + }, + 'simple': { + 'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s', + }, + }, + 'handlers': { + 'console': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'json', + }, + 'daphne': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'simple', + }, + }, + 'loggers': { + 'django': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + 'daphne': { + 'handlers': ['daphne'], + 'level': 'INFO', + 'propagate': False, + }, + 'django.channels': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + }, + 'root': { + 'handlers': ['console'], + 'level': 'INFO', + }, +} +``` + + + + + +Configure your Dockerfile to ensure Daphne logs are captured by Docker: + +**Method 1: Using environment variables** + +```dockerfile +FROM python:3.9-slim + +# Install dependencies +COPY requirements.txt . +RUN pip install -r requirements.txt + +# Copy application code +COPY . /app +WORKDIR /app + +# Set environment variables for logging +ENV PYTHONUNBUFFERED=1 +ENV DJANGO_LOG_LEVEL=INFO +ENV DAPHNE_LOG_LEVEL=INFO + +# Expose port +EXPOSE 8000 + +# Start Daphne with explicit logging configuration +CMD ["daphne", "-b", "0.0.0.0", "-p", "8000", "--verbosity", "2", "your_project.asgi:application"] +``` + +**Method 2: Using custom entrypoint script** + +```dockerfile +FROM python:3.9-slim + +# Install dependencies +COPY requirements.txt . +RUN pip install -r requirements.txt + +# Copy application and entrypoint +COPY . /app +COPY entrypoint.sh /entrypoint.sh +RUN chmod +x /entrypoint.sh + +WORKDIR /app + +# Use custom entrypoint +ENTRYPOINT ["/entrypoint.sh"] +``` + +**Entrypoint script (entrypoint.sh):** + +```bash +#!/bin/bash +set -e + +# Ensure stdout/stderr are unbuffered +export PYTHONUNBUFFERED=1 + +# Configure logging level from environment +export DJANGO_LOG_LEVEL=${DJANGO_LOG_LEVEL:-INFO} + +# Start Daphne with proper logging +exec daphne \ + --bind 0.0.0.0 \ + --port 8000 \ + --verbosity 2 \ + --access-log - \ + --proxy-headers \ + your_project.asgi:application 2>&1 +``` + + + + + +Configure Docker Compose to properly handle Daphne logs: + +```yaml +version: '3.8' + +services: + web: + build: . + ports: + - "8000:8000" + environment: + - PYTHONUNBUFFERED=1 + - DJANGO_LOG_LEVEL=INFO + - DAPHNE_LOG_LEVEL=INFO + # Remove stdin_open and tty for production + # stdin_open: true + # tty: true + logging: + driver: "json-file" + options: + max-size: "10m" + max-file: "3" + volumes: + - .:/app + depends_on: + - redis + - db + + redis: + image: redis:alpine + + db: + image: postgres:13 + environment: + POSTGRES_DB: myproject + POSTGRES_USER: myuser + POSTGRES_PASSWORD: mypassword +``` + +**Alternative with log aggregation:** + +```yaml +version: '3.8' + +services: + web: + build: . + ports: + - "8000:8000" + environment: + - PYTHONUNBUFFERED=1 + logging: + driver: "syslog" + options: + syslog-address: "tcp://localhost:514" + tag: "daphne-app" + # Or use fluentd for centralized logging + # logging: + # driver: "fluentd" + # options: + # fluentd-address: localhost:24224 + # tag: docker.daphne +``` + + + + + +**Common problems and solutions:** + +**1. Logs not appearing in Docker:** + +```bash +# Check if the container is running +docker ps + +# Check container logs +docker logs container_name + +# Check if stdout is being captured +docker exec -it container_name /bin/bash +python -c "import sys; print('Test stdout', file=sys.stdout); print('Test stderr', file=sys.stderr)" +``` + +**2. Buffered output issues:** + +```dockerfile +# Add to Dockerfile +ENV PYTHONUNBUFFERED=1 + +# Or in Python code +import sys +sys.stdout.reconfigure(line_buffering=True) +sys.stderr.reconfigure(line_buffering=True) +``` + +**3. Daphne not starting properly:** + +```bash +# Test Daphne manually inside container +docker exec -it container_name daphne --help + +# Check if ASGI application is valid +python manage.py check --deploy + +# Test ASGI application +python -c " +import os +os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'your_project.settings') +from your_project.asgi import application +print('ASGI application loaded successfully') +" +``` + +**4. Permission issues:** + +```dockerfile +# Ensure proper user permissions +RUN adduser --disabled-password --gecos '' appuser +USER appuser + +# Or fix permissions +RUN chown -R appuser:appuser /app +``` + + + + + +**1. Structured logging configuration:** + +```python +# Production-ready logging setup +import logging.config +import json + +class JSONFormatter(logging.Formatter): + def format(self, record): + log_entry = { + 'timestamp': self.formatTime(record), + 'level': record.levelname, + 'logger': record.name, + 'message': record.getMessage(), + 'module': record.module, + 'function': record.funcName, + 'line': record.lineno, + } + + # Add exception info if present + if record.exc_info: + log_entry['exception'] = self.formatException(record.exc_info) + + return json.dumps(log_entry) + +LOGGING = { + 'version': 1, + 'disable_existing_loggers': False, + 'formatters': { + 'json': { + '()': JSONFormatter, + }, + }, + 'handlers': { + 'console': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'json', + }, + }, + 'loggers': { + 'daphne': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + 'django': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + }, + 'root': { + 'handlers': ['console'], + 'level': 'INFO', + }, +} +``` + +**2. Log rotation and cleanup:** + +```yaml +# docker-compose.yml with log rotation +version: '3.8' +services: + web: + build: . + logging: + driver: "json-file" + options: + max-size: "50m" + max-file: "5" + compress: "true" +``` + + + +--- + +_This FAQ was automatically generated on January 21, 2025 based on a real user query._ diff --git a/docs/troubleshooting/docker-exec-format-error-troubleshooting.mdx b/docs/troubleshooting/docker-exec-format-error-troubleshooting.mdx new file mode 100644 index 000000000..0cff92702 --- /dev/null +++ b/docs/troubleshooting/docker-exec-format-error-troubleshooting.mdx @@ -0,0 +1,274 @@ +--- +sidebar_position: 3 +title: "Docker Exec Format Error in Jobs" +description: "Solution for exec format error when running Docker containers in SleakOps jobs" +date: "2025-01-27" +category: "workload" +tags: ["docker", "jobs", "containers", "exec-format-error", "architecture"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Docker Exec Format Error in Jobs + +**Date:** January 27, 2025 +**Category:** Workload +**Tags:** Docker, Jobs, Containers, Exec Format Error, Architecture + +## Problem Description + +**Context:** User attempts to create a job in SleakOps using a Node.js TypeScript Docker image but encounters architecture compatibility issues. + +**Observed Symptoms:** + +- Error message: `exec /bin/sleep: exec format error` +- `ImagePullBackOff` error when tag is not specified +- Job fails to start properly +- Issue occurs specifically with `efrecon/ts-node:9.1.1` image + +**Relevant Configuration:** + +- Docker image: `efrecon/ts-node` +- Tag: `9.1.1` +- Platform: SleakOps Jobs +- Target: Node.js with TypeScript execution + +**Error Conditions:** + +- Error occurs during container startup +- Happens when using specific Docker image tags +- Related to architecture mismatch between image and cluster nodes +- Prevents job execution + +## Detailed Solution + + + +The `exec /bin/sleep: exec format error` typically indicates an architecture mismatch: + +1. **Architecture incompatibility**: The Docker image was built for a different CPU architecture (e.g., ARM vs x86_64) +2. **Platform mismatch**: The image doesn't match your cluster's node architecture +3. **Multi-arch support**: The specific tag may not support your cluster's architecture + +**Common scenarios:** + +- Image built for ARM64 running on x86_64 nodes +- Image built for x86_64 running on ARM64 nodes (less common) +- Missing multi-architecture support in the specific tag + + + + + +Before using a Docker image, verify its architecture compatibility: + +1. **Check Docker Hub**: Visit the image page on Docker Hub +2. **Look for architecture tags**: Check if multi-arch builds are available +3. **Use docker manifest** (if you have Docker CLI access): + +```bash +docker manifest inspect efrecon/ts-node:9.1.1 +``` + +4. **Check supported platforms**: Look for `linux/amd64`, `linux/arm64`, etc. + + + + + +Instead of `efrecon/ts-node`, consider these well-maintained alternatives: + +**Option 1: Official Node.js image with ts-node** + +```yaml +image: node:18-alpine +command: ["/bin/sh", "-c"] +args: ["npm install -g ts-node typescript && ts-node your-script.ts"] +``` + +**Option 2: Custom Dockerfile approach** + +```dockerfile +FROM node:18-alpine +RUN npm install -g ts-node typescript +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +CMD ["ts-node", "src/index.ts"] +``` + +**Option 3: Multi-stage build** + +```dockerfile +FROM node:18-alpine as builder +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +RUN npm run build + +FROM node:18-alpine +WORKDIR /app +COPY --from=builder /app/dist ./dist +COPY package*.json ./ +RUN npm install --only=production +CMD ["node", "dist/index.js"] +``` + + + + + +When configuring your job in SleakOps: + +1. **Use reliable base images**: + + - `node:18-alpine` (lightweight) + - `node:18` (full Ubuntu base) + - `node:18-slim` (minimal Ubuntu) + +2. **Job configuration example**: + +```yaml +name: typescript-job +image: node:18-alpine +tag: latest +command: ["/bin/sh"] +args: ["-c", "npm install -g ts-node typescript && ts-node /app/script.ts"] +resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" +``` + +3. **Environment variables** (if needed): + +```yaml +env: + - name: NODE_ENV + value: "production" + - name: TS_NODE_PROJECT + value: "/app/tsconfig.json" +``` + + + + + +**Step 1: Verify cluster architecture** + +- Check your SleakOps cluster configuration +- Most SleakOps clusters run on x86_64 (AMD64) architecture + +**Step 2: Test with official images first** + +```yaml +image: node:18-alpine +command: ["node"] +args: ["-v"] +``` + +**Step 3: Add TypeScript support gradually** + +```yaml +image: node:18-alpine +command: ["/bin/sh"] +args: ["-c", "npm install -g typescript && tsc --version"] +``` + +**Step 4: Full TypeScript execution** + +```yaml +image: node:18-alpine +command: ["/bin/sh"] +args: + [ + "-c", + 'npm install -g ts-node typescript && echo ''console.log("Hello TypeScript")'' > test.ts && ts-node test.ts', + ] +``` + +**Step 5: Mount your actual code** + +- Use ConfigMaps or Secrets for small scripts +- Use persistent volumes for larger codebases +- Consider building custom images for complex applications + + + + + +**Image Selection:** + +- Use official Node.js images when possible +- Prefer Alpine variants for smaller size +- Always specify exact versions (avoid `latest`) + +**TypeScript Handling:** + +- Pre-compile TypeScript for production jobs +- Use ts-node only for development/testing +- Consider using esbuild for faster compilation + +**Resource Management:** + +```yaml +resources: + requests: + memory: "128Mi" # Minimum for Node.js + cpu: "50m" + limits: + memory: "1Gi" # Adjust based on your app + cpu: "1000m" +``` + +**Error Handling:** + +- Always include proper exit codes in your scripts +- Use health checks when appropriate +- Log errors to stdout/stderr for SleakOps monitoring + +**Security Considerations:** + +- Don't run containers as root unless necessary +- Use minimal base images to reduce attack surface +- Regularly update base images for security patches + + + + + +**If you must use the specific image:** + +1. **Check for multi-arch versions**: + - Look for tags with `-amd64` or `-arm64` suffixes + - Try different version tags that might have better architecture support + +2. **Build your own image**: + ```dockerfile + FROM node:18-alpine + RUN npm install -g ts-node typescript + WORKDIR /app + COPY . . + CMD ["ts-node", "index.ts"] + ``` + +3. **Use SleakOps build process**: + - Let SleakOps build your application + - Use the built artifacts in a simple runtime container + +**For complex TypeScript projects:** + +- Consider using a build step to compile TypeScript to JavaScript +- Use multi-stage Docker builds for optimization +- Implement proper dependency caching for faster builds + + + +--- + +_This FAQ section was automatically generated on January 27, 2025, based on a real user inquiry._ diff --git a/docs/troubleshooting/dockerfile-dotnet-build-errors.mdx b/docs/troubleshooting/dockerfile-dotnet-build-errors.mdx new file mode 100644 index 000000000..689a1d01f --- /dev/null +++ b/docs/troubleshooting/dockerfile-dotnet-build-errors.mdx @@ -0,0 +1,254 @@ +--- +sidebar_position: 3 +title: "Dockerfile Build Errors with .NET Applications" +description: "Troubleshooting Docker build failures and application startup issues in .NET projects" +date: "2024-01-15" +category: "project" +tags: ["dockerfile", "dotnet", "build", "troubleshooting", "docker"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Dockerfile Build Errors with .NET Applications + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Dockerfile, .NET, Build, Troubleshooting, Docker + +## Problem Description + +**Context:** Users experience build failures when deploying .NET applications through SleakOps, particularly during the Docker build process or when the application fails to start properly after deployment. + +**Observed Symptoms:** + +- Docker build fails during the deployment process +- Application fails to start with .NET runtime errors +- Build logs show errors related to `dotnet` command execution +- Pull request deployments fail unexpectedly + +**Relevant Configuration:** + +- Application type: .NET Web API +- Build process: Docker-based deployment +- Command: `dotnet Netdo.Firev.WebApi.dll` +- Deployment trigger: Pull request from develop to main branch + +**Error Conditions:** + +- Error occurs during Docker build process +- May be related to Dockerfile configuration +- Could involve application-level issues (Main class problems) +- Appears during automated deployment workflows + +## Detailed Solution + + + +To diagnose the problem, first test your application locally using Docker: + +1. **Run the container locally:** + + ```bash + docker compose run --rm --name api-shell api + ``` + +2. **Inside the container, test the application command:** + + ```bash + dotnet Netdo.Firev.WebApi.dll + ``` + +3. **Check for any runtime errors or missing dependencies** + +This will help identify if the issue is with the application code or the Docker configuration. + + + + + +Common Dockerfile issues with .NET applications: + +```dockerfile +# Ensure you're using the correct base image +FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base +WORKDIR /app +EXPOSE 80 +EXPOSE 443 + +# Build stage +FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build +WORKDIR /src +COPY ["YourProject.csproj", "./"] +RUN dotnet restore "YourProject.csproj" +COPY . . +RUN dotnet build "YourProject.csproj" -c Release -o /app/build + +# Publish stage +FROM build AS publish +RUN dotnet publish "YourProject.csproj" -c Release -o /app/publish + +# Final stage +FROM base AS final +WORKDIR /app +COPY --from=publish /app/publish . +ENTRYPOINT ["dotnet", "YourProject.dll"] +``` + +**Key points to verify:** + +- Correct .NET runtime version +- Proper file copying and build steps +- Correct entry point configuration +- All necessary dependencies included + + + + + +If the error is related to the Main class or application startup: + +1. **Check your Program.cs file:** + + ```csharp + // For .NET 6+ (minimal hosting model) + var builder = WebApplication.CreateBuilder(args); + + // Add services + builder.Services.AddControllers(); + + var app = builder.Build(); + + // Configure pipeline + app.UseRouting(); + app.MapControllers(); + + app.Run(); + ``` + +2. **For older .NET versions, ensure proper Main method:** + + ```csharp + public class Program + { + public static void Main(string[] args) + { + CreateHostBuilder(args).Build().Run(); + } + + public static IHostBuilder CreateHostBuilder(string[] args) => + Host.CreateDefaultBuilder(args) + .ConfigureWebHostDefaults(webBuilder => + { + webBuilder.UseStartup(); + }); + } + ``` + +3. **Verify your project file (.csproj):** + ```xml + + + net6.0 + Exe + + + ``` + + + + + +To get detailed information about the build failure: + +1. **Access SleakOps dashboard** +2. **Navigate to your project's deployment logs** +3. **Look for specific error messages in the build phase** +4. **Check for:** + - Missing dependencies + - Compilation errors + - Runtime configuration issues + - File permission problems + +**Common error patterns:** + +- `Unable to find a matching executable` +- `Assembly not found` +- `Configuration errors` +- `Port binding issues` + + + + + +Ensure your application is properly configured for the deployment environment: + +1. **Check appsettings.json:** + ```json + { + "Logging": { + "LogLevel": { + "Default": "Information", + "Microsoft.AspNetCore": "Warning" + } + }, + "AllowedHosts": "*", + "Urls": "http://0.0.0.0:80" + } + ``` + +2. **Environment-specific configuration:** + ```json + // appsettings.Production.json + { + "Logging": { + "LogLevel": { + "Default": "Warning" + } + } + } + ``` + +3. **Verify port configuration:** + - Ensure your application listens on the correct port + - Check that EXPOSE directive in Dockerfile matches application port + - Verify SleakOps port configuration + + + + + +Common dependency-related problems: + +1. **Missing runtime dependencies:** + ```dockerfile + # Add any required system dependencies + RUN apt-get update && apt-get install -y \ + libgdiplus \ + && rm -rf /var/lib/apt/lists/* + ``` + +2. **NuGet package restore issues:** + ```dockerfile + # Clear NuGet cache if needed + RUN dotnet nuget locals all --clear + RUN dotnet restore --no-cache + ``` + +3. **Version compatibility:** + - Ensure all packages are compatible with target framework + - Check for deprecated packages + - Verify package versions in .csproj file + +4. **Runtime identifier issues:** + ```xml + + linux-x64 + false + + ``` + + + +--- + +_This FAQ section was automatically generated on January 15, 2024, based on a real user inquiry._ diff --git a/docs/troubleshooting/domain-certificate-delegation-error.mdx b/docs/troubleshooting/domain-certificate-delegation-error.mdx new file mode 100644 index 000000000..49597910c --- /dev/null +++ b/docs/troubleshooting/domain-certificate-delegation-error.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "Domain Certificate Delegation Error During Deployment" +description: "Solution for ACMModule DoesNotExist error when deploying with domain certificates" +date: "2024-12-19" +category: "project" +tags: ["deployment", "domain", "certificate", "acm", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Domain Certificate Delegation Error During Deployment + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** Deployment, Domain, Certificate, ACM, Troubleshooting + +## Problem Description + +**Context:** User experiences deployment failure when trying to register a production domain in SleakOps. The error occurs during the certificate delegation process for domain management. + +**Observed Symptoms:** + +- Deployment fails with `Modules.DoesNotExist` error +- Error specifically mentions `ACMModule` not found +- Issue prevents domain registration for production environment +- Error occurs during ingress values generation + +**Relevant Configuration:** + +- Domain: `hub.supra.social` (production domain) +- Certificate management: AWS ACM integration +- Deployment process: Helm chart values generation +- Platform: SleakOps domain management system + +**Error Conditions:** + +- Error occurs during `create_values_config_map` execution +- Specifically fails at `domain.acm_certificate_info(raise_exceptions=True)` +- ACMModule query returns no results +- Prevents completion of deployment process + +## Detailed Solution + + + +When encountering certificate delegation issues, follow these steps: + +1. **Access SleakOps Console** + + - Navigate to your project dashboard + - Go to **Domains** section + +2. **Re-delegate Certificates** + + - Find the affected domain (`hub.supra.social`) + - Click on **Certificate Management** + - Select **Re-delegate Certificate** + - Follow the console prompts to complete delegation + +3. **Verify Delegation Status** + - Check that certificate status shows as "Active" + - Ensure ACM module is properly associated with the domain + + + + + +If certificate delegation fails, try relaunching the HZ (Hosted Zone) task: + +1. **Access Task Management** + + - Go to **Infrastructure** → **Tasks** + - Look for HZ-related tasks + +2. **Relaunch HZ Task** + + - Select the most recent HZ task + - Click **Relaunch** or **Retry** + - Wait for task completion + +3. **Proceed with Deployment** + - After HZ task completes successfully + - Attempt deployment again + - Certificate delegation should now work properly + + + + + +To prevent conflicts with unused domains: + +1. **Identify Unused Domains** + + - Review domain list in SleakOps console + - Identify domains that were never properly registered + +2. **Remove Unused Entries** + + - Select unused domain entries (e.g., Disker domains) + - Click **Delete** or **Remove** + - Confirm removal + +3. **Verify Clean State** + - Ensure only active, properly configured domains remain + - Check that each domain has proper certificate delegation + + + + + +Before attempting deployment, verify DNS settings: + +1. **Check DNS Records** + + - Verify that DNS records for `supra.social` are correct + - Ensure no external changes were made outside SleakOps + +2. **Validate Subdomain Configuration** + + - Check that `hub.supra.social` is properly configured + - Verify subdomain delegation is working + +3. **Test Resolution** + + ```bash + # Test DNS resolution + nslookup hub.supra.social + + # Check certificate status + openssl s_client -connect hub.supra.social:443 -servername hub.supra.social + ``` + + + + + +If the problem persists after certificate re-delegation: + +1. **Check Module Status** + + - Verify ACMModule is enabled for your project + - Contact support if module appears missing + +2. **Review Recent Changes** + + - Check if any DNS changes were made externally + - Verify no conflicting certificate requests exist + +3. **Monitor Deployment Logs** + + - Check deployment logs for additional error details + - Look for certificate validation failures + +4. **Contact Support** + - If issue persists, provide: + - Domain name affected + - Complete error traceback + - Recent changes made to DNS or certificates + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/domain-delegation-production-environments.mdx b/docs/troubleshooting/domain-delegation-production-environments.mdx new file mode 100644 index 000000000..3d84c3433 --- /dev/null +++ b/docs/troubleshooting/domain-delegation-production-environments.mdx @@ -0,0 +1,206 @@ +--- +sidebar_position: 3 +title: "Domain Delegation Issues in Production Environments" +description: "Solution for domain delegation problems in production environments" +date: "2024-12-20" +category: "project" +tags: ["domain", "dns", "delegation", "production", "environment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Domain Delegation Issues in Production Environments + +**Date:** December 20, 2024 +**Category:** Project +**Tags:** Domain, DNS, Delegation, Production, Environment + +## Problem Description + +**Context:** Users experience issues with domain delegation when configuring production environments in SleakOps, where main domains are not being properly delegated despite correct configuration. + +**Observed Symptoms:** + +- Main domains are not delegated in production environments +- Domain delegation appears to be configured but doesn't take effect +- DNS propagation may be delayed or incomplete +- Standard DNS record management may not be sufficient for full domain delegation + +**Relevant Configuration:** + +- Environment type: Production +- Domain type: Main/root domain (not subdomain) +- DNS provider: Various domain registrars +- Configuration method: Standard DNS record management + +**Error Conditions:** + +- Domain delegation fails despite correct DNS configuration +- Issue occurs specifically with main domains in production +- Problem may be related to domain registrar-specific delegation processes +- DNS propagation delays can mask the actual issue + +## Detailed Solution + + + +First, check if the issue is related to DNS propagation delays: + +1. Use online DNS propagation checkers: + + - whatsmydns.net + - dnschecker.org + - dns-lookup.com + +2. Check from different locations and DNS servers +3. DNS propagation can take up to 48 hours for full global propagation +4. Use `dig` or `nslookup` commands to verify delegation: + +```bash +# Check NS records for your domain +dig NS yourdomain.com + +# Check from specific DNS servers +dig @8.8.8.8 NS yourdomain.com +dig @1.1.1.1 NS yourdomain.com +``` + + + + + +Many domain registrars require domain delegation to be configured through a separate process: + +**For complete domain delegation:** + +1. **Log into your domain registrar's control panel** +2. **Look for domain delegation options:** + + - "Domain Delegation" + - "Change Name Servers" + - "DNS Management" → "Delegate Domain" + - "Advanced DNS Settings" + +3. **Common registrar-specific locations:** + + - **GoDaddy**: Domain Settings → Name Servers → Change + - **Namecheap**: Domain List → Manage → Name Servers + - **Route53**: Hosted Zones → NS Record Set + - **Cloudflare**: DNS → Name Servers + +4. **Set the name servers provided by SleakOps** + + + + + +To delegate your domain to SleakOps: + +1. **In SleakOps Dashboard:** + + - Go to your **Project Settings** + - Navigate to **Domains** section + - Find **Name Servers** information + +2. **Typical SleakOps name servers format:** + + ``` + ns1.sleakops.com + ns2.sleakops.com + ns3.sleakops.com + ns4.sleakops.com + ``` + +3. **Copy these name servers exactly as shown** +4. **Configure them in your domain registrar** + + + + + +After configuring domain delegation: + +1. **Wait for propagation** (up to 48 hours) + +2. **Verify delegation with dig:** + + ```bash + # Should return SleakOps name servers + dig NS yourdomain.com + ``` + +3. **Check domain resolution:** + + ```bash + # Should resolve to your SleakOps environment + dig A yourdomain.com + ``` + +4. **Test in browser:** + - Visit your domain directly + - Verify SSL certificate is valid + - Check that it points to your SleakOps environment + + + + + +**Issue 1: Mixed DNS Management** + +- Problem: Some DNS records managed at registrar, others at SleakOps +- Solution: Ensure complete delegation - all DNS managed by SleakOps + +**Issue 2: Incorrect Name Server Format** + +- Problem: Name servers entered with trailing dots or incorrect format +- Solution: Copy name servers exactly as provided by SleakOps + +**Issue 3: Partial Delegation** + +- Problem: Only some name servers configured +- Solution: Configure all provided name servers (typically 2-4) + +**Issue 4: Registrar Caching** + +- Problem: Registrar caches old DNS settings +- Solution: Contact registrar support to clear DNS cache + +**Issue 5: Domain Lock Status** + +- Problem: Domain is locked and prevents delegation changes +- Solution: Unlock domain in registrar settings before changing name servers + + + + + +If delegation still doesn't work: + +1. **Contact domain registrar support:** + + - Explain you need to delegate the entire domain + - Provide SleakOps name servers + - Ask about domain-specific delegation procedures + +2. **Check domain status:** + + ```bash + whois yourdomain.com | grep -i status + ``` + +3. **Verify domain isn't expired or locked** + +4. **Check for DNSSEC conflicts:** + + - Disable DNSSEC at registrar if enabled + - SleakOps will manage DNSSEC after delegation + +5. **Test with subdomain first:** + - Create a test subdomain (test.yourdomain.com) + - Verify SleakOps can manage subdomains correctly + + + +--- + +_This FAQ was automatically generated on December 20, 2024 based on a real user query._ diff --git a/docs/troubleshooting/domain-migration-custom-domains.mdx b/docs/troubleshooting/domain-migration-custom-domains.mdx new file mode 100644 index 000000000..bda83cc96 --- /dev/null +++ b/docs/troubleshooting/domain-migration-custom-domains.mdx @@ -0,0 +1,228 @@ +--- +sidebar_position: 3 +title: "Domain Migration and Custom Domain Configuration" +description: "How to migrate applications to new domains and configure custom domains in SleakOps" +date: "2024-01-15" +category: "project" +tags: ["domain", "migration", "dns", "alias", "custom-domain"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Domain Migration and Custom Domain Configuration + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Domain, Migration, DNS, Alias, Custom Domain + +## Problem Description + +**Context:** Users need to migrate their applications from one domain to another or configure custom domains for their SleakOps deployments. + +**Observed Symptoms:** + +- Need to change application URLs from old domain to new domain +- Want to use newly acquired domain for existing applications +- Require guidance on DNS configuration for custom domains +- Need to understand migration options and their implications + +**Relevant Configuration:** + +- Current domain: `byroncode.com` +- New domain: `ordenapp.com.ar` +- Application: Frontend/landing page +- Platform: SleakOps execution environment + +**Error Conditions:** + +- Applications currently accessible only through old domain +- Need to transition without service disruption +- DNS records need manual configuration + +## Detailed Solution + + + +For simple URL changes without full migration, you can add a domain alias to your existing execution: + +**Steps:** + +1. Navigate to your execution in the SleakOps dashboard +2. Go to **Settings** → **Domain Configuration** +3. Click **Add Alias** +4. Enter your new domain (e.g., `ordenapp.com.ar`) +5. Save the configuration + +**DNS Configuration:** +After creating the alias, you'll receive DNS records that need to be manually added to your domain registrar: + +``` +Type: CNAME +Name: www +Value: [provided-by-sleakops].sleakops.io + +Type: A +Name: @ +Value: [IP-address-provided] +``` + +**Benefits:** + +- Quick implementation +- No service disruption +- Maintains existing configuration +- Both domains work simultaneously + + + + + +For complete migration of all services from the old domain to the new one: + +**Prerequisites:** + +- Coordinate with SleakOps support team +- Schedule maintenance window +- Backup current configuration + +**Migration Process:** + +1. **Pre-migration planning:** + + - Document all current services and configurations + - Identify dependencies + - Plan rollback strategy + +2. **Domain configuration:** + + - Update primary domain in project settings + - Reconfigure SSL certificates + - Update environment variables + +3. **DNS updates:** + + - Point new domain to SleakOps infrastructure + - Update all subdomains + - Configure proper TTL values + +4. **Testing and validation:** + - Verify all services are accessible + - Test SSL certificate validity + - Confirm all integrations work + +**Note:** This option requires coordination with the SleakOps team and may involve service downtime. + + + + + +With the latest SleakOps version, you can create new environments with custom URLs: + +**When to use:** + +- For development environments +- When you want a fresh start +- For testing new domain configuration + +**Steps:** + +1. Create a new environment in SleakOps +2. During setup, specify your custom domain +3. Configure DNS records as provided +4. Deploy your application to the new environment + +**Considerations:** + +- Dependencies need to be recreated +- Database migrations may be required +- Environment variables need reconfiguration +- Suitable mainly for development environments + +```yaml +# Example environment configuration +environment: + name: "production-new-domain" + domain: "ordenapp.com.ar" + ssl: true + auto_deploy: true +``` + + + + + +**Common DNS Records for SleakOps:** + +``` +# Root domain +Type: A +Name: @ +Value: [IP provided by SleakOps] +TTL: 300 + +# WWW subdomain +Type: CNAME +Name: www +Value: [hostname].sleakops.io +TTL: 300 + +# API subdomain (if applicable) +Type: CNAME +Name: api +Value: [api-hostname].sleakops.io +TTL: 300 +``` + +**DNS Propagation:** + +- Changes can take 24-48 hours to fully propagate +- Use tools like `dig` or online DNS checkers to verify +- Lower TTL values speed up changes but increase DNS queries + +**Verification Commands:** + +```bash +# Check A record +dig ordenapp.com.ar A + +# Check CNAME record +dig www.ordenapp.com.ar CNAME + +# Check from specific DNS server +dig @8.8.8.8 ordenapp.com.ar +``` + + + + + +**For Production Environments:** + +1. **Use domain aliases first** - Test the new domain alongside the old one +2. **Plan maintenance windows** - For full migrations, schedule during low-traffic periods +3. **Monitor after changes** - Watch for SSL certificate issues and DNS propagation +4. **Keep old domain active** - Maintain redirects for SEO and user experience + +**SSL Certificate Considerations:** + +- SleakOps automatically provisions SSL certificates for custom domains +- Certificate generation may take 5-10 minutes +- Ensure DNS records are correct before SSL provisioning + +**Rollback Strategy:** + +- Keep original DNS records documented +- Have a plan to revert changes quickly +- Test rollback procedure in development first + +**Communication:** + +- Notify users of domain changes in advance +- Update documentation and links +- Configure proper redirects from old to new domain + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/domain-ns-delegation-replication-issues.mdx b/docs/troubleshooting/domain-ns-delegation-replication-issues.mdx new file mode 100644 index 000000000..b23f91e0f --- /dev/null +++ b/docs/troubleshooting/domain-ns-delegation-replication-issues.mdx @@ -0,0 +1,210 @@ +--- +sidebar_position: 3 +title: "Domain NS Delegation Replication Issues" +description: "Troubleshooting NS record delegation and replication problems with domain providers" +date: "2024-01-15" +category: "general" +tags: ["domain", "dns", "ns-records", "delegation", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Domain NS Delegation Replication Issues + +**Date:** January 15, 2024 +**Category:** General +**Tags:** Domain, DNS, NS Records, Delegation, Troubleshooting + +## Problem Description + +**Context:** User is attempting to configure a custom domain (ordenapp.com.ar) in SleakOps but encounters NS (Name Server) delegation replication failures with their domain provider (DonWeb). + +**Observed Symptoms:** + +- NS delegation configuration failing at domain provider level +- Replication issues with delegated name servers +- Domain not resolving properly to SleakOps infrastructure +- Provider (DonWeb) acknowledging the technical issue + +**Relevant Configuration:** + +- Domain: ordenapp.com.ar (Argentine domain) +- Domain Provider: DonWeb +- Target Platform: SleakOps +- DNS Configuration: NS delegation setup + +**Error Conditions:** + +- NS record replication fails at provider level +- Domain delegation not propagating correctly +- Issue occurs during initial domain configuration +- Provider confirms technical problem on their end + +## Detailed Solution + + + +NS (Name Server) delegation is the process where your domain provider points your domain to external name servers (in this case, SleakOps' DNS servers). + +The process involves: + +1. **SleakOps provides NS records** (e.g., ns1.sleakops.com, ns2.sleakops.com) +2. **You configure these at your domain provider** (DonWeb) +3. **Provider replicates the changes** to root DNS servers +4. **DNS propagation occurs globally** (24-48 hours) + + + + + +While waiting for provider resolution: + +1. **Verify NS records in SleakOps:** + + - Go to your project settings + - Check the "Custom Domain" section + - Note down the provided NS records + +2. **Check current DNS status:** + + ```bash + # Check current NS records + dig NS ordenapp.com.ar + + # Check if domain resolves + nslookup ordenapp.com.ar + ``` + +3. **Verify provider configuration:** + - Log into DonWeb control panel + - Confirm NS records are correctly entered + - Check for any error messages or warnings + + + + + +When dealing with provider issues: + +1. **Document the problem clearly:** + + - Provide exact NS records from SleakOps + - Include screenshots of configuration + - Note any error messages + +2. **Request specific technical details:** + + - Ask for replication logs + - Request timeline for resolution + - Get confirmation of NS record acceptance + +3. **Escalate if necessary:** + - Request technical support escalation + - Ask to speak with DNS/domain specialists + - Consider temporary workarounds + + + + + +If provider issues persist: + +1. **Subdomain delegation:** + + ``` + # Instead of ordenapp.com.ar, use: + app.ordenapp.com.ar + ``` + + Configure only the subdomain with SleakOps NS records. + +2. **CNAME approach (if supported):** + + ``` + # Create CNAME record pointing to SleakOps + www.ordenapp.com.ar CNAME your-app.sleakops.io + ``` + +3. **Transfer domain to different provider:** + + - Consider providers with better DNS management + - Popular options: Cloudflare, Route53, Namecheap + +4. **Use SleakOps subdomain temporarily:** + ``` + # Use provided subdomain while resolving + your-app.sleakops.io + ``` + + + + + +Once the provider resolves the replication issue: + +1. **Check DNS propagation:** + + ```bash + # Check NS records globally + dig +trace NS ordenapp.com.ar + + # Verify from different locations + nslookup ordenapp.com.ar 8.8.8.8 + nslookup ordenapp.com.ar 1.1.1.1 + ``` + +2. **Test domain resolution:** + + ```bash + # Test HTTP resolution + curl -I http://ordenapp.com.ar + + # Test HTTPS if configured + curl -I https://ordenapp.com.ar + ``` + +3. **Monitor propagation:** + + - Use online tools like whatsmydns.net + - Check from multiple global locations + - Allow 24-48 hours for full propagation + +4. **Update SleakOps configuration:** + - Confirm domain status in SleakOps dashboard + - Test SSL certificate generation + - Verify application accessibility + + + + + +To avoid similar problems: + +1. **Choose reliable DNS providers:** + + - Research provider DNS reliability + - Check support quality and response times + - Consider managed DNS services + +2. **Document DNS configurations:** + + - Keep records of all NS configurations + - Screenshot important settings + - Maintain contact information for technical support + +3. **Test in staging first:** + + - Use test subdomains before production + - Verify DNS changes in non-critical environments + - Have rollback plans ready + +4. **Monitor DNS health:** + - Set up monitoring for domain resolution + - Use tools like UptimeRobot or Pingdom + - Configure alerts for DNS failures + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/domain-setup-alert-explanation.mdx b/docs/troubleshooting/domain-setup-alert-explanation.mdx new file mode 100644 index 000000000..0cc4d4115 --- /dev/null +++ b/docs/troubleshooting/domain-setup-alert-explanation.mdx @@ -0,0 +1,157 @@ +--- +sidebar_position: 3 +title: "Domain Setup Alert Explanation" +description: "Understanding the 'Setup domain' alert and when action is required" +date: "2024-01-15" +category: "project" +tags: ["domain", "alert", "setup", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Domain Setup Alert Explanation + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Domain, Alert, Setup, Troubleshooting + +## Problem Description + +**Context:** Users receive "Setup domain" alerts for domains that appear to be functioning correctly, causing confusion about whether action is required. + +**Observed Symptoms:** + +- Multiple domains showing "Setup domain" alert +- Domains are accessible and functioning normally (e.g., https://www.mocona.com.ar/) +- Alert appears despite proper domain functionality +- Uncertainty about required actions + +**Relevant Configuration:** + +- Domain status: Functional and accessible +- Alert type: "Setup domain" +- Domain examples: External domains like mocona.com.ar +- Platform: SleakOps domain management + +**Error Conditions:** + +- Alert appears for working domains +- No apparent functional issues with the domains +- Alert persists despite normal operation + +## Detailed Solution + + + +The "Setup domain" alert in SleakOps indicates one of the following conditions: + +1. **DNS Configuration Incomplete**: The domain's DNS records are not properly configured to point to your SleakOps services +2. **SSL Certificate Pending**: The SSL certificate for the domain is still being provisioned or validated +3. **Domain Verification Pending**: The platform is waiting for domain ownership verification +4. **Configuration Mismatch**: There's a discrepancy between the expected and actual domain configuration + +This alert can appear even when the domain is accessible because it may be pointing to a different server or service. + + + + + +You should take action if: + +- **Your application is not accessible** through the domain +- **SSL certificate errors** appear when accessing the domain +- **You want the domain to point to your SleakOps deployment** instead of its current destination +- **Email notifications** indicate failed deployments or services + +You can **ignore the alert** if: + +- The domain is working as expected +- It's pointing to the correct destination (even if not SleakOps) +- You're using the domain for external services + + + + + +To determine if you need to take action: + +1. **Check where your domain points**: + + ```bash + nslookup www.mocona.com.ar + # or + dig www.mocona.com.ar + ``` + +2. **Verify SSL certificate**: + + - Visit your domain in a browser + - Check if there are SSL warnings + - Verify the certificate issuer + +3. **Check SleakOps deployment status**: + + - Go to your project dashboard + - Verify if services are running + - Check deployment logs for errors + +4. **Test application functionality**: + - Access all critical pages + - Test forms and interactive elements + - Verify API endpoints if applicable + + + + + +If you determine action is needed: + +**For DNS Configuration:** + +1. Go to your domain registrar's DNS management +2. Update A records to point to SleakOps IP addresses +3. Update CNAME records as specified in your SleakOps project + +**For SSL Certificate Issues:** + +1. In SleakOps dashboard, go to **Domain Settings** +2. Click **Regenerate SSL Certificate** +3. Wait for validation (can take up to 24 hours) + +**For Domain Verification:** + +1. Check your email for verification requests +2. Follow the verification link or add required DNS TXT records +3. Contact support if verification fails + +**Configuration Example:** + +```dns +# DNS Records for SleakOps +www.yourdomain.com A 1.2.3.4 +yourdomain.com A 1.2.3.4 +*.yourdomain.com CNAME your-project.sleakops.com +``` + + + + + +If the domain is working correctly and you don't want it managed by SleakOps: + +1. Go to **Project Settings** → **Domains** +2. Find the domain with the alert +3. Click **Remove Domain** or **Disable Monitoring** +4. Confirm the action + +Alternatively, you can: + +- **Mute notifications** for specific domains +- **Configure alert thresholds** to reduce false positives +- **Contact support** to whitelist external domains + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/doppler-integration-troubleshooting.mdx b/docs/troubleshooting/doppler-integration-troubleshooting.mdx new file mode 100644 index 000000000..b428b3d55 --- /dev/null +++ b/docs/troubleshooting/doppler-integration-troubleshooting.mdx @@ -0,0 +1,627 @@ +--- +sidebar_position: 3 +title: "Doppler Integration Issues in SleakOps" +description: "Troubleshooting Doppler configuration and environment variable synchronization issues" +date: "2024-12-19" +category: "dependency" +tags: + ["doppler", "environment-variables", "secrets", "configuration", "deployment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Doppler Integration Issues in SleakOps + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** Doppler, Environment Variables, Secrets, Configuration, Deployment + +## Problem Description + +**Context:** Users integrating Doppler as an external service for environment variable management in SleakOps deployments experience issues where variables are not being updated during deployments, despite successful configuration. + +**Observed Symptoms:** + +- Environment variables not updating during recent deployments +- Doppler service configured but variables remain stale +- Deployments complete successfully but use outdated configuration +- Rolling update strategy shows temporary replica count increase + +**Relevant Configuration:** + +- External service: Doppler for environment variable management +- Deployment strategy: Rolling Update +- Docker args: `DOPPLER_CONFIG = staging` +- Project secrets and Doppler references configured in SleakOps + +**Error Conditions:** + +- Variables not refreshing on new deployments +- Occurs specifically with Doppler-managed environment variables +- Problem persists across multiple deployment attempts +- Local secrets may still work while Doppler integration fails + +## Detailed Solution + + + +First, verify your Doppler configuration is correctly set up in SleakOps: + +1. **Check Docker Arguments:** + + ```yaml + # In your SleakOps project configuration + docker_args: + DOPPLER_CONFIG: "staging" # or your environment name + DOPPLER_TOKEN: "${DOPPLER_TOKEN}" # should reference secret + ``` + +2. **Verify Doppler Token Secret:** + + - Go to **Project Settings** → **Secrets** + - Ensure `DOPPLER_TOKEN` is properly configured + - Token should have read access to the specified config + +3. **Check Doppler Config Reference:** + - Verify the config name matches exactly in Doppler dashboard + - Common configs: `dev`, `staging`, `production` + + + + + +Ensure your Doppler token has the correct permissions: + +1. **Test Token Locally:** + + ```bash + # Test if token can access the config + curl -H "Authorization: Bearer YOUR_DOPPLER_TOKEN" \ + "https://api.doppler.com/v3/configs/config/secrets" \ + -G -d project=YOUR_PROJECT -d config=staging + ``` + +2. **Check Token Scope:** + + - Token must have `read` access to the specific config + - Verify the token hasn't expired + - Ensure it's a **Service Token**, not a **Personal Token** + +3. **Regenerate Token if Needed:** + - Go to Doppler Dashboard → **Access** → **Service Tokens** + - Create new token with appropriate scope + - Update the secret in SleakOps + + + + + +To ensure environment variables are refreshed during deployment: + +1. **Trigger Complete Redeployment:** + + ```bash + # Force restart all pods to pick up new variables + kubectl rollout restart deployment/your-app-name + ``` + +2. **Check Pod Environment Variables:** + + ```bash + # Verify variables are loaded correctly + kubectl exec -it pod/your-pod-name -- env | grep YOUR_VAR + ``` + +3. **Use Deployment Annotations:** + Add a timestamp annotation to force pod recreation: + ```yaml + # This forces Kubernetes to recreate pods + spec: + template: + metadata: + annotations: + deployment.kubernetes.io/revision: "$(date +%s)" + ``` + + + + + +If variables still aren't updating: + +1. **Check Doppler Logs:** + + ```bash + # Check if Doppler CLI is working in your container + kubectl logs deployment/your-app -c your-container | grep -i doppler + ``` + +2. **Verify Doppler CLI Installation:** + + ```dockerfile + # Ensure Doppler CLI is installed in your Docker image + RUN curl -Ls https://cli.doppler.com/install.sh | sh + + # Use Doppler to run your application + CMD ["doppler", "run", "--", "your-app-command"] + ``` + +3. **Test Manual Sync:** + + ```bash + # Inside your container, test manual sync + doppler secrets download --no-file --format env + ``` + +4. **Check Network Connectivity:** + - Ensure your cluster can reach `api.doppler.com` + - Verify no firewall rules block the connection + - Test DNS resolution: `nslookup api.doppler.com` + + + + + +If Doppler integration continues to fail: + +1. **Use Kubernetes Secrets as Backup:** + + ```yaml + # Create a Kubernetes secret with critical variables + apiVersion: v1 + kind: Secret + metadata: + name: app-secrets + data: + DATABASE_URL: + ``` + +2. **Implement Doppler Webhook:** + + - Set up Doppler webhooks to trigger redeployments + - Automatically update secrets when Doppler config changes + +3. **Use Init Container Pattern:** + + ```yaml + # Fetch secrets before main container starts + initContainers: + - name: doppler-sync + image: dopplerhq/cli:latest + command: + ["doppler", "secrets", "download", "--format", "env", "--no-file"] + volumeMounts: + - name: secrets-volume + mountPath: /secrets + ``` + +4. **Hybrid Approach:** + - Use SleakOps secrets for critical variables + - Use Doppler for non-critical configuration + - Implement fallback mechanisms in your application + + + + + +The temporary increase in replica count is normal during deployments: + +1. **Rolling Update Process:** + + - Kubernetes creates new pods with updated configuration + - Keeps old pods running until new ones are ready + - Gradually shifts traffic to new pods + - Terminates old pods once new ones are healthy + +2. **Expected Behavior:** + + - Temporary replica count: `desired + maxSurge` + - For 2 replicas with default settings: up to 3 pods temporarily + - Returns to desired count (2) after deployment completes + +3. **Monitor Deployment Progress:** + ```bash + kubectl rollout status deployment/your-app-name + kubectl get pods -w # Watch pod status changes + ``` + +4. **Control Rolling Update Parameters:** + ```yaml + spec: + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 1 # Maximum extra pods during update + maxUnavailable: 0 # Minimum pods that must remain available + ``` + + + + + +**1. Doppler Sidecar Container:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: app-with-doppler +spec: + template: + spec: + containers: + - name: app + image: your-app:latest + env: + - name: DATABASE_URL + valueFrom: + secretKeyRef: + name: doppler-secrets + key: DATABASE_URL + + - name: doppler-sync + image: dopplerhq/cli:latest + command: + - /bin/sh + - -c + - | + while true; do + doppler secrets download --no-file --format k8s-secret | kubectl apply -f - + sleep 300 # Sync every 5 minutes + done + env: + - name: DOPPLER_TOKEN + valueFrom: + secretKeyRef: + name: doppler-token + key: token +``` + +**2. Doppler Kubernetes Operator:** + +```yaml +# Install Doppler Operator +kubectl apply -f https://github.com/DopplerHQ/kubernetes-operator/releases/latest/download/recommended.yaml + +# Create DopplerSecret resource +apiVersion: secrets.doppler.com/v1alpha1 +kind: DopplerSecret +metadata: + name: doppler-secret +spec: + tokenSecret: + name: doppler-token-secret + managedSecret: + name: doppler-managed-secret + namespace: default + project: your-project + config: staging +``` + +**3. Automated Refresh with ConfigMap:** + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: doppler-sync-script +data: + sync.sh: | + #!/bin/bash + set -e + + echo "Syncing secrets from Doppler..." + doppler secrets download --no-file --format env > /tmp/secrets.env + + # Compare with existing secrets + if ! cmp -s /tmp/secrets.env /shared/secrets.env; then + echo "Secrets changed, updating..." + cp /tmp/secrets.env /shared/secrets.env + + # Signal main container to reload + kill -HUP $(pgrep -f "your-main-process") + fi +--- +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + volumes: + - name: shared-secrets + emptyDir: {} + - name: sync-script + configMap: + name: doppler-sync-script + defaultMode: 0755 + + initContainers: + - name: doppler-init + image: dopplerhq/cli:latest + command: ["/scripts/sync.sh"] + volumeMounts: + - name: shared-secrets + mountPath: /shared + - name: sync-script + mountPath: /scripts +``` + + + + + +**Environment Verification:** + +```bash +# 1. Verify Doppler CLI is available +kubectl exec -it deployment/your-app -- which doppler + +# 2. Test Doppler authentication +kubectl exec -it deployment/your-app -- doppler me + +# 3. List available configs +kubectl exec -it deployment/your-app -- doppler configs + +# 4. Test secret retrieval +kubectl exec -it deployment/your-app -- doppler secrets + +# 5. Check environment variables +kubectl exec -it deployment/your-app -- env | sort +``` + +**Network Connectivity Tests:** + +```bash +# Test Doppler API connectivity from pod +kubectl exec -it deployment/your-app -- curl -v https://api.doppler.com/v3/auth/me \ + -H "Authorization: Bearer $DOPPLER_TOKEN" + +# Check DNS resolution +kubectl exec -it deployment/your-app -- nslookup api.doppler.com + +# Test with verbose output +kubectl exec -it deployment/your-app -- doppler secrets --debug +``` + +**Configuration Validation:** + +```bash +# Verify token format (should be dp.st.xxx) +echo $DOPPLER_TOKEN | grep -E '^dp\.st\.' + +# Check token permissions +curl -H "Authorization: Bearer $DOPPLER_TOKEN" \ + "https://api.doppler.com/v3/me" | jq '.access' + +# Validate project and config exist +curl -H "Authorization: Bearer $DOPPLER_TOKEN" \ + "https://api.doppler.com/v3/configs?project=YOUR_PROJECT" | jq '.configs' +``` + +**SleakOps Integration Check:** + +1. **Verify Secret Configuration:** + - Secret name matches reference in Docker args + - Secret value is properly base64 encoded + - Secret has correct scope (project/environment) + +2. **Check Deployment Logs:** + ```bash + kubectl logs deployment/your-app --previous | grep -i error + kubectl describe deployment your-app + ``` + +3. **Validate Environment Variables:** + ```bash + # Check if Doppler variables are loaded + kubectl exec deployment/your-app -- env | grep -v '^KUBERNETES' + ``` + + + + + +**1. Doppler Sync Monitoring:** + +```yaml +# Create a monitoring job +apiVersion: batch/v1 +kind: CronJob +metadata: + name: doppler-health-check +spec: + schedule: "*/5 * * * *" # Every 5 minutes + jobTemplate: + spec: + template: + spec: + containers: + - name: health-check + image: dopplerhq/cli:latest + command: + - /bin/sh + - -c + - | + if ! doppler secrets --no-file > /dev/null 2>&1; then + echo "Doppler sync failed at $(date)" + # Send alert to monitoring system + curl -X POST "$WEBHOOK_URL" -d '{"text":"Doppler sync failed"}' + exit 1 + fi + echo "Doppler sync successful at $(date)" +``` + +**2. Application Health Endpoint:** + +```javascript +// Add health check endpoint to your application +app.get('/health/secrets', (req, res) => { + const requiredVars = ['DATABASE_URL', 'API_KEY', 'JWT_SECRET']; + const missing = requiredVars.filter(varName => !process.env[varName]); + + if (missing.length > 0) { + return res.status(503).json({ + status: 'unhealthy', + missing_variables: missing, + last_sync: process.env.DOPPLER_LAST_SYNC || 'unknown' + }); + } + + res.json({ + status: 'healthy', + secrets_loaded: requiredVars.length, + last_sync: process.env.DOPPLER_LAST_SYNC || 'unknown' + }); +}); +``` + +**3. Prometheus Metrics:** + +```yaml +# Monitor secret sync metrics +apiVersion: v1 +kind: ConfigMap +metadata: + name: doppler-exporter +data: + exporter.py: | + import time + import requests + import os + from prometheus_client import Gauge, start_http_server + + secret_count = Gauge('doppler_secrets_total', 'Total number of secrets') + sync_timestamp = Gauge('doppler_last_sync_timestamp', 'Last successful sync timestamp') + + def collect_metrics(): + try: + # Use Doppler API to get secret count + response = requests.get( + 'https://api.doppler.com/v3/configs/config/secrets', + headers={'Authorization': f'Bearer {os.environ["DOPPLER_TOKEN"]}'}, + params={'project': os.environ['DOPPLER_PROJECT'], 'config': os.environ['DOPPLER_CONFIG']} + ) + data = response.json() + secret_count.set(len(data['secrets'])) + sync_timestamp.set(time.time()) + except Exception as e: + print(f"Error collecting metrics: {e}") + + if __name__ == '__main__': + start_http_server(8000) + while True: + collect_metrics() + time.sleep(60) +``` + + + + + +**1. Security Best Practices:** + +```bash +# Use least privilege tokens +# Create environment-specific tokens +doppler service-tokens create staging-token --config staging --access read + +# Rotate tokens regularly +doppler service-tokens delete old-token +doppler service-tokens create new-token --config staging --access read +``` + +**2. Configuration Management:** + +```yaml +# Use separate configs for each environment +environments: + development: + doppler: + project: myapp + config: dev + staging: + doppler: + project: myapp + config: staging + production: + doppler: + project: myapp + config: prod +``` + +**3. Fallback Strategies:** + +```javascript +// Implement graceful fallback in your application +function getConfig(key, fallback = null) { + // Try Doppler first + let value = process.env[key]; + + // Fallback to Kubernetes secrets + if (!value && process.env[`K8S_SECRET_${key}`]) { + value = process.env[`K8S_SECRET_${key}`]; + } + + // Fallback to default value + if (!value && fallback !== null) { + value = fallback; + } + + if (!value) { + throw new Error(`Required configuration ${key} not found`); + } + + return value; +} + +// Usage +const databaseUrl = getConfig('DATABASE_URL'); +const apiKey = getConfig('API_KEY', 'development-key'); +``` + +**4. Testing and Validation:** + +```bash +#!/bin/bash +# test-doppler-integration.sh + +set -e + +echo "Testing Doppler integration..." + +# Test 1: Authentication +echo "1. Testing authentication..." +doppler me > /dev/null || { + echo "❌ Authentication failed" + exit 1 +} +echo "✅ Authentication successful" + +# Test 2: Config access +echo "2. Testing config access..." +doppler secrets --no-file > /dev/null || { + echo "❌ Config access failed" + exit 1 +} +echo "✅ Config access successful" + +# Test 3: Required secrets present +echo "3. Checking required secrets..." +required_secrets=("DATABASE_URL" "API_KEY" "JWT_SECRET") +for secret in "${required_secrets[@]}"; do + if doppler secrets get "$secret" --no-file > /dev/null 2>&1; then + echo "✅ $secret found" + else + echo "❌ $secret missing" + exit 1 + fi +done + +echo "🎉 All tests passed!" +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/eks-pod-scheduling-tolerations.mdx b/docs/troubleshooting/eks-pod-scheduling-tolerations.mdx new file mode 100644 index 000000000..934dece61 --- /dev/null +++ b/docs/troubleshooting/eks-pod-scheduling-tolerations.mdx @@ -0,0 +1,229 @@ +--- +sidebar_position: 3 +title: "EKS Pod Scheduling Issues with Spot Instances" +description: "Solution for pods failing to schedule on spot instance nodepools in EKS" +date: "2025-02-21" +category: "cluster" +tags: ["eks", "scheduling", "tolerations", "spot-instances", "karpenter"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# EKS Pod Scheduling Issues with Spot Instances + +**Date:** February 21, 2025 +**Category:** Cluster +**Tags:** EKS, Scheduling, Tolerations, Spot Instances, Karpenter + +## Problem Description + +**Context:** After upgrading to a newer version of EKS in SleakOps, pods (such as Elasticsearch) are failing to schedule properly on spot instance nodepools due to missing tolerations configuration. + +**Observed Symptoms:** + +- Pods failing to start or remain in Pending state +- Scheduling errors related to node taints +- Applications not being deployed to spot instance nodepools +- Pod scheduling failures after EKS version upgrades + +**Relevant Configuration:** + +- EKS cluster with spot instance nodepools +- Nodepool name: `spot-amd64` +- Karpenter-managed node provisioning +- Taint key: `karpenter.sh/nodepool` + +**Error Conditions:** + +- Occurs after EKS version upgrades +- Affects deployments without proper tolerations +- Impacts both direct kubectl deployments and Helm-managed applications + +## Detailed Solution + + + +With newer versions of EKS and Karpenter, spot instance nodepools are automatically tainted to prevent regular workloads from being scheduled on them unless explicitly configured. This ensures better resource management and cost optimization. + +The taint applied is: + +```yaml +key: karpenter.sh/nodepool +value: spot-amd64 +effect: NoSchedule +``` + +Pods need matching tolerations to be scheduled on these nodes. + + + + + +If you deployed your application directly using `kubectl apply`, add the following tolerations to your deployment YAML: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: elasticsearch +spec: + template: + spec: + tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule + containers: + - name: elasticsearch + image: elasticsearch:7.17.0 + # ... rest of your container configuration +``` + +Apply the updated configuration: + +```bash +kubectl apply -f your-deployment.yaml +``` + + + + + +If your application was deployed using Helm, you need to update the values file or pass the tolerations as parameters: + +**Option 1: Update values.yaml** + +```yaml +# values.yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule +``` + +**Option 2: Pass tolerations during Helm install/upgrade** + +```bash +helm upgrade elasticsearch elastic/elasticsearch \ + --set tolerations[0].key=karpenter.sh/nodepool \ + --set tolerations[0].operator=Equal \ + --set tolerations[0].value=spot-amd64 \ + --set tolerations[0].effect=NoSchedule +``` + +**Option 3: Create a custom values file** + +```yaml +# custom-tolerations.yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule +``` + +Then apply: + +```bash +helm upgrade elasticsearch elastic/elasticsearch -f custom-tolerations.yaml +``` + + + + + +After applying the tolerations, verify that your pods are scheduling correctly: + +1. **Check pod status:** + +```bash +kubectl get pods -l app=elasticsearch +``` + +2. **Verify pod scheduling:** + +```bash +kubectl describe pod +``` + +3. **Check which node the pod is running on:** + +```bash +kubectl get pods -o wide +``` + +4. **Verify the node is part of the spot nodepool:** + +```bash +kubectl describe node | grep -i taint +``` + + + + + +If you have multiple spot nodepools or want to allow scheduling on both spot and on-demand instances, you can add multiple tolerations: + +```yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule + - key: karpenter.sh/nodepool + operator: Equal + value: spot-arm64 + effect: NoSchedule + - key: karpenter.sh/nodepool + operator: Equal + value: on-demand + effect: NoSchedule +``` + +Alternatively, use the `Exists` operator to tolerate any nodepool: + +```yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Exists + effect: NoSchedule +``` + + + + + +**Best Practices:** + +1. **Always include tolerations in your deployment templates** when using spot instances +2. **Use Helm charts with configurable tolerations** for easier management +3. **Test deployments after EKS upgrades** to ensure compatibility +4. **Document your nodepool configuration** for team reference + +**Template for future deployments:** + +```yaml +# deployment-template.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.name }} +spec: + template: + spec: + tolerations: + {{- if .Values.tolerations }} + {{- toYaml .Values.tolerations | nindent 6 }} + {{- end }} + containers: + - name: {{ .Values.name }} + # ... container configuration +``` + + + +--- + +_This FAQ was automatically generated on February 21, 2025 based on a real user query._ diff --git a/docs/troubleshooting/environment-specific-configuration-files.mdx b/docs/troubleshooting/environment-specific-configuration-files.mdx new file mode 100644 index 000000000..4415c0c35 --- /dev/null +++ b/docs/troubleshooting/environment-specific-configuration-files.mdx @@ -0,0 +1,1159 @@ +--- +sidebar_position: 3 +title: "Environment-Specific Configuration Files" +description: "How to manage configuration files that change between environments (prod/dev/qa)" +date: "2024-01-15" +category: "project" +tags: ["configuration", "environment", "files", "deployment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Environment-Specific Configuration Files + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Configuration, Environment, Files, Deployment + +## Problem Description + +**Context:** User needs to deploy configuration files that vary between different environments (production, development, QA) in SleakOps platform. + +**Observed Symptoms:** + +- Need to upload different configuration files per environment +- Environment variables are available but insufficient for file-based configuration +- Uncertainty about how to handle file variations across environments +- Configuration files contain environment-specific settings + +**Relevant Configuration:** + +- Multiple environments: prod, dev, qa +- Configuration stored in files (not just environment variables) +- Environment variables are already available +- File content changes based on deployment target + +**Error Conditions:** + +- Unable to deploy different configuration files per environment +- Configuration files contain hardcoded values for specific environments +- Need dynamic file content based on deployment context + +## Detailed Solution + + + +The recommended approach is to use Kubernetes ConfigMaps with environment-specific configurations: + +1. **Create separate ConfigMaps** for each environment +2. **Use environment variables** to reference the correct ConfigMap +3. **Mount ConfigMaps as files** in your containers + +```yaml +# config-dev.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config-dev +data: + config.json: | + { + "database_url": "dev-db.example.com", + "api_endpoint": "https://api-dev.example.com", + "debug_mode": true, + "log_level": "debug", + "cache_ttl": 300 + } + app.properties: | + server.port=8080 + spring.datasource.url=jdbc:postgresql://dev-db.example.com:5432/myapp + spring.jpa.hibernate.ddl-auto=update + logging.level.root=DEBUG +--- +# config-prod.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config-prod +data: + config.json: | + { + "database_url": "prod-db.example.com", + "api_endpoint": "https://api.example.com", + "debug_mode": false, + "log_level": "warn", + "cache_ttl": 3600 + } + app.properties: | + server.port=8080 + spring.datasource.url=jdbc:postgresql://prod-db.example.com:5432/myapp + spring.jpa.hibernate.ddl-auto=validate + logging.level.root=WARN +``` + +Then reference these in your deployment: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-app +spec: + template: + spec: + containers: + - name: app + image: my-app:latest + volumeMounts: + - name: config-volume + mountPath: /app/config + env: + - name: ENVIRONMENT + value: "dev" # or "prod" depending on environment + volumes: + - name: config-volume + configMap: + name: app-config-dev # Switch based on environment +``` + + + + + +Use template files with placeholders that get replaced with environment variables: + +### 1. Configuration Template Approach + +```json +// config.template.json +{ + "database": { + "host": "${DB_HOST}", + "port": ${DB_PORT}, + "name": "${DB_NAME}", + "ssl": ${DB_SSL_ENABLED} + }, + "api": { + "base_url": "${API_BASE_URL}", + "timeout": ${API_TIMEOUT}, + "rate_limit": ${API_RATE_LIMIT} + }, + "features": { + "debug_mode": ${DEBUG_MODE}, + "analytics": ${ANALYTICS_ENABLED}, + "new_feature": ${NEW_FEATURE_FLAG} + }, + "logging": { + "level": "${LOG_LEVEL}", + "format": "${LOG_FORMAT}" + } +} +``` + +### 2. Initialization Script + +```bash +#!/bin/bash +# init-config.sh - Replace template variables with environment values + +CONFIG_TEMPLATE="/app/config/config.template.json" +CONFIG_FILE="/app/config/config.json" + +# Set default values if not provided +export DB_HOST=${DB_HOST:-localhost} +export DB_PORT=${DB_PORT:-5432} +export DB_SSL_ENABLED=${DB_SSL_ENABLED:-false} +export API_TIMEOUT=${API_TIMEOUT:-30000} +export DEBUG_MODE=${DEBUG_MODE:-false} +export LOG_LEVEL=${LOG_LEVEL:-info} + +# Replace environment variables in template +envsubst < "$CONFIG_TEMPLATE" > "$CONFIG_FILE" + +echo "Configuration file generated:" +cat "$CONFIG_FILE" + +# Start the application +exec "$@" +``` + +### 3. Docker Implementation + +```dockerfile +# Dockerfile +FROM node:18-alpine + +WORKDIR /app + +# Copy template and initialization script +COPY config.template.json /app/config/ +COPY init-config.sh /usr/local/bin/ +RUN chmod +x /usr/local/bin/init-config.sh + +# Install envsubst for template substitution +RUN apk add --no-cache gettext + +# Copy application files +COPY . . +RUN npm install + +# Use init script as entrypoint +ENTRYPOINT ["init-config.sh"] +CMD ["node", "server.js"] +``` + +### 4. SleakOps Variable Groups Configuration + +```yaml +# Development Environment Variables +DB_HOST: "dev-postgres.cluster.local" +DB_PORT: "5432" +DB_NAME: "myapp_dev" +DB_SSL_ENABLED: "false" +API_BASE_URL: "https://api-dev.example.com" +API_TIMEOUT: "30000" +API_RATE_LIMIT: "1000" +DEBUG_MODE: "true" +ANALYTICS_ENABLED: "false" +NEW_FEATURE_FLAG: "true" +LOG_LEVEL: "debug" +LOG_FORMAT: "pretty" + +# Production Environment Variables +DB_HOST: "prod-postgres.cluster.local" +DB_PORT: "5432" +DB_NAME: "myapp_prod" +DB_SSL_ENABLED: "true" +API_BASE_URL: "https://api.example.com" +API_TIMEOUT: "60000" +API_RATE_LIMIT: "10000" +DEBUG_MODE: "false" +ANALYTICS_ENABLED: "true" +NEW_FEATURE_FLAG: "false" +LOG_LEVEL: "warn" +LOG_FORMAT: "json" +``` + + + + + +For complex applications with multiple configuration files: + +### 1. Directory Structure Approach + +``` +config/ +├── base/ +│ ├── app.yaml +│ ├── logging.yaml +│ └── security.yaml +├── development/ +│ ├── database.yaml +│ ├── cache.yaml +│ └── overrides.yaml +├── production/ +│ ├── database.yaml +│ ├── cache.yaml +│ └── overrides.yaml +└── qa/ + ├── database.yaml + ├── cache.yaml + └── overrides.yaml +``` + +### 2. Configuration Merger Script + +```python +#!/usr/bin/env python3 +# merge-config.py +import os +import yaml +import json +import sys +from pathlib import Path + +def load_yaml_file(file_path): + """Load YAML file and return parsed content""" + try: + with open(file_path, 'r') as f: + return yaml.safe_load(f) + except FileNotFoundError: + return {} + except yaml.YAMLError as e: + print(f"Error parsing YAML file {file_path}: {e}") + sys.exit(1) + +def deep_merge(base_dict, override_dict): + """Deep merge two dictionaries""" + result = base_dict.copy() + + for key, value in override_dict.items(): + if (key in result and + isinstance(result[key], dict) and + isinstance(value, dict)): + result[key] = deep_merge(result[key], value) + else: + result[key] = value + + return result + +def merge_configurations(environment, config_dir="/app/config"): + """Merge base configuration with environment-specific overrides""" + config_path = Path(config_dir) + base_path = config_path / "base" + env_path = config_path / environment + + # Load base configurations + base_config = {} + if base_path.exists(): + for config_file in base_path.glob("*.yaml"): + file_config = load_yaml_file(config_file) + base_config = deep_merge(base_config, file_config) + + # Load environment-specific configurations + env_config = {} + if env_path.exists(): + for config_file in env_path.glob("*.yaml"): + file_config = load_yaml_file(config_file) + env_config = deep_merge(env_config, file_config) + + # Merge configurations + final_config = deep_merge(base_config, env_config) + + return final_config + +def main(): + environment = os.getenv('ENVIRONMENT', 'development') + output_format = os.getenv('CONFIG_FORMAT', 'json') + output_file = os.getenv('CONFIG_OUTPUT', '/app/config/merged-config.json') + + print(f"Merging configuration for environment: {environment}") + + merged_config = merge_configurations(environment) + + # Write merged configuration + os.makedirs(os.path.dirname(output_file), exist_ok=True) + + if output_format.lower() == 'yaml': + with open(output_file, 'w') as f: + yaml.dump(merged_config, f, default_flow_style=False) + else: + with open(output_file, 'w') as f: + json.dump(merged_config, f, indent=2) + + print(f"Configuration written to: {output_file}") + +if __name__ == "__main__": + main() +``` + +### 3. Base Configuration Files + +```yaml +# config/base/app.yaml +app: + name: "MyApplication" + version: "1.0.0" + port: 8080 + +logging: + level: "info" + format: "json" + +security: + session_timeout: 3600 + max_login_attempts: 5 +``` + +```yaml +# config/base/logging.yaml +logging: + handlers: + console: + enabled: true + level: "info" + file: + enabled: false + path: "/var/log/app.log" + level: "debug" + max_size: "100MB" + max_files: 5 +``` + +### 4. Environment-Specific Overrides + +```yaml +# config/development/database.yaml +database: + host: "localhost" + port: 5432 + name: "myapp_dev" + username: "dev_user" + password: "dev_password" + ssl_mode: "disable" + pool_size: 10 + +cache: + type: "redis" + host: "localhost" + port: 6379 + database: 0 + ttl: 300 +``` + +```yaml +# config/production/database.yaml +database: + host: "${DB_HOST}" + port: "${DB_PORT}" + name: "${DB_NAME}" + username: "${DB_USERNAME}" + password: "${DB_PASSWORD}" + ssl_mode: "require" + pool_size: 50 + connection_timeout: 30000 + +cache: + type: "redis" + host: "${REDIS_HOST}" + port: "${REDIS_PORT}" + database: "${REDIS_DB}" + password: "${REDIS_PASSWORD}" + ttl: 3600 + cluster_mode: true +``` + +### 5. Docker Integration + +```dockerfile +FROM python:3.9-alpine + +WORKDIR /app + +# Install dependencies +RUN pip install pyyaml + +# Copy configuration merger script +COPY merge-config.py /usr/local/bin/ +RUN chmod +x /usr/local/bin/merge-config.py + +# Copy configuration files +COPY config/ /app/config/ + +# Copy application +COPY . /app + +# Merge configuration and start app +CMD ["sh", "-c", "python /usr/local/bin/merge-config.py && python app.py"] +``` + + + + + +For configurations containing sensitive data like passwords and API keys: + +### 1. Kubernetes Secrets Integration + +```yaml +# secrets.yaml +apiVersion: v1 +kind: Secret +metadata: + name: app-secrets-dev +type: Opaque +data: + db-password: ZGV2X3Bhc3N3b3Jk # base64 encoded + api-key: YWJjZGVmZ2hpams= + jwt-secret: c3VwZXJfc2VjcmV0X2tleQ== +--- +apiVersion: v1 +kind: Secret +metadata: + name: app-secrets-prod +type: Opaque +data: + db-password: cHJvZF9wYXNzd29yZA== # base64 encoded + api-key: eHl6MTIzNDU2Nzg5 + jwt-secret: cHJvZF9qd3Rfc2VjcmV0 +``` + +### 2. Configuration with Secret References + +```json +{ + "database": { + "host": "${DB_HOST}", + "port": "${DB_PORT}", + "username": "${DB_USERNAME}", + "password": "${DB_PASSWORD}", + "ssl": true + }, + "api": { + "base_url": "${API_BASE_URL}", + "key": "${API_KEY}", + "timeout": 30000 + }, + "jwt": { + "secret": "${JWT_SECRET}", + "expiration": "24h" + } +} +``` + +### 3. Deployment with Secrets + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-app +spec: + template: + spec: + containers: + - name: app + image: my-app:latest + env: + - name: DB_PASSWORD + valueFrom: + secretKeyRef: + name: app-secrets-dev + key: db-password + - name: API_KEY + valueFrom: + secretKeyRef: + name: app-secrets-dev + key: api-key + - name: JWT_SECRET + valueFrom: + secretKeyRef: + name: app-secrets-dev + key: jwt-secret + volumeMounts: + - name: config-volume + mountPath: /app/config + - name: secret-volume + mountPath: /app/secrets + readOnly: true + volumes: + - name: config-volume + configMap: + name: app-config-dev + - name: secret-volume + secret: + secretName: app-secrets-dev +``` + +### 4. Secure Configuration Loader + +```python +# secure_config.py +import os +import json +import base64 +from pathlib import Path + +class SecureConfigLoader: + def __init__(self, config_path="/app/config", secrets_path="/app/secrets"): + self.config_path = Path(config_path) + self.secrets_path = Path(secrets_path) + + def load_secret(self, secret_name): + """Load secret from mounted secret volume""" + secret_file = self.secrets_path / secret_name + if secret_file.exists(): + return secret_file.read_text().strip() + return os.getenv(secret_name.upper().replace('-', '_')) + + def substitute_secrets(self, config_data): + """Replace secret placeholders in configuration""" + if isinstance(config_data, dict): + for key, value in config_data.items(): + config_data[key] = self.substitute_secrets(value) + elif isinstance(config_data, list): + for i, item in enumerate(config_data): + config_data[i] = self.substitute_secrets(item) + elif isinstance(config_data, str): + if config_data.startswith("${") and config_data.endswith("}"): + env_var = config_data[2:-1] + secret_value = self.load_secret(env_var.lower().replace('_', '-')) + if secret_value: + return secret_value + return os.getenv(env_var, config_data) + return config_data + + def load_config(self, config_file="config.json"): + """Load and process configuration file""" + config_path = self.config_path / config_file + + with open(config_path, 'r') as f: + config = json.load(f) + + # Substitute secrets and environment variables + return self.substitute_secrets(config) + +# Usage example +loader = SecureConfigLoader() +config = loader.load_config() +``` + + + + + +### 1. Configuration Schema Validation + +```python +# config_validator.py +import json +import jsonschema +from jsonschema import validate, ValidationError + +# Define configuration schema +CONFIG_SCHEMA = { + "type": "object", + "properties": { + "database": { + "type": "object", + "properties": { + "host": {"type": "string"}, + "port": {"type": "integer", "minimum": 1, "maximum": 65535}, + "name": {"type": "string"}, + "username": {"type": "string"}, + "password": {"type": "string"}, + "ssl": {"type": "boolean"} + }, + "required": ["host", "port", "name", "username", "password"] + }, + "api": { + "type": "object", + "properties": { + "base_url": {"type": "string", "format": "uri"}, + "key": {"type": "string"}, + "timeout": {"type": "integer", "minimum": 1000} + }, + "required": ["base_url", "key"] + }, + "logging": { + "type": "object", + "properties": { + "level": {"enum": ["debug", "info", "warn", "error"]}, + "format": {"enum": ["json", "text", "pretty"]} + } + } + }, + "required": ["database", "api"] +} + +def validate_config(config_data): + """Validate configuration against schema""" + try: + validate(instance=config_data, schema=CONFIG_SCHEMA) + print("✓ Configuration validation passed") + return True + except ValidationError as e: + print(f"✗ Configuration validation failed: {e.message}") + print(f"Failed at path: {' -> '.join(str(p) for p in e.path)}") + return False + +def validate_environment_vars(required_vars): + """Validate that required environment variables are set""" + missing_vars = [] + for var in required_vars: + if not os.getenv(var): + missing_vars.append(var) + + if missing_vars: + print(f"✗ Missing required environment variables: {', '.join(missing_vars)}") + return False + + print("✓ All required environment variables are set") + return True + +# Usage in application startup +def startup_validation(): + """Perform comprehensive startup validation""" + required_env_vars = ["DB_HOST", "DB_PASSWORD", "API_KEY", "JWT_SECRET"] + + # Validate environment variables + if not validate_environment_vars(required_env_vars): + sys.exit(1) + + # Load and validate configuration + config = load_config() + if not validate_config(config): + sys.exit(1) + + return config +``` + +### 2. Configuration Change Detection + +```python +# config_monitor.py +import os +import hashlib +import time +import signal +import sys +from pathlib import Path + +class ConfigMonitor: + def __init__(self, config_files, callback=None): + self.config_files = [Path(f) for f in config_files] + self.file_hashes = {} + self.callback = callback + self.running = False + + def calculate_file_hash(self, file_path): + """Calculate MD5 hash of file content""" + try: + with open(file_path, 'rb') as f: + return hashlib.md5(f.read()).hexdigest() + except FileNotFoundError: + return None + + def initialize_hashes(self): + """Initialize file hashes""" + for config_file in self.config_files: + self.file_hashes[config_file] = self.calculate_file_hash(config_file) + + def check_for_changes(self): + """Check if any configuration files have changed""" + changes = [] + + for config_file in self.config_files: + current_hash = self.calculate_file_hash(config_file) + stored_hash = self.file_hashes.get(config_file) + + if current_hash != stored_hash: + changes.append(config_file) + self.file_hashes[config_file] = current_hash + + return changes + + def start_monitoring(self, interval=5): + """Start monitoring configuration files""" + self.running = True + self.initialize_hashes() + + print(f"Monitoring {len(self.config_files)} configuration files...") + + while self.running: + try: + changes = self.check_for_changes() + + if changes: + print(f"Configuration changes detected: {[str(f) for f in changes]}") + if self.callback: + self.callback(changes) + + time.sleep(interval) + + except KeyboardInterrupt: + self.stop_monitoring() + + def stop_monitoring(self): + """Stop monitoring""" + self.running = False + print("Configuration monitoring stopped") + +# Usage +def on_config_change(changed_files): + """Handle configuration file changes""" + print("Reloading configuration...") + try: + new_config = load_config() + if validate_config(new_config): + # Apply new configuration + apply_config(new_config) + print("Configuration reloaded successfully") + else: + print("Invalid configuration detected, keeping current configuration") + except Exception as e: + print(f"Error reloading configuration: {e}") + +# Start monitoring in a separate thread +import threading + +monitor = ConfigMonitor([ + "/app/config/config.json", + "/app/config/database.yaml", + "/app/config/logging.yaml" +], callback=on_config_change) + +monitor_thread = threading.Thread(target=monitor.start_monitoring) +monitor_thread.daemon = True +monitor_thread.start() +``` + +### 3. Health Check with Configuration Status + +```python +# health_check.py +from flask import Flask, jsonify +import datetime + +app = Flask(__name__) + +@app.route('/health/config') +def config_health(): + """Health check endpoint that includes configuration status""" + try: + # Load current configuration + config = load_config() + + # Validate configuration + is_valid = validate_config(config) + + # Check configuration file timestamps + config_files = [ + "/app/config/config.json", + "/app/config/database.yaml" + ] + + file_info = {} + for config_file in config_files: + if os.path.exists(config_file): + stat = os.stat(config_file) + file_info[config_file] = { + "exists": True, + "size": stat.st_size, + "modified": datetime.datetime.fromtimestamp(stat.st_mtime).isoformat() + } + else: + file_info[config_file] = {"exists": False} + + return jsonify({ + "status": "healthy" if is_valid else "unhealthy", + "timestamp": datetime.datetime.utcnow().isoformat(), + "configuration": { + "valid": is_valid, + "environment": os.getenv("ENVIRONMENT", "unknown"), + "files": file_info + } + }), 200 if is_valid else 503 + + except Exception as e: + return jsonify({ + "status": "unhealthy", + "error": str(e), + "timestamp": datetime.datetime.utcnow().isoformat() + }), 503 + +if __name__ == "__main__": + app.run(host="0.0.0.0", port=8080) +``` + + + + + +### 1. Configuration Hierarchy + +Implement a clear configuration hierarchy: + +``` +1. Default/Base Configuration (lowest priority) +2. Environment-Specific Configuration +3. ConfigMap/Volume Mounts +4. Environment Variables +5. Runtime Arguments (highest priority) +``` + +### 2. Security Best Practices + +```python +# secure_config_practices.py + +class SecureConfigManager: + """Best practices for secure configuration management""" + + SENSITIVE_KEYS = ['password', 'secret', 'key', 'token', 'credential'] + + def __init__(self): + self.config = {} + + def load_config_securely(self, config_file): + """Load configuration with security checks""" + config = self.load_raw_config(config_file) + + # 1. Validate no secrets in configuration files + self.validate_no_hardcoded_secrets(config) + + # 2. Substitute environment variables + config = self.substitute_env_vars(config) + + # 3. Load secrets from secure sources + config = self.load_secrets_securely(config) + + # 4. Validate final configuration + self.validate_config(config) + + return config + + def validate_no_hardcoded_secrets(self, config, path=""): + """Ensure no hardcoded secrets in configuration""" + if isinstance(config, dict): + for key, value in config.items(): + current_path = f"{path}.{key}" if path else key + + # Check if key suggests sensitive data + if any(sensitive in key.lower() for sensitive in self.SENSITIVE_KEYS): + if isinstance(value, str) and not value.startswith("${"): + print(f"WARNING: Potential hardcoded secret at {current_path}") + + self.validate_no_hardcoded_secrets(value, current_path) + elif isinstance(config, list): + for i, item in enumerate(config): + self.validate_no_hardcoded_secrets(item, f"{path}[{i}]") + + def mask_sensitive_config(self, config): + """Create a safe version of config for logging""" + safe_config = {} + + def mask_value(key, value): + if any(sensitive in key.lower() for sensitive in self.SENSITIVE_KEYS): + return "*" * 8 + return value + + def process_dict(d, result): + for key, value in d.items(): + if isinstance(value, dict): + result[key] = {} + process_dict(value, result[key]) + else: + result[key] = mask_value(key, value) + + process_dict(config, safe_config) + return safe_config +``` + +### 3. Deployment Best Practices + +```yaml +# deployment-best-practices.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: my-app + labels: + app: my-app + version: "1.0.0" +spec: + replicas: 3 + selector: + matchLabels: + app: my-app + template: + metadata: + labels: + app: my-app + annotations: + config.version: "v1.2.3" # Track config version + spec: + containers: + - name: app + image: my-app:latest + + # Environment variables for non-sensitive config + env: + - name: ENVIRONMENT + value: "production" + - name: LOG_LEVEL + value: "warn" + - name: CONFIG_FILE + value: "/app/config/config.json" + + # Sensitive data from secrets + - name: DB_PASSWORD + valueFrom: + secretKeyRef: + name: app-secrets + key: db-password + + # Configuration from ConfigMaps + volumeMounts: + - name: config-volume + mountPath: /app/config + readOnly: true + - name: secret-volume + mountPath: /app/secrets + readOnly: true + + # Health checks that verify configuration + readinessProbe: + httpGet: + path: /health/config + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + + # Resource limits + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" + + volumes: + - name: config-volume + configMap: + name: app-config + - name: secret-volume + secret: + secretName: app-secrets + + # Security context + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 2000 +``` + +### 4. Testing Configuration + +```python +# test_configuration.py +import pytest +import tempfile +import json +import os +from pathlib import Path + +class TestConfiguration: + """Test suite for configuration management""" + + @pytest.fixture + def temp_config_dir(self): + """Create temporary config directory""" + with tempfile.TemporaryDirectory() as temp_dir: + yield Path(temp_dir) + + @pytest.fixture + def sample_configs(self, temp_config_dir): + """Create sample configuration files""" + # Base config + base_config = { + "app": {"name": "testapp", "port": 8080}, + "database": {"host": "${DB_HOST}", "port": 5432} + } + + # Environment configs + dev_config = {"database": {"host": "dev-db.local"}} + prod_config = {"database": {"host": "prod-db.local"}} + + # Write configs + base_dir = temp_config_dir / "base" + base_dir.mkdir() + (base_dir / "app.json").write_text(json.dumps(base_config)) + + dev_dir = temp_config_dir / "development" + dev_dir.mkdir() + (dev_dir / "overrides.json").write_text(json.dumps(dev_config)) + + prod_dir = temp_config_dir / "production" + prod_dir.mkdir() + (prod_dir / "overrides.json").write_text(json.dumps(prod_config)) + + return temp_config_dir + + def test_config_merging(self, sample_configs): + """Test configuration merging logic""" + # Set environment + os.environ["ENVIRONMENT"] = "development" + + # Load merged config + merged = merge_configurations("development", str(sample_configs)) + + assert merged["app"]["name"] == "testapp" + assert merged["database"]["host"] == "dev-db.local" + assert merged["database"]["port"] == 5432 + + def test_environment_substitution(self, sample_configs): + """Test environment variable substitution""" + os.environ["DB_HOST"] = "custom-db.local" + os.environ["ENVIRONMENT"] = "production" + + config = load_and_substitute_config(str(sample_configs)) + + assert config["database"]["host"] == "custom-db.local" + + def test_config_validation(self): + """Test configuration validation""" + valid_config = { + "database": { + "host": "localhost", + "port": 5432, + "name": "testdb", + "username": "user", + "password": "pass" + }, + "api": { + "base_url": "https://api.example.com", + "key": "test-key" + } + } + + assert validate_config(valid_config) == True + + # Test invalid config + invalid_config = {"database": {"host": "localhost"}} # Missing required fields + assert validate_config(invalid_config) == False +``` + +### 5. Documentation Template + +```markdown +# Configuration Management Guide + +## Overview + +This application uses a hierarchical configuration system supporting multiple environments. + +## Configuration Files + +### Structure +``` + +config/ +├── base/ # Base configuration (all environments) +├── development/ # Development overrides +├── staging/ # Staging overrides +└── production/ # Production overrides + +``` + +### Environment Variables +| Variable | Required | Description | Default | +|----------|----------|-------------|---------| +| ENVIRONMENT | Yes | Deployment environment | development | +| DB_HOST | Yes | Database hostname | localhost | +| DB_PASSWORD | Yes | Database password | (none) | +| API_KEY | Yes | External API key | (none) | + +### Configuration Validation +- All configuration is validated against JSON schema on startup +- Missing required fields will cause startup failure +- Invalid values will be logged as warnings + +### Security +- Sensitive values must use environment variables or mounted secrets +- Configuration files should never contain hardcoded secrets +- All secret access is logged for audit purposes + +### Monitoring +- Configuration health endpoint: `/health/config` +- Configuration changes are detected and logged +- Metrics include config validation status and file timestamps +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/environment-variables-not-accessible-nuxtjs.mdx b/docs/troubleshooting/environment-variables-not-accessible-nuxtjs.mdx new file mode 100644 index 000000000..46e8ec971 --- /dev/null +++ b/docs/troubleshooting/environment-variables-not-accessible-nuxtjs.mdx @@ -0,0 +1,225 @@ +--- +sidebar_position: 3 +title: "Environment Variables Not Accessible in NuxtJS Application" +description: "Solution for environment variables not being received in NuxtJS applications deployed on SleakOps" +date: "2024-01-15" +category: "project" +tags: ["nuxtjs", "environment-variables", "deployment", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Environment Variables Not Accessible in NuxtJS Application + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** NuxtJS, Environment Variables, Deployment, Configuration + +## Problem Description + +**Context:** User has deployed a NuxtJS application on SleakOps and configured environment variables through the platform, but the application cannot access these variables at runtime. + +**Observed Symptoms:** + +- Environment variables are configured in SleakOps platform +- Variables are not accessible within the NuxtJS application +- Application behavior suggests missing environment configuration +- Variables may appear as undefined or empty + +**Relevant Configuration:** + +- Framework: NuxtJS +- Platform: SleakOps +- Environment variables: Configured through platform interface +- Deployment type: Container-based deployment + +**Error Conditions:** + +- Variables are not available at application runtime +- Issue occurs after successful deployment +- Problem persists across application restarts +- May affect both server-side and client-side variable access + +## Detailed Solution + + + +NuxtJS has specific requirements for environment variables: + +1. **Server-side variables**: Available in `process.env` +2. **Client-side variables**: Must be explicitly exposed using `publicRuntimeConfig` or `privateRuntimeConfig` +3. **Build-time variables**: Need to be available during the build process + +```javascript +// nuxt.config.js +export default { + // Server-side only + privateRuntimeConfig: { + apiSecret: process.env.API_SECRET, + }, + // Exposed to client-side + publicRuntimeConfig: { + apiUrl: process.env.API_URL || "https://api.example.com", + }, +}; +``` + + + + + +To properly configure environment variables in SleakOps: + +1. **Go to your Project Settings** +2. **Navigate to Environment Variables section** +3. **Add variables with proper naming**: + + - Use `NUXT_` prefix for automatic Nuxt recognition + - Example: `NUXT_API_URL`, `NUXT_DATABASE_URL` + +4. **Ensure variables are available at build time**: + - Mark critical variables as "Build Time" if they're needed during compilation + - Mark runtime variables as "Runtime" for application execution + + + + + +Ensure your Dockerfile properly handles environment variables: + +```dockerfile +# Dockerfile example for NuxtJS +FROM node:18-alpine + +WORKDIR /app + +# Copy package files +COPY package*.json ./ +RUN npm ci --only=production + +# Copy source code +COPY . . + +# Build the application (environment variables needed here) +ARG NUXT_API_URL +ARG NUXT_APP_NAME +ENV NUXT_API_URL=$NUXT_API_URL +ENV NUXT_APP_NAME=$NUXT_APP_NAME + +RUN npm run build + +# Expose port +EXPOSE 3000 + +# Start the application +CMD ["npm", "start"] +``` + + + + + +For Nuxt 3 applications, use the new runtime configuration: + +```typescript +// nuxt.config.ts +export default defineNuxtConfig({ + runtimeConfig: { + // Private keys (only available on server-side) + apiSecret: process.env.API_SECRET, + // Public keys (exposed to client-side) + public: { + apiUrl: process.env.NUXT_PUBLIC_API_URL || "https://api.example.com", + appName: process.env.NUXT_PUBLIC_APP_NAME || "My App", + }, + }, +}); +``` + +Access variables in your application: + +```vue + + + +``` + + + + + +1. **Verify variable names in SleakOps**: + + - Check for typos in variable names + - Ensure consistent naming between platform and code + +2. **Check build logs**: + + - Look for environment variable availability during build + - Verify no build-time errors related to missing variables + +3. **Test locally**: + + ```bash + # Create .env file for local testing + NUXT_PUBLIC_API_URL=https://api.example.com + NUXT_PUBLIC_APP_NAME=Test App + API_SECRET=your-secret-key + + # Run locally + npm run dev + ``` + +4. **Debug in production**: + ```javascript + // Add temporary logging + console.log("Environment check:", { + nodeEnv: process.env.NODE_ENV, + apiUrl: process.env.NUXT_PUBLIC_API_URL, + hasSecret: !!process.env.API_SECRET, + }); + ``` + + + + + +**Solution 1: Update variable naming** + +- Prefix public variables with `NUXT_PUBLIC_` +- Use consistent casing (UPPER_CASE for env vars) + +**Solution 2: Rebuild application** + +- After changing environment variables, trigger a new deployment +- Ensure build process picks up new variables + +**Solution 3: Check variable scope** + +- Verify if variables should be build-time or runtime +- Configure accordingly in SleakOps platform + +**Solution 4: Update nuxt.config** + +- Ensure all required variables are properly configured +- Test both server-side and client-side access + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/environment-variables-not-working-during-migration.mdx b/docs/troubleshooting/environment-variables-not-working-during-migration.mdx new file mode 100644 index 000000000..6f8c8dc7a --- /dev/null +++ b/docs/troubleshooting/environment-variables-not-working-during-migration.mdx @@ -0,0 +1,184 @@ +--- +sidebar_position: 3 +title: "Environment Variables Not Working During Platform Migration" +description: "Solution for environment variables not reaching applications during SleakOps maintenance migrations" +date: "2024-04-24" +category: "project" +tags: + [ + "environment-variables", + "secrets", + "migration", + "deployment", + "troubleshooting", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Environment Variables Not Working During Platform Migration + +**Date:** April 24, 2024 +**Category:** Project +**Tags:** Environment Variables, Secrets, Migration, Deployment, Troubleshooting + +## Problem Description + +**Context:** During SleakOps platform maintenance migrations, environment variables and secrets may temporarily become unavailable to applications, causing deployment and runtime issues. + +**Observed Symptoms:** + +- Environment variables not being sent to applications after build/deploy +- Variables showing as undefined or empty in application logs +- Error when trying to edit variables in the SleakOps platform +- Applications failing to connect to APIs or external services +- Variables that were working before suddenly stop functioning + +**Relevant Configuration:** + +- Platform: SleakOps +- Affected: Environment variables and secrets +- Timing: During maintenance migrations +- Impact: Applications cannot access configuration + +**Error Conditions:** + +- Occurs during platform maintenance windows +- Variables become inaccessible after new builds/deploys +- Edit functionality disabled temporarily +- Applications may point to wrong environments (QA instead of production) + +## Detailed Solution + + + +During SleakOps maintenance migrations: + +1. **Secrets backup process**: The platform creates backups of secrets in your AWS account +2. **Temporary edit disable**: Variable editing is disabled during migration +3. **Existing secrets remain**: Variables still exist in the cluster but may not propagate to new deployments +4. **Build/deploy impact**: New builds may not receive updated environment variables + +**Important**: Avoid triggering new builds/deploys during active migrations. + + + + + +If you need immediate functionality during migration: + +1. **Application-level configuration**: Temporarily hardcode critical values in your application +2. **Environment switching**: Point development/staging to production APIs as temporary measure +3. **Avoid new deployments**: Don't trigger builds/deploys until migration completes +4. **Cache existing deployments**: Use currently running instances that have the variables + +```javascript +// Example: Temporary hardcoded fallback +const API_URL = process.env.API_URL || "https://api.production.example.com"; +const API_KEY = process.env.API_KEY || "fallback-key-for-emergency"; +``` + + + + + +Once migration is complete: + +1. **Check variable availability**: Verify all environment variables are accessible +2. **Test in build process**: Trigger a test build to confirm variables are injected +3. **Validate application startup**: Ensure applications receive all required configuration +4. **Monitor logs**: Check application logs for any missing variables + +```bash +# Example: Debug environment variables in your application +console.log('Environment variables check:'); +console.log('API_URL:', process.env.API_URL); +console.log('DATABASE_URL:', process.env.DATABASE_URL ? 'SET' : 'MISSING'); +console.log('SECRET_KEY:', process.env.SECRET_KEY ? 'SET' : 'MISSING'); +``` + + + + + +After migration completion: + +1. **Edit access restored**: Variable editing functionality will be re-enabled +2. **Update if needed**: Make any necessary changes to environment variables +3. **Deploy with new variables**: Trigger new deployment to apply any updates +4. **Verify propagation**: Confirm variables reach all application instances + +**Steps to verify edit functionality**: + +1. Go to your project's environment variables section +2. Try editing a non-critical variable +3. Save changes and deploy +4. Verify the change is reflected in your application + + + + + +To minimize impact during future migrations: + +1. **Monitor maintenance announcements**: Stay informed about planned migrations +2. **Implement graceful fallbacks**: Design applications to handle missing variables +3. **Use configuration management**: Implement proper config management patterns +4. **Schedule deployments**: Avoid deployments during known maintenance windows +5. **Test variable dependencies**: Regularly test what happens when variables are missing + +```javascript +// Example: Robust configuration handling +class Config { + constructor() { + this.apiUrl = this.getRequiredEnv("API_URL"); + this.dbUrl = this.getRequiredEnv("DATABASE_URL"); + this.secretKey = this.getRequiredEnv("SECRET_KEY"); + } + + getRequiredEnv(key) { + const value = process.env[key]; + if (!value) { + console.error(`Missing required environment variable: ${key}`); + // Implement fallback or graceful degradation + return this.getFallbackValue(key); + } + return value; + } + + getFallbackValue(key) { + // Implement appropriate fallback logic + const fallbacks = { + API_URL: "https://api.fallback.example.com", + // Add other fallbacks as needed + }; + return fallbacks[key] || null; + } +} +``` + + + + + +When experiencing variable issues during migrations: + +1. **Contact support immediately**: Report the issue with specific details +2. **Provide context**: Include which variables are affected and when the issue started +3. **Avoid multiple deployments**: Don't repeatedly try to deploy until issue is resolved +4. **Document workarounds**: Keep track of any temporary fixes applied +5. **Wait for confirmation**: Get confirmation that migration is complete before resuming normal operations + +**Information to include in support requests**: + +- Project name and environment affected +- Specific variables that aren't working +- When the issue was first noticed +- Any error messages from the platform +- Screenshots of variable configuration if helpful + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/extending-charts-custom-ingress.mdx b/docs/troubleshooting/extending-charts-custom-ingress.mdx new file mode 100644 index 000000000..7ba056b66 --- /dev/null +++ b/docs/troubleshooting/extending-charts-custom-ingress.mdx @@ -0,0 +1,261 @@ +--- +sidebar_position: 3 +title: "Adding Custom Ingress Configuration Using Extended Charts" +description: "How to configure persistent custom Ingress resources using SleakOps Extended Charts feature" +date: "2024-12-26" +category: "project" +tags: ["ingress", "helm", "extended-charts", "aws-load-balancer", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Adding Custom Ingress Configuration Using Extended Charts + +**Date:** December 26, 2024 +**Category:** Project +**Tags:** Ingress, Helm, Extended Charts, AWS Load Balancer, Kubernetes + +## Problem Description + +**Context:** Users need to add custom Ingress configurations to their SleakOps projects that persist across deployments and updates. The standard project configuration may not cover all specific Ingress requirements like custom annotations, multiple hosts, or specific AWS Load Balancer Controller configurations. + +**Observed Symptoms:** + +- Need to manually configure Ingress resources after each deployment +- Custom Ingress configurations get lost during project updates +- Requirement for specific AWS ALB annotations and configurations +- Need to handle multiple domains and SSL certificates + +**Relevant Configuration:** + +- Platform: SleakOps with Kubernetes +- Load Balancer: AWS Application Load Balancer (ALB) +- Ingress Controller: AWS Load Balancer Controller +- SSL/TLS: AWS Certificate Manager integration + +**Error Conditions:** + +- Ingress configurations don't persist across deployments +- Manual configurations are overwritten during updates +- Need for complex routing rules and redirects + +## Detailed Solution + + + +To add persistent custom Ingress configurations: + +1. Navigate to your project in SleakOps +2. Go to the **"Extended Charts"** section +3. This feature allows you to add custom Kubernetes resources that will be included in your Helm deployment + + + + + +Once in the Extended Charts section: + +1. Click on **"Templates"** or **"Add Template"** +2. Add your custom Ingress YAML configuration +3. The configuration will be automatically included in your Helm chart + + + + + +Here's a complete example of an Ingress configuration with AWS ALB annotations: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: websimpleecommx + namespace: mx-simplee-web-prod-mx-2 + labels: + app.kubernetes.io/instance: 1.0.1 + app.kubernetes.io/name: websimpleecommx + annotations: + # AWS Load Balancer Controller annotations + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]' + alb.ingress.kubernetes.io/ssl-redirect: "443" + + # Certificate configuration + alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/your-cert-id + + # Health check configuration + alb.ingress.kubernetes.io/healthcheck-path: /health + alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30" + alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5" + alb.ingress.kubernetes.io/healthy-threshold-count: "2" + alb.ingress.kubernetes.io/unhealthy-threshold-count: "2" + + # Redirect configuration (example for www to non-www) + alb.ingress.kubernetes.io/actions.redirect-to-simplee: > + {"Type": "redirect", + "RedirectConfig": { + "Protocol": "HTTPS", + "Port": "443", + "StatusCode": "HTTP_301", + "Host": "simplee.com.mx" + }} +spec: + rules: + # Handle www redirect + - host: www.simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: redirect-to-simplee + port: + name: use-annotation + # Main domain + - host: simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: websimpleecommx-service + port: + number: 80 +``` + + + + + +### Multiple Hosts with Different Backends + +```yaml +spec: + rules: + - host: api.example.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: api-service + port: + number: 8080 + - host: admin.example.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: admin-service + port: + number: 3000 +``` + +### Path-based Routing + +```yaml +spec: + rules: + - host: example.com + http: + paths: + - path: /api + pathType: Prefix + backend: + service: + name: api-service + port: + number: 8080 + - path: /admin + pathType: Prefix + backend: + service: + name: admin-service + port: + number: 3000 + - path: / + pathType: Prefix + backend: + service: + name: frontend-service + port: + number: 80 +``` + +### Custom Redirect Rules + +```yaml +annotations: + alb.ingress.kubernetes.io/actions.custom-redirect: > + {"Type": "redirect", + "RedirectConfig": { + "Protocol": "HTTPS", + "Port": "443", + "StatusCode": "HTTP_301", + "Path": "/new-path" + }} +``` + + + + + +### Verify Ingress Deployment + +After adding your Ingress configuration to Extended Charts: + +```bash +# Check if Ingress was created +kubectl get ingress -n + +# Verify Ingress details +kubectl describe ingress -n + +# Check ALB creation in AWS +aws elbv2 describe-load-balancers --region +``` + +### Common Issues and Solutions + +1. **Ingress not creating ALB:** + + - Verify AWS Load Balancer Controller is installed + - Check service account permissions + - Ensure correct annotations are used + +2. **SSL certificate issues:** + + - Verify certificate ARN is correct + - Ensure certificate is in the same region + - Check certificate status in AWS Certificate Manager + +3. **Routing not working:** + - Verify service names and ports match + - Check target group health in AWS console + - Validate security group rules + +### Testing Your Configuration + +```bash +# Test HTTP response +curl -I http://your-domain.com + +# Test HTTPS redirect +curl -I -L http://your-domain.com + +# Test specific paths +curl -I https://your-domain.com/api +``` + + + +--- + +_This FAQ was automatically generated on December 26, 2024 based on a real user query._ diff --git a/docs/troubleshooting/fargate-pods-not-terminating-cost-increase.mdx b/docs/troubleshooting/fargate-pods-not-terminating-cost-increase.mdx new file mode 100644 index 000000000..309fffe85 --- /dev/null +++ b/docs/troubleshooting/fargate-pods-not-terminating-cost-increase.mdx @@ -0,0 +1,290 @@ +--- +sidebar_position: 3 +title: "AWS Fargate Pods Not Terminating - Cost Increase Issue" +description: "Solution for AWS Fargate pods that don't terminate properly causing unexpected cost increases" +date: "2024-03-14" +category: "cluster" +tags: ["fargate", "aws", "cost", "pods", "termination", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# AWS Fargate Pods Not Terminating - Cost Increase Issue + +**Date:** March 14, 2024 +**Category:** Cluster +**Tags:** Fargate, AWS, Cost, Pods, Termination, Troubleshooting + +## Problem Description + +**Context:** Users experience unexpected cost increases in their AWS EKS cluster when Fargate pods running deployments fail to terminate properly after job completion. + +**Observed Symptoms:** + +- Dramatic increase in AWS costs over several days +- Multiple Fargate replicas visible in Jobs listing +- Pods accumulating without proper cleanup +- Cost forecast showing significant spikes +- Individual pod costs are low but accumulate over time + +**Relevant Configuration:** + +- Cluster type: AWS EKS with Fargate +- Workload type: Deployments running on Fargate nodes +- Monitoring: Prometheus with insufficient RAM allocation (1250MB) +- Cost monitoring: Enabled with forecasting + +**Error Conditions:** + +- Fargate pods don't terminate after deployment completion +- Prometheus memory issues causing node allocation problems +- Unused nodes remain active generating costs +- Problem appears intermittently without clear trigger + +## Detailed Solution + + + +This cost increase issue typically stems from two main problems: + +1. **Fargate Pod Lifecycle Issue**: Fargate pods running deployments don't always terminate properly, causing them to accumulate over time +2. **Prometheus Resource Constraints**: Insufficient RAM allocation (1250MB) causes Prometheus to allocate nodes that become unusable but remain active + +Both issues result in resources staying active longer than necessary, generating unexpected costs. + + + + + +To resolve the Prometheus memory issue: + +1. **Access the Prometheus addon configuration** in your cluster +2. **Increase the minimum RAM allocation** from 1250MB to 2GB (2048MB) +3. **Apply the configuration changes** + +```yaml +# Prometheus addon configuration +resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" +``` + +This prevents Prometheus from entering nodes with insufficient resources and creating unusable but active nodes. + + + + + +To address the Fargate pod accumulation: + +1. **Check current Fargate pods**: + +```bash +kubectl get pods --all-namespaces -o wide | grep fargate +``` + +2. **Identify stuck pods**: + +```bash +kubectl get pods --all-namespaces --field-selector=status.phase=Succeeded +kubectl get pods --all-namespaces --field-selector=status.phase=Failed +``` + +3. **Clean up completed pods**: + +```bash +# Remove completed pods +kubectl delete pods --all-namespaces --field-selector=status.phase=Succeeded +kubectl delete pods --all-namespaces --field-selector=status.phase=Failed +``` + + + + + +To prevent future accumulation, implement automated cleanup: + +1. **Create a cleanup CronJob**: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: pod-cleanup + namespace: kube-system +spec: + schedule: "0 2 * * *" # Run daily at 2 AM + jobTemplate: + spec: + template: + spec: + serviceAccountName: pod-cleanup + containers: + - name: kubectl + image: bitnami/kubectl:latest + command: + - /bin/sh + - -c + - | + kubectl delete pods --all-namespaces --field-selector=status.phase=Succeeded + kubectl delete pods --all-namespaces --field-selector=status.phase=Failed + restartPolicy: OnFailure +``` + +2. **Create the necessary RBAC**: + +```yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: pod-cleanup + namespace: kube-system +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: pod-cleanup +rules: + - apiGroups: [""] + resources: ["pods"] + verbs: ["list", "delete"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: pod-cleanup +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: pod-cleanup +subjects: + - kind: ServiceAccount + name: pod-cleanup + namespace: kube-system +``` + + + + + +To prevent future cost surprises: + +1. **Set up cost alerts** in AWS: + + - Go to AWS Billing → Budgets + - Create a budget for your EKS cluster + - Set alerts at 80% and 100% of expected costs + +2. **Monitor Fargate usage**: + +```bash +# Check Fargate pod count +kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.nodeName | contains("fargate")) | .metadata.name' | wc -l +``` + +3. **Regular cost reviews**: + - Check AWS Cost Explorer weekly + - Monitor EKS costs by service + - Review Fargate usage patterns + + + + + +**Deployment Configuration:** + +- Set appropriate `activeDeadlineSeconds` for jobs +- Use `ttlSecondsAfterFinished` for automatic cleanup +- Configure proper resource limits + +**Monitoring:** + +- Set up Prometheus alerts for pod accumulation +- Monitor cluster resource usage regularly +- Implement cost tracking dashboards + +**Maintenance:** + +- Schedule regular cluster health checks +- Implement automated cleanup processes +- Review and update resource allocations periodically + +```yaml +# Example job with automatic cleanup +apiVersion: batch/v1 +kind: Job +metadata: + name: example-job +spec: + ttlSecondsAfterFinished: 300 # Clean up 5 minutes after completion + activeDeadlineSeconds: 3600 # Kill job after 1 hour + template: + spec: + restartPolicy: Never + containers: + - name: job-container + image: your-image + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" +``` + +**Resource Optimization:** + +- Right-size your Fargate pods to avoid over-provisioning +- Use spot instances where appropriate for cost savings +- Implement horizontal pod autoscaling for variable workloads + +**Cluster Maintenance:** + +- Regularly review and clean up unused resources +- Monitor cluster health and performance metrics +- Keep Kubernetes and addon versions up to date + + + + + +**Resource Right-Sizing:** + +1. **Analyze actual resource usage**: + ```bash + kubectl top pods --all-namespaces + kubectl top nodes + ``` + +2. **Adjust resource requests and limits** based on actual usage +3. **Use Vertical Pod Autoscaler** for automatic right-sizing + +**Scheduling Optimization:** + +- Use node affinity to optimize pod placement +- Implement pod disruption budgets for controlled updates +- Consider using mixed instance types for cost optimization + +**Monitoring and Alerting:** + +- Set up comprehensive cost monitoring dashboards +- Implement alerts for unusual cost spikes +- Regular cost reviews and optimization sessions + +**Automation:** + +- Automate resource cleanup processes +- Implement policy-based resource management +- Use tools like Karpenter for intelligent node provisioning + + + +--- + +_This FAQ section was automatically generated on March 14, 2024, based on a real user inquiry._ +``` diff --git a/docs/troubleshooting/frontend-environment-variables-docker-build.mdx b/docs/troubleshooting/frontend-environment-variables-docker-build.mdx new file mode 100644 index 000000000..56b3024f6 --- /dev/null +++ b/docs/troubleshooting/frontend-environment-variables-docker-build.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 3 +title: "Frontend Environment Variables Not Working During Build" +description: "Solution for frontend projects not receiving environment variables during Docker build process" +date: "2024-12-23" +category: "project" +tags: ["frontend", "environment-variables", "docker", "build", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Frontend Environment Variables Not Working During Build + +**Date:** December 23, 2024 +**Category:** Project +**Tags:** Frontend, Environment Variables, Docker, Build, Configuration + +## Problem Description + +**Context:** Frontend projects in SleakOps are not receiving environment variables during the build process, affecting multiple applications across different frontend frameworks. + +**Observed Symptoms:** + +- Environment variables are not available in frontend applications +- Variables that work in other project types fail in frontend projects +- Build process completes successfully but variables are undefined at runtime +- Issue affects multiple frontend applications consistently + +**Relevant Configuration:** + +- Project type: Frontend applications +- Build process: Docker-based builds +- Environment variables: Configured in SleakOps project settings +- Variable access: Required during build time, not just runtime + +**Error Conditions:** + +- Variables are undefined or null in the built application +- Environment variables work in backend services but not frontend +- Issue occurs across different frontend frameworks (React, Vue, Angular, etc.) +- Problem persists regardless of variable naming conventions + +## Detailed Solution + + + +Frontend projects require environment variables during the **build process**, not just at runtime. Unlike backend services that can access environment variables when the container starts, frontend applications need these variables to be "baked in" during the build step because: + +1. Frontend code runs in the browser, not on the server +2. Environment variables must be embedded into the static files during build +3. The build process needs access to variables to generate the final bundle + + + + + +To fix this issue, you need to configure Docker Args in your project settings: + +1. Go to your **Projects** list in SleakOps +2. Find your frontend project and click **Edit** +3. Navigate to the **Docker Args** section +4. Add your environment variables as build arguments + +**Example configuration:** + +```bash +# In Docker Args section +REACT_APP_API_URL=${API_URL} +REACT_APP_ENV=${ENVIRONMENT} +VUE_APP_BASE_URL=${BASE_URL} +NEXT_PUBLIC_API_KEY=${API_KEY} +``` + + + + + +Your Dockerfile must be configured to accept and use these build arguments: + +```dockerfile +# Accept build arguments +ARG REACT_APP_API_URL +ARG REACT_APP_ENV +ARG VUE_APP_BASE_URL +ARG NEXT_PUBLIC_API_KEY + +# Set them as environment variables during build +ENV REACT_APP_API_URL=$REACT_APP_API_URL +ENV REACT_APP_ENV=$REACT_APP_ENV +ENV VUE_APP_BASE_URL=$VUE_APP_BASE_URL +ENV NEXT_PUBLIC_API_KEY=$NEXT_PUBLIC_API_KEY + +# Run your build command +RUN npm run build +``` + + + + + +Different frontend frameworks have specific naming conventions for environment variables: + +**React:** + +- Must start with `REACT_APP_` +- Example: `REACT_APP_API_URL` + +**Vue.js:** + +- Must start with `VUE_APP_` +- Example: `VUE_APP_BASE_URL` + +**Next.js:** + +- Must start with `NEXT_PUBLIC_` for client-side access +- Example: `NEXT_PUBLIC_API_KEY` + +**Angular:** + +- No specific prefix required, but often use custom configuration +- Access through `environment.ts` files + +**Vite-based projects:** + +- Must start with `VITE_` +- Example: `VITE_API_ENDPOINT` + + + + + +To verify your environment variables are working: + +1. **Check build logs**: Look for the variables being set during build +2. **Browser console**: Use `console.log(process.env.REACT_APP_API_URL)` in your code +3. **Network tab**: Verify API calls are using the correct URLs +4. **Build output**: Check if variables appear in the bundled files (be careful with sensitive data) + +**Test code example:** + +```javascript +// Add this temporarily to verify +console.log("Environment variables:", { + apiUrl: process.env.REACT_APP_API_URL, + environment: process.env.REACT_APP_ENV, +}); +``` + + + + + +If variables still don't work after configuration: + +1. **Check variable names**: Ensure they follow framework conventions +2. **Verify Docker Args syntax**: Make sure the syntax is correct in SleakOps +3. **Rebuild the project**: Changes to Docker Args require a full rebuild +4. **Check Dockerfile**: Ensure ARG and ENV statements are present +5. **Validate variable values**: Ensure the source environment variables have values + +**Common mistakes:** + +- Forgetting framework-specific prefixes +- Not rebuilding after configuration changes +- Missing ARG statements in Dockerfile +- Incorrect variable substitution syntax + + + +--- + +_This FAQ was automatically generated on December 23, 2024 based on a real user query._ diff --git a/docs/troubleshooting/github-actions-automatic-deployment.mdx b/docs/troubleshooting/github-actions-automatic-deployment.mdx new file mode 100644 index 000000000..4186d6bb0 --- /dev/null +++ b/docs/troubleshooting/github-actions-automatic-deployment.mdx @@ -0,0 +1,199 @@ +--- +sidebar_position: 3 +title: "GitHub Actions Automatic Deployment Setup" +description: "How to configure automatic deployments with GitHub Actions in SleakOps" +date: "2024-12-17" +category: "project" +tags: ["github-actions", "ci-cd", "deployment", "automation"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# GitHub Actions Automatic Deployment Setup + +**Date:** December 17, 2024 +**Category:** Project +**Tags:** GitHub Actions, CI/CD, Deployment, Automation + +## Problem Description + +**Context:** User wants to configure automatic deployments that trigger when pushing code to a GitHub repository using SleakOps platform. + +**Observed Symptoms:** + +- Need to set up CI/CD pipeline for automatic deployments +- Want deployments to trigger on Git push events +- Looking for integration between GitHub and SleakOps +- Need guidance on GitHub Actions configuration + +**Relevant Configuration:** + +- Platform: SleakOps +- Version Control: GitHub +- CI/CD Tool: GitHub Actions +- Deployment Target: SleakOps managed infrastructure + +**Error Conditions:** + +- Manual deployment process currently in use +- No automated pipeline configured +- Need to establish connection between GitHub and SleakOps + +## Detailed Solution + + + +SleakOps provides a CLI tool specifically designed for CI/CD integration. You can find the complete documentation at: https://docs.sleakops.com/cli + +The CLI allows you to: + +- Deploy applications automatically +- Manage deployments from CI/CD pipelines +- Integrate with various CI/CD platforms including GitHub Actions + + + + + +To generate the GitHub Actions workflow file: + +1. **Access SleakOps Console** + + - Log into your SleakOps dashboard + - Navigate to your project + +2. **Generate CI/CD Configuration** + + - Look for the CI/CD or GitHub Actions section + - The console will provide a pre-configured workflow file + - This file includes all necessary steps for deployment + +3. **Download Configuration** + - Copy the generated workflow file + - Note the required secret keys that will be displayed + + + + + +Once you have the workflow file from SleakOps console: + +1. **Create Workflow Directory** + + ```bash + mkdir -p .github/workflows + ``` + +2. **Add Workflow File** + + - Create a new file: `.github/workflows/deploy.yml` + - Paste the configuration provided by SleakOps console + +3. **Basic Workflow Structure** + + ```yaml + name: Deploy to SleakOps + + on: + push: + branches: [main, master] + + jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Deploy to SleakOps + run: | + # SleakOps CLI commands will be here + env: + SLEAKOPS_TOKEN: ${{ secrets.SLEAKOPS_TOKEN }} + SLEAKOPS_PROJECT_ID: ${{ secrets.SLEAKOPS_PROJECT_ID }} + ``` + + + + + +To set up the required secrets in GitHub: + +1. **Access Repository Settings** + + - Go to your GitHub repository + - Click on **Settings** tab + - Navigate to **Secrets and variables** → **Actions** + +2. **Add Required Secrets** + The SleakOps console will show you the exact secrets needed, typically: + + - `SLEAKOPS_TOKEN`: Your SleakOps API token + - `SLEAKOPS_PROJECT_ID`: Your project identifier + - Any additional environment-specific secrets + +3. **Create New Secret** + - Click **New repository secret** + - Enter the secret name exactly as shown in SleakOps console + - Paste the corresponding value + - Click **Add secret** + + + + + +To verify your setup works correctly: + +1. **Make a Test Commit** + + ```bash + git add . + git commit -m "Test automatic deployment" + git push origin main + ``` + +2. **Monitor GitHub Actions** + + - Go to your repository's **Actions** tab + - Watch the workflow execution + - Check for any errors in the logs + +3. **Verify Deployment** + - Check SleakOps console for deployment status + - Verify your application is updated + - Test the deployed application + + + + + +If the deployment fails: + +1. **Check Secrets** + + - Verify all required secrets are configured + - Ensure secret names match exactly + - Check that tokens haven't expired + +2. **Review Workflow Logs** + + - Check GitHub Actions logs for specific errors + - Look for authentication or permission issues + - Verify CLI commands are executing correctly + +3. **Validate SleakOps Configuration** + + - Ensure your SleakOps project is properly configured + - Check that the CLI has necessary permissions + - Verify project ID is correct + +4. **Test CLI Locally** + ```bash + # Install SleakOps CLI locally to test + npm install -g @sleakops/cli + sleakops deploy --help + ``` + + + +--- + +_This FAQ was automatically generated on December 17, 2024 based on a real user query._ diff --git a/docs/troubleshooting/github-actions-multi-project-deployment.mdx b/docs/troubleshooting/github-actions-multi-project-deployment.mdx new file mode 100644 index 000000000..2c472d88c --- /dev/null +++ b/docs/troubleshooting/github-actions-multi-project-deployment.mdx @@ -0,0 +1,220 @@ +--- +sidebar_position: 3 +title: "GitHub Actions Multi-Project Deployment" +description: "Configure GitHub Actions to deploy multiple SleakOps projects automatically" +date: "2025-01-15" +category: "project" +tags: ["github-actions", "ci-cd", "deployment", "automation", "sleakops-cli"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# GitHub Actions Multi-Project Deployment + +**Date:** January 15, 2025 +**Category:** Project +**Tags:** GitHub Actions, CI/CD, Deployment, Automation, SleakOps CLI + +## Problem Description + +**Context:** User needs to configure GitHub Actions to automatically build and deploy multiple SleakOps projects when pushing to the production branch, specifically wanting to deploy both `simplee-drf-aws-cl` and `simplee-drf-aws-mx` projects simultaneously. + +**Observed Symptoms:** + +- Current CI/CD pipeline only deploys one project +- Need to trigger multiple project deployments from a single push +- Requirement to duplicate build and deploy commands for additional projects +- Environment parameter confusion (using `-e main` instead of `-e dev`) + +**Relevant Configuration:** + +- Platform: GitHub Actions +- SleakOps CLI commands: `sleakops build` and `sleakops deploy` +- Projects: `simplee-drf-aws-cl`, `simplee-drf-aws-mx` +- Branch trigger: `prod` branch +- Environment: `dev` (not `main`) + +**Error Conditions:** + +- Incorrect environment parameter in deploy command +- Single project deployment instead of multi-project +- Need for workflow optimization for multiple projects + +## Detailed Solution + + + +Here's how to configure GitHub Actions to deploy multiple SleakOps projects: + +```yaml +name: Multi-Project CI/CD + +on: + push: + branches: + - prod + +jobs: + build: + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Run SleakOps build for ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops build -p ${{ matrix.project }} -b prod -w + + deploy: + needs: [build] + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Run SleakOps deploy for ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops deploy -p ${{ matrix.project }} -e dev -w +``` + + + + + +If you prefer sequential deployment (one project after another), use this configuration: + +```yaml +name: Sequential Multi-Project CI/CD + +on: + push: + branches: + - prod + +jobs: + build-and-deploy-cl: + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Build simplee-drf-aws-cl + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops build -p simplee-drf-aws-cl -b prod -w + + - name: Deploy simplee-drf-aws-cl + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops deploy -p simplee-drf-aws-cl -e dev -w + + build-and-deploy-mx: + runs-on: ubuntu-latest + needs: [build-and-deploy-cl] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Build simplee-drf-aws-mx + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops build -p simplee-drf-aws-mx -b prod -w + + - name: Deploy simplee-drf-aws-mx + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops deploy -p simplee-drf-aws-mx -e dev -w +``` + + + + + +**Important:** The environment parameter in the deploy command should match your SleakOps environment name, not the git branch. + +Common mistakes: + +- ❌ `sleakops deploy -p project-name -e main -w` +- ✅ `sleakops deploy -p project-name -e dev -w` +- ✅ `sleakops deploy -p project-name -e prod -w` + +To verify your environment name: + +1. Go to your SleakOps console +2. Navigate to your project +3. Check the environment section +4. Use the exact environment name in the `-e` parameter + + + + + +For production environments, consider adding error handling and notifications: + +```yaml +name: Production Multi-Project CI/CD + +on: + push: + branches: + - prod + +jobs: + build: + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + fail-fast: false + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Build ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + echo "Building ${{ matrix.project }}..." + sleakops build -p ${{ matrix.project }} -b prod -w + echo "Build completed for ${{ matrix.project }}" + + deploy: + needs: [build] + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + fail-fast: false + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sle +``` + + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/github-actions-quota-management.mdx b/docs/troubleshooting/github-actions-quota-management.mdx new file mode 100644 index 000000000..6ee250662 --- /dev/null +++ b/docs/troubleshooting/github-actions-quota-management.mdx @@ -0,0 +1,254 @@ +--- +sidebar_position: 15 +title: "GitHub Actions Quota Management and CI/CD Optimization" +description: "Managing GitHub Actions quotas and optimizing CI/CD pipeline efficiency in SleakOps" +date: "2024-12-11" +category: "project" +tags: ["github-actions", "ci-cd", "quota", "optimization", "deployment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# GitHub Actions Quota Management and CI/CD Optimization + +**Date:** December 11, 2024 +**Category:** Project +**Tags:** GitHub Actions, CI/CD, Quota, Optimization, Deployment + +## Problem Description + +**Context:** Organizations using SleakOps with GitHub Actions for CI/CD pipelines may encounter quota limitations, especially when running multiple deployments across development and production environments. + +**Observed Symptoms:** + +- GitHub Actions quota consumed faster than expected +- Deployments taking significantly longer than usual (50+ minutes) +- Reduced number of successful builds compared to previous months +- Unexpected quota exhaustion mid-month + +**Relevant Configuration:** + +- Multiple environments: Development and Production +- Historical performance: ~100 builds per month previously +- Current issue: Quota exhaustion with fewer builds +- Deployment supervision: Manual monitoring required + +**Error Conditions:** + +- Quota exhaustion occurs earlier in the month +- Longer deployment times consuming more minutes +- Unsupervised deployments leading to unexpected quota usage + +## Detailed Solution + + + +GitHub has made several changes to their free tier quotas over time: + +**Current GitHub Free Tier Limits (2024):** + +- **Public repositories**: Unlimited minutes +- **Private repositories**: 2,000 minutes/month for free accounts +- **Organization accounts**: 2,000 minutes/month (free tier) + +**Recent Changes:** + +- GitHub has not significantly reduced free quotas recently +- However, they've improved billing granularity and monitoring +- Usage patterns may have changed due to longer job execution times + +**Verification Steps:** + +1. Go to GitHub → Settings → Billing and plans +2. Check "Usage this month" under Actions +3. Review historical usage patterns + + + + + +To reduce GitHub Actions minutes consumption: + +**1. Optimize Build Processes:** + +```yaml +# .github/workflows/deploy.yml +name: Optimized Deploy +on: + push: + branches: [main, develop] + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + # Use caching to reduce build time + - name: Cache dependencies + uses: actions/cache@v3 + with: + path: ~/.npm + key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} + + # Parallel jobs for different environments + - name: Build and test + run: | + npm ci + npm run build + npm test +``` + +**2. Environment-Specific Deployments:** + +```yaml +# Separate workflows for dev/prod +strategy: + matrix: + environment: [development, production] +steps: + - name: Deploy to ${{ matrix.environment }} + if: | + (matrix.environment == 'development' && github.ref == 'refs/heads/develop') || + (matrix.environment == 'production' && github.ref == 'refs/heads/main') +``` + + + + + +**Automated Monitoring Configuration:** + +1. **Slack/Teams Notifications:** + +```yaml +- name: Notify deployment start + uses: 8398a7/action-slack@v3 + with: + status: custom + custom_payload: | + { + text: "🚀 Deployment started for ${{ github.repository }}", + blocks: [ + { + type: "section", + text: { + type: "mrkdwn", + text: "Environment: ${{ matrix.environment }}\nBranch: ${{ github.ref }}" + } + } + ] + } +``` + +2. **Timeout Configuration:** + +```yaml +jobs: + deploy: + timeout-minutes: 15 # Prevent runaway jobs + steps: + - name: Deploy with timeout + timeout-minutes: 10 + run: | + # Your deployment commands +``` + +3. **SleakOps Dashboard Integration:** + +- Enable deployment notifications in SleakOps dashboard +- Set up alerts for long-running deployments +- Configure automatic rollback on timeout + + + + + +**1. Usage Monitoring:** + +```bash +# Check current usage via GitHub CLI +gh api /user/settings/billing/actions + +# Monitor workflow runs +gh run list --limit 50 --json status,conclusion,createdAt,name +``` + +**2. Cost Optimization:** + +- Use self-hosted runners for intensive builds +- Implement conditional deployments +- Cache dependencies and build artifacts +- Use matrix strategies efficiently + +**3. Alternative Solutions:** + +```yaml +# Conditional deployment based on changes +name: Smart Deploy +on: + push: + paths: + - "src/**" + - "package.json" + - "Dockerfile" + +jobs: + check-changes: + outputs: + should-deploy: ${{ steps.changes.outputs.src }} + steps: + - uses: dorny/paths-filter@v2 + id: changes + with: + filters: | + src: + - 'src/**' + - 'package.json' + + deploy: + needs: check-changes + if: needs.check-changes.outputs.should-deploy == 'true' + # ... deployment steps +``` + + + + + +**1. Cluster-Specific Configurations:** + +- Configure separate workflows for QA and Production +- Use SleakOps environment variables for conditional logic +- Implement progressive deployment strategies + +**2. Resource Management:** + +```yaml +# SleakOps optimized workflow +env: + SLEAKOPS_ENV: ${{ github.ref == 'refs/heads/main' && 'production' || 'development' }} + +jobs: + deploy: + steps: + - name: Deploy to SleakOps + run: | + # Use SleakOps CLI with optimized settings + sleakops deploy \ + --environment $SLEAKOPS_ENV \ + --timeout 300 \ + --max-retries 2 +``` + +**3. Monitoring Integration:** + +- Enable SleakOps deployment webhooks +- Set up automated rollback policies +- Configure resource limits for build pods + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/github-actions-timeout-limits.mdx b/docs/troubleshooting/github-actions-timeout-limits.mdx new file mode 100644 index 000000000..288bae0ae --- /dev/null +++ b/docs/troubleshooting/github-actions-timeout-limits.mdx @@ -0,0 +1,207 @@ +--- +sidebar_position: 3 +title: "GitHub Actions Timeout and Monthly Limits" +description: "Solution for GitHub Actions deployments timing out and exceeding monthly limits" +date: "2024-01-15" +category: "project" +tags: ["github-actions", "deployment", "timeout", "ci-cd", "limits"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# GitHub Actions Timeout and Monthly Limits + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** GitHub Actions, Deployment, Timeout, CI/CD, Limits + +## Problem Description + +**Context:** Users experience deployment failures when using GitHub Actions for CI/CD, with builds taking significantly longer than expected and potentially exceeding GitHub's monthly usage limits. + +**Observed Symptoms:** + +- Deployments that normally take 10-14 minutes are running for 45+ minutes +- GitHub Actions jobs failing after 8 seconds on push +- Multiple deployments getting stuck in "running" state +- CI/CD pipeline failures without clear error messages + +**Relevant Configuration:** + +- Platform: GitHub Actions integration with SleakOps +- Normal deployment time: 10-14 minutes +- Observed deployment time: 45+ minutes +- Project type: Web application deployment + +**Error Conditions:** + +- Error occurs immediately after git push (within 8 seconds) +- Multiple deployments running simultaneously +- Monthly GitHub Actions limits potentially exceeded +- CI.yml configuration appears correct + +## Detailed Solution + + + +To verify if you've exceeded GitHub Actions limits: + +1. Go to your **GitHub Organization Settings** +2. Navigate to **Settings** → **Billing** +3. Check the **Actions & Packages** section +4. Review: + - Minutes used this month + - Available minutes remaining + - Current plan limits + +**Free Plan Limits:** + +- 2,000 minutes/month for private repositories +- Unlimited for public repositories + +**Paid Plan Limits:** + +- Varies by plan (Pro: 3,000 min/month, Team: 10,000 min/month) + + + + + +Instead of relying on GitHub Actions, you can use SleakOps' native build and deployment system: + +1. **Access SleakOps Dashboard** +2. Navigate to your project +3. Go to **Build & Deploy** section +4. Configure **Direct Deploy** from SleakOps: + +```yaml +# Example SleakOps deployment configuration +build: + strategy: "sleakops-native" + timeout: "15m" + +deploy: + auto_deploy: true + branch: "dev" + environment: "development" +``` + +**Benefits:** + +- No GitHub Actions minute consumption +- More reliable build times +- Better integration with SleakOps features +- Detailed build logs and monitoring + + + + + +If you prefer to continue using GitHub Actions, optimize your workflow: + +```yaml +# .github/workflows/ci.yml +name: CI/CD Pipeline + +on: + push: + branches: [dev, main] + +jobs: + build-and-deploy: + runs-on: ubuntu-latest + timeout-minutes: 20 # Set explicit timeout + + steps: + - uses: actions/checkout@v3 + + # Use caching to reduce build time + - name: Cache dependencies + uses: actions/cache@v3 + with: + path: ~/.npm + key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} + + # Optimize Docker builds + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v2 + + - name: Build and push + uses: docker/build-push-action@v3 + with: + cache-from: type=gha + cache-to: type=gha,mode=max +``` + + + + + +To prevent future issues, implement monitoring: + +**1. GitHub Actions Monitoring:** + +```yaml +# Add to your workflow +- name: Notify on long builds + if: ${{ job.status == 'failure' || job.duration > 1200 }} + uses: 8398a7/action-slack@v3 + with: + status: custom + custom_payload: | + { + text: "Build taking longer than expected: ${{ job.duration }}s" + } +``` + +**2. SleakOps Monitoring:** + +- Enable build time alerts in SleakOps dashboard +- Set up notifications for builds exceeding 15 minutes +- Monitor resource usage trends + +**3. Usage Tracking:** + +- Weekly review of GitHub Actions usage +- Set up alerts when approaching 80% of monthly limit +- Consider upgrading GitHub plan if consistently hitting limits + + + + + +For immediate resolution: + +**1. Cancel Running Actions:** + +```bash +# Using GitHub CLI +gh run list --status in_progress +gh run cancel [RUN_ID] +``` + +**2. Check Workflow Status:** + +- Go to **Actions** tab in your repository +- Cancel any stuck workflows +- Review logs for specific error messages + +**3. Temporary Workaround:** + +- Disable GitHub Actions temporarily +- Use SleakOps manual deployment +- Re-enable once limits reset (monthly) + +**4. Emergency Deployment:** +If you need immediate deployment: + +1. Access SleakOps dashboard +2. Go to project → **Manual Deploy** +3. Select branch and environment +4. Deploy directly through SleakOps + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/github-automatic-deployments-failing.mdx b/docs/troubleshooting/github-automatic-deployments-failing.mdx new file mode 100644 index 000000000..da5589591 --- /dev/null +++ b/docs/troubleshooting/github-automatic-deployments-failing.mdx @@ -0,0 +1,176 @@ +--- +sidebar_position: 3 +title: "GitHub Automatic Deployments Not Working" +description: "Solution for automatic deployments from GitHub that have stopped working" +date: "2024-01-15" +category: "project" +tags: ["github", "deployment", "automation", "ci-cd", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# GitHub Automatic Deployments Not Working + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** GitHub, Deployment, Automation, CI/CD, Troubleshooting + +## Problem Description + +**Context:** Automatic deployments from GitHub repositories have stopped working in SleakOps platform, preventing code changes from being automatically deployed to the target environment. + +**Observed Symptoms:** + +- Automatic deployments from GitHub are not triggering +- Error messages appearing in deployment logs +- Code changes pushed to repository are not reflected in deployed application +- Issue has been occurring for several days + +**Relevant Configuration:** + +- Platform: SleakOps +- Source: GitHub repository +- Deployment type: Automatic deployments +- Duration: Multiple days of failure + +**Error Conditions:** + +- Deployments fail consistently over multiple days +- Error occurs after code push to GitHub +- Automatic deployment pipeline is not executing +- Manual intervention may be required + +## Detailed Solution + + + +The most common cause of automatic deployment failures is webhook configuration issues: + +1. **Verify webhook status in GitHub:** + + - Go to your repository → Settings → Webhooks + - Check if the SleakOps webhook is active + - Look for recent delivery failures + +2. **Check webhook URL:** + + - Ensure the webhook URL points to the correct SleakOps endpoint + - Verify the URL format: `https://api.sleakops.com/webhooks/github/{project-id}` + +3. **Validate webhook secret:** + - Confirm the webhook secret matches your SleakOps project configuration + - If unsure, regenerate the webhook in SleakOps settings + + + + + +Check if GitHub access tokens or permissions have expired: + +1. **GitHub App permissions:** + + - Go to SleakOps → Project Settings → GitHub Integration + - Verify the GitHub App is still connected + - Check if permissions include repository access + +2. **Personal Access Token (if used):** + + - Verify the token hasn't expired + - Ensure token has necessary scopes: `repo`, `workflow` + - Regenerate token if needed + +3. **Repository access:** + - Confirm SleakOps still has access to the repository + - Check if repository was moved or renamed + + + + + +Examine the deployment logs for specific error messages: + +1. **Access deployment logs:** + + - Go to SleakOps Dashboard → Your Project → Deployments + - Click on the failed deployment attempts + - Review error messages and stack traces + +2. **Common error patterns:** + + - `Authentication failed`: Token or webhook issues + - `Repository not found`: Access or naming issues + - `Build failed`: Code compilation or dependency issues + - `Timeout`: Resource or network connectivity issues + +3. **Pipeline configuration:** + - Verify the deployment pipeline configuration is correct + - Check if any recent changes were made to deployment settings + + + + + +Try triggering a manual deployment to isolate the issue: + +1. **Manual deployment test:** + + - Go to SleakOps → Your Project → Deployments + - Click "Deploy Now" or "Manual Deploy" + - Select the branch you want to deploy + +2. **If manual deployment works:** + + - The issue is specifically with automatic triggers + - Focus on webhook and GitHub integration troubleshooting + +3. **If manual deployment fails:** + - The issue is with the deployment process itself + - Check build configuration, dependencies, and resource allocation + + + + + +If other solutions don't work, try resetting the GitHub integration: + +1. **Disconnect and reconnect:** + + - Go to SleakOps → Project Settings → Integrations + - Disconnect GitHub integration + - Reconnect and reauthorize access + +2. **Reconfigure webhook:** + + - Delete existing webhook from GitHub repository + - Let SleakOps recreate the webhook automatically + - Test with a new commit + +3. **Verify configuration:** + - Ensure branch settings are correct + - Confirm deployment triggers are enabled + - Check if any deployment rules or conditions are blocking execution + + + + + +Contact SleakOps support if: + +- All troubleshooting steps have been attempted +- Manual deployments also fail +- Error messages are unclear or system-related +- The issue affects multiple projects + +**Information to provide:** + +- Project name and ID +- GitHub repository URL +- Screenshots of error messages +- Timeline of when the issue started +- Recent changes made to the project or repository + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/gitlab-self-hosted-redirect-issue.mdx b/docs/troubleshooting/gitlab-self-hosted-redirect-issue.mdx new file mode 100644 index 000000000..e4c2eb0f3 --- /dev/null +++ b/docs/troubleshooting/gitlab-self-hosted-redirect-issue.mdx @@ -0,0 +1,185 @@ +--- +sidebar_position: 3 +title: "GitLab Self-Hosted Redirect Issue" +description: "Solution for GitLab self-hosted repository connection issues with redirect loops" +date: "2025-02-25" +category: "project" +tags: ["gitlab", "self-hosted", "git", "connection", "redirect"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# GitLab Self-Hosted Redirect Issue + +**Date:** February 25, 2025 +**Category:** Project +**Tags:** GitLab, Self-hosted, Git, Connection, Redirect + +## Problem Description + +**Context:** User attempts to connect a self-hosted GitLab repository to SleakOps but encounters continuous redirects to the GitLab page instead of successful authentication. + +**Observed Symptoms:** + +- Cannot connect to self-hosted GitLab repository +- Continuous redirect to GitLab page during authentication +- Connection process fails to complete +- Unable to access repository through SleakOps platform + +**Relevant Configuration:** + +- Repository type: GitLab self-hosted +- Connection method: Settings → Authorizations +- Platform: SleakOps +- Authentication flow: OAuth/redirect-based + +**Error Conditions:** + +- Redirect loop occurs during authentication +- Problem persists across multiple connection attempts +- Issue specific to self-hosted GitLab instances +- Standard GitLab.com connections may work normally + +## Detailed Solution + + + +Before troubleshooting the connection, ensure your GitLab instance is properly configured: + +1. **Check GitLab OAuth application settings**: + + - Go to your GitLab instance: `Admin Area → Applications` + - Verify the OAuth application for SleakOps exists + - Confirm the redirect URI matches SleakOps callback URL + +2. **Verify GitLab accessibility**: + - Ensure your GitLab instance is accessible from external networks + - Check if HTTPS is properly configured + - Verify SSL certificates are valid + + + + + +Follow these steps to connect your self-hosted GitLab: + +1. **In SleakOps platform**: + + - Navigate to **Settings → Authorizations** + - Click **Add Git Provider** + - Select **GitLab Self-Hosted** + +2. **Configure GitLab instance**: + + ``` + GitLab URL: https://your-gitlab-instance.com + Application ID: [from GitLab OAuth app] + Application Secret: [from GitLab OAuth app] + ``` + +3. **Complete OAuth flow**: + - Click **Connect** + - You'll be redirected to your GitLab instance + - Authorize the SleakOps application + - You should be redirected back to SleakOps + + + + + +If you're experiencing redirect loops, check these common issues: + +**1. Incorrect Redirect URI**: + +- In GitLab OAuth app settings, ensure redirect URI is: + ``` + https://app.sleakops.com/auth/gitlab/callback + ``` + +**2. Network connectivity**: + +- Verify SleakOps can reach your GitLab instance +- Check firewall rules and network policies +- Ensure no proxy is interfering with the connection + +**3. GitLab instance configuration**: + +- Verify `external_url` in GitLab configuration +- Check if GitLab is behind a reverse proxy +- Ensure proper HTTPS configuration + + + + + +If the OAuth application doesn't exist, create it: + +1. **In GitLab Admin Area**: + + - Go to **Admin Area → Applications** + - Click **New Application** + +2. **Application settings**: + + ``` + Name: SleakOps + Redirect URI: https://app.sleakops.com/auth/gitlab/callback + Scopes: + ✓ api + ✓ read_user + ✓ read_repository + ✓ write_repository + ``` + +3. **Save and note credentials**: + - Copy the Application ID + - Copy the Secret + - Use these in SleakOps configuration + + + + + +If OAuth continues to fail, try these alternatives: + +**1. Personal Access Token**: + +- Create a GitLab Personal Access Token +- Use token-based authentication instead of OAuth +- Configure in SleakOps with token credentials + +**2. SSH Key Authentication**: + +- Generate SSH key pair in SleakOps +- Add public key to GitLab user/project +- Use SSH-based repository URLs + +**3. Contact Support**: +If issues persist, contact SleakOps support with: + +- GitLab instance URL +- Network configuration details +- Error logs from both SleakOps and GitLab + + + + + +After configuration, verify the connection works: + +1. **Test repository access**: + + - Try to browse repositories in SleakOps + - Attempt to create a new project from GitLab repo + - Verify webhooks are properly configured + +2. **Check permissions**: + - Ensure SleakOps can read repository content + - Verify write permissions for CI/CD operations + - Test webhook delivery + + + +--- + +_This FAQ was automatically generated on February 25, 2025 based on a real user query._ diff --git a/docs/troubleshooting/global-variables-update-error.mdx b/docs/troubleshooting/global-variables-update-error.mdx new file mode 100644 index 000000000..ca416592b --- /dev/null +++ b/docs/troubleshooting/global-variables-update-error.mdx @@ -0,0 +1,279 @@ +--- +sidebar_position: 3 +title: "Global Variables Update Error in SleakOps" +description: "Solution for errors when updating global variables in SleakOps platform" +date: "2024-03-27" +category: "project" +tags: ["variables", "configuration", "error", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Global Variables Update Error in SleakOps + +**Date:** March 27, 2024 +**Category:** Project +**Tags:** Variables, Configuration, Error, Troubleshooting + +## Problem Description + +**Context:** User encounters errors when trying to update global variables through the SleakOps platform interface, preventing configuration changes from being applied. + +**Observed Symptoms:** + +- Error message appears when attempting to modify global variables +- Configuration changes cannot be saved through the UI +- Variables update process fails without specific error details +- Unable to modify existing configuration settings + +**Relevant Configuration:** + +- Platform: SleakOps web interface +- Feature: Global Variables management +- Action: Updating existing variable values +- Error type: Generic error during save operation + +**Error Conditions:** + +- Error occurs when saving variable changes +- Affects global variable modifications +- Prevents configuration updates from being applied +- Issue persists across multiple attempts + +## Detailed Solution + + + +While the platform issue is being investigated, you can update variables directly in the cluster: + +**Using Lens (Kubernetes IDE):** + +1. **Connect to your cluster** through Lens +2. **Navigate to Secrets** in the left sidebar +3. **Find the relevant secret** containing your variables +4. **Edit the secret** by clicking the edit button +5. **Update the variable values** in the data section +6. **Save the changes** +7. **Restart the deployment** to apply the new variables + +```bash +# Alternative: Using kubectl +kubectl edit secret -n + +# After editing, restart the deployment +kubectl rollout restart deployment -n +``` + + + + + +If you prefer using kubectl directly: + +**Step 1: Identify the secret** + +```bash +# List secrets in your namespace +kubectl get secrets -n + +# Look for secrets with your application name or environment variables +kubectl describe secret -n +``` + +**Step 2: Update the secret** + +```bash +# Edit the secret directly +kubectl edit secret -n + +# Or patch specific values +kubectl patch secret -n --type='json' -p='[{"op": "replace", "path": "/data/YOUR_VARIABLE", "value": "'"$(echo -n 'new-value' | base64)"'"}]' + +# Create a new secret if needed +kubectl create secret generic \ + --from-literal=VAR1=value1 \ + --from-literal=VAR2=value2 \ + -n +``` + +**Step 3: Verify the changes** + +```bash +# Check the secret contents +kubectl get secret -n -o yaml + +# Decode a specific value to verify +kubectl get secret -n -o jsonpath='{.data.YOUR_VARIABLE}' | base64 -d +``` + +**Step 4: Apply changes to deployment** + +```bash +# Restart deployment to pick up new variables +kubectl rollout restart deployment -n + +# Check rollout status +kubectl rollout status deployment -n +``` + + + + + +For non-sensitive configuration variables, consider using ConfigMaps: + +**Create ConfigMap:** + +```bash +# Create ConfigMap from literals +kubectl create configmap app-config \ + --from-literal=API_URL=https://api.example.com \ + --from-literal=DEBUG_MODE=false \ + --from-literal=LOG_LEVEL=info \ + -n + +# Create from file +kubectl create configmap app-config --from-file=config.properties -n +``` + +**Update existing ConfigMap:** + +```bash +# Edit ConfigMap directly +kubectl edit configmap app-config -n + +# Replace with new values +kubectl create configmap app-config \ + --from-literal=API_URL=https://api-staging.example.com \ + --from-literal=DEBUG_MODE=true \ + --dry-run=client -o yaml | kubectl replace -f - +``` + +**Reference in deployment:** + +```yaml +spec: + template: + spec: + containers: + - name: app + env: + - name: API_URL + valueFrom: + configMapKeyRef: + name: app-config + key: API_URL + - name: DEBUG_MODE + valueFrom: + configMapKeyRef: + name: app-config + key: DEBUG_MODE +``` + + + + + +If the SleakOps interface continues to have issues: + +**1. Check browser console:** + +```javascript +// Open browser developer tools (F12) +// Look for JavaScript errors in Console tab +// Check Network tab for failed API requests +``` + +**2. Try different approaches:** + +- Clear browser cache and cookies +- Try incognito/private browsing mode +- Use a different browser +- Check for SleakOps platform status updates + +**3. Validate variable format:** + +- Ensure no special characters that might cause parsing issues +- Check for proper JSON format if using structured data +- Verify variable names follow naming conventions +- Ensure values don't exceed length limits + +**4. Platform-specific debugging:** + +```bash +# Check if the issue is in the platform or cluster +kubectl get events -n --sort-by='.lastTimestamp' + +# Look for any platform operator issues +kubectl get pods -n sleakops-system +kubectl logs -n sleakops-system +``` + + + + + +To prevent future issues: + +**1. Version control your variables:** + +```bash +# Export current variables for backup +kubectl get secret -n -o yaml > backup-secrets.yaml +kubectl get configmap -n -o yaml > backup-configmaps.yaml +``` + +**2. Use environment-specific naming:** + +```yaml +# Good naming convention +secrets: + - name: app-secrets-dev + - name: app-secrets-staging + - name: app-secrets-prod + +configmaps: + - name: app-config-dev + - name: app-config-staging + - name: app-config-prod +``` + +**3. Implement proper validation:** + +```bash +# Validate before applying +kubectl apply --dry-run=client -f your-config.yaml + +# Test deployment after changes +kubectl get pods -n +kubectl logs deployment/ -n +``` + +**4. Documentation:** + +- Document all variable purposes and expected values +- Maintain a change log for variable updates +- Keep emergency rollback procedures documented + +**5. Monitoring:** + +```bash +# Set up alerts for deployment failures +kubectl get events -n -w + +# Monitor application logs for configuration errors +kubectl logs -f deployment/ -n +``` + +**6. Regular maintenance:** + +- Remove unused variables regularly +- Update variable documentation +- Review and rotate sensitive values periodically +- Test variable updates in staging before production + + + +--- + +_This FAQ was automatically generated on March 27, 2024 based on a real user query._ diff --git a/docs/troubleshooting/grafana-404-error-troubleshooting.mdx b/docs/troubleshooting/grafana-404-error-troubleshooting.mdx new file mode 100644 index 000000000..4866977f3 --- /dev/null +++ b/docs/troubleshooting/grafana-404-error-troubleshooting.mdx @@ -0,0 +1,207 @@ +--- +sidebar_position: 3 +title: "Grafana 404 Error After Installation" +description: "Troubleshooting Grafana 404 errors when accessing the dashboard URL" +date: "2024-01-15" +category: "dependency" +tags: ["grafana", "monitoring", "404", "dns", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana 404 Error After Installation + +**Date:** January 15, 2024 +**Category:** Dependency +**Tags:** Grafana, Monitoring, 404, DNS, Troubleshooting + +## Problem Description + +**Context:** User successfully installs OpenTelemetry and Grafana in their development cluster through SleakOps, but encounters a 404 error when trying to access the Grafana dashboard URL. + +**Observed Symptoms:** + +- Grafana URL returns "This grafana.develop.takenos.com page can't be found" (404 error) +- All pods appear healthy in Kubernetes Lens +- Grafana pod shows normal logs without errors +- No alerts or obvious error indicators in the cluster +- DNS resolution works (different from "server IP address could not be found" errors) + +**Relevant Configuration:** + +- Environment: Development cluster +- URL: `https://grafana.develop.takenos.com/` +- Components: OpenTelemetry + Grafana +- Monitoring tools: Kubernetes Lens + +**Error Conditions:** + +- Error occurs when accessing Grafana dashboard URL +- Pods are running and healthy +- DNS resolves correctly (no DNS resolution errors) +- Issue persists after waiting for DNS propagation + +## Detailed Solution + + + +The key difference between error types helps diagnose the issue: + +- **404 Error**: "This grafana.develop.takenos.com page can't be found" - DNS resolves but the service/ingress isn't properly configured +- **DNS Error**: "server IP address could not be found" - DNS resolution fails + +Since you're getting a 404, the DNS is working but there's a configuration issue with the ingress or service. + + + + + +Check if the Grafana ingress is properly configured: + +```bash +# Check ingress resources +kubectl get ingress -A + +# Look for grafana ingress specifically +kubectl get ingress -A | grep grafana + +# Describe the grafana ingress +kubectl describe ingress grafana-ingress -n +``` + +Verify that: + +- The ingress exists and has the correct host (`grafana.develop.takenos.com`) +- The ingress has a valid backend service +- The ingress controller is running + + + + + +Ensure the Grafana service is properly configured: + +```bash +# List all services +kubectl get svc -A + +# Check grafana service specifically +kubectl get svc -A | grep grafana + +# Describe the grafana service +kubectl describe svc grafana -n + +# Test service connectivity +kubectl port-forward svc/grafana 3000:3000 -n +``` + +Then test locally: `http://localhost:3000` + + + + + +Even though pods look healthy, verify they're properly connected: + +```bash +# Check pod status and readiness +kubectl get pods -A | grep grafana + +# Check pod logs for any startup issues +kubectl logs -n + +# Verify service endpoints +kubectl get endpoints -A | grep grafana +``` + +Ensure the service endpoints show the pod IPs. + + + + + +Verify the ingress controller is working: + +```bash +# Check ingress controller pods +kubectl get pods -n ingress-nginx +# or +kubectl get pods -n kube-system | grep ingress + +# Check ingress controller logs +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller +``` + +Look for any errors related to your grafana ingress. + + + + + +In SleakOps platform: + +1. **Check the Monitoring Add-on Status**: + + - Go to your cluster configuration + - Verify the monitoring add-on is fully deployed + - Check for any deployment warnings or errors + +2. **Verify DNS Configuration**: + + - Ensure your domain is properly configured in SleakOps + - Check if SSL certificates are properly issued + +3. **Review Deployment Logs**: + - Check the deployment history for any failed steps + - Look for ingress or service creation failures + + + + + +**Fix 1: Recreate the Ingress** + +```bash +kubectl delete ingress grafana-ingress -n +# Wait for SleakOps to recreate it, or manually apply the correct configuration +``` + +**Fix 2: Check Grafana Configuration** + +```bash +# Check if Grafana is configured with the correct root URL +kubectl get configmap grafana-config -n -o yaml +``` + +**Fix 3: Restart Grafana Pod** + +```bash +kubectl delete pod -n +``` + +**Fix 4: Verify Path-based Routing** +Some configurations use path-based routing. Try accessing: + +- `https://grafana.develop.takenos.com/grafana/` +- `https://develop.takenos.com/grafana/` + + + + + +After applying fixes: + +1. **Wait for propagation** (2-5 minutes) +2. **Clear browser cache** or try incognito mode +3. **Test the URL**: `https://grafana.develop.takenos.com/` +4. **Check ingress status**: + ```bash + kubectl get ingress grafana-ingress -n + ``` +5. **Verify SSL certificate** if using HTTPS + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/grafana-loki-datasource-configuration.mdx b/docs/troubleshooting/grafana-loki-datasource-configuration.mdx new file mode 100644 index 000000000..bf40bc6cd --- /dev/null +++ b/docs/troubleshooting/grafana-loki-datasource-configuration.mdx @@ -0,0 +1,203 @@ +--- +sidebar_position: 3 +title: "Grafana Loki Datasource Configuration Issue" +description: "Solution for Loki datasource not being added to Grafana automatically" +date: "2024-11-15" +category: "dependency" +tags: ["grafana", "loki", "datasource", "monitoring", "logs"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana Loki Datasource Configuration Issue + +**Date:** November 15, 2024 +**Category:** Dependency +**Tags:** Grafana, Loki, Datasource, Monitoring, Logs + +## Problem Description + +**Context:** Users experience issues with Grafana not displaying logs properly due to Loki datasource not being automatically configured during SleakOps deployment. + +**Observed Symptoms:** + +- Grafana dashboard loads but shows no log data +- Loki datasource missing from Grafana configuration +- Log queries return empty results +- Stream filter defaults to 'stderr' but shows no data + +**Relevant Configuration:** + +- Monitoring stack: Grafana + Loki +- Default stream filter: `stderr` +- Loki components: `loki-read` pod +- Authentication: Grafana admin user required + +**Error Conditions:** + +- Occurs during initial SleakOps deployment +- Loki datasource addition fails silently +- Requires manual intervention to resolve +- May need pod restarts to function properly + +## Detailed Solution + + + +To check if Loki datasource is properly configured in Grafana: + +1. **Access Grafana dashboard** + + - Use the admin credentials provided by SleakOps + - Navigate to **Configuration** → **Data Sources** + +2. **Check for Loki datasource** + + - Look for a datasource named "Loki" or similar + - Verify the URL points to your Loki service + - Test the connection using the "Save & Test" button + +3. **Expected Loki URL format:** + ``` + http://loki-gateway.monitoring.svc.cluster.local:80 + ``` + or + ``` + http://loki-read.monitoring.svc.cluster.local:3100 + ``` + + + + + +If the Loki datasource is missing, add it manually: + +1. **In Grafana, go to Configuration → Data Sources** +2. **Click "Add data source"** +3. **Select "Loki" from the list** +4. **Configure the datasource:** + + - **Name:** `Loki` + - **URL:** `http://loki-gateway.monitoring.svc.cluster.local:80` + - **Access:** `Server (default)` + +5. **Save and test the connection** + +```yaml +# Example datasource configuration +apiVersion: 1 +datasources: + - name: Loki + type: loki + access: proxy + url: http://loki-gateway.monitoring.svc.cluster.local:80 + isDefault: false +``` + + + + + +If Loki datasource exists but doesn't work properly: + +1. **Restart loki-read pod:** + + ```bash + kubectl delete pod -l app=loki-read -n monitoring + ``` + +2. **Check pod status:** + + ```bash + kubectl get pods -n monitoring | grep loki + ``` + +3. **Verify Loki logs:** + + ```bash + kubectl logs -l app=loki-read -n monitoring + ``` + +4. **Test Loki API directly:** + ```bash + kubectl port-forward svc/loki-gateway 3100:80 -n monitoring + curl http://localhost:3100/ready + ``` + + + + + +Once Loki datasource is working: + +1. **Understanding stream filters:** + + - Default filter searches for `stderr` streams + - Change to `stdout` to see application logs + - Use `{stream="stderr"}` or `{stream="stdout"}` in queries + +2. **Common LogQL queries:** + + ```logql + # All stderr logs + {stream="stderr"} + + # All stdout logs + {stream="stdout"} + + # Logs from specific namespace + {namespace="default"} + + # Logs from specific pod + {pod="my-app-12345"} + + # Combined filters + {namespace="default", stream="stderr"} + ``` + +3. **Adjust time range:** + - Use Grafana's time picker + - Start with "Last 1 hour" for testing + - Expand range if no logs appear + + + + + +**If logs still don't appear:** + +1. **Check if applications are generating logs:** + + ```bash + kubectl logs -n + ``` + +2. **Verify Loki is receiving logs:** + + ```bash + # Check Loki ingester logs + kubectl logs -l app=loki-write -n monitoring + ``` + +3. **Restart Grafana if needed:** + + ```bash + kubectl delete pod -l app.kubernetes.io/name=grafana -n monitoring + ``` + +4. **Check Grafana logs:** + ```bash + kubectl logs -l app.kubernetes.io/name=grafana -n monitoring + ``` + +**Prevention for future deployments:** + +- This issue is being addressed in upcoming SleakOps versions +- Automatic datasource configuration will be implemented +- Manual verification steps will be documented in deployment guides + + + +--- + +_This FAQ was automatically generated on November 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/grafana-loki-installation-troubleshooting.mdx b/docs/troubleshooting/grafana-loki-installation-troubleshooting.mdx new file mode 100644 index 000000000..51b033bef --- /dev/null +++ b/docs/troubleshooting/grafana-loki-installation-troubleshooting.mdx @@ -0,0 +1,194 @@ +--- +sidebar_position: 3 +title: "Grafana and Loki Installation Issues" +description: "Troubleshooting stuck Loki installation and Grafana timeout issues" +date: "2024-11-12" +category: "dependency" +tags: ["grafana", "loki", "monitoring", "addons", "timeout"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana and Loki Installation Issues + +**Date:** November 12, 2024 +**Category:** Dependency +**Tags:** Grafana, Loki, Monitoring, Addons, Timeout + +## Problem Description + +**Context:** User installed Grafana and Loki addons on their production cluster through SleakOps platform but encountered installation and access issues. + +**Observed Symptoms:** + +- Grafana shows as "installed" but returns timeout when accessing the web interface +- Loki installation appears stuck and never completes +- Grafana becomes accessible after resuming stuck Loki installation +- Access works properly when connected via VPN + +**Relevant Configuration:** + +- Environment: Production cluster +- Addons: Grafana and Loki +- Access method: VPN connection required +- Installation status: Grafana marked as installed, Loki stuck during installation + +**Error Conditions:** + +- Timeout errors when accessing Grafana web interface +- Loki installation process hangs indefinitely +- Issues resolved after manually resuming stuck installation + +## Detailed Solution + + + +To verify the current status of your addons in SleakOps: + +1. Navigate to your **Cluster Dashboard** +2. Go to the **Addons** section +3. Check the status of each addon: + - **Installing**: Still in progress + - **Installed**: Successfully deployed + - **Failed**: Installation encountered errors + - **Stuck**: Installation appears frozen + +If an addon shows as "Installing" for more than 30 minutes, it may be stuck. + + + + + +When Loki installation gets stuck: + +1. Go to **Cluster Dashboard** → **Addons** +2. Find the Loki addon with "Installing" status +3. Click on the **three dots menu** next to Loki +4. Select **"Resume Installation"** or **"Retry"** +5. Monitor the installation progress +6. Wait for status to change to "Installed" + +**Alternative method via kubectl:** + +```bash +# Check pod status +kubectl get pods -n loki-system + +# Check for stuck pods +kubectl describe pod -n loki-system + +# Restart stuck pods if necessary +kubectl delete pod -n loki-system +``` + + + + + +Grafana timeout issues are often related to: + +1. **Incomplete Loki installation**: Grafana may depend on Loki being fully installed +2. **Network connectivity**: Ensure you're connected via VPN +3. **Pod readiness**: Grafana pods may not be fully ready + +**Steps to resolve:** + +1. **Ensure Loki is fully installed** (as described above) +2. **Check Grafana pod status:** + ```bash + kubectl get pods -n grafana-system + kubectl logs -f -n grafana-system + ``` +3. **Verify VPN connection** is active +4. **Wait 5-10 minutes** after Loki installation completes +5. **Try accessing Grafana again** + + + + + +Grafana and other monitoring tools in SleakOps typically require VPN access for security: + +**Requirements:** + +- Active VPN connection to your cluster network +- Proper DNS resolution through VPN +- Correct firewall rules allowing access + +**Verification steps:** + +1. Confirm VPN connection is active +2. Test DNS resolution: `nslookup grafana.your-cluster.local` +3. Check if you can reach other cluster services +4. Verify your user has proper permissions for Grafana access + + + + + +To avoid similar issues in the future: + +**Installation order:** + +1. Install Loki first (logging backend) +2. Wait for Loki to be fully ready +3. Then install Grafana (visualization frontend) + +**Monitoring installation:** + +- Check addon status every 10-15 minutes during installation +- Don't install multiple heavy addons simultaneously +- Ensure cluster has sufficient resources before installing + +**Resource requirements:** + +```yaml +# Minimum recommended resources +Loki: + memory: 2Gi + cpu: 500m +Grafana: + memory: 1Gi + cpu: 250m +``` + + + + + +**Check overall cluster health:** + +```bash +kubectl get nodes +kubectl top nodes +kubectl get pods --all-namespaces | grep -E "(loki|grafana)" +``` + +**Check specific addon logs:** + +```bash +# Loki logs +kubectl logs -f deployment/loki -n loki-system + +# Grafana logs +kubectl logs -f deployment/grafana -n grafana-system +``` + +**Check service endpoints:** + +```bash +kubectl get svc -n grafana-system +kubectl get svc -n loki-system +``` + +**Check ingress/routes:** + +```bash +kubectl get ingress --all-namespaces +``` + + + +--- + +_This FAQ was automatically generated on November 12, 2024 based on a real user query._ diff --git a/docs/troubleshooting/grafana-loki-log-ingestion-issues.mdx b/docs/troubleshooting/grafana-loki-log-ingestion-issues.mdx new file mode 100644 index 000000000..2eed0f011 --- /dev/null +++ b/docs/troubleshooting/grafana-loki-log-ingestion-issues.mdx @@ -0,0 +1,443 @@ +--- +sidebar_position: 3 +title: "Grafana Loki Log Ingestion Issues" +description: "Troubleshooting missing logs and loki-write pod failures in Grafana" +date: "2024-12-19" +category: "dependency" +tags: ["grafana", "loki", "logging", "monitoring", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana Loki Log Ingestion Issues + +**Date:** December 19, 2024 +**Category:** Dependency +**Tags:** Grafana, Loki, Logging, Monitoring, Troubleshooting + +## Problem Description + +**Context:** Users experience issues with Grafana's log visualization where the loki-write pod appears unable to write and store logs properly, resulting in missing log entries from application services. + +**Observed Symptoms:** + +- Missing initial logs from services in Grafana dashboard +- Log entries appear hours after the actual service start time +- loki-write pod experiencing write/storage failures +- Incomplete log timeline with gaps in log history +- Services monitored through Lens show logs that don't appear in Grafana + +**Relevant Configuration:** + +- Component: Grafana with Loki backend +- Affected pod: `loki-write` +- Log ingestion: Real-time application logs +- Monitoring setup: SleakOps integrated Grafana stack + +**Error Conditions:** + +- Logs missing from service startup period +- Delayed log appearance (hours after actual events) +- Inconsistent log ingestion across different services +- loki-write pod unable to persist log data + +## Detailed Solution + + + +To diagnose loki-write pod issues: + +1. **Check pod status and logs:** + +```bash +kubectl get pods -n monitoring | grep loki-write +kubectl logs -n monitoring loki-write-0 --tail=100 +``` + +2. **Verify storage configuration:** + +```bash +kubectl describe pvc -n monitoring | grep loki +``` + +3. **Check resource limits:** + +```bash +kubectl describe pod -n monitoring loki-write-0 +``` + +Common issues include: + +- Insufficient storage space +- Memory/CPU resource constraints +- PVC mounting problems +- Network connectivity issues + + + + + +If storage is the root cause: + +1. **Check available storage:** + +```bash +kubectl exec -n monitoring loki-write-0 -- df -h +``` + +2. **Verify PVC status:** + +```bash +kubectl get pvc -n monitoring +kubectl describe pvc loki-storage -n monitoring +``` + +3. **Increase storage if needed:** + +```yaml +# In your Loki configuration +persistence: + enabled: true + size: 50Gi # Increase from default + storageClass: gp3 +``` + +4. **Clean up old logs if storage is full:** + +```bash +# Access loki-write pod +kubectl exec -it -n monitoring loki-write-0 -- /bin/sh +# Check and clean old chunks +ls -la /loki/chunks/ +``` + + + + + +To prevent resource-related log ingestion issues: + +1. **Increase memory limits:** + +```yaml +# Loki write component resources +write: + resources: + requests: + memory: 512Mi + cpu: 100m + limits: + memory: 2Gi + cpu: 500m +``` + +2. **Configure proper retention:** + +```yaml +limits_config: + retention_period: 168h # 7 days + ingestion_rate_mb: 10 + ingestion_burst_size_mb: 20 +``` + +3. **Optimize chunk configuration:** + +```yaml +chunk_store_config: + max_look_back_period: 168h +schema_config: + configs: + - from: 2023-01-01 + store: boltdb-shipper + object_store: s3 + schema: v11 + index: + prefix: loki_index_ + period: 24h +``` + + + + + +To ensure logs are being properly ingested: + +1. **Check Promtail configuration:** + +```bash +kubectl logs -n monitoring promtail-daemonset-xxx +``` + +2. **Verify log shipping:** + +```bash +# Check if logs are being sent to Loki +kubectl exec -n monitoring promtail-xxx -- wget -qO- http://localhost:3101/metrics | grep promtail_sent +``` + +3. **Test Loki API directly:** + +```bash +# Query Loki for recent logs +kubectl port-forward -n monitoring svc/loki 3100:3100 +curl -G -s "http://localhost:3100/loki/api/v1/query" --data-urlencode 'query={job="your-service"}' --data-urlencode 'start=1h' +``` + +4. **Check service discovery:** + +```bash +# Verify Promtail is discovering your pods +kubectl exec -n monitoring promtail-xxx -- wget -qO- http://localhost:3101/targets +``` + + + + + +Ensure Grafana is properly configured to query Loki: + +1. **Verify Loki datasource:** + + - Go to Grafana → Configuration → Data Sources + - Check Loki URL: `http://loki:3100` + - Test connection + +2. **Configure proper time ranges:** + + - In Grafana dashboards, ensure time range covers the expected period + - Check timezone settings + +3. **Optimize query performance:** + +```logql +# Use efficient LogQL queries +{namespace="your-namespace", pod=~"your-service-.*"} |= "your-search-term" +``` + +4. **Set appropriate refresh intervals:** + - For real-time monitoring: 5-10 seconds + - For historical analysis: 1-5 minutes + + + + + +To prevent future log ingestion issues: + +1. **Monitor Loki metrics:** + +```yaml +# Add alerts for Loki health +- alert: LokiWriteErrors + expr: increase(loki_ingester_chunks_flushed_total{status="failed"}[5m]) > 0 + for: 2m + labels: + severity: warning + annotations: + summary: "Loki write errors detected" +``` + +2. **Set up storage monitoring:** + +```yaml +- alert: LokiStorageFull + expr: (kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*loki.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*loki.*"}) * 100 < 15 + for: 5m + labels: + severity: warning + annotations: + summary: "Loki storage is running low" + description: "Loki storage has less than 15% space available" + +- alert: LokiIngestionRate + expr: rate(loki_distributor_lines_received_total[5m]) > 1000 + for: 2m + labels: + severity: info + annotations: + summary: "High log ingestion rate detected" + description: "Loki is receiving logs at {{ $value }} lines/second" +``` + +3. **Monitor log pipeline health:** + +```bash +# Create a dashboard to monitor: +# - Log ingestion rate +# - Storage usage +# - Write/read latency +# - Error rates +# - Memory/CPU usage +``` + + + + + +Configure proper log retention to prevent storage issues: + +1. **Set up automatic log cleanup:** + +```yaml +# Configure retention policies +compactor: + working_directory: /loki/compactor + shared_store: s3 + compaction_interval: 5m + retention_enabled: true + retention_delete_delay: 2h + +limits_config: + retention_period: 336h # 14 days + enforce_metric_name: false + reject_old_samples: true + reject_old_samples_max_age: 168h +``` + +2. **Monitor retention effectiveness:** + +```bash +# Check compactor logs +kubectl logs -n monitoring loki-compactor-0 + +# Verify retention is working +kubectl exec -n monitoring loki-read-0 -- ls -la /loki/chunks/ +``` + +3. **Manual cleanup if needed:** + +```bash +# Emergency cleanup procedure (use with caution) +kubectl exec -it -n monitoring loki-write-0 -- /bin/sh +# Remove old chunks manually if automatic cleanup fails +find /loki/chunks -type f -mtime +7 -delete +``` + + + + + +For high-volume log environments: + +1. **Scale Loki components:** + +```yaml +# Scale write components +write: + replicas: 3 + resources: + requests: + memory: 1Gi + cpu: 200m + limits: + memory: 4Gi + cpu: 1000m + +# Scale read components +read: + replicas: 2 + resources: + requests: + memory: 512Mi + cpu: 100m + limits: + memory: 2Gi + cpu: 500m +``` + +2. **Optimize ingestion performance:** + +```yaml +# Tune ingestion settings +ingester: + chunk_block_size: 262144 + chunk_target_size: 1572864 + max_chunk_age: 2h + chunk_encoding: snappy + + lifecycler: + ring: + kvstore: + store: etcd + etcd: + endpoints: + - etcd:2379 +``` + +3. **Configure load balancing:** + +```yaml +# Distribute load across ingesters +distributor: + ring: + kvstore: + store: etcd + etcd: + endpoints: + - etcd:2379 +``` + + + + + +Use this checklist when logs are missing or delayed: + +**Step 1: Basic Health Check** + +```bash +□ Check all Loki pods are running +□ Verify no pods are in CrashLoopBackOff +□ Check pod resource usage (memory, CPU) +□ Verify storage availability +``` + +**Step 2: Configuration Verification** + +```bash +□ Validate Loki configuration syntax +□ Check Promtail configuration for target discovery +□ Verify Grafana datasource configuration +□ Confirm log shipping endpoints +``` + +**Step 3: Network and Connectivity** + +```bash +□ Test connectivity between Promtail and Loki +□ Verify service discovery is working +□ Check firewall/network policies +□ Test Loki API endpoints +``` + +**Step 4: Data Pipeline** + +```bash +□ Verify logs are being generated by applications +□ Check Promtail is discovering log files/streams +□ Confirm logs are reaching Loki (check metrics) +□ Test querying logs directly via API +``` + +**Step 5: Resource and Performance** + +```bash +□ Monitor ingestion rate vs capacity +□ Check for storage space issues +□ Verify retention policies are working +□ Review alert configurations +``` + +**Emergency Recovery Procedures:** + +1. Restart loki-write pods if stuck +2. Clear storage space if full +3. Reset Promtail if not shipping logs +4. Recreate Grafana datasource if queries fail +5. Scale up resources if performance issues persist + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/grafana-loki-log-viewing-issues.mdx b/docs/troubleshooting/grafana-loki-log-viewing-issues.mdx new file mode 100644 index 000000000..3f0d1a499 --- /dev/null +++ b/docs/troubleshooting/grafana-loki-log-viewing-issues.mdx @@ -0,0 +1,550 @@ +--- +sidebar_position: 3 +title: "Grafana Loki Log Viewing Issues" +description: "Solutions for incomplete log viewing and timezone issues in Grafana Loki dashboard" +date: "2025-01-15" +category: "general" +tags: ["grafana", "loki", "logs", "timezone", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana Loki Log Viewing Issues + +**Date:** January 15, 2025 +**Category:** General +**Tags:** Grafana, Loki, Logs, Timezone, Troubleshooting + +## Problem Description + +**Context:** Users experience issues when viewing logs in Grafana Loki dashboard, particularly with CronJob pods and services where the complete log output is not visible, making it difficult to determine if jobs completed successfully. + +**Observed Symptoms:** + +- Incomplete log display - final lines of logs not showing +- Cannot see job completion status (success/failure) +- Grafana dashboard freezes or shows infinite loading +- Timeout errors when accessing Loki logs +- Missing final "OK" or completion messages in logs +- Issues occur both in SleakOps dashboard and Lens + +**Relevant Configuration:** + +- Platform: SleakOps with Grafana/Loki integration +- Workload type: CronJobs and regular Pods +- Log viewing: Through Grafana dashboard and Lens +- Timezone: UTC vs local timezone mismatches + +**Error Conditions:** + +- Error occurs when querying logs outside specific time ranges +- Dashboard breaks when there are gaps in log data +- Timezone configuration causes log viewing issues +- Query to Loki fails when time range includes periods without logs + +## Detailed Solution + + + +The primary cause of log viewing issues is timezone mismatch between dashboards: + +**Solution:** + +1. **Set both dashboards to the same timezone**: + + - Change the main dashboard timezone to **UTC 0** + - Or set both dashboards to **UTC-3** (local timezone) + +2. **How to change timezone in Grafana**: + + - Go to dashboard settings (gear icon) + - Navigate to **General** → **Time options** + - Set **Timezone** to desired value + - Save dashboard + +3. **Verify time ranges match**: + - Ensure both log explorer and metrics dashboards use the same time range + - Use absolute time ranges when possible + + + + + +To avoid query failures, filter logs within valid time ranges: + +**Steps:** + +1. **Identify valid log periods**: + + - First check the metrics dashboard to see when your service actually ran + - Note the exact time range with log activity + +2. **Apply conservative time filters**: + + - Use the time range picker in Grafana + - Set **From** and **To** times to cover only periods with known log activity + - Avoid extending beyond the range where logs exist + +3. **Example time filtering**: + ``` + From: 2025-01-15 14:00:00 + To: 2025-01-15 16:00:00 + ``` + +**Note:** Extending the time range beyond periods with actual logs will cause the Loki query to fail. + + + + + +When logs appear incomplete (missing final lines): + +**Possible causes:** + +1. **Pod termination timing**: Logs may be cut off if the pod terminates before all logs are flushed +2. **Loki ingestion delay**: There might be a delay between log generation and availability in Loki +3. **Buffer issues**: Log buffers may not be fully flushed before pod termination + +**Diagnostic steps:** + +1. **Check pod status**: + + ```bash + kubectl get pods -n + kubectl describe pod -n + ``` + +2. **Verify job completion**: + + ```bash + kubectl get jobs -n + kubectl describe job -n + ``` + +3. **Check alternative log sources**: + - Use `kubectl logs` directly + - Check Lens pod logs + - Verify file outputs if the service writes to files + + + + + +For CronJobs, the pod status indicates job success/failure: + +**Pod status meanings:** + +- **Completed**: Job finished successfully (exit code 0) +- **Failed**: Job failed (exit code 1 or other non-zero) +- **Running**: Job still executing + +**Best practices for CronJob logging:** + +1. **Always include explicit completion messages**: + + ```bash + echo "Job started at $(date)" + # Your job logic here + echo "Job completed successfully at $(date)" + exit 0 + ``` + +2. **Use proper exit codes**: + + ```bash + # For success + exit 0 + + # For failure + echo "Error: Something went wrong" + exit 1 + ``` + +3. **Check job history**: + ```bash + kubectl get jobs -n --show-labels + kubectl describe cronjob -n + ``` + + + + + +While permanent fixes are being developed: + +**Immediate workarounds:** + +1. **Use kubectl for complete logs**: + + ```bash + kubectl logs -n --tail=-1 + ``` + +2. **Check multiple log sources**: + + - SleakOps dashboard + - Lens application + - Direct kubectl commands + - File outputs (if applicable) + +3. **Verify job completion through pod status**: + + ```bash + kubectl get pods -n --field-selector=status.phase=Succeeded + kubectl get pods -n --field-selector=status.phase=Failed + ``` + +4. **Use conservative time ranges**: + - Only query time periods where you know logs exist + - Avoid large time ranges that might include gaps + +**Long-term solutions in development:** + +- Fix for Loki query handling when log gaps exist +- Improved timezone handling in dashboards +- Better log buffer flushing for terminating pods + + + + + +Use these LogQL query patterns to improve log viewing: + +1. **Query with specific time boundaries**: + +```logql +{namespace="your-namespace", pod=~"your-pod.*"} +| json +| timestamp >= "2025-01-15T14:00:00Z" +| timestamp <= "2025-01-15T16:00:00Z" +``` + +2. **Filter by log level to find completion messages**: + +```logql +{namespace="your-namespace"} +|~ "(?i)(completed|finished|success|done|exit)" +| line_format "{{.timestamp}}: {{.message}}" +``` + +3. **Query the last N log entries**: + +```logql +{namespace="your-namespace", pod=~"your-service.*"} +| tail 100 +``` + +4. **Find error patterns and completion status**: + +```logql +{namespace="your-namespace"} +|~ "(?i)(error|fail|exception|completed|success)" +| json +| line_format "{{.level}}: {{.message}}" +``` + +5. **Time-based log aggregation**: + +```logql +sum by (pod) ( + count_over_time({namespace="your-namespace"}[5m]) +) +``` + + + + + +Properly configure Loki to ensure log availability: + +1. **Check current retention configuration**: + +```bash +# View Loki configuration +kubectl get configmap loki-config -n monitoring -o yaml + +# Check retention settings +kubectl exec -n monitoring loki-read-0 -- cat /etc/loki/config.yaml | grep -A 10 retention +``` + +2. **Configure appropriate retention periods**: + +```yaml +# Example Loki retention configuration +limits_config: + retention_period: 168h # 7 days + enforce_metric_name: false + reject_old_samples: true + reject_old_samples_max_age: 168h + +compactor: + working_directory: /loki/compactor + shared_store: s3 + compaction_interval: 5m + retention_enabled: true + retention_delete_delay: 2h +``` + +3. **Monitor log storage usage**: + +```bash +# Check storage utilization +kubectl exec -n monitoring loki-write-0 -- df -h /loki + +# Check number of chunks +kubectl exec -n monitoring loki-write-0 -- find /loki/chunks -name "*.gz" | wc -l +``` + + + + + +Debug issues with specific Loki components: + +1. **Loki Write component issues**: + +```bash +# Check write component logs +kubectl logs -n monitoring loki-write-0 --tail=100 + +# Check write metrics +kubectl exec -n monitoring loki-write-0 -- wget -qO- http://localhost:3100/metrics | grep loki_ingester + +# Verify write component health +kubectl exec -n monitoring loki-write-0 -- wget -qO- http://localhost:3100/ready +``` + +2. **Loki Read component issues**: + +```bash +# Check read component logs +kubectl logs -n monitoring loki-read-0 --tail=100 + +# Test query functionality +kubectl exec -n monitoring loki-read-0 -- wget -qO- "http://localhost:3100/loki/api/v1/labels" + +# Check read component metrics +kubectl exec -n monitoring loki-read-0 -- wget -qO- http://localhost:3100/metrics | grep loki_querier +``` + +3. **Promtail (log shipper) issues**: + +```bash +# Check Promtail pods +kubectl get pods -n monitoring | grep promtail + +# Check Promtail logs +kubectl logs -n monitoring daemonset/promtail --tail=100 + +# Verify Promtail configuration +kubectl get configmap promtail-config -n monitoring -o yaml + +# Check Promtail targets +kubectl exec -n monitoring promtail-xxx -- wget -qO- http://localhost:3101/targets +``` + + + + + +Improve log ingestion performance and reliability: + +1. **Optimize Promtail configuration**: + +```yaml +# Enhanced Promtail configuration +scrape_configs: + - job_name: kubernetes-pods + kubernetes_sd_configs: + - role: pod + relabel_configs: + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + action: keep + regex: true + - source_labels: [__meta_kubernetes_pod_container_name] + target_label: container + - source_labels: [__meta_kubernetes_pod_name] + target_label: pod + - source_labels: [__meta_kubernetes_namespace] + target_label: namespace + pipeline_stages: + - json: + expressions: + timestamp: timestamp + message: message + level: level + - timestamp: + source: timestamp + format: RFC3339 + - labels: + level: +``` + +2. **Configure log buffering**: + +```yaml +# Improved buffering configuration +limits_config: + ingestion_rate_mb: 10 + ingestion_burst_size_mb: 20 + max_streams_per_user: 10000 + max_line_size: 256000 + +server: + http_listen_port: 3100 + grpc_listen_port: 9095 + http_server_read_timeout: 30s + http_server_write_timeout: 30s +``` + +3. **Implement log sampling for high-volume services**: + +```yaml +# Sample high-volume logs +pipeline_stages: + - match: + selector: '{level="debug"}' + stages: + - sampling: + rate: 0.1 # Keep only 10% of debug logs + - match: + selector: '{level="info"}' + stages: + - sampling: + rate: 0.5 # Keep 50% of info logs +``` + + + + + +When logs are completely inaccessible through Grafana: + +1. **Direct kubectl log access**: + +```bash +# Get logs from all containers in a pod +kubectl logs -n --all-containers=true + +# Get logs from previous pod instance +kubectl logs -n --previous + +# Get logs with timestamps +kubectl logs -n --timestamps=true + +# Follow live logs +kubectl logs -f -n +``` + +2. **Export logs to files**: + +```bash +# Export recent logs to file +kubectl logs -n --since=1h > pod-logs-$(date +%Y%m%d-%H%M%S).log + +# Export all available logs +kubectl logs -n --tail=-1 > complete-pod-logs.log +``` + +3. **Access logs via node filesystem** (if pods write to hostPath): + +```bash +# Connect to node and access log files +kubectl debug node/ -it --image=busybox +# Navigate to /host/var/log/pods/ to find pod logs +``` + +4. **Query Loki API directly**: + +```bash +# Port-forward to Loki +kubectl port-forward -n monitoring svc/loki 3100:3100 & + +# Query logs directly +curl -G -s "http://localhost:3100/loki/api/v1/query_range" \ + --data-urlencode 'query={namespace="your-namespace"}' \ + --data-urlencode 'start=2025-01-15T14:00:00Z' \ + --data-urlencode 'end=2025-01-15T16:00:00Z' \ + --data-urlencode 'limit=1000' | jq '.data.result' +``` + + + + + +Set up monitoring to detect log ingestion issues: + +1. **Key metrics to monitor**: + +```promql +# Log ingestion rate +rate(loki_distributor_lines_received_total[5m]) + +# Log ingestion errors +rate(loki_distributor_lines_received_total{status="error"}[5m]) + +# Query performance +histogram_quantile(0.95, rate(loki_request_duration_seconds_bucket[5m])) + +# Storage utilization +loki_ingester_memory_chunks / loki_ingester_memory_chunks_max +``` + +2. **Alerting rules for log pipeline**: + +```yaml +groups: + - name: loki-pipeline + rules: + - alert: LokiLogIngestionStopped + expr: rate(loki_distributor_lines_received_total[5m]) == 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Loki log ingestion has stopped" + + - alert: LokiHighErrorRate + expr: rate(loki_distributor_lines_received_total{status="error"}[5m]) > 10 + for: 2m + labels: + severity: critical + annotations: + summary: "High error rate in Loki log ingestion" +``` + +3. **Dashboard panels for log health**: + +```json +{ + "title": "Log Pipeline Health", + "panels": [ + { + "title": "Logs Ingested per Second", + "targets": [{ "expr": "rate(loki_distributor_lines_received_total[5m])" }] + }, + { + "title": "Log Ingestion Errors", + "targets": [ + { + "expr": "rate(loki_distributor_lines_received_total{status=\"error\"}[5m])" + } + ] + }, + { + "title": "Query Latency (95th percentile)", + "targets": [ + { + "expr": "histogram_quantile(0.95, rate(loki_request_duration_seconds_bucket[5m]))" + } + ] + } + ] +} +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/grafana-loki-not-loading-logs.mdx b/docs/troubleshooting/grafana-loki-not-loading-logs.mdx new file mode 100644 index 000000000..dd9a3fd6f --- /dev/null +++ b/docs/troubleshooting/grafana-loki-not-loading-logs.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "Grafana Loki Not Loading Logs or Options" +description: "Solution for Grafana Loki when dashboards and explore view don't load log options" +date: "2024-10-21" +category: "dependency" +tags: ["grafana", "loki", "logs", "monitoring", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana Loki Not Loading Logs or Options + +**Date:** October 21, 2024 +**Category:** Dependency +**Tags:** Grafana, Loki, Logs, Monitoring, Troubleshooting + +## Problem Description + +**Context:** Users experience issues with Grafana Loki where log dashboards and the explore view stop working, preventing access to log data and options. + +**Observed Symptoms:** + +- Grafana Loki stops functioning suddenly +- Log dashboards don't load any options +- Explore view with Loki data source shows no available options +- Loki service status appears normal +- Problem affects all log-related functionality in Grafana + +**Relevant Configuration:** + +- Service: Grafana with Loki data source +- Platform: SleakOps managed environment +- Affected components: Log dashboards and Explore view +- External status: Loki service status shows as operational + +**Error Conditions:** + +- Error occurs intermittently without clear trigger +- Affects all users accessing log functionality +- Problem persists until manual intervention +- No obvious configuration changes preceding the issue + +## Detailed Solution + + + +The quickest way to resolve this issue is to restart the `loki-read` pod: + +**Using kubectl:** + +```bash +# Find the loki-read pod +kubectl get pods -n monitoring | grep loki-read + +# Restart the pod by deleting it (it will be recreated automatically) +kubectl delete pod -n monitoring + +# Verify the new pod is running +kubectl get pods -n monitoring | grep loki-read +``` + +**Using SleakOps Dashboard:** + +1. Navigate to your cluster's workloads +2. Find the `loki-read` deployment in the monitoring namespace +3. Restart the deployment or delete the pod +4. Wait for the new pod to be ready + + + + + +After restarting the loki-read pod: + +1. **Wait 2-3 minutes** for the pod to fully initialize +2. **Access Grafana** and go to the Explore view +3. **Select Loki** as the data source +4. **Check if log labels** and options are now loading +5. **Test a simple query** like `{job="your-app-name"}` +6. **Verify dashboards** are showing log data again + + + + + +This issue typically occurs due to: + +- **Memory pressure** on the loki-read component +- **Connection timeouts** between Grafana and Loki +- **Index corruption** in Loki's temporary storage +- **Resource exhaustion** during high log volume periods + +The SleakOps team is working on a permanent solution to prevent this issue from recurring. + + + + + +To minimize the occurrence of this issue: + +**Monitor resource usage:** + +```bash +# Check loki-read pod resource usage +kubectl top pod -n monitoring | grep loki-read + +# Check pod logs for errors +kubectl logs -n monitoring --tail=100 +``` + +**Set up alerts:** + +- Monitor Loki query response times +- Alert on high memory usage in loki-read pods +- Set up health checks for Grafana data source connectivity + +**Best practices:** + +- Regularly review log retention policies +- Monitor log ingestion rates +- Consider log sampling for high-volume applications + + + + + +Contact SleakOps support if: + +- Restarting the loki-read pod doesn't resolve the issue +- The problem recurs frequently (more than once per week) +- You see persistent errors in loki-read pod logs +- Log data appears to be missing or corrupted +- Performance degradation affects other monitoring components + +Include the following information: + +- Timestamp when the issue started +- Any recent changes to log volume or configuration +- Screenshots of the Grafana interface showing the problem +- Output of `kubectl logs -n monitoring ` + + + +--- + +_This FAQ was automatically generated on November 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/grafana-prometheus-datasource-configuration.mdx b/docs/troubleshooting/grafana-prometheus-datasource-configuration.mdx new file mode 100644 index 000000000..41b9e36fb --- /dev/null +++ b/docs/troubleshooting/grafana-prometheus-datasource-configuration.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Grafana Prometheus DataSource Configuration" +description: "How to configure Prometheus datasource in Grafana within SleakOps platform" +date: "2024-12-11" +category: "dependency" +tags: ["grafana", "prometheus", "monitoring", "datasource", "thanos"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana Prometheus DataSource Configuration + +**Date:** December 11, 2024 +**Category:** Dependency +**Tags:** Grafana, Prometheus, Monitoring, DataSource, Thanos + +## Problem Description + +**Context:** Users attempting to configure Prometheus as a data source in Grafana within the SleakOps platform encounter connection issues and are unaware of pre-configured data sources. + +**Observed Symptoms:** + +- Unable to connect to Prometheus from Grafana interface +- Manual Prometheus configuration attempts fail +- Users create duplicate data sources unnecessarily +- Connection timeouts when testing Prometheus data source + +**Relevant Configuration:** + +- Grafana is accessible via VPN connection +- Prometheus is installed within the Kubernetes cluster +- Default data sources are pre-configured by SleakOps +- Thanos is used as a proxy for long-term storage in S3 + +**Error Conditions:** + +- Data source test connection fails +- Prometheus endpoint not reachable from Grafana +- Users attempt manual configuration instead of using defaults + +## Detailed Solution + + + +SleakOps automatically configures a Prometheus data source in Grafana. **You don't need to create a new one manually.** + +**Steps to use the default data source:** + +1. Access Grafana through your SleakOps VPN connection +2. Go to **Configuration** → **Data Sources** +3. Look for the existing data source named **"Prometheus"** +4. This data source is already configured and ready to use + +**Key benefits of the default configuration:** + +- Pre-configured with correct endpoints +- Integrated with Thanos for long-term metric storage +- Works with all default SleakOps dashboards +- No manual configuration required + + + + + +The default Prometheus data source uses **Thanos** as a target instead of connecting directly to Prometheus: + +**What is Thanos?** + +- A highly available Prometheus setup with long-term storage capabilities +- Stores metrics in AWS S3 for extended retention +- Provides the same query interface as Prometheus +- Enables historical data analysis beyond Prometheus's local retention + +**Configuration details:** + +```yaml +# Default data source configuration (managed by SleakOps) +name: Prometheus +type: prometheus +url: http://thanos-query:9090 +access: proxy +isDefault: true +``` + +**Why use Thanos instead of direct Prometheus?** + +- Extended metric retention (stored in S3) +- Better performance for historical queries +- High availability setup +- Seamless integration with SleakOps monitoring stack + + + + + +If you're experiencing connection issues with Grafana: + +**1. Verify VPN connection:** + +```bash +# Check if you can reach Grafana +curl -I https://grafana.your-project.sleakops.com +``` + +**2. Check data source status:** + +- Go to **Configuration** → **Data Sources** +- Click on the **"Prometheus"** data source +- Click **"Save & Test"** to verify connectivity + +**3. Common connection problems:** + +- **VPN not connected**: Ensure you're connected to the SleakOps VPN +- **Ingress not ready**: Wait a few minutes after cluster creation +- **DNS resolution**: Clear browser cache and try again + +**4. Alternative access methods:** + +```bash +# Port-forward to access Grafana directly (if VPN issues persist) +kubectl port-forward svc/grafana 3000:80 -n monitoring +``` + + + + + +**You should only create custom data sources if:** + +1. **External Prometheus instances**: Connecting to Prometheus outside your cluster +2. **Different retention policies**: Need different query timeouts or intervals +3. **Custom integrations**: Specific monitoring tools or external services +4. **Development/testing**: Separate data sources for different environments + +**If creating custom data sources:** + +```yaml +# Example custom Prometheus configuration +name: prometheus-custom +type: prometheus +url: http://prometheus-server.monitoring.svc.cluster.local:80 +access: proxy +timeout: 60s +``` + +**Best practices:** + +- Use descriptive names (e.g., "prometheus-dev", "prometheus-external") +- Test connectivity before saving +- Document the purpose of custom data sources +- Avoid duplicating the default configuration + + + + + +**Default SleakOps dashboards:** + +- Are configured to use the default "Prometheus" data source +- Work automatically without modification +- Include cluster metrics, application metrics, and infrastructure monitoring + +**If using custom data sources:** + +- You may need to modify dashboard queries +- Update data source references in dashboard JSON +- Test all panels to ensure data is displayed correctly + +**Recommended approach:** + +1. Start with the default "Prometheus" data source +2. Explore available dashboards and metrics +3. Only create custom data sources when specifically needed +4. Keep the default data source as your primary monitoring source + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/gunicorn-readiness-probe-timeout-configuration.mdx b/docs/troubleshooting/gunicorn-readiness-probe-timeout-configuration.mdx new file mode 100644 index 000000000..282a8055c --- /dev/null +++ b/docs/troubleshooting/gunicorn-readiness-probe-timeout-configuration.mdx @@ -0,0 +1,205 @@ +--- +sidebar_position: 3 +title: "Gunicorn Timeout Configuration with Kubernetes Readiness Probes" +description: "How to configure Gunicorn timeouts without causing readiness probe failures in SleakOps" +date: "2025-01-15" +category: "workload" +tags: ["gunicorn", "timeout", "readiness-probe", "python", "healthcheck"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Gunicorn Timeout Configuration with Kubernetes Readiness Probes + +**Date:** January 15, 2025 +**Category:** Workload +**Tags:** Gunicorn, Timeout, Readiness Probe, Python, Healthcheck + +## Problem Description + +**Context:** When configuring Gunicorn with a custom timeout setting in SleakOps, pods fail to start properly due to readiness probe failures, even though the application works fine without the timeout configuration. + +**Observed Symptoms:** + +- Pods show as "Not Ready" when Gunicorn timeout is configured +- Readiness probe failures prevent pods from receiving traffic +- Application works correctly when timeout parameter is removed +- API requests are taking longer than expected (20-30 seconds) + +**Relevant Configuration:** + +- Gunicorn command with timeout: `gunicorn --bind 0.0.0.0:8000 --timeout 10 backend:app` +- Framework: Python with Gunicorn WSGI server +- Deployment: Kubernetes pods in SleakOps platform +- Desired timeout: 5-10 seconds for API requests + +**Error Conditions:** + +- Pods fail readiness checks when `--timeout` parameter is added to Gunicorn +- Issue occurs during pod startup and health checking +- Problem persists until timeout configuration is removed + +## Detailed Solution + + + +The issue occurs because: + +1. **Gunicorn timeout** kills worker processes that take longer than specified time +2. **Kubernetes readiness probe** expects consistent responses from the health endpoint +3. When Gunicorn kills workers due to timeout, health checks may fail intermittently +4. Failed readiness probes prevent the pod from being marked as "Ready" + + + + + +In SleakOps, adjust the readiness probe configuration: + +1. Go to your **Workload** configuration +2. Edit the **Healthcheck** settings +3. Click on **Advanced Options** +4. Configure the following parameters: + +```yaml +readinessProbe: + httpGet: + path: /health # Your health endpoint + port: 8000 + initialDelaySeconds: 30 # Wait before first check + periodSeconds: 10 # Check every 10 seconds + timeoutSeconds: 5 # Timeout for each check + successThreshold: 1 # Consecutive successes needed + failureThreshold: 5 # Consecutive failures before marking as failed +``` + +**Key adjustments:** + +- Increase `failureThreshold` to allow more tolerance +- Increase `timeoutSeconds` to match or exceed your Gunicorn timeout +- Adjust `periodSeconds` to reduce check frequency + + + + + +For better compatibility with Kubernetes, use this Gunicorn configuration: + +```bash +# Recommended Gunicorn command +newrelic-admin run-program gunicorn \ + --bind 0.0.0.0:8000 \ + --limit-request-line 0 \ + --max-requests 3000 \ + --max-requests-jitter 200 \ + --timeout 30 \ + --keep-alive 5 \ + --worker-class sync \ + --workers 4 \ + backend:app +``` + +**Important parameters:** + +- `--timeout 30`: Set higher than your expected request time +- `--keep-alive 5`: Maintain connections for health checks +- `--workers 4`: Multiple workers for redundancy +- `--worker-class sync`: Use sync workers for predictable behavior + + + + + +Set the load balancer timeout to handle longer requests: + +1. In your **Ingress** configuration +2. Add the following annotation: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + annotations: + alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=60 + alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30 +spec: + # Your ingress configuration +``` + +**Timeout hierarchy:** + +1. Gunicorn timeout: 30 seconds (worker timeout) +2. Load balancer timeout: 60 seconds (connection timeout) +3. Readiness probe timeout: 5 seconds (health check timeout) + + + + + +To debug timeout issues: + +1. **Check pod logs:** + + ```bash + kubectl logs -f deployment/your-app-name + ``` + +2. **Monitor readiness probe:** + + ```bash + kubectl describe pod your-pod-name + ``` + +3. **Test health endpoint directly:** + + ```bash + kubectl port-forward pod/your-pod-name 8000:8000 + curl -v http://localhost:8000/health + ``` + +4. **Temporarily disable readiness probe:** + - Edit deployment using Lens or kubectl + - Remove `readinessProbe` section temporarily + - Test if application responds correctly + + + + + +To address the root cause of slow requests: + +1. **Identify slow endpoints:** + + - Use APM tools (New Relic, as shown in your config) + - Add logging to measure request duration + - Profile database queries + +2. **Database optimization:** + + ```python + # Add connection pooling + from sqlalchemy import create_engine + engine = create_engine( + 'postgresql://...', + pool_size=10, + max_overflow=20, + pool_timeout=30 + ) + ``` + +3. **Async processing:** + + - Move long-running tasks to background jobs + - Use Celery or similar task queue + - Return immediate response with job ID + +4. **Caching:** + - Implement Redis caching for frequent queries + - Use HTTP caching headers + - Cache expensive computations + + + +--- + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/hpa-scaling-issues-troubleshooting.mdx b/docs/troubleshooting/hpa-scaling-issues-troubleshooting.mdx new file mode 100644 index 000000000..4e7a6c7ab --- /dev/null +++ b/docs/troubleshooting/hpa-scaling-issues-troubleshooting.mdx @@ -0,0 +1,769 @@ +--- +sidebar_position: 3 +title: "HPA Scaling Issues - Pods Not Scaling Down" +description: "Troubleshooting HPA scaling problems when pods accumulate over time" +date: "2024-12-19" +category: "cluster" +tags: ["hpa", "scaling", "kubernetes", "troubleshooting", "memory-leak", "cpu"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# HPA Scaling Issues - Pods Not Scaling Down + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** HPA, Scaling, Kubernetes, Troubleshooting, Memory Leak, CPU + +## Problem Description + +**Context:** Production environment experiencing abnormal pod scaling behavior where the Horizontal Pod Autoscaler (HPA) creates many more pods than usual and fails to scale down properly over time. + +**Observed Symptoms:** + +- Significantly more pods running than normal in production +- Pods normalize temporarily after deployment but then accumulate again +- HPA appears to be scaling up but not scaling down effectively +- Problem occurs repeatedly over time, creating a cumulative scaling effect + +**Relevant Configuration:** + +- Environment: Production Kubernetes cluster +- Autoscaling: HPA enabled +- Monitoring: Lens being used to observe pod counts +- Deployment cycle: Temporary normalization after deployments + +**Error Conditions:** + +- HPA fails to scale down pods when resource usage should decrease +- Cumulative pod growth over time +- Resource consumption remains high preventing normal downscaling +- Problem persists across multiple deployment cycles + +## Detailed Solution + + + +The most likely cause is that your application is maintaining open processes or has constant CPU/memory consumption that prevents HPA from scaling down normally. Over time, this creates cumulative scaling where pods keep getting added but never removed. + +Common causes include: + +- **Memory leaks**: Application doesn't release memory properly +- **Long-running processes**: Background tasks that keep CPU/memory usage high +- **Open connections**: Database or external service connections not being closed +- **Session management**: Long-lived user sessions consuming resources +- **Inefficient code**: Bottlenecks that cause constant resource usage + + + + + +To identify the specific issue, use application performance monitoring (APM) tools: + +### Recommended Tools + +1. **Atatus** + + - Real-time application monitoring + - Memory leak detection + - Performance bottleneck identification + +2. **New Relic** + + - Comprehensive APM solution + - CPU and memory profiling + - Database query analysis + +3. **Blackfire.io** + - PHP profiling (if applicable) + - Performance optimization insights + - Real-time monitoring + +These tools help identify problems like memory leaks, slow processes, long sessions, or bottlenecks in real-time without needing to reproduce traffic conditions. + + + + + +When analyzing your application metrics, focus on these key questions: + +### CPU Usage Patterns + +**Question**: Does CPU consumption start high (around 70%) right after deployment? + +- **If YES**: Likely a CPU configuration issue +- **Solution**: Review and adjust CPU requests/limits in your deployment + +```yaml +resources: + requests: + cpu: "100m" # Adjust based on actual needs + memory: "128Mi" + limits: + cpu: "500m" # Set appropriate limits + memory: "512Mi" +``` + +### Memory Usage Patterns + +**Question**: Does memory consumption increase over time and never decrease? + +- **If YES**: Likely a memory leak or processes not being released +- **Solution**: Profile your application to find memory leaks + +### Timing Patterns + +**Question**: Does the problem occur at specific times? + +- **If YES**: Check logs and metrics during those periods +- **Solution**: Correlate with business logic, cron jobs, or external integrations + + + + + +Check your HPA configuration to ensure it's properly set up: + +```bash +# Check current HPA status +kubectl get hpa + +# Get detailed HPA information +kubectl describe hpa + +# Check HPA events +kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler +``` + +Ensure your HPA has proper scaling policies: + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: your-app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 + - type: Resource + resource: + name: memory + target: + type: Utilization + averageUtilization: 80 + behavior: + scaleDown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodSeconds: 60 + scaleUp: + stabilizationWindowSeconds: 60 + policies: + - type: Percent + value: 50 + periodSeconds: 60 +``` + + + + + +### Step 1: Check Current Resource Usage + +```bash +# Check pod resource usage +kubectl top pods -n + +# Check node resource usage +kubectl top nodes +``` + +### Step 2: Analyze Application Logs + +```bash +# Check for memory-related errors +kubectl logs | grep -i "memory\|oom\|killed" + +# Check for connection/resource leaks +kubectl logs | grep -i "connection\|timeout\|leak" +``` + +### Step 3: Monitor HPA Behavior + +```bash +# Watch HPA scaling decisions in real-time +kubectl get hpa -w + +# Check HPA scaling events +kubectl describe hpa +``` + +### Step 4: Temporary Mitigation + +If the problem is critical, implement temporary mitigation: + +```bash +# Manually scale down pods to a reasonable number +kubectl scale deployment --replicas=3 + +# Temporarily disable HPA if necessary +kubectl delete hpa + +# Monitor resource usage without HPA +kubectl top pods -n --sort-by=cpu +kubectl top pods -n --sort-by=memory +``` + +### Step 5: Force Pod Restart + +```bash +# Force restart to clear any accumulated resource issues +kubectl rollout restart deployment + +# Wait for rollout to complete +kubectl rollout status deployment +``` + + + + + +### Memory Leak Detection + +**1. Monitor memory usage over time:** + +```bash +# Create a script to monitor memory usage +#!/bin/bash +while true; do + echo "$(date): $(kubectl top pods | grep your-app)" + sleep 300 # Check every 5 minutes +done > memory-usage.log +``` + +**2. Analyze memory patterns:** + +```bash +# Check for consistently increasing memory +grep "your-app" memory-usage.log | awk '{print $3}' | sed 's/Mi//' > memory-values.txt + +# Plot or analyze the trend +python3 -c " +import sys +values = [int(line.strip()) for line in open('memory-values.txt')] +print(f'Memory trend: start={values[0]}Mi, end={values[-1]}Mi, increase={values[-1]-values[0]}Mi') +if len(values) > 1: + avg_increase = (values[-1] - values[0]) / (len(values) - 1) + print(f'Average increase per measurement: {avg_increase:.2f}Mi') +" +``` + +### CPU Usage Analysis + +**1. Check CPU patterns:** + +```bash +# Monitor CPU spikes and sustained usage +kubectl top pods --sort-by=cpu | head -10 + +# Check for specific processes consuming CPU +kubectl exec -it -- top -n 1 +``` + +**2. Analyze application bottlenecks:** + +```bash +# Check for hung processes or infinite loops +kubectl exec -it -- ps aux + +# Check thread usage +kubectl exec -it -- cat /proc/loadavg +``` + +### Network Connection Analysis + +**1. Check for connection leaks:** + +```bash +# Monitor active connections +kubectl exec -it -- netstat -an | wc -l + +# Check for connections in specific states +kubectl exec -it -- netstat -an | grep TIME_WAIT | wc -l +kubectl exec -it -- netstat -an | grep ESTABLISHED | wc -l +``` + +**2. Database connection monitoring:** + +```bash +# For applications using databases, check connection pools +kubectl exec -it -- curl localhost:8080/health/database 2>/dev/null | jq '.connections' + +# Or check application-specific health endpoints +kubectl exec -it -- curl localhost:8080/metrics 2>/dev/null | grep connection_pool +``` + + + + + +### Memory Management + +**1. Implement proper garbage collection (language-specific):** + +```javascript +// Node.js example +setInterval(() => { + if (global.gc) { + global.gc(); + } +}, 60000); // Force GC every minute + +// Monitor memory usage +setInterval(() => { + const memUsage = process.memoryUsage(); + console.log( + `Memory: RSS=${memUsage.rss / 1024 / 1024}MB, Heap=${ + memUsage.heapUsed / 1024 / 1024 + }MB` + ); +}, 30000); +``` + +```python +# Python example +import gc +import psutil +import threading +import time + +def memory_monitor(): + while True: + gc.collect() # Force garbage collection + process = psutil.Process() + memory_mb = process.memory_info().rss / 1024 / 1024 + print(f"Memory usage: {memory_mb:.2f} MB") + time.sleep(60) + +# Start memory monitoring thread +thread = threading.Thread(target=memory_monitor, daemon=True) +thread.start() +``` + +### Connection Pool Management + +**1. Database connection pooling:** + +```javascript +// Node.js with connection pool management +const mysql = require("mysql2/promise"); + +const pool = mysql.createPool({ + host: process.env.DB_HOST, + user: process.env.DB_USER, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + waitForConnections: true, + connectionLimit: 10, // Limit concurrent connections + queueLimit: 0, + acquireTimeout: 60000, // Timeout for getting connection + timeout: 60000, // Timeout for queries + reconnect: true, + idleTimeout: 300000, // Close idle connections after 5 minutes + enableKeepAlive: true, +}); + +// Monitor pool status +setInterval(() => { + console.log( + `Pool status: Active=${pool.pool._allConnections.length}, Free=${pool.pool._freeConnections.length}` + ); +}, 30000); +``` + +### Background Task Optimization + +**1. Implement proper task cleanup:** + +```python +# Python background task management +import asyncio +import signal +import sys + +class BackgroundTaskManager: + def __init__(self): + self.tasks = set() + self.running = True + + def add_task(self, coro): + task = asyncio.create_task(coro) + self.tasks.add(task) + task.add_done_callback(self.tasks.discard) + return task + + async def cleanup(self): + self.running = False + for task in self.tasks: + task.cancel() + await asyncio.gather(*self.tasks, return_exceptions=True) + +task_manager = BackgroundTaskManager() + +def signal_handler(signum, frame): + print("Received shutdown signal, cleaning up...") + asyncio.create_task(task_manager.cleanup()) + sys.exit(0) + +signal.signal(signal.SIGTERM, signal_handler) +signal.signal(signal.SIGINT, signal_handler) +``` + + + + + +### Resource Requests and Limits Tuning + +**1. Right-size your containers:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: your-app +spec: + template: + spec: + containers: + - name: your-app + image: your-app:latest + resources: + requests: + # Set based on actual baseline usage + cpu: "100m" # 0.1 CPU cores + memory: "256Mi" # 256 MB + limits: + # Set 2-3x higher than requests + cpu: "300m" # 0.3 CPU cores + memory: "512Mi" # 512 MB + env: + - name: NODE_OPTIONS + value: "--max-old-space-size=400" # For Node.js apps +``` + +### HPA Behavior Tuning + +**1. Optimize scaling behavior:** + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: your-app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + minReplicas: 1 + maxReplicas: 8 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 60 # Lower threshold for faster scaling + - type: Resource + resource: + name: memory + target: + type: Utilization + averageUtilization: 70 + behavior: + scaleDown: + stabilizationWindowSeconds: 600 # Wait 10 minutes before scaling down + policies: + - type: Percent + value: 25 # Scale down 25% at a time + periodSeconds: 300 # Every 5 minutes + scaleUp: + stabilizationWindowSeconds: 60 # Scale up quickly + policies: + - type: Percent + value: 100 # Double the pods if needed + periodSeconds: 60 + - type: Pods + value: 2 # Or add 2 pods, whichever is smaller + periodSeconds: 60 +``` + +### Pod Disruption Budget + +**1. Ensure graceful scaling:** + +```yaml +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: your-app-pdb +spec: + minAvailable: 1 + selector: + matchLabels: + app: your-app +``` + + + + + +### Prometheus Metrics + +**1. Application-level metrics:** + +```javascript +// Node.js with Prometheus metrics +const promClient = require("prom-client"); + +// Memory usage metric +const memoryUsage = new promClient.Gauge({ + name: "app_memory_usage_bytes", + help: "Memory usage in bytes", +}); + +// Active connections metric +const activeConnections = new promClient.Gauge({ + name: "app_active_connections", + help: "Number of active database connections", +}); + +// Update metrics periodically +setInterval(() => { + const memUsage = process.memoryUsage(); + memoryUsage.set(memUsage.rss); + + // Update connection count if you have access to pool + if (pool && pool.pool) { + activeConnections.set(pool.pool._allConnections.length); + } +}, 15000); + +// Expose metrics endpoint +app.get("/metrics", (req, res) => { + res.set("Content-Type", promClient.register.contentType); + res.end(promClient.register.metrics()); +}); +``` + +### Grafana Dashboard + +**1. Create HPA monitoring dashboard:** + +```json +{ + "dashboard": { + "title": "HPA Scaling Monitoring", + "panels": [ + { + "title": "Pod Count", + "type": "graph", + "targets": [ + { + "expr": "kube_deployment_status_replicas{deployment=\"your-app\"}" + } + ] + }, + { + "title": "CPU Usage", + "type": "graph", + "targets": [ + { + "expr": "rate(container_cpu_usage_seconds_total{pod=~\"your-app-.*\"}[5m]) * 100" + } + ] + }, + { + "title": "Memory Usage", + "type": "graph", + "targets": [ + { + "expr": "container_memory_working_set_bytes{pod=~\"your-app-.*\"} / 1024 / 1024" + } + ] + } + ] + } +} +``` + +### Alerting Rules + +**1. Set up alerts for scaling issues:** + +```yaml +groups: + - name: hpa-scaling + rules: + - alert: HPANotScalingDown + expr: kube_deployment_status_replicas > 5 for 30m + labels: + severity: warning + annotations: + summary: "HPA not scaling down {{ $labels.deployment }}" + description: "Deployment {{ $labels.deployment }} has been running more than 5 replicas for 30 minutes" + + - alert: MemoryLeakDetected + expr: increase(container_memory_working_set_bytes[1h]) > 100*1024*1024 + labels: + severity: critical + annotations: + summary: "Possible memory leak in {{ $labels.pod }}" + description: "Memory usage increased by more than 100MB in the last hour" +``` + + + + + +### Development Best Practices + +**1. Code review checklist:** + +- [ ] Proper connection management (close connections in finally blocks) +- [ ] Memory cleanup in long-running processes +- [ ] Timeout configurations for external calls +- [ ] Graceful shutdown handling +- [ ] Resource monitoring and health checks + +**2. Testing strategies:** + +```bash +#!/bin/bash +# Load testing script to validate scaling behavior + +echo "Starting load test to validate HPA behavior..." + +# Baseline measurement +echo "Baseline pod count:" +kubectl get pods | grep your-app | wc -l + +# Apply load +echo "Applying load..." +for i in {1..10}; do + kubectl run load-test-$i --image=busybox --rm -it --restart=Never -- \ + wget -q --spider http://your-app-service/api/health & +done + +# Monitor scaling +echo "Monitoring scaling behavior..." +for i in {1..20}; do + echo "$(date): Pods=$(kubectl get pods | grep your-app | wc -l), CPU=$(kubectl top pods | grep your-app | awk '{sum+=$2} END {print sum "m"}')" + sleep 30 +done + +# Cleanup +echo "Cleaning up load test..." +kubectl get pods | grep load-test | awk '{print $1}' | xargs kubectl delete pod + +# Monitor scale down +echo "Monitoring scale down..." +for i in {1..40}; do + echo "$(date): Pods=$(kubectl get pods | grep your-app | wc -l)" + sleep 30 +done +``` + +### Operational Procedures + +**1. Regular maintenance:** + +```bash +#!/bin/bash +# Weekly HPA health check script + +echo "=== HPA Health Check - $(date) ===" + +# Check HPA status +echo "1. HPA Status:" +kubectl get hpa + +# Check recent scaling events +echo "2. Recent scaling events:" +kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler --sort-by='.lastTimestamp' | tail -10 + +# Check resource usage trends +echo "3. Current resource usage:" +kubectl top pods | grep your-app + +# Check for any stuck pods +echo "4. Pod status check:" +kubectl get pods | grep your-app | grep -v Running + +# Verify metrics server +echo "5. Metrics server status:" +kubectl top nodes > /dev/null && echo "✓ Metrics server working" || echo "✗ Metrics server issue" + +echo "=== Health check complete ===" +``` + +**2. Emergency runbook:** + +## HPA Scaling Emergency Runbook + +### Symptoms: Too many pods, not scaling down + +1. **Immediate action** (< 5 minutes): + + ```bash + # Check current status + kubectl get hpa + kubectl top pods + + # If critical, manually scale down + kubectl scale deployment your-app --replicas=3 + ``` + +2. **Investigation** (5-15 minutes): + + ```bash + # Check HPA behavior + kubectl describe hpa your-app-hpa + + # Check application metrics + kubectl logs deployment/your-app | tail -100 + + # Check resource usage + kubectl exec -it deployment/your-app -- top + ``` + +3. **Resolution** (15-30 minutes): + - Apply identified fixes + - Monitor for 30 minutes + - Re-enable HPA if disabled + +4. **Follow-up** (1-24 hours): + - Review APM tool insights + - Implement permanent fixes + - Update monitoring/alerting + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ +``` diff --git a/docs/troubleshooting/index.mdx b/docs/troubleshooting/index.mdx new file mode 100644 index 000000000..9d0467583 --- /dev/null +++ b/docs/troubleshooting/index.mdx @@ -0,0 +1,55 @@ +--- +title: User Issues +description: Common problems and solutions that users encounter when using the platform. +--- + +import DocCardList from "@theme/DocCardList"; + +# User Issues - Troubleshooting Guide + +This section contains solutions to the most common problems that users encounter when using the platform. Here you'll find detailed guides organized by categories to help you quickly resolve any issues. + +## 🚀 Most Common Issues + +If this is your first time experiencing a problem, these are the most frequent issues: + +- **[Cluster Connection Issues](./troubleshooting/cluster-connection-troubleshooting)** - Cluster connectivity problems +- **[Deployment Stuck](./troubleshooting/deployment-stuck-state-resolution)** - Deployments not progressing +- **[Build Failures](./troubleshooting/deployment-build-failed-production)** - Build process errors +- **[VPN Connection](./troubleshooting/vpn-connection-disconnection-issues)** - VPN connection issues +- **[Database Connections](./troubleshooting/database-credentials-access)** - Database access issues + +## 📋 Problem Categories + +### 🏗️ Infrastructure & Clusters +Issues related to EKS clusters, nodepools, scaling, and infrastructure resources. + +### 🚀 Deployments & CI/CD +Issues with deployments, builds, CI/CD pipelines, and GitHub Actions. + +### 🗄️ Database & Storage +Problems with PostgreSQL databases, migrations, connections, and storage. + +### 🌐 Networking & DNS +Domain configuration, SSL, DNS, load balancers, and routing. + +### 📊 Monitoring & Logging +Issues with Grafana, Prometheus, Loki, and monitoring tools. + +### 🔐 Security & Access +VPN issues, authentication, permissions, and access control. + +### ⚡ Performance & Resources +Memory optimization, CPU, performance, and resource management. + +## 💡 General Troubleshooting Tips + +1. **Check the logs** - Always start by reviewing your application and pod logs +2. **Verify status** - Use `kubectl get pods` to check the status of your resources +3. **Review events** - Kubernetes events often provide important clues +4. **Check metrics** - Grafana and Prometheus can show resource issues + +--- + +## All Troubleshooting Articles + diff --git a/docs/troubleshooting/infrastructure-architecture-diagrams.mdx b/docs/troubleshooting/infrastructure-architecture-diagrams.mdx new file mode 100644 index 000000000..3f4ef8b5a --- /dev/null +++ b/docs/troubleshooting/infrastructure-architecture-diagrams.mdx @@ -0,0 +1,252 @@ +--- +sidebar_position: 15 +title: "Infrastructure Architecture Diagrams" +description: "Understanding SleakOps infrastructure architecture and request flow" +date: "2024-01-15" +category: "general" +tags: ["architecture", "infrastructure", "diagrams", "networking", "vpc"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Infrastructure Architecture Diagrams + +**Date:** January 15, 2024 +**Category:** General +**Tags:** Architecture, Infrastructure, Diagrams, Networking, VPC + +## Problem Description + +**Context:** Users need to understand the infrastructure architecture created by SleakOps when deploying applications, including how requests flow from users to applications and the relationship between different components. + +**Observed Symptoms:** + +- Lack of detailed infrastructure architecture diagrams +- Difficulty understanding request flow from browser to application +- Unclear component relationships within the VPC +- Generic diagrams that don't show real component details + +**Relevant Configuration:** + +- SleakOps deployed applications +- Load balancers and DNS configuration +- VPC and networking components +- Multiple application deployments + +**Error Conditions:** + +- Missing detailed architectural documentation +- Inadequate visual representation of infrastructure components +- Unclear boundaries between internal and external components + +## Detailed Solution + + + +SleakOps creates a comprehensive infrastructure architecture when deploying applications. Here's the typical architecture: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Internet │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ DNS (Route 53) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ app1.yourdomain.com → ALB │ │ +│ │ app2.yourdomain.com → ALB │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ Application Load Balancer (ALB) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ SSL Termination │ │ +│ │ Path-based routing │ │ +│ │ Health checks │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ VPC │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ EKS Cluster │ │ +│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ +│ │ │ Node Group 1 │ │ Node Group 2 │ │ │ +│ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ +│ │ │ │ Pod 1 │ │ │ │ Pod 3 │ │ │ │ +│ │ │ │ Pod 2 │ │ │ │ Pod 4 │ │ │ │ +│ │ │ └───────────┘ │ │ └───────────┘ │ │ │ +│ │ └─────────────────┘ └─────────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + + + + + +Here's how a user request flows through the SleakOps infrastructure: + +**1. DNS Resolution** + +- User enters `app.yourdomain.com` in browser +- DNS (Route 53) resolves to Application Load Balancer IP + +**2. Load Balancer Processing** + +- Request hits ALB (outside VPC) +- SSL termination occurs at ALB +- ALB performs health checks on targets +- Routes request based on path/host rules + +**3. VPC Entry** + +- Request enters VPC through ALB target groups +- Traffic flows to EKS cluster nodes + +**4. Kubernetes Processing** + +- Request reaches Kubernetes Service +- Service load balances to healthy Pods +- Application processes the request + +**5. Response Path** + +- Application sends response back through same path +- ALB handles SSL encryption for response +- Response reaches user's browser + +```mermaid +sequenceDiagram + participant User + participant DNS + participant ALB + participant EKS + participant App + + User->>DNS: app.domain.com + DNS->>User: ALB IP + User->>ALB: HTTPS Request + ALB->>EKS: HTTP Request (VPC) + EKS->>App: Forward to Pod + App->>EKS: Response + EKS->>ALB: Response + ALB->>User: HTTPS Response +``` + + + + + +**Outside VPC:** + +- **Route 53 DNS**: Domain name resolution +- **Application Load Balancer**: SSL termination, routing, health checks +- **Internet Gateway**: VPC internet access + +**Inside VPC:** + +- **EKS Cluster**: Managed Kubernetes control plane +- **Node Groups**: EC2 instances running Kubernetes nodes +- **Pods**: Application containers +- **Services**: Kubernetes load balancing +- **Ingress Controllers**: Route external traffic to services + +**Networking Components:** + +- **Public Subnets**: ALB and NAT Gateways +- **Private Subnets**: EKS nodes and Pods +- **Security Groups**: Firewall rules +- **NACLs**: Subnet-level access control + +**Storage & Data:** + +- **EBS Volumes**: Persistent storage for Pods +- **RDS/Database**: If configured +- **S3 Buckets**: Object storage + + + + + +When deploying multiple applications with SleakOps, the architecture scales efficiently: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Application Load Balancer │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Host-based routing: │ │ +│ │ • app1.domain.com → Target Group 1 │ │ +│ │ • app2.domain.com → Target Group 2 │ │ +│ │ • api.domain.com → Target Group 3 │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ VPC │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ EKS Cluster │ │ +│ │ │ │ +│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ +│ │ │ Namespace 1 │ │ Namespace 2 │ │ Namespace 3 │ │ │ +│ │ │ │ │ │ │ │ │ │ +│ │ │ App1 Pods │ │ App2 Pods │ │ API Pods │ │ │ +│ │ │ Service │ │ Service │ │ Service │ │ │ +│ │ │ Ingress │ │ Ingress │ │ Ingress │ │ │ +│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Key Benefits:** + +- **Resource Sharing**: Multiple apps share the same EKS cluster +- **Isolation**: Each app runs in its own namespace +- **Efficient Routing**: Single ALB handles all applications +- **Cost Optimization**: Shared infrastructure reduces costs + + + + + +SleakOps implements multiple layers of security: + +**Network Security:** + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Internet │ +└─────────────────────┬───────────────────────────────────────────┘ + │ (HTTPS only) +┌─────────────────────┴───────────────────────────────────────────┐ +│ ALB │ +│ Security Groups: Allow 80/443 from 0.0.0.0/0 │ +└─────────────────────┬───────────────────────────────────────────┘ + │ (HTTP to VPC) +┌─────────────────────┴───────────────────────────────────────────┐ +│ VPC │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Private Subnets │ │ +│ │ Security Groups: Allow traffic only from ALB │ │ +│ │ ┌─────────────────────────────────────────────────┐ │ │ +│ │ │ EKS Nodes │ │ │ +│ │ │ Pod Security: Network policies, RBAC │ │ │ +│ │ └─────────────────────────────────────────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Security Layers:** + +1. **SSL/TLS Termination**: All external traffic encrypted +2. **Security Groups**: Firewall rules at instance level +3. **Network ACLs**: Subnet-level access control +4. **Kubernetes RBAC**: Pod-level access control +5. **Network Policies**: Inter-pod communication rules + + + +--- + +_This FAQ section was automatically generated on January 15, 2024, based on a real user inquiry._ diff --git a/docs/troubleshooting/ingress-multiple-domains-configuration.mdx b/docs/troubleshooting/ingress-multiple-domains-configuration.mdx new file mode 100644 index 000000000..686b26822 --- /dev/null +++ b/docs/troubleshooting/ingress-multiple-domains-configuration.mdx @@ -0,0 +1,212 @@ +--- +sidebar_position: 3 +title: "Configuring Multiple Domains in Kubernetes Ingress" +description: "Manual configuration of Kubernetes Ingress for multiple domains when SleakOps environment root configuration has issues" +date: "2024-12-23" +category: "workload" +tags: ["ingress", "kubernetes", "domains", "lens", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuring Multiple Domains in Kubernetes Ingress + +**Date:** December 23, 2024 +**Category:** Workload +**Tags:** Ingress, Kubernetes, Domains, Lens, Troubleshooting + +## Problem Description + +**Context:** When SleakOps has issues with multiple environment root configurations in production, manual Ingress configuration becomes necessary to properly route multiple domains to the same service. + +**Observed Symptoms:** + +- Environment domain configuration showing incorrect domain (e.g., showing simplee.cl instead of simplee.com.mx) +- Unable to create new services due to incorrect domain routing +- Multiple domains need to point to the same backend service +- Production environment affected by environment root configuration bug + +**Relevant Configuration:** + +- Multiple domains: web.simplee.com.mx, simplee.com.mx, www.simplee.com.mx +- Backend service: mx-simplee-web-prod-mx-2-mx-simplee-web-svc +- Service port: 7500 +- Tool used: Lens for Kubernetes management + +**Error Conditions:** + +- SleakOps bug with 2 environment root in production +- Domain routing pointing to wrong environment +- Service creation blocked due to incorrect domain configuration + +## Detailed Solution + + + +When SleakOps has environment root configuration issues, you can manually configure the Ingress resource using Lens: + +1. **Open Lens** and connect to your cluster +2. **Navigate to Network** → **Ingresses** +3. **Find the target Ingress** in the appropriate namespace +4. **Edit the Ingress resource** by clicking the edit button +5. **Update the configuration** with the correct rules and TLS settings + + + + + +Here's the complete configuration for multiple domains pointing to the same service: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: your-ingress-name + namespace: your-namespace +spec: + tls: + - hosts: + - web.simplee.com.mx + - simplee.com.mx + - www.simplee.com.mx + secretName: your-tls-secret + rules: + - host: web.simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mx-simplee-web-prod-mx-2-mx-simplee-web-svc + port: + number: 7500 + - host: simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mx-simplee-web-prod-mx-2-mx-simplee-web-svc + port: + number: 7500 + - host: www.simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mx-simplee-web-prod-mx-2-mx-simplee-web-svc + port: + number: 7500 +``` + + + + + +The TLS section ensures all domains are covered by SSL certificates: + +```yaml +tls: + - hosts: + - web.simplee.com.mx + - simplee.com.mx + - www.simplee.com.mx + secretName: your-tls-secret-name +``` + +**Important considerations:** + +- All domains must be listed in the TLS hosts section +- The TLS secret must contain certificates for all listed domains +- Certificate must be valid for all subdomains and the root domain + + + + + +Before applying the Ingress configuration, verify your backend service: + +1. **Check service exists:** + + ```bash + kubectl get svc mx-simplee-web-prod-mx-2-mx-simplee-web-svc -n your-namespace + ``` + +2. **Verify service port:** + + ```bash + kubectl describe svc mx-simplee-web-prod-mx-2-mx-simplee-web-svc -n your-namespace + ``` + +3. **Test service connectivity:** + ```bash + kubectl port-forward svc/mx-simplee-web-prod-mx-2-mx-simplee-web-svc 8080:7500 -n your-namespace + ``` + + + + + +**If domains are not resolving:** + +1. **Check DNS configuration:** + + ```bash + nslookup simplee.com.mx + nslookup web.simplee.com.mx + nslookup www.simplee.com.mx + ``` + +2. **Verify Ingress Controller:** + + ```bash + kubectl get pods -n ingress-nginx + kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + ``` + +3. **Check Ingress status:** + ```bash + kubectl describe ingress your-ingress-name -n your-namespace + ``` + +**If SSL certificates are not working:** + +1. **Check certificate secret:** + + ```bash + kubectl get secret your-tls-secret -n your-namespace -o yaml + ``` + +2. **Verify certificate validity:** + ```bash + openssl x509 -in <(kubectl get secret your-tls-secret -n your-namespace -o jsonpath='{.data.tls\.crt}' | base64 -d) -text -noout + ``` + + + + + +This manual configuration serves as a temporary workaround while the SleakOps team fixes the environment root configuration bug: + +1. **Monitor for platform updates** that resolve the multiple environment root issue +2. **Document the manual changes** made for future reference +3. **Test all domain endpoints** after applying the configuration +4. **Set up monitoring** to ensure all domains remain accessible + +**Testing checklist:** + +- [ ] https://web.simplee.com.mx loads correctly +- [ ] https://simplee.com.mx loads correctly +- [ ] https://www.simplee.com.mx loads correctly +- [ ] SSL certificates are valid for all domains +- [ ] All domains route to the correct service + + + +--- + +_This FAQ was automatically generated on December 23, 2024 based on a real user query._ diff --git a/docs/troubleshooting/ingress-route53-records-not-created.mdx b/docs/troubleshooting/ingress-route53-records-not-created.mdx new file mode 100644 index 000000000..4f72affbb --- /dev/null +++ b/docs/troubleshooting/ingress-route53-records-not-created.mdx @@ -0,0 +1,192 @@ +--- +sidebar_position: 3 +title: "Ingress Route53 Records Not Being Created" +description: "Solution for missing DNS records in Route53 for public webservices with ingress configuration" +date: "2024-05-23" +category: "workload" +tags: ["ingress", "route53", "dns", "webservice", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Ingress Route53 Records Not Being Created + +**Date:** May 23, 2024 +**Category:** Workload +**Tags:** Ingress, Route53, DNS, Webservice, Networking + +## Problem Description + +**Context:** Users have configured webservices with public ingress in SleakOps production environment, but the corresponding DNS records are not being automatically created in Route53. + +**Observed Symptoms:** + +- Webservice configured with public ingress (e.g., organizations-api) +- No corresponding DNS record created in Route53 +- Service appears as "internal" instead of "public" in some cases +- Multiple services potentially affected by the same issue + +**Relevant Configuration:** + +- Environment: Production +- Service type: Webservice with public ingress +- DNS provider: AWS Route53 +- Ingress controller: Kubernetes ingress + +**Error Conditions:** + +- DNS records missing for public webservices +- Services not accessible via configured domain names +- Ingress configuration appears correct but DNS resolution fails + +## Detailed Solution + + + +First, check if your webservice ingress is properly configured: + +1. Go to your **Project** → **Workloads** → **Webservices** +2. Select the affected service (e.g., organizations-api) +3. Check the **Networking** section +4. Ensure **Ingress Type** is set to **Public** +5. Verify the **Domain** configuration is correct + +```yaml +# Expected configuration +ingress: + enabled: true + type: public + domain: organizations-api.yourdomain.com + tls: true +``` + + + + + +SleakOps uses External-DNS to automatically create Route53 records. Check its status: + +1. Access your cluster via **kubectl** +2. Check External-DNS pods status: + +```bash +kubectl get pods -n kube-system | grep external-dns +kubectl logs -n kube-system deployment/external-dns +``` + +3. Look for errors related to: + - AWS permissions + - Route53 access + - Ingress annotation processing + + + + + +Ensure your cluster has proper permissions to manage Route53 records: + +1. Check if the External-DNS service account has the correct IAM role +2. Required permissions include: + - `route53:ChangeResourceRecordSets` + - `route53:ListHostedZones` + - `route53:ListResourceRecordSets` + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": ["route53:ChangeResourceRecordSets"], + "Resource": "arn:aws:route53:::hostedzone/*" + }, + { + "Effect": "Allow", + "Action": ["route53:ListHostedZones", "route53:ListResourceRecordSets"], + "Resource": "*" + } + ] +} +``` + + + + + +To manually verify and potentially fix DNS records: + +1. **Check current Route53 records:** + + - Go to AWS Console → Route53 → Hosted Zones + - Select your domain's hosted zone + - Look for missing A/CNAME records + +2. **Get the Load Balancer endpoint:** + + ```bash + kubectl get ingress -n your-namespace + kubectl describe ingress your-service-ingress -n your-namespace + ``` + +3. **Manually create the record if needed:** + - Record type: A (Alias) or CNAME + - Name: organizations-api.yourdomain.com + - Value: Load Balancer DNS name + + + + + +If your service appears as "internal" instead of "public": + +1. Go to **SleakOps Dashboard** → **Your Project** +2. Navigate to **Workloads** → **Webservices** +3. Select the service (organizations-api) +4. In **Networking** section: + + - Change **Ingress Type** from "Internal" to "Public" + - Ensure **Domain** is properly configured + - Save changes + +5. Wait 5-10 minutes for DNS propagation +6. Verify the Route53 record is created + + + + + +If the issue persists: + +1. **Check ingress annotations:** + + ```bash + kubectl get ingress your-service -o yaml + ``` + + Look for External-DNS annotations like: + + ```yaml + annotations: + external-dns.alpha.kubernetes.io/hostname: organizations-api.yourdomain.com + ``` + +2. **Restart External-DNS:** + + ```bash + kubectl rollout restart deployment/external-dns -n kube-system + ``` + +3. **Check for conflicting records:** + + - Ensure no manual Route53 records conflict with automatic ones + - Remove any duplicate or conflicting DNS entries + +4. **Verify hosted zone:** + - Confirm the correct hosted zone exists in Route53 + - Check if the domain delegation is properly configured + + + +--- + +_This FAQ was automatically generated on May 23, 2024 based on a real user query._ diff --git a/docs/troubleshooting/job-management-cleanup-logs.mdx b/docs/troubleshooting/job-management-cleanup-logs.mdx new file mode 100644 index 000000000..3167e7a0a --- /dev/null +++ b/docs/troubleshooting/job-management-cleanup-logs.mdx @@ -0,0 +1,500 @@ +--- +sidebar_position: 15 +title: "Job Management and Log Viewing" +description: "How to manage job executions, clean up job history, and view logs for completed jobs" +date: "2025-02-04" +category: "workload" +tags: ["jobs", "logs", "cleanup", "management", "execution"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Job Management and Log Viewing + +**Date:** February 4, 2025 +**Category:** Workload +**Tags:** Jobs, Logs, Cleanup, Management, Execution + +## Problem Description + +**Context:** Users need to manage job executions in SleakOps, including cleaning up old job runs and viewing logs for both successful and failed job executions. + +**Observed Symptoms:** + +- Job execution history accumulates over time with no cleanup option +- Logs are visible for failed jobs but not for successful ones +- Users cannot manually delete old job executions from the interface +- Job history becomes cluttered with old executions that are no longer relevant + +**Relevant Configuration:** + +- Platform: SleakOps job management interface +- Job types: All job types (scheduled, manual, etc.) +- Log visibility: Currently limited to failed executions + +**Error Conditions:** + +- No built-in cleanup mechanism for job history +- Inconsistent log viewing between successful and failed jobs +- Manual cleanup requires support team intervention + +## Detailed Solution + + + +Currently, SleakOps has the following limitations in job management: + +1. **No self-service cleanup**: Users cannot delete old job executions through the interface +2. **Limited log access**: Logs are automatically shown for failed jobs but not for successful ones +3. **Manual intervention required**: Cleanup requires contacting support for manual deletion + +These are known limitations that are being addressed in future platform updates. + + + + + +To clean up old job executions: + +1. **Contact Support**: Send an email to support@sleakops.com +2. **Specify jobs to delete**: Provide details about which job executions you want removed +3. **Include project context**: Mention your project name and job names +4. **Wait for confirmation**: Support team will manually remove the specified executions + +**Email template:** + +``` +Subject: Job Execution Cleanup Request + +Hi SleakOps Team, + +I would like to request cleanup of old job executions for: +- Project: [Your Project Name] +- Jobs: [Specific job names or "all old executions"] +- Time range: [Optional: executions older than X days] + +Thank you! +``` + + + + + +Currently, logs for successful jobs are not automatically displayed in the interface. To access them: + +1. **Re-run the job**: Execute the same job with identical configuration +2. **Monitor during execution**: Watch the logs while the job is running +3. **Contact support**: Request specific log access for completed successful jobs + +**Workaround for log retention:** + +- Consider adding logging to external systems within your job scripts +- Use job output files that persist beyond execution +- Implement custom logging mechanisms in your job code + + + + + +To re-run a job with the same configuration: + +1. **Navigate to Jobs section** in your SleakOps project +2. **Find the job** you want to re-execute +3. **Click on the job name** to view details +4. **Use the "Run Again" or "Execute" button** (if available) +5. **Verify configuration** matches your previous execution +6. **Monitor logs** during execution to capture output + +If the re-execution option is not available in the interface, you may need to: + +- Recreate the job configuration manually +- Use the same parameters and settings as the previous execution + + + + + +**For better job management:** + +1. **Regular cleanup**: Request cleanup monthly or quarterly +2. **Meaningful job names**: Use descriptive names to identify jobs easily +3. **External logging**: Implement logging to external systems for long-term retention +4. **Documentation**: Keep track of important job configurations externally + +**For log management:** + +1. **Capture during execution**: Monitor jobs while they run to see output +2. **Export important logs**: Save critical log information externally +3. **Use job outputs**: Design jobs to write important results to files +4. **Implement notifications**: Set up alerts for job completion status + + + + + +SleakOps is working on implementing the following job management features: + +1. **Self-service cleanup**: Users will be able to delete old job executions +2. **Enhanced log viewing**: Logs will be accessible for both successful and failed jobs +3. **Job history management**: Better filtering and organization of job executions +4. **Log retention policies**: Configurable log retention settings + +These improvements are planned for future releases to enhance the user experience. + + + + + +Common issues and solutions when managing jobs: + +1. **Jobs stuck in pending state**: + +```bash +# Check job status and events +kubectl get jobs -n your-namespace +kubectl describe job -n your-namespace + +# Look for resource constraints +kubectl get events -n your-namespace | grep +``` + +2. **Jobs failing to start**: + +- Verify Docker image availability and accessibility +- Check resource quotas and limits +- Validate environment variables and secrets +- Ensure proper RBAC permissions + +3. **Jobs running indefinitely**: + +```yaml +# Set appropriate timeout in job configuration +spec: + activeDeadlineSeconds: 3600 # 1 hour timeout + backoffLimit: 3 # Maximum retries +``` + +4. **Job logs not showing**: + +```bash +# Access logs directly via kubectl +kubectl logs job/ -n your-namespace +kubectl logs -n your-namespace +``` + + + + + +Configure jobs for better management and monitoring: + +1. **Job templates with proper labeling**: + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: data-processing-job + labels: + app: data-processor + version: "1.0" + environment: production +spec: + template: + metadata: + labels: + app: data-processor + job-type: scheduled + spec: + restartPolicy: OnFailure + containers: + - name: processor + image: your-app:latest + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + env: + - name: LOG_LEVEL + value: "INFO" + - name: OUTPUT_FORMAT + value: "json" +``` + +2. **Job monitoring configuration**: + +```yaml +# Add monitoring annotations +metadata: + annotations: + prometheus.io/scrape: "true" + prometheus.io/port: "9090" + prometheus.io/path: "/metrics" +``` + +3. **Persistent job outputs**: + +```yaml +# Mount persistent volume for job outputs +spec: + template: + spec: + volumes: + - name: job-output + persistentVolumeClaim: + claimName: job-storage-pvc + containers: + - name: job-container + volumeMounts: + - name: job-output + mountPath: /output +``` + + + + + +Set up robust logging for job executions: + +1. **Structured logging within jobs**: + +```python +# Python example for job logging +import logging +import json +from datetime import datetime + +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) + +logger = logging.getLogger(__name__) + +def log_job_event(event_type, message, metadata=None): + log_entry = { + "timestamp": datetime.utcnow().isoformat(), + "event_type": event_type, + "message": message, + "metadata": metadata or {} + } + logger.info(json.dumps(log_entry)) + +# Usage in job +log_job_event("job_start", "Data processing job started") +log_job_event("processing", "Processing 1000 records", {"record_count": 1000}) +log_job_event("job_complete", "Job completed successfully", {"duration": "120s"}) +``` + +2. **External log shipping**: + +```yaml +# Fluentd sidecar for log shipping +spec: + template: + spec: + containers: + - name: main-job + image: your-job-image + volumeMounts: + - name: log-volume + mountPath: /var/log + - name: log-shipper + image: fluentd:latest + volumeMounts: + - name: log-volume + mountPath: /var/log + env: + - name: FLUENTD_CONF + value: fluent.conf +``` + +3. **Database logging**: + +```python +# Example: Log job status to database +import psycopg2 +from datetime import datetime + +def log_to_database(job_name, status, details): + conn = psycopg2.connect( + host="your-db-host", + database="job_logs", + user="user", + password="password" + ) + cursor = conn.cursor() + + cursor.execute(""" + INSERT INTO job_executions (job_name, status, details, timestamp) + VALUES (%s, %s, %s, %s) + """, (job_name, status, details, datetime.utcnow())) + + conn.commit() + conn.close() +``` + + + + + +Implement automated job management: + +1. **CronJob configuration**: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: cleanup-job +spec: + schedule: "0 2 * * *" # Daily at 2 AM + jobTemplate: + spec: + template: + spec: + containers: + - name: cleanup + image: cleanup-image:latest + command: + - /bin/sh + - -c + - | + echo "Starting cleanup at $(date)" + # Cleanup logic here + echo "Cleanup completed at $(date)" + restartPolicy: OnFailure + successfulJobsHistoryLimit: 3 + failedJobsHistoryLimit: 1 +``` + +2. **Job dependency management**: + +```yaml +# Example of job that waits for another job +apiVersion: batch/v1 +kind: Job +metadata: + name: dependent-job +spec: + template: + spec: + initContainers: + - name: wait-for-prerequisite + image: busybox + command: + - /bin/sh + - -c + - | + while ! kubectl get job prerequisite-job -o jsonpath='{.status.succeeded}' | grep -q 1; do + echo "Waiting for prerequisite job to complete..." + sleep 30 + done + containers: + - name: main-task + image: your-task-image +``` + +3. **Job cleanup automation**: + +```bash +#!/bin/bash +# Script to cleanup old jobs automatically + +NAMESPACE="your-namespace" +RETENTION_DAYS=7 + +# Delete jobs older than retention period +kubectl get jobs -n $NAMESPACE -o json | \ +jq -r ".items[] | select(.metadata.creationTimestamp | fromdateiso8601 < (now - ($RETENTION_DAYS * 24 * 3600))) | .metadata.name" | \ +while read job; do + echo "Deleting old job: $job" + kubectl delete job $job -n $NAMESPACE +done +``` + + + + + +Monitor job performance and set up alerts: + +1. **Prometheus metrics for jobs**: + +```yaml +# ServiceMonitor for job metrics +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: job-metrics +spec: + selector: + matchLabels: + app: job-monitoring + endpoints: + - port: metrics +``` + +2. **Alerting rules**: + +```yaml +groups: + - name: job-alerts + rules: + - alert: JobFailed + expr: kube_job_status_failed > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Job {{ $labels.job_name }} has failed" + description: "Job {{ $labels.job_name }} in namespace {{ $labels.namespace }} has been failing for more than 5 minutes" + + - alert: JobRunningTooLong + expr: time() - kube_job_status_start_time > 3600 + for: 0m + labels: + severity: warning + annotations: + summary: "Job {{ $labels.job_name }} running too long" + description: "Job {{ $labels.job_name }} has been running for more than 1 hour" +``` + +3. **Dashboard configuration**: + +```json +{ + "dashboard": { + "title": "Job Monitoring Dashboard", + "panels": [ + { + "title": "Job Success Rate", + "type": "stat", + "targets": [ + { + "expr": "rate(kube_job_status_succeeded[5m]) / rate(kube_job_status_start_time[5m]) * 100" + } + ] + }, + { + "title": "Job Duration", + "type": "graph", + "targets": [ + { + "expr": "kube_job_status_completion_time - kube_job_status_start_time" + } + ] + } + ] + } +} +``` + + + +--- + +_This FAQ was automatically generated on February 4, 2025 based on a real user query._ diff --git a/docs/troubleshooting/job-reexecution-kubernetes.mdx b/docs/troubleshooting/job-reexecution-kubernetes.mdx new file mode 100644 index 000000000..a9d60fcac --- /dev/null +++ b/docs/troubleshooting/job-reexecution-kubernetes.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 3 +title: "Job Re-execution in Kubernetes" +description: "How to re-run Kubernetes Jobs with the same parameters without creating new ones" +date: "2024-01-15" +category: "workload" +tags: ["kubernetes", "jobs", "reexecution", "batch"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Job Re-execution in Kubernetes + +**Date:** January 15, 2024 +**Category:** Workload +**Tags:** Kubernetes, Jobs, Re-execution, Batch + +## Problem Description + +**Context:** User needs to re-execute a Kubernetes Job with the same parameters after it has completed, but the platform doesn't provide a direct "re-run" option, forcing the creation of new Jobs each time. + +**Observed Symptoms:** + +- No "re-execute" or "re-run" button available for completed Jobs +- Need to create a new Job each time with the same parameters +- Manual process required to duplicate Job configuration +- Loss of execution history and relationship between Job runs + +**Relevant Configuration:** + +- Job type: Kubernetes Batch Job +- Job status: Completed +- Platform: SleakOps Kubernetes environment +- Desired behavior: Re-run existing Job with same parameters + +**Error Conditions:** + +- Occurs after Job completion +- No built-in re-execution mechanism available +- Manual recreation required for each execution + +## Detailed Solution + + + +Kubernetes Jobs are designed to run once and complete. Once a Job finishes successfully, it cannot be "restarted" in the traditional sense. This is by design in Kubernetes: + +- Jobs are immutable once created +- Completed Jobs remain for history and log access +- Re-execution requires creating a new Job object + +This behavior ensures consistency and prevents accidental re-runs of critical batch processes. + + + + + +In SleakOps, you can create reusable Job templates to simplify re-execution: + +1. **Save Job as Template**: + + - Go to your completed Job + - Click **"Save as Template"** + - Give it a descriptive name + +2. **Create from Template**: + + - Go to **Jobs** → **Create New** + - Select **"From Template"** + - Choose your saved template + - Modify parameters if needed + +3. **Template Benefits**: + - Preserves all configuration + - Allows parameter modification + - Maintains execution history + - Faster than manual recreation + + + + + +If you need to manually recreate a Job using kubectl: + +```bash +# Get the original Job configuration +kubectl get job my-job -o yaml > job-backup.yaml + +# Edit the file to remove status and metadata.uid +# Change metadata.name to avoid conflicts +sed -i 's/name: my-job/name: my-job-rerun/' job-backup.yaml + +# Remove status section and other runtime fields +kubectl create -f job-backup.yaml +``` + +**Important**: Remove these fields from the YAML: + +- `metadata.uid` +- `metadata.resourceVersion` +- `metadata.creationTimestamp` +- `status` section + + + + + +If you need to run the same Job multiple times, consider using a CronJob: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: my-repeated-job +spec: + schedule: "@daily" # or manual trigger + jobTemplate: + spec: + template: + spec: + containers: + - name: my-container + image: my-image + command: ["my-command"] + restartPolicy: OnFailure +``` + +**Benefits of CronJob**: + +- Can be triggered manually +- Maintains job history +- Supports scheduled execution +- Better for repeated tasks + + + + + +For optimal Job re-execution in SleakOps: + +1. **During Job Creation**: + + - Use descriptive names with version/date + - Document parameters in Job description + - Save as template immediately after creation + +2. **For Re-execution**: + + - Use **"Create from Template"** option + - Update name with new timestamp/version + - Verify parameters before execution + - Keep original Job for reference + +3. **Best Practices**: + - Name Jobs with timestamps: `data-processing-2024-01-15` + - Use labels to group related Job executions + - Document parameter changes between runs + - Clean up old Jobs periodically + + + + + +Common issues when recreating Jobs: + +**Name Conflicts**: + +```bash +# Error: Job already exists +# Solution: Use different name +metadata: + name: my-job-v2 # or my-job-20240115 +``` + +**Resource Conflicts**: + +- Ensure previous Job's pods are cleaned up +- Check for persistent volume claims +- Verify service account permissions + +**Parameter Issues**: + +- Double-check environment variables +- Verify secret and configmap references +- Ensure image versions are correct + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/jobs-timeout-display-issue.mdx b/docs/troubleshooting/jobs-timeout-display-issue.mdx new file mode 100644 index 000000000..54494906d --- /dev/null +++ b/docs/troubleshooting/jobs-timeout-display-issue.mdx @@ -0,0 +1,188 @@ +--- +sidebar_position: 3 +title: "Jobs Display Timeout Error Despite Successful Execution" +description: "Solution for jobs showing timeout error in UI while actually running successfully" +date: "2025-02-10" +category: "workload" +tags: ["jobs", "timeout", "ui", "display", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Jobs Display Timeout Error Despite Successful Execution + +**Date:** February 10, 2025 +**Category:** Workload +**Tags:** Jobs, Timeout, UI, Display, Troubleshooting + +## Problem Description + +**Context:** Long-running jobs in SleakOps that take several hours to complete are displaying timeout errors in the UI, even though they are executing successfully in the background. + +**Observed Symptoms:** + +- Jobs show "Time Out" error status in the SleakOps dashboard +- Jobs are actually running and completing successfully +- The timeout error appears to be a display issue, not an actual execution failure +- Long-running jobs (several hours) are more likely to experience this issue + +**Relevant Configuration:** + +- Job type: Long-running batch jobs +- Execution time: Several hours +- Platform: SleakOps job management system +- Status display: Shows timeout error incorrectly + +**Error Conditions:** + +- Error appears in UI after extended execution time +- Jobs continue running despite timeout display +- Issue affects job status visibility and monitoring +- Problem occurs with jobs that exceed certain time thresholds + +## Detailed Solution + + + +This is a known UI display issue where: + +1. **Jobs continue executing normally** in the background +2. **The UI timeout is cosmetic** - it doesn't affect actual job execution +3. **Job completion status** may not update correctly in real-time +4. **The underlying job infrastructure** continues working as expected + +This is a platform-level issue that requires a fix from the SleakOps development team. + + + + + +To confirm your job is executing properly despite the timeout display: + +1. **Check job logs directly:** + + ```bash + # Access job logs through kubectl if available + kubectl logs -f job/your-job-name + ``` + +2. **Monitor resource usage:** + + - Check CPU/memory usage in the cluster + - Verify if your job pods are still active + +3. **Check job output:** + + - Monitor any output files or databases your job writes to + - Verify intermediate results are being generated + +4. **Use kubectl to check job status:** + ```bash + kubectl get jobs + kubectl describe job your-job-name + ``` + + + + + +While waiting for the platform fix, you can: + +1. **Set up external monitoring:** + + ```yaml + # Add health check endpoints to your job + apiVersion: batch/v1 + kind: Job + metadata: + name: long-running-job + spec: + template: + spec: + containers: + - name: worker + image: your-image + # Add periodic status updates + command: ["/bin/sh"] + args: + [ + "-c", + "your-job-command && echo 'Job completed successfully' > /tmp/status", + ] + ``` + +2. **Implement progress logging:** + + - Add regular progress updates to your job code + - Use structured logging to track job phases + - Consider using external status endpoints + +3. **Use job completion notifications:** + - Configure webhooks or notifications for job completion + - Set up alerts for actual job failures vs. UI timeouts + + + + + +To minimize issues with long-running jobs: + +1. **Implement checkpointing:** + + ```python + # Example: Save progress periodically + def save_checkpoint(progress_data): + with open('/tmp/checkpoint.json', 'w') as f: + json.dump(progress_data, f) + + def load_checkpoint(): + try: + with open('/tmp/checkpoint.json', 'r') as f: + return json.load(f) + except FileNotFoundError: + return None + ``` + +2. **Break down large jobs:** + + - Consider splitting very long jobs into smaller chunks + - Use job dependencies to chain smaller jobs together + - Implement proper error handling and retry logic + +3. **Add health checks:** + ```yaml + # Add liveness and readiness probes + livenessProbe: + exec: + command: + - /bin/sh + - -c + - "test -f /tmp/job-alive" + initialDelaySeconds: 30 + periodSeconds: 60 + ``` + + + + + +The SleakOps development team is aware of this issue and is working on a fix. The timeline includes: + +1. **Short-term (few days):** Platform fix deployment +2. **Medium-term:** Enhanced job monitoring and status display +3. **Long-term:** Improved handling of long-running workloads + +**What the fix will address:** + +- Correct timeout handling for long-running jobs +- Improved UI status updates +- Better real-time job monitoring +- Enhanced job lifecycle management + +**No action required from users** - the fix will be applied automatically to the platform. + + + +--- + +_This FAQ was automatically generated on February 10, 2025 based on a real user query._ diff --git a/docs/troubleshooting/karpenter-pod-scheduling-warnings.mdx b/docs/troubleshooting/karpenter-pod-scheduling-warnings.mdx new file mode 100644 index 000000000..7d5ef1972 --- /dev/null +++ b/docs/troubleshooting/karpenter-pod-scheduling-warnings.mdx @@ -0,0 +1,238 @@ +--- +sidebar_position: 3 +title: "Karpenter Pod Scheduling Warnings and Node Provisioning" +description: "Understanding Karpenter warnings when pods cannot be scheduled and how node provisioning works" +date: "2024-04-17" +category: "cluster" +tags: + [ + "karpenter", + "pod-scheduling", + "node-provisioning", + "warnings", + "troubleshooting", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Karpenter Pod Scheduling Warnings and Node Provisioning + +**Date:** April 17, 2024 +**Category:** Cluster +**Tags:** Karpenter, Pod Scheduling, Node Provisioning, Warnings, Troubleshooting + +## Problem Description + +**Context:** Users experience pod scheduling warnings in production environments when using Karpenter for node autoscaling, potentially causing 502 errors and service disruptions. + +**Observed Symptoms:** + +- Pods show "running" status but display scheduling warnings +- Intermittent 502 errors affecting end users +- Warning messages appear when no nodes are available for pod placement +- Service disruptions during node provisioning periods + +**Relevant Configuration:** + +- Environment: Production +- Autoscaler: Karpenter +- Pod status: Running with warnings +- Node provisioning: Automatic + +**Error Conditions:** + +- Warnings appear when cluster lacks available nodes for new pods +- Occurs during peak traffic or scaling events +- May correlate with user-facing 502 errors +- Temporary condition during node provisioning process + +## Detailed Solution + + + +When you see pod scheduling warnings with Karpenter, this is typically normal behavior: + +1. **Warning trigger**: The alert appears when no existing nodes have sufficient resources for new pods +2. **Automatic response**: Karpenter detects this condition and begins provisioning new nodes +3. **Temporary state**: This is a transitional period, not a permanent error +4. **Resolution time**: The process usually takes 2-3 minutes to complete + +The warning indicates Karpenter is working correctly, not that there's a problem. + + + + + +Karpenter's node provisioning follows this sequence: + +1. **Detection**: Karpenter identifies unschedulable pods +2. **Instance selection**: Chooses appropriate EC2 instance type based on requirements +3. **Instance launch**: Requests new EC2 instance from AWS +4. **Node initialization**: Instance boots and installs necessary components +5. **Cluster registration**: New node joins the Kubernetes cluster +6. **Pod scheduling**: Pending pods are scheduled to the new node + +**Timeline**: This entire process typically takes 2-3 minutes. + + + + + +When you see scheduling warnings, you can monitor the node provisioning process: + +**In SleakOps Dashboard:** + +1. Navigate to **Cluster** → **Nodes** +2. Look for nodes with status "Creating" or "Pending" +3. Monitor the node list for new additions + +**Using kubectl:** + +```bash +# Watch nodes being created +kubectl get nodes -w + +# Check Karpenter logs +kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter + +# View pending pods +kubectl get pods --field-selector=status.phase=Pending +``` + +**Expected behavior:** You should see new nodes appearing within 2-3 minutes of the warning. + + + + + +The correlation between pod scheduling warnings and 502 errors can occur when: + +1. **Resource exhaustion**: Existing pods consume all available resources +2. **Scaling delay**: New requests arrive during the 2-3 minute provisioning window +3. **Load balancer behavior**: Some requests may fail while new capacity comes online + +**Mitigation strategies:** + +- Configure appropriate resource requests and limits +- Implement proper health checks and readiness probes +- Consider pre-scaling for predictable traffic patterns +- Use Horizontal Pod Autoscaler (HPA) alongside Karpenter + + + + + +To minimize scheduling warnings and improve response times: + +**1. Configure Karpenter NodePool properly:** + +```yaml +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: default +spec: + template: + spec: + requirements: + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: karpenter.sh/capacity-type + operator: In + values: ["spot", "on-demand"] + nodeClassRef: + apiVersion: karpenter.k8s.aws/v1beta1 + kind: EC2NodeClass + name: default + disruption: + consolidationPolicy: WhenEmpty + consolidateAfter: 30s +``` + +**2. Set appropriate resource requests:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + containers: + - name: app + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "256Mi" + cpu: "200m" +``` + +**3. Use Horizontal Pod Autoscaler:** + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + + + + + +If scheduling warnings persist or 502 errors continue: + +**1. Check Karpenter configuration:** + +```bash +# Verify Karpenter is running +kubectl get pods -n karpenter + +# Check Karpenter events +kubectl get events -n karpenter --sort-by='.lastTimestamp' +``` + +**2. Verify AWS permissions:** + +- Ensure Karpenter has proper IAM permissions +- Check EC2 service quotas and limits +- Verify subnet and security group configurations + +**3. Monitor resource utilization:** + +```bash +# Check node resource usage +kubectl top nodes + +# Check pod resource usage +kubectl top pods --all-namespaces +``` + +**4. Review application configuration:** + +- Verify resource requests match actual usage +- Check readiness and liveness probes +- Ensure proper graceful shutdown handling + + + +--- + +_This FAQ was automatically generated on April 17, 2024 based on a real user query._ diff --git a/docs/troubleshooting/keda-installation-guide.mdx b/docs/troubleshooting/keda-installation-guide.mdx new file mode 100644 index 000000000..c3c0a6171 --- /dev/null +++ b/docs/troubleshooting/keda-installation-guide.mdx @@ -0,0 +1,308 @@ +--- +sidebar_position: 15 +title: "Installing KEDA in SleakOps Kubernetes Clusters" +description: "Complete guide to install and configure KEDA for autoscaling workloads in SleakOps clusters" +date: "2025-02-13" +category: "cluster" +tags: ["keda", "autoscaling", "kubernetes", "workload", "scheduler"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Installing KEDA in SleakOps Kubernetes Clusters + +**Date:** February 13, 2025 +**Category:** Cluster +**Tags:** KEDA, Autoscaling, Kubernetes, Workload, Scheduler + +## Problem Description + +**Context:** Users need to implement advanced autoscaling capabilities in their SleakOps Kubernetes clusters, including the ability to scale workloads to zero and schedule workload scaling based on time or events. KEDA (Kubernetes Event-driven Autoscaling) provides these capabilities but is not natively supported in SleakOps yet. + +**Observed Symptoms:** + +- Need to manually start/stop workloads to save resources +- Requirement for scheduled scaling (e.g., API services that only run during business hours) +- Lack of event-driven autoscaling capabilities +- Need for cost optimization through workload scheduling + +**Relevant Configuration:** + +- Platform: SleakOps Kubernetes clusters +- KEDA version: Latest stable (2.11+) +- Kubernetes version: Compatible with SleakOps clusters +- Workload types: Web Services and Workers + +**Error Conditions:** + +- Manual resource management is time-consuming +- Resources running 24/7 when only needed during specific hours +- No native scheduling capabilities in SleakOps for workload scaling + +## Detailed Solution + + + +KEDA (Kubernetes Event-driven Autoscaling) is a single-purpose and lightweight component that can be added to any Kubernetes cluster. It provides: + +- **Scale to Zero**: Scale deployments down to zero replicas when not needed +- **Event-driven Scaling**: Scale based on various metrics and events +- **Cron-based Scaling**: Schedule scaling operations based on time +- **Multiple Scalers**: Support for various data sources (queues, databases, HTTP, etc.) + +This is particularly useful for: + +- Cost optimization by stopping unused services +- Scheduled workloads (batch jobs, APIs with specific usage patterns) +- Event-driven microservices + + + + + +The recommended way to install KEDA in your SleakOps cluster is using Helm: + +```bash +# Add the KEDA Helm repository +helm repo add kedacore https://kedacore.github.io/charts +helm repo update + +# Install KEDA in the keda namespace +helm install keda kedacore/keda --namespace keda --create-namespace + +# Verify the installation +kubectl get pods -n keda +``` + +Expected output: + +``` +NAME READY STATUS RESTARTS AGE +keda-admission-webhooks-xxx 1/1 Running 0 2m +keda-operator-xxx 1/1 Running 0 2m +keda-operator-metrics-apiserver-xxx 1/1 Running 0 2m +``` + + + + + +Alternatively, you can install KEDA using kubectl: + +```bash +# Install KEDA +kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.11.2/keda-2.11.2.yaml + +# Verify the installation +kubectl get pods -n keda +``` + +**Note**: Replace `v2.11.2` with the latest version available. + + + + + +To implement scheduled scaling (start/stop workloads at specific times), use the Cron scaler: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: api-service-scheduler + namespace: your-namespace +spec: + scaleTargetRef: + name: your-api-deployment # Replace with your deployment name + minReplicaCount: 0 # Scale to zero when not needed + maxReplicaCount: 3 # Maximum replicas during active hours + triggers: + - type: cron + metadata: + timezone: America/Argentina/Buenos_Aires # Adjust to your timezone + start: "0 8 * * 1-5" # Start at 8 AM, Monday to Friday + end: "0 18 * * 1-5" # Stop at 6 PM, Monday to Friday + desiredReplicas: "2" # Number of replicas during active hours +``` + +Apply the configuration: + +```bash +kubectl apply -f scaled-object.yaml +``` + + + + + +For manual control over workload scaling, you can pause and resume ScaledObjects: + +```bash +# Pause scaling (keeps current replica count) +kubectl annotate scaledobject api-service-scheduler autoscaling.keda.sh/paused=true + +# Resume scaling +kubectl annotate scaledobject api-service-scheduler autoscaling.keda.sh/paused- + +# Scale to zero manually +kubectl patch scaledobject api-service-scheduler --type merge -p '{"spec":{"minReplicaCount":0}}' + +# Check ScaledObject status +kubectl get scaledobject -n your-namespace +kubectl describe scaledobject api-service-scheduler +``` + +You can also create simple ScaledObjects that allow manual scaling to zero: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: manual-scaler + namespace: your-namespace +spec: + scaleTargetRef: + name: your-deployment + minReplicaCount: 0 + maxReplicaCount: 10 + # No triggers means manual scaling only +``` + + + + + +### Monitoring KEDA Operations + +```bash +# Check KEDA operator logs +kubectl logs -n keda -l app.kubernetes.io/name=keda-operator + +# Check metrics server logs +kubectl logs -n keda -l app.kubernetes.io/name=keda-operator-metrics-apiserver + +# View ScaledObject events +kubectl describe scaledobject -n + +# Monitor HPA created by KEDA +kubectl get hpa -n +``` + +### Common Issues and Solutions + +1. **ScaledObject not creating HPA:** + + - Verify KEDA is running: `kubectl get pods -n keda` + - Check ScaledObject syntax and indentation + - Ensure target deployment exists + +2. **Cron scaling not working:** + + - Verify timezone configuration + - Check cron expression syntax + - Ensure KEDA operator has correct time + +3. **Scaling not happening:** + - Check if the deployment has sufficient resources + - Verify cluster has enough capacity + - Review ScaledObject and HPA status + +### Best Practices + +1. **Resource Management:** + + - Set appropriate resource requests/limits on deployments + - Monitor resource usage after implementing KEDA + +2. **Scheduling:** + + - Use meaningful names for ScaledObjects + - Document your scaling schedules + - Test scaling behavior in development first + +3. **Monitoring:** + - Set up alerts for scaling events + - Monitor application startup times + - Track resource cost savings + + + + + +### HTTP-based Scaling + +Scale based on HTTP traffic: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: http-scaler +spec: + scaleTargetRef: + name: web-service + minReplicaCount: 0 + maxReplicaCount: 10 + triggers: + - type: prometheus + metadata: + serverAddress: http://prometheus:9090 + metricName: http_requests_per_second + threshold: "10" + query: rate(http_requests_total[1m]) +``` + +### Queue-based Scaling + +Scale based on message queue length: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: queue-scaler +spec: + scaleTargetRef: + name: worker-deployment + minReplicaCount: 0 + maxReplicaCount: 50 + triggers: + - type: aws-sqs-queue + metadata: + queueURL: https://sqs.us-east-1.amazonaws.com/123456789/my-queue + queueLength: "5" + awsRegion: "us-east-1" +``` + +### Multiple Triggers + +Combine multiple scaling triggers: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: multi-trigger-scaler +spec: + scaleTargetRef: + name: api-service + minReplicaCount: 1 + maxReplicaCount: 20 + triggers: + - type: cron + metadata: + timezone: UTC + start: "0 8 * * *" + end: "0 20 * * *" + desiredReplicas: "3" + - type: cpu + metadata: + type: Utilization + value: "70" +``` + + + +--- + +_This FAQ was automatically generated on February 13, 2025 based on a real user query._ diff --git a/docs/troubleshooting/kubernetes-cronjob-duplicate-execution.mdx b/docs/troubleshooting/kubernetes-cronjob-duplicate-execution.mdx new file mode 100644 index 000000000..b879a5a15 --- /dev/null +++ b/docs/troubleshooting/kubernetes-cronjob-duplicate-execution.mdx @@ -0,0 +1,227 @@ +--- +sidebar_position: 3 +title: "Kubernetes CronJob Duplicate Execution Issue" +description: "Solution for CronJobs running multiple times due to failed pods and retry configuration" +date: "2025-01-03" +category: "workload" +tags: ["cronjob", "kubernetes", "retry", "backofflimit", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes CronJob Duplicate Execution Issue + +**Date:** January 3, 2025 +**Category:** Workload +**Tags:** CronJob, Kubernetes, Retry, BackOffLimit, Troubleshooting + +## Problem Description + +**Context:** CronJobs in Kubernetes cluster are executing multiple times instead of running once at their scheduled time. This issue appears to affect specific CronJobs that have been manually executed through Lens or have experienced failures. + +**Observed Symptoms:** + +- CronJobs execute twice instead of once +- Two pods are created for each CronJob execution +- The first pod appears to fail, triggering a second execution +- No visible errors in the logs of the failed pods +- Issue persists across deployments and releases +- Problem affects specific CronJobs that were previously executed manually + +**Relevant Configuration:** + +- Default Kubernetes CronJob retry behavior: BackOffLimit = 1 +- CronJobs affected after manual execution via Lens +- Jobs show failed status despite no apparent errors + +**Error Conditions:** + +- Occurs consistently for affected CronJobs +- Happens regardless of the actual job success/failure +- Persists after new releases are deployed +- Affects production scheduled jobs + +## Detailed Solution + + + +The duplicate execution issue is caused by Kubernetes' default retry mechanism for failed Jobs: + +1. **Default BackOffLimit**: Kubernetes CronJobs have a default `backoffLimit` of 1, meaning they will retry once if they fail +2. **Job Failure Detection**: Even if your application code runs successfully, the Job might be marked as failed due to: + + - Exit code issues + - Resource constraints + - Timeout configurations + - Signal handling problems + +3. **Retry Behavior**: When a Job fails, Kubernetes automatically creates a new pod to retry the execution + + + + + +To immediately stop the duplicate executions, set the `backoffLimit` to 0: + +**Using Lens (GUI method):** + +1. Open Lens and connect to your cluster +2. Navigate to **Workloads** → **CronJobs** in the sidebar +3. Find your affected CronJob and click on it +4. Click **Edit** or go to the YAML view +5. Locate the `backoffLimit` field (usually set to 1) +6. Change it to `0`: + +```yaml +spec: + jobTemplate: + spec: + backoffLimit: 0 # Changed from 1 to 0 + template: + spec: + # ... rest of your job spec +``` + +7. Save the changes + + + + + +While disabling retries fixes the symptom, the proper solution is to ensure your jobs complete successfully: + +**1. Check your application exit codes:** + +```bash +# In your job container, ensure proper exit +exit 0 # Success +# or +exit 1 # Failure (will trigger retry if backoffLimit > 0) +``` + +**2. Review job logs for hidden errors:** + +```bash +kubectl logs --previous +``` + +**3. Common issues that cause "silent" failures:** + +- Database connection timeouts +- Missing environment variables +- Permission issues +- Memory/CPU limits exceeded +- Improper signal handling + + + + + +In SleakOps, you can configure the retry behavior: + +**Current workaround:** +Use Lens as described above until SleakOps adds native support for BackOffLimit configuration. + +**Future configuration (when available):** +The SleakOps team is working on adding this configuration option directly in the platform interface. + +**Best practices for SleakOps CronJobs:** + +```yaml +# Recommended CronJob configuration +apiVersion: batch/v1 +kind: CronJob +metadata: + name: your-cronjob +spec: + schedule: "0 5 * * *" # Daily at 5 AM + jobTemplate: + spec: + backoffLimit: 0 # No retries + activeDeadlineSeconds: 300 # 5 minute timeout + template: + spec: + restartPolicy: Never + containers: + - name: your-job + image: your-image + command: ["/bin/sh"] + args: ["-c", "your-command && exit 0"] +``` + + + + + +**Monitor your CronJobs:** + +1. **Check job status regularly:** + +```bash +kubectl get cronjobs +kubectl get jobs +``` + +2. **Set up alerts for failed jobs:** + +```yaml +# Example Prometheus alert +- alert: CronJobFailed + expr: kube_job_status_failed > 0 + for: 0m + labels: + severity: warning + annotations: + summary: "CronJob {{ $labels.job_name }} failed" +``` + +**Prevention strategies:** + +- Always test CronJobs in development first +- Use proper exit codes in your scripts +- Implement proper error handling +- Set reasonable resource limits +- Use health checks when possible + + + + + +If you're still experiencing duplicate executions: + +**1. Verify the BackOffLimit setting:** + +```bash +kubectl get cronjob -o yaml | grep backoffLimit +``` + +**2. Check recent job history:** + +```bash +kubectl get jobs --sort-by=.metadata.creationTimestamp +``` + +**3. Examine failed job details:** + +```bash +kubectl describe job +``` + +**4. Review pod events:** + +```bash +kubectl get events --sort-by=.metadata.creationTimestamp +``` + +**5. Check for resource constraints:** + +```bash +kubectl top pods +kubectl describe nodes +``` + + + +--- + +_This FAQ was automatically generated on January 3, 2025 based on a real user query._ diff --git a/docs/troubleshooting/kubernetes-dns-label-length-limit.mdx b/docs/troubleshooting/kubernetes-dns-label-length-limit.mdx new file mode 100644 index 000000000..c86809c15 --- /dev/null +++ b/docs/troubleshooting/kubernetes-dns-label-length-limit.mdx @@ -0,0 +1,209 @@ +--- +sidebar_position: 3 +title: "Kubernetes DNS Label Length Limit Error" +description: "Solution for DNS resolution errors due to label length exceeding 63 characters" +date: "2025-02-21" +category: "cluster" +tags: ["kubernetes", "dns", "service", "naming", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes DNS Label Length Limit Error + +**Date:** February 21, 2025 +**Category:** Cluster +**Tags:** Kubernetes, DNS, Service, Naming, Troubleshooting + +## Problem Description + +**Context:** When attempting to resolve internal Kubernetes service DNS names, users encounter DNS resolution failures due to domain label length restrictions. + +**Observed Symptoms:** + +- DNS resolution fails with "not a legal IDNA2008 name" error +- Error message indicates "domain label longer than 63 characters" +- Services cannot communicate internally using generated DNS names +- `dig` command fails to resolve service FQDN + +**Relevant Configuration:** + +- Service name: Generated automatically by SleakOps +- Namespace: Long descriptive names (e.g., `velo-contact-email-sender-production`) +- DNS format: `..svc.cluster.local` +- Error occurs when total label length exceeds 63 characters + +**Error Conditions:** + +- Occurs when service names are auto-generated with long prefixes +- Happens with descriptive namespace names +- DNS resolution fails completely +- Services become unreachable via internal DNS + +## Detailed Solution + + + +Kubernetes DNS names must comply with RFC standards: + +- **Maximum label length**: 63 characters per DNS label +- **Total FQDN length**: 253 characters maximum +- **Label format**: Must contain only lowercase letters, numbers, and hyphens +- **Service FQDN format**: `..svc.cluster.local` + +In the example error: + +``` +velo-contact-email-sender-production-velo-contact-email-sender-svc.velo-contact-email-sender-production.svc.cluster.local +``` + +The service name part `velo-contact-email-sender-production-velo-contact-email-sender-svc` exceeds 63 characters. + + + + + +To find the actual service name in your cluster: + +1. **Using kubectl:** + +```bash +kubectl get services -n velo-contact-email-sender-production +``` + +2. **Using Lens (as suggested by support):** + + - Navigate to the specific namespace + - Go to **Services** section + - Find your service and note the actual name + +3. **Check service manifest:** + Look at the `metadata.name` field in your service definition: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: contact-email-sen-svc # This is the actual service name + namespace: velo-contact-email-sender-production +``` + + + + + +Based on the service manifest provided, the correct DNS name should be: + +``` +contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local +``` + +**Format breakdown:** + +- **Service name**: `contact-email-sen-svc` (from `metadata.name`) +- **Namespace**: `velo-contact-email-sender-production` +- **Domain suffix**: `svc.cluster.local` + +**Testing the resolution:** + +```bash +# From within a pod in the cluster +dig contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local + +# Or use nslookup +nslookup contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local + +# Simple connectivity test +telnet contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local 5001 +``` + + + + + +For services within the same namespace, you can use shorter forms: + +1. **Same namespace** (recommended): + +``` +contact-email-sen-svc:5001 +``` + +2. **Cross-namespace but same cluster**: + +``` +contact-email-sen-svc.velo-contact-email-sender-production:5001 +``` + +3. **Full FQDN** (when needed): + +``` +contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local:5001 +``` + + + + + +To avoid this issue in future deployments: + +1. **Use shorter service names** in SleakOps configuration +2. **Avoid redundant prefixes** in service names +3. **Consider namespace length** when naming projects +4. **Test DNS resolution** after deployment + +**Recommended naming pattern:** + +- Project: `velo-email-sender` +- Service: `email-svc` +- Result: `email-svc.velo-email-sender.svc.cluster.local` + +**Configuration example:** + +```yaml +# In SleakOps workload configuration +name: email-sender # Keep it short and descriptive +type: webservice +internal: true +port: 5001 +``` + + + + + +If DNS issues persist, follow these steps: + +1. **Verify service exists:** + +```bash +kubectl get svc -n velo-contact-email-sender-production +``` + +2. **Check DNS from within cluster:** + +```bash +# Get a shell in any pod +kubectl exec -it -n -- /bin/bash + +# Test DNS resolution +nslookup contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local +``` + +3. **Verify CoreDNS is working:** + +```bash +kubectl get pods -n kube-system | grep coredns +``` + +4. **Check service endpoints:** + +```bash +kubectl get endpoints contact-email-sen-svc -n velo-contact-email-sender-production +``` + + + +--- + +_This FAQ was automatically generated on February 21, 2025 based on a real user query._ diff --git a/docs/troubleshooting/kubernetes-memory-limits-pod-restarts.mdx b/docs/troubleshooting/kubernetes-memory-limits-pod-restarts.mdx new file mode 100644 index 000000000..8ae97d2df --- /dev/null +++ b/docs/troubleshooting/kubernetes-memory-limits-pod-restarts.mdx @@ -0,0 +1,204 @@ +--- +sidebar_position: 3 +title: "Kubernetes Pod Restarts Due to Memory Limits" +description: "Solution for pods restarting due to insufficient memory configuration" +date: "2024-12-19" +category: "workload" +tags: ["kubernetes", "memory", "pod-restarts", "resources", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes Pod Restarts Due to Memory Limits + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** Kubernetes, Memory, Pod Restarts, Resources, Troubleshooting + +## Problem Description + +**Context:** A service in Kubernetes suddenly stops working without any changes from the user side. The pod keeps restarting with a "Back-off restarting failed container" message, and the service restarts every time it's accessed. + +**Observed Symptoms:** + +- Pod shows "Back-off restarting failed container" message +- Service restarts automatically when accessed +- No changes were made to the application code +- Issue appears suddenly without apparent cause + +**Relevant Configuration:** + +- MemoryMin and MemoryMax values appear to be set too low +- Pod is being killed by Kubernetes when exceeding memory limits +- Service is accessible but unstable due to constant restarts + +**Error Conditions:** + +- Error occurs when the pod exceeds defined memory limits +- Kubernetes kills the pod assuming the excess memory usage is incorrect +- Problem manifests during service access or high load + +## Detailed Solution + + + +Kubernetes manages pod resources through requests and limits: + +- **Memory Request (MemoryMin)**: Guaranteed memory allocation +- **Memory Limit (MemoryMax)**: Maximum memory the pod can use + +When a pod exceeds its memory limit, Kubernetes terminates it with an OOMKilled (Out of Memory Killed) status and restarts it automatically. + + + + + +To analyze current memory usage: + +1. Access your Grafana dashboard +2. Navigate to **'Kubernetes / Compute Resources / Namespace (Pods)'** +3. Select your namespace and pod +4. Review the memory usage patterns: + - Current memory consumption + - Memory spikes during operation + - Comparison with configured limits + +```yaml +# Example of what to look for in metrics +Memory Usage: 512Mi +Memory Limit: 256Mi # This would cause OOMKilled +Memory Request: 128Mi +``` + + + + + +To resolve the issue, increase the memory configuration: + +**In SleakOps Dashboard:** + +1. Go to your service configuration +2. Navigate to **Resource Settings** +3. Increase **Memory Limit** (MemoryMax) +4. Optionally increase **Memory Request** (MemoryMin) +5. Deploy the changes + +**Example Configuration:** + +```yaml +resources: + requests: + memory: "512Mi" # MemoryMin + cpu: "250m" + limits: + memory: "1Gi" # MemoryMax + cpu: "500m" +``` + +**Recommended Starting Values:** + +- For small applications: 512Mi - 1Gi +- For medium applications: 1Gi - 2Gi +- For large applications: 2Gi - 4Gi + + + + + +**Step 1: Check Current Pod Status** + +```bash +kubectl get pods -n your-namespace +kubectl describe pod your-pod-name -n your-namespace +``` + +**Step 2: Check Pod Events** + +```bash +kubectl get events -n your-namespace --sort-by='.lastTimestamp' +``` + +**Step 3: Review Pod Logs** + +```bash +kubectl logs your-pod-name -n your-namespace --previous +``` + +**Step 4: Monitor Resource Usage** + +```bash +kubectl top pods -n your-namespace +``` + +**Step 5: Update Resource Limits** + +- Increase memory limits based on observed usage +- Add 20-50% buffer above peak usage +- Test with gradual increases + + + + + +**Memory Sizing Guidelines:** + +1. **Start Conservative**: Begin with reasonable limits and monitor +2. **Monitor Regularly**: Use Grafana dashboards to track usage patterns +3. **Set Appropriate Ratios**: + - Memory Request: 70-80% of typical usage + - Memory Limit: 150-200% of typical usage + +**Configuration Example:** + +```yaml +# For a Node.js application +resources: + requests: + memory: "256Mi" # Guaranteed allocation + cpu: "100m" + limits: + memory: "512Mi" # Maximum allowed + cpu: "200m" + +# For a Java application +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + +**Monitoring Alerts:** + +- Set up alerts when memory usage exceeds 80% of limits +- Monitor for frequent restarts +- Track memory usage trends over time + + + + + +Contact SleakOps support if: + +- Memory issues persist after increasing limits +- You observe unusual memory consumption patterns +- The application worked previously without configuration changes +- You need help interpreting Grafana metrics +- Resource increases don't resolve the restart issue + +Provide this information: + +- Service name and namespace +- Current resource configuration +- Grafana screenshots showing memory usage +- Pod logs and events +- Timeline of when the issue started + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/kubernetes-pod-scheduling-insufficient-resources.mdx b/docs/troubleshooting/kubernetes-pod-scheduling-insufficient-resources.mdx new file mode 100644 index 000000000..34df2f3ef --- /dev/null +++ b/docs/troubleshooting/kubernetes-pod-scheduling-insufficient-resources.mdx @@ -0,0 +1,171 @@ +--- +sidebar_position: 3 +title: "Kubernetes Pod Scheduling Failures - Insufficient Resources" +description: "Solution for pod scheduling failures due to insufficient CPU and memory resources in Kubernetes clusters" +date: "2024-03-12" +category: "cluster" +tags: ["kubernetes", "scheduling", "resources", "nodepool", "karpenter"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes Pod Scheduling Failures - Insufficient Resources + +**Date:** March 12, 2024 +**Category:** Cluster +**Tags:** Kubernetes, Scheduling, Resources, Nodepool, Karpenter + +## Problem Description + +**Context:** A deployment that was previously working suddenly fails to schedule pods in a Kubernetes cluster managed by SleakOps, showing resource insufficiency errors. + +**Observed Symptoms:** + +- Pods fail to schedule with "0/X nodes are available" error +- "Insufficient cpu" and "Insufficient memory" messages +- Error mentions specific resource requirements not being met +- Previously working deployments suddenly stop functioning + +**Relevant Configuration:** + +- Pod resource requirements: CPU 665m, Memory 2000Mi +- Nodepool: spot-arm64 with ARM64 architecture +- Cluster using Karpenter for node provisioning +- Minimum node requirements: 2GB RAM, 2 CPU cores + +**Error Conditions:** + +- Error occurs during pod scheduling phase +- Appears when nodepool reaches resource limits +- Affects deployments that were previously successful +- Multiple nodepools show incompatibility issues + +## Detailed Solution + + + +The error message indicates several issues: + +1. **Insufficient Resources**: Nodes don't have enough CPU (665m required) or memory (2000Mi required) +2. **Nodepool Limits**: The nodepool has reached its configured resource limits +3. **Taint Incompatibility**: Some nodes have taints that prevent pod scheduling +4. **Instance Type Constraints**: No available instance types meet the resource and constraint requirements + +``` +0/8 nodes are available: +- 2 Insufficient cpu +- 4 Insufficient memory +- Taints preventing scheduling on certain nodes +``` + + + + + +To resolve this issue immediately: + +1. **Access Nodepool Settings**: + + - Go to SleakOps Console + - Navigate to Clusters → Your Cluster → Settings → Node Pools + - Select the affected nodepool (e.g., "spot-arm64") + +2. **Increase Resource Limits**: + + - **CPU Limit**: Increase from current value to accommodate more pods + - **Memory Limit**: Increase total memory allocation (e.g., from 32GB to 64GB) + - **Node Count**: Ensure maximum node count allows for scaling + +3. **Save and Apply Changes**: + - The cluster will automatically provision new nodes if needed + - Existing pods should reschedule within a few minutes + + + + + +**Section 1 - Resource Limits (Security/Cost Cap)**: + +- Acts as a safety limit to prevent unexpected costs +- Doesn't affect costs unless you increase pod count or enable autoscaling +- Prevents runaway resource consumption + +**Section 2 - Node Configuration**: + +- **Storage**: Configures disk space for each node +- **Advanced Settings**: Sets minimum CPU and memory per node + - Default: 2GB RAM, 2 CPU cores minimum + - Prevents scheduling on very small instances (t4g.nano, t4g.micro) + - Ensures daemonsets can run properly + +```yaml +# Example nodepool configuration +resource_limits: + max_cpu: "32" + max_memory: "64Gi" + max_nodes: 10 + +node_requirements: + min_cpu: "2" + min_memory: "2Gi" + storage: "20Gi" +``` + + + + + +**Important**: Increasing resource limits does NOT immediately increase costs. + +**What affects costs**: + +- **Actual resource usage**: Only pay for nodes that are actually created +- **Pod scaling**: More pods = more nodes = higher costs +- **Autoscaling**: If enabled, automatic scaling based on demand + +**What doesn't affect costs**: + +- **Increasing limits**: Just raises the ceiling, doesn't create resources +- **Higher memory/CPU caps**: Only affects maximum possible usage + +**Best practices**: + +- Set reasonable limits based on expected maximum load +- Monitor actual usage vs. limits regularly +- Use spot instances for cost optimization when possible + + + + + +**Monitor Resource Usage**: + +1. **SleakOps Dashboard**: Check cluster resource utilization +2. **Set Alerts**: Configure notifications for high resource usage +3. **Regular Reviews**: Periodically review and adjust limits + +**Prevention Strategies**: + +- **Gradual Scaling**: Increase limits gradually based on actual needs +- **Resource Requests**: Ensure pods have appropriate resource requests set +- **Multiple Nodepools**: Use different nodepools for different workload types +- **Spot vs On-Demand**: Balance cost and reliability needs + +**Troubleshooting Commands**: + +```bash +# Check node resources +kubectl describe nodes + +# Check pod resource requests +kubectl describe pod -n + +# View nodepool status +kubectl get nodepools +``` + + + +--- + +_This FAQ was automatically generated on March 12, 2024 based on a real user query._ diff --git a/docs/troubleshooting/kubernetes-secrets-ssl-private-keys.mdx b/docs/troubleshooting/kubernetes-secrets-ssl-private-keys.mdx new file mode 100644 index 000000000..770954826 --- /dev/null +++ b/docs/troubleshooting/kubernetes-secrets-ssl-private-keys.mdx @@ -0,0 +1,202 @@ +--- +sidebar_position: 3 +title: "Kubernetes Secrets for SSL Private Keys" +description: "How to securely store and mount SSL private keys using Kubernetes Secrets in SleakOps" +date: "2024-12-19" +category: "project" +tags: ["kubernetes", "secrets", "ssl", "security", "volumes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes Secrets for SSL Private Keys + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** Kubernetes, Secrets, SSL, Security, Volumes + +## Problem Description + +**Context:** Users need to securely store SSL private keys in their Kubernetes environment and mount them as files within containers with proper access restrictions. + +**Observed Symptoms:** + +- Need to store SSL private keys securely in Kubernetes +- Requirement to mount secrets as files (not environment variables) +- Need to restrict access to authorized pods only +- Requirement for proper file permissions on mounted secrets + +**Relevant Configuration:** + +- Platform: SleakOps Kubernetes environment +- Secret type: SSL private keys +- Mount requirement: File-based (not environment variables) +- Security level: High priority with restricted access + +**Error Conditions:** + +- Default VariableGroup creates environment variables, not files +- Need specific configuration for file-based secret mounting +- Requires proper security practices for SSL key management + +## Detailed Solution + + + +The standard approach in SleakOps is to use VariableGroups, which expose secrets as environment variables: + +1. **Create a VariableGroup:** + + - Go to your project in SleakOps + - Navigate to **VariableGroups** + - Create a new VariableGroup + - Add your SSL private key as a variable + +2. **Assign to Execution:** + - Assign the VariableGroup to specific executions + - Or leave it "global" to expose to all executions in the project + +**Note:** This method exposes the secret as an environment variable, not as a file. + + + + + +For file-based secret mounting (recommended for SSL keys), use the volume approach: + +1. **Navigate to Project Details:** + + - Go to **Projects** → **Details** + - Find the **Volumes** section + +2. **Create a Secret Volume:** + + - Create a new volume of type "Secret" + - Upload or paste your SSL private key content + - Configure the mount path where the file should appear + +3. **Example Configuration:** + ```yaml + # Volume configuration + name: ssl-private-key + type: secret + mountPath: /etc/ssl/private/ + fileName: private.key + permissions: 0600 + ``` + + + + + +When working with SSL private keys in Kubernetes: + +1. **File Permissions:** + + - Set restrictive permissions (0600 or 0400) + - Ensure only the application user can read the key + +2. **Access Control:** + + - Use Kubernetes RBAC to limit pod access + - Only mount secrets to pods that need them + - Avoid exposing keys as environment variables + +3. **Storage Security:** + + - Use Kubernetes native secret encryption at rest + - Rotate keys regularly + - Monitor access to secret resources + +4. **Example Secure Configuration:** + ```yaml + # Secure volume mount + volumes: + - name: ssl-key-volume + secret: + secretName: ssl-private-key + defaultMode: 0400 + items: + - key: private.key + path: private.key + mode: 0400 + ``` + + + + + +**Step 1: Prepare Your SSL Key** + +- Ensure your private key is in PEM format +- Remove any extra whitespace or formatting +- Test the key validity before uploading + +**Step 2: Create the Secret Volume in SleakOps** + +1. Navigate to **Projects** → **[Your Project]** → **Details** +2. Scroll to **Volumes** section +3. Click **Add Volume** +4. Select **Secret** type +5. Configure: + - **Name:** `ssl-private-key` + - **Mount Path:** `/etc/ssl/private/` + - **File Name:** `private.key` + - **Content:** Paste your private key + - **Permissions:** `0600` + +**Step 3: Reference in Your Application** + +```dockerfile +# In your Dockerfile or application +# The key will be available at /etc/ssl/private/private.key +COPY --from=secrets /etc/ssl/private/private.key /app/ssl/ +``` + +**Step 4: Verify Access** + +- Deploy your application +- Check that the file exists at the specified path +- Verify file permissions are correct +- Test SSL functionality + + + + + +**Issue 1: File Not Found** + +- Verify the mount path is correct +- Check that the volume is properly attached to the pod +- Ensure the secret was created successfully + +**Issue 2: Permission Denied** + +- Check file permissions (should be 0600 or 0400) +- Verify the application user has read access +- Ensure the mount path directory exists + +**Issue 3: Invalid Key Format** + +- Verify the key is in proper PEM format +- Check for extra characters or formatting issues +- Test the key outside of Kubernetes first + +**Debugging Commands:** + +```bash +# Check if secret exists +kubectl get secrets + +# Verify volume mount +kubectl describe pod + +# Check file permissions inside pod +kubectl exec -- ls -la /etc/ssl/private/ +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/kubernetes-shared-volumes-across-namespaces.mdx b/docs/troubleshooting/kubernetes-shared-volumes-across-namespaces.mdx new file mode 100644 index 000000000..96c6fe74c --- /dev/null +++ b/docs/troubleshooting/kubernetes-shared-volumes-across-namespaces.mdx @@ -0,0 +1,748 @@ +--- +sidebar_position: 3 +title: "Kubernetes Shared Volumes Across Namespaces" +description: "Solution for sharing volumes between pods in different namespaces and alternative approaches" +date: "2025-01-30" +category: "cluster" +tags: ["kubernetes", "volumes", "namespaces", "storage", "s3"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Kubernetes Shared Volumes Across Namespaces + +**Date:** January 30, 2025 +**Category:** Cluster +**Tags:** Kubernetes, Volumes, Namespaces, Storage, S3 + +## Problem Description + +**Context:** User needs to share data between a cronjob that generates files and an nginx service that serves those files, but they are running in different namespaces in a Kubernetes cluster. + +**Observed Symptoms:** + +- Cannot mount the same volume across different namespaces +- Need to share generated files between cronjob and web server +- Current setup works on EC2 with shared volume mount +- Looking for Kubernetes equivalent of shared volume access + +**Relevant Configuration:** + +- Platform: SleakOps on AWS EKS +- Use case: Cronjob generates files, Nginx serves them +- Current setup: EC2 with shared volume between containers +- Target: Kubernetes pods in different namespaces + +**Error Conditions:** + +- Kubernetes limitation: same volume cannot be used across namespaces +- Need alternative solution for file sharing +- Performance requirements for large file generation (5GB+) + +## Detailed Solution + + + +Kubernetes has a fundamental limitation: **the same PersistentVolume cannot be mounted by pods in different namespaces**. This is by design for security and isolation purposes. + +This means your current EC2 approach of sharing a volume between containers won't work directly in Kubernetes when pods are in different namespaces. + + + + + +The recommended approach is to use **Amazon S3** as intermediate storage: + +### Architecture: + +1. **Cronjob**: Generates files locally → Uploads to S3 +2. **Nginx service**: Downloads files from S3 → Serves them + +### Benefits: + +- Works across namespaces +- Scalable and reliable +- Cost-effective for large files +- Built-in authentication via SleakOps service accounts + +### Configuration example: + +```yaml +# Cronjob configuration +apiVersion: batch/v1 +kind: CronJob +metadata: + name: file-generator + namespace: jobs +spec: + schedule: "0 2 * * *" + jobTemplate: + spec: + template: + spec: + containers: + - name: generator + image: your-app:latest + env: + - name: S3_BUCKET + value: "your-bucket-name" + - name: AWS_REGION + value: "us-east-1" + volumeMounts: + - name: temp-storage + mountPath: /tmp/files + volumes: + - name: temp-storage + emptyDir: + sizeLimit: 60Gi +``` + + + + + +For large file generation, configure your nodepool with sufficient EBS storage: + +### In SleakOps: + +1. Go to **Cluster Configuration** +2. Select your **Nodepool** +3. Modify **Node Configuration**: + - **EBS Volume Size**: 50-60 GB + - **Volume Type**: gp3 (faster and cheaper) + +### Configuration example: + +```yaml +nodepool_config: + instance_type: "t3.medium" + disk_size: 60 # GB + disk_type: "gp3" + min_nodes: 1 + max_nodes: 5 +``` + + + + + +SleakOps automatically configures S3 authentication via **service accounts**. You don't need to manage AWS credentials manually. + +### Java example for S3 access: + +```java +import software.amazon.awssdk.services.s3.S3Client; +import software.amazon.awssdk.regions.Region; +import software.amazon.awssdk.services.s3.model.*; + +public class S3FileUploader { + public static void main(String[] args) { + // SleakOps handles authentication automatically + S3Client s3 = S3Client.builder() + .region(Region.US_EAST_1) + .build(); + + // Upload file to S3 + PutObjectRequest putRequest = PutObjectRequest.builder() + .bucket("your-bucket-name") + .key("generated-files/data.zip") + .build(); + + s3.putObject(putRequest, + RequestBody.fromFile(new File("/tmp/files/data.zip"))); + } +} +``` + +### Python example: + +```python +import boto3 +import os + +# SleakOps handles authentication via service account +s3_client = boto3.client('s3') +bucket_name = os.environ['S3_BUCKET'] + +# Upload generated file +s3_client.upload_file( + '/tmp/files/generated_data.zip', + bucket_name, + 'generated-files/generated_data.zip' +) +``` + + + + + +For serving files from S3 through Nginx, you have several options: + +### Option 1: Nginx with S3 proxy + +```nginx +server { + listen 80; + server_name your-domain.com; + + location /files/ { + proxy_pass https://your-bucket.s3.amazonaws.com/; + proxy_set_header Host your-bucket.s3.amazonaws.com; + proxy_hide_header x-amz-id-2; + proxy_hide_header x-amz-request-id; + } +} +``` + +### Option 2: Download and serve locally + +```bash +#!/bin/bash +# Init script for nginx container +aws s3 sync s3://your-bucket/generated-files/ /usr/share/nginx/html/files/ +nginx -g "daemon off;" +``` + +### Option 3: Use S3 static website hosting + +Enable static website hosting on your S3 bucket and point your domain directly to S3. + + + + + +If you must keep everything within Kubernetes: + +### Option 1: Same namespace + +Move both cronjob and nginx to the same namespace to share volumes. + +### Option 2: NFS or EFS + +Use Amazon EFS (Elastic File System) which can be mounted across namespaces: + +```yaml +apiVersion: v1 +kind: PersistentVolume +metadata: + name: efs-pv +spec: + capacity: + storage: 100Gi + accessModes: + - ReadWriteMany + persistentVolumeReclaimPolicy: Retain + storageClassName: efs-sc + csi: + driver: efs.csi.aws.com + volumeHandle: fs-12345678.efs.us-west-2.amazonaws.com +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: efs-pvc-cronjob + namespace: jobs +spec: + accessModes: + - ReadWriteMany + storageClassName: efs-sc + resources: + requests: + storage: 100Gi +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: efs-pvc-nginx + namespace: web +spec: + accessModes: + - ReadWriteMany + storageClassName: efs-sc + resources: + requests: + storage: 100Gi +``` + +**Pros:** + +- True shared filesystem +- POSIX-compliant file operations +- Good performance for concurrent access + +**Cons:** + +- Additional AWS cost for EFS +- More complex setup +- Network dependency + +### Option 3: Kubernetes Volumes with Copy Pattern + +Use init containers to copy data between pods: + +```yaml +# Nginx deployment with init container +apiVersion: apps/v1 +kind: Deployment +metadata: + name: nginx-with-data + namespace: web +spec: + replicas: 1 + selector: + matchLabels: + app: nginx + template: + metadata: + labels: + app: nginx + spec: + initContainers: + - name: data-fetcher + image: amazon/aws-cli:latest + command: + - sh + - -c + - | + aws s3 sync s3://your-bucket/generated-files/ /shared-data/ + volumeMounts: + - name: shared-volume + mountPath: /shared-data + env: + - name: AWS_DEFAULT_REGION + value: "us-west-2" + containers: + - name: nginx + image: nginx:latest + ports: + - containerPort: 80 + volumeMounts: + - name: shared-volume + mountPath: /usr/share/nginx/html + readOnly: true + volumes: + - name: shared-volume + emptyDir: {} +``` + + + + + +When dealing with large files (5GB+), consider these optimization strategies: + +### 1. Streaming and Chunked Uploads + +```python +import boto3 +from botocore.config import Config +import os + +# Optimized S3 client configuration +s3_config = Config( + retries={'max_attempts': 10, 'mode': 'adaptive'}, + max_pool_connections=50 +) + +s3_client = boto3.client('s3', config=s3_config) + +def upload_large_file(file_path, bucket, key): + """Upload large file with multipart upload""" + try: + # Use multipart upload for files > 100MB + file_size = os.path.getsize(file_path) + if file_size > 100 * 1024 * 1024: # 100MB + print(f"Using multipart upload for {file_size / (1024*1024):.2f}MB file") + + s3_client.upload_file( + file_path, + bucket, + key, + Config=boto3.s3.transfer.TransferConfig( + multipart_threshold=1024 * 25, # 25MB + max_concurrency=10, + multipart_chunksize=1024 * 25, + use_threads=True + ) + ) + print(f"Successfully uploaded {key}") + return True + except Exception as e: + print(f"Upload failed: {e}") + return False + +def download_large_file(bucket, key, file_path): + """Download large file with optimization""" + try: + s3_client.download_file( + bucket, + key, + file_path, + Config=boto3.s3.transfer.TransferConfig( + multipart_threshold=1024 * 25, + max_concurrency=10, + multipart_chunksize=1024 * 25, + use_threads=True + ) + ) + print(f"Successfully downloaded {key}") + return True + except Exception as e: + print(f"Download failed: {e}") + return False +``` + +### 2. Compression and Optimization + +```bash +#!/bin/bash +# Script for cronjob to optimize file transfer + +# Generate your files +generate_files.sh + +# Compress files before upload +tar -czf generated-files-$(date +%Y%m%d).tar.gz /path/to/generated/files/ + +# Upload compressed file +aws s3 cp generated-files-$(date +%Y%m%d).tar.gz s3://your-bucket/compressed/ + +# Upload individual files for nginx access +aws s3 sync /path/to/generated/files/ s3://your-bucket/files/ --delete + +# Cleanup local files +rm -rf /path/to/generated/files/* +rm generated-files-$(date +%Y%m%d).tar.gz +``` + +### 3. Nginx Configuration for S3 Proxy + +Instead of downloading files, configure Nginx to proxy S3 requests: + +```nginx +# nginx.conf +server { + listen 80; + server_name your-domain.com; + + location /files/ { + # Proxy to S3 bucket + proxy_pass https://your-bucket.s3.amazonaws.com/; + proxy_set_header Host your-bucket.s3.amazonaws.com; + proxy_set_header Authorization ""; + + # Cache settings + proxy_cache_valid 200 1h; + proxy_cache_key $uri; + + # Hide S3 headers + proxy_hide_header x-amz-id-2; + proxy_hide_header x-amz-request-id; + proxy_hide_header Set-Cookie; + + # Add custom headers + add_header Cache-Control "public, max-age=3600"; + } + + location /health { + access_log off; + return 200 "healthy\n"; + add_header Content-Type text/plain; + } +} +``` + + + + + +### CloudWatch Monitoring + +Set up monitoring for your file sharing system: + +```yaml +# CloudWatch alarm for S3 upload failures +apiVersion: v1 +kind: ConfigMap +metadata: + name: monitoring-config +data: + cloudwatch-config.json: | + { + "alarms": [ + { + "name": "S3-Upload-Errors", + "metric": "4xxErrors", + "namespace": "AWS/S3", + "threshold": 5, + "period": 300, + "dimensions": { + "BucketName": "your-bucket" + } + } + ] + } +``` + +### Application-Level Monitoring + +```python +import time +import logging +from prometheus_client import Counter, Histogram, start_http_server + +# Metrics +upload_counter = Counter('file_uploads_total', 'Total file uploads', ['status']) +upload_duration = Histogram('file_upload_duration_seconds', 'File upload duration') +download_counter = Counter('file_downloads_total', 'Total file downloads', ['status']) + +def monitored_upload(file_path, bucket, key): + start_time = time.time() + try: + result = upload_large_file(file_path, bucket, key) + status = 'success' if result else 'failure' + upload_counter.labels(status=status).inc() + + duration = time.time() - start_time + upload_duration.observe(duration) + + logging.info(f"Upload {status}: {key} in {duration:.2f}s") + return result + except Exception as e: + upload_counter.labels(status='error').inc() + logging.error(f"Upload error: {e}") + return False + +# Start metrics server +start_http_server(8000) +``` + +### Log Aggregation + +```yaml +# Fluent Bit configuration for log collection +apiVersion: v1 +kind: ConfigMap +metadata: + name: fluent-bit-config +data: + fluent-bit.conf: | + [INPUT] + Name tail + Path /var/log/cronjob/*.log + Parser json + Tag cronjob.* + + [INPUT] + Name tail + Path /var/log/nginx/*.log + Parser nginx + Tag nginx.* + + [OUTPUT] + Name cloudwatch_logs + Match * + region us-west-2 + log_group_name /k8s/file-sharing + auto_create_group true +``` + + + + + +### Common Issues and Solutions + +1. **S3 Upload Timeouts:** + + ```python + import boto3 + from botocore.config import Config + + # Increase timeout values + config = Config( + read_timeout=900, # 15 minutes + connect_timeout=60, + retries={'max_attempts': 10} + ) + s3_client = boto3.client('s3', config=config) + ``` + +2. **Permission Issues:** + + ```bash + # Check IAM permissions + aws sts get-caller-identity + aws s3 ls s3://your-bucket/ --recursive + + # Test upload permission + echo "test" | aws s3 cp - s3://your-bucket/test.txt + ``` + +3. **Network Connectivity:** + + ```bash + # Test from pod + kubectl exec -it -- nslookup s3.amazonaws.com + kubectl exec -it -- curl -I https://s3.amazonaws.com + ``` + +4. **File Synchronization Issues:** + + ```python + import hashlib + import boto3 + + def verify_file_integrity(local_file, bucket, s3_key): + # Calculate local file hash + with open(local_file, 'rb') as f: + local_hash = hashlib.md5(f.read()).hexdigest() + + # Get S3 object ETag (MD5 for single-part uploads) + s3 = boto3.client('s3') + response = s3.head_object(Bucket=bucket, Key=s3_key) + s3_etag = response['ETag'].strip('"') + + if local_hash == s3_etag: + print("File integrity verified") + return True + else: + print(f"Hash mismatch: local={local_hash}, s3={s3_etag}") + return False + ``` + +5. **Performance Issues:** + + ```bash + # Monitor disk I/O + kubectl top pods + kubectl exec -it -- iostat -x 1 + + # Check network throughput + kubectl exec -it -- iperf3 -c s3.amazonaws.com -p 80 + ``` + +### Emergency Recovery Procedures + +```bash +#!/bin/bash +# Emergency recovery script + +BUCKET="your-bucket" +BACKUP_BUCKET="your-backup-bucket" + +# Check if primary upload failed +if ! aws s3 ls s3://$BUCKET/latest/ > /dev/null 2>&1; then + echo "Primary upload failed, initiating recovery..." + + # Attempt to restore from backup + aws s3 sync s3://$BACKUP_BUCKET/latest/ s3://$BUCKET/latest/ + + # Trigger nginx refresh + kubectl rollout restart deployment/nginx -n web + + # Send alert + curl -X POST "https://hooks.slack.com/..." \ + -H 'Content-type: application/json' \ + --data '{"text":"File sharing recovery completed"}' +fi +``` + + + + + +### Security Best Practices + +1. **Use least privilege IAM policies:** + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": ["s3:GetObject", "s3:PutObject"], + "Resource": "arn:aws:s3:::your-bucket/files/*" + }, + { + "Effect": "Allow", + "Action": ["s3:ListBucket"], + "Resource": "arn:aws:s3:::your-bucket", + "Condition": { + "StringLike": { + "s3:prefix": ["files/*"] + } + } + } + ] + } + ``` + +2. **Enable S3 versioning and lifecycle policies:** + + ```json + { + "Rules": [ + { + "ID": "FileRetention", + "Status": "Enabled", + "Filter": { "Prefix": "files/" }, + "Transitions": [ + { + "Days": 30, + "StorageClass": "STANDARD_IA" + }, + { + "Days": 90, + "StorageClass": "GLACIER" + } + ], + "Expiration": { + "Days": 365 + } + } + ] + } + ``` + +3. **Implement proper error handling and retries:** + + ```python + import time + import random + + def retry_with_backoff(func, max_retries=5, base_delay=1): + for attempt in range(max_retries): + try: + return func() + except Exception as e: + if attempt == max_retries - 1: + raise e + + delay = base_delay * (2 ** attempt) + random.uniform(0, 1) + print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.2f}s") + time.sleep(delay) + ``` + +### Performance Best Practices + +1. **Use appropriate instance types for file operations** +2. **Configure proper resource requests and limits** +3. **Implement caching strategies for frequently accessed files** +4. **Use S3 Transfer Acceleration for global deployments** +5. **Monitor and optimize network bandwidth usage** + +### Operational Best Practices + +1. **Implement comprehensive logging and monitoring** +2. **Set up automated backups and disaster recovery** +3. **Document file naming conventions and data retention policies** +4. **Regular testing of file sharing mechanisms** +5. **Maintain up-to-date documentation of the architecture** + + + +--- + +_This FAQ was automatically generated on January 30, 2025 based on a real user query._ diff --git a/docs/troubleshooting/lens-cluster-connection-troubleshooting.mdx b/docs/troubleshooting/lens-cluster-connection-troubleshooting.mdx new file mode 100644 index 000000000..d4c1ccae1 --- /dev/null +++ b/docs/troubleshooting/lens-cluster-connection-troubleshooting.mdx @@ -0,0 +1,215 @@ +--- +sidebar_position: 3 +title: "Lens Cluster Connection Issues" +description: "Troubleshooting guide for Lens IDE connection problems with Kubernetes clusters" +date: "2025-01-27" +category: "cluster" +tags: ["lens", "connection", "troubleshooting", "kubernetes", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Lens Cluster Connection Issues + +**Date:** January 27, 2025 +**Category:** Cluster +**Tags:** Lens, Connection, Troubleshooting, Kubernetes, Networking + +## Problem Description + +**Context:** Users experience connectivity issues when trying to access their Kubernetes cluster through Lens IDE, despite the cluster services running normally and being accessible from external sources. + +**Observed Symptoms:** + +- Lens IDE cannot connect to the Kubernetes cluster +- Cluster services continue running normally +- External access to services works correctly +- Connection issues appear suddenly without configuration changes +- Lens shows connection timeout or authentication errors + +**Relevant Configuration:** + +- Tool: Lens IDE (Kubernetes IDE) +- Cluster: SleakOps managed Kubernetes cluster +- Network: Local development environment +- Access method: kubectl configuration + +**Error Conditions:** + +- Connection fails specifically through Lens IDE +- Problem occurs intermittently or after period of inactivity +- Other kubectl commands may also be affected +- VPN or network connectivity issues may be involved + +## Detailed Solution + + + +The most common solution is to restart your network connection: + +**For Windows:** + +1. Open Command Prompt as Administrator +2. Run the following commands: + ```cmd + ipconfig /release + ipconfig /renew + ipconfig /flushdns + ``` +3. Restart your network adapter from Network Settings + +**For macOS:** + +1. Turn off Wi-Fi from the menu bar +2. Wait 10 seconds +3. Turn Wi-Fi back on +4. Or use Terminal: + ```bash + sudo dscacheutil -flushcache + sudo killall -HUP mDNSResponder + ``` + +**For Linux:** + +```bash +sudo systemctl restart NetworkManager +# or +sudo service network-manager restart +``` + + + + + +1. **Refresh cluster connection in Lens:** + + - Right-click on your cluster in Lens + - Select "Refresh" + - Wait for the connection to re-establish + +2. **Clear Lens cache:** + + - Close Lens completely + - Clear application cache: + - **Windows:** `%APPDATA%\Lens` + - **macOS:** `~/Library/Application Support/Lens` + - **Linux:** `~/.config/Lens` + - Restart Lens + +3. **Re-add cluster configuration:** + - Remove the cluster from Lens + - Re-import your kubeconfig file + - Test the connection + + + + + +Before troubleshooting Lens, verify that kubectl works: + +```bash +# Test basic connectivity +kubectl cluster-info + +# Check if you can list nodes +kubectl get nodes + +# Verify authentication +kubectl auth can-i get pods +``` + +If kubectl doesn't work, the issue is with your cluster configuration, not Lens specifically. + + + + + +If you're using a VPN to access your cluster: + +1. **Verify VPN connection:** + + - Check VPN client status + - Ensure you're connected to the correct VPN server + - Try disconnecting and reconnecting + +2. **Test network connectivity:** + + ```bash + # Ping your cluster API server + ping your-cluster-api-server.com + + # Test port connectivity + telnet your-cluster-api-server.com 443 + ``` + +3. **DNS resolution issues:** + + ```bash + # Check DNS resolution + nslookup your-cluster-api-server.com + + # Try using different DNS servers + # Google DNS: 8.8.8.8, 8.8.4.4 + # Cloudflare DNS: 1.1.1.1, 1.0.0.1 + ``` + + + + + +If your cluster uses temporary credentials (like AWS EKS), they may have expired: + +**For EKS clusters:** + +```bash +# Update kubeconfig +aws eks update-kubeconfig --region your-region --name your-cluster-name + +# Verify the update +kubectl get nodes +``` + +**For other cloud providers:** + +- **GKE:** `gcloud container clusters get-credentials` +- **AKS:** `az aks get-credentials` + +**Check token expiration:** + +```bash +# View current context +kubectl config current-context + +# View detailed config +kubectl config view --minify +``` + + + + + +1. **Firewall settings:** + + - Ensure Lens is allowed through your firewall + - Check if ports 443 and 6443 are open + - Temporarily disable firewall to test + +2. **Corporate proxy:** + + - Configure proxy settings in Lens preferences + - Set environment variables: + ```bash + export HTTP_PROXY=http://proxy.company.com:8080 + export HTTPS_PROXY=http://proxy.company.com:8080 + export NO_PROXY=localhost,127.0.0.1 + ``` + +3. **Certificate issues:** + - Check if your organization uses custom certificates + - Import certificates into your system's trust store + + + +--- + +_This FAQ was automatically generated on January 27, 2025 based on a real user query._ diff --git a/docs/troubleshooting/lens-vpn-dns-resolution-issue.mdx b/docs/troubleshooting/lens-vpn-dns-resolution-issue.mdx new file mode 100644 index 000000000..51f56346b --- /dev/null +++ b/docs/troubleshooting/lens-vpn-dns-resolution-issue.mdx @@ -0,0 +1,156 @@ +--- +sidebar_position: 3 +title: "Lens VPN DNS Resolution Issue" +description: "Solution for DNS resolution problems when using Lens with Pritunl VPN" +date: "2025-03-05" +category: "user" +tags: ["lens", "vpn", "pritunl", "dns", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Lens VPN DNS Resolution Issue + +**Date:** March 5, 2025 +**Category:** User +**Tags:** Lens, VPN, Pritunl, DNS, Troubleshooting + +## Problem Description + +**Context:** Users experience DNS resolution issues when trying to connect to Kubernetes clusters through Lens while connected to Pritunl VPN. The system resolves public DNS instead of internal DNS, preventing proper cluster access. + +**Observed Symptoms:** + +- Lens cannot connect to Kubernetes cluster despite being connected to VPN +- DNS resolution errors when accessing cluster resources +- Public DNS is being resolved instead of internal/private DNS +- Connection works intermittently or not at all + +**Relevant Configuration:** + +- Tool: Lens (Kubernetes IDE) +- VPN: Pritunl +- Connection: VPN is active and connected +- Kubeconfig: Properly configured and imported + +**Error Conditions:** + +- Error occurs when Lens tries to resolve cluster endpoints +- Problem persists even with active VPN connection +- DNS resolution defaults to public instead of private +- Issue commonly occurs after system restarts or network changes + +## Detailed Solution + + + +This is a common issue that occurs when the system resolves public DNS instead of the internal VPN DNS. Follow these steps to resolve it: + +1. **Close Lens completely** + + - Make sure Lens is fully closed (check system tray) + - End any remaining Lens processes if necessary + +2. **Reconnect to Pritunl VPN** + + - Disconnect from the current VPN connection + - Wait a few seconds + - Reconnect to the VPN + +3. **Reset DNS service in Pritunl** + + - Open Pritunl client + - Go to **Options** or **Settings** + - Look for **"Reset DNS Service"** option + - Click on it to reset the DNS configuration + +4. **Reopen Lens** + - Launch Lens again + - Try connecting to your cluster + + + + + +To confirm the DNS resolution is working correctly: + +1. **Check VPN status** + + ```bash + # On Windows + nslookup your-cluster-endpoint + + # On macOS/Linux + dig your-cluster-endpoint + ``` + +2. **Test cluster connectivity** + + ```bash + kubectl cluster-info + kubectl get nodes + ``` + +3. **Verify in Lens** + - Open Lens + - Connect to your cluster + - Check if resources load properly + + + + + +If the DNS reset doesn't work, try these alternatives: + +**Option 1: Restart network services** + +```bash +# Windows (run as administrator) +ipconfig /flushdns +ipconfig /release +ipconfig /renew + +# macOS +sudo dscacheutil -flushcache +sudo killall -HUP mDNSResponder + +# Linux +sudo systemctl restart systemd-resolved +``` + +**Option 2: Manual DNS configuration** + +- Configure your system to use the VPN's DNS servers +- Add the internal DNS servers to your network configuration +- Ensure VPN DNS takes priority over system DNS + +**Option 3: Use kubectl directly** + +- If Lens continues having issues, use kubectl from terminal +- This bypasses Lens DNS resolution +- Configure your kubeconfig properly for direct access + + + + + +To minimize DNS resolution problems: + +1. **Always connect to VPN before opening Lens** +2. **Use Pritunl's DNS reset feature regularly** +3. **Keep Pritunl client updated** +4. **Configure static DNS if needed** +5. **Monitor network changes that might affect DNS** + +**Pro tip:** Create a script or routine: + +1. Connect to VPN +2. Reset DNS service +3. Wait 10 seconds +4. Open Lens + + + +--- + +_This FAQ was automatically generated on March 5, 2025 based on a real user query._ diff --git a/docs/troubleshooting/lens-wsl-aws-cli-configuration.mdx b/docs/troubleshooting/lens-wsl-aws-cli-configuration.mdx new file mode 100644 index 000000000..2163ca67b --- /dev/null +++ b/docs/troubleshooting/lens-wsl-aws-cli-configuration.mdx @@ -0,0 +1,236 @@ +--- +sidebar_position: 3 +title: "Lens with WSL and AWS CLI Configuration" +description: "Solution for Lens authentication issues when using WSL with AWS CLI" +date: "2024-12-19" +category: "user" +tags: ["lens", "wsl", "aws-cli", "authentication", "windows"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Lens with WSL and AWS CLI Configuration + +**Date:** December 19, 2024 +**Category:** User +**Tags:** Lens, WSL, AWS CLI, Authentication, Windows + +## Problem Description + +**Context:** Team member using Windows with WSL is trying to access a Kubernetes cluster through Lens but encounters authentication errors due to AWS CLI not being found in the Windows environment. + +**Observed Symptoms:** + +- Lens cannot authenticate with the Kubernetes cluster +- Error message: "executable aws not found" +- Authentication proxy fails to start properly +- Connection to cluster fails with Internal Server Error (500) + +**Relevant Configuration:** + +- Operating System: Windows with WSL +- Tool: Lens (Kubernetes IDE) +- Authentication: AWS CLI credential plugin +- Environment: Mixed Windows/WSL setup + +**Error Conditions:** + +- Lens opens Windows terminal instead of WSL terminal +- AWS CLI is installed in WSL but not in Windows +- Credential plugin cannot find AWS executable +- Authentication fails during cluster connection + +## Detailed Solution + + + +The best approach is to install Lens directly within the WSL environment: + +### Prerequisites + +- WSL2 with Ubuntu or similar Linux distribution +- X11 forwarding or WSLg for GUI applications + +### Installation Steps + +1. **Enable GUI support in WSL:** + + ```bash + # For WSL2 with WSLg (Windows 11) + # WSLg is included by default, no additional setup needed + + # For older Windows versions, install X11 server + # Install VcXsrv or similar X11 server on Windows + export DISPLAY=:0 + ``` + +2. **Install Lens in WSL:** + + ```bash + # Download Lens AppImage + wget https://api.k8slens.dev/binaries/Lens-2023.12.151757-latest.x86_64.AppImage + + # Make it executable + chmod +x Lens-2023.12.151757-latest.x86_64.AppImage + + # Run Lens + ./Lens-2023.12.151757-latest.x86_64.AppImage + ``` + +3. **Verify AWS CLI access:** + ```bash + # Ensure AWS CLI is properly configured + aws --version + aws sts get-caller-identity + ``` + + + + + +Ensure your kubeconfig is properly configured within WSL: + +1. **Update kubeconfig:** + + ```bash + # Update kubeconfig for your EKS cluster + aws eks update-kubeconfig --region --name + ``` + +2. **Verify cluster access:** + + ```bash + # Test cluster connectivity + kubectl get nodes + kubectl cluster-info + ``` + +3. **Check kubeconfig location:** + ```bash + # Ensure kubeconfig is in the expected location + ls -la ~/.kube/config + ``` + + + + + +If you prefer to keep Lens on Windows, install AWS CLI on Windows: + +### Installation Steps + +1. **Download AWS CLI for Windows:** + + - Visit: https://aws.amazon.com/cli/ + - Download the Windows installer + - Run the installer with administrator privileges + +2. **Configure AWS CLI:** + + ```cmd + # Open Command Prompt or PowerShell + aws configure + # Enter your AWS Access Key ID, Secret Access Key, region, and output format + ``` + +3. **Copy WSL credentials (if needed):** + + ```cmd + # Copy credentials from WSL to Windows + # From WSL, copy ~/.aws/ directory contents to Windows %USERPROFILE%\.aws\ + ``` + +4. **Update kubeconfig on Windows:** + ```cmd + aws eks update-kubeconfig --region --name + ``` + + + + + +### Common Issues and Solutions + +**Issue 1: X11 forwarding not working** + +```bash +# Install required packages +sudo apt update +sudo apt install x11-apps + +# Test X11 forwarding +xeyes +``` + +**Issue 2: AWS credentials not found** + +```bash +# Check AWS credentials +aws configure list +cat ~/.aws/credentials +cat ~/.aws/config +``` + +**Issue 3: Kubeconfig path issues** + +```bash +# Set KUBECONFIG environment variable +export KUBECONFIG=~/.kube/config + +# Add to ~/.bashrc for persistence +echo 'export KUBECONFIG=~/.kube/config' >> ~/.bashrc +``` + +**Issue 4: Permission denied errors** + +```bash +# Fix kubeconfig permissions +chmod 600 ~/.kube/config +chown $USER:$USER ~/.kube/config +``` + + + + + +### Recommended Setup + +1. **Use WSL2 for all Kubernetes tools:** + + - Install kubectl, helm, aws-cli in WSL + - Keep all configurations in WSL environment + - Use WSL for all cluster interactions + +2. **Environment consistency:** + + ```bash + # Add to ~/.bashrc + export KUBECONFIG=~/.kube/config + export AWS_PROFILE=default + export AWS_REGION=us-west-2 + ``` + +3. **Tool installation script:** + + ```bash + #!/bin/bash + # install-k8s-tools.sh + + # Install kubectl + curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" + sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl + + # Install AWS CLI + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + sudo ./aws/install + + # Install Helm + curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash + ``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/load-balancer-host-header-routing.mdx b/docs/troubleshooting/load-balancer-host-header-routing.mdx new file mode 100644 index 000000000..2d6f6cbe4 --- /dev/null +++ b/docs/troubleshooting/load-balancer-host-header-routing.mdx @@ -0,0 +1,170 @@ +--- +sidebar_position: 3 +title: "Load Balancer Host Header Routing Issue" +description: "Solution for HTTP 404 errors when using custom domains with load balancers" +date: "2024-01-15" +category: "workload" +tags: ["load-balancer", "nginx", "host-header", "routing", "cloudflare"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Load Balancer Host Header Routing Issue + +**Date:** January 15, 2024 +**Category:** Workload +**Tags:** Load Balancer, Nginx, Host Header, Routing, Cloudflare + +## Problem Description + +**Context:** User has deployed an Nginx service listening on port 80, configured to serve content for specific domains ("velo.la" and "www.velo.la"). The service is accessible via the SleakOps-provided domain but returns HTTP 404 when accessed through custom domains via Cloudflare CNAME or direct host header manipulation. + +**Observed Symptoms:** + +- Service works correctly when accessed via SleakOps domain: `main-proxy.develop.us-east-2.velo.la` +- HTTP 404 error when accessing via Cloudflare CNAME redirect from `velo.la` to the SleakOps domain +- HTTP 404 error when using `/etc/hosts` to point `velo.la` to the load balancer IP +- `curl -H "Host: velo.la"` returns HTTP 404, while direct curl to SleakOps domain returns 200 OK +- Nginx logs show requests arrive for direct domain access but not for custom host headers + +**Relevant Configuration:** + +- Service: Nginx listening on port 80 +- Expected domains: `velo.la`, `www.velo.la` +- SleakOps domain: `main-proxy.develop.us-east-2.velo.la` +- DNS: Cloudflare CNAME pointing custom domains to SleakOps domain + +**Error Conditions:** + +- HTTP 404 occurs when Host header differs from the SleakOps-provided domain +- Load balancer appears to filter requests based on Host header +- Requests with custom Host headers don't reach the Nginx container + +## Detailed Solution + + + +The issue occurs because SleakOps load balancers perform **host-based routing** by default. When you access the service using a custom domain (via Host header or CNAME), the load balancer doesn't recognize the custom domain and returns a 404 before the request reaches your Nginx container. + +This is why: + +- `curl https://main-proxy.develop.us-east-2.velo.la` works (recognized domain) +- `curl -H "Host: velo.la" https://main-proxy.develop.us-east-2.velo.la` fails (unrecognized domain) + + + + + +To fix this issue, you need to configure your custom domains in the SleakOps service configuration: + +1. **Access your service configuration** in the SleakOps dashboard +2. **Navigate to the Networking section** +3. **Add custom domains** to the allowed hosts list: + +```yaml +# Service configuration example +networking: + public: true + domains: + - "main-proxy.develop.us-east-2.velo.la" # Default SleakOps domain + - "velo.la" # Your custom domain + - "www.velo.la" # Your www subdomain + ports: + - port: 80 + protocol: HTTP +``` + +4. **Apply the configuration** and wait for the deployment to update + + + + + +Ensure your Nginx configuration properly handles the custom domains: + +```nginx +server { + listen 80; + server_name velo.la www.velo.la main-proxy.develop.us-east-2.velo.la; + + location / { + # Your application configuration + root /usr/share/nginx/html; + index index.html index.htm; + } +} +``` + +If you're using a custom Nginx configuration file, make sure it includes all the domains you want to serve. + + + + + +For Cloudflare configuration, ensure you're using the correct setup: + +1. **CNAME Records:** + + - `velo.la` → `main-proxy.develop.us-east-2.velo.la` + - `www.velo.la` → `main-proxy.develop.us-east-2.velo.la` + +2. **SSL/TLS Settings:** + + - Set SSL/TLS encryption mode to **"Full"** or **"Full (strict)"** + - Ensure **"Always Use HTTPS"** is enabled if needed + +3. **Proxy Status:** + - You can keep the orange cloud (proxied) enabled + - Or disable it (gray cloud) for direct DNS resolution + + + + + +After configuring custom domains, test the setup: + +1. **Test direct access:** + +```bash +curl -v https://velo.la +curl -v https://www.velo.la +``` + +2. **Test with explicit Host header:** + +```bash +curl -v -H "Host: velo.la" https://main-proxy.develop.us-east-2.velo.la +``` + +3. **Test with different tools:** + +```bash +# Test with wget +wget --server-response --header="Host: velo.la" https://main-proxy.develop.us-east-2.velo.la + +# Test with different User-Agent +curl -v -H "Host: velo.la" -H "User-Agent: Mozilla/5.0" https://main-proxy.develop.us-east-2.velo.la +``` + +4. **Monitor service logs:** + + - Check application logs for incoming requests + - Monitor load balancer access logs if available + - Look for any SSL/TLS certificate issues + +5. **Validate DNS configuration:** + - Ensure CNAME records are properly configured + - Verify DNS propagation using tools like `dig` or `nslookup` + - Check for conflicting DNS records + +**Important Notes:** + +- Changes to load balancer configuration may require SleakOps support assistance +- Custom domain configuration should be consistent across all environments +- Always test configuration changes in development before applying to production + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/loki-grafana-connection-issues.mdx b/docs/troubleshooting/loki-grafana-connection-issues.mdx new file mode 100644 index 000000000..a176fd836 --- /dev/null +++ b/docs/troubleshooting/loki-grafana-connection-issues.mdx @@ -0,0 +1,159 @@ +--- +sidebar_position: 3 +title: "Loki and Grafana Connection Issues" +description: "Troubleshooting Loki pod configuration bugs affecting Grafana connectivity" +date: "2024-11-22" +category: "dependency" +tags: ["loki", "grafana", "monitoring", "troubleshooting", "pods"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Loki and Grafana Connection Issues + +**Date:** November 22, 2024 +**Category:** Dependency +**Tags:** Loki, Grafana, Monitoring, Troubleshooting, Pods + +## Problem Description + +**Context:** Users experience connectivity issues between Grafana and Loki logging service in SleakOps clusters, preventing proper log visualization and monitoring functionality. + +**Observed Symptoms:** + +- Grafana cannot connect to Loki data source +- Loki service appears unresponsive or unreachable +- Log queries fail or timeout in Grafana interface +- Monitoring dashboards show no log data + +**Relevant Configuration:** + +- Service: Loki logging stack +- Component affected: `loki-read` pod +- Interface: Grafana dashboard +- Management tool: Lens Kubernetes IDE + +**Error Conditions:** + +- Error occurs after cluster updates or pod restarts +- Problem persists until manual pod recreation +- Affects log aggregation and monitoring capabilities +- May impact multiple users accessing Grafana dashboards + +## Detailed Solution + + + +To immediately resolve the connection issue: + +1. **Open Lens Kubernetes IDE** +2. **Navigate to your SleakOps cluster** +3. **Go to Workloads → Pods** +4. **Search for the `loki-read` pod** +5. **Right-click on the pod and select "Delete"** +6. **Wait for the pod to be automatically recreated** +7. **Test Grafana connectivity** + +The pod will be automatically recreated by the deployment controller, which should resolve the configuration bug. + + + + + +After deleting the pod, verify it's properly recreated: + +```bash +# Check pod status +kubectl get pods -n monitoring | grep loki-read + +# Verify pod is running and ready +kubectl describe pod -n monitoring + +# Check pod logs for any errors +kubectl logs -n monitoring +``` + +The pod should show status `Running` with all containers ready (e.g., `1/1`). + + + + + +After the pod recreation: + +1. **Access Grafana dashboard** +2. **Go to Configuration → Data Sources** +3. **Find the Loki data source** +4. **Click "Test" to verify connectivity** +5. **Try running a simple log query**: + ``` + {namespace="default"} + ``` + +If successful, you should see log entries appearing in the query results. + + + + + +If you prefer using kubectl instead of Lens: + +```bash +# List loki pods +kubectl get pods -n monitoring | grep loki + +# Delete the loki-read pod +kubectl delete pod -n monitoring + +# Watch pod recreation +kubectl get pods -n monitoring -w | grep loki-read +``` + +Replace `` with the actual pod name from the first command. + + + + + +To monitor for this issue in the future: + +1. **Set up alerts for Loki pod restarts** +2. **Monitor Grafana data source health** +3. **Check pod logs regularly for configuration errors** + +```yaml +# Example alert rule for Loki pod issues +groups: + - name: loki-alerts + rules: + - alert: LokiPodDown + expr: up{job="loki-read"} == 0 + for: 2m + labels: + severity: warning + annotations: + summary: "Loki read pod is down" +``` + + + + + +Escalate to SleakOps support if: + +- Pod recreation doesn't resolve the issue +- Problem recurs frequently (more than once per day) +- Multiple Loki components are affected +- Grafana shows persistent connection errors after pod restart + +Include in your support request: + +- Pod logs before and after recreation +- Grafana error messages +- Cluster and namespace information + + + +--- + +_This FAQ was automatically generated on November 22, 2024 based on a real user query._ diff --git a/docs/troubleshooting/loki-log-explorer-dashboard-loading-issue.mdx b/docs/troubleshooting/loki-log-explorer-dashboard-loading-issue.mdx new file mode 100644 index 000000000..046a5d994 --- /dev/null +++ b/docs/troubleshooting/loki-log-explorer-dashboard-loading-issue.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "Loki Log Explorer Dashboard Loading Issue" +description: "Solution for Grafana Log Explorer dashboard stuck in loading state" +date: "2025-02-12" +category: "dependency" +tags: ["loki", "grafana", "logs", "dashboard", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Loki Log Explorer Dashboard Loading Issue + +**Date:** February 12, 2025 +**Category:** Dependency +**Tags:** Loki, Grafana, Logs, Dashboard, Troubleshooting + +## Problem Description + +**Context:** Users experience issues with Grafana Log Explorer dashboard in SleakOps platform where logs fail to load and the interface remains stuck in a loading state indefinitely. + +**Observed Symptoms:** + +- Log Explorer dashboard shows continuous loading spinner +- Logs never appear in the Grafana interface +- Dashboard remains unresponsive +- Other Grafana dashboards may work normally + +**Relevant Configuration:** + +- Component: Loki logging system +- Interface: Grafana Log Explorer dashboard +- Affected pods: `loki-backend` and `loki-read` +- Platform: SleakOps Kubernetes environment + +**Error Conditions:** + +- Occurs specifically with log-related dashboards +- Happens when `loki-backend` pods restart without `loki-read` pods restarting +- Results in communication breakdown between Loki components + +## Detailed Solution + + + +This is a known issue with Loki where the `loki-read` pods lose their ability to communicate with `loki-backend` pods after the backend restarts. This creates a state where: + +- The `loki-backend` pods are running with new configurations +- The `loki-read` pods are still trying to use old connection parameters +- The communication channel between components is broken +- Log queries cannot be processed, resulting in infinite loading + +This issue is being tracked in the official Loki GitHub repository: + +- [Issue #14384](https://github.com/grafana/loki/issues/14384#issuecomment-2612675359) +- [Issue #15191](https://github.com/grafana/loki/issues/15191) + + + + + +The immediate solution is to restart the `loki-read` pods to re-establish communication with the backend: + +**Using kubectl:** + +```bash +# Find the loki-read pods +kubectl get pods -n | grep loki-read + +# Restart the loki-read pods +kubectl delete pod -n + +# Or restart all loki-read pods at once +kubectl delete pods -l app=loki-read -n +``` + +**Using SleakOps interface:** + +1. Navigate to your cluster management +2. Go to **Workloads** → **Pods** +3. Filter by `loki-read` +4. Select the pods and choose **Restart** + +The pods will automatically restart and re-establish connection with the backend. + + + + + +After restarting the `loki-read` pods: + +1. **Wait 2-3 minutes** for pods to fully restart +2. **Check pod status:** + + ```bash + kubectl get pods -n | grep loki + ``` + + All pods should show `Running` status + +3. **Test the Log Explorer:** + + - Open Grafana dashboard + - Navigate to Log Explorer + - Try querying recent logs + - Verify logs are loading properly + +4. **Check pod logs if issues persist:** + ```bash + kubectl logs -f -n + ``` + + + + + +While this is a known Loki issue being addressed upstream, you can: + +**Monitor for the issue:** + +- Set up alerts for when Log Explorer stops responding +- Monitor Loki pod restart events +- Create health checks for log ingestion + +**Temporary workaround automation:** + +```yaml +# Example monitoring script +apiVersion: batch/v1 +kind: CronJob +metadata: + name: loki-health-check +spec: + schedule: "*/10 * * * *" # Every 10 minutes + jobTemplate: + spec: + template: + spec: + containers: + - name: health-check + image: curlimages/curl + command: + - /bin/sh + - -c + - | + # Check if Loki is responding + if ! curl -f http://loki-gateway/ready; then + echo "Loki not responding, may need intervention" + fi + restartPolicy: OnFailure +``` + +**Stay updated:** + +- Monitor the GitHub issues mentioned above for permanent fixes +- Update Loki when patches become available +- Consider using Loki's distributed mode for better resilience + + + + + +If restarting `loki-read` pods doesn't solve the issue: + +**1. Restart all Loki components:** + +```bash +# Restart all Loki pods +kubectl delete pods -l app.kubernetes.io/name=loki -n +``` + +**2. Check Loki configuration:** + +```bash +# Check Loki configmap +kubectl get configmap loki-config -n -o yaml +``` + +**3. Verify network connectivity:** + +```bash +# Test connectivity between pods +kubectl exec -it -n -- nslookup loki-backend +``` + +**4. Check resource constraints:** + +```bash +# Check if pods are resource-constrained +kubectl top pods -n | grep loki +``` + + + +--- + +_This FAQ was automatically generated on February 12, 2025 based on a real user query._ diff --git a/docs/troubleshooting/loki-read-pods-connection-issue.mdx b/docs/troubleshooting/loki-read-pods-connection-issue.mdx new file mode 100644 index 000000000..38698f6a8 --- /dev/null +++ b/docs/troubleshooting/loki-read-pods-connection-issue.mdx @@ -0,0 +1,208 @@ +--- +sidebar_position: 15 +title: "Loki Read Pods Connection Issue" +description: "Solution for Loki read pods not reconnecting to backend after restart" +date: "2024-04-21" +category: "dependency" +tags: ["loki", "grafana", "monitoring", "pods", "connection", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Loki Read Pods Connection Issue + +**Date:** April 21, 2024 +**Category:** Dependency +**Tags:** Loki, Grafana, Monitoring, Pods, Connection, Troubleshooting + +## Problem Description + +**Context:** Loki monitoring stack experiencing connectivity issues where read pods fail to reconnect to the backend service after the backend pod restarts or becomes unavailable. + +**Observed Symptoms:** + +- Loki read pods (`loki-read`) cannot connect to Loki backend +- Connection errors persist even after backend pods are restored +- Grafana dashboards show data retrieval issues +- Log queries fail or return incomplete results + +**Relevant Configuration:** + +- Component: Loki read pods (`loki-read`) +- Backend service: `loki-backend` +- Platform: Kubernetes-based Loki deployment +- Connection behavior: Initial connection only, no automatic reconnection + +**Error Conditions:** + +- Occurs when `loki-backend` pods restart or crash +- Read pods only attempt connection at startup +- No automatic reconnection mechanism in current Loki version +- Requires manual intervention to restore connectivity + +## Detailed Solution + + + +To resolve the connection issue immediately: + +1. **Identify the affected pods:** + + ```bash + kubectl get pods -l app=loki-read -n monitoring + ``` + +2. **Restart the Loki read pods:** + + ```bash + kubectl delete pods -l app=loki-read -n monitoring + ``` + +3. **Verify pods are running:** + + ```bash + kubectl get pods -l app=loki-read -n monitoring -w + ``` + +4. **Check connectivity:** + ```bash + kubectl logs -l app=loki-read -n monitoring --tail=50 + ``` + +The pods will automatically reconnect to the backend upon restart. + + + + + +The issue occurs because: + +1. **Single connection attempt**: Loki read pods only attempt to connect to the backend during their initialization phase +2. **No reconnection logic**: Current Loki versions lack automatic reconnection mechanisms +3. **Backend dependency**: When `loki-backend` restarts, existing connections are lost +4. **Static connection**: Read pods maintain a static connection without health checks + +This is a known limitation in older Loki versions that has been addressed in newer releases. + + + + + +The permanent fix involves upgrading to a Loki version that includes [PR #17063](https://github.com/grafana/loki/pull/17063): + +**What the PR addresses:** + +- Adds health check capabilities to Loki read pods +- Implements automatic reconnection logic +- Improves connection resilience + +**Implementation steps:** + +1. **Check current Loki version:** + + ```bash + kubectl get deployment loki-read -n monitoring -o jsonpath='{.spec.template.spec.containers[0].image}' + ``` + +2. **Update Helm values to include health checks:** + + ```yaml + loki: + read: + livenessProbe: + enabled: true + httpGet: + path: /ready + port: 3100 + initialDelaySeconds: 30 + periodSeconds: 10 + readinessProbe: + enabled: true + httpGet: + path: /ready + port: 3100 + initialDelaySeconds: 15 + periodSeconds: 5 + ``` + +3. **Upgrade using Helm:** + ```bash + helm upgrade loki grafana/loki -n monitoring -f values.yaml + ``` + + + + + +To prevent and quickly detect this issue: + +**Set up monitoring alerts:** + +```yaml +# Prometheus alert rule +groups: + - name: loki-connectivity + rules: + - alert: LokiReadPodsDisconnected + expr: up{job="loki-read"} == 0 + for: 2m + labels: + severity: warning + annotations: + summary: "Loki read pods are disconnected" + description: "Loki read pods have been disconnected for more than 2 minutes" +``` + +**Health check script:** + +```bash +#!/bin/bash +# Check Loki read pods connectivity +READ_PODS=$(kubectl get pods -l app=loki-read -n monitoring --no-headers | wc -l) +READY_PODS=$(kubectl get pods -l app=loki-read -n monitoring --no-headers | grep Running | wc -l) + +if [ "$READ_PODS" -ne "$READY_PODS" ]; then + echo "Warning: Not all Loki read pods are ready" + kubectl get pods -l app=loki-read -n monitoring +fi +``` + + + + + +If the problem persists after restarting pods: + +1. **Check backend pod status:** + + ```bash + kubectl get pods -l app=loki-backend -n monitoring + kubectl logs -l app=loki-backend -n monitoring --tail=100 + ``` + +2. **Verify service connectivity:** + + ```bash + kubectl get svc loki-backend -n monitoring + kubectl describe svc loki-backend -n monitoring + ``` + +3. **Test internal connectivity:** + + ```bash + kubectl run test-pod --rm -i --tty --image=busybox -- /bin/sh + # Inside the pod: + nslookup loki-backend.monitoring.svc.cluster.local + wget -qO- http://loki-backend.monitoring.svc.cluster.local:3100/ready + ``` + +4. **Check network policies:** + ```bash + kubectl get networkpolicies -n monitoring + ``` + + + +--- + +_This FAQ was automatically generated on April 21, 2024 based on a real user query._ diff --git a/docs/troubleshooting/monitoring-addons-pricing-guide.mdx b/docs/troubleshooting/monitoring-addons-pricing-guide.mdx new file mode 100644 index 000000000..11204a614 --- /dev/null +++ b/docs/troubleshooting/monitoring-addons-pricing-guide.mdx @@ -0,0 +1,175 @@ +--- +sidebar_position: 3 +title: "Monitoring Addons Pricing Guide" +description: "Complete pricing breakdown for Grafana, Loki, KubeCost and OpenTelemetry monitoring addons" +date: "2024-12-19" +category: "general" +tags: + [ + "pricing", + "monitoring", + "grafana", + "loki", + "kubecost", + "opentelemetry", + "addons", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Monitoring Addons Pricing Guide + +**Date:** December 19, 2024 +**Category:** General +**Tags:** Pricing, Monitoring, Grafana, Loki, KubeCost, OpenTelemetry, Addons + +## Problem Description + +**Context:** Users need to understand the pricing structure for monitoring and observability addons in SleakOps, including individual component costs and dependencies. + +**Observed Symptoms:** + +- Need for cost estimation before implementing monitoring stack +- Uncertainty about individual addon pricing +- Questions about dependencies between monitoring components +- Confusion about what drives the actual costs + +**Relevant Configuration:** + +- Monitoring stack: Grafana + Loki + KubeCost + OpenTelemetry +- Log retention: 14 days (configurable) +- Infrastructure: Spot instances within existing cluster +- Database: RDS required for Grafana + +**Error Conditions:** + +- Budget planning without clear cost breakdown +- Potential over-provisioning due to lack of pricing transparency +- Difficulty in justifying monitoring investments + +## Detailed Solution + + + +**Current Monitoring Addon Prices (Approximate):** + +- **Grafana**: $20 USD/month +- **Loki**: $10 USD/month +- **KubeCost**: $10 USD/month +- **OpenTelemetry**: $10 USD/month (upcoming addon) + +**Important Notes:** + +- Loki requires Grafana to be installed first +- Total cost for complete monitoring stack: ~$40-50 USD/month +- Prices are approximate and may vary based on usage + + + + + +**Fixed Costs:** + +- **RDS Database**: Required for Grafana (main fixed cost component) + +**Variable Costs:** + +- **Compute Resources**: Run on spot instances within your existing cluster +- **Storage**: S3 buckets for log and metrics retention +- **Network**: Data transfer costs (minimal) + +**Cost Optimization:** + +- Monitoring addons share instances with Essential addons +- In many cases, no new instances are needed +- Spot instances significantly reduce compute costs + + + + + +**Required Dependencies:** + +1. **Grafana** (standalone) + + - Requires: RDS database + - Cost: $20 USD/month + +2. **Loki** (log aggregation) + + - Requires: Grafana must be installed first + - Additional cost: $10 USD/month + - Combined with Grafana: $30 USD/month + +3. **KubeCost** (cost monitoring) + + - Can be installed independently + - Cost: $10 USD/month + +4. **OpenTelemetry** (upcoming) + - Part of observability stack + - Estimated cost: $10 USD/month + + + + + +**SleakOps Retention Benefits:** + +- **Standard retention**: 14 days (included in base price) +- **Extended retention**: Up to 3 months with minimal cost increase +- **Storage location**: S3 buckets (cost-effective) +- **No performance impact**: Historical data doesn't affect cluster performance + +**Comparison with other solutions:** + +```yaml +# Traditional monitoring (expensive) +retention_days: 14 +storage_type: "cluster-local" +cost_increase: "linear with retention" + +# SleakOps approach (cost-effective) +retention_days: 90 +storage_type: "s3-bucket" +cost_increase: "minimal" +``` + + + + + +**For Budget Planning:** + +1. **Minimum monitoring setup**: + + - Grafana only: $20 USD/month + - Good for basic metrics visualization + +2. **Standard monitoring setup**: + + - Grafana + Loki: $30 USD/month + - Includes log aggregation and analysis + +3. **Complete monitoring setup**: + + - Grafana + Loki + KubeCost: $40 USD/month + - Full observability with cost tracking + +4. **Future-ready setup**: + - All addons + OpenTelemetry: $50 USD/month + - Complete observability and tracing + +**Cost Optimization Tips:** + +- Start with essential addons to share compute resources +- Monitor actual usage before adding more components +- Take advantage of S3 storage for long-term retention +- Consider seasonal usage patterns for cost planning + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/monitoring-alternatives-datadog-vs-sleakops-addons.mdx b/docs/troubleshooting/monitoring-alternatives-datadog-vs-sleakops-addons.mdx new file mode 100644 index 000000000..3feff34a7 --- /dev/null +++ b/docs/troubleshooting/monitoring-alternatives-datadog-vs-sleakops-addons.mdx @@ -0,0 +1,758 @@ +--- +sidebar_position: 3 +title: "Monitoring Alternatives: DataDog vs SleakOps Addons" +description: "Comparison between DataDog and SleakOps native monitoring addons for application metrics and telemetry" +date: "2024-03-25" +category: "dependency" +tags: + ["monitoring", "datadog", "grafana", "loki", "otel", "telemetry", "metrics"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Monitoring Alternatives: DataDog vs SleakOps Addons + +**Date:** March 25, 2024 +**Category:** Dependency +**Tags:** Monitoring, DataDog, Grafana, Loki, OTEL, Telemetry, Metrics + +## Problem Description + +**Context:** Users migrating from external monitoring solutions like DataDog need to understand the available monitoring options in SleakOps and how to implement application-level metrics collection for business KPIs and performance monitoring. + +**Observed Symptoms:** + +- Loss of DataDog monitoring capabilities after migration to SleakOps +- Need for application-level metrics collection for business KPIs +- Uncertainty about available monitoring alternatives in SleakOps +- Questions about cost implications of different monitoring solutions + +**Relevant Configuration:** + +- Application uses OpenTelemetry for metrics export +- Previous DataDog configuration with API keys and collectors +- Spring Boot application with metrics management configuration +- Need for custom application tags and environment-specific metrics + +**Error Conditions:** + +- Missing monitoring infrastructure after platform migration +- Inability to track business performance metrics +- Lack of visibility into application performance + +## Detailed Solution + + + +SleakOps provides a comprehensive monitoring stack through native addons: + +**Loki (Log Management):** + +- Persists logs from all cluster components +- Includes both application logs and controller logs +- Provides centralized log aggregation and search + +**Grafana (Metrics and Visualization):** + +- Collects and persists CPU, Memory, Network, and I/O metrics +- Monitors all cluster components and applications +- Provides customizable dashboards and alerting + +**OpenTelemetry (APM - Beta):** + +- Application Performance Monitoring using open standards +- Currently in beta with expanding metric capabilities +- Compatible with existing OpenTelemetry instrumentation + + + + + +**DataDog Cost Structure:** + +- Charges per instance in the cluster +- Variable costs as cluster scales up/down +- Difficult to predict and control expenses + +**Alternative: New Relic** + +- Pricing based on users and data ingestion +- Free tier includes 100GB of data and 1 user +- More predictable cost structure +- Can be used free for small teams + +**Cost Comparison:** + +``` +DataDog: $15-23/host/month (variable with cluster size) +New Relic: $0-99/user/month (predictable, free tier available) +SleakOps Addons: Included in platform cost +``` + + + + + +If you decide to continue with DataDog, you can add it as a custom dependency: + +**1. Add DataDog Agent to Docker Image:** + +```dockerfile +# Add to your application Dockerfile +FROM your-base-image + +# Install DataDog agent +RUN curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh | bash + +# Copy your application +COPY . /app + +# Configure DataDog environment variables +ENV DD_API_KEY=${DATADOG_API_KEY} +ENV DD_SITE="datadoghq.com" +ENV DD_LOGS_ENABLED=true +ENV DD_APM_ENABLED=true +``` + +**2. Configure Application Properties:** + +```yaml +management: + metrics: + export: + datadog: + api-key: ${DATADOG_API_KEY} + application-key: ${DATADOG_APPLICATION_KEY} + uri: https://api.datadoghq.com + enabled: ${DATADOG_TOGGLE} + api-host: ${DATADOG_COLLECTOR_HOST} + port: ${DATADOG_COLLECTOR_PORT} + tags: + appId: ${spring.application.name} + env: ${DATADOG_ENVIRONMENT} + host: ${spring.cloud.client.hostname} + security: + enabled: false +``` + +**3. Set Environment Variables in SleakOps:** + +- `DATADOG_API_KEY`: Your DataDog API key +- `DATADOG_APPLICATION_KEY`: Your DataDog application key +- `DATADOG_TOGGLE`: Enable/disable DataDog (true/false) +- `DATADOG_ENVIRONMENT`: Environment name (prod, staging, dev) +- `DATADOG_COLLECTOR_HOST`: DataDog collector endpoint +- `DATADOG_COLLECTOR_PORT`: Collector port (usually 8125) + + + + + +Since your application already uses OpenTelemetry, you can easily integrate with SleakOps OTEL addon: + +**1. Enable OTEL Addon in SleakOps:** + +- Navigate to Cluster Settings → Addons +- Enable "OpenTelemetry (Beta)" +- Configure OTEL collector endpoint + +**2. Update Application Configuration:** + +```yaml +management: + metrics: + export: + otlp: + endpoint: http://otel-collector:4317 + protocol: grpc + headers: + authorization: Bearer ${OTEL_TOKEN} + tracing: + enabled: true + sampling: + probability: 1.0 +``` + +**3. Custom Metrics Configuration:** + +```java +@Component +public class BusinessMetrics { + private final MeterRegistry meterRegistry; + private final Counter orderCounter; + private final Timer processTimer; + + public BusinessMetrics(MeterRegistry meterRegistry) { + this.meterRegistry = meterRegistry; + this.orderCounter = Counter.builder("business.orders.total") + .description("Total orders processed") + .tag("env", "${ENVIRONMENT}") + .register(meterRegistry); + this.processTimer = Timer.builder("business.process.duration") + .description("Business process duration") + .register(meterRegistry); + } +} +``` + + + + + +| Feature | DataDog | SleakOps Addons | New Relic | +| ----------------------------- | ---------------- | ------------------- | ------------------ | +| **Cost Model** | Per instance | Included | Per user + data | +| **Free Tier** | 14-day trial | Included | 100GB + 1 user | +| **APM** | Full featured | Beta (expanding) | Full featured | +| **Log Management** | Yes | Yes (Loki) | Yes | +| **Infrastructure Monitoring** | Yes | Yes (Grafana) | Yes | +| **Custom Dashboards** | Yes | Yes (Grafana) | Yes | +| **Alerting** | Advanced | Basic (Grafana) | Advanced | +| **Real User Monitoring** | Yes | No | Yes | +| **Synthetic Monitoring** | Yes | No | Yes | +| **Database Monitoring** | Yes | Basic | Yes | +| **Security Monitoring** | Yes | No | Basic | +| **Team Collaboration** | Advanced | Basic | Advanced | +| **API Access** | Full REST API | Limited | Full REST API | +| **Mobile App** | Yes | No | Yes | +| **OTEL Support** | Yes | Native | Yes | + +**Recommendation Matrix:** + +- **Small teams/startups**: New Relic Free Tier or SleakOps Addons +- **Cost-conscious**: SleakOps Addons (included) +- **Enterprise needs**: DataDog or New Relic Paid +- **OTEL-first**: SleakOps Addons + OpenTelemetry + + + + + +**Phase 1: Parallel Running (1-2 weeks)** + +1. **Enable SleakOps monitoring addons:** + +```bash +# Enable required addons +sleakops addon enable grafana +sleakops addon enable loki +sleakops addon enable opentelemetry +``` + +2. **Configure dual export in applications:** + +```yaml +management: + metrics: + export: + # Keep existing DataDog export + datadog: + enabled: ${DATADOG_ENABLED:true} + api-key: ${DATADOG_API_KEY} + # Add OpenTelemetry export + otlp: + enabled: ${OTEL_ENABLED:true} + endpoint: ${OTEL_ENDPOINT:http://otel-collector:4317} +``` + +3. **Set up basic Grafana dashboards:** + +```json +{ + "dashboard": { + "title": "Application Metrics", + "panels": [ + { + "title": "Request Rate", + "type": "graph", + "targets": [ + { + "expr": "rate(http_requests_total[5m])", + "legendFormat": "{{method}} {{status}}" + } + ] + }, + { + "title": "Response Time", + "type": "graph", + "targets": [ + { + "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", + "legendFormat": "95th percentile" + } + ] + } + ] + } +} +``` + +**Phase 2: Feature Parity (2-3 weeks)** + +1. **Recreate critical DataDog dashboards in Grafana:** + +```yaml +# Business KPI Dashboard +apiVersion: v1 +kind: ConfigMap +metadata: + name: business-dashboard +data: + dashboard.json: | + { + "dashboard": { + "title": "Business KPIs", + "panels": [ + { + "title": "Orders per Hour", + "type": "stat", + "targets": [ + { + "expr": "increase(business_orders_total[1h])", + "legendFormat": "Orders" + } + ] + }, + { + "title": "Revenue Tracking", + "type": "graph", + "targets": [ + { + "expr": "sum(business_revenue_total) by (product_type)", + "legendFormat": "{{product_type}}" + } + ] + } + ] + } + } +``` + +2. **Set up alerting rules:** + +```yaml +groups: + - name: application.rules + rules: + - alert: HighErrorRate + expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 + for: 2m + labels: + severity: warning + annotations: + summary: "High error rate detected" + description: "Error rate is {{ $value }} errors per second" + + - alert: HighResponseTime + expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 + for: 5m + labels: + severity: critical + annotations: + summary: "High response time detected" + description: "95th percentile response time is {{ $value }}s" +``` + +3. **Configure notification channels:** + +```yaml +# Slack notification +apiVersion: v1 +kind: Secret +metadata: + name: grafana-slack-config +data: + notifications.yaml: | + notifiers: + - name: slack + type: slack + uid: slack + settings: + url: ${SLACK_WEBHOOK_URL} + channel: "#alerts" + title: "SleakOps Alert" + text: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}" +``` + +**Phase 3: Full Migration (1 week)** + +1. **Disable DataDog exports:** + +```yaml +management: + metrics: + export: + datadog: + enabled: false + otlp: + enabled: true +``` + +2. **Remove DataDog dependencies:** + +```dockerfile +# Remove DataDog agent installation +# FROM your-base-image +# RUN curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh | bash # REMOVE THIS LINE +``` + +3. **Update monitoring documentation and runbooks** + + + + + +**1. Spring Boot Actuator + Micrometer Setup:** + +```java +@Configuration +@EnableConfigurationProperties(MetricsProperties.class) +public class MetricsConfig { + + @Bean + public MeterRegistryCustomizer metricsCommonTags() { + return registry -> registry.config() + .commonTags("application", "your-app-name") + .commonTags("environment", System.getenv("ENVIRONMENT")); + } + + @Bean + public TimedAspect timedAspect(MeterRegistry registry) { + return new TimedAspect(registry); + } +} +``` + +**2. Business Metrics Service:** + +```java +@Service +public class BusinessMetricsService { + private final MeterRegistry meterRegistry; + private final Counter orderCounter; + private final Counter revenueCounter; + private final Timer checkoutTimer; + private final Gauge activeUsersGauge; + + public BusinessMetricsService(MeterRegistry meterRegistry) { + this.meterRegistry = meterRegistry; + + this.orderCounter = Counter.builder("business.orders") + .description("Total orders processed") + .tag("type", "all") + .register(meterRegistry); + + this.revenueCounter = Counter.builder("business.revenue") + .description("Total revenue in cents") + .baseUnit("cents") + .register(meterRegistry); + + this.checkoutTimer = Timer.builder("business.checkout.duration") + .description("Time taken to complete checkout") + .register(meterRegistry); + + this.activeUsersGauge = Gauge.builder("business.users.active") + .description("Currently active users") + .register(meterRegistry, this, BusinessMetricsService::getActiveUserCount); + } + + public void recordOrder(String productType, BigDecimal amount) { + orderCounter.increment( + Tags.of( + "product_type", productType, + "amount_range", getAmountRange(amount) + ) + ); + + revenueCounter.increment( + Tags.of("product_type", productType), + amount.multiply(BigDecimal.valueOf(100)).doubleValue() + ); + } + + @Timed(value = "business.checkout.duration", description = "Checkout process timer") + public void processCheckout(CheckoutRequest request) { + // Business logic here + recordOrder(request.getProductType(), request.getAmount()); + } + + private String getAmountRange(BigDecimal amount) { + if (amount.compareTo(BigDecimal.valueOf(50)) < 0) return "0-50"; + if (amount.compareTo(BigDecimal.valueOf(200)) < 0) return "50-200"; + return "200+"; + } + + private double getActiveUserCount() { + // Logic to count active users + return userSessionService.getActiveUserCount(); + } +} +``` + +**3. Custom Metrics Controller:** + +```java +@RestController +@RequestMapping("/api/metrics") +public class MetricsController { + private final BusinessMetricsService metricsService; + private final MeterRegistry meterRegistry; + + @PostMapping("/order") + public ResponseEntity recordOrder(@RequestBody OrderRequest request) { + metricsService.recordOrder(request.getProductType(), request.getAmount()); + return ResponseEntity.ok("Metric recorded"); + } + + @GetMapping("/custom") + public ResponseEntity> getCustomMetrics() { + Map metrics = new HashMap<>(); + + // Get current metric values + Counter orderCounter = meterRegistry.find("business.orders").counter(); + if (orderCounter != null) { + metrics.put("total_orders", orderCounter.count()); + } + + return ResponseEntity.ok(metrics); + } +} +``` + +**4. Prometheus Metrics Exposition:** + +```yaml +# Application properties +management: + endpoints: + web: + exposure: + include: health,info,metrics,prometheus + endpoint: + metrics: + enabled: true + prometheus: + enabled: true + metrics: + export: + prometheus: + enabled: true + distribution: + percentiles-histogram: + http.server.requests: true + business.checkout.duration: true + percentiles: + http.server.requests: 0.5, 0.95, 0.99 + business.checkout.duration: 0.5, 0.95, 0.99 +``` + + + + + +**1. Missing Metrics in Grafana:** + +```bash +# Check if application metrics endpoint is accessible +kubectl port-forward deployment/your-app 8080:8080 +curl http://localhost:8080/actuator/prometheus + +# Verify Prometheus is scraping your application +kubectl port-forward svc/prometheus 9090:9090 +# Open http://localhost:9090/targets and verify your app is listed and UP + +# Check Prometheus configuration +kubectl get configmap prometheus-config -o yaml +``` + +**2. OTEL Collector Issues:** + +```bash +# Check OTEL collector logs +kubectl logs deployment/otel-collector -n monitoring + +# Verify OTEL collector configuration +kubectl get configmap otel-collector-config -o yaml + +# Test OTEL endpoint connectivity +kubectl run test-pod --image=curlimages/curl --rm -it -- \ + curl -X POST http://otel-collector:4317/v1/metrics \ + -H "Content-Type: application/x-protobuf" \ + --data-binary @/dev/null +``` + +**3. Grafana Dashboard Issues:** + +```bash +# Check Grafana logs +kubectl logs deployment/grafana -n monitoring + +# Verify data source configuration +kubectl exec -it deployment/grafana -n monitoring -- \ + grafana-cli admin data-sources list + +# Test Prometheus connectivity from Grafana +kubectl exec -it deployment/grafana -n monitoring -- \ + curl http://prometheus:9090/api/v1/query?query=up +``` + +**4. High Cardinality Issues:** + +```java +// BAD: High cardinality metric +Counter.builder("http.requests") + .tag("user_id", userId) // This creates one metric per user! + .register(meterRegistry); + +// GOOD: Low cardinality metric +Counter.builder("http.requests") + .tag("endpoint", "/api/users") + .tag("method", "GET") + .tag("status", "200") + .register(meterRegistry); +``` + +**5. Memory Issues with Metrics:** + +```yaml +# Optimize metrics collection +management: + metrics: + export: + prometheus: + enabled: true + step: 30s # Reduce frequency if needed + distribution: + minimum-expected-value: 100ms + maximum-expected-value: 10s + expiry: 5m # Reduce memory usage + buffer-length: 3 # Reduce buffer size +``` + + + + + +**1. SleakOps Addons Optimization:** + +```yaml +# Optimize Grafana resource usage +grafana: + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" + +# Optimize Loki retention +loki: + retention_period: "14d" # Reduce from default 30d + compactor: + working_directory: /tmp/compactor + shared_store: s3 + retention_enabled: true + +# Optimize Prometheus storage +prometheus: + retention: "15d" # Reduce from default 30d + storage: + tsdb: + retention.size: "10GB" +``` + +**2. Metric Sampling Strategies:** + +```java +@Configuration +public class MetricsSamplingConfig { + + @Bean + public MeterFilter samplingFilter() { + return MeterFilter.maximumAllowableMetrics(1000); + } + + @Bean + public MeterFilter highFrequencyFilter() { + // Sample high-frequency metrics + return MeterFilter.deny(id -> { + String name = id.getName(); + return name.contains("high.frequency") && + Math.random() > 0.1; // Keep only 10% of samples + }); + } +} +``` + +**3. DataDog Cost Reduction:** + +```dockerfile +# Use DataDog Agent with minimal configuration +FROM datadog/agent:latest +ENV DD_APM_ENABLED=false # Disable if not needed +ENV DD_PROCESS_AGENT_ENABLED=false # Disable process monitoring +ENV DD_LOGS_ENABLED=false # Use Loki instead +ENV DD_KUBERNETES_COLLECT_METADATA_TAGS=false +ENV DD_KUBERNETES_METADATA_TAG_UPDATE_FREQ=60 +``` + +**4. New Relic Optimization:** + +```yaml +# Use New Relic with data sampling +newrelic: + app_name: "Your App" + license_key: "${NEW_RELIC_LICENSE_KEY}" + distributed_tracing: + enabled: true + transaction_tracer: + enabled: true + transaction_threshold: 500ms # Only trace slow transactions + error_collector: + enabled: true + ignore_status_codes: "400-404" # Ignore client errors +``` + +**5. Monitoring Cost Calculator:** + +```python +#!/usr/bin/env python3 +""" +Calculate monitoring costs for different solutions +""" + +def calculate_datadog_cost(hosts, months=12): + """DataDog pricing: $15-23 per host per month""" + base_cost = 15 + return hosts * base_cost * months + +def calculate_newrelic_cost(users, data_gb_per_month, months=12): + """New Relic pricing: Free tier 100GB + 1 user, then $99/user/month""" + if users <= 1 and data_gb_per_month <= 100: + return 0 + + user_cost = max(0, users - 1) * 99 * months + data_overage = max(0, data_gb_per_month - 100) * 0.25 * months + return user_cost + data_overage + +def calculate_sleakops_cost(): + """SleakOps addons are included in platform cost""" + return 0 + +# Example calculation +hosts = 5 +users = 3 +data_gb_month = 150 +months = 12 + +print(f"DataDog cost (12 months): ${calculate_datadog_cost(hosts, months)}") +print(f"New Relic cost (12 months): ${calculate_newrelic_cost(users, data_gb_month, months)}") +print(f"SleakOps cost (12 months): ${calculate_sleakops_cost()}") +``` + + + +--- + +_This FAQ was automatically generated on March 25, 2024 based on a real user query._ diff --git a/docs/troubleshooting/monitoring-metrics-persistence-node-failures.mdx b/docs/troubleshooting/monitoring-metrics-persistence-node-failures.mdx new file mode 100644 index 000000000..9bb9f3953 --- /dev/null +++ b/docs/troubleshooting/monitoring-metrics-persistence-node-failures.mdx @@ -0,0 +1,216 @@ +--- +sidebar_position: 3 +title: "Metrics Loss During Node Failures in Kubernetes Clusters" +description: "Understanding and preventing metrics loss when nodes fail or are replaced in Kubernetes clusters" +date: "2024-06-26" +category: "cluster" +tags: ["monitoring", "metrics", "node-failure", "prometheus", "persistence"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Metrics Loss During Node Failures in Kubernetes Clusters + +**Date:** June 26, 2024 +**Category:** Cluster +**Tags:** Monitoring, Metrics, Node Failure, Prometheus, Persistence + +## Problem Description + +**Context:** Users experience loss of monitoring metrics when Kubernetes nodes fail or are replaced during deployments, affecting historical data availability for analysis and troubleshooting. + +**Observed Symptoms:** + +- Metrics disappear when nodes crash or are terminated +- Historical monitoring data becomes unavailable +- Gaps in metric continuity during node replacements +- Loss of performance data during deployments + +**Relevant Configuration:** + +- Metrics retention: 2 hours local storage before S3 persistence +- Instance types: On-demand instances (less prone to failures) +- Storage backend: S3 for long-term metric storage +- Monitoring stack: Prometheus-based metrics collection + +**Error Conditions:** + +- Node failures during the 2-hour local retention window +- Deployments that replace nodes before metrics are persisted +- Unexpected node terminations due to infrastructure issues +- Network issues preventing metric persistence to S3 + +## Detailed Solution + + + +The current metrics architecture works as follows: + +1. **Local Storage Phase**: Metrics are stored locally on each node for 2 hours +2. **Persistence Phase**: After 2 hours, metrics are automatically persisted to S3 +3. **Risk Window**: If a node fails within the 2-hour window, metrics are lost + +This design optimizes for: + +- Reduced network traffic costs +- Following official tool recommendations +- Balancing performance with storage costs + +```yaml +# Current configuration example +prometheus: + retention: + local: "2h" + remote_write: + interval: "2h" + destination: "s3://metrics-bucket" +``` + + + + + +During deployments, metrics loss can occur when: + +1. **Rolling Updates**: Old nodes are terminated before metrics are persisted +2. **Node Replacement**: New nodes replace old ones within the 2-hour window +3. **Scaling Operations**: Nodes are removed during scale-down operations + +**Deployment vs. Node Failure Scenarios:** + +- **Planned Deployments**: Metrics may be lost if deployment occurs within 2-hour window +- **Unplanned Failures**: Less common with on-demand instances but still possible +- **Infrastructure Issues**: Network problems, AWS service disruptions + + + + + +While waiting for platform improvements, consider these approaches: + +**1. Timing Deployments** + +```bash +# Check last metric persistence time +kubectl get configmap prometheus-config -o yaml | grep last_persist + +# Wait for next persistence cycle before deploying +echo "Waiting for metrics persistence..." +sleep 7200 # 2 hours +``` + +**2. Manual Metric Backup** + +```bash +# Export current metrics before deployment +kubectl port-forward svc/prometheus 9090:9090 +curl -G 'http://localhost:9090/api/v1/query_range' \ + --data-urlencode 'query=up' \ + --data-urlencode 'start=2024-06-26T00:00:00Z' \ + --data-urlencode 'end=2024-06-26T23:59:59Z' \ + --data-urlencode 'step=60s' > metrics_backup.json +``` + +**3. Monitoring Deployment Impact** + +```bash +# Monitor node replacement during deployment +kubectl get events --field-selector reason=NodeReady -w +``` + + + + + +The SleakOps team is actively working on solutions to prevent metrics loss: + +**Planned Improvements:** + +1. **Persistent Volume Claims**: Store metrics in persistent storage +2. **Faster Persistence**: Reduce the 2-hour window to minimize risk +3. **Graceful Node Shutdown**: Ensure metrics are saved before node termination +4. **Redundant Storage**: Multiple copies of metrics across nodes + +**Timeline:** + +- These improvements are on the product roadmap +- Updates will be communicated as they become available +- No specific ETA provided yet + + + + + +**1. Deployment Scheduling** + +- Schedule deployments after metric persistence cycles +- Monitor metric persistence status before deployments +- Use maintenance windows for critical deployments + +**2. Monitoring Setup** + +```yaml +# Add alerts for metric persistence issues +groups: + - name: metrics.rules + rules: + - alert: MetricsPersistenceDelay + expr: time() - prometheus_tsdb_last_successful_snapshot_timestamp > 7200 + for: 5m + annotations: + summary: "Metrics persistence is delayed" +``` + +**3. Documentation** + +- Document deployment procedures that account for metrics +- Train team members on metric persistence windows +- Create runbooks for metric recovery procedures + + + + + +For critical environments requiring zero metric loss: + +**1. External Monitoring** + +- Use external monitoring services (DataDog, New Relic) +- Implement push-based metrics to external systems +- Set up redundant monitoring infrastructure + +**2. Custom Persistence** + +```yaml +# Custom sidecar for immediate persistence +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: metrics-backup +spec: + template: + spec: + containers: + - name: backup + image: prom/prometheus:latest + command: + - /bin/sh + - -c + - | + while true; do + promtool query instant 'up' | aws s3 cp - s3://backup-metrics/$(date +%s).json + sleep 300 + done +``` + +**3. High-Availability Setup** + +- Deploy Prometheus in HA mode +- Use Thanos for long-term storage +- Implement cross-region metric replication + + + +--- + +_This FAQ was automatically generated on 12/19/2024 based on a real user query._ diff --git a/docs/troubleshooting/nodepool-memory-limit-troubleshooting.mdx b/docs/troubleshooting/nodepool-memory-limit-troubleshooting.mdx new file mode 100644 index 000000000..ef29332c7 --- /dev/null +++ b/docs/troubleshooting/nodepool-memory-limit-troubleshooting.mdx @@ -0,0 +1,211 @@ +--- +sidebar_position: 3 +title: "Nodepool Memory Limit Issues" +description: "Troubleshooting and resolving nodepool memory capacity problems" +date: "2025-02-27" +category: "cluster" +tags: ["nodepool", "memory", "capacity", "scaling", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Nodepool Memory Limit Issues + +**Date:** February 27, 2025 +**Category:** Cluster +**Tags:** Nodepool, Memory, Capacity, Scaling, Troubleshooting + +## Problem Description + +**Context:** Production pods failing to schedule due to nodepool memory limits being reached after adding new workloads during migration. + +**Observed Symptoms:** + +- Pods stuck in "Pending" state +- Pod scheduling failures due to insufficient memory +- Nodepool at maximum provisioned memory capacity +- Critical production workloads affected (cron jobs, applications) + +**Relevant Configuration:** + +- Nodepool type: `spot-amd64` +- Original memory limit: 120 GB +- Temporary increase: 160 GB +- New workloads: Redash and WordPress deployments + +**Error Conditions:** + +- Occurs when total pod memory requests exceed nodepool capacity +- Triggered after adding new deployments to existing nodepool +- Affects pod scheduling and application availability +- Problem escalates during peak usage or pod restarts + +## Detailed Solution + + + +For urgent situations, you can temporarily increase the nodepool memory limit: + +1. **Access SleakOps Console** +2. Navigate to **Cluster Management** → **Nodepools** +3. Select the affected nodepool (e.g., `spot-amd64`) +4. Go to **Configuration** → **Resources** +5. Increase the **Memory Limit** (e.g., from 120GB to 200GB) +6. Click **Apply Changes** + +**Note:** This provides immediate relief but should be followed by proper capacity planning. + + + + + +To understand your current memory utilization: + +```bash +# Check node memory usage +kubectl top nodes + +# Check pod memory requests and limits +kubectl describe nodes | grep -A 5 "Allocated resources" + +# List pods with memory requests +kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,MEMORY_REQUEST:.spec.containers[*].resources.requests.memory +``` + +This helps identify which workloads are consuming the most memory. + + + + + +For better resource management, create separate nodepools for different workload types: + +1. **In SleakOps Console:** + - Go to **Cluster Management** → **Nodepools** + - Click **Create New Nodepool** + - Configure specifications: + +```yaml +# Example: Dedicated nodepool for data workloads +name: "data-workloads" +instance_type: "m5.xlarge" +min_size: 1 +max_size: 5 +desired_size: 2 +memory_limit: "64GB" +labels: + workload-type: "data" +taints: + - key: "workload-type" + value: "data" + effect: "NoSchedule" +``` + +2. **Update deployments to use the new nodepool:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: redash-deployment +spec: + template: + spec: + nodeSelector: + workload-type: "data" + tolerations: + - key: "workload-type" + operator: "Equal" + value: "data" + effect: "NoSchedule" +``` + + + + + +Implement monitoring to prevent future capacity issues: + +1. **Enable cluster monitoring** in SleakOps +2. **Set up alerts** for nodepool memory usage: + + - Warning at 70% capacity + - Critical at 85% capacity + +3. **Create dashboard** to track: + - Memory utilization per nodepool + - Pod scheduling failures + - Resource requests vs. limits + +```yaml +# Example alert configuration +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: nodepool-memory-alerts +spec: + groups: + - name: nodepool.rules + rules: + - alert: NodepoolMemoryHigh + expr: (sum(kube_pod_container_resource_requests{resource="memory"}) by (node) / sum(kube_node_status_allocatable{resource="memory"}) by (node)) > 0.85 + for: 5m + annotations: + summary: "Nodepool memory usage is high" +``` + + + + + +**Recommended practices for nodepool capacity management:** + +1. **Reserve 20-30% buffer** for unexpected load spikes +2. **Separate critical and non-critical workloads** into different nodepools +3. **Use resource requests and limits** appropriately: + +```yaml +resources: + requests: + memory: "512Mi" # What the pod needs + cpu: "250m" + limits: + memory: "1Gi" # Maximum the pod can use + cpu: "500m" +``` + +4. **Regular capacity reviews** - monthly assessment of usage patterns +5. **Implement horizontal pod autoscaling** for variable workloads +6. **Use spot instances wisely** - ensure critical workloads have fallback options + + + + + +For deployments created outside of SleakOps platform: + +1. **Document external deployments:** + + ```bash + # List all deployments not managed by SleakOps + kubectl get deployments --all-namespaces -o yaml | grep -v "sleakops.com" + ``` + +2. **Import into SleakOps** (if possible): + + - Use SleakOps import functionality + - Recreate deployments through SleakOps interface + +3. **Create dedicated nodepool** for external workloads: + + - Label appropriately for identification + - Set appropriate resource limits + - Monitor separately from platform-managed workloads + +4. **Establish governance** for future deployments to prevent similar issues + + + +--- + +_This FAQ was automatically generated on February 27, 2025 based on a real user query._ diff --git a/docs/troubleshooting/nodepool-ondemand-autoscaling-pending-pods.mdx b/docs/troubleshooting/nodepool-ondemand-autoscaling-pending-pods.mdx new file mode 100644 index 000000000..8e35d8eef --- /dev/null +++ b/docs/troubleshooting/nodepool-ondemand-autoscaling-pending-pods.mdx @@ -0,0 +1,218 @@ +--- +sidebar_position: 3 +title: "Nodepool OnDemand Configuration Causing Pod Scaling Issues" +description: "Pods stuck in pending state after changing nodepool from default to OnDemand configuration" +date: "2024-12-19" +category: "cluster" +tags: ["nodepool", "ondemand", "autoscaling", "pending", "scaling"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Nodepool OnDemand Configuration Causing Pod Scaling Issues + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** Nodepool, OnDemand, Autoscaling, Pending, Scaling + +## Problem Description + +**Context:** User changed nodepool configuration from default (no value) to OnDemand at SleakOps team's request. After this change, autoscaling functionality is not working properly. + +**Observed Symptoms:** + +- Pods created by autoscaling remain in pending state +- Only minimum required pods are active +- New pods fail to scale and show errors +- Autoscaling was working before the nodepool configuration change + +**Relevant Configuration:** + +- Nodepool type: Changed from default (no value) to OnDemand +- Autoscaling: Enabled but not functioning properly +- Minimum pods: Working correctly +- Scaling pods: Stuck in pending state + +**Error Conditions:** + +- Error occurs when autoscaler tries to create new pods +- Pods remain in pending state indefinitely +- Issue started after nodepool configuration change +- Only affects scaled pods, not minimum required pods + +## Detailed Solution + + + +When changing from default nodepool configuration to OnDemand, several aspects can affect pod scheduling: + +1. **Instance provisioning**: OnDemand instances have different provisioning characteristics +2. **Resource allocation**: OnDemand configuration may have different resource limits +3. **Scheduling constraints**: New configuration might introduce scheduling restrictions +4. **Capacity planning**: OnDemand instances may have different availability patterns + + + + + +To identify why pods are stuck in pending state: + +```bash +# Check pod status and events +kubectl get pods -o wide +kubectl describe pod + +# Check node capacity and resources +kubectl get nodes -o wide +kubectl describe nodes + +# Check cluster autoscaler logs +kubectl logs -n kube-system deployment/cluster-autoscaler +``` + +Common reasons for pending pods after nodepool changes: + +- Insufficient node capacity +- Resource constraints (CPU/Memory) +- Node selector mismatches +- Taints and tolerations issues + + + + + +Check your current nodepool configuration in SleakOps: + +1. **Navigate to Cluster Management** +2. **Select your cluster** +3. **Go to Nodepools section** +4. **Verify OnDemand configuration**: + - Instance types are appropriate + - Min/Max scaling limits are correct + - Availability zones are properly configured + - Resource allocations match your workload needs + +```yaml +# Example proper OnDemand nodepool configuration +nodepool: + name: "ondemand-nodepool" + instance_type: ["t3.medium", "t3.large"] + capacity_type: "ON_DEMAND" + min_size: 2 + max_size: 10 + desired_size: 3 + availability_zones: ["us-east-1a", "us-east-1b", "us-east-1c"] +``` + + + + + +Verify that the cluster autoscaler is properly configured for OnDemand instances: + +```bash +# Check autoscaler configuration +kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml + +# Check autoscaler deployment +kubectl get deployment cluster-autoscaler -n kube-system -o yaml +``` + +Ensure the autoscaler has: + +- Proper permissions for OnDemand instance management +- Correct nodepool discovery configuration +- Appropriate scaling policies + + + + + +Check if your pods have resource requests that match the OnDemand nodepool capacity: + +```yaml +# Example pod with proper resource requests +apiVersion: v1 +kind: Pod +spec: + containers: + - name: app + image: nginx + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" +``` + +Common issues: + +- Resource requests too high for available node capacity +- Missing resource requests causing scheduling issues +- Limits set too restrictively + + + + + +1. **Scale down and up manually**: + + ```bash + kubectl scale deployment --replicas=1 + kubectl scale deployment --replicas= + ``` + +2. **Force node scaling**: + + - Temporarily increase nodepool desired capacity in SleakOps + - Wait for nodes to provision + - Check if pods can schedule on new nodes + +3. **Check for node taints**: + + ```bash + kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints + ``` + +4. **Verify pod tolerations** if nodes have taints + + + + + +For optimal OnDemand nodepool performance: + +1. **Mixed instance types**: Use multiple instance types for better availability +2. **Proper resource planning**: Ensure pod requests align with node capacity +3. **Gradual scaling**: Configure appropriate scaling policies +4. **Monitoring**: Set up alerts for pending pods and scaling events + +```yaml +# Recommended HPA configuration +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/opensearch-iam-roles-configuration.mdx b/docs/troubleshooting/opensearch-iam-roles-configuration.mdx new file mode 100644 index 000000000..720c006aa --- /dev/null +++ b/docs/troubleshooting/opensearch-iam-roles-configuration.mdx @@ -0,0 +1,583 @@ +--- +sidebar_position: 3 +title: "OpenSearch IAM Roles Configuration" +description: "How to configure IAM roles for OpenSearch in SleakOps" +date: "2025-02-05" +category: "dependency" +tags: ["opensearch", "iam", "aws", "roles", "permissions"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# OpenSearch IAM Roles Configuration + +**Date:** February 5, 2025 +**Category:** Dependency +**Tags:** OpenSearch, IAM, AWS, Roles, Permissions + +## Problem Description + +**Context:** User is setting up an OpenSearch service in SleakOps and has questions about whether IAM roles are required for proper access configuration. + +**Observed Symptoms:** + +- Uncertainty about IAM role requirements for OpenSearch +- Questions about access configuration after service creation +- Need to understand permission structure for OpenSearch in AWS + +**Relevant Configuration:** + +- Service type: OpenSearch +- Instance type: t3.medium.search +- Platform: AWS +- Access method: To be determined + +**Error Conditions:** + +- Potential access issues if IAM roles are not properly configured +- Service may be created but inaccessible without proper permissions +- Applications may fail to connect to OpenSearch without correct IAM setup + +## Detailed Solution + + + +Yes, OpenSearch in AWS typically requires IAM roles for secure access. The specific requirements depend on your access method: + +**For VPC-based access (recommended):** + +- IAM roles are required for applications to access the OpenSearch cluster +- Fine-grained access control can be configured + +**For public access:** + +- IAM policies can be used alongside IP-based restrictions +- Less secure but simpler for development environments + + + + + +SleakOps automatically configures basic IAM roles for OpenSearch: + +1. **Service Role**: Created automatically for the OpenSearch service itself +2. **Access Role**: Generated for applications within your cluster to access OpenSearch +3. **Master User**: Configured with administrative permissions + +**What's configured automatically:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::ACCOUNT-ID:role/sleakops-opensearch-access-role" + }, + "Action": "es:*", + "Resource": "arn:aws:es:region:ACCOUNT-ID:domain/your-domain/*" + } + ] +} +``` + +**Automatic Configuration includes:** + +- **Read/Write permissions** for applications in your cluster +- **Index creation and management** capabilities +- **Search and aggregation** permissions +- **Basic monitoring** access + + + + + +For advanced use cases, you may need to configure custom IAM roles: + +**1. Create Custom Application Role:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "es:ESHttpGet", + "es:ESHttpPost", + "es:ESHttpPut", + "es:ESHttpDelete", + "es:ESHttpHead" + ], + "Resource": [ + "arn:aws:es:region:account:domain/your-domain/*", + "arn:aws:es:region:account:domain/your-domain" + ] + } + ] +} +``` + +**2. Configure Trust Relationship:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "ec2.amazonaws.com" + }, + "Action": "sts:AssumeRole" + }, + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::ACCOUNT-ID:role/sleakops-cluster-role" + }, + "Action": "sts:AssumeRole" + } + ] +} +``` + +**3. Apply Role to OpenSearch Domain:** + +```bash +# Using AWS CLI to update domain access policy +aws opensearch update-domain-config \ + --domain-name your-opensearch-domain \ + --access-policies file://access-policy.json +``` + + + + + +After configuring IAM roles, verify access works correctly: + +**1. Test Connection from Application Pod:** + +```bash +# From within your application pod +curl -X GET "https://your-opensearch-endpoint/_cluster/health" \ + --aws-sigv4 "aws:amz:region:es" \ + --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" +``` + +**2. Test Index Operations:** + +```bash +# Create a test index +curl -X PUT "https://your-opensearch-endpoint/test-index" \ + --aws-sigv4 "aws:amz:region:es" \ + --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \ + -H 'Content-Type: application/json' \ + -d '{"settings": {"number_of_shards": 1}}' + +# Insert test document +curl -X POST "https://your-opensearch-endpoint/test-index/_doc" \ + --aws-sigv4 "aws:amz:region:es" \ + --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \ + -H 'Content-Type: application/json' \ + -d '{"message": "Hello OpenSearch", "timestamp": "'$(date -Iseconds)'"}' + +# Search test +curl -X GET "https://your-opensearch-endpoint/test-index/_search" \ + --aws-sigv4 "aws:amz:region:es" \ + --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" +``` + +**3. Check IAM Role Permissions:** + +```bash +# Simulate IAM policy evaluation +aws iam simulate-principal-policy \ + --policy-source-arn arn:aws:iam::ACCOUNT-ID:role/your-opensearch-role \ + --action-names es:ESHttpGet es:ESHttpPost \ + --resource-arns arn:aws:es:region:account:domain/your-domain/* +``` + + + + + +Configure your applications to use OpenSearch with proper IAM authentication: + +**1. Python Application Example:** + +```python +import boto3 +from opensearchpy import OpenSearch, RequestsHttpConnection +from requests_aws4auth import AWS4Auth + +# Create AWS4Auth credentials +region = 'us-east-1' # Replace with your region +service = 'es' +credentials = boto3.Session().get_credentials() +awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token) + +# Initialize OpenSearch client +client = OpenSearch( + hosts=[{'host': 'your-opensearch-endpoint', 'port': 443}], + http_auth=awsauth, + use_ssl=True, + verify_certs=True, + connection_class=RequestsHttpConnection +) + +# Test connection +try: + info = client.info() + print(f"Connected to OpenSearch: {info}") +except Exception as e: + print(f"Connection failed: {e}") +``` + +**2. Node.js Application Example:** + +```javascript +const AWS = require("aws-sdk"); +const { Client } = require("@opensearch-project/opensearch"); +const { AwsSigv4Signer } = require("@opensearch-project/opensearch/aws"); + +// Configure AWS credentials +AWS.config.update({ + region: "us-east-1", // Replace with your region + accessKeyId: process.env.AWS_ACCESS_KEY_ID, + secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY, +}); + +// Create OpenSearch client with AWS Sigv4 authentication +const client = new Client({ + ...AwsSigv4Signer({ + region: "us-east-1", + service: "es", + getCredentials: () => + new Promise((resolve, reject) => { + AWS.config.getCredentials((err, credentials) => { + if (err) { + reject(err); + } else { + resolve(credentials); + } + }); + }), + }), + node: "https://your-opensearch-endpoint", +}); + +// Test connection +async function testConnection() { + try { + const response = await client.info(); + console.log("Connected to OpenSearch:", response.body); + } catch (error) { + console.error("Connection failed:", error); + } +} + +testConnection(); +``` + +**3. Environment Variables Configuration:** + +```yaml +# In your SleakOps deployment +env: + - name: OPENSEARCH_ENDPOINT + value: "https://your-opensearch-endpoint" + - name: AWS_DEFAULT_REGION + value: "us-east-1" + - name: AWS_ACCESS_KEY_ID + valueFrom: + secretKeyRef: + name: opensearch-credentials + key: access-key-id + - name: AWS_SECRET_ACCESS_KEY + valueFrom: + secretKeyRef: + name: opensearch-credentials + key: secret-access-key +``` + + + + + +Common issues and their solutions: + +**1. "Access Denied" Errors:** + +```bash +# Check current IAM permissions +aws sts get-caller-identity + +# Verify role can be assumed +aws sts assume-role \ + --role-arn arn:aws:iam::ACCOUNT-ID:role/your-opensearch-role \ + --role-session-name test-session +``` + +**Solution Steps:** + +- Verify the IAM role has correct permissions +- Check resource ARNs match your domain +- Ensure trust relationships are configured properly + +**2. "Invalid Signature" Errors:** + +```bash +# Check AWS credentials are correctly configured +aws configure list + +# Test AWS API access +aws opensearch describe-domain --domain-name your-domain +``` + +**Solution Steps:** + +- Verify AWS credentials are not expired +- Check system clock is synchronized +- Ensure correct region is specified + +**3. "Forbidden" Errors:** + +Check OpenSearch access policies: + +```bash +# Get current domain configuration +aws opensearch describe-domain-config --domain-name your-domain + +# Check access policy +aws opensearch describe-domain-config --domain-name your-domain \ + --query 'DomainConfig.AccessPolicies.Options' --output text +``` + +**Solution Steps:** + +- Update domain access policy to include your IAM role +- Verify IP restrictions if using public access +- Check fine-grained access control settings + + + + + +Set up monitoring to track access and identify issues: + +**1. CloudTrail Logging:** + +```json +{ + "eventVersion": "1.05", + "userIdentity": { + "type": "AssumedRole", + "principalId": "AIDACKCEVSQ6C2EXAMPLE", + "arn": "arn:aws:sts::123456789012:assumed-role/opensearch-role/session", + "accountId": "123456789012" + }, + "eventTime": "2025-02-05T10:30:00Z", + "eventSource": "es.amazonaws.com", + "eventName": "ESHttpPost", + "resources": [ + { + "accountId": "123456789012", + "type": "AWS::ES::Domain", + "ARN": "arn:aws:es:us-east-1:123456789012:domain/your-domain/*" + } + ] +} +``` + +**2. CloudWatch Metrics:** + +```bash +# Monitor failed requests +aws cloudwatch get-metric-statistics \ + --namespace AWS/ES \ + --metric-name IndexingErrors \ + --dimensions Name=DomainName,Value=your-domain Name=ClientId,Value=123456789012 \ + --statistics Sum \ + --start-time 2025-02-05T00:00:00Z \ + --end-time 2025-02-05T23:59:59Z \ + --period 3600 +``` + +**3. Application-Level Monitoring:** + +```python +import logging +import time +from opensearchpy import OpenSearch + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +class MonitoredOpenSearchClient: + def __init__(self, client): + self.client = client + + def search(self, **kwargs): + start_time = time.time() + try: + result = self.client.search(**kwargs) + duration = time.time() - start_time + logger.info(f"Search completed in {duration:.2f}s") + return result + except Exception as e: + logger.error(f"Search failed: {e}") + raise + + def index(self, **kwargs): + start_time = time.time() + try: + result = self.client.index(**kwargs) + duration = time.time() - start_time + logger.info(f"Index operation completed in {duration:.2f}s") + return result + except Exception as e: + logger.error(f"Index operation failed: {e}") + raise +``` + + + + + +Follow these security best practices: + +**1. Principle of Least Privilege:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": ["es:ESHttpGet", "es:ESHttpPost"], + "Resource": "arn:aws:es:region:account:domain/your-domain/specific-index/*", + "Condition": { + "IpAddress": { + "aws:SourceIp": ["10.0.0.0/16"] + } + } + } + ] +} +``` + +**2. Use Resource-Specific Permissions:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "ReadOnlyAccess", + "Effect": "Allow", + "Action": ["es:ESHttpGet", "es:ESHttpHead"], + "Resource": "arn:aws:es:region:account:domain/your-domain/logs-*/_search" + }, + { + "Sid": "WriteAccess", + "Effect": "Allow", + "Action": ["es:ESHttpPost", "es:ESHttpPut"], + "Resource": "arn:aws:es:region:account:domain/your-domain/application-logs-*/_doc" + } + ] +} +``` + +**3. Enable Fine-Grained Access Control:** + +```bash +# Enable FGAC via AWS CLI +aws opensearch update-domain-config \ + --domain-name your-domain \ + --advanced-security-options \ + Enabled=true,InternalUserDatabaseEnabled=false,MasterUserOptions='{ + "MasterUserARN": "arn:aws:iam::ACCOUNT-ID:role/opensearch-master-role" + }' +``` + +**4. Regular Access Review:** + +```bash +# Script to audit OpenSearch access +#!/bin/bash + +DOMAIN_NAME="your-domain" +echo "=== OpenSearch Access Audit ===" + +# Get domain access policy +echo "Current Access Policy:" +aws opensearch describe-domain-config \ + --domain-name $DOMAIN_NAME \ + --query 'DomainConfig.AccessPolicies.Options' \ + --output text | jq . + +# List recent access attempts +echo "Recent Access Logs (last 24 hours):" +aws logs filter-log-events \ + --log-group-name "/aws/opensearch/domains/$DOMAIN_NAME/search-logs" \ + --start-time $(date -d '24 hours ago' +%s)000 \ + --filter-pattern "ERROR" + +# Check for failed authentication attempts +echo "Failed Authentication Attempts:" +aws logs filter-log-events \ + --log-group-name "/aws/opensearch/domains/$DOMAIN_NAME/search-logs" \ + --start-time $(date -d '24 hours ago' +%s)000 \ + --filter-pattern "403" +``` + +**5. Credential Rotation:** + +```yaml +# Automated credential rotation using Kubernetes CronJob +apiVersion: batch/v1 +kind: CronJob +metadata: + name: opensearch-credential-rotation +spec: + schedule: "0 2 * * 0" # Weekly on Sunday at 2 AM + jobTemplate: + spec: + template: + spec: + containers: + - name: credential-rotator + image: aws-cli:latest + command: + - /bin/bash + - -c + - | + # Rotate IAM access keys + OLD_KEY=$(aws iam list-access-keys --user-name opensearch-user --query 'AccessKeyMetadata[0].AccessKeyId' --output text) + aws iam create-access-key --user-name opensearch-user > new_key.json + + # Update Kubernetes secret + NEW_KEY=$(cat new_key.json | jq -r '.AccessKey.AccessKeyId') + NEW_SECRET=$(cat new_key.json | jq -r '.AccessKey.SecretAccessKey') + + kubectl create secret generic opensearch-credentials-new \ + --from-literal=access-key-id=$NEW_KEY \ + --from-literal=secret-access-key=$NEW_SECRET + + # Test new credentials before deleting old ones + # (Add validation logic here) + + # Delete old key after validation + aws iam delete-access-key --user-name opensearch-user --access-key-id $OLD_KEY + restartPolicy: OnFailure +``` + + + +--- + +_This FAQ was automatically generated on February 5, 2025 based on a real user query._ diff --git a/docs/troubleshooting/opensearch-pod-authentication-permissions.mdx b/docs/troubleshooting/opensearch-pod-authentication-permissions.mdx new file mode 100644 index 000000000..6da2eaf53 --- /dev/null +++ b/docs/troubleshooting/opensearch-pod-authentication-permissions.mdx @@ -0,0 +1,460 @@ +--- +sidebar_position: 3 +title: "OpenSearch Pod Authentication and Permission Issues" +description: "Troubleshooting OpenSearch connectivity and authorization errors from Kubernetes pods" +date: "2025-02-19" +category: "dependency" +tags: ["opensearch", "authentication", "permissions", "aws", "iam", "pod"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# OpenSearch Pod Authentication and Permission Issues + +**Date:** February 19, 2025 +**Category:** Dependency +**Tags:** OpenSearch, Authentication, Permissions, AWS, IAM, Pod + +## Problem Description + +**Context:** User attempts to connect to an OpenSearch cluster from a Kubernetes pod but encounters authorization errors despite expecting automatic permission configuration. + +**Observed Symptoms:** + +- Authorization error when making curl requests to OpenSearch URL from within a pod +- Connection fails despite OpenSearch dependency being configured in SleakOps +- Similar to S3 permission issues - credentials are found but lack adequate permissions + +**Relevant Configuration:** + +- OpenSearch dependency configured in SleakOps +- Pod running in Kubernetes cluster +- AWS-hosted infrastructure +- IAM roles and policies expected to be auto-configured + +**Error Conditions:** + +- Error occurs when making HTTP requests to OpenSearch from pod +- Authorization failure despite dependency configuration +- Problem persists across different connection attempts +- Similar pattern to S3 permission issues + +## Detailed Solution + + + +First, determine how your pod is currently authenticating with AWS: + +1. **Install AWS CLI in your pod**: + + ```bash + # If not already installed + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + sudo ./aws/install + ``` + +2. **Check current identity**: + ```bash + aws sts get-caller-identity + ``` + +This will show you: + +- Whether you're using pod identity or environment credentials +- The actual IAM role being assumed +- Account and user information + + + + + +For OpenSearch access from Kubernetes pods, your IAM role needs specific permissions: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "es:ESHttpGet", + "es:ESHttpPost", + "es:ESHttpPut", + "es:ESHttpDelete", + "es:ESHttpHead", + "es:Describe*", + "es:List*" + ], + "Resource": [ + "arn:aws:es:*:*:domain/your-opensearch-domain/*", + "arn:aws:es:*:*:domain/your-opensearch-domain" + ] + } + ] +} +``` + +**Minimum required actions:** + +- `es:ESHttpGet` - Read operations (search, get documents) +- `es:ESHttpPost` - Create operations (index documents, search) +- `es:ESHttpPut` - Update operations (update documents, create indices) +- `es:ESHttpDelete` - Delete operations (delete documents, indices) + + + + + +SleakOps should automatically configure IRSA for OpenSearch access, but you can verify the setup: + +1. **Check if the service account has annotations**: + + ```bash + kubectl get serviceaccount -o yaml | grep annotations -A 5 + ``` + + Look for: + + ```yaml + annotations: + eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/sleakops-opensearch-role + ``` + +2. **Verify the pod uses the correct service account**: + + ```bash + kubectl describe pod YOUR_POD_NAME | grep "Service Account" + ``` + +3. **Check environment variables in the pod**: + + ```bash + kubectl exec -it YOUR_POD_NAME -- env | grep AWS + ``` + + You should see: + + ``` + AWS_ROLE_ARN=arn:aws:iam::ACCOUNT:role/sleakops-opensearch-role + AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token + ``` + + + + + +Test the connection to ensure proper authentication: + +1. **Basic connectivity test**: + + ```bash + # From within your pod + curl -I https://your-opensearch-endpoint.region.es.amazonaws.com/_cluster/health + ``` + +2. **Authenticated request using AWS CLI**: + + ```bash + # Install AWS CLI if not available + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip && sudo ./aws/install + + # Test authenticated access + aws opensearch describe-domain --domain-name your-domain-name + ``` + +3. **Test with proper AWS Signature v4**: + + ```bash + # Using curl with AWS signature + curl -X GET "https://your-opensearch-endpoint/_cluster/health" \ + --aws-sigv4 "aws:amz:us-east-1:es" \ + --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" + ``` + +4. **Python test script** (if Python is available in the pod): + + ```python + import boto3 + import requests + from requests_aws4auth import AWS4Auth + + # Get credentials + session = boto3.Session() + credentials = session.get_credentials() + + # Create AWS4Auth object + awsauth = AWS4Auth( + credentials.access_key, + credentials.secret_key, + 'us-east-1', # Replace with your region + 'es', + session_token=credentials.token + ) + + # Test connection + url = 'https://your-opensearch-endpoint/_cluster/health' + response = requests.get(url, auth=awsauth) + print(f"Status: {response.status_code}") + print(f"Response: {response.text}") + ``` + + + + + +**1. "Access Denied" or "Unauthorized" (403/401 errors):** + +```bash +# Check current IAM identity +aws sts get-caller-identity + +# Verify domain access policy +aws opensearch describe-domain --domain-name your-domain \ + --query 'DomainStatus.AccessPolicies' --output text | jq . +``` + +**Solution steps:** + +- Verify the IAM role has the required permissions +- Check that the domain access policy allows the role +- Ensure the pod is using the correct service account + +**2. "Unable to locate credentials" errors:** + +```bash +# Check if AWS credentials are available +env | grep AWS + +# Check service account token +ls -la /var/run/secrets/eks.amazonaws.com/serviceaccount/ +``` + +**Solution steps:** + +- Verify IRSA is configured correctly +- Check service account annotations +- Restart the pod to refresh tokens + +**3. Network connectivity issues:** + +```bash +# Test network connectivity +nslookup your-opensearch-endpoint.region.es.amazonaws.com + +# Test port connectivity +telnet your-opensearch-endpoint.region.es.amazonaws.com 443 +``` + +**Solution steps:** + +- Check VPC security groups +- Verify network policies +- Ensure OpenSearch is in VPC if using VPC endpoints + + + + + +**Step 1: Verify basic AWS access** + +```bash +# Test if AWS CLI works +aws sts get-caller-identity + +# List OpenSearch domains +aws opensearch list-domain-names +``` + +**Step 2: Check OpenSearch domain configuration** + +```bash +# Get domain details +aws opensearch describe-domain --domain-name your-domain + +# Check access policies +aws opensearch describe-domain --domain-name your-domain \ + --query 'DomainStatus.AccessPolicies' --output text +``` + +**Step 3: Simulate IAM policy evaluation** + +```bash +# Test specific OpenSearch actions +aws iam simulate-principal-policy \ + --policy-source-arn $(aws sts get-caller-identity --query Arn --output text) \ + --action-names es:ESHttpGet es:ESHttpPost \ + --resource-arns "arn:aws:es:region:account:domain/your-domain/*" +``` + +**Step 4: Test from application code** + +Create a test script to verify application-level access: + +```python +#!/usr/bin/env python3 +import boto3 +import json +from opensearchpy import OpenSearch, RequestsHttpConnection +from requests_aws4auth import AWS4Auth + +def test_opensearch_connection(): + try: + # Get AWS credentials + session = boto3.Session() + credentials = session.get_credentials() + region = 'us-east-1' # Replace with your region + + # Create AWS4Auth + awsauth = AWS4Auth( + credentials.access_key, + credentials.secret_key, + region, + 'es', + session_token=credentials.token + ) + + # Create OpenSearch client + client = OpenSearch( + hosts=[{'host': 'your-opensearch-endpoint', 'port': 443}], + http_auth=awsauth, + use_ssl=True, + verify_certs=True, + connection_class=RequestsHttpConnection + ) + + # Test connection + info = client.info() + print("✅ Connection successful!") + print(f"Cluster info: {json.dumps(info, indent=2)}") + + # Test basic operations + health = client.cluster.health() + print(f"✅ Cluster health: {health['status']}") + + return True + + except Exception as e: + print(f"❌ Connection failed: {str(e)}") + return False + +if __name__ == "__main__": + test_opensearch_connection() +``` + + + + + +**Automatic Configuration in SleakOps:** + +When you add OpenSearch as a dependency in SleakOps, the following is automatically configured: + +1. **Service Account with IRSA**: Annotated with the appropriate IAM role +2. **IAM Role**: With permissions to access the OpenSearch domain +3. **Domain Access Policy**: Updated to allow the IAM role +4. **Environment Variables**: Injected into your pods + +**Accessing OpenSearch from your application:** + +```python +# Environment variables available in your pod +import os +opensearch_endpoint = os.getenv('OPENSEARCH_ENDPOINT') +opensearch_region = os.getenv('AWS_DEFAULT_REGION', 'us-east-1') +``` + +**Configuration in your deployment:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: your-app +spec: + template: + spec: + serviceAccountName: sleakops-opensearch-sa # Auto-created by SleakOps + containers: + - name: app + env: + - name: OPENSEARCH_ENDPOINT + value: "https://your-opensearch-endpoint" + - name: AWS_DEFAULT_REGION + value: "us-east-1" +``` + +**Best practices for SleakOps OpenSearch integration:** + +- Use the automatically configured service account +- Rely on IRSA for authentication (don't use access keys) +- Use environment variables for endpoint configuration +- Test connections during application startup +- Implement proper error handling and retries + + + + + +**Security Best Practices:** + +1. **Use VPC endpoints** for OpenSearch when possible +2. **Enable encryption** at rest and in transit +3. **Use fine-grained access control** for production environments +4. **Regularly rotate credentials** and review permissions + +**Monitoring and Alerting:** + +```bash +# Create CloudWatch custom metric for connection health +aws cloudwatch put-metric-data \ + --namespace "Custom/OpenSearch" \ + --metric-data MetricName=ConnectionHealth,Value=1,Unit=Count +``` + +**Application-level monitoring:** + +```python +import time +import logging +from opensearchpy import OpenSearch + +def monitor_opensearch_health(client): + """Monitor OpenSearch cluster health""" + try: + health = client.cluster.health() + status = health['status'] + + if status == 'green': + logging.info("OpenSearch cluster is healthy") + return True + elif status == 'yellow': + logging.warning("OpenSearch cluster has issues but is functional") + return True + else: + logging.error("OpenSearch cluster is unhealthy") + return False + + except Exception as e: + logging.error(f"Failed to check OpenSearch health: {e}") + return False + +# Use in your application +if not monitor_opensearch_health(opensearch_client): + # Implement fallback or alert mechanism + pass +``` + +**Performance Optimization:** + +- Use connection pooling for high-traffic applications +- Implement proper indexing strategies +- Monitor query performance and optimize slow queries +- Use batch operations for bulk data operations + + + +--- + +_This FAQ was automatically generated on February 19, 2025 based on a real user query._ diff --git a/docs/troubleshooting/opentelemetry-django-database-detection.mdx b/docs/troubleshooting/opentelemetry-django-database-detection.mdx new file mode 100644 index 000000000..2b01681e4 --- /dev/null +++ b/docs/troubleshooting/opentelemetry-django-database-detection.mdx @@ -0,0 +1,806 @@ +--- +sidebar_position: 3 +title: "OpenTelemetry Database Detection Issue in Django" +description: "Solution for OpenTelemetry not detecting configured databases in Django applications" +date: "2024-12-19" +category: "workload" +tags: ["opentelemetry", "django", "database", "monitoring", "instrumentation"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# OpenTelemetry Database Detection Issue in Django + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** OpenTelemetry, Django, Database, Monitoring, Instrumentation + +## Problem Description + +**Context:** User has a Django application deployed in SleakOps with OpenTelemetry auto-instrumentation enabled, but OpenTelemetry is not detecting the configured database connections despite the database being properly configured in the Django settings. + +**Observed Symptoms:** + +- OpenTelemetry auto-instrumentation is not detecting database connections +- Database is properly configured and working in the Django application +- Adding `DJANGO_SETTINGS_MODULE` environment variable does not resolve the issue +- Application functions normally but lacks database telemetry data + +**Relevant Configuration:** + +- Framework: Django (Django REST Framework) +- Environment variable: `DJANGO_SETTINGS_MODULE="simplee_drf.settings"` +- Platform: SleakOps with OpenTelemetry auto-instrumentation +- Previous issue with boto library was resolved by updating version + +**Error Conditions:** + +- OpenTelemetry fails to detect database during application startup +- Missing database traces and metrics in observability data +- Issue persists after setting Django settings module environment variable + +## Detailed Solution + + + +First, ensure your Django settings are properly configured for database detection: + +1. **Verify DJANGO_SETTINGS_MODULE is correctly set:** + +```yaml +# In your SleakOps deployment configuration +environment: + DJANGO_SETTINGS_MODULE: "your_project.settings" + # Replace 'your_project' with your actual project name +``` + +2. **Check your Django settings file contains database configuration:** + +```python +# settings.py +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', # or your database engine + 'NAME': os.environ.get('DB_NAME'), + 'USER': os.environ.get('DB_USER'), + 'PASSWORD': os.environ.get('DB_PASSWORD'), + 'HOST': os.environ.get('DB_HOST'), + 'PORT': os.environ.get('DB_PORT', '5432'), + } +} +``` + + + + + +Ensure OpenTelemetry Django instrumentation is properly configured: + +1. **Add required OpenTelemetry packages to requirements.txt:** + +```txt +opentelemetry-api +opentelemetry-sdk +opentelemetry-instrumentation +opentelemetry-instrumentation-django +opentelemetry-instrumentation-psycopg2 # For PostgreSQL +opentelemetry-instrumentation-mysql # For MySQL +opentelemetry-instrumentation-sqlite3 # For SQLite +opentelemetry-exporter-otlp +opentelemetry-propagator-b3 +opentelemetry-propagator-jaeger +opentelemetry-instrumentation-requests +opentelemetry-instrumentation-urllib3 +``` + +2. **Configure auto-instrumentation in Django settings:** + +```python +# settings.py + +# OpenTelemetry configuration +import os +from opentelemetry import trace +from opentelemetry.instrumentation.django import DjangoInstrumentor +from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor +from opentelemetry.instrumentation.requests import RequestsInstrumentor +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import BatchSpanProcessor +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter + +# Initialize OpenTelemetry +if not trace.get_tracer_provider(): + trace.set_tracer_provider(TracerProvider()) + + # Configure OTLP exporter + otlp_exporter = OTLPSpanExporter( + endpoint=os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317"), + headers={"Authorization": f"Bearer {os.environ.get('OTEL_AUTH_TOKEN', '')}"} + ) + + # Add span processor + span_processor = BatchSpanProcessor(otlp_exporter) + trace.get_tracer_provider().add_span_processor(span_processor) + +# Auto-instrument Django +DjangoInstrumentor().instrument() + +# Auto-instrument database connections +Psycopg2Instrumentor().instrument() # For PostgreSQL + +# Auto-instrument HTTP requests +RequestsInstrumentor().instrument() +``` + +3. **Alternative: Use environment variable approach:** + +```yaml +# In your deployment configuration +environment: + OTEL_PYTHON_DJANGO_INSTRUMENT: "true" + OTEL_PYTHON_PSYCOPG2_INSTRUMENT: "true" + OTEL_PYTHON_REQUESTS_INSTRUMENT: "true" + OTEL_SERVICE_NAME: "your-django-app" + OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4317" +``` + + + + + +If auto-instrumentation doesn't work, configure manual instrumentation: + +1. **Create custom instrumentation middleware:** + +```python +# middleware/opentelemetry_middleware.py +import logging +from django.conf import settings +from django.db import connection +from opentelemetry import trace +from opentelemetry.instrumentation.django import DjangoInstrumentor +from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor + +logger = logging.getLogger(__name__) + +class OpenTelemetryMiddleware: + def __init__(self, get_response): + self.get_response = get_response + self.setup_instrumentation() + + def setup_instrumentation(self): + """Setup OpenTelemetry instrumentation""" + try: + # Instrument Django + if not hasattr(settings, '_otel_django_instrumented'): + DjangoInstrumentor().instrument() + settings._otel_django_instrumented = True + logger.info("Django instrumentation enabled") + + # Instrument database + if not hasattr(settings, '_otel_db_instrumented'): + # Detect database engine and instrument accordingly + db_engine = settings.DATABASES['default']['ENGINE'] + + if 'postgresql' in db_engine: + Psycopg2Instrumentor().instrument() + elif 'mysql' in db_engine: + from opentelemetry.instrumentation.mysql import MySQLInstrumentor + MySQLInstrumentor().instrument() + elif 'sqlite' in db_engine: + from opentelemetry.instrumentation.sqlite3 import SQLite3Instrumentor + SQLite3Instrumentor().instrument() + + settings._otel_db_instrumented = True + logger.info(f"Database instrumentation enabled for {db_engine}") + + except Exception as e: + logger.error(f"Failed to setup OpenTelemetry instrumentation: {e}") + + def __call__(self, request): + response = self.get_response(request) + return response +``` + +2. **Add middleware to Django settings:** + +```python +# settings.py +MIDDLEWARE = [ + 'your_app.middleware.opentelemetry_middleware.OpenTelemetryMiddleware', + 'django.middleware.security.SecurityMiddleware', + # ... other middleware +] +``` + +3. **Manual span creation for database operations:** + +```python +# In your Django views or services +from opentelemetry import trace +from django.db import connection + +tracer = trace.get_tracer(__name__) + +def get_user_data(user_id): + with tracer.start_as_current_span("database.query.get_user") as span: + span.set_attribute("user.id", user_id) + span.set_attribute("db.statement", "SELECT * FROM users WHERE id = %s") + + with connection.cursor() as cursor: + cursor.execute("SELECT * FROM auth_user WHERE id = %s", [user_id]) + result = cursor.fetchone() + + span.set_attribute("db.rows_affected", 1 if result else 0) + return result +``` + + + + + +1. **Check application logs for OpenTelemetry initialization:** + +```bash +# Look for these log messages in SleakOps logs +kubectl logs -f deployment/your-app | grep -i "opentelemetry\|instrumentation" +``` + +2. **Test database connection manually:** + +```python +# In Django shell or a test view +from django.db import connection + +def test_db_connection(): + with connection.cursor() as cursor: + cursor.execute("SELECT 1") + result = cursor.fetchone() + return result +``` + +3. **Verify telemetry data is being sent:** + +```python +# Add to a Django view for testing +from opentelemetry import trace +from opentelemetry.sdk.trace.export import ConsoleSpanExporter +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import SimpleSpanProcessor + +def test_telemetry_view(request): + """Test view to verify OpenTelemetry is working""" + + # Create a test span + tracer = trace.get_tracer(__name__) + with tracer.start_as_current_span("test.database.query") as span: + span.set_attribute("test", "true") + + # Perform a database query + from django.contrib.auth.models import User + user_count = User.objects.count() + + span.set_attribute("user.count", user_count) + + return JsonResponse({"status": "success", "user_count": user_count}) +``` + +4. **Enable debug logging:** + +```python +# settings.py +import logging + +# Enable OpenTelemetry debug logging +logging.basicConfig(level=logging.DEBUG) +logging.getLogger("opentelemetry").setLevel(logging.DEBUG) +logging.getLogger("opentelemetry.instrumentation").setLevel(logging.DEBUG) + +LOGGING = { + 'version': 1, + 'disable_existing_loggers': False, + 'handlers': { + 'console': { + 'class': 'logging.StreamHandler', + }, + }, + 'loggers': { + 'opentelemetry': { + 'handlers': ['console'], + 'level': 'DEBUG', + 'propagate': True, + }, + 'opentelemetry.instrumentation': { + 'handlers': ['console'], + 'level': 'DEBUG', + 'propagate': True, + }, + }, +} +``` + +5. **Check instrumentation status:** + +```python +# Add this to your Django app startup +def check_instrumentation_status(): + """Check which instrumentations are active""" + import pkg_resources + + # Check installed OpenTelemetry packages + otel_packages = [ + pkg for pkg in pkg_resources.working_set + if 'opentelemetry' in pkg.project_name + ] + + print("Installed OpenTelemetry packages:") + for pkg in otel_packages: + print(f" {pkg.project_name}: {pkg.version}") + + # Check if instrumentation is active + from opentelemetry.instrumentation import _instruments + active_instruments = _instruments._instruments + + print("Active instrumentations:") + for name, instrument in active_instruments.items(): + print(f" {name}: {instrument}") + +# Call this in your Django app's ready() method +check_instrumentation_status() +``` + + + + + +Configure all necessary environment variables for OpenTelemetry: + +1. **Required environment variables:** + +```yaml +# In SleakOps deployment configuration +environment: + # Django settings + DJANGO_SETTINGS_MODULE: "your_project.settings" + + # OpenTelemetry configuration + OTEL_SERVICE_NAME: "your-django-app" + OTEL_SERVICE_VERSION: "1.0.0" + OTEL_DEPLOYMENT_ENVIRONMENT: "production" # or staging/development + + # OTLP Exporter configuration + OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4317" + OTEL_EXPORTER_OTLP_PROTOCOL: "grpc" + OTEL_EXPORTER_OTLP_HEADERS: "authorization=Bearer YOUR_TOKEN" + + # Instrumentation configuration + OTEL_PYTHON_DJANGO_INSTRUMENT: "true" + OTEL_PYTHON_PSYCOPG2_INSTRUMENT: "true" + OTEL_PYTHON_REQUESTS_INSTRUMENT: "true" + OTEL_PYTHON_URLLIB3_INSTRUMENT: "true" + + # Trace configuration + OTEL_TRACES_EXPORTER: "otlp" + OTEL_METRICS_EXPORTER: "otlp" + OTEL_LOGS_EXPORTER: "otlp" + + # Sampling configuration + OTEL_TRACES_SAMPLER: "parentbased_traceidratio" + OTEL_TRACES_SAMPLER_ARG: "0.1" # Sample 10% of traces + + # Resource attributes + OTEL_RESOURCE_ATTRIBUTES: "service.name=your-django-app,service.version=1.0.0" +``` + +2. **Database-specific configuration:** + +```yaml +# For PostgreSQL +environment: + OTEL_PYTHON_PSYCOPG2_INSTRUMENT: "true" + DB_NAME: "your_database" + DB_USER: "your_user" + DB_PASSWORD: "your_password" + DB_HOST: "your-db-host" + DB_PORT: "5432" + +# For MySQL +environment: + OTEL_PYTHON_MYSQL_INSTRUMENT: "true" + +# For SQLite +environment: + OTEL_PYTHON_SQLITE3_INSTRUMENT: "true" +``` + +3. **Advanced configuration:** + +```python +# settings.py - Advanced OpenTelemetry configuration +import os +from opentelemetry.sdk.resources import SERVICE_NAME, SERVICE_VERSION, Resource + +# Configure resource attributes +resource = Resource.create({ + SERVICE_NAME: os.environ.get("OTEL_SERVICE_NAME", "django-app"), + SERVICE_VERSION: os.environ.get("OTEL_SERVICE_VERSION", "1.0.0"), + "deployment.environment": os.environ.get("OTEL_DEPLOYMENT_ENVIRONMENT", "development"), + "service.namespace": os.environ.get("OTEL_SERVICE_NAMESPACE", "default"), +}) + +# Configure trace provider with resource +from opentelemetry.sdk.trace import TracerProvider +trace.set_tracer_provider(TracerProvider(resource=resource)) +``` + + + + + +1. **Create a test endpoint for verification:** + +```python +# views.py +from django.http import JsonResponse +from django.db import connection +from opentelemetry import trace +import time + +def test_opentelemetry(request): + """Test endpoint to verify OpenTelemetry database instrumentation""" + + tracer = trace.get_tracer(__name__) + + with tracer.start_as_current_span("test.opentelemetry.endpoint") as span: + span.set_attribute("test.type", "database_instrumentation") + + # Test database query + with tracer.start_as_current_span("database.test.query") as db_span: + db_span.set_attribute("db.statement", "SELECT COUNT(*) FROM auth_user") + + with connection.cursor() as cursor: + cursor.execute("SELECT COUNT(*) FROM auth_user") + user_count = cursor.fetchone()[0] + + db_span.set_attribute("db.rows_returned", 1) + db_span.set_attribute("user.count", user_count) + + # Test span attributes + span.set_attribute("response.user_count", user_count) + span.set_attribute("test.status", "success") + + # Add some processing time + time.sleep(0.1) + + return JsonResponse({ + "status": "success", + "message": "OpenTelemetry test completed", + "user_count": user_count, + "instrumentation": { + "django": "enabled", + "database": "enabled", + "traces": "enabled" + } + }) +``` + +2. **Integration test for instrumentation:** + +```python +# tests/test_opentelemetry.py +from django.test import TestCase, Client +from django.urls import reverse +from unittest.mock import patch, MagicMock + +class OpenTelemetryInstrumentationTest(TestCase): + def setUp(self): + self.client = Client() + + @patch('opentelemetry.trace.get_tracer') + def test_database_instrumentation(self, mock_tracer): + """Test that database queries are instrumented""" + + # Mock tracer and span + mock_span = MagicMock() + mock_tracer.return_value.start_as_current_span.return_value.__enter__.return_value = mock_span + + # Make request that triggers database query + response = self.client.get('/test-opentelemetry/') + + # Verify response + self.assertEqual(response.status_code, 200) + data = response.json() + self.assertEqual(data['status'], 'success') + + # Verify span creation was called + mock_tracer.return_value.start_as_current_span.assert_called() + + # Verify span attributes were set + mock_span.set_attribute.assert_called() + + def test_database_connection(self): + """Test database connection works""" + from django.db import connection + + with connection.cursor() as cursor: + cursor.execute("SELECT 1") + result = cursor.fetchone() + + self.assertEqual(result[0], 1) +``` + +3. **Manual verification script:** + +```python +# verify_instrumentation.py +#!/usr/bin/env python3 + +import os +import sys +import django + +# Setup Django +os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'your_project.settings') +django.setup() + +def verify_opentelemetry_setup(): + """Verify OpenTelemetry is properly configured""" + + print("=== OpenTelemetry Django Verification ===\n") + + # Check imports + try: + from opentelemetry import trace + print("✓ OpenTelemetry API imported successfully") + except ImportError as e: + print(f"✗ Failed to import OpenTelemetry API: {e}") + return False + + # Check Django instrumentation + try: + from opentelemetry.instrumentation.django import DjangoInstrumentor + print("✓ Django instrumentation available") + except ImportError as e: + print(f"✗ Django instrumentation not available: {e}") + return False + + # Check database instrumentation + try: + from django.conf import settings + db_engine = settings.DATABASES['default']['ENGINE'] + + if 'postgresql' in db_engine: + from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor + print("✓ PostgreSQL instrumentation available") + elif 'mysql' in db_engine: + from opentelemetry.instrumentation.mysql import MySQLInstrumentor + print("✓ MySQL instrumentation available") + elif 'sqlite' in db_engine: + from opentelemetry.instrumentation.sqlite3 import SQLite3Instrumentor + print("✓ SQLite instrumentation available") + except ImportError as e: + print(f"✗ Database instrumentation not available: {e}") + return False + + # Test database connection + try: + from django.db import connection + with connection.cursor() as cursor: + cursor.execute("SELECT 1") + result = cursor.fetchone() + print("✓ Database connection successful") + except Exception as e: + print(f"✗ Database connection failed: {e}") + return False + + # Check tracer provider + try: + tracer_provider = trace.get_tracer_provider() + print(f"✓ Tracer provider configured: {type(tracer_provider)}") + except Exception as e: + print(f"✗ Tracer provider issue: {e}") + return False + + # Test span creation + try: + tracer = trace.get_tracer(__name__) + with tracer.start_as_current_span("test.span") as span: + span.set_attribute("test", "verification") + print("✓ Span creation successful") + except Exception as e: + print(f"✗ Span creation failed: {e}") + return False + + print("\n=== All checks passed! ===") + return True + +if __name__ == "__main__": + success = verify_opentelemetry_setup() + sys.exit(0 if success else 1) +``` + + + + + +**1. "No module named 'opentelemetry.instrumentation.django'" Error:** + +```bash +# Install the missing package +pip install opentelemetry-instrumentation-django + +# Or add to requirements.txt +echo "opentelemetry-instrumentation-django" >> requirements.txt +``` + +**2. Database instrumentation not working:** + +```python +# Force re-instrumentation in Django app's ready() method +from django.apps import AppConfig + +class YourAppConfig(AppConfig): + default_auto_field = 'django.db.models.BigAutoField' + name = 'your_app' + + def ready(self): + # Force database instrumentation + from django.conf import settings + + if not hasattr(settings, '_db_instrumented'): + db_engine = settings.DATABASES['default']['ENGINE'] + + if 'postgresql' in db_engine: + from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor + Psycopg2Instrumentor().instrument() + + settings._db_instrumented = True +``` + +**3. Spans not appearing in OpenTelemetry backend:** + +```python +# Add console exporter for debugging +from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor + +# Add console exporter (for debugging only) +console_exporter = ConsoleSpanExporter() +simple_processor = SimpleSpanProcessor(console_exporter) +trace.get_tracer_provider().add_span_processor(simple_processor) +``` + +**4. High memory usage from instrumentation:** + +```python +# Configure sampling to reduce overhead +import os +os.environ["OTEL_TRACES_SAMPLER"] = "parentbased_traceidratio" +os.environ["OTEL_TRACES_SAMPLER_ARG"] = "0.1" # Sample 10% of traces +``` + +**5. Database connection pool issues:** + +```python +# Configure connection pool settings +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', + 'NAME': os.environ.get('DB_NAME'), + # ... other settings ... + 'OPTIONS': { + 'MAX_CONNS': 20, + 'MIN_CONNS': 5, + }, + 'CONN_MAX_AGE': 600, # Connection pooling + } +} +``` + + + + + +**1. Performance Optimization:** + +```python +# Configure efficient sampling +OTEL_CONFIG = { + 'traces_sampler': 'parentbased_traceidratio', + 'traces_sampler_arg': 0.1, # 10% sampling + 'max_attr_length': 256, # Limit attribute length + 'max_events': 32, # Limit events per span + 'max_links': 32, # Limit links per span +} +``` + +**2. Security Considerations:** + +```python +# Don't trace sensitive endpoints +from opentelemetry.instrumentation.django import DjangoInstrumentor + +DjangoInstrumentor().instrument( + excluded_urls=[ + "/admin", # Django admin + "/api/auth/login", # Login endpoints + "/api/auth/token", # Token endpoints + "/health", # Health checks + ] +) +``` + +**3. Custom Attributes for Business Logic:** + +```python +# Add business context to spans +from opentelemetry import trace + +def process_order(order_id, user_id): + tracer = trace.get_tracer(__name__) + + with tracer.start_as_current_span("business.process_order") as span: + span.set_attribute("order.id", order_id) + span.set_attribute("user.id", user_id) + span.set_attribute("business.operation", "order_processing") + + # Business logic here + order = Order.objects.get(id=order_id) + span.set_attribute("order.amount", float(order.total_amount)) + span.set_attribute("order.status", order.status) + + return order +``` + +**4. Error Handling and Logging:** + +```python +# Proper error handling with spans +import logging +from opentelemetry.trace import Status, StatusCode + +logger = logging.getLogger(__name__) + +def risky_database_operation(): + tracer = trace.get_tracer(__name__) + + with tracer.start_as_current_span("database.risky_operation") as span: + try: + # Database operation + result = perform_complex_query() + span.set_status(Status(StatusCode.OK)) + return result + + except Exception as e: + # Record error in span + span.record_exception(e) + span.set_status(Status(StatusCode.ERROR, str(e))) + + # Also log the error + logger.error(f"Database operation failed: {e}", exc_info=True) + raise +``` + +**5. Resource Configuration:** + +```python +# Comprehensive resource configuration +from opentelemetry.sdk.resources import SERVICE_NAME, SERVICE_VERSION, Resource +import socket + +resource = Resource.create({ + SERVICE_NAME: "your-django-app", + SERVICE_VERSION: "1.0.0", + "service.namespace": "production", + "service.instance.id": f"{socket.gethostname()}-{os.getpid()}", + "deployment.environment": os.environ.get("ENVIRONMENT", "development"), + "telemetry.sdk.name": "opentelemetry", + "telemetry.sdk.language": "python", + "telemetry.sdk.version": "1.21.0", +}) +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/pod-readiness-probe-failed-connection-refused.mdx b/docs/troubleshooting/pod-readiness-probe-failed-connection-refused.mdx new file mode 100644 index 000000000..621f20c66 --- /dev/null +++ b/docs/troubleshooting/pod-readiness-probe-failed-connection-refused.mdx @@ -0,0 +1,216 @@ +--- +sidebar_position: 3 +title: "Pod Readiness Probe Failed - Connection Refused" +description: "Solution for pods failing readiness probes with connection refused errors" +date: "2024-12-19" +category: "workload" +tags: ["pod", "readiness-probe", "connection-refused", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Pod Readiness Probe Failed - Connection Refused + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** Pod, Readiness Probe, Connection Refused, Troubleshooting + +## Problem Description + +**Context:** After a successful build and deployment, the new pod fails to start properly and enters a restart loop. The readiness probe cannot connect to the application's health check endpoint. + +**Observed Symptoms:** + +- Pod shows "Back-off restarting failed container" status +- Readiness probe fails with "connection refused" error +- Container image is present on the machine +- Build and deploy processes completed successfully +- Application appears to not be responding to HTTP requests + +**Relevant Configuration:** + +- Health check endpoint: `/users/sign_in` +- Application port: `3000` +- Pod internal IP: `10.130.33.83` +- Error: `dial tcp 10.130.33.83:3000: connect: connection refused` + +**Error Conditions:** + +- Occurs after successful build and deployment +- Readiness probe consistently fails +- Pod cannot reach ready state +- Application service is not accessible + +## Detailed Solution + + + +A "connection refused" error during readiness probe indicates that: + +1. **Application not started**: The HTTP server inside the container hasn't started +2. **Wrong port**: The application is listening on a different port than expected +3. **Application crash**: The application started but crashed before the probe +4. **Binding issues**: The application is only binding to localhost instead of 0.0.0.0 + +The fact that the container image is present suggests the deployment succeeded, but the application inside isn't running properly. + + + + + +First, examine the pod logs to identify why the application isn't starting: + +```bash +# Get pod logs +kubectl logs -n + +# For previous container instance +kubectl logs -n --previous + +# Follow logs in real-time +kubectl logs -n -f +``` + +Look for: + +- Application startup errors +- Database connection failures +- Missing environment variables +- Port binding issues +- Dependency failures + + + + + +Ensure your application is configured correctly: + +1. **Check application binding**: + + ```javascript + // Wrong - only binds to localhost + app.listen(3000, "localhost"); + + // Correct - binds to all interfaces + app.listen(3000, "0.0.0.0"); + ``` + +2. **Verify Dockerfile EXPOSE**: + + ```dockerfile + EXPOSE 3000 + ``` + +3. **Check Kubernetes service configuration**: + ```yaml + spec: + ports: + - port: 3000 + targetPort: 3000 + ``` + + + + + +If the application takes time to start, adjust the readiness probe: + +```yaml +spec: + containers: + - name: app + readinessProbe: + httpGet: + path: /users/sign_in + port: 3000 + initialDelaySeconds: 30 # Wait 30 seconds before first probe + periodSeconds: 10 # Check every 10 seconds + timeoutSeconds: 5 # 5 second timeout + failureThreshold: 3 # Fail after 3 consecutive failures + successThreshold: 1 # Success after 1 successful probe +``` + + + + + +If `/users/sign_in` requires authentication or complex logic, create a simpler health endpoint: + +```javascript +// Add a simple health check endpoint +app.get("/health", (req, res) => { + res.status(200).json({ status: "ok" }); +}); +``` + +Then update your readiness probe: + +```yaml +readinessProbe: + httpGet: + path: /health + port: 3000 +``` + + + + + +1. **Check if the container is running**: + + ```bash + kubectl describe pod -n + ``` + +2. **Execute into the container** (if it stays running): + + ```bash + kubectl exec -it -n -- /bin/bash + # Test if the port is listening + netstat -tlnp | grep 3000 + ``` + +3. **Test the endpoint manually**: + + ```bash + kubectl exec -it -n -- curl http://localhost:3000/users/sign_in + ``` + +4. **Check resource limits**: + ```bash + kubectl top pod -n + ``` + + + + + +**For Ruby on Rails applications:** + +```ruby +# Ensure Rails binds to all interfaces +# config/puma.rb +bind "tcp://0.0.0.0:#{ENV.fetch('PORT', 3000)}" +``` + +**For Node.js applications:** + +```javascript +// Ensure Express binds to all interfaces +const port = process.env.PORT || 3000; +app.listen(port, "0.0.0.0", () => { + console.log(`Server running on port ${port}`); +}); +``` + +**For environment variables:** + +- Verify all required environment variables are set +- Check database connection strings +- Ensure secrets are properly mounted + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/postgres-pgvector-extension-alpine-debian.mdx b/docs/troubleshooting/postgres-pgvector-extension-alpine-debian.mdx new file mode 100644 index 000000000..61164666d --- /dev/null +++ b/docs/troubleshooting/postgres-pgvector-extension-alpine-debian.mdx @@ -0,0 +1,226 @@ +--- +sidebar_position: 3 +title: "PostgreSQL pgvector Extension Issues with Alpine vs Debian" +description: "Solution for pgvector extension compatibility issues between Alpine and Debian PostgreSQL images" +date: "2024-04-24" +category: "dependency" +tags: ["postgresql", "pgvector", "alpine", "debian", "extensions", "database"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# PostgreSQL pgvector Extension Issues with Alpine vs Debian + +**Date:** April 24, 2024 +**Category:** Dependency +**Tags:** PostgreSQL, pgvector, Alpine, Debian, Extensions, Database + +## Problem Description + +**Context:** User is implementing the pgvector extension for PostgreSQL in their SleakOps environment. The extension works correctly in production and staging environments but fails in the local development environment. + +**Observed Symptoms:** + +- pgvector extension installs successfully in production and staging +- Local development environment cannot install or use pgvector extension +- Inconsistent behavior between environments +- Extension-related errors in local PostgreSQL instance + +**Relevant Configuration:** + +- Local environment: `postgres:14-alpine` +- Production/Staging: `postgres:14` (Debian-based) +- Extension: pgvector for vector similarity search +- Platform: SleakOps with PostgreSQL dependency + +**Error Conditions:** + +- Error occurs when trying to install pgvector extension locally +- Problem appears only in Alpine-based PostgreSQL images +- Works correctly in Debian-based PostgreSQL images +- Environment inconsistency between local and production + +## Detailed Solution + + + +The key difference between PostgreSQL images: + +- **postgres:14-alpine**: Based on Alpine Linux, minimal image with musl libc +- **postgres:14**: Based on Debian, includes glibc and more development tools + +Alpine images are smaller but may lack certain libraries and compilation tools needed for PostgreSQL extensions like pgvector. + + + + + +To maintain environment consistency, update your local PostgreSQL configuration: + +**In docker-compose.yml or equivalent:** + +```yaml +services: + postgres: + image: postgres:14 # Changed from postgres:14-alpine + environment: + POSTGRES_DB: your_database + POSTGRES_USER: your_user + POSTGRES_PASSWORD: your_password + ports: + - "5432:5432" + volumes: + - postgres_data:/var/lib/postgresql/data +``` + +**In SleakOps dependency configuration:** + +```yaml +dependencies: + postgres: + image: postgres:14 + version: "14" + extensions: + - pgvector +``` + + + + + +After switching to the Debian image, install pgvector: + +**Method 1: Using SQL commands** + +```sql +-- Connect to your database +\c your_database + +-- Create the extension +CREATE EXTENSION IF NOT EXISTS vector; + +-- Verify installation +\dx vector +``` + +**Method 2: Using init script** + +Create an initialization script: + +```sql +-- init-pgvector.sql +CREATE EXTENSION IF NOT EXISTS vector; +``` + +Mount it in your container: + +```yaml +volumes: + - ./init-pgvector.sql:/docker-entrypoint-initdb.d/init-pgvector.sql +``` + + + + + +To confirm pgvector is working correctly: + +```sql +-- Check if extension is installed +SELECT * FROM pg_extension WHERE extname = 'vector'; + +-- Test vector functionality +CREATE TABLE test_vectors ( + id SERIAL PRIMARY KEY, + embedding vector(3) +); + +-- Insert test data +INSERT INTO test_vectors (embedding) VALUES ('[1,2,3]'); +INSERT INTO test_vectors (embedding) VALUES ('[4,5,6]'); + +-- Test similarity search +SELECT id, embedding, embedding <-> '[1,2,3]' AS distance +FROM test_vectors +ORDER BY distance; + +-- Clean up test +DROP TABLE test_vectors; +``` + + + + + +**If pgvector still doesn't work after switching to Debian:** + +1. **Check PostgreSQL version compatibility:** + + ```bash + docker exec -it your_postgres_container psql -U your_user -c "SELECT version();" + ``` + +2. **Install pgvector manually if needed:** + + ```bash + # Connect to container + docker exec -it your_postgres_container bash + + # Install dependencies + apt-get update + apt-get install -y postgresql-14-pgvector + ``` + +3. **Restart PostgreSQL service:** + + ```bash + docker restart your_postgres_container + ``` + +4. **Check extension availability:** + ```sql + SELECT * FROM pg_available_extensions WHERE name = 'vector'; + ``` + + + + + +To prevent future environment inconsistencies: + +**1. Use identical base images:** + +```yaml +# Use the same image across all environments +postgres: + image: postgres:14 # Not postgres:14-alpine +``` + +**2. Document dependencies:** + +```yaml +# Create a dependencies.yml file +postgres: + version: "14" + image: postgres:14 + extensions: + - pgvector + - uuid-ossp + required_packages: + - postgresql-14-pgvector +``` + +**3. Use initialization scripts:** + +```bash +# Create init-extensions.sql +#!/bin/bash +CREATE EXTENSION IF NOT EXISTS vector; +CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/postgresql-backup-restore-version-compatibility.mdx b/docs/troubleshooting/postgresql-backup-restore-version-compatibility.mdx new file mode 100644 index 000000000..5149cc875 --- /dev/null +++ b/docs/troubleshooting/postgresql-backup-restore-version-compatibility.mdx @@ -0,0 +1,206 @@ +--- +sidebar_position: 3 +title: "PostgreSQL Backup Restore Version Compatibility Issue" +description: "Solution for pg_restore version compatibility errors when restoring backups between different PostgreSQL versions" +date: "2024-08-20" +category: "dependency" +tags: ["postgresql", "backup", "restore", "rds", "version-compatibility"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# PostgreSQL Backup Restore Version Compatibility Issue + +**Date:** August 20, 2024 +**Category:** Dependency +**Tags:** PostgreSQL, Backup, Restore, RDS, Version Compatibility + +## Problem Description + +**Context:** User attempts to restore PostgreSQL backups created with newer versions of pg_dump to older PostgreSQL RDS instances, encountering version compatibility issues. + +**Observed Symptoms:** + +- Error: `pg_restore: error: unsupported version (1.15) in file header` +- Backup created with PostgreSQL 16.1 and pg_dump 16.3 +- Target RDS instance running PostgreSQL 14.11 +- Restore operation fails due to format incompatibility + +**Relevant Configuration:** + +- Source PostgreSQL version: 16.1 +- pg_dump version: 16.3 +- Target RDS PostgreSQL version: 14.11 +- Backup format: Custom format (likely) + +**Error Conditions:** + +- Error occurs when attempting to restore backup files +- Happens when pg_dump version is newer than pg_restore version +- Affects custom format backups created with newer PostgreSQL versions + +## Detailed Solution + + + +PostgreSQL backup compatibility follows these rules: + +- **pg_dump**: Can create backups from older or same version databases +- **pg_restore**: Can restore backups created by same or older versions of pg_dump +- **Format versions**: Each PostgreSQL version may introduce new backup format versions + +In your case: + +- pg_dump 16.3 created a backup with format version 1.15 +- pg_restore from PostgreSQL 14.11 doesn't support format version 1.15 + + + + + +### Option 1: Use Plain Text Format + +Create a new backup using plain text format which is more compatible: + +```bash +# Create plain text backup (compatible across versions) +pg_dump -h source-host -U username -d database_name -f backup.sql + +# Restore using psql (works with any PostgreSQL version) +psql -h rds-endpoint -U username -d target_database -f backup.sql +``` + +### Option 2: Use Compatible pg_dump Version + +Use pg_dump from PostgreSQL 14.x to create the backup: + +```bash +# Install PostgreSQL 14 client tools +sudo apt-get install postgresql-client-14 + +# Create backup with compatible version +/usr/lib/postgresql/14/bin/pg_dump -h source-host -U username -d database_name -Fc -f backup_v14.dump + +# Restore with pg_restore +pg_restore -h rds-endpoint -U username -d target_database backup_v14.dump +``` + + + + + +### Planning RDS Upgrade to PostgreSQL 16 + +**Prerequisites:** + +1. Review PostgreSQL 16 compatibility with your applications +2. Test upgrade in a staging environment +3. Plan for downtime (major version upgrades require restart) + +**Upgrade Steps:** + +1. **Create RDS Snapshot:** + +```bash +aws rds create-db-snapshot \ + --db-instance-identifier your-rds-instance \ + --db-snapshot-identifier pre-upgrade-snapshot-$(date +%Y%m%d) +``` + +2. **Modify RDS Instance:** + +```bash +aws rds modify-db-instance \ + --db-instance-identifier your-rds-instance \ + --engine-version 16.1 \ + --apply-immediately +``` + +3. **Monitor Upgrade Progress:** + +```bash +aws rds describe-db-instances \ + --db-instance-identifier your-rds-instance \ + --query 'DBInstances[0].DBInstanceStatus' +``` + +**Considerations:** + +- Upgrade path: 14.11 → 15.x → 16.1 (may require intermediate versions) +- Downtime: 10-30 minutes depending on database size +- Cost: No additional cost for major version upgrades + + + + + +### Updating PostgreSQL Version in SleakOps + +If using SleakOps managed PostgreSQL: + +1. **Access Database Configuration:** + + - Go to your project dashboard + - Navigate to **Dependencies** → **Databases** + - Select your PostgreSQL instance + +2. **Update Version:** + +```yaml +# In your sleakops.yaml or database configuration +dependencies: + databases: + - name: main-db + type: postgresql + version: "16.1" # Update from 14.11 + instance_class: db.t3.micro + storage: 20 +``` + +3. **Apply Changes:** + +```bash +sleakops deploy --environment production +``` + +**Note:** This will trigger a database upgrade with associated downtime. + + + + + +### Backup Strategy Best Practices + +1. **Version-Aware Backups:** + +```bash +# Always specify format and version compatibility +pg_dump --version # Check your pg_dump version +pg_dump -Fc --no-owner --no-privileges -f backup_$(date +%Y%m%d).dump database_name +``` + +2. **Multiple Format Backups:** + +```bash +# Create both custom and plain text backups +pg_dump -Fc -f backup_custom.dump database_name +pg_dump -f backup_plain.sql database_name +``` + +3. **Regular Compatibility Testing:** + +- Test restore procedures in staging environments +- Maintain compatible pg_dump versions for different target environments +- Document backup and restore procedures + +4. **Version Management:** + +- Keep PostgreSQL versions synchronized between environments when possible +- Plan regular upgrade cycles to avoid large version gaps +- Use containerized pg_dump tools for consistent versions + + + +--- + +_This FAQ was automatically generated on August 20, 2024 based on a real user query._ diff --git a/docs/troubleshooting/production-environment-setup-guide.mdx b/docs/troubleshooting/production-environment-setup-guide.mdx new file mode 100644 index 000000000..711c8e0ab --- /dev/null +++ b/docs/troubleshooting/production-environment-setup-guide.mdx @@ -0,0 +1,330 @@ +--- +sidebar_position: 3 +title: "Production Environment Setup Guide" +description: "Step-by-step guide to create and configure production environments in SleakOps" +date: "2024-12-19" +category: "project" +tags: ["production", "environment", "deployment", "setup", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Production Environment Setup Guide + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** Production, Environment, Deployment, Setup, Configuration + +## Problem Description + +**Context:** Users need to create a production environment in SleakOps to deploy applications from the main branch, separate from development environments that typically use develop branches. + +**Observed Symptoms:** + +- Need to create a new production environment from scratch +- Requirement to separate production deployments from development +- Need to migrate or replicate existing development configurations +- Database migration requirements for production data + +**Relevant Configuration:** + +- Environment type: Production (root environment) +- Branch strategy: main branch for production vs develop for staging +- Database: PostgreSQL with potential data migration needs +- Domain configuration: Production-specific domains + +**Error Conditions:** + +- Lack of clear step-by-step process for production setup +- Uncertainty about resource replication from development +- Database migration complexity + +## Detailed Solution + + + +### Creating the Production Environment + +1. **Navigate to Environments** + + - Go to your SleakOps dashboard + - Click on **Environments** in the sidebar + +2. **Create New Environment** + + - Click **"Create Environment"** + - Set the name as `prod` or `production` + - **Important**: Mark this environment as **"Root Environment"** + +3. **Configure Domain Settings** + - Set up your production domains + - Configure SSL certificates if needed + - Ensure DNS settings point to your production infrastructure + +### Environment Configuration Example + +```yaml +name: prod +type: root +domain_config: + primary_domain: "myapp.com" + ssl_enabled: true + certificate_type: "letsencrypt" +branch_strategy: + default_branch: "main" + auto_deploy: true +``` + + + + + +### Resource Replication Process + +You need to recreate all the resources from your development environment: + +#### Projects + +1. **Create New Project** + - Use the same configuration as development + - Change the branch from `develop` to `main` + - Update environment variables for production + +#### Dependencies + +1. **Database Dependencies** + + - Create PostgreSQL instances for production + - Configure with production-appropriate sizing + - Set up backup and monitoring + +2. **Cache Dependencies** + - Redis or other caching solutions + - Configure with production specifications + +#### Executions (Workloads) + +1. **Web Services** + + - Replicate web service configurations + - Adjust resource limits for production load + - Configure health checks and monitoring + +2. **Worker Services** + - Background job processors + - Scheduled tasks and cron jobs + +#### Variable Groups + +1. **Environment Variables** + - Create production-specific variable groups + - Update API keys, database connections + - Set production feature flags + +### Production Configuration Checklist + +- [ ] Environment created and marked as root +- [ ] Projects configured with main branch +- [ ] Database dependencies created +- [ ] Cache dependencies configured +- [ ] Web services replicated +- [ ] Worker services set up +- [ ] Production variable groups created +- [ ] Domain and SSL configured +- [ ] Monitoring and alerting enabled + + + + + +### Database Dump Import Process + +If you need to migrate data from development to production: + +#### Prerequisites + +1. **Create Database Dump** + + ```bash + pg_dump -h dev-db-host -U username -d database_name > production_dump.sql + ``` + +2. **Prepare Production Database** + - Ensure the production PostgreSQL instance is running + - Verify connection credentials + - Ensure sufficient storage space + +#### Import Process + +1. **Access Database Import Feature** + + - Go to your PostgreSQL dependency in production environment + - Look for "Import Database" or "Restore" option + +2. **Upload and Import** + - Upload your dump file + - Execute the import process + - Monitor the import progress + +#### Post-Import Verification + +```sql +-- Verify data integrity +SELECT COUNT(*) FROM your_main_table; + +-- Check user accounts +SELECT COUNT(*) FROM users; + +-- Verify application-specific data +SELECT * FROM configuration_table LIMIT 5; +``` + +### Important Considerations + +- **Data Sanitization**: Remove or anonymize sensitive development data +- **Production Secrets**: Update all API keys and passwords +- **Feature Flags**: Disable development-only features +- **Email Configuration**: Ensure production email settings + + + + + +### Pre-Production Checklist + +#### Technical Verification + +- [ ] All services are healthy and running +- [ ] Database connections are working +- [ ] External API integrations are configured +- [ ] SSL certificates are valid +- [ ] Domain routing is correct +- [ ] Monitoring and logging are active + +#### Security Checklist + +- [ ] Production secrets are updated +- [ ] Development debug modes are disabled +- [ ] Access controls are properly configured +- [ ] Backup procedures are in place +- [ ] Security scanning is complete + +### Deployment Process + +1. **Initial Deployment** + + ```bash + # Ensure main branch is ready + git checkout main + git pull origin main + + # Deploy through SleakOps + # This will automatically trigger based on your configuration + ``` + +2. **Smoke Testing** + + - Test critical user journeys + - Verify database connectivity + - Check external service integrations + - Validate monitoring and alerting + +3. **Go-Live Preparation** + - Schedule maintenance window if needed + - Prepare rollback procedures + - Set up real-time monitoring + - Notify stakeholders of go-live timeline + +### Monitoring Production + +```yaml +# Example monitoring configuration +monitoring: + health_checks: + - endpoint: "/health" + interval: "30s" + timeout: "5s" + alerts: + - type: "response_time" + threshold: "2s" + - type: "error_rate" + threshold: "5%" + logging: + level: "info" + retention: "30d" +``` + + + + + +### Resource Sizing Guidelines + +#### Database + +- **Development**: t3.micro or t3.small +- **Production**: t3.medium or larger based on load +- Enable automated backups +- Configure read replicas if needed + +#### Application Services + +- **CPU**: Start with 2x development environment CPU allocation +- **Memory**: Start with 2x development environment memory +- **Replicas**: Minimum 2 for high availability +- **Storage**: Use production-grade storage classes + +#### Load Balancer Configuration + +```yaml +# Production load balancer settings +ingress: + annotations: + nginx.ingress.kubernetes.io/rate-limit: "100" + nginx.ingress.kubernetes.io/rate-limit-rps: "10" + nginx.ingress.kubernetes.io/ssl-redirect: "true" +``` + +### Security Considerations + +1. **Environment Variables Security**: + + - Use SleakOps Variable Groups for sensitive data + - Never store secrets in plain text + - Rotate API keys and passwords regularly + +2. **Network Security**: + + - Enable VPC private networking + - Configure security groups appropriately + - Use SSL/TLS for all communications + +3. **Access Control**: + - Limit production environment access + - Use role-based access control (RBAC) + - Enable audit logging + +### Monitoring and Maintenance + +1. **Set up monitoring alerts**: + + - CPU and memory usage + - Application error rates + - Database performance metrics + +2. **Backup Strategy**: + + - Automated daily database backups + - Application configuration backups + - Disaster recovery procedures + +3. **Update Procedures**: + - Staging deployments before production + - Blue-green or rolling deployment strategies + - Rollback procedures for failed deployments + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/production-environment-setup.mdx b/docs/troubleshooting/production-environment-setup.mdx new file mode 100644 index 000000000..317ea19b2 --- /dev/null +++ b/docs/troubleshooting/production-environment-setup.mdx @@ -0,0 +1,300 @@ +--- +sidebar_position: 3 +title: "Production Environment Setup Guide" +description: "Complete guide for setting up production environments with database migration and custom domains" +date: "2024-01-15" +category: "project" +tags: + ["production", "deployment", "database", "migration", "domain", "environment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Production Environment Setup Guide + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Production, Deployment, Database, Migration, Domain, Environment + +## Problem Description + +**Context:** User needs to set up a production environment by replicating staging configuration, including dependencies and variable groups, while managing database migration and custom domain configuration. + +**Observed Symptoms:** + +- Need to replicate staging environment configuration to production +- Requirement to modify RAILS_ENV variable from staging to production +- Need for database dump/restore procedures for off-hours deployment +- Custom domain configuration needed for production access + +**Relevant Configuration:** + +- Source environment: Staging +- Target environment: Production +- Framework: Rails application +- Custom domain: hub.supra.social +- Database: Requires dump/restore capability + +**Error Conditions:** + +- Manual environment setup required +- Database migration needed during off-hours +- Domain configuration for production access + +## Detailed Solution + + + +To replicate your staging environment configuration: + +1. **Access SleakOps Dashboard** + + - Navigate to your project + - Go to **Environments** section + +2. **Create Production Environment** + + ```bash + # Using SleakOps CLI + sleakops env create production --from-template staging + ``` + +3. **Copy Dependencies** + + - In the dashboard, go to **Dependencies** tab + - Select all dependencies from staging + - Click **Clone to Environment** → Select **Production** + +4. **Copy Variable Groups** + - Navigate to **Variables** → **Groups** + - Select staging variable group + - Click **Duplicate** → Name it for production + - **Important**: Change `RAILS_ENV` from `staging` to `production` + + + + + +### Creating Database Dump + +```bash +# Connect to staging database pod +kubectl exec -it -- bash + +# Create dump (PostgreSQL example) +pg_dump -U -h localhost > /tmp/production_dump.sql + +# Copy dump to local machine +kubectl cp :/tmp/production_dump.sql ./production_dump.sql +``` + +### Restoring to Production Database + +```bash +# Copy dump to production database pod +kubectl cp ./production_dump.sql :/tmp/production_dump.sql + +# Connect to production database pod +kubectl exec -it -- bash + +# Restore database +psql -U -h localhost < /tmp/production_dump.sql +``` + +### Automated Script for Off-Hours Deployment + +```bash +#!/bin/bash +# production-deploy.sh + +set -e + +echo "Starting production database migration at $(date)" + +# Step 1: Create backup of current production DB +echo "Creating production backup..." +kubectl exec -it -- pg_dump -U > prod_backup_$(date +%Y%m%d_%H%M%S).sql + +# Step 2: Apply new dump +echo "Applying new database dump..." +kubectl cp ./production_dump.sql :/tmp/ +kubectl exec -it -- psql -U < /tmp/production_dump.sql + +# Step 3: Restart application pods +echo "Restarting application..." +kubectl rollout restart deployment/ -n + +echo "Migration completed at $(date)" +``` + + + + + +### Configure Custom Domain in SleakOps + +1. **In SleakOps Dashboard:** + + - Go to **Production Environment** + - Navigate to **Networking** → **Domains** + - Click **Add Custom Domain** + - Enter: `hub.supra.social` + +2. **DNS Configuration:** + + ```dns + # Add CNAME record in your DNS provider + hub.supra.social. CNAME + ``` + +3. **SSL Certificate:** + - SleakOps will automatically provision SSL certificate + - Wait for certificate validation (usually 5-10 minutes) + +### Verify Domain Configuration + +```bash +# Check DNS resolution +nslookup hub.supra.social + +# Test HTTPS access +curl -I https://hub.supra.social + +# Check certificate +openssl s_client -connect hub.supra.social:443 -servername hub.supra.social +``` + + + + + +### Pre-Deployment (Before 7 PM) + +- [ ] Verify staging environment is stable +- [ ] Create database dump from staging +- [ ] Test dump restoration in development +- [ ] Prepare deployment script +- [ ] Notify stakeholders of maintenance window + +### During Deployment (7 PM - Off Hours) + +- [ ] Create production database backup +- [ ] Apply new database dump +- [ ] Update environment variables (RAILS_ENV=production) +- [ ] Deploy application with new configuration +- [ ] Verify domain accessibility +- [ ] Run smoke tests + +### Post-Deployment + +- [ ] Monitor application logs +- [ ] Verify all features working +- [ ] Check performance metrics +- [ ] Notify team of successful deployment + +### Rollback Plan (If Issues Occur) + +```bash +# Restore previous database backup +kubectl cp prod_backup_.sql :/tmp/ +kubectl exec -it -- psql -U < /tmp/prod_backup_.sql + +# Revert to previous application version +kubectl rollout undo deployment/ -n +``` + + + + + +### Database Connection Issues + +```bash +# Check database pod status +kubectl get pods -l app=database + +# Check database logs +kubectl logs + +# Test database connectivity +kubectl exec -it -- rails db:migrate:status +``` + +### Domain Not Accessible + +1. **Check DNS propagation:** + + ```bash + dig hub.supra.social + ``` + +2. **Verify ingress configuration:** + + ```bash + kubectl get ingress -n + kubectl describe ingress + ``` + +3. **Check certificate status:** + + ```bash + kubectl get certificates -n + kubectl describe certificate + ``` + +4. **Verify load balancer status:** + ```bash + kubectl get services -n + kubectl describe service + ``` + +### Performance Issues + +1. **Monitor resource usage:** + + ```bash + kubectl top pods -n + kubectl top nodes + ``` + +2. **Scale services if needed:** + + - Access SleakOps dashboard + - Navigate to workloads + - Adjust replica count for high-traffic services + +3. **Database performance:** + - Monitor connection pool usage + - Check for slow queries + - Consider read replicas for production load + +### Best Practices for Production + +1. **Security:** + + - Rotate all credentials after setup + - Enable SSL/TLS for all communications + - Implement proper access controls + +2. **Monitoring:** + + - Set up alerts for critical metrics + - Monitor application logs regularly + - Configure uptime monitoring + +3. **Backup Strategy:** + + - Schedule regular database backups + - Test restore procedures + - Document recovery processes + +4. **Deployment Strategy:** + - Use blue-green or rolling deployments + - Test all changes in staging first + - Maintain rollback procedures + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/production-site-down-pods-not-starting.mdx b/docs/troubleshooting/production-site-down-pods-not-starting.mdx new file mode 100644 index 000000000..4f304c457 --- /dev/null +++ b/docs/troubleshooting/production-site-down-pods-not-starting.mdx @@ -0,0 +1,244 @@ +--- +sidebar_position: 1 +title: "Production Site Down - Pods Not Starting" +description: "Troubleshooting guide for production outages when pods fail to start" +date: "2024-01-15" +category: "workload" +tags: ["production", "pods", "outage", "troubleshooting", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Production Site Down - Pods Not Starting + +**Date:** January 15, 2024 +**Category:** Workload +**Tags:** Production, Pods, Outage, Troubleshooting, Kubernetes + +## Problem Description + +**Context:** Production website is experiencing a complete outage with pods failing to start or restart properly in the Kubernetes cluster. + +**Observed Symptoms:** + +- Production site is completely down +- Pods are not starting or coming online +- Application services are unavailable +- Users cannot access the production environment + +**Relevant Configuration:** + +- Environment: Production +- Platform: Kubernetes cluster +- Issue scope: Complete site outage +- Pod status: Failed to start + +**Error Conditions:** + +- Occurs in production environment +- Affects all or most application pods +- Results in complete service unavailability +- May indicate cluster-wide issues + +## Detailed Solution + + + +**Step 1: Check Pod Status** + +```bash +# Check all pods in the namespace +kubectl get pods -n + +# Get detailed pod information +kubectl describe pods -n + +# Check pod logs +kubectl logs -n --previous +``` + +**Step 2: Check Node Status** + +```bash +# Verify node health +kubectl get nodes + +# Check node resources +kubectl top nodes + +# Describe problematic nodes +kubectl describe node +``` + + + + + +**Resource Exhaustion:** + +- Check if nodes have sufficient CPU/Memory +- Verify storage capacity +- Look for resource quotas being exceeded + +**Image Pull Issues:** + +```bash +# Check if images can be pulled +kubectl describe pod | grep -i "image" + +# Verify image registry connectivity +kubectl get events --sort-by=.metadata.creationTimestamp +``` + +**Configuration Issues:** + +- Check ConfigMaps and Secrets +- Verify environment variables +- Validate service account permissions + +**Network Problems:** + +- Test cluster DNS resolution +- Check service connectivity +- Verify ingress controller status + + + + + +**1. Check Cluster Events** + +```bash +# Get recent cluster events +kubectl get events --sort-by=.metadata.creationTimestamp -A + +# Filter for error events +kubectl get events --field-selector type=Warning -A +``` + +**2. Verify Critical System Pods** + +```bash +# Check kube-system pods +kubectl get pods -n kube-system + +# Check ingress controller +kubectl get pods -n ingress-nginx + +# Check monitoring stack +kubectl get pods -n monitoring +``` + +**3. Check Resource Availability** + +```bash +# Check node capacity +kubectl describe nodes | grep -A 5 "Allocated resources" + +# Check persistent volumes +kubectl get pv,pvc -A +``` + + + + + +**Immediate Recovery Steps:** + +1. **Restart Deployments** + +```bash +# Restart specific deployment +kubectl rollout restart deployment/ -n + +# Restart all deployments in namespace +kubectl get deployments -n -o name | xargs -I {} kubectl rollout restart {} -n +``` + +2. **Scale Resources if Needed** + +```bash +# Scale up deployment +kubectl scale deployment/ --replicas=3 -n + +# Add more nodes if using cluster autoscaler +kubectl get nodes --show-labels +``` + +3. **Clear Failed Pods** + +```bash +# Delete failed pods to trigger recreation +kubectl delete pods --field-selector=status.phase=Failed -n + +# Delete pods stuck in pending state +kubectl delete pods --field-selector=status.phase=Pending -n +``` + + + + + +**Set Up Monitoring Alerts:** + +1. **Pod Health Monitoring** + +```yaml +# Example Prometheus alert rule +- alert: PodsNotReady + expr: kube_pod_status_ready{condition="false"} > 0 + for: 5m + labels: + severity: critical + annotations: + summary: "Pods not ready in {{ $labels.namespace }}" +``` + +2. **Resource Monitoring** + +```yaml +- alert: NodeResourceExhaustion + expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 + for: 2m + labels: + severity: warning +``` + +**Best Practices:** + +- Implement health checks and readiness probes +- Set appropriate resource requests and limits +- Use horizontal pod autoscaling +- Maintain staging environment for testing +- Regular backup of critical configurations + + + + + +**Escalate immediately if:** + +- Multiple nodes are down +- Cluster control plane is unresponsive +- Data corruption is suspected +- Security breach is detected + +**Before escalating, gather:** + +- Cluster status output +- Recent deployment history +- Error logs and events +- Resource utilization metrics +- Timeline of when the issue started + +**Emergency Contacts:** + +- Platform team for cluster-level issues +- Infrastructure team for node/network problems +- Application team for service-specific issues + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/production-site-down-troubleshooting.mdx b/docs/troubleshooting/production-site-down-troubleshooting.mdx new file mode 100644 index 000000000..a7c645041 --- /dev/null +++ b/docs/troubleshooting/production-site-down-troubleshooting.mdx @@ -0,0 +1,589 @@ +--- +sidebar_position: 1 +title: "Production Site Down - Emergency Troubleshooting" +description: "Emergency procedures for handling production site outages and deployment issues" +date: "2025-01-27" +category: "general" +tags: ["production", "outage", "deployment", "emergency", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Production Site Down - Emergency Troubleshooting + +**Date:** January 27, 2025 +**Category:** General +**Tags:** Production, Outage, Deployment, Emergency, Troubleshooting + +## Problem Description + +**Context:** Production site experiencing downtime requiring immediate attention and emergency deployment procedures through SleakOps platform. + +**Observed Symptoms:** + +- Complete site outage ("Se nos cayó el sitio") +- Site under construction status +- Urgent production environment issues +- Need for emergency deployment coordination +- Multiple team members required for resolution + +**Relevant Configuration:** + +- Environment: Production +- Platform: SleakOps with Bitbucket integration +- Deployment method: Bitbucket pipelines +- Team coordination: Required for production changes + +**Error Conditions:** + +- Production site completely down +- Emergency deployment needed +- Multiple stakeholders involved in resolution +- Time-sensitive restoration required + +## Detailed Solution + + + +When facing a production outage: + +1. **Verify the scope of the outage**: + + - Check if it's a complete site failure or partial functionality + - Verify DNS resolution and basic connectivity + - Check monitoring dashboards for alerts + +2. **Gather initial information**: + + - Recent deployments or changes + - Error logs from applications + - Infrastructure status (pods, services, ingress) + +3. **Establish communication**: + - Create emergency communication channel + - Notify all relevant stakeholders + - Document timeline of events + + + + + +For emergency deployments in SleakOps: + +1. **Coordinate with team**: + + ```bash + # Before any production deployment, ensure team coordination + # Use communication channels to announce deployment + ``` + +2. **Deploy through SleakOps**: + + - Access SleakOps dashboard + - Navigate to the affected project + - Select the production environment + - Choose "Deploy" with the last known good version + +3. **Alternative deployment via Bitbucket**: + + ```yaml + # If deploying through Bitbucket pipelines + # Ensure the pipeline targets the correct environment + pipelines: + branches: + main: + - step: + name: Emergency Deploy to Production + script: + - echo "Emergency deployment initiated" + - # Your deployment commands here + ``` + +4. **Monitor deployment progress**: + - Watch deployment logs in real-time + - Monitor application health checks + - Verify service restoration + + + + + +If the emergency deployment doesn't resolve the issue: + +1. **Immediate rollback**: + + - In SleakOps: Use "Rollback" feature to previous stable version + - Document the rollback decision and timing + +2. **Verify rollback success**: + + ```bash + # Check application status + kubectl get pods -n production + kubectl get services -n production + + # Verify application health + curl -I https://your-production-site.com + ``` + +3. **Post-rollback actions**: + - Confirm site functionality + - Notify stakeholders of temporary resolution + - Begin root cause analysis + + + + + +Effective coordination during production incidents: + +1. **Establish incident commander**: + + - Designate one person to coordinate efforts + - All deployment decisions go through this person + - Maintain clear communication channels + +2. **Role assignments**: + + - **Incident Commander**: Overall coordination + - **Technical Lead**: Hands-on troubleshooting + - **Communications**: Stakeholder updates + - **Documentation**: Timeline and actions taken + +3. **Communication protocols**: + + - Use video calls for real-time coordination + - Maintain written updates in shared channels + - Set regular update intervals (every 15-30 minutes) + +4. **Decision making**: + - Quick consensus on deployment actions + - Clear go/no-go decisions + - Document all major decisions with timestamps + + + + + +After resolving the production outage: + +1. **Immediate verification**: + + - Comprehensive functionality testing + - Performance monitoring for 1-2 hours + - User acceptance verification + +2. **Documentation**: + + - Complete incident timeline + - Root cause analysis + - Actions taken and their outcomes + - Lessons learned + +3. **Follow-up improvements**: + + - Review deployment procedures + - Enhance monitoring and alerting + - Update emergency response procedures + - Schedule post-mortem meeting + +4. **Stakeholder communication**: + - Send resolution notification + - Provide brief summary of cause and fix + - Share timeline for detailed post-mortem + + + + + +To minimize future production outages: + +1. **Deployment best practices**: + + - Always use staging environment first + - Implement blue-green deployments + - Maintain rollback procedures + - Use feature flags for risky changes + +2. **Monitoring and alerting**: + + ```yaml + # Example monitoring configuration + alerts: + - name: site-down + condition: http_status != 200 + duration: 1m + severity: critical + notifications: + - slack + - email + ``` + +3. **Emergency preparedness**: + + - Maintain updated emergency contact list + - Regular disaster recovery drills + - Pre-defined communication channels + - Emergency deployment procedures documentation + +4. **Team readiness**: + - Cross-training on deployment procedures + - Access permissions for key team members + - Emergency escalation procedures + - Regular review of incident response plans + + + + + +When site outages occur, perform these infrastructure checks: + +1. **Cloud provider status**: + +```bash +# Check AWS service health (if using AWS) +aws sts get-caller-identity +aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,State.Name]' + +# Verify load balancer status +aws elbv2 describe-load-balancers +aws elbv2 describe-target-health --target-group-arn +``` + +2. **Kubernetes cluster health**: + +```bash +# Check cluster nodes +kubectl get nodes +kubectl describe nodes | grep -i "ready\|condition" + +# Verify core services +kubectl get pods -n kube-system +kubectl get componentstatuses + +# Check cluster events +kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp | tail -20 +``` + +3. **Network connectivity**: + +```bash +# Test DNS resolution +dig your-domain.com +nslookup your-domain.com + +# Check SSL certificates +echo | openssl s_client -connect your-domain.com:443 2>/dev/null | openssl x509 -noout -dates + +# Test connectivity from different locations +curl -I https://your-domain.com +``` + +4. **Database connectivity**: + +```bash +# Check database status (PostgreSQL example) +kubectl exec -it -- pg_isready +kubectl exec -it -- psql -c "SELECT version();" + +# Check database connections +kubectl exec -it -- nc -zv database-service 5432 +``` + + + + + +Debug application-specific issues during outages: + +1. **Application pod analysis**: + +```bash +# Check pod status and events +kubectl get pods -n production +kubectl describe pod -n production + +# Examine application logs +kubectl logs -n production --tail=100 +kubectl logs -n production --previous # Previous container logs +``` + +2. **Configuration verification**: + +```bash +# Check ConfigMaps and Secrets +kubectl get configmaps -n production +kubectl get secrets -n production + +# Verify environment variables +kubectl exec -it -n production -- env | sort +``` + +3. **Resource utilization**: + +```bash +# Check resource usage +kubectl top pods -n production +kubectl top nodes + +# Check resource quotas +kubectl describe quota -n production +kubectl describe limitrange -n production +``` + +4. **Service connectivity**: + +```bash +# Test internal service connections +kubectl exec -it -n production -- wget -qO- http://internal-service:port/health + +# Check service endpoints +kubectl get endpoints -n production +kubectl describe service -n production +``` + + + + + +Use these communication templates during outages: + +**Initial incident notification:** + +``` +🚨 PRODUCTION INCIDENT ALERT 🚨 + +Status: Site down detected +Time: [TIMESTAMP] +Impact: Complete site outage +Team: Incident response team activated +ETA for update: 15 minutes + +Incident Commander: [NAME] +Next update: [TIME] +``` + +**Progress updates:** + +``` +🔄 INCIDENT UPDATE + +Status: Investigating +Time: [TIMESTAMP] +Progress: [CURRENT ACTIONS] +Impact: [CURRENT IMPACT ASSESSMENT] + +Actions taken: +- [ACTION 1] +- [ACTION 2] + +Next steps: +- [NEXT ACTION] + +Next update: [TIME] +``` + +**Resolution notification:** + +``` +✅ INCIDENT RESOLVED + +Status: Site restored +Time: [TIMESTAMP] +Duration: [TOTAL DOWNTIME] +Root cause: [BRIEF CAUSE] + +Resolution: +- [ACTIONS THAT FIXED THE ISSUE] + +Follow-up: +- Post-mortem scheduled for [DATE/TIME] +- Preventive measures to be implemented +``` + +**Escalation procedures:** + +```bash +# Emergency contact hierarchy +1. Incident Commander +2. Technical Lead/DevOps Lead +3. Engineering Manager +4. CTO/Technical Director +5. Business Stakeholders + +# Communication channels +- Primary: Slack #incidents +- Secondary: WhatsApp group +- External: Customer status page +``` + + + + + +Use this comprehensive checklist to verify full system recovery: + +**Application Level:** + +```bash +□ All pods are running and ready +□ Health check endpoints responding (200 OK) +□ Application logs show normal operation +□ No error spikes in monitoring dashboards +□ Resource utilization within normal ranges +``` + +**Infrastructure Level:** + +```bash +□ All cluster nodes are ready +□ Load balancer is healthy and routing traffic +□ DNS resolution is working correctly +□ SSL certificates are valid and not expired +□ CDN (if applicable) is serving content +``` + +**Data Layer:** + +```bash +□ Database connections are stable +□ Database queries are performing normally +□ Backup systems are operational +□ Data integrity checks pass +□ Cache layers are functioning +``` + +**User Experience:** + +```bash +□ Website loads completely within acceptable time +□ All critical user flows work end-to-end +□ Payment systems are operational (if applicable) +□ User authentication/login works +□ Mobile app connectivity (if applicable) +``` + +**Business Functions:** + +```bash +□ Order processing works (if e-commerce) +□ Email notifications are sending +□ Third-party integrations are working +□ Monitoring and alerting systems are operational +□ Backup and disaster recovery systems are verified +``` + +**Sign-off checklist:** + +```bash +□ Technical team confirms full restoration +□ QA team verifies critical flows +□ Business stakeholders approve go-live +□ Customer support team notified of resolution +□ Documentation updated with incident details +``` + + + + + +Set up automated systems to handle future outages: + +1. **Health check automation**: + +```yaml +# Kubernetes deployment with robust health checks +spec: + template: + spec: + containers: + - name: app + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 +``` + +2. **Auto-scaling configuration**: + +```yaml +# Horizontal Pod Autoscaler +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: app-deployment + minReplicas: 3 + maxReplicas: 20 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 + - type: Resource + resource: + name: memory + target: + type: Utilization + averageUtilization: 80 +``` + +3. **Circuit breaker implementation**: + +```yaml +# Example circuit breaker configuration +circuit_breaker: + failure_threshold: 5 + recovery_timeout: 30s + success_threshold: 3 + timeout: 10s +``` + +4. **Automated alerting**: + +```yaml +# Prometheus alerting rules +groups: + - name: production-alerts + rules: + - alert: SiteDown + expr: probe_success{job="blackbox"} == 0 + for: 1m + labels: + severity: critical + annotations: + summary: "Production site is down" + description: "Site has been down for more than 1 minute" + + - alert: HighErrorRate + expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 + for: 2m + labels: + severity: warning + annotations: + summary: "High error rate detected" +``` + + + +--- + +_This FAQ was automatically generated on January 27, 2025 based on a real user query._ diff --git a/docs/troubleshooting/project-branch-name-validation-error.mdx b/docs/troubleshooting/project-branch-name-validation-error.mdx new file mode 100644 index 000000000..c3ce0bd0c --- /dev/null +++ b/docs/troubleshooting/project-branch-name-validation-error.mdx @@ -0,0 +1,155 @@ +--- +sidebar_position: 3 +title: "Project Creation Error with Branch Names Containing Special Characters" +description: "Solution for branch name validation errors when creating projects from repositories" +date: "2025-02-19" +category: "project" +tags: ["project", "repository", "branch", "validation", "git"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Project Creation Error with Branch Names Containing Special Characters + +**Date:** February 19, 2025 +**Category:** Project +**Tags:** Project, Repository, Branch, Validation, Git + +## Problem Description + +**Context:** User attempts to create a new project in SleakOps based on a Git repository but encounters validation errors when the branch name contains special characters or specific naming patterns. + +**Observed Symptoms:** + +- Interface error when trying to create a project from a repository +- Error occurs specifically with branch names containing forward slashes +- Branch name format: `feature/SITE-1` triggers validation error +- Project creation process fails at the branch selection step + +**Relevant Configuration:** + +- Branch name: `feature/SITE-1` +- Action: Creating new project from repository +- Platform: SleakOps web interface +- Repository type: Git repository + +**Error Conditions:** + +- Error appears during project creation wizard +- Occurs when branch names contain forward slashes (`/`) +- Validation error prevents project creation completion +- Issue affects branch selection in the interface + +## Detailed Solution + + + +While the SleakOps team works on fixing the validation issue, you can use this temporary workaround: + +1. **Change the branch name temporarily** in your repository: + + - Rename `feature/SITE-1` to something like `feature-SITE-1` or `featureSITE1` + - Use only alphanumeric characters and hyphens/underscores + +2. **Create the project** with the renamed branch + +3. **Use manual build specification** when needed: + - During build process, you can manually specify the original branch name + - This allows you to work with your original branch structure + + + + + +When you need to build from your original branch: + +1. Go to your project's **Build** section +2. Select **Manual Build** option +3. In the build configuration, specify the branch name manually: + ``` + Branch: feature/SITE-1 + ``` +4. Proceed with the build process + +**Note:** This workaround works for manual builds but won't work for continuous integration until the platform fix is implemented. + + + + + +To avoid similar issues in the future, consider these branch naming conventions: + +**Recommended formats:** + +- `feature-SITE-1` +- `feature_SITE_1` +- `featureSITE1` +- `site1-feature` + +**Avoid these characters in branch names:** + +- Forward slashes (`/`) +- Special characters that might cause URL encoding issues +- Spaces or other whitespace characters + +**Git branch naming that works well with SleakOps:** + +```bash +# Good examples +git checkout -b feature-user-authentication +git checkout -b bugfix-login-error +git checkout -b hotfix-security-patch + +# Avoid these patterns +git checkout -b feature/user-authentication +git checkout -b bug/fix-login +``` + + + + + +**Current limitations:** + +- Continuous integration will not work with the workaround +- Automated builds will fail until the platform fix is deployed +- Manual intervention required for each build + +**When the fix is available:** + +- Branch names with forward slashes will be supported +- Continuous integration will resume normal operation +- No changes needed to existing project configuration + +**Timeline:** + +- The SleakOps team has added this issue to their backlog +- Fix priority depends on deployment schedule and user needs +- Contact support if you need this resolved before production deployment + + + + + +If you frequently use feature branches with slashes, consider this workflow: + +1. **Keep your development branches** with original naming +2. **Create deployment branches** with SleakOps-friendly names: + + ```bash + # Your development branch + git checkout feature/SITE-1 + + # Create deployment branch + git checkout -b deploy-SITE-1 + git push origin deploy-SITE-1 + ``` + +3. **Use deployment branches** for SleakOps projects +4. **Merge changes** from feature branches to deployment branches as needed + + + +--- + +_This FAQ was automatically generated on February 19, 2025 based on a real user query._ diff --git a/docs/troubleshooting/project-build-branch-configuration.mdx b/docs/troubleshooting/project-build-branch-configuration.mdx new file mode 100644 index 000000000..7d4380989 --- /dev/null +++ b/docs/troubleshooting/project-build-branch-configuration.mdx @@ -0,0 +1,160 @@ +--- +sidebar_position: 3 +title: "Build Branch Configuration in Projects" +description: "Understanding how SleakOps handles branch selection during build processes" +date: "2024-07-23" +category: "project" +tags: ["build", "branch", "git", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Build Branch Configuration in Projects + +**Date:** July 23, 2024 +**Category:** Project +**Tags:** Build, Branch, Git, Configuration + +## Problem Description + +**Context:** Users need to understand how SleakOps determines which Git branch to use during the build process when no specific branch is specified in the build configuration. + +**Observed Symptoms:** + +- Uncertainty about which branch is used when none is explicitly specified +- Builds may use unexpected branches +- Need clarification on default branch behavior + +**Relevant Configuration:** + +- Project default branch setting +- Build configuration without explicit branch specification +- Git repository configuration + +**Error Conditions:** + +- Builds using incorrect branch when expectations differ +- Confusion about branch selection logic + +## Detailed Solution + + + +When no specific branch is defined in the build configuration, SleakOps follows this behavior: + +**Default Branch Selection:** + +- The build process automatically uses the **default branch defined in the project settings** +- This ensures consistent behavior across all builds within the project +- The project's default branch takes precedence over repository default branches + + + + + +To configure or verify your project's default branch: + +1. Navigate to your **Project Settings** +2. Go to the **Source Code** or **Repository** section +3. Look for the **Default Branch** setting +4. Set it to your preferred branch (e.g., `main`, `master`, `develop`) + +```yaml +# Example project configuration +project: + name: "my-application" + repository: + url: "https://github.com/company/my-app" + default_branch: "main" # This branch will be used when none specified +``` + + + + + +You can override the default branch for specific builds: + +**Manual Builds:** + +1. Go to **Build** section in your project +2. Click **New Build** or **Trigger Build** +3. Specify the desired branch in the branch field + +**Automated Builds:** + +```yaml +# In your build configuration +build: + trigger: + branch: "feature/new-functionality" # Override default branch + steps: + - name: "build" + command: "npm run build" +``` + + + + + +**Recommended Practices:** + +1. **Set clear default branches**: Use `main` or `master` as your project default +2. **Document branch strategy**: Make sure team members understand the branching model +3. **Use environment-specific branches**: + - `main` for production builds + - `develop` for staging builds + - Feature branches for testing + +**Environment Configuration Example:** + +```yaml +environments: + production: + branch: "main" + auto_deploy: true + staging: + branch: "develop" + auto_deploy: true + development: + branch: "*" # Allow any branch + auto_deploy: false +``` + + + + + +**Common Issues and Solutions:** + +1. **Wrong branch being built:** + + - Check project default branch setting + - Verify build configuration doesn't have conflicting branch specifications + +2. **Build fails with branch not found:** + + - Ensure the specified branch exists in the repository + - Check branch name spelling and case sensitivity + +3. **Inconsistent build behavior:** + - Review all build configurations for explicit branch settings + - Standardize branch naming across environments + +**Verification Steps:** + +```bash +# Check current project configuration +sleakops project show + +# List available branches +sleakops project branches + +# View build history with branch information +sleakops builds list --project --show-branches +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/project-ci-pipeline-deployment-not-triggering.mdx b/docs/troubleshooting/project-ci-pipeline-deployment-not-triggering.mdx new file mode 100644 index 000000000..2af1ccc4d --- /dev/null +++ b/docs/troubleshooting/project-ci-pipeline-deployment-not-triggering.mdx @@ -0,0 +1,227 @@ +--- +sidebar_position: 15 +title: "CI Pipeline Not Triggering on Branch Push" +description: "Solution for deployments not executing when pushing to configured branch" +date: "2024-03-18" +category: "project" +tags: ["ci", "pipeline", "deployment", "github", "staging"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CI Pipeline Not Triggering on Branch Push + +**Date:** March 18, 2024 +**Category:** Project +**Tags:** CI, Pipeline, Deployment, GitHub, Staging + +## Problem Description + +**Context:** User has configured a project with a specific branch (staging) for deployments, but when pushing code to that branch, the CI/CD pipeline is not executing and deployments are not being triggered. + +**Observed Symptoms:** + +- Push to staging branch does not trigger deployments +- No visible pipeline execution in GitHub Actions +- Project appears to be correctly configured with the desired branch +- Expected automated deployment process is not working + +**Relevant Configuration:** + +- Target branch: `staging` +- Platform: GitHub with GitHub Actions +- CI/CD files location: `.github/workflows/` +- Project settings configured in SleakOps + +**Error Conditions:** + +- Pipeline fails to trigger after push to staging branch +- No error messages visible initially +- Deployment automation not working as expected + +## Detailed Solution + + + +First, ensure your project is configured with the correct branch: + +1. Go to your **SleakOps Project Dashboard** +2. Navigate to **Project Settings** +3. Check **Pipeline Settings** section +4. Verify that the **Target Branch** is set to `staging` +5. Save changes if any modifications were made + +The configuration should match the branch you're pushing to. + + + + + +Verify your GitHub Actions workflow files are properly structured: + +1. Check that files exist in `.github/workflows/` directory +2. Ensure workflow files have `.yml` or `.yaml` extension +3. Verify the workflow is triggered on the correct branch + +Example workflow configuration: + +```yaml +name: SleakOps Deployment + +on: + push: + branches: + - staging # Make sure this matches your configured branch + pull_request: + branches: + - staging + +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - name: Checkout code + uses: actions/checkout@v3 + + - name: Deploy to SleakOps + uses: sleakops/deploy-action@v1 + with: + api-key: ${{ secrets.SLEAKOPS_API_KEY }} + project-id: ${{ secrets.SLEAKOPS_PROJECT_ID }} +``` + + + + + +Here's a complete example workflow file that you can use as-is for SleakOps: + +```yaml +# .github/workflows/sleakops-deploy.yml +name: SleakOps CI/CD Pipeline + +on: + push: + branches: + - staging + - main + pull_request: + branches: + - staging + - main + +env: + SLEAKOPS_API_URL: https://api.sleakops.com + +jobs: + build-and-deploy: + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v3 + + - name: Setup Node.js + uses: actions/setup-node@v3 + with: + node-version: "18" + cache: "npm" + + - name: Install dependencies + run: npm ci + + - name: Run tests + run: npm test + + - name: Build application + run: npm run build + + - name: Deploy to SleakOps + uses: sleakops/deploy-action@v1 + with: + api-key: ${{ secrets.SLEAKOPS_API_KEY }} + project-id: ${{ secrets.SLEAKOPS_PROJECT_ID }} + environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }} + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} +``` + +Save this file as `.github/workflows/sleakops-deploy.yml` in your repository. + + + + + +After merging a PR to staging, verify the action execution: + +1. Go to your **GitHub repository** +2. Click on the **Actions** tab +3. Look for workflow runs after your recent push/merge +4. Click on any failed runs to see detailed error logs +5. Check the **Jobs** section for specific step failures + +Common issues to look for: + +- Missing secrets (SLEAKOPS_API_KEY, SLEAKOPS_PROJECT_ID) +- Incorrect branch names in workflow triggers +- Syntax errors in YAML files +- Permission issues with GitHub tokens + + + + + +If the pipeline still doesn't trigger: + +1. **Check repository permissions:** + + - Ensure GitHub Actions are enabled for your repository + - Verify workflow permissions in repository settings + +2. **Validate YAML syntax:** + + ```bash + # Use a YAML validator or GitHub's built-in checker + yamllint .github/workflows/sleakops-deploy.yml + ``` + +3. **Test with a simple workflow:** + + ```yaml + name: Test Workflow + on: + push: + branches: [staging] + jobs: + test: + runs-on: ubuntu-latest + steps: + - run: echo "Pipeline triggered successfully" + ``` + +4. **Check SleakOps integration:** + - Verify API keys are correctly set in GitHub Secrets + - Ensure project ID matches your SleakOps project + - Check SleakOps dashboard for any integration issues + + + + + +Ensure the following secrets are configured in your GitHub repository: + +1. Go to **Repository Settings** → **Secrets and variables** → **Actions** +2. Add the following repository secrets: + +| Secret Name | Description | Where to find | +| --------------------- | ----------------------- | ---------------------------------------- | +| `SLEAKOPS_API_KEY` | Your SleakOps API key | SleakOps Dashboard → Settings → API Keys | +| `SLEAKOPS_PROJECT_ID` | Your project identifier | SleakOps Project → Settings → General | + +3. Verify secrets are accessible in your workflow by checking the Actions logs + + + +--- + +_This FAQ was automatically generated on March 18, 2024 based on a real user query._ diff --git a/docs/troubleshooting/project-deletion-process.mdx b/docs/troubleshooting/project-deletion-process.mdx new file mode 100644 index 000000000..e4a323bcf --- /dev/null +++ b/docs/troubleshooting/project-deletion-process.mdx @@ -0,0 +1,195 @@ +--- +sidebar_position: 15 +title: "How to Delete a Project in SleakOps" +description: "Complete guide for safely deleting projects and understanding what gets removed" +date: "2024-03-13" +category: "project" +tags: ["project", "deletion", "management", "cleanup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# How to Delete a Project in SleakOps + +**Date:** March 13, 2024 +**Category:** Project +**Tags:** Project, Deletion, Management, Cleanup + +## Problem Description + +**Context:** Users need to understand how to properly delete projects in SleakOps and what components are affected when a project is deleted. + +**Observed Symptoms:** + +- Unable to find project deletion option in the interface +- Uncertainty about what gets deleted along with the project +- Need to clean up projects with naming issues or technical debt +- Questions about the impact on associated resources + +**Relevant Configuration:** + +- Project contains: Docker args, workloads, dependencies, vargroups +- Associated resources: Environment-specific configurations +- Dependencies: Database connections, external services +- Variable groups: Shared across projects or workload-specific + +**Error Conditions:** + +- Deletion option not visible in expected locations +- Uncertainty about cascading deletions +- Risk of losing important configuration data + +## Detailed Solution + + + +Before deleting a project, make sure to document the following components: + +**1. Docker Arguments** + +- Note all custom docker args configured for the project +- Document any special build configurations + +**2. Workloads** + +- List all workloads within the project +- Document their configurations and settings + +**3. Dependencies** + +- Record all project dependencies (databases, caches, etc.) +- Note connection strings and configurations + +**4. Variable Groups (VarGroups)** + +- Document vargroups associated with the project +- Note vargroups associated with specific workloads +- Check if any vargroups are shared with other projects + + + + + +To delete a project in SleakOps: + +1. Navigate to your **Project Dashboard** +2. Go to **Project Settings** +3. Click on **General Settings** +4. Scroll down to find the **Delete Project** button + +**Note:** The deletion button is typically located at the bottom of the General Settings page and may be styled as a red or warning button. + + + + + +When you delete a project, the following components are **automatically removed**: + +**✅ Deleted with Project:** + +- The project itself +- All workloads within the project +- Project-specific dependencies +- Variable groups associated exclusively with the project +- Variable groups associated with workloads in the project +- Project-specific docker configurations +- Environment configurations for this project + +**⚠️ Potentially Affected:** + +- Shared variable groups (if used by other projects) +- External dependencies (databases, services) - these may need manual cleanup + +**❌ Not Deleted:** + +- Cluster infrastructure +- Other projects in the same cluster +- Shared resources used by multiple projects + + + + + +**Step 1: Backup Important Data** + +```bash +# Export project configuration (if available via CLI) +sleakops project export --project-name your-project-name +``` + +**Step 2: Document Dependencies** + +- Take screenshots of dependency configurations +- Note down database connection strings +- Document any custom environment variables + +**Step 3: Check Shared Resources** + +- Verify which vargroups are shared with other projects +- Ensure no other projects depend on this project's resources + +**Step 4: Perform Deletion** + +1. Go to **Project Settings > General Settings** +2. Scroll to the bottom of the page +3. Click **Delete Project** +4. Confirm the deletion when prompted +5. Wait for the deletion process to complete + +**Step 5: Verify Cleanup** + +- Check that the project no longer appears in your project list +- Verify that associated workloads are removed +- Confirm that exclusive vargroups are deleted + + + + + +If you're deleting a project to recreate it (e.g., to fix naming issues): + +**Naming Conventions:** + +- Avoid redundant suffixes (e.g., don't use both project name and environment in the name) +- Use clear, descriptive names: `admin` instead of `sostengo-admin-prod-prod` +- Remember that environment suffixes are added automatically + +**Configuration Management:** + +- Use the documented configurations from the deleted project +- Apply lessons learned to avoid technical debt +- Consider using Infrastructure as Code approaches for future projects + +**Timing Considerations:** + +- Plan the deletion and recreation during maintenance windows +- Ensure all team members are aware of the changes +- Have rollback plans in case of issues + + + + + +**If you can't find the deletion option:** + +- Check your user permissions - you may need admin access +- Ensure you're in the correct project context +- Try refreshing the page or clearing browser cache + +**If deletion fails:** + +- Check for active deployments that need to be stopped first +- Verify there are no running workloads +- Contact support if the project appears to be stuck in a deletion state + +**After deletion:** + +- If resources appear to still exist, wait a few minutes for propagation +- Check the activity logs for deletion status +- Contact support if cleanup appears incomplete + + + +--- + +_This FAQ was automatically generated on March 13, 2024 based on a real user query._ diff --git a/docs/troubleshooting/project-deletion-s3-bucket-cleanup.mdx b/docs/troubleshooting/project-deletion-s3-bucket-cleanup.mdx new file mode 100644 index 000000000..705bd35a0 --- /dev/null +++ b/docs/troubleshooting/project-deletion-s3-bucket-cleanup.mdx @@ -0,0 +1,242 @@ +--- +sidebar_position: 3 +title: "Project Deletion Stuck - S3 Bucket Cleanup Issue" +description: "Solution for projects stuck in 'pending for delete' status due to S3 bucket cleanup problems" +date: "2024-10-31" +category: "project" +tags: ["s3", "deletion", "bucket", "cleanup", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Project Deletion Stuck - S3 Bucket Cleanup Issue + +**Date:** October 31, 2024 +**Category:** Project +**Tags:** S3, Deletion, Bucket, Cleanup, AWS + +## Problem Description + +**Context:** When attempting to delete a project in SleakOps, the project becomes stuck in "pending for delete" status due to issues with S3 bucket cleanup processes. + +**Observed Symptoms:** + +- Project remains in "pending for delete" status indefinitely +- Deletion process appears to hang or timeout +- S3 buckets associated with the project are not properly cleaned up +- Error occurs specifically with buckets containing large numbers of objects + +**Relevant Configuration:** + +- Project type: Any project with associated S3 storage +- AWS S3 buckets with significant object count +- SleakOps automated cleanup processes + +**Error Conditions:** + +- Occurs during project deletion process +- Happens when S3 buckets contain many objects +- Cleanup process fails to handle large object volumes +- Deletion gets stuck before completion + +## Detailed Solution + + + +The issue occurs because AWS S3 requires all objects to be deleted from a bucket before the bucket itself can be deleted. When a bucket contains a large number of objects, the cleanup process can: + +1. **Timeout**: The deletion process may exceed time limits +2. **Rate limit**: AWS API rate limits may be hit during bulk deletions +3. **Memory issues**: Processing too many objects at once can cause memory problems +4. **Incomplete cleanup**: Some objects may remain, preventing bucket deletion + + + + + +To check if this is affecting your project: + +1. **Access AWS Console** +2. **Navigate to S3 service** +3. **Search for buckets** related to your project name +4. **Check object count** in each bucket +5. **Look for buckets** that should have been deleted but still exist + +```bash +# Using AWS CLI to check bucket contents +aws s3 ls s3://your-project-bucket-name --recursive --summarize +``` + + + + + +The SleakOps team has implemented fixes for this issue: + +1. **Improved batch processing**: Objects are now deleted in smaller, manageable batches +2. **Enhanced error handling**: Better retry mechanisms for failed deletions +3. **Timeout management**: Extended timeouts for large bucket cleanup +4. **Progress tracking**: Better monitoring of cleanup progress + +If your project is currently stuck: + +- **Contact support**: Report the stuck project via support ticket +- **Provide project name**: Include the exact project name showing "pending for delete" +- **Wait for resolution**: The team will manually complete the cleanup process + + + + + +To avoid this issue in future project deletions: + +1. **Regular cleanup**: Periodically clean up unnecessary files in your project +2. **Lifecycle policies**: Implement S3 lifecycle policies to automatically delete old objects +3. **Monitor storage**: Keep track of object counts in your project's S3 buckets +4. **Staged deletion**: For projects with large amounts of data, consider manual cleanup before deletion + +```yaml +# Example S3 lifecycle policy +LifecycleConfiguration: + Rules: + - Id: DeleteOldObjects + Status: Enabled + Filter: + Prefix: temp/ + Expiration: + Days: 30 + - Id: DeleteIncompleteMultipartUploads + Status: Enabled + AbortIncompleteMultipartUpload: + DaysAfterInitiation: 7 +``` + +**Recommended practices:** + +- Set up lifecycle policies before accumulating large amounts of data +- Use versioning judiciously to avoid exponential object growth +- Implement automated cleanup for temporary files +- Monitor storage costs regularly to identify potential issues early + + + + + +While waiting for a stuck deletion to complete: + +1. **Check project status** regularly in the SleakOps dashboard +2. **Monitor AWS CloudTrail** for S3 deletion events (if you have access) +3. **Watch for email notifications** from the SleakOps team +4. **Avoid retrying** the deletion process while it's being resolved + +**Note**: The resolution process may take some time depending on the number of objects that need to be cleaned up. + +```bash +# Monitor S3 bucket size during cleanup (if you have access) +aws s3api list-objects-v2 --bucket your-bucket-name --query 'length(Contents[])' + +# Check for stuck multipart uploads +aws s3api list-multipart-uploads --bucket your-bucket-name +``` + + + + + +If project deletion is critical and cannot wait for support resolution: + +**Option 1: Manual S3 Cleanup (Advanced)** + +```bash +# List all objects in the bucket +aws s3 ls s3://your-bucket-name --recursive + +# Delete objects in batches (be very careful!) +aws s3 rm s3://your-bucket-name --recursive + +# Delete the bucket +aws s3 rb s3://your-bucket-name +``` + +**⚠️ Warning**: Only perform manual cleanup if: + +- You have full AWS access to the account +- You've verified the bucket only contains project-related data +- You understand the risks of data loss + +**Option 2: Request Priority Support** + +For critical production issues: + +- Contact SleakOps support via emergency channel +- Clearly mark as "URGENT - Production Impact" +- Provide business justification for priority handling + + + + + +**Check for related issues:** + +1. **Verify IAM permissions**: Ensure the SleakOps service has proper S3 deletion permissions +2. **Check bucket policies**: Look for bucket policies that might prevent deletion +3. **Review object locks**: Check if objects have legal holds or retention policies +4. **Examine cross-region replication**: Verify if bucket has replication rules that need cleanup + +**Common resolution times:** + +- Small buckets (< 1000 objects): 5-15 minutes +- Medium buckets (1K-100K objects): 30 minutes - 2 hours +- Large buckets (100K+ objects): 2-24 hours +- Very large buckets (millions of objects): May require several days + +**Signs that cleanup is progressing:** + +- Decreasing object count in AWS Console +- S3 API calls visible in CloudTrail logs +- Reduced storage costs in AWS billing +- Email updates from SleakOps support team + + + + + +**Before creating projects:** + +- Plan your data retention strategy +- Estimate potential storage usage +- Implement automated cleanup from the start +- Set up monitoring for storage costs + +**During project lifecycle:** + +- Regularly review and clean up unnecessary files +- Monitor storage growth patterns +- Use temporary buckets for short-lived data +- Implement proper logging and cleanup of temporary artifacts + +**Before project deletion:** + +- Backup any critical data +- Document any persistent data that should be preserved +- Clean up large objects manually if possible +- Notify team members about the planned deletion + +**Project deletion checklist:** + +```bash +# Pre-deletion verification +□ Backup critical data +□ Document important configurations +□ Clean up large files/datasets +□ Verify no external dependencies +□ Notify stakeholders +□ Choose appropriate time window +□ Have rollback plan ready +``` + + + +--- + +_This FAQ was automatically generated on October 31, 2024 based on a real user query._ diff --git a/docs/troubleshooting/project-deletion-stuck-deleting-state.mdx b/docs/troubleshooting/project-deletion-stuck-deleting-state.mdx new file mode 100644 index 000000000..de224294c --- /dev/null +++ b/docs/troubleshooting/project-deletion-stuck-deleting-state.mdx @@ -0,0 +1,467 @@ +--- +sidebar_position: 3 +title: "Project Stuck in Deleting State" +description: "Solution for projects that remain in 'deleting' status after deletion" +date: "2024-12-19" +category: "project" +tags: ["project", "deletion", "troubleshooting", "ui"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Project Stuck in Deleting State + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** Project, Deletion, Troubleshooting, UI + +## Problem Description + +**Context:** User attempts to delete a project through the SleakOps platform interface, but the project remains visible in the Projects view with a "deleting" status indefinitely. + +**Observed Symptoms:** + +- Project appears to be deleted successfully +- Project still visible in the Projects list +- Project status shows as "deleting" permanently +- UI does not reflect the actual deletion completion + +**Relevant Configuration:** + +- Project name: Can affect any project +- Platform: SleakOps web interface +- Action: Project deletion through UI + +**Error Conditions:** + +- Occurs after initiating project deletion +- Status remains "deleting" indefinitely +- Project resources may be actually deleted but UI state persists +- Refresh does not resolve the status + +## Detailed Solution + + + +When you delete a project in SleakOps, the system performs several cleanup operations: + +1. **Resource cleanup**: Deletes all associated cloud resources (clusters, storage, etc.) +2. **Database cleanup**: Removes project records from the platform database +3. **UI state update**: Updates the interface to reflect the deletion + +Sometimes the UI state update can lag behind the actual resource cleanup, causing the "deleting" status to persist. + + + + + +If your project is stuck in "deleting" state: + +1. **Wait for completion**: Large projects may take 10-15 minutes to fully delete +2. **Refresh the page**: Use Ctrl+F5 (or Cmd+Shift+R on Mac) for a hard refresh +3. **Clear browser cache**: Clear your browser cache and cookies for the SleakOps domain +4. **Check in incognito/private mode**: Open SleakOps in an incognito window to verify the status + + + + + +To confirm if your project is actually deleted: + +1. **Check cloud provider console**: + + - AWS: Verify no resources remain in your account + - Azure: Check resource groups are deleted + - GCP: Confirm project resources are cleaned up + +2. **Attempt to create a new project** with the same name: + + - If successful, the old project was actually deleted + - If it fails due to name conflict, the project may still exist + +3. **Contact support** if the issue persists after 30 minutes + + + + + +To avoid this issue in the future: + +1. **Ensure stable connection**: Maintain a stable internet connection during deletion +2. **Don't close the browser**: Keep the browser tab open until deletion completes +3. **Delete during off-peak hours**: Perform deletions when system load is lower +4. **Monitor resource cleanup**: Check your cloud provider console to confirm cleanup + + + + + +Contact SleakOps support if: + +- Project remains in "deleting" state for more than 30 minutes +- Cloud resources are not being cleaned up +- You cannot create a new project with the same name +- The project reappears after seeming to be deleted + +Provide this information when contacting support: + +- Project name +- Time when deletion was initiated +- Screenshots of the current status +- Any error messages received + + + + + +SleakOps project deletion involves multiple steps that happen sequentially: + +1. **UI State Update**: Project marked as "deleting" in the interface +2. **Kubernetes Resource Cleanup**: Deployments, services, and pods are removed +3. **Storage Cleanup**: Persistent volumes and backups are deleted +4. **Cloud Provider Cleanup**: Load balancers, security groups, and other cloud resources +5. **Database Cleanup**: Project metadata and configurations removed from SleakOps database +6. **Final State Update**: Project completely removed from UI + +**Why deletions get stuck:** + +- Cloud provider API rate limits or timeouts +- Large number of resources requiring cleanup +- Network connectivity issues during cleanup +- Resource dependencies that prevent immediate deletion +- Backup processes that need to complete first + + + + + +Check and manually clean up cloud resources if needed: + +**AWS Resources to Check:** + +```bash +# List EC2 instances +aws ec2 describe-instances --filters "Name=tag:Project,Values=your-project-name" + +# List load balancers +aws elbv2 describe-load-balancers + +# List security groups +aws ec2 describe-security-groups --filters "Name=tag:Project,Values=your-project-name" + +# List EBS volumes +aws ec2 describe-volumes --filters "Name=tag:Project,Values=your-project-name" + +# List S3 buckets +aws s3api list-buckets --query "Buckets[?contains(Name, 'your-project-name')]" +``` + +**Manual cleanup commands (use with caution):** + +```bash +# Delete load balancers +aws elbv2 delete-load-balancer --load-balancer-arn + +# Terminate EC2 instances +aws ec2 terminate-instances --instance-ids + +# Delete security groups (after instances are terminated) +aws ec2 delete-security-group --group-id + +# Delete EBS volumes +aws ec2 delete-volume --volume-id +``` + +**Azure Resources:** + +```bash +# List resource groups +az group list --query "[?name=='your-project-rg']" + +# List resources in group +az resource list --resource-group your-project-rg + +# Delete resource group (includes all resources) +az group delete --name your-project-rg --yes --no-wait +``` + +**GCP Resources:** + +```bash +# List compute instances +gcloud compute instances list --filter="labels.project=your-project-name" + +# List load balancers +gcloud compute forwarding-rules list + +# Delete compute instances +gcloud compute instances delete instance-name --zone=zone-name + +# List and delete persistent disks +gcloud compute disks list --filter="labels.project=your-project-name" +gcloud compute disks delete disk-name --zone=zone-name +``` + + + + + +Verify Kubernetes resources are properly cleaned up: + +1. **Check cluster access** (if you have cluster credentials): + +```bash +# List all namespaces +kubectl get namespaces + +# Check if project namespace still exists +kubectl get namespace your-project-namespace + +# List resources in project namespace +kubectl get all -n your-project-namespace + +# Check persistent volumes +kubectl get pv | grep your-project +``` + +2. **Manual Kubernetes cleanup** (if necessary): + +```bash +# Delete namespace (removes all resources in it) +kubectl delete namespace your-project-namespace --force --grace-period=0 + +# Delete persistent volumes manually +kubectl delete pv your-project-pv-name + +# Delete cluster roles and bindings +kubectl get clusterrole | grep your-project +kubectl delete clusterrole your-project-role + +kubectl get clusterrolebinding | grep your-project +kubectl delete clusterrolebinding your-project-binding +``` + +3. **Check for stuck finalizers**: + +```bash +# Check for resources with finalizers +kubectl get namespace your-project-namespace -o yaml + +# Remove finalizers if needed (advanced operation) +kubectl patch namespace your-project-namespace -p '{"metadata":{"finalizers":null}}' +``` + + + + + +Understanding database-level cleanup issues: + +1. **Project metadata cleanup**: SleakOps maintains project metadata in its database +2. **Configuration cleanup**: Project configurations, environment variables, and secrets +3. **Audit trail preservation**: Some deletion records may be kept for audit purposes +4. **Cache invalidation**: Platform caches may need time to update + +**What gets removed from SleakOps database:** + +- Project configuration and settings +- Environment variables and secrets +- Deployment history (except audit logs) +- User access permissions for the project +- Monitoring and alerting configurations + +**What may be preserved:** + +- Audit logs of project activities +- Billing and usage history +- Security event logs +- Platform metrics and analytics + + + + + +For persistent deletion issues: + +1. **Browser-based debugging**: + +```javascript +// Check browser developer tools console for errors +// Look for network requests to deletion API endpoints +// Check for any JavaScript errors during deletion process + +// Clear all SleakOps-related data +localStorage.clear(); +sessionStorage.clear(); +// Clear cookies for sleakops.com domain +``` + +2. **Network connectivity verification**: + +```bash +# Test connectivity to SleakOps API +curl -I https://api.sleakops.com/health + +# Check DNS resolution +nslookup api.sleakops.com + +# Test from different network/location +# Use different internet connection or VPN +``` + +3. **Platform status verification**: + +```bash +# Check SleakOps status page +curl -s https://status.sleakops.com/api/v2/status.json + +# Verify platform maintenance windows +# Check SleakOps social media or documentation for known issues +``` + +4. **Alternative deletion methods**: + +- Try deletion from different browser or device +- Use SleakOps CLI if available +- Contact support for manual deletion +- Use API directly if you have access + + + + + +Set up monitoring to track deletion progress: + +1. **Cloud provider monitoring**: + +```bash +# AWS CloudTrail logs +aws logs describe-log-groups --log-group-name-prefix "/aws/apigateway/sleakops" + +# Monitor resource counts +watch "aws ec2 describe-instances --filters 'Name=tag:Project,Values=your-project' | jq '.Reservations | length'" + +# Monitor storage usage +watch "aws s3 ls | grep your-project" +``` + +2. **Platform monitoring**: + +```bash +# Monitor API responses +while true; do + echo "$(date): Checking project status..." + curl -s "https://api.sleakops.com/projects/your-project-id" | jq '.status' + sleep 30 +done +``` + +3. **Notification setup**: + +```bash +# Email notification when deletion completes +#!/bin/bash +PROJECT_NAME="your-project" +while kubectl get namespace $PROJECT_NAME >/dev/null 2>&1; do + sleep 60 +done +echo "Project $PROJECT_NAME deleted successfully" | mail -s "Project Deletion Complete" your-email@example.com +``` + + + + + +Important considerations for data that may be lost: + +1. **What gets permanently deleted**: + +- Application databases and data +- File storage and uploads +- Configuration files and secrets +- Log files and metrics history +- Custom SSL certificates +- Backup files and snapshots + +2. **Data backup before deletion**: + +```bash +# Backup databases +kubectl exec -it database-pod -- pg_dump database_name > backup.sql + +# Backup persistent volumes +kubectl get pv your-project-pv -o yaml > pv-backup.yaml + +# Export configurations +kubectl get configmaps -n your-namespace -o yaml > configmaps-backup.yaml +kubectl get secrets -n your-namespace -o yaml > secrets-backup.yaml + +# Backup application files +kubectl cp pod-name:/app/data ./backup-data/ +``` + +3. **Recovery options if deletion was accidental**: + +- Contact SleakOps support immediately +- Provide project details and deletion timestamp +- Check if any automated backups exist +- Restore from cloud provider snapshots if available +- Recreate project with backed up configurations + +4. **Prevention strategies**: + +- Enable automated backups before deletion +- Use project naming conventions to avoid accidental deletion +- Implement approval processes for production project deletions +- Document important configurations and data locations + + + + + +Implement these practices to avoid deletion issues: + +1. **Pre-deletion checklist**: + +```bash +□ Backup all important data +□ Export project configurations +□ Notify team members +□ Document any special setup or configurations +□ Check for dependencies on other projects +□ Verify no active users or services +□ Plan downtime window if needed +□ Have rollback plan ready +``` + +2. **Deletion procedures**: + +- Schedule deletions during maintenance windows +- Monitor cloud provider consoles during deletion +- Use staging/testing deletions to verify process +- Document the deletion process and any issues encountered +- Keep deletion logs and timestamps + +3. **Post-deletion verification**: + +```bash +□ Verify all cloud resources are removed +□ Check billing for reduced costs +□ Confirm no ghost resources remain +□ Update documentation and inventories +□ Notify stakeholders of completion +□ Archive any important deletion logs +``` + +4. **Emergency contacts and procedures**: + +- Maintain updated contact list for SleakOps support +- Document escalation procedures for stuck deletions +- Keep cloud provider support contacts available +- Have access to cloud provider root accounts for emergency cleanup + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/project-dependency-deletion-behavior.mdx b/docs/troubleshooting/project-dependency-deletion-behavior.mdx new file mode 100644 index 000000000..f77fb59a0 --- /dev/null +++ b/docs/troubleshooting/project-dependency-deletion-behavior.mdx @@ -0,0 +1,164 @@ +--- +sidebar_position: 3 +title: "Project Deletion and Dependencies Behavior" +description: "Understanding what happens to dependencies when deleting a project in SleakOps" +date: "2025-01-28" +category: "project" +tags: ["project", "dependencies", "redis", "deletion", "cleanup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Project Deletion and Dependencies Behavior + +**Date:** January 28, 2025 +**Category:** Project +**Tags:** Project, Dependencies, Redis, Deletion, Cleanup + +## Problem Description + +**Context:** Users need to understand the behavior of project dependencies (such as Redis, databases, caches) when a project is deleted from the SleakOps platform. + +**Observed Symptoms:** + +- Uncertainty about dependency cleanup when deleting projects +- Need to understand if dependencies are automatically removed +- Concern about data loss or orphaned resources +- Questions about cleanup procedures + +**Relevant Configuration:** + +- Project with attached dependencies (Redis, PostgreSQL, MySQL, etc.) +- SleakOps project management interface +- Dependency lifecycle management + +**Error Conditions:** + +- Risk of orphaned resources if dependencies aren't properly cleaned up +- Potential data loss if dependencies are unexpectedly deleted +- Billing concerns for unused resources + +## Detailed Solution + + + +**Yes, dependencies are automatically deleted when you delete a project in SleakOps.** + +When you delete a project that contains dependencies like Redis, PostgreSQL, MySQL, or other services, the following occurs: + +1. **All project dependencies are removed** along with the project +2. **Data stored in these dependencies is permanently deleted** +3. **Associated cloud resources are cleaned up** to avoid ongoing charges +4. **The deletion is irreversible** - you cannot recover the data afterward + + + + + +The deletion process follows this sequence: + +1. **Project deletion initiated** by user +2. **Workloads are stopped** (webservices, workers, cron jobs) +3. **Dependencies are identified** and marked for deletion +4. **Data backup warning** is displayed (if applicable) +5. **Dependencies are deleted** in reverse dependency order +6. **Project resources are cleaned up** +7. **Confirmation of complete deletion** + +```bash +# Example of what gets deleted: +- Redis instance and all cached data +- PostgreSQL database and all stored data +- MySQL database and all stored data +- Associated volumes and storage +- Network configurations +- Load balancers and ingress rules +``` + + + + + +**Important: Always backup critical data before deleting a project.** + +### For Redis: + +```bash +# Connect to your Redis instance and create a backup +redis-cli --rdb /path/to/backup.rdb + +# Or export specific keys +redis-cli --scan --pattern "*" | xargs redis-cli MGET +``` + +### For PostgreSQL: + +```bash +# Create a database dump +pg_dump -h your-host -U your-user -d your-database > backup.sql + +# Or use SleakOps CLI if available +sleakops project backup --project-id PROJECT_ID --service postgres +``` + +### For MySQL: + +```bash +# Create a database dump +mysqldump -h your-host -u your-user -p your-database > backup.sql +``` + + + + + +### Before Deleting a Project: + +1. **Review all dependencies**: + + - Go to Project Settings → Dependencies + - List all attached services (Redis, databases, etc.) + - Identify critical data that needs backup + +2. **Create backups**: + + - Export important data from Redis + - Dump database contents + - Save configuration files + +3. **Consider alternatives**: + - **Pause the project** instead of deleting (if available) + - **Migrate dependencies** to another project + - **Detach dependencies** before deletion (if supported) + +### Safety Checklist: + +- [ ] All critical data has been backed up +- [ ] Team members are aware of the deletion +- [ ] No active users depend on the services +- [ ] Alternative solutions are in place if needed + + + + + +**Unfortunately, once a project and its dependencies are deleted, recovery is generally not possible.** + +However, check these options: + +1. **Cloud provider backups**: Some cloud providers maintain automatic backups +2. **SleakOps support**: Contact support immediately if deletion was accidental +3. **Application-level backups**: Check if your application created its own backups + +### Prevention for the future: + +- Set up regular automated backups +- Use staging environments for testing deletions +- Implement backup strategies in your application code +- Consider using external backup services + + + +--- + +_This FAQ was automatically generated on January 28, 2025 based on a real user query._ diff --git a/docs/troubleshooting/project-dockerfile-arguments-configuration.mdx b/docs/troubleshooting/project-dockerfile-arguments-configuration.mdx new file mode 100644 index 000000000..d76b932ce --- /dev/null +++ b/docs/troubleshooting/project-dockerfile-arguments-configuration.mdx @@ -0,0 +1,184 @@ +--- +sidebar_position: 3 +title: "Dockerfile Arguments Configuration Error" +description: "Solution for configuring Dockerfile arguments in SleakOps projects" +date: "2025-03-11" +category: "project" +tags: ["dockerfile", "arguments", "configuration", "deployment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Dockerfile Arguments Configuration Error + +**Date:** March 11, 2025 +**Category:** Project +**Tags:** Dockerfile, Arguments, Configuration, Deployment + +## Problem Description + +**Context:** User encounters issues when creating a production project in SleakOps, specifically related to Dockerfile arguments configuration and project creation workflow. + +**Observed Symptoms:** + +- Cannot add Dockerfile arguments during project creation +- Project creation fails in production environment +- Interface prevents argument configuration before project exists +- Error persists despite Dockerfile being present in the correct branch + +**Relevant Configuration:** + +- Branch: `sleakops-master` (production branch) +- Environment: Production +- Dockerfile: Present in repository +- Arguments: Required by Dockerfile but cannot be configured + +**Error Conditions:** + +- Error occurs during project creation to production environment +- Arguments cannot be declared before project creation +- Issue resolved by SleakOps team intervention +- Problem identified as platform-side issue + +## Detailed Solution + + + +In SleakOps, Dockerfile arguments (ARG) should be configured in the **Project Configuration** section, not in Variable Groups: + +1. Go to your **Project Settings** +2. Navigate to **Build Configuration** +3. Look for **Dockerfile Arguments** section +4. Add your arguments there before deployment + +**Important:** Dockerfile arguments are different from runtime environment variables: + +- **Dockerfile Arguments (ARG)**: Used during image build process +- **Variable Groups**: Used for runtime environment variables + + + + + +To properly create a project with Dockerfile arguments: + +1. **Ensure Dockerfile is in the correct branch** + + - Verify the Dockerfile exists in your production branch (`sleakops-master`) + - Confirm all changes are merged from development branches + +2. **Create the project first** + + - Create the project without arguments initially + - This allows access to configuration options + +3. **Configure arguments after creation** + + - Go to Project Settings → Build Configuration + - Add required Dockerfile arguments + - Save configuration + +4. **Redeploy with arguments** + - Trigger a new deployment + - Arguments will be passed during build process + + + + + +Before creating a production project, verify: + +```bash +# Check if Dockerfile exists in production branch +git checkout sleakops-master +ls -la | grep Dockerfile + +# Verify branch is up to date +git pull origin sleakops-master + +# Check if all necessary changes are merged +git log --oneline -10 +``` + +In SleakOps interface: + +1. Go to **Repository Settings** +2. Verify **Production Branch** is set to `sleakops-master` +3. Confirm **Dockerfile Path** is correct (usually `./Dockerfile`) + + + + + +If you encounter similar issues: + +1. **Check build logs** + + - Go to **Deployments** → **Build Logs** + - Look for specific error messages + - Verify if Dockerfile is found + +2. **Verify repository access** + + - Ensure SleakOps has access to the repository + - Check if branch permissions are correct + +3. **Contact support if needed** + - Platform-side issues may require team intervention + - Provide specific error messages and screenshots + - Include branch and repository information + +**Example Dockerfile with arguments:** + +```dockerfile +FROM node:16-alpine + +# Declare build arguments +ARG NODE_ENV=production +ARG API_URL +ARG DATABASE_URL + +# Use arguments during build +ENV NODE_ENV=$NODE_ENV +ENV API_URL=$API_URL +ENV DATABASE_URL=$DATABASE_URL + +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . + +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +To avoid similar issues in the future: + +1. **Always test in staging first** + + - Create and test projects in staging environment + - Verify all configurations work before production + +2. **Document your arguments** + + - Keep a list of required Dockerfile arguments + - Document their purpose and expected values + +3. **Use consistent branch naming** + + - Maintain clear branch naming conventions + - Ensure production branch is always up to date + +4. **Regular branch synchronization** + - Regularly merge changes to production branch + - Keep development and production branches in sync + + + +--- + +_This FAQ was automatically generated on March 11, 2025 based on a real user query._ diff --git a/docs/troubleshooting/project-dockerfile-path-configuration.mdx b/docs/troubleshooting/project-dockerfile-path-configuration.mdx new file mode 100644 index 000000000..43833c979 --- /dev/null +++ b/docs/troubleshooting/project-dockerfile-path-configuration.mdx @@ -0,0 +1,195 @@ +--- +sidebar_position: 3 +title: "Dockerfile Path Configuration Error in Projects" +description: "Solution for Dockerfile path issues and exceptions when editing project configuration" +date: "2025-02-20" +category: "project" +tags: ["dockerfile", "project", "configuration", "path", "repository"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Dockerfile Path Configuration Error in Projects + +**Date:** February 20, 2025 +**Category:** Project +**Tags:** Dockerfile, Project, Configuration, Path, Repository + +## Problem Description + +**Context:** Users experience issues when configuring or editing the Dockerfile path in SleakOps project settings, resulting in the path being cleared or exceptions being thrown. + +**Observed Symptoms:** + +- Dockerfile path gets automatically cleared/deleted when editing +- Exception thrown when trying to modify Dockerfile path +- Configuration doesn't persist after saving +- Project build fails due to missing Dockerfile reference + +**Relevant Configuration:** + +- Dockerfile path format: Relative path from repository root +- Repository branch: Must match the branch where Dockerfile exists +- Path structure: Must include full path including filename + +**Error Conditions:** + +- Error occurs when Dockerfile is not found in specified repository and branch +- Path specification is incomplete or incorrect +- Dockerfile doesn't exist at the specified location + +## Detailed Solution + + + +The Dockerfile path in SleakOps must include the complete path from the repository root, including the filename: + +**Correct format:** + +``` +./docker/base/Dockerfile +./docker/Dockerfile +./Dockerfile +``` + +**Incorrect format:** + +``` +./docker/base # Missing filename +docker/base # Missing ./ prefix +base # Incomplete path +``` + +**Example:** +If your repository structure is: + +``` +my-repo/ +├── docker/ +│ └── base/ +│ └── Dockerfile +└── src/ +``` + +The correct path would be: `./docker/base/Dockerfile` + + + + + +Ensure that: + +1. **Repository is correctly linked** to your project +2. **Branch specification** matches where your Dockerfile exists +3. **Dockerfile exists** in the specified branch + +**Steps to verify:** + +1. Go to your **Project Settings** +2. Check the **Repository** field points to the correct repo +3. Verify the **Branch** field matches your Dockerfile's branch +4. Confirm the Dockerfile exists at the specified path in that branch + +**Common issues:** + +- Dockerfile exists in `main` branch but project is configured for `develop` +- Repository URL is incorrect or outdated +- Dockerfile was moved or renamed after initial configuration + + + + + +If the Dockerfile path keeps getting cleared: + +1. **Verify file existence:** + + ```bash + # Clone your repository and check + git clone + cd + git checkout + ls -la ./docker/base/Dockerfile # Replace with your path + ``` + +2. **Check file permissions:** + + - Ensure the Dockerfile is readable + - Verify repository access permissions + +3. **Update configuration step by step:** + + - First, ensure repository and branch are correct + - Then, add the complete Dockerfile path + - Save and verify the configuration persists + +4. **Test with a simple path:** + - Try placing Dockerfile in repository root: `./Dockerfile` + - If this works, gradually move to subdirectories + + + + + +**Recommended practices:** + +1. **Use consistent naming:** + + ``` + ./Dockerfile # For single service + ./docker/app/Dockerfile # For multi-service apps + ./services/api/Dockerfile # For microservices + ``` + +2. **Organize by environment:** + + ``` + ./docker/production/Dockerfile + ./docker/development/Dockerfile + ./docker/staging/Dockerfile + ``` + +3. **Keep Dockerfiles in version control:** + + - Always commit Dockerfile changes + - Use meaningful commit messages + - Tag releases that include Dockerfile changes + +4. **Test locally before configuring in SleakOps:** + ```bash + docker build -f ./docker/base/Dockerfile . + ``` + + + + + +If the problem persists: + +1. **Use repository root Dockerfile:** + + - Move your Dockerfile to the repository root + - Use path: `./Dockerfile` + +2. **Create a symbolic link:** + + ```bash + ln -s ./docker/base/Dockerfile ./Dockerfile + ``` + +3. **Use Docker Compose:** + + - Configure docker-compose.yml in repository root + - Reference Dockerfile from compose file + +4. **Contact support with details:** + - Repository URL + - Branch name + - Exact Dockerfile path + - Screenshots of the error + + + +--- + +_This FAQ was automatically generated on February 20, 2025 based on a real user query._ diff --git a/docs/troubleshooting/project-multi-service-setup-guide.mdx b/docs/troubleshooting/project-multi-service-setup-guide.mdx new file mode 100644 index 000000000..03b768076 --- /dev/null +++ b/docs/troubleshooting/project-multi-service-setup-guide.mdx @@ -0,0 +1,289 @@ +--- +sidebar_position: 15 +title: "Multi-Service Project Setup Guide" +description: "Complete guide for setting up Kafka, MySQL, MongoDB connections, and Nginx reverse proxy in SleakOps" +date: "2024-01-15" +category: "project" +tags: + ["kafka", "mysql", "mongodb", "nginx", "reverse-proxy", "database", "setup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Multi-Service Project Setup Guide + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** Kafka, MySQL, MongoDB, Nginx, Reverse Proxy, Database, Setup + +## Problem Description + +**Context:** Setting up a complete project environment that requires multiple services including message queuing (Kafka), relational database (MySQL), NoSQL database connection (MongoDB), and reverse proxy configuration (Nginx). + +**Observed Symptoms:** + +- Need to deploy Kafka service for message processing +- Require MySQL database with data loading capabilities +- Need to establish connection to production MongoDB instance +- Require Nginx configuration for reverse proxy setup + +**Relevant Configuration:** + +- Services needed: Kafka, MySQL, Nginx +- External connection: Production MongoDB +- Proxy configuration: Reverse proxy setup +- Database operations: Data loading and connectivity + +**Error Conditions:** + +- Services must be properly interconnected +- Database connections must be secure and reliable +- Reverse proxy must handle traffic correctly +- All services must be accessible within the project environment + +## Detailed Solution + + + +To add Kafka to your SleakOps project: + +1. **Navigate to your project dashboard** +2. **Go to Dependencies section** +3. **Add Kafka dependency:** + - Click "Add Dependency" + - Select "Kafka" from the list + - Configure the following settings: + +```yaml +# Kafka configuration example +kafka: + version: "3.5" + replicas: 3 + storage: "10Gi" + resources: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" +``` + +4. **Configure topics** (if needed): + - Add environment variables for topic configuration + - Set retention policies + - Configure partitions and replication factor + + + + + +To set up MySQL and load your database: + +1. **Add MySQL dependency:** + - Go to Dependencies → Add Dependency → MySQL + - Configure version and resources: + +```yaml +# MySQL configuration +mysql: + version: "8.0" + database: "your_database_name" + username: "your_username" + storage: "20Gi" + resources: + requests: + memory: "1Gi" + cpu: "500m" +``` + +2. **Load database data:** + + - Use the **Init Scripts** feature in MySQL configuration + - Upload your SQL dump files + - Or use environment variables to run initialization commands + +3. **Access credentials:** + - Database credentials are automatically generated + - Access them via environment variables in your applications + - Connection string format: `mysql://username:password@mysql-service:3306/database` + + + + + +To connect to an external MongoDB production instance: + +1. **Add connection details as secrets:** + - Go to Project Settings → Environment Variables + - Add MongoDB connection variables: + +```bash +MONGO_PROD_URI=mongodb://username:password@prod-mongo-host:27017/database +MONGO_PROD_DATABASE=your_production_database +MONGO_PROD_USERNAME=your_username +MONGO_PROD_PASSWORD=your_password +``` + +2. **Network security considerations:** + + - Ensure your SleakOps cluster can reach the production MongoDB + - Configure firewall rules if necessary + - Use VPN or private networking if available + +3. **Test connectivity:** + - Create a simple test workload to verify connection + - Use MongoDB client tools to test from within the cluster + + + + + +To set up Nginx as a reverse proxy: + +1. **Add Nginx workload:** + + - Go to Workloads → Add Workload → Web Service + - Select Nginx as the base image + +2. **Configure reverse proxy:** + - Create custom Nginx configuration: + +```nginx +# nginx.conf example +server { + listen 80; + server_name your-domain.com; + + # Proxy to your main application + location / { + proxy_pass http://your-app-service:8080; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + } + + # Proxy to Kafka management UI (if needed) + location /kafka { + proxy_pass http://kafka-ui-service:8080; + proxy_set_header Host $host; + } + + # Health check endpoint + location /health { + return 200 'OK'; + add_header Content-Type text/plain; + } +} +``` + +3. **Mount configuration:** + + - Upload your nginx.conf as a config file + - Mount it to `/etc/nginx/conf.d/default.conf` + +4. **Expose the service:** + - Configure ingress or load balancer + - Set up SSL/TLS if required + + + + + +To ensure all services work together: + +1. **Service discovery:** + + - Services can communicate using their service names + - Example: `kafka-service:9092`, `mysql-service:3306` + +2. **Environment variables for integration:** + +```bash +# Application environment variables +KAFKA_BROKERS=kafka-service:9092 +MYSQL_HOST=mysql-service +MYSQL_PORT=3306 +MYSQL_DATABASE=your_database +MONGO_PROD_URI=mongodb://prod-mongo-host:27017/database +NGINX_UPSTREAM=your-app-service:8080 +``` + +3. **Health checks:** + + - Configure health checks for each service + - Monitor connectivity between services + - Set up alerts for service failures + +4. **Testing the complete setup:** + - Test Kafka message production/consumption + - Verify MySQL database queries + - Confirm MongoDB production connectivity + - Test reverse proxy routing + + + + + +**Service connectivity issues:** + +- Check service names and ports +- Verify network policies +- Review firewall configurations + +**Database connection problems:** + +- Verify credentials and connection strings +- Check database service status +- Review network connectivity + +**Nginx proxy issues:** + +- Check Nginx configuration syntax +- Verify upstream service availability +- Review access logs for errors +- Test direct service access + +**Kafka connectivity problems:** + +- Verify Kafka broker addresses +- Check consumer group configurations +- Review topic permissions +- Monitor Kafka logs for errors + +**Performance and scaling:** + +```bash +# Monitor resource usage +kubectl top pods -n + +# Scale services based on load +kubectl scale deployment kafka --replicas=3 +kubectl scale deployment your-app --replicas=5 + +# Check service endpoints +kubectl get endpoints -n +``` + +**Debugging commands:** + +```bash +# Test service connectivity +kubectl exec -it -- telnet + +# Check service DNS resolution +kubectl exec -it -- nslookup + +# View service logs +kubectl logs -f deployment/ + +# Debug networking +kubectl exec -it -- netstat -an +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/project-stuck-analyzing-dockerfile-missing.mdx b/docs/troubleshooting/project-stuck-analyzing-dockerfile-missing.mdx new file mode 100644 index 000000000..9e411398b --- /dev/null +++ b/docs/troubleshooting/project-stuck-analyzing-dockerfile-missing.mdx @@ -0,0 +1,501 @@ +--- +sidebar_position: 3 +title: "Project Stuck in Analyzing State - Missing Dockerfile" +description: "Solution for projects that remain permanently stuck in analyzing state due to missing Dockerfile" +date: "2024-12-11" +category: "project" +tags: ["dockerfile", "analyzing", "deployment", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Project Stuck in Analyzing State - Missing Dockerfile + +**Date:** December 11, 2024 +**Category:** Project +**Tags:** Dockerfile, Analyzing, Deployment, Troubleshooting + +## Problem Description + +**Context:** A new project in SleakOps gets stuck permanently in the "analyzing" state and cannot proceed with the deployment process. + +**Observed Symptoms:** + +- Project remains in "analyzing" state indefinitely +- No progress in the deployment pipeline +- Build process does not advance to next stages +- No clear error messages displayed in the UI + +**Relevant Configuration:** + +- Project type: New project +- Status: Stuck in analyzing phase +- Repository: Connected to source code repository +- Dockerfile: Missing or not detected + +**Error Conditions:** + +- Error occurs during initial project analysis +- Happens when SleakOps cannot find a Dockerfile in the repository +- Problem persists until Dockerfile is properly configured +- May occur if Dockerfile was added after the initial analysis + +## Detailed Solution + + + +First, check if your repository contains a Dockerfile: + +1. **Check repository root**: Ensure there's a file named `Dockerfile` (case-sensitive) in the root directory +2. **Verify file contents**: The Dockerfile should contain valid Docker instructions +3. **Check file permissions**: Ensure the file is readable and properly committed to the repository + +```bash +# Example of a basic Dockerfile structure +FROM node:18-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +If your repository doesn't have a Dockerfile, you need to create one: + +1. **Create the file**: Add a file named `Dockerfile` in your repository root +2. **Choose appropriate base image**: Select based on your application technology +3. **Define build steps**: Include all necessary commands to build your application +4. **Commit and push**: Make sure to commit the Dockerfile to your repository + +**Common Dockerfile templates:** + +```dockerfile +# Node.js application +FROM node:18-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production +COPY . . +EXPOSE 3000 +CMD ["node", "server.js"] +``` + +```dockerfile +# Python application +FROM python:3.9-slim +WORKDIR /app +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt +COPY . . +EXPOSE 8000 +CMD ["python", "app.py"] +``` + + + + + +After adding or fixing the Dockerfile: + +1. **Commit changes**: Ensure the Dockerfile is committed to your repository +2. **Trigger reanalysis**: In SleakOps dashboard: + - Go to your project + - Look for "Re-analyze" or "Refresh" button + - Click to trigger a new analysis +3. **Monitor progress**: Watch the project status change from "analyzing" to the next phase + + + + + +Before committing, validate your Dockerfile: + +```bash +# Test Dockerfile locally +docker build -t test-image . + +# Check for syntax errors +docker run --rm -i hadolint/hadolint < Dockerfile +``` + +**Common Dockerfile issues:** + +- Missing `FROM` instruction +- Incorrect file paths in `COPY` commands +- Missing executable permissions +- Wrong working directory setup + + + + + +If the project remains stuck after adding a Dockerfile: + +1. **Check repository permissions**: Ensure SleakOps has access to read the repository +2. **Verify branch**: Make sure you're working on the correct branch +3. **Clear cache**: Try disconnecting and reconnecting the repository +4. **Contact support**: If the issue persists, provide: + - Repository URL + - Dockerfile contents + - Project configuration details + + + + + +Follow these best practices when creating Dockerfiles for SleakOps: + +1. **Use appropriate base images**: + +```dockerfile +# For production, use specific versions and slim/alpine variants +FROM node:18-alpine +# Instead of: FROM node:latest +``` + +2. **Optimize layer caching**: + +```dockerfile +# Copy package files first for better caching +COPY package*.json ./ +RUN npm ci --only=production +# Then copy application code +COPY . . +``` + +3. **Set proper working directory**: + +```dockerfile +# Always set a working directory +WORKDIR /app +# Avoid using root directory +``` + +4. **Handle file permissions correctly**: + +```dockerfile +# Create non-root user for security +RUN addgroup -g 1001 -S nodejs +RUN adduser -S nextjs -u 1001 +USER nextjs +``` + +5. **Configure health checks**: + +```dockerfile +# Add health check for better container management +HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ + CMD curl -f http://localhost:3000/health || exit 1 +``` + + + + + +Complete Dockerfile examples for different technologies: + +**Node.js with TypeScript:** + +```dockerfile +FROM node:18-alpine AS base + +# Install dependencies only when needed +FROM base AS deps +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production && npm cache clean --force + +# Rebuild the source code only when needed +FROM base AS builder +WORKDIR /app +COPY package*.json ./ +RUN npm ci +COPY . . +RUN npm run build + +# Production image +FROM base AS runner +WORKDIR /app +ENV NODE_ENV=production + +RUN addgroup -g 1001 -S nodejs +RUN adduser -S nextjs -u 1001 + +COPY --from=deps --chown=nextjs:nodejs /app/node_modules ./node_modules +COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist +COPY --from=builder --chown=nextjs:nodejs /app/package.json ./package.json + +USER nextjs + +EXPOSE 3000 +CMD ["node", "dist/index.js"] +``` + +**Python Django Application:** + +```dockerfile +FROM python:3.11-slim + +# Set environment variables +ENV PYTHONDONTWRITEBYTECODE=1 +ENV PYTHONUNBUFFERED=1 + +# Set work directory +WORKDIR /app + +# Install system dependencies +RUN apt-get update \ + && apt-get install -y --no-install-recommends \ + postgresql-client \ + && rm -rf /var/lib/apt/lists/* + +# Install Python dependencies +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Copy project +COPY . . + +# Create non-root user +RUN adduser --disabled-password --gecos '' appuser +RUN chown -R appuser:appuser /app +USER appuser + +# Run the application +EXPOSE 8000 +CMD ["gunicorn", "--bind", "0.0.0.0:8000", "myproject.wsgi:application"] +``` + +**Java Spring Boot:** + +```dockerfile +FROM openjdk:17-jdk-slim AS build + +WORKDIR /app +COPY gradle* ./ +COPY gradle ./gradle +COPY build.gradle settings.gradle ./ +RUN ./gradlew dependencies + +COPY src ./src +RUN ./gradlew build -x test + +FROM openjdk:17-jre-slim +WORKDIR /app + +RUN addgroup --system spring && adduser --system spring --ingroup spring +USER spring:spring + +COPY --from=build /app/build/libs/*.jar app.jar + +EXPOSE 8080 +HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ + CMD curl -f http://localhost:8080/actuator/health || exit 1 + +ENTRYPOINT ["java", "-jar", "app.jar"] +``` + + + + + +Debug common Dockerfile issues: + +1. **Build context issues**: + +```bash +# Check what files are being sent to Docker build context +echo "FROM alpine" | docker build - + +# Use .dockerignore to exclude unnecessary files +cat > .dockerignore << EOF +node_modules +npm-debug.log +.git +.gitignore +README.md +.env +.nyc_output +coverage +.cache +EOF +``` + +2. **Multi-stage build optimization**: + +```dockerfile +# Separate build and runtime stages +FROM node:18-alpine AS development +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +EXPOSE 3000 +CMD ["npm", "run", "dev"] + +FROM node:18-alpine AS production +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production +COPY . . +EXPOSE 3000 +CMD ["npm", "start"] +``` + +3. **Debugging build failures**: + +```bash +# Build with verbose output +docker build --progress=plain --no-cache -t debug-image . + +# Inspect intermediate layers +docker run -it /bin/sh + +# Check build logs +docker build --progress=plain . 2>&1 | tee build.log +``` + +4. **Environment-specific configurations**: + +```dockerfile +# Use ARG for build-time variables +ARG NODE_ENV=production +ARG BUILD_VERSION + +# Use ENV for runtime variables +ENV NODE_ENV=${NODE_ENV} +ENV BUILD_VERSION=${BUILD_VERSION} + +# Conditional commands based on environment +RUN if [ "$NODE_ENV" = "development" ]; then npm install; else npm ci --only=production; fi +``` + + + + + +Resolve repository-related issues that prevent analysis: + +1. **Check repository connectivity**: + +```bash +# Verify SleakOps can access your repository +# In SleakOps dashboard: +# 1. Go to Project Settings +# 2. Check Repository Connection Status +# 3. Test connection if available +``` + +2. **Branch configuration**: + +- Ensure SleakOps is monitoring the correct branch +- Check if Dockerfile exists in the monitored branch +- Verify branch protection rules don't block SleakOps + +3. **Repository permissions**: + +```yaml +# For GitHub, ensure SleakOps has these permissions: +permissions: + - contents: read + - metadata: read + - pull_requests: read (if using PR workflows) +``` + +4. **Webhook configuration**: + +- Verify webhooks are configured for repository updates +- Check webhook delivery logs for failures +- Ensure webhook URLs are accessible from your repository + +5. **File path issues**: + +```bash +# Ensure Dockerfile is in the expected location +ls -la Dockerfile # Should be in repository root + +# Check for hidden characters or encoding issues +file Dockerfile +hexdump -C Dockerfile | head +``` + + + + + +Configure projects for optimal analysis: + +1. **Project structure requirements**: + +```text +your-repository/ +├── Dockerfile # Required in root +├── .dockerignore # Recommended +├── package.json # For Node.js projects +├── requirements.txt # For Python projects +├── pom.xml # For Java Maven projects +├── build.gradle # For Java Gradle projects +└── src/ # Application source code +``` + +2. **Environment-specific Dockerfiles**: + +```dockerfile +# Dockerfile.development +FROM node:18-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +EXPOSE 3000 +CMD ["npm", "run", "dev"] +``` + +```dockerfile +# Dockerfile.production (or just Dockerfile) +FROM node:18-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production +COPY . . +EXPOSE 3000 +CMD ["npm", "start"] +``` + +3. **Configuration file handling**: + +```dockerfile +# Handle multiple configuration files +COPY config/ ./config/ +COPY public/ ./public/ + +# Use build arguments for environment-specific configs +ARG CONFIG_ENV=production +COPY config/${CONFIG_ENV}.json ./config/active.json +``` + +4. **Resource optimization**: + +```dockerfile +# Set appropriate resource limits in Dockerfile +ENV NODE_OPTIONS="--max-old-space-size=2048" + +# Use multi-stage builds to reduce image size +FROM node:18-alpine AS deps +# ... dependency installation + +FROM node:18-alpine AS runner +COPY --from=deps /app/node_modules ./node_modules +# ... rest of application +``` + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/project-vargroups-environment-variables.mdx b/docs/troubleshooting/project-vargroups-environment-variables.mdx new file mode 100644 index 000000000..ae679e142 --- /dev/null +++ b/docs/troubleshooting/project-vargroups-environment-variables.mdx @@ -0,0 +1,185 @@ +--- +sidebar_position: 3 +title: "Vargroups Environment Variables Not Available in Application" +description: "Solution for applications not receiving all expected environment variables from Vargroups" +date: "2024-12-30" +category: "project" +tags: ["vargroups", "environment-variables", "project", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Vargroups Environment Variables Not Available in Application + +**Date:** December 30, 2024 +**Category:** Project +**Tags:** Vargroups, Environment Variables, Project, Configuration + +## Problem Description + +**Context:** User has created multiple Vargroups but their application is only receiving environment variables from one specific Vargroup ('simplee-web'), while other expected environment variables are missing. + +**Observed Symptoms:** + +- Application only shows environment variables from one Vargroup +- Missing environment variables that should be available +- Other Vargroups appear to not be affecting the application +- Application may not be reading environment variables correctly from the Pod + +**Relevant Configuration:** + +- Application: Web application (simplee.cl) +- Available Vargroup: 'simplee-web' +- Platform: SleakOps project environment +- Missing: Additional Vargroups with required environment variables + +**Error Conditions:** + +- Environment variables from other Vargroups are not reaching the application +- Only one Vargroup's variables are being applied +- Application functionality may be impacted by missing variables + +## Detailed Solution + + + +Vargroups in SleakOps are **project-scoped**, meaning: + +- Each Vargroup only affects the specific Project where it was created +- Vargroups from other Projects are not accessible +- You need to create all required Vargroups within the same Project + +**To verify your current Vargroups:** + +1. Navigate to your Project in SleakOps +2. Go to **Configuration** → **Vargroups** +3. Check which Vargroups exist in this specific Project + + + + + +If you're missing Vargroups in your current Project: + +1. **Navigate to Project Settings:** + + - Go to your Project dashboard + - Select **Configuration** → **Vargroups** + +2. **Create New Vargroup:** + + - Click **Add Vargroup** + - Enter a descriptive name + - Add all required environment variables + +3. **Configure Variables:** + + ```bash + # Example Vargroup configuration + DATABASE_URL=postgresql://user:pass@host:5432/db + API_KEY=your-api-key-here + ENVIRONMENT=production + ``` + +4. **Apply Changes:** + - Save the Vargroup + - Redeploy your application to apply new variables + + + + + +Ensure your application code is properly reading environment variables: + +**For Node.js applications:** + +```javascript +// Check if variables are available +console.log("Available env vars:", Object.keys(process.env)); + +// Access specific variables +const dbUrl = process.env.DATABASE_URL; +const apiKey = process.env.API_KEY; +``` + +**For Python applications:** + +```python +import os + +# Check available variables +print('Available env vars:', list(os.environ.keys())) + +# Access specific variables +db_url = os.environ.get('DATABASE_URL') +api_key = os.environ.get('API_KEY') +``` + +**Debug in Pod:** + +```bash +# Connect to your pod and check environment +kubectl exec -it -- env | grep -i +``` + + + + + +**Step 1: Verify Vargroup Assignment** + +- Ensure all required Vargroups are created in the correct Project +- Check that Vargroups are properly assigned to your application + +**Step 2: Check Deployment Status** + +- After creating new Vargroups, trigger a new deployment +- Environment variables are applied during deployment time + +**Step 3: Validate Variable Names** + +- Ensure variable names match exactly (case-sensitive) +- Check for typos in variable names + +**Step 4: Application Code Review** + +- Verify your application is reading from `process.env` or equivalent +- Check if variables are being used in the correct scope + +**Step 5: Pod-level Verification** + +```bash +# Check environment variables in the running pod +kubectl get pods -n +kubectl exec -it -n -- printenv +``` + + + + + +**Issue 1: Variables Not Updated After Changes** + +- **Solution:** Redeploy the application after Vargroup changes +- Environment variables are injected at container startup + +**Issue 2: Case Sensitivity** + +- **Solution:** Ensure exact case matching between Vargroup and application code +- Linux containers are case-sensitive + +**Issue 3: Variable Overriding** + +- **Solution:** Check if multiple Vargroups define the same variable +- Later Vargroups may override earlier ones + +**Issue 4: Application Not Reading Environment** + +- **Solution:** Verify application framework properly loads environment variables +- Some frameworks require explicit configuration + + + +--- + +_This FAQ was automatically generated on December 30, 2024 based on a real user query._ diff --git a/docs/troubleshooting/prometheus-memory-issues-grafana-no-data.mdx b/docs/troubleshooting/prometheus-memory-issues-grafana-no-data.mdx new file mode 100644 index 000000000..47653e9f2 --- /dev/null +++ b/docs/troubleshooting/prometheus-memory-issues-grafana-no-data.mdx @@ -0,0 +1,220 @@ +--- +sidebar_position: 3 +title: "Prometheus Memory Issues Causing Grafana Data Loss" +description: "Solution for Prometheus backend pod crashes due to insufficient RAM causing Grafana to show no data" +date: "2024-12-11" +category: "dependency" +tags: ["prometheus", "grafana", "memory", "monitoring", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Prometheus Memory Issues Causing Grafana Data Loss + +**Date:** December 11, 2024 +**Category:** Dependency +**Tags:** Prometheus, Grafana, Memory, Monitoring, Troubleshooting + +## Problem Description + +**Context:** Grafana dashboards showing no data or empty metrics for several days due to underlying Prometheus backend pod failures caused by insufficient memory allocation. + +**Observed Symptoms:** + +- Grafana dashboards display no data or metrics +- Prometheus backend pod crashes or remains in failed state +- Missing metrics collection for extended periods (days) +- Default namespace filter showing empty results in Grafana +- Loki dashboard not displaying log information + +**Relevant Configuration:** + +- Prometheus backend pod memory limit: Below required threshold +- Default Grafana namespace filter: 'default' (empty namespace) +- Affected timeframe: Multiple days of missing data +- Memory requirement: Minimum 1250Mi RAM needed + +**Error Conditions:** + +- Prometheus pod fails due to OOMKilled (Out of Memory) +- Metrics collection stops completely +- Grafana cannot retrieve historical data from failed period +- Problem persists until manual intervention + +## Detailed Solution + + + +**Immediate Solution:** + +1. **Identify the failed Prometheus pod:** + + ```bash + kubectl get pods -n monitoring | grep prometheus + kubectl describe pod -n monitoring + ``` + +2. **Check memory usage and limits:** + + ```bash + kubectl top pod -n monitoring + kubectl get pod -n monitoring -o yaml | grep -A 5 -B 5 resources + ``` + +3. **Increase memory allocation manually:** + + ```yaml + # Edit the Prometheus deployment + kubectl edit deployment prometheus-server -n monitoring + + # Add or modify the resources section: + resources: + requests: + memory: "1250Mi" + limits: + memory: "2Gi" + ``` + +4. **Restart the deployment:** + ```bash + kubectl rollout restart deployment prometheus-server -n monitoring + ``` + + + + + +**Problem:** Grafana opens with 'default' namespace filter which typically contains no deployed applications. + +**Solution:** + +1. **Access Grafana dashboard** +2. **Change namespace filter:** + + - Look for the namespace dropdown (usually at the top) + - Select a namespace that contains your applications + - Common namespaces: `kube-system`, `monitoring`, `default`, or your application namespaces + +3. **Set a meaningful default:** + - Choose a namespace with active workloads + - Save the dashboard with the correct namespace selected + +**Available Dashboards:** + +- Kubernetes cluster overview +- Node metrics +- Pod metrics +- Application-specific dashboards +- Network monitoring +- Storage metrics + + + + + +**Problem:** Loki dashboard not showing log information due to read component failure. + +**Solution:** + +1. **Identify the Loki read pod:** + + ```bash + kubectl get pods -n monitoring | grep loki-read + ``` + +2. **Delete the problematic pod:** + + ```bash + kubectl delete pod -n monitoring + ``` + +3. **Verify automatic recreation:** + + ```bash + kubectl get pods -n monitoring | grep loki-read + kubectl logs -n monitoring + ``` + +4. **Test log collection:** + - Wait 2-3 minutes for the pod to fully start + - Check Grafana Loki dashboard for new log entries + - Verify logs are being collected from your applications + + + + + +**Monitoring Setup:** + +1. **Set up alerts for Prometheus memory usage:** + + ```yaml + # Example alert rule + - alert: PrometheusHighMemoryUsage + expr: (container_memory_usage_bytes{pod=~"prometheus.*"} / container_spec_memory_limit_bytes{pod=~"prometheus.*"}) > 0.8 + for: 5m + labels: + severity: warning + annotations: + summary: "Prometheus is using high memory" + ``` + +2. **Regular memory monitoring:** + + ```bash + # Check current memory usage + kubectl top pods -n monitoring + + # Monitor over time + watch kubectl top pods -n monitoring + ``` + +3. **Scaling considerations:** + - As cluster grows, Prometheus memory requirements increase + - Monitor metrics retention period + - Consider Prometheus federation for large clusters + - Adjust memory limits based on cluster size and retention policies + +**Future Platform Improvements:** + +- Memory limits will be adjustable through the SleakOps frontend +- Automatic scaling based on cluster size +- Proactive monitoring and alerting for resource constraints + + + + + +**Important Notes:** + +- **Lost metrics cannot be recovered:** Data from the period when Prometheus was down is permanently lost +- **Plan for redundancy:** Consider setting up Prometheus federation or external storage for critical metrics +- **Backup strategies:** Implement regular Prometheus data backups for critical environments + +**Mitigation for Production:** + +1. **High Availability setup:** + + ```yaml + # Example HA Prometheus configuration + prometheus: + prometheusSpec: + replicas: 2 + retention: 30d + resources: + requests: + memory: 2Gi + limits: + memory: 4Gi + ``` + +2. **External storage:** + - Configure remote write to external TSDB + - Use Thanos for long-term storage + - Implement cross-region backup strategies + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/prometheus-memory-issues-node-allocation.mdx b/docs/troubleshooting/prometheus-memory-issues-node-allocation.mdx new file mode 100644 index 000000000..0817fde9f --- /dev/null +++ b/docs/troubleshooting/prometheus-memory-issues-node-allocation.mdx @@ -0,0 +1,218 @@ +--- +sidebar_position: 3 +title: "Prometheus Memory Issues and Node Allocation" +description: "Solution for Prometheus pod crashes due to memory constraints and dynamic node allocation" +date: "2024-12-19" +category: "cluster" +tags: ["prometheus", "memory", "monitoring", "grafana", "node-allocation"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Prometheus Memory Issues and Node Allocation + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** Prometheus, Memory, Monitoring, Grafana, Node Allocation + +## Problem Description + +**Context:** Prometheus pods in SleakOps clusters experience crashes due to memory constraints when allocated to nodes with insufficient resources, causing monitoring dashboards and metrics collection to fail. + +**Observed Symptoms:** + +- Prometheus pod crashes due to memory exhaustion +- Grafana dashboards show no data or become unavailable +- Metrics are not being collected or stored +- Prometheus container shows yellow status (warning state) +- Monitoring functionality is completely disrupted + +**Relevant Configuration:** + +- Prometheus has dynamic memory requirements +- Node allocation is dynamic and may place Prometheus on undersized nodes +- Grafana depends on Prometheus for metrics data +- Loki may be affected by the same node resource constraints + +**Error Conditions:** + +- Occurs when Prometheus is scheduled on nodes with insufficient memory +- Problem is intermittent due to dynamic node allocation +- Affects all monitoring and observability features +- Can recur as cluster scaling changes node availability + +## Detailed Solution + + + +The immediate solution involves configuring Prometheus to always be scheduled on nodes with sufficient resources: + +1. **Access your cluster configuration** +2. **Modify Prometheus deployment** to include node affinity rules +3. **Ensure Prometheus targets larger nodes** with adequate memory + +```yaml +# Example node affinity configuration for Prometheus +apiVersion: apps/v1 +kind: Deployment +metadata: + name: prometheus +spec: + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: node.kubernetes.io/instance-type + operator: In + values: + - "m5.large" + - "m5.xlarge" + - "c5.large" + - "c5.xlarge" +``` + +This prevents Prometheus from being scheduled on smaller nodes that cannot handle its memory requirements. + + + + + +After applying the fix, verify that monitoring is working correctly: + +1. **Check Prometheus pod status**: + + ```bash + kubectl get pods -n monitoring | grep prometheus + ``` + + The pod should show `Running` status with all containers green. + +2. **Verify Grafana dashboards**: + + - **Container Logs (Loki)**: Check if log data is available + - **Compute and RAM (Prometheus)**: Verify metrics are being collected + - **Network Traffic (Prometheus)**: Confirm network metrics are updating + +3. **Test dashboard functionality**: + - Access Grafana interface + - Navigate to different dashboards + - Confirm data is being displayed with recent timestamps + + + + + +To identify when Prometheus is experiencing issues: + +1. **Visual indicators in SleakOps dashboard**: + + - Look for yellow containers (warning state) + - Check for red containers (failed state) + - Monitor resource usage graphs + +2. **Command line monitoring**: + + ```bash + # Check Prometheus pod resource usage + kubectl top pod -n monitoring | grep prometheus + + # Check pod events for memory issues + kubectl describe pod -n monitoring + + # Monitor pod logs for memory errors + kubectl logs -n monitoring + ``` + +3. **Set up alerts** for Prometheus pod restarts or memory usage spikes. + + + + + +Set appropriate resource requests and limits for Prometheus: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: prometheus +spec: + template: + spec: + containers: + - name: prometheus + resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "1000m" +``` + +**Guidelines for resource sizing**: + +- **Small clusters** (< 50 pods): 2Gi memory request, 4Gi limit +- **Medium clusters** (50-200 pods): 4Gi memory request, 8Gi limit +- **Large clusters** (> 200 pods): 8Gi memory request, 16Gi limit + + + + + +To prevent this issue from recurring: + +1. **Implement node taints and tolerations**: + + ```yaml + # Taint nodes for monitoring workloads + kubectl taint nodes monitoring=true:NoSchedule + + # Add toleration to Prometheus deployment + tolerations: + - key: "monitoring" + operator: "Equal" + value: "true" + effect: "NoSchedule" + ``` + +2. **Use dedicated node pools** for monitoring components +3. **Implement cluster autoscaling** with minimum node requirements +4. **Set up monitoring alerts** for resource exhaustion +5. **Regular capacity planning** based on cluster growth + + + + + +If the problem persists after applying the initial fix: + +1. **Check cluster node capacity**: + + ```bash + kubectl describe nodes | grep -A 5 "Allocated resources" + ``` + +2. **Verify Prometheus configuration**: + + - Check scrape intervals and retention policies + - Review target discovery configuration + - Validate storage configuration + +3. **Examine cluster events**: + + ```bash + kubectl get events --sort-by=.metadata.creationTimestamp + ``` + +4. **Consider Prometheus federation** for very large clusters +5. **Implement Prometheus sharding** if single instance cannot handle the load + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/rails-console-large-script-execution.mdx b/docs/troubleshooting/rails-console-large-script-execution.mdx new file mode 100644 index 000000000..aed1cf72f --- /dev/null +++ b/docs/troubleshooting/rails-console-large-script-execution.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 15 +title: "Rails Console Large Script Execution Issue" +description: "Solution for executing large scripts in Rails console through Lens" +date: "2024-01-15" +category: "workload" +tags: ["rails", "console", "lens", "kubectl", "script-execution"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Rails Console Large Script Execution Issue + +**Date:** January 15, 2024 +**Category:** Workload +**Tags:** Rails, Console, Lens, kubectl, Script Execution + +## Problem Description + +**Context:** User experiences issues when trying to paste or execute large scripts (approximately 300 lines) in Rails console through Lens interface. + +**Observed Symptoms:** + +- `IOError: ungetbyte failed` when pasting large scripts +- Error occurs in `/usr/local/lib/ruby/3.1.0/reline/ansi.rb:259` +- Console becomes unresponsive with large code blocks +- Issue specifically affects Rails IRB console accessed through Lens + +**Relevant Configuration:** + +- Ruby version: 3.1.0 +- Rails console accessed through Lens +- Script size: ~300 lines +- Error location: Reline ANSI module + +**Error Conditions:** + +- Error occurs when pasting large amounts of code +- Happens specifically in Rails console (IRB) +- Problem appears when using Lens console interface +- Issue does not occur with smaller code snippets + +## Detailed Solution + + + +The most reliable solution is to copy your script as a file to the pod and then execute it: + +```bash +# Copy script from local machine to pod +kubectl cp /home/user/local/path/script.rb namespace/pod_name:/tmp/script.rb + +# Then in Rails console, load and execute the script +load '/tmp/script.rb' +``` + +**Step-by-step process:** + +1. Save your script to a local file (e.g., `my_script.rb`) +2. Use `kubectl cp` to copy it to the pod +3. Access Rails console through Lens +4. Use `load` or `require` to execute the script + + + + + +You can also execute Rails scripts directly from the terminal: + +```bash +# Access pod terminal through Lens +# Then run Rails runner with your script +rails runner /tmp/script.rb + +# Or execute Ruby code directly +ruby -e "$(cat /tmp/script.rb)" +``` + +This bypasses the IRB console limitations entirely. + + + + + +**Remember that pod filesystems are ephemeral:** + +- Files copied to pods will be lost when pods restart +- Use `/tmp` directory for temporary scripts +- For persistent scripts, consider using ConfigMaps or mounted volumes + +**Creating a ConfigMap for reusable scripts:** + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: rails-scripts +data: + my-script.rb: | + # Your Ruby script content here + puts "Hello from ConfigMap script" +``` + +Then mount it in your deployment and access at `/scripts/my-script.rb`. + + + + + +**Method 1: Split large scripts into smaller chunks** + +```ruby +# In Rails console, paste smaller sections at a time +# This avoids the buffer overflow issue +``` + +**Method 2: Use heredoc syntax** + +```ruby +# In Rails console +script_content = <<~RUBY + # Your script content here + # This can handle larger blocks better +RUBY + +eval(script_content) +``` + +**Method 3: Execute from Rails application** + +```ruby +# Create a rake task or Rails runner script +# rails runner 'path/to/your/script.rb' +``` + + + + + +If `kubectl cp` doesn't work: + +**Check pod and namespace:** + +```bash +# List pods in namespace +kubectl get pods -n your-namespace + +# Verify pod is running +kubectl describe pod pod-name -n your-namespace +``` + +**Correct syntax:** + +```bash +# From local to pod +kubectl cp ./local-file.rb namespace/pod-name:/tmp/remote-file.rb + +# From pod to local +kubectl cp namespace/pod-name:/tmp/remote-file.rb ./local-file.rb +``` + +**Common issues:** + +- Ensure the target directory exists in the pod +- Check file permissions +- Verify kubectl context is correct + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/react-environment-variables-runtime.mdx b/docs/troubleshooting/react-environment-variables-runtime.mdx new file mode 100644 index 000000000..1461ee06e --- /dev/null +++ b/docs/troubleshooting/react-environment-variables-runtime.mdx @@ -0,0 +1,201 @@ +--- +sidebar_position: 15 +title: "React Environment Variables Not Available at Runtime" +description: "Solution for React apps where environment variables from vargroups are not accessible at runtime" +date: "2024-12-19" +category: "project" +tags: ["react", "environment-variables", "dockerfile", "build", "runtime"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# React Environment Variables Not Available at Runtime + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** React, Environment Variables, Dockerfile, Build, Runtime + +## Problem Description + +**Context:** User has a React application deployed in SleakOps where environment variables defined in vargroups are not accessible at runtime through `process.env`, even though they work locally with `.env.local` files. + +**Observed Symptoms:** + +- Environment variables from vargroups don't appear in `process.env` logs +- Variables work correctly in local development with `.env.local` files +- Application builds successfully but variables are undefined at runtime +- Static build process doesn't include runtime environment variables + +**Relevant Configuration:** + +- Application type: React SPA (Single Page Application) +- Build process: Multi-stage Docker build with Node.js and Nginx +- Deployment: Static files served by Nginx +- Environment: SleakOps platform with vargroups + +**Error Conditions:** + +- Variables are undefined when accessed via `process.env` in the browser +- Issue occurs only in deployed environment, not locally +- Problem persists even when vargroups are properly configured + +## Detailed Solution + + + +The issue occurs because React environment variables are resolved during the **build process**, not at runtime. Here's what happens: + +1. **Build Stage**: Environment variables are embedded into the static JavaScript files during compilation +2. **Runtime Stage**: The application runs as static files served by Nginx, with no access to server environment variables +3. **Browser Execution**: `process.env` references are replaced with literal values during build + +This is why variables work locally (available during `yarn build`) but not in production (not available during Docker build). + + + + + +To make environment variables available during the Docker build process: + +1. **Add ARG declarations** to your Dockerfile: + +```dockerfile +FROM node:20-alpine AS build + +# Declare build arguments for environment variables +ARG PUBLIC_URL +ARG NODE_ENV +ARG REACT_APP_API_URL +ARG REACT_APP_API_KEY +# Add all your environment variables as ARG + +# Set working directory +WORKDIR /workspace/app + +# Copy the code to the container +COPY . . + +# Install dependencies +RUN yarn install + +# Build the app (ARG variables will be available as ENV during build) +RUN yarn run build + +# Production stage +FROM nginx:latest +WORKDIR /usr/share/nginx/html +COPY ./deploy/nginx/default.conf /etc/nginx/conf.d/default.conf +COPY --from=build /workspace/app/build . +EXPOSE 3000 +CMD ["nginx", "-g", "daemon off;"] +``` + +2. **Configure variables in SleakOps**: + - Go to your project's **Docker Args** vargroup + - Add all the environment variables your React app needs + - These will be passed as `--build-arg` during Docker build + + + + + +In SleakOps, you need to configure variables in the correct vargroup: + +1. **Navigate to Vargroups**: Go to your project → Configuration → Vargroups +2. **Use Docker Args vargroup**: Environment variables for React build must be in the "Docker Args" vargroup, not the runtime vargroup +3. **Add variables**: + ``` + PUBLIC_URL=https://your-domain.com + REACT_APP_API_URL=https://api.your-domain.com + REACT_APP_ENVIRONMENT=production + ``` + +**Important**: React only includes environment variables that start with `REACT_APP_` in the build. + + + + + +For true runtime environment variables, consider migrating to a framework with SSR capabilities: + +**Next.js Migration Benefits:** + +- Environment variables available at runtime +- Server-side rendering capabilities +- Better SEO and performance +- Runtime configuration without rebuilding + +**Basic Next.js configuration:** + +```javascript +// next.config.js +module.exports = { + env: { + CUSTOM_KEY: process.env.CUSTOM_KEY, + }, + // Or use runtime configuration + publicRuntimeConfig: { + apiUrl: process.env.API_URL, + }, +}; +``` + +**Dockerfile for Next.js:** + +```dockerfile +FROM node:20-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +RUN npm run build +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +To verify your solution works: + +1. **Check build logs**: Ensure ARG variables are available during build +2. **Inspect built files**: Look for your variables in the generated JavaScript +3. **Test in browser**: Use browser dev tools to verify `process.env` values +4. **Console logging**: Add temporary logs to verify variable availability + +```javascript +// Add this temporarily to verify variables +console.log("Environment variables:", { + apiUrl: process.env.REACT_APP_API_URL, + environment: process.env.REACT_APP_ENVIRONMENT, + nodeEnv: process.env.NODE_ENV, +}); +``` + + + + + +**Naming Convention:** + +- Always prefix with `REACT_APP_` for client-side variables +- Use descriptive names: `REACT_APP_API_BASE_URL` instead of `REACT_APP_URL` + +**Security Considerations:** + +- Never include sensitive data (API keys, secrets) in client-side environment variables +- All `REACT_APP_` variables are publicly accessible in the browser +- Use backend proxy for sensitive API calls + +**Development vs Production:** + +- Use different vargroups for different environments +- Test with production-like environment variables during development +- Document all required environment variables in your README + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/redis-connection-configuration.mdx b/docs/troubleshooting/redis-connection-configuration.mdx new file mode 100644 index 000000000..f3a85188e --- /dev/null +++ b/docs/troubleshooting/redis-connection-configuration.mdx @@ -0,0 +1,232 @@ +--- +sidebar_position: 3 +title: "Redis Connection Configuration Issues" +description: "Solution for Redis dependency connection problems in SleakOps projects" +date: "2024-01-15" +category: "dependency" +tags: ["redis", "connection", "aws", "elasticache", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Redis Connection Configuration Issues + +**Date:** January 15, 2024 +**Category:** Dependency +**Tags:** Redis, Connection, AWS, ElastiCache, Configuration + +## Problem Description + +**Context:** User created a Redis dependency for their SleakOps project which generated the corresponding variable group, but the Spring Boot application cannot connect to the Redis instance despite having the correct URL and port configuration. + +**Observed Symptoms:** + +- Redis dependency created successfully in SleakOps +- Variable group generated with `CACHE_REDIS_URL` variable +- Spring Boot application fails to connect with `RedisConnectionException` +- Error shows "Unable to connect to cache-xxx.amazonaws.com:6379" +- Connection attempts timeout or are interrupted + +**Relevant Configuration:** + +- Redis instance: AWS ElastiCache +- Spring Data Redis version: 3.3.2 +- Lettuce client version: 6.3.2.RELEASE +- Current URL format: `redis://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379` +- Port: 6379 (standard Redis port) + +**Error Conditions:** + +- Connection fails during application startup +- Error occurs when Spring tries to establish reactive Redis connection +- Multiple URL formats attempted without success +- Problem persists across different URL configurations + +## Detailed Solution + + + +For AWS ElastiCache Redis instances in SleakOps, the correct URL format should be: + +``` +redis://[hostname]:[port]/[database_number] +``` + +**Standard configuration:** + +``` +CACHE_REDIS_URL=redis://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379/0 +``` + +**Key points:** + +- Always include the database number (usually `/0` for default database) +- Port `6379` is the standard Redis port +- Use `redis://` protocol prefix for non-SSL connections +- Use `rediss://` for SSL connections if your ElastiCache has encryption in transit enabled + + + + + +The connection issue might be related to network configuration: + +**1. Check Security Groups:** + +- Ensure your EKS cluster's security group allows outbound traffic on port 6379 +- Verify ElastiCache security group allows inbound traffic from your EKS cluster + +**2. Subnet Configuration:** + +- ElastiCache and EKS cluster should be in the same VPC +- Subnets should have proper routing configured + +**3. Test connectivity from a pod:** + +```bash +# Create a test pod +kubectl run redis-test --image=redis:alpine --rm -it -- sh + +# Inside the pod, test connection +redis-cli -h cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com -p 6379 ping +``` + + + + + +Ensure your Spring Boot application is properly configured: + +**application.yml:** + +```yaml +spring: + data: + redis: + url: ${CACHE_REDIS_URL} + timeout: 10s + lettuce: + pool: + max-active: 8 + max-idle: 8 + min-idle: 0 +``` + +**Or using individual properties:** + +```yaml +spring: + data: + redis: + host: cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com + port: 6379 + database: 0 + timeout: 10s +``` + + + + + +If your ElastiCache instance has encryption in transit enabled: + +**1. Use `rediss://` protocol:** + +``` +CACHE_REDIS_URL=rediss://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379/0 +``` + +**2. Configure SSL in Spring Boot:** + +```yaml +spring: + data: + redis: + url: ${CACHE_REDIS_URL} + ssl: + enabled: true +``` + +**3. Check ElastiCache configuration:** + +- Go to AWS Console → ElastiCache → Redis clusters +- Verify if "Encryption in transit" is enabled +- If enabled, you must use SSL connection + + + + + +In SleakOps, ensure your Redis variables are correctly configured: + +**1. Check variable group:** + +- Go to your project → Dependencies → Redis +- Verify all required variables are present: + - `CACHE_REDIS_URL` + - `CACHE_REDIS_HOST` (if using individual properties) + - `CACHE_REDIS_PORT` (if using individual properties) + +**2. Variable format:** + +``` +CACHE_REDIS_URL=redis://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379/0 +CACHE_REDIS_HOST=cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com +CACHE_REDIS_PORT=6379 +``` + +**3. Apply changes:** + +- After modifying variables, redeploy your application +- Variables are injected during deployment + + + + + +**1. Enable Redis connection logging:** + +```yaml +logging: + level: + org.springframework.data.redis: DEBUG + io.lettuce.core: DEBUG +``` + +**2. Check application logs for detailed errors:** + +```bash +kubectl logs -f deployment/your-app-name +``` + +**3. Verify ElastiCache status:** + +- AWS Console → ElastiCache → Redis clusters +- Ensure status is "Available" +- Check for any maintenance windows + +**4. Test with Redis CLI from local machine:** + +```bash +# If you have VPN access to the VPC +redis-cli -h cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com -p 6379 ping +``` + +**5. Common URL formats to try:** + +``` +# Standard connection +redis://hostname:6379/0 + +# With authentication (if AUTH is enabled) +redis://:password@hostname:6379/0 + +# SSL connection +rediss://hostname:6379/0 +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/retool-rds-private-connection.mdx b/docs/troubleshooting/retool-rds-private-connection.mdx new file mode 100644 index 000000000..2a486efbb --- /dev/null +++ b/docs/troubleshooting/retool-rds-private-connection.mdx @@ -0,0 +1,208 @@ +--- +sidebar_position: 3 +title: "Connecting Retool to Private RDS Database" +description: "How to connect external Retool service to private RDS database in SleakOps" +date: "2024-08-30" +category: "dependency" +tags: ["retool", "rds", "database", "nlb", "private", "connection"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Connecting Retool to Private RDS Database + +**Date:** August 30, 2024 +**Category:** Dependency +**Tags:** Retool, RDS, Database, NLB, Private, Connection + +## Problem Description + +**Context:** Users need to connect external Retool dashboards to RDS databases deployed through SleakOps. Previously, they could modify security groups to allow direct access, but SleakOps deploys RDS instances in private subnets for security. + +**Observed Symptoms:** + +- Cannot connect Retool directly to RDS database +- RDS is deployed in private subnets without internet access +- Security group modifications don't provide external connectivity +- Retool requires specific IP allowlisting for connection + +**Relevant Configuration:** + +- RDS database: PostgreSQL in private subnets +- Retool IPs to allowlist: `35.90.103.132/30`, `44.208.168.68/30` +- SleakOps VPC with private/public subnet architecture +- Security groups managed by SleakOps + +**Error Conditions:** + +- Connection timeouts when trying to reach RDS from Retool +- DNS resolution may fail for private RDS endpoints +- Port connectivity issues (typically port 5432 for PostgreSQL) + +## Detailed Solution + + + +SleakOps deploys RDS databases in private subnets for security best practices. This means: + +- **Private subnets**: No direct internet access +- **Security isolation**: Protected from external threats +- **VPC-only access**: Only resources within the VPC can connect + +To connect external services like Retool, you need to create a bridge between the private database and the internet. + + + + + +The recommended approach is to use a Network Load Balancer (NLB) as described in [Retool's documentation](https://docs.retool.com/center-of-excellence/patterns/AWS/Connect/privateresource). + +**Steps to implement:** + +1. **Create Network Load Balancer** + + - Deploy in public subnets + - Assign Elastic IP addresses + - Configure for TCP traffic on port 5432 + +2. **Configure Target Group** + + - Type: IP addresses + - Protocol: TCP + - Port: 5432 + - Target: Private IP of RDS instance + +3. **Update Security Groups** + + - NLB security group: Allow inbound from Retool IPs + - RDS security group: Allow inbound from NLB + +4. **Configure Retool Connection** + - Use NLB's Elastic IP as database host + - Standard port 5432 + - Same database credentials + +**Cost consideration:** Approximately $20 USD per month plus data transfer costs. + + + + + +For the NLB approach, configure security groups as follows: + +**NLB Security Group:** + +``` +Inbound Rules: +- Type: Custom TCP +- Port: 5432 +- Source: 35.90.103.132/30 (Retool IP range 1) +- Source: 44.208.168.68/30 (Retool IP range 2) + +Outbound Rules: +- Type: Custom TCP +- Port: 5432 +- Destination: RDS security group ID +``` + +**RDS Security Group (add rule):** + +``` +Inbound Rules: +- Type: PostgreSQL +- Port: 5432 +- Source: NLB security group ID +``` + + + + + +If NLB is not suitable, consider these alternatives: + +**1. Self-hosted Retool in Cluster** + +- Deploy Retool directly in your Kubernetes cluster +- Requires significant resources (16GB RAM, 8 CPUs) +- Higher infrastructure costs but better security +- Direct access to private resources + +**2. RDS Read Replica in Public Subnets** + +- Create read-only replica in public subnets +- Use for reporting/dashboard purposes only +- Maintains security of primary database +- Additional cost for replica instance + +**3. VPN Connection (if supported)** + +- Some Retool plans support VPN connections +- Check with Retool support for availability +- Most secure option if available +- May require enterprise Retool plan + + + + + +To test connectivity before configuring Retool: + +**From your local machine with VPN:** + +```bash +# Test port connectivity +nmap -Pn -p 5432 your-rds-endpoint.amazonaws.com + +# Test database connection +psql -h your-rds-endpoint.amazonaws.com -p 5432 -U username -d database_name +``` + +**From within the cluster:** + +```bash +# Create test pod +kubectl run postgres-client --rm -it --image=postgres:13 -- bash + +# Inside the pod +psql -h your-rds-endpoint.amazonaws.com -p 5432 -U username -d database_name +``` + +**After NLB setup:** + +```bash +# Test through NLB (from internet) +nmap -Pn -p 5432 your-nlb-elastic-ip +``` + + + + + +**Connection timeouts:** + +- Verify security group rules are correctly configured +- Check NLB target group health status +- Ensure RDS is in running state + +**DNS resolution issues:** + +- Use IP addresses instead of hostnames for testing +- Verify DNS settings in your network + +**Authentication failures:** + +- Confirm database credentials are correct +- Check if database user has necessary permissions +- Verify database name is correct + +**NLB health check failures:** + +- Ensure target group points to correct RDS private IP +- Verify RDS security group allows traffic from NLB +- Check RDS subnet routing tables + + + +--- + +_This FAQ was automatically generated based on a real user query about connecting Retool to private RDS databases in SleakOps._ diff --git a/docs/troubleshooting/s3-bucket-access-authentication.mdx b/docs/troubleshooting/s3-bucket-access-authentication.mdx new file mode 100644 index 000000000..3103f37e0 --- /dev/null +++ b/docs/troubleshooting/s3-bucket-access-authentication.mdx @@ -0,0 +1,516 @@ +--- +sidebar_position: 3 +title: "S3 Bucket Access and Authentication Issues" +description: "Resolving AWS S3 bucket access problems with IAM roles and cross-project authentication" +date: "2025-02-11" +category: "dependency" +tags: ["s3", "aws", "authentication", "boto3", "iam", "bucket"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# S3 Bucket Access and Authentication Issues + +**Date:** February 11, 2025 +**Category:** Dependency +**Tags:** S3, AWS, Authentication, Boto3, IAM, Bucket + +## Problem Description + +**Context:** User created a private S3 bucket through SleakOps and is experiencing authentication issues when accessing it from Python applications using boto3. The bucket works with explicit AWS credentials but fails with IAM role-based authentication within the cluster. + +**Observed Symptoms:** + +- `botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden` +- Authentication works outside the cluster with explicit credentials +- Authentication fails inside the cluster without explicit credentials +- Need to access S3 bucket from different projects/services + +**Relevant Configuration:** + +- Bucket: Private S3 bucket created through SleakOps +- Library: boto3 (Python) +- Environment: EKS cluster with IAM roles +- Access pattern: Both same-project and cross-project access needed + +**Error Conditions:** + +- Error occurs when using IAM role authentication within pods +- Problem appears during HeadObject and other S3 operations +- Works with explicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY +- Fails when relying on pod identity or service account roles + +## Detailed Solution + + + +SleakOps provides automatic IAM role-based authentication for S3 buckets created within projects. This means: + +1. **Same-project access**: No explicit credentials needed +2. **Cross-project access**: Requires additional configuration +3. **External access**: Requires explicit credentials or presigned URLs + +```python +# Within the same project - no credentials needed +import boto3 + +s3_client = boto3.client('s3', region_name='us-east-1') +# This should work automatically +``` + + + + + +To debug authentication problems: + +1. **Enter a pod in your project**: + +```bash +kubectl exec -it -- /bin/bash +``` + +2. **Install AWS CLI** (if not present): + +```bash +apt-get update && apt-get install -y awscli +# or +pip install awscli +``` + +3. **Clear any existing credentials**: + +```bash +unset AWS_ACCESS_KEY_ID +unset AWS_SECRET_ACCESS_KEY +unset AWS_SESSION_TOKEN +``` + +4. **Test authentication**: + +```bash +aws s3 ls +# Should list buckets if authentication works +``` + +5. **Test specific bucket access**: + +```bash +aws s3 ls s3://your-bucket-name +``` + + + + + +For same-project S3 access: + +```python +import boto3 +from botocore.exceptions import ClientError + +# Initialize S3 client without explicit credentials +# The IAM role will be used automatically +s3_client = boto3.client( + 's3', + region_name='us-east-1' # Specify your region +) + +try: + # Test bucket access + response = s3_client.head_bucket(Bucket='your-bucket-name') + print("Bucket access successful") +except ClientError as e: + print(f"Error accessing bucket: {e}") +``` + +For listing objects: + +```python +try: + response = s3_client.list_objects_v2(Bucket='your-bucket-name') + for obj in response.get('Contents', []): + print(f"Object: {obj['Key']}") +except ClientError as e: + print(f"Error listing objects: {e}") +``` + + + + + +To access an S3 bucket from a different project: + +1. **In SleakOps dashboard**: + + - Go to the **source project** (where the bucket was created) + - Navigate to **Project Settings** → **Access Config** + - Add the target project or service account + +2. **Grant specific permissions**: + + - Select the S3 bucket resource + - Choose appropriate permissions (read, write, delete) + - Save the configuration + +3. **Use the same authentication method**: + +```python +# No changes needed in code - IAM roles handle cross-project access +s3_client = boto3.client('s3', region_name='us-east-1') +``` + + + + + +For HTTP access from external services or browsers: + +```python +import boto3 +from botocore.exceptions import ClientError + +def generate_presigned_url(bucket_name, object_key, expiration=3600): + """Generate a presigned URL for S3 object access""" + s3_client = boto3.client('s3', region_name='us-east-1') + + try: + response = s3_client.generate_presigned_url( + 'get_object', + Params={'Bucket': bucket_name, 'Key': object_key}, + ExpiresIn=expiration + ) + return response + except ClientError as e: + print(f"Error generating presigned URL: {e}") + return None + +# Usage +url = generate_presigned_url('your-bucket', 'path/to/file.txt') +if url: + print(f"Download URL: {url}") +``` + +For upload URLs: + +```python +def generate_presigned_upload_url(bucket_name, object_key, expiration=3600): + """Generate a presigned URL for uploading files""" + s3_client = boto3.client('s3', region_name='us-east-1') + + try: + response = s3_client.generate_presigned_url( + 'put_object', + Params={'Bucket': bucket_name, 'Key': object_key}, + ExpiresIn=expiration + ) + return response + except ClientError as e: + print(f"Error generating upload URL: {e}") + return None +``` + + + + + +If you're still getting 403 errors: + +1. **Check IAM role permissions**: + + - Verify the pod's service account has the correct IAM role + - Ensure the role has S3 permissions for your bucket + +2. **Verify bucket policy**: + - Check if the bucket has restrictive policies + - Ensure your IAM role is allowed to access the bucket + +3. **Test access step by step**: + + ```python + import boto3 + import botocore + + def test_s3_access(bucket_name): + try: + s3_client = boto3.client('s3') + + # Test 1: List bucket (basic access) + print("Testing bucket listing...") + response = s3_client.list_objects_v2(Bucket=bucket_name, MaxKeys=1) + print("✓ Bucket listing successful") + + # Test 2: Head bucket (check permissions) + print("Testing bucket head...") + s3_client.head_bucket(Bucket=bucket_name) + print("✓ Bucket head successful") + + # Test 3: Get bucket location + print("Testing bucket location...") + location = s3_client.get_bucket_location(Bucket=bucket_name) + print(f"✓ Bucket location: {location.get('LocationConstraint', 'us-east-1')}") + + return True + + except botocore.exceptions.ClientError as e: + error_code = e.response['Error']['Code'] + print(f"✗ Error {error_code}: {e.response['Error']['Message']}") + return False + except Exception as e: + print(f"✗ Unexpected error: {str(e)}") + return False + + # Test your bucket + test_s3_access('your-bucket-name') + ``` + +4. **Common resolution steps**: + + ```bash + # Check current AWS identity + aws sts get-caller-identity + + # Test bucket access from CLI + aws s3 ls s3://your-bucket-name/ + + # Check IAM role attached to service account + kubectl describe serviceaccount default + kubectl describe serviceaccount + ``` + + + + + +For accessing S3 buckets across different SleakOps projects: + +### Method 1: Using Explicit Credentials (Recommended) + +1. **Create IAM user with S3 permissions**: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:DeleteObject", + "s3:ListBucket" + ], + "Resource": [ + "arn:aws:s3:::your-bucket-name", + "arn:aws:s3:::your-bucket-name/*" + ] + } + ] + } + ``` + +2. **Store credentials in SleakOps Variable Groups**: + + ```bash + # Create a Variable Group with S3 credentials + AWS_ACCESS_KEY_ID=AKIA... + AWS_SECRET_ACCESS_KEY=... + AWS_DEFAULT_REGION=us-east-1 + S3_BUCKET_NAME=your-bucket-name + ``` + +3. **Use credentials in your application**: + + ```python + import boto3 + import os + + def get_s3_client(): + return boto3.client( + 's3', + aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'], + aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'], + region_name=os.environ.get('AWS_DEFAULT_REGION', 'us-east-1') + ) + + # Usage + s3_client = get_s3_client() + bucket_name = os.environ['S3_BUCKET_NAME'] + ``` + +### Method 2: Cross-Account IAM Role (Advanced) + +For more advanced setups, you can configure cross-account IAM role assumptions: + +1. **Trust relationship policy on the target bucket's IAM role**: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::SOURCE-ACCOUNT:role/SERVICE-ROLE-NAME" + }, + "Action": "sts:AssumeRole" + } + ] + } + ``` + +2. **Python code for role assumption**: + + ```python + import boto3 + from botocore.exceptions import BotoCoreError, ClientError + + def assume_role_and_get_s3_client(role_arn, session_name='s3-access'): + try: + sts_client = boto3.client('sts') + + response = sts_client.assume_role( + RoleArn=role_arn, + RoleSessionName=session_name + ) + + credentials = response['Credentials'] + + return boto3.client( + 's3', + aws_access_key_id=credentials['AccessKeyId'], + aws_secret_access_key=credentials['SecretAccessKey'], + aws_session_token=credentials['SessionToken'] + ) + except (BotoCoreError, ClientError) as e: + print(f"Error assuming role: {e}") + return None + ``` + + + + + +### Security Recommendations + +1. **Use least privilege access**: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject" + ], + "Resource": "arn:aws:s3:::your-bucket-name/specific-prefix/*" + } + ] + } + ``` + +2. **Enable bucket versioning and logging**: + + ```python + # Enable versioning + s3_client.put_bucket_versioning( + Bucket='your-bucket-name', + VersioningConfiguration={'Status': 'Enabled'} + ) + + # Enable access logging + s3_client.put_bucket_logging( + Bucket='your-bucket-name', + BucketLoggingStatus={ + 'LoggingEnabled': { + 'TargetBucket': 'your-log-bucket', + 'TargetPrefix': 'access-logs/' + } + } + ) + ``` + +3. **Use presigned URLs for temporary access**: + + ```python + def generate_presigned_url(bucket_name, object_key, expiration=3600): + try: + response = s3_client.generate_presigned_url( + 'get_object', + Params={'Bucket': bucket_name, 'Key': object_key}, + ExpiresIn=expiration + ) + return response + except ClientError as e: + print(f"Error generating presigned URL: {e}") + return None + + # Generate a URL that expires in 1 hour + url = generate_presigned_url('my-bucket', 'my-file.txt', 3600) + ``` + +4. **Implement proper error handling**: + + ```python + import boto3 + from botocore.exceptions import ClientError, NoCredentialsError + + def safe_s3_operation(bucket_name, key): + try: + s3_client = boto3.client('s3') + response = s3_client.get_object(Bucket=bucket_name, Key=key) + return response['Body'].read() + + except NoCredentialsError: + print("AWS credentials not found") + return None + + except ClientError as e: + error_code = e.response['Error']['Code'] + if error_code == 'NoSuchKey': + print(f"Object {key} not found in bucket {bucket_name}") + elif error_code == 'AccessDenied': + print(f"Access denied to {key} in bucket {bucket_name}") + elif error_code == 'NoSuchBucket': + print(f"Bucket {bucket_name} not found") + else: + print(f"Error {error_code}: {e.response['Error']['Message']}") + return None + ``` + +### Monitoring and Logging + +Set up CloudWatch monitoring for S3 access: + +```python +import boto3 + +def setup_s3_monitoring(bucket_name): + cloudwatch = boto3.client('cloudwatch') + + # Create alarm for 4xx errors + cloudwatch.put_metric_alarm( + AlarmName=f'{bucket_name}-4xx-errors', + ComparisonOperator='GreaterThanThreshold', + EvaluationPeriods=1, + MetricName='4xxErrors', + Namespace='AWS/S3', + Period=300, + Statistic='Sum', + Threshold=10.0, + ActionsEnabled=True, + Dimensions=[ + { + 'Name': 'BucketName', + 'Value': bucket_name + }, + ] + ) +``` + + + +--- + +_This FAQ was automatically generated on February 11, 2025 based on a real user query._ diff --git a/docs/troubleshooting/security-ddos-protection-aws-waf.mdx b/docs/troubleshooting/security-ddos-protection-aws-waf.mdx new file mode 100644 index 000000000..46b61fbee --- /dev/null +++ b/docs/troubleshooting/security-ddos-protection-aws-waf.mdx @@ -0,0 +1,214 @@ +--- +sidebar_position: 3 +title: "DDoS Protection and Bot Attack Prevention for Web Services" +description: "Configure AWS WAF and CloudFront for DDoS protection and bot attack prevention in SleakOps web services" +date: "2024-12-19" +category: "workload" +tags: + ["security", "ddos", "waf", "cloudfront", "aws", "webservice", "protection"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# DDoS Protection and Bot Attack Prevention for Web Services + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** Security, DDoS, WAF, CloudFront, AWS, WebService, Protection + +## Problem Description + +**Context:** Users need to understand what DDoS and bot attack protection is available for web services deployed in SleakOps, and whether additional CDN or security measures are required. + +**Observed Symptoms:** + +- Uncertainty about existing DDoS protection for web services +- Questions about whether CloudFront is already implemented +- Need to understand if additional CDN configuration is required +- Concerns about bot attack prevention + +**Relevant Configuration:** + +- Platform: AWS-based SleakOps deployment +- Service type: Web services with public load balancers +- Security requirement: DDoS and bot attack protection +- AWS services: WAF, CloudFront, Load Balancer + +**Error Conditions:** + +- Web services may be vulnerable to DDoS attacks without proper protection +- Bot attacks could impact service availability +- Lack of clarity on existing security measures + +## Detailed Solution + + + +SleakOps web services are not automatically protected against DDoS attacks by default. You need to manually configure AWS WAF (Web Application Firewall) to protect your applications. + +**Key Points:** + +- Web services use public load balancers that are exposed to the internet +- AWS provides DDoS protection through AWS WAF +- Protection must be explicitly configured and attached to your load balancer + + + + + +To configure AWS WAF for your SleakOps web services: + +1. **Access AWS Console** + + - Go to AWS Console + - Search for "WAF" in the search bar + - Select "AWS WAF & Shield" + +2. **Create Web ACL** + + - Click "Create web ACL" + - Choose "Regional resources (Application Load Balancer, API Gateway)" + - Select your region where the cluster is deployed + +3. **Associate with Load Balancer** + + - In the "Associated AWS resources" section + - Add your cluster's public load balancer + - The load balancer should appear in the dropdown list + +4. **Configure Rules** + ```yaml + # Example WAF configuration + Rules: + - AWS Managed Rules - Core Rule Set + - AWS Managed Rules - Known Bad Inputs + - AWS Managed Rules - SQL Database + - Rate Limiting Rule (custom) + ``` + + + + + +**AWS Managed Rules (Recommended):** + +- Pre-configured by AWS security experts +- Automatically updated for new threats +- More cost-effective than custom rules +- Cover common attack patterns: + - SQL injection + - Cross-site scripting (XSS) + - Known bad IP addresses + - Bot protection + +**Custom Rules:** + +- More expensive to maintain +- Require security expertise to configure +- Useful for specific business logic protection +- Can be combined with managed rules + +**Recommended Setup:** + +```yaml +Managed Rule Groups: + - AWSManagedRulesCommonRuleSet + - AWSManagedRulesKnownBadInputsRuleSet + - AWSManagedRulesBotControlRuleSet + - AWSManagedRulesAmazonIpReputationList +``` + + + + + +**CloudFront Benefits:** + +- Serves static files more efficiently +- Provides additional DDoS protection at edge locations +- Reduces load on your web services +- Improves global performance + +**When to Use CloudFront:** + +- Your application serves static content (images, CSS, JS) +- You have users in multiple geographic regions +- You want additional protection beyond WAF +- You need to reduce bandwidth costs + +**CloudFront + WAF Setup:** + +```yaml +# CloudFront distribution configuration +Origin: your-sleakops-loadbalancer.region.elb.amazonaws.com +Caching: + - Static files: Cache for 24 hours + - Dynamic content: No cache or short TTL +WAF: Associate the same WAF Web ACL with CloudFront +``` + + + + + +**AWS WAF Costs:** + +- Web ACL: $1.00 per month +- Rules: $0.60 per million requests +- Managed rule groups: $1.00-$10.00 per month each + +**CloudFront Costs:** + +- Data transfer: $0.085 per GB (varies by region) +- Requests: $0.0075 per 10,000 requests +- Additional WAF costs if applied to CloudFront + +**Cost-Effective Approach:** + +1. Start with AWS WAF on load balancer only +2. Use AWS managed rules (cheaper than custom) +3. Add CloudFront if you have significant static content +4. Monitor costs and adjust rules as needed + + + + + +**Phase 1: Basic WAF Protection** + +1. Identify your cluster's load balancer ARN +2. Create WAF Web ACL in AWS Console +3. Add managed rule groups: + - Core Rule Set + - Known Bad Inputs + - IP Reputation List +4. Associate with load balancer +5. Test and monitor + +**Phase 2: Enhanced Protection (Optional)** + +1. Add Bot Control managed rules +2. Configure rate limiting rules +3. Set up CloudWatch monitoring +4. Create custom rules if needed + +**Phase 3: CloudFront Integration (If Needed)** + +1. Create CloudFront distribution +2. Point origin to your load balancer +3. Configure caching policies +4. Associate WAF with CloudFront +5. Update DNS to point to CloudFront + +**Monitoring and Maintenance:** + +```bash +# Monitor WAF metrics +aws wafv2 get-sampled-requests --web-acl-arn --rule-metric-name --scope REGIONAL --time-window StartTime=,EndTime= --max-items 100 +``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/ssl-certificate-management-multiple-domains.mdx b/docs/troubleshooting/ssl-certificate-management-multiple-domains.mdx new file mode 100644 index 000000000..62f4b40e7 --- /dev/null +++ b/docs/troubleshooting/ssl-certificate-management-multiple-domains.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "SSL Certificate Management for Multiple Domains" +description: "Managing SSL certificates across regions and multiple subdomains in SleakOps" +date: "2024-12-19" +category: "project" +tags: ["ssl", "certificates", "domains", "aws", "cloudfront", "acm"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# SSL Certificate Management for Multiple Domains + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** SSL, Certificates, Domains, AWS, CloudFront, ACM + +## Problem Description + +**Context:** Users managing multiple subdomains and SSL certificates in SleakOps may encounter issues with certificate validation, regional placement, and certificate reuse across different aliases. + +**Observed Symptoms:** + +- SSL certificate not validating for specific subdomains (e.g., media.app.develop.domain.com) +- Certificate working for some subdomains but not others (e.g., static.app.develop.domain.com works) +- Certificates appearing in wrong AWS regions (us-east-1 instead of intended region) +- Confusion about when to create individual certificates vs wildcard certificates + +**Relevant Configuration:** + +- Multiple subdomains under the same parent domain +- AWS Certificate Manager (ACM) certificates +- CloudFront distribution requirements +- Regional certificate placement requirements + +**Error Conditions:** + +- Certificate validation failures for specific subdomains +- Certificates created in us-east-1 when expected in other regions +- Uncertainty about proper certificate management strategy + +## Detailed Solution + + + +SleakOps has an automatic certificate reuse feature that: + +- **Reuses existing certificates** when adding new aliases that belong to an already managed domain +- **Automatically detects** if a subdomain belongs to a domain with an existing certificate +- **Prevents duplicate certificates** for the same domain hierarchy + +This feature was implemented to optimize certificate management and reduce AWS ACM limits. + + + + + +**CloudFront Certificate Requirement:** + +Certificates in `us-east-1` are specifically for CloudFront distributions: + +- **Static content URLs** (like `static.app.develop.domain.com`) use CloudFront +- **CloudFront requires** SSL certificates to be in the `us-east-1` region +- **This is an AWS requirement**, not a SleakOps configuration issue + +**Regional Distribution:** + +- Application certificates: Deployed in your chosen region (e.g., Ohio) +- CloudFront certificates: Always in `us-east-1` + + + + + +To fix certificate validation issues: + +**Step 1: Delete existing certificates** + +1. Go to your SleakOps project dashboard +2. Navigate to **Domains & SSL** +3. Delete the problematic certificates +4. **Important:** This will not break existing services immediately + +**Step 2: Recreate certificates** + +1. Add your domains again in SleakOps +2. The platform will automatically create optimized certificates +3. SleakOps will reuse certificates where appropriate + +**Step 3: Verify validation** + +1. Check that all subdomains validate correctly +2. Monitor the certificate status in AWS ACM +3. Test all affected URLs + + + + + +**Recommended Approach:** + +For domains like `develop.domain.com` with multiple subdomains: + +``` +# Optimal certificate configuration +Certificate 1: *.develop.domain.com, develop.domain.com +- Covers: app.develop.domain.com +- Covers: api.develop.domain.com +- Covers: media.develop.domain.com +- Covers: develop.domain.com (apex) + +Certificate 2: static.develop.domain.com (us-east-1 only) +- For CloudFront distribution +- Must be in us-east-1 region +``` + +**Benefits:** + +- Reduces certificate count +- Simplifies management +- Covers all current and future subdomains +- Meets AWS regional requirements + + + + + +**Common validation issues and solutions:** + +**Issue 1: DNS validation not completing** + +- Verify DNS records are properly configured +- Check that domain ownership is confirmed in AWS +- Ensure DNS propagation has completed (can take up to 24 hours) + +**Issue 2: Certificate not applying to subdomain** + +- Confirm the certificate includes the specific subdomain +- Check wildcard certificate covers the subdomain pattern +- Verify the certificate is in the correct region for the service + +**Issue 3: Mixed certificate regions** + +- Application services: Use certificates in your deployment region +- CloudFront services: Must use us-east-1 certificates +- This is normal and expected behavior + +**Verification commands:** + +```bash +# Check certificate details +aws acm describe-certificate --certificate-arn arn:aws:acm:region:account:certificate/cert-id + +# Test SSL connection +openssl s_client -connect your-domain.com:443 -servername your-domain.com +``` + + + + + +**Planning your certificate strategy:** + +1. **Group related domains**: Use wildcard certificates for multiple subdomains +2. **Understand regional requirements**: Accept that CloudFront certificates will be in us-east-1 +3. **Use SleakOps automation**: Let the platform handle certificate reuse and optimization +4. **Monitor expiration**: Set up alerts for certificate renewal +5. **Document your setup**: Keep track of which certificates serve which domains + +**When to recreate certificates:** + +- Validation issues that persist after DNS propagation +- Need to add many new subdomains to existing setup +- Consolidating multiple individual certificates into wildcards +- Migrating between different domain structures + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/ssl-certificate-subdomain-issues.mdx b/docs/troubleshooting/ssl-certificate-subdomain-issues.mdx new file mode 100644 index 000000000..be86a6541 --- /dev/null +++ b/docs/troubleshooting/ssl-certificate-subdomain-issues.mdx @@ -0,0 +1,247 @@ +--- +sidebar_position: 3 +title: "SSL Certificate Issues with API Subdomains" +description: "Troubleshooting SSL certificate problems for API endpoints and subdomains" +date: "2024-12-19" +category: "cluster" +tags: ["ssl", "certificates", "https", "api", "security"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# SSL Certificate Issues with API Subdomains + +**Date:** December 19, 2024 +**Category:** Cluster +**Tags:** SSL, Certificates, HTTPS, API, Security + +## Problem Description + +**Context:** Users report SSL certificate warnings when accessing specific API endpoints, while the main domain appears secure. This commonly occurs when applications have multiple subdomains or API paths that need proper SSL certificate coverage. + +**Observed Symptoms:** + +- Browser shows "Not Secure" warning for specific API endpoints +- Main domain (e.g., `https://api.domain.com`) shows valid certificate +- Specific API paths (e.g., `https://api.domain.com/api/v3/...`) trigger security warnings +- Mobile browsers may be more sensitive to certificate issues +- File downloads or API calls may fail due to SSL validation + +**Relevant Configuration:** + +- Domain: API subdomain with multiple endpoints +- Certificate type: Likely single domain or insufficient wildcard coverage +- Platform: Kubernetes cluster with ingress controller +- Client access: Mobile and web browsers + +**Error Conditions:** + +- Error appears on specific API endpoints but not root domain +- Problem occurs across different browsers and devices +- Issue affects file downloads and API responses +- Certificate validation fails for certain paths + +## Detailed Solution + + + +First, check your current SSL certificate setup: + +```bash +# Check certificate details for your domain +openssl s_client -connect api.yourdomain.com:443 -servername api.yourdomain.com + +# Or use online tools like SSL Labs +# https://www.ssllabs.com/ssltest/ +``` + +Look for: + +- Certificate validity dates +- Subject Alternative Names (SAN) +- Certificate chain completeness +- Cipher suite compatibility + + + + + +Ensure your Kubernetes ingress is properly configured for SSL: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: api-ingress + annotations: + kubernetes.io/ingress.class: "nginx" + cert-manager.io/cluster-issuer: "letsencrypt-prod" + nginx.ingress.kubernetes.io/ssl-redirect: "true" + nginx.ingress.kubernetes.io/force-ssl-redirect: "true" +spec: + tls: + - hosts: + - api.yourdomain.com + secretName: api-tls-secret + rules: + - host: api.yourdomain.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: api-service + port: + number: 80 +``` + + + + + +If using cert-manager, ensure proper configuration: + +```yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: api-certificate + namespace: default +spec: + secretName: api-tls-secret + issuerRef: + name: letsencrypt-prod + kind: ClusterIssuer + dnsNames: + - api.yourdomain.com + - "*.api.yourdomain.com" # Wildcard for subpaths if needed +``` + +Verify cert-manager is working: + +```bash +# Check certificate status +kubectl get certificates + +# Check certificate details +kubectl describe certificate api-certificate + +# Check cert-manager logs +kubectl logs -n cert-manager deployment/cert-manager +``` + + + + + +1. **Check certificate coverage:** + + ```bash + # Test specific endpoints + curl -I https://api.yourdomain.com/api/v3/endpoint + + # Check certificate chain + openssl s_client -connect api.yourdomain.com:443 -showcerts + ``` + +2. **Verify DNS resolution:** + + ```bash + nslookup api.yourdomain.com + dig api.yourdomain.com + ``` + +3. **Test from different locations:** + + - Use online SSL checkers + - Test from different networks + - Check mobile vs desktop browsers + +4. **Check ingress controller logs:** + ```bash + kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + ``` + + + + + +**For SleakOps managed clusters:** + +1. **Update ingress annotations:** + + ```yaml + annotations: + nginx.ingress.kubernetes.io/backend-protocol: "HTTP" + nginx.ingress.kubernetes.io/ssl-redirect: "true" + nginx.ingress.kubernetes.io/proxy-body-size: "50m" + ``` + +2. **Ensure proper service configuration:** + + ```yaml + apiVersion: v1 + kind: Service + metadata: + name: api-service + spec: + ports: + - port: 80 + targetPort: 8080 + protocol: TCP + selector: + app: api-app + ``` + +3. **Check certificate renewal:** + ```bash + # Force certificate renewal + kubectl delete certificate api-certificate + kubectl apply -f certificate.yaml + ``` + +**For custom domains:** + +- Ensure DNS points to correct load balancer +- Verify certificate includes all required domains +- Check certificate chain is complete + + + + + +1. **Use wildcard certificates** for multiple subdomains: + + ```yaml + dnsNames: + - yourdomain.com + - "*.yourdomain.com" + ``` + +2. **Monitor certificate expiration:** + + ```bash + # Set up monitoring alerts + kubectl get certificates -o wide + ``` + +3. **Test SSL configuration regularly:** + + ```bash + # Automated SSL testing + curl -f https://api.yourdomain.com/health + ``` + +4. **Use HSTS headers** for enhanced security: + ```yaml + annotations: + nginx.ingress.kubernetes.io/server-snippet: | + add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always; + ``` + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/ssl-certificate-validation-aws-acm.mdx b/docs/troubleshooting/ssl-certificate-validation-aws-acm.mdx new file mode 100644 index 000000000..f7573dff3 --- /dev/null +++ b/docs/troubleshooting/ssl-certificate-validation-aws-acm.mdx @@ -0,0 +1,150 @@ +--- +sidebar_position: 3 +title: "SSL Certificate Validation Error in AWS ACM" +description: "How to resolve SSL certificate validation issues when ACM certificates fail to validate within 72 hours" +date: "2024-11-07" +category: "provider" +tags: ["aws", "acm", "ssl", "certificate", "dns", "validation"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# SSL Certificate Validation Error in AWS ACM + +**Date:** November 7, 2024 +**Category:** Provider +**Tags:** AWS, ACM, SSL, Certificate, DNS, Validation + +## Problem Description + +**Context:** When deploying applications with custom domains in SleakOps, AWS Certificate Manager (ACM) certificates may fail to validate if the DNS validation is not completed within 72 hours of certificate creation. + +**Observed Symptoms:** + +- ACM certificate shows "Failed" or "Validation timed out" status +- Domain aliases cannot be properly configured +- SSL/TLS connections fail for the custom domain +- Certificate appears as "erroneous" in AWS console +- Applications may be inaccessible via HTTPS on custom domains + +**Relevant Configuration:** + +- Certificate type: AWS Certificate Manager (ACM) +- Validation method: DNS validation +- Domain: Custom domain with external DNS management +- Time limit: 72 hours from certificate creation + +**Error Conditions:** + +- Certificate validation not completed within 72-hour window +- DNS TXT records not properly configured +- Domain ownership not verified through DNS +- Certificate regeneration required + +## Detailed Solution + + + +When an ACM certificate fails validation, you need to regenerate it: + +1. **Access your SleakOps execution dashboard** +2. **Navigate to the affected execution** (e.g., your landing page or web application) +3. **Go to the SSL/Certificate section** +4. **Click "Regenerate Certificate"** or similar option +5. **Wait for the new certificate to be created** + +SleakOps will automatically create a new ACM certificate with fresh validation records. + + + + + +After regenerating the certificate, you must add DNS validation records: + +1. **Copy the validation information** from your SleakOps execution details: + + - **Name**: `_[validation-hash].yourdomain.com` + - **Value**: `_[validation-value].acm-validations.aws.` + - **Type**: `TXT` + +2. **Add the TXT record to your DNS provider**: + + ``` + Name: _595d6ebeeb98358afc0357d079d068f4.yourdomain.com + Type: TXT + Value: _139389a2dc765df9b1c6bc66a1367077.djqtsrsxkq.acm-validations.aws. + TTL: 300 (or your provider's minimum) + ``` + +3. **Save the DNS record** and wait for propagation (usually 5-15 minutes) + + + + + +After adding the DNS records: + +1. **Wait for DNS propagation** (5-15 minutes typically) +2. **Use the "Check Certificate" button** in SleakOps to manually trigger validation +3. **Monitor the certificate status** in your execution dashboard +4. **Verify DNS record propagation** using tools like: + ```bash + dig TXT _595d6ebeeb98358afc0357d079d068f4.yourdomain.com + ``` + or + ```bash + nslookup -type=TXT _595d6ebeeb98358afc0357d079d068f4.yourdomain.com + ``` + +**Expected result**: Certificate status should change to "Issued" or "Valid" + + + + + +**If validation continues to fail, check these common issues:** + +1. **Incorrect DNS record format**: + + - Ensure the TXT record name includes the full validation subdomain + - Don't add extra quotes around the value + - Use the exact values provided by AWS/SleakOps + +2. **DNS propagation delays**: + + - Some DNS providers take longer to propagate changes + - Wait up to 1 hour before considering it failed + - Check with multiple DNS lookup tools + +3. **Multiple validation records**: + + - Remove any old/duplicate validation records + - Only keep the current validation record + +4. **DNS provider limitations**: + - Some providers don't support long TXT record names + - Contact your DNS provider if records aren't saving properly + + + + + +**To avoid certificate validation timeouts:** + +1. **Set up DNS records immediately** after certificate creation +2. **Monitor certificate status** regularly in SleakOps dashboard +3. **Use automated DNS providers** when possible (Route53, Cloudflare API) +4. **Set calendar reminders** for certificate renewals +5. **Test DNS changes** before applying them to production domains + +**Best practices:** + +- Complete DNS validation within 24 hours of certificate creation +- Keep DNS management credentials accessible to your team +- Document your DNS validation process for future reference + + + +--- + +_This FAQ was automatically generated on November 7, 2024 based on a real user query._ diff --git a/docs/troubleshooting/superset-bitnami-helm-deployment.mdx b/docs/troubleshooting/superset-bitnami-helm-deployment.mdx new file mode 100644 index 000000000..783c45d05 --- /dev/null +++ b/docs/troubleshooting/superset-bitnami-helm-deployment.mdx @@ -0,0 +1,689 @@ +--- +sidebar_position: 3 +title: "Deploying Apache Superset with Bitnami Helm Chart" +description: "Complete guide for deploying Apache Superset using Bitnami Helm chart with custom configurations and tolerations" +date: "2024-12-20" +category: "workload" +tags: ["superset", "bitnami", "helm", "kubernetes", "deployment", "tolerations"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Deploying Apache Superset with Bitnami Helm Chart + +**Date:** December 20, 2024 +**Category:** Workload +**Tags:** Superset, Bitnami, Helm, Kubernetes, Deployment, Tolerations + +## Problem Description + +**Context:** Users need to deploy Apache Superset using the Bitnami Helm chart in a Kubernetes cluster with specific configurations including database initialization tolerations and custom values. + +**Observed Symptoms:** + +- Need to add tolerations to database initialization pods +- Bitnami chart doesn't support tolerations for init containers via values +- Requires post-renderer script to modify chart templates +- Complex configuration management for Superset deployment + +**Relevant Configuration:** + +- Chart: `bitnami/superset` +- Namespace: `superset` +- Custom values file: `values-def.yaml` +- Post-renderer script: `add-tolerations.sh` + +**Error Conditions:** + +- Database initialization pods may fail to schedule without proper tolerations +- Standard Helm values don't provide sufficient customization options +- Manual chart modification required for specific use cases + +## Detailed Solution + + + +First, add and update the Bitnami Helm repository: + +```bash +# Add Bitnami repository +helm repo add bitnami https://charts.bitnami.com/bitnami + +# Update repositories to get latest charts +helm repo update + +# Verify repository is added +helm repo list +``` + + + + + +Create a post-renderer script to add tolerations to database initialization pods: + +```bash +#!/bin/bash +# File: add-tolerations.sh + +# Make script executable +chmod +x add-tolerations.sh + +# Read from stdin and apply transformations +cat <&0 | \ +# Add tolerations to database init jobs +yq eval '(.spec.template.spec.tolerations) = [ + { + "key": "node-role", + "operator": "Equal", + "value": "spot", + "effect": "NoSchedule" + }, + { + "key": "kubernetes.io/arch", + "operator": "Equal", + "value": "arm64", + "effect": "NoSchedule" + } +] | select(.kind == "Job" and (.metadata.name | contains("postgresql-init")))' - +``` + +This script: + +- Adds tolerations for spot instances and ARM64 architecture +- Only applies to Jobs with names containing "postgresql-init" +- Uses `yq` to modify YAML on the fly + + + + + +Create a comprehensive values file for Superset deployment: + +```yaml +# File: values-def.yaml + +# Global configurations +global: + postgresql: + auth: + existingSecret: "" + secretKeys: + adminPasswordKey: "" + userPasswordKey: "" + +# Superset configuration +superset: + # Image configuration + image: + repository: bitnami/superset + tag: latest + pullPolicy: IfNotPresent + + # Superset admin configuration + admin: + user: admin + password: "your-secure-password" + firstName: Admin + lastName: User + email: admin@yourdomain.com + + # Application configuration + config: + # Secret key for session encryption + secretKey: "your-superset-secret-key-here" + + # Database configuration + databaseUri: "postgresql://superset:password@superset-postgresql:5432/superset" + + # Redis configuration for caching + redisUri: "redis://superset-redis-master:6379/0" + + # Resource allocation + resources: + limits: + cpu: 2000m + memory: 4Gi + requests: + cpu: 1000m + memory: 2Gi + + # Scaling configuration + replicaCount: 2 + + # Node selection and tolerations + nodeSelector: + kubernetes.io/arch: arm64 + + tolerations: + - key: "node-role" + operator: "Equal" + value: "spot" + effect: "NoSchedule" + - key: "kubernetes.io/arch" + operator: "Equal" + value: "arm64" + effect: "NoSchedule" + +# PostgreSQL database configuration +postgresql: + enabled: true + + # Database settings + auth: + database: superset + username: superset + password: "secure-database-password" + + # Primary configuration + primary: + # Resources for primary database + resources: + limits: + cpu: 1000m + memory: 2Gi + requests: + cpu: 500m + memory: 1Gi + + # Node selection for database + nodeSelector: + kubernetes.io/arch: arm64 + + # Tolerations for database pods + tolerations: + - key: "node-role" + operator: "Equal" + value: "spot" + effect: "NoSchedule" + - key: "kubernetes.io/arch" + operator: "Equal" + value: "arm64" + effect: "NoSchedule" + + # Persistence configuration + persistence: + enabled: true + size: 20Gi + storageClass: "gp3" + + # Initialization job configuration + # Note: This requires the post-renderer script to add tolerations + initdb: + scripts: + 01_init.sql: | + CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; + GRANT ALL PRIVILEGES ON DATABASE superset TO superset; + +# Redis configuration for caching +redis: + enabled: true + + # Architecture configuration + architecture: standalone + + # Master configuration + master: + resources: + limits: + cpu: 500m + memory: 1Gi + requests: + cpu: 250m + memory: 512Mi + + # Node selection + nodeSelector: + kubernetes.io/arch: arm64 + + # Tolerations + tolerations: + - key: "node-role" + operator: "Equal" + value: "spot" + effect: "NoSchedule" + - key: "kubernetes.io/arch" + operator: "Equal" + value: "arm64" + effect: "NoSchedule" + + # Persistence + persistence: + enabled: true + size: 8Gi + storageClass: "gp3" + +# Service configuration +service: + type: ClusterIP + port: 8088 + targetPort: 8088 + +# Ingress configuration +ingress: + enabled: true + className: "nginx" + annotations: + nginx.ingress.kubernetes.io/proxy-body-size: "50m" + nginx.ingress.kubernetes.io/proxy-read-timeout: "300" + nginx.ingress.kubernetes.io/proxy-send-timeout: "300" + hosts: + - host: superset.yourdomain.com + paths: + - path: / + pathType: Prefix + tls: + - secretName: superset-tls + hosts: + - superset.yourdomain.com +``` + + + + + +Execute the deployment with the following commands: + +```bash +# 1. Create namespace +kubectl create namespace superset + +# 2. Install Superset with post-renderer script +helm install superset bitnami/superset \ + --namespace superset \ + --values values-def.yaml \ + --post-renderer ./add-tolerations.sh \ + --timeout 15m \ + --wait + +# 3. Verify deployment status +kubectl get pods -n superset + +# 4. Check service status +kubectl get svc -n superset + +# 5. Monitor deployment progress +kubectl logs -f deployment/superset -n superset + +# 6. Get initial admin credentials (if not set in values) +kubectl get secret superset -n superset -o jsonpath="{.data.admin-password}" | base64 --decode +``` + +**Alternative deployment using Helm upgrade:** + +```bash +# For updating existing deployment +helm upgrade superset bitnami/superset \ + --namespace superset \ + --values values-def.yaml \ + --post-renderer ./add-tolerations.sh \ + --timeout 15m \ + --wait +``` + + + + + +After successful deployment, perform these verification steps: + +**1. Verify all pods are running:** + +```bash +# Check pod status +kubectl get pods -n superset + +# Expected output should show all pods in Running state: +# superset-xxx 1/1 Running 0 5m +# superset-postgresql-0 1/1 Running 0 5m +# superset-redis-master-0 1/1 Running 0 5m +``` + +**2. Access Superset web interface:** + +```bash +# Port forward to access locally (for testing) +kubectl port-forward svc/superset 8088:8088 -n superset + +# Access via browser: http://localhost:8088 +``` + +**3. Configure SSL certificate (if using ingress):** + +```bash +# Create TLS secret for HTTPS +kubectl create secret tls superset-tls \ + --cert=path/to/your/cert.crt \ + --key=path/to/your/cert.key \ + -n superset + +# Or use cert-manager for automatic certificate generation +kubectl apply -f - < + + + +**1. Pods stuck in Pending state:** + +```bash +# Check pod events for scheduling issues +kubectl describe pod -n superset superset-xxx + +# Common issues: +# - Node selector not matching any nodes +# - Tolerations not matching node taints +# - Resource requests exceeding available capacity +``` + +**Solution:** + +- Verify your node labels match the nodeSelector +- Ensure tolerations match existing node taints +- Check resource availability on target nodes + +**2. Database connection errors:** + +```bash +# Check PostgreSQL pod logs +kubectl logs -n superset superset-postgresql-0 + +# Check Superset logs for database errors +kubectl logs -n superset deployment/superset | grep -i database +``` + +**Solution:** + +- Verify database credentials in values.yaml +- Check network policies allowing communication +- Ensure PostgreSQL is fully initialized before Superset starts + +**3. Init container tolerations not applied:** + +```bash +# Check if post-renderer script is working +helm template superset bitnami/superset \ + --values values-def.yaml \ + --post-renderer ./add-tolerations.sh | \ + grep -A 10 -B 5 tolerations +``` + +**Solution:** + +- Verify post-renderer script has execute permissions +- Check yq is installed and available +- Test script independently with sample YAML + +**4. Resource limitations:** + +```bash +# Check resource usage +kubectl top pods -n superset + +# Check resource quotas +kubectl describe quota -n superset +``` + +**Solution:** + +- Adjust resource requests/limits in values.yaml +- Increase namespace resource quotas if needed +- Consider using HPA for auto-scaling + + + + + +**Database Backup:** + +```bash +# Create database backup +kubectl exec -n superset superset-postgresql-0 -- pg_dump -U superset superset > superset-backup-$(date +%Y%m%d).sql + +# Automated backup script +#!/bin/bash +NAMESPACE="superset" +BACKUP_DIR="/backups/superset" +DATE=$(date +%Y%m%d-%H%M%S) + +mkdir -p $BACKUP_DIR +kubectl exec -n $NAMESPACE superset-postgresql-0 -- pg_dump -U superset superset | gzip > $BACKUP_DIR/superset-backup-$DATE.sql.gz + +# Cleanup old backups (keep last 7 days) +find $BACKUP_DIR -name "superset-backup-*.sql.gz" -mtime +7 -delete +``` + +**Configuration Backup:** + +```bash +# Export current Helm values +helm get values superset -n superset > superset-values-backup.yaml + +# Backup custom configurations +kubectl get configmap -n superset -o yaml > superset-configmaps-backup.yaml +kubectl get secret -n superset -o yaml > superset-secrets-backup.yaml +``` + +**Updating Superset:** + +```bash +# Update Helm repository +helm repo update + +# Check available versions +helm search repo bitnami/superset --versions + +# Upgrade to latest version +helm upgrade superset bitnami/superset \ + --namespace superset \ + --values values-def.yaml \ + --post-renderer ./add-tolerations.sh \ + --timeout 15m \ + --wait + +# Rollback if needed +helm rollback superset 1 -n superset +``` + +**Monitoring and Health Checks:** + +```bash +# Set up health monitoring +kubectl apply -f - < + + + +**Security Hardening:** + +1. **Use strong passwords and secret keys:** + +```yaml +# Generate secure passwords +superset: + admin: + password: $(openssl rand -base64 32) + config: + secretKey: $(openssl rand -hex 32) + +postgresql: + auth: + password: $(openssl rand -base64 32) +``` + +2. **Network security:** + +```yaml +# Network policies to restrict traffic +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: superset-network-policy + namespace: superset +spec: + podSelector: + matchLabels: + app.kubernetes.io/name: superset + ingress: + - from: + - namespaceSelector: + matchLabels: + name: ingress-nginx + ports: + - protocol: TCP + port: 8088 + egress: + - to: + - podSelector: + matchLabels: + app.kubernetes.io/name: postgresql + ports: + - protocol: TCP + port: 5432 +``` + +**Performance Optimization:** + +1. **Resource sizing:** + +```yaml +# Production resource recommendations +superset: + resources: + requests: + cpu: 2000m + memory: 4Gi + limits: + cpu: 4000m + memory: 8Gi + replicaCount: 3 + +postgresql: + primary: + resources: + requests: + cpu: 1000m + memory: 4Gi + limits: + cpu: 2000m + memory: 8Gi +``` + +2. **High availability setup:** + +```yaml +# PostgreSQL HA configuration +postgresql: + architecture: replication + readReplicas: + replicaCount: 2 + +# Redis HA configuration +redis: + architecture: replication + replica: + replicaCount: 2 +``` + +**Monitoring and Observability:** + +```yaml +# Enable metrics collection +superset: + metrics: + enabled: true + serviceMonitor: + enabled: true + namespace: monitoring + +# Add logging configuration +superset: + config: + logging: + version: 1 + disable_existing_loggers: false + formatters: + default: + format: '[%(asctime)s] [%(levelname)s] %(message)s' + handlers: + console: + class: logging.StreamHandler + formatter: default + stream: ext://sys.stdout + root: + level: INFO + handlers: [console] +``` + + + +--- + +_This FAQ was automatically generated on December 20, 2024 based on a real user query._ diff --git a/docs/troubleshooting/troubleshooting-502-errors-pod-logs.mdx b/docs/troubleshooting/troubleshooting-502-errors-pod-logs.mdx new file mode 100644 index 000000000..404e1cb5c --- /dev/null +++ b/docs/troubleshooting/troubleshooting-502-errors-pod-logs.mdx @@ -0,0 +1,359 @@ +--- +sidebar_position: 3 +title: "502 Error and Pod Logs Not Loading" +description: "Troubleshooting 502 errors when pod logs cannot be accessed through the platform" +date: "2025-01-27" +category: "workload" +tags: ["502-error", "pod-logs", "troubleshooting", "deployment", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# 502 Error and Pod Logs Not Loading + +**Date:** January 27, 2025 +**Category:** Workload +**Tags:** 502 Error, Pod Logs, Troubleshooting, Deployment, Networking + +## Problem Description + +**Context:** User experiences a 502 error when accessing their application deployed on SleakOps, combined with inability to view pod logs through the platform interface. + +**Observed Symptoms:** + +- 502 Bad Gateway error when accessing the application URL +- Pod logs button in the SleakOps interface doesn't open/load logs +- Port forwarding through Lens results in blank screen with network errors +- Docker container runs successfully when tested locally +- Rollback to previous working build doesn't resolve the issue +- Logs from other projects and deployments are accessible + +**Relevant Configuration:** + +- Environment: Development +- Application: Monorepo-based application +- Platform: SleakOps Kubernetes deployment +- Local testing: Docker container works correctly +- Previous state: Application was working with earlier builds + +**Error Conditions:** + +- Error occurs when accessing the application URL +- Pod logs are specifically inaccessible for this deployment +- Port forwarding fails with network errors +- Issue persists after rollback attempts +- Problem is isolated to specific pods/deployment + +## Detailed Solution + + + +When you encounter a 502 error combined with inaccessible pod logs, this typically indicates: + +1. **Pod startup issues**: The pod may be failing to start properly or crashing during initialization +2. **Resource constraints**: Insufficient memory or CPU causing pod termination +3. **Health check failures**: Readiness or liveness probes failing +4. **Network connectivity issues**: Problems with service-to-pod communication +5. **Container runtime issues**: Problems specific to the Kubernetes environment vs local Docker + +The fact that logs are inaccessible suggests the pods may be in a crash loop or failing state. + + + + + +When the SleakOps interface can't show logs, use kubectl directly: + +```bash +# Get pod status and events +kubectl get pods -n +kubectl describe pod -n + +# Get logs from crashed/restarting pods +kubectl logs -n --previous + +# Get real-time logs +kubectl logs -f -n + +# Check events for the namespace +kubectl get events -n --sort-by='.lastTimestamp' +``` + +Look for: + +- Pod restart counts +- Exit codes in pod description +- Recent events showing errors +- Resource limit exceeded messages + + + + + +Resource issues are common when local Docker works but Kubernetes deployment fails: + +```bash +# Check resource usage +kubectl top pods -n + +# Check resource limits in deployment +kubectl get deployment -n -o yaml | grep -A 10 resources + +# Check node resources +kubectl describe nodes +``` + +**Common solutions:** + +1. **Increase memory limits**: + +```yaml +resources: + limits: + memory: "1Gi" # Increase from default + cpu: "500m" + requests: + memory: "512Mi" + cpu: "250m" +``` + +2. **Check for memory leaks** in your application +3. **Optimize container startup** to reduce resource spikes + + + + + +Incorrect health checks can cause 502 errors: + +```yaml +# Example of proper health check configuration +livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 +``` + +**Key considerations:** + +- Ensure health check endpoints exist in your application +- Set appropriate `initialDelaySeconds` for app startup time +- Use different endpoints for liveness vs readiness if possible +- Consider disabling health checks temporarily for debugging + + + + + +When Docker works locally but Kubernetes fails, check: + +**1. Environment Variables:** + +```bash +# Compare environment variables +kubectl exec -n -- env +``` + +**2. File System Permissions:** + +- Kubernetes runs with different user contexts +- Check if your app writes to specific directories +- Ensure proper file permissions in Dockerfile + +**3. Network Configuration:** + +- Kubernetes networking differs from Docker +- Check if your app binds to `0.0.0.0` not `localhost` +- Verify port configurations match service definitions + +**4. Dependencies and External Services:** + +- Database connections may differ +- External API endpoints might be unreachable +- DNS resolution differences + + + + + +For SleakOps-specific issues: + +**1. Check Build Logs:** + +- Review the build process in SleakOps dashboard +- Look for any warnings or errors during image creation +- Verify all build steps completed successfully + +**2. Deployment Configuration:** + +- Check if deployment configuration changed +- Verify environment variables are properly set +- Ensure secrets and configmaps are accessible + +**3. Service Configuration:** + +- Verify service is properly routing to pods +- Check ingress configuration if applicable +- Ensure load balancer is healthy + +**4. Platform Resources:** + +- Check if cluster has sufficient resources +- Verify no platform-wide issues +- Contact SleakOps support if platform interface is unresponsive + + + + + +If the issue is urgent and needs immediate resolution: + +**1. Force Pod Restart:** + +```bash +kubectl delete pod -n +``` + +**2. Scale Down and Up:** + +```bash +kubectl scale deployment --replicas=0 -n +kubectl scale deployment --replicas=1 -n +``` + +**3. Temporary Resource Increase:** + +- Temporarily increase resource limits in deployment +- Scale down other non-critical services if cluster resources are limited + +```bash +# Quick resource patch +kubectl patch deployment -n -p '{"spec":{"template":{"spec":{"containers":[{"name":"","resources":{"limits":{"memory":"2Gi","cpu":"1000m"}}}]}}}}' +``` + +**4. Emergency Rollback:** + +```bash +# Check rollout history +kubectl rollout history deployment/ -n + +# Rollback to previous version +kubectl rollout undo deployment/ -n + +# Rollback to specific revision +kubectl rollout undo deployment/ --to-revision=2 -n +``` + +**5. Alternative Access Methods:** + +If the platform interface is unresponsive: + +```bash +# Port forward to access application directly +kubectl port-forward pod/ 8080:8080 -n + +# Create a temporary service for testing +kubectl expose pod --port=80 --target-port=8080 --name=temp-service -n +``` + + + + + +To prevent similar issues in the future: + +**1. Enhanced Monitoring:** + +```yaml +# Prometheus monitoring for your app +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: app-monitor +spec: + selector: + matchLabels: + app: your-app + endpoints: + - port: metrics + interval: 30s + path: /metrics +``` + +**2. Alerting Configuration:** + +```yaml +# AlertManager rules +groups: + - name: application-alerts + rules: + - alert: PodCrashLooping + expr: rate(kube_pod_container_status_restarts_total[5m]) > 0 + for: 2m + labels: + severity: critical + annotations: + summary: "Pod {{ $labels.pod }} is crash looping" + + - alert: HighPod502Errors + expr: rate(nginx_ingress_controller_requests{status="502"}[5m]) > 0.1 + for: 1m + labels: + severity: warning + annotations: + summary: "High 502 error rate detected" +``` + +**3. Health Check Best Practices:** + +- Implement comprehensive health endpoints +- Test health checks during development +- Monitor health check response times +- Use gradual rollout strategies + +**4. Resource Planning:** + +- Set appropriate resource requests and limits +- Monitor resource usage patterns +- Plan for traffic spikes +- Implement horizontal pod autoscaling + +**5. Testing Strategy:** + +```bash +# Create a testing checklist +echo "Deployment Testing Checklist: +1. Local Docker container test +2. Resource limit verification +3. Health check endpoint test +4. Environment variable validation +5. Network connectivity test +6. Load testing with expected traffic +7. Rollback procedure verification" > deployment-checklist.txt +``` + +**6. Documentation:** + +- Document all environment-specific configurations +- Maintain troubleshooting runbooks +- Keep emergency contact information updated +- Document rollback procedures + + + +--- + +_This FAQ was automatically generated on January 27, 2025 based on a real user query._ diff --git a/docs/troubleshooting/troubleshooting-application-crashes-memory-issues.mdx b/docs/troubleshooting/troubleshooting-application-crashes-memory-issues.mdx new file mode 100644 index 000000000..52df262b6 --- /dev/null +++ b/docs/troubleshooting/troubleshooting-application-crashes-memory-issues.mdx @@ -0,0 +1,255 @@ +--- +sidebar_position: 3 +title: "Application Crashes and Memory Issues in Production" +description: "Troubleshooting application crashes, memory issues, and 502 errors in SleakOps environments" +date: "2024-06-09" +category: "workload" +tags: ["memory", "crashes", "502-error", "production", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Application Crashes and Memory Issues in Production + +**Date:** June 9, 2024 +**Category:** Workload +**Tags:** Memory, Crashes, 502 Error, Production, Troubleshooting + +## Problem Description + +**Context:** Applications deployed in SleakOps experiencing crashes and connectivity issues in both development and production environments, with specific memory-related errors causing service disruptions. + +**Observed Symptoms:** + +- Development environment fails to start properly +- Production application crashes after deployment +- 502 Bad Gateway errors when accessing the application +- Memory-related errors in application logs +- Port-forward functionality not working correctly +- DNS resolution issues for some subdomains + +**Relevant Configuration:** + +- Environment: Both development and production +- Application: Monorepo-based web application +- Error location: Line 805 in logs (memory issue) +- Host: app.takenos.com +- Service: web-app with tRPC endpoints + +**Error Conditions:** + +- Crashes occur during deployment to production +- Memory exhaustion at specific code locations +- Intermittent connectivity issues +- Application appears healthy in Kubernetes but returns errors + +## Detailed Solution + + + +When your application crashes due to memory issues, follow these steps: + +1. **Check current memory limits:** + + ```bash + kubectl describe pod -n + ``` + +2. **Monitor memory usage:** + + ```bash + kubectl top pod -n + ``` + +3. **Review application logs for memory errors:** + ```bash + kubectl logs -n --tail=100 + ``` + +Look for errors like: + +- "Out of memory" +- "JavaScript heap out of memory" +- "Cannot allocate memory" + + + + + +To increase memory limits for your application in SleakOps: + +1. **Access your project configuration** +2. **Navigate to the affected workload (web service)** +3. **Go to Advanced Configuration → Resources** +4. **Increase memory limits:** + +```yaml +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2Gi" # Increase this value + cpu: "1000m" +``` + +**Recommended memory increases:** + +- For Node.js applications: Start with 2Gi, increase to 4Gi if needed +- For Java applications: Start with 4Gi, increase to 8Gi if needed +- For Python applications: Start with 1Gi, increase to 2Gi if needed + + + + + +When experiencing 502 errors despite healthy-looking pods: + +1. **Verify pod health:** + + ```bash + kubectl get pods -n + kubectl describe pod -n + ``` + +2. **Check service endpoints:** + + ```bash + kubectl get endpoints -n + ``` + +3. **Test pod connectivity directly:** + + ```bash + kubectl port-forward pod/ 8080:8080 -n + curl http://localhost:8080/health + ``` + +4. **Verify ingress configuration:** + ```bash + kubectl describe ingress -n + ``` + +**Common causes:** + +- Application not listening on the correct port +- Health check endpoints failing +- Internal application errors not reflected in pod status +- Ingress misconfiguration + + + + + +To verify DNS and ingress configuration: + +1. **Check DNS resolution:** + + ```bash + dig A your-domain.com + nslookup your-domain.com + ``` + +2. **Verify load balancer status:** + + ```bash + kubectl get svc -n + kubectl describe svc -n + ``` + +3. **Check ingress controller logs:** + + ```bash + kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + ``` + +4. **Test direct load balancer access:** + ```bash + curl -H "Host: your-domain.com" http:// + ``` + + + + + +To prevent crashes during production deployments: + +1. **Implement proper health checks:** + + ```yaml + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + + readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 + ``` + +2. **Use rolling deployment strategy:** + + ```yaml + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 1 + ``` + +3. **Set appropriate resource limits:** + + - Always set both requests and limits + - Monitor actual usage and adjust accordingly + - Leave headroom for traffic spikes + +4. **Test in staging environment first:** + - Deploy to development/staging before production + - Run load tests to verify memory usage + - Monitor for memory leaks over time + + + + + +If your production application is down: + +1. **Immediate rollback:** + + ```bash + kubectl rollout undo deployment/ -n + ``` + +2. **Scale up replicas temporarily:** + + ```bash + kubectl scale deployment --replicas=3 -n + ``` + +3. **Check recent changes:** + + ```bash + kubectl rollout history deployment/ -n + ``` + +4. **Monitor recovery:** + + ```bash + kubectl get pods -n -w + ``` + +5. **Verify application is responding:** + ```bash + curl -I https://your-domain.com + ``` + + + +--- + +_This FAQ was automatically generated on June 9, 2024 based on a real user query._ diff --git a/docs/troubleshooting/troubleshooting-build-node-disk-space.mdx b/docs/troubleshooting/troubleshooting-build-node-disk-space.mdx new file mode 100644 index 000000000..5f5a225af --- /dev/null +++ b/docs/troubleshooting/troubleshooting-build-node-disk-space.mdx @@ -0,0 +1,782 @@ +--- +sidebar_position: 3 +title: "Build Node Disk Space Issues" +description: "Troubleshooting and resolving disk space problems on build nodes" +date: "2024-09-10" +category: "cluster" +tags: ["build", "disk-space", "nodes", "troubleshooting", "storage"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Build Node Disk Space Issues + +**Date:** September 10, 2024 +**Category:** Cluster +**Tags:** Build, Disk Space, Nodes, Troubleshooting, Storage + +## Problem Description + +**Context:** Build processes failing intermittently due to insufficient disk space on Kubernetes nodes, even after increasing node storage from 20GB to 40GB. + +**Observed Symptoms:** + +- Build processes crash intermittently due to disk space issues +- Some builds fail while others succeed immediately after +- Problem persists even after increasing node disk size to 40GB +- Cluster memory usage shows approximately 60% utilization +- Issue occurs sporadically rather than consistently + +**Relevant Configuration:** + +- Node disk size: Previously 20GB, increased to 40GB +- Cluster memory usage: ~60% +- Build system: Docker-based builds on Kubernetes nodes +- Platform: SleakOps managed Kubernetes cluster + +**Error Conditions:** + +- Builds fail when nodes run out of disk space +- Problem occurs during Docker image building process +- Intermittent failures suggest temporary disk space exhaustion +- Issue affects multiple builds but not consistently + +## Detailed Solution + + + +To identify what's consuming disk space on your build nodes: + +1. **SSH into the affected node** (if you have access) +2. **Use ncdu command** to analyze disk usage: + +```bash +# Install ncdu if not available +sudo apt-get update && sudo apt-get install ncdu + +# Analyze disk usage starting from root +sudo ncdu / + +# Focus on common problem areas +sudo ncdu /var/lib/docker +sudo ncdu /var/log +sudo ncdu /tmp +``` + +3. **Check Docker-specific usage**: + +```bash +# Check Docker system usage +docker system df + +# List all containers and their sizes +docker ps -a --size + +# List all images and their sizes +docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" +``` + + + + + +Typical causes of disk space exhaustion on build nodes: + +1. **Docker layer accumulation**: + + - Unused Docker images consuming space + - Intermediate build layers not being cleaned up + - Docker build cache growing over time + +2. **Log file accumulation**: + + - Container logs in `/var/log/containers/` + - System logs in `/var/log/` + - Build logs not being rotated + +3. **Temporary files**: + + - Build artifacts in `/tmp` + - Package manager caches + - Application temporary files + +4. **Large Docker images**: + - Base images that are unnecessarily large + - Multi-stage builds not properly optimized + + + + + +To free up disk space immediately: + +```bash +# Clean up Docker system (removes unused containers, networks, images) +docker system prune -af + +# Remove unused volumes +docker volume prune -f + +# Clean up build cache +docker builder prune -af + +# Clean up logs (be careful with this command) +sudo journalctl --vacuum-time=7d +sudo find /var/log -name "*.log" -type f -mtime +7 -delete + +# Clean up temporary files +sudo rm -rf /tmp/* +sudo apt-get clean +``` + +**Note**: Always backup important data before running cleanup commands. + + + + + +To prevent future disk space issues, optimize your Docker builds: + +1. **Use multi-stage builds**: + +```dockerfile +# Multi-stage build example +FROM node:16 AS builder +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production + +FROM node:16-alpine AS runtime +WORKDIR /app +COPY --from=builder /app/node_modules ./node_modules +COPY . . +CMD ["npm", "start"] +``` + +2. **Use .dockerignore file**: + +```dockerignore +node_modules +.git +.gitignore +README.md +.env +.nyc_output +coverage +.cache +``` + +3. **Clean up in the same RUN command**: + +```dockerfile +RUN apt-get update && \ + apt-get install -y package-name && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* +``` + + + + + +Set up monitoring to prevent future issues: + +1. **Monitor disk usage**: + +```bash +# Check current disk usage +df -h + +# Monitor disk usage over time +watch -n 5 'df -h' + +# Set up alerts for disk usage > 80% +``` + +2. **Implement automated cleanup**: + +```bash +# Create a cleanup script +#!/bin/bash +# cleanup-docker.sh + +echo "Starting Docker cleanup..." +docker system prune -f +docker volume prune -f +echo "Docker cleanup completed" + +# Add to crontab to run daily +# 0 2 * * * /path/to/cleanup-docker.sh +``` + +3. **Configure log rotation**: + +```json +{ + "log-driver": "json-file", + "log-opts": { + "max-size": "50m", + "max-file": "3" + } +} +``` + +Add this to your Docker daemon configuration at `/etc/docker/daemon.json`. + + + + + +Set up monitoring to prevent future disk space issues: + +**1. Kubernetes resource monitoring:** + +```yaml +# disk-space-monitor.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: disk-monitor-script +data: + monitor.sh: | + #!/bin/bash + THRESHOLD=80 + USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//') + + if [ $USAGE -gt $THRESHOLD ]; then + echo "ALERT: Disk usage is ${USAGE}% - exceeds threshold of ${THRESHOLD}%" + # Trigger cleanup + docker system prune -f + docker volume prune -f + + # Log to stdout for monitoring + echo "$(date): Emergency cleanup performed due to disk usage: ${USAGE}%" + else + echo "$(date): Disk usage normal: ${USAGE}%" + fi + +--- +apiVersion: batch/v1 +kind: CronJob +metadata: + name: disk-space-monitor +spec: + schedule: "*/15 * * * *" # Every 15 minutes + jobTemplate: + spec: + template: + spec: + containers: + - name: monitor + image: alpine:latest + command: ["/bin/sh"] + args: ["/scripts/monitor.sh"] + volumeMounts: + - name: docker-socket + mountPath: /var/run/docker.sock + - name: host-root + mountPath: /host + readOnly: true + - name: script + mountPath: /scripts + volumes: + - name: docker-socket + hostPath: + path: /var/run/docker.sock + - name: host-root + hostPath: + path: / + - name: script + configMap: + name: disk-monitor-script + defaultMode: 0755 + restartPolicy: OnFailure +``` + +**2. Prometheus monitoring rules:** + +```yaml +# prometheus-rules.yaml +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: disk-space-alerts +spec: + groups: + - name: disk-space + rules: + - alert: NodeDiskSpaceHigh + expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 85 + for: 5m + labels: + severity: warning + annotations: + summary: "Node {{ $labels.instance }} disk space usage is above 85%" + description: "Disk usage on {{ $labels.instance }} is {{ $value }}%" + + - alert: NodeDiskSpaceCritical + expr: (1 - (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"})) * 100 > 95 + for: 2m + labels: + severity: critical + annotations: + summary: "Node {{ $labels.instance }} disk space critically low" + description: "Only {{ $value }}% disk space remaining on {{ $labels.instance }}" +``` + +**3. Custom monitoring dashboard:** + +```bash +#!/bin/bash +# disk-usage-dashboard.sh + +# Create simple monitoring script +cat > /usr/local/bin/disk-monitor.sh << 'EOF' +#!/bin/bash + +# Colors for output +RED='\033[0;31m' +YELLOW='\033[1;33m' +GREEN='\033[0;32m' +NC='\033[0m' # No Color + +# Function to check disk usage +check_disk_usage() { + local path=$1 + local threshold=${2:-80} + local usage=$(df "$path" | awk 'NR==2 {print $5}' | sed 's/%//') + + if [ $usage -gt 90 ]; then + echo -e "${RED}CRITICAL: $path usage: ${usage}%${NC}" + return 2 + elif [ $usage -gt $threshold ]; then + echo -e "${YELLOW}WARNING: $path usage: ${usage}%${NC}" + return 1 + else + echo -e "${GREEN}OK: $path usage: ${usage}%${NC}" + return 0 + fi +} + +# Function to check Docker space +check_docker_space() { + echo "=== Docker Space Usage ===" + docker system df + echo "" + + echo "=== Docker Image Sizes ===" + docker images --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}" | head -10 + echo "" + + echo "=== Docker Container Sizes ===" + docker ps -s --format "table {{.Names}}\t{{.Size}}" | head -10 +} + +# Main monitoring +echo "=== Disk Space Monitor - $(date) ===" +check_disk_usage "/" +check_disk_usage "/var/lib/docker" 85 +check_disk_usage "/tmp" 75 + +echo "" +check_docker_space + +echo "" +echo "=== Cleanup Recommendations ===" +# Check for large log files +find /var/log -type f -size +100M 2>/dev/null | head -5 | while read file; do + echo "Large log file: $file ($(du -h "$file" | cut -f1))" +done + +# Check for old docker images +IMAGES_TO_CLEAN=$(docker images -f "dangling=true" -q | wc -l) +if [ $IMAGES_TO_CLEAN -gt 0 ]; then + echo "Found $IMAGES_TO_CLEAN dangling images that can be cleaned" +fi + +CONTAINERS_TO_CLEAN=$(docker ps -aq -f status=exited | wc -l) +if [ $CONTAINERS_TO_CLEAN -gt 0 ]; then + echo "Found $CONTAINERS_TO_CLEAN exited containers that can be cleaned" +fi +EOF + +chmod +x /usr/local/bin/disk-monitor.sh + +# Add to crontab for regular monitoring +(crontab -l 2>/dev/null; echo "*/30 * * * * /usr/local/bin/disk-monitor.sh >> /var/log/disk-monitor.log 2>&1") | crontab - +``` + + + + + +Implement automated cleanup to prevent disk space issues: + +**1. Docker cleanup automation:** + +```bash +#!/bin/bash +# advanced-docker-cleanup.sh + +# Configuration +MAX_DISK_USAGE=85 +DOCKER_ROOT="/var/lib/docker" +LOG_FILE="/var/log/docker-cleanup.log" + +log() { + echo "$(date '+%Y-%m-%d %H:%M:%S') $1" | tee -a "$LOG_FILE" +} + +# Function to get disk usage percentage +get_disk_usage() { + df "$1" | awk 'NR==2 {print $5}' | sed 's/%//' +} + +# Function to cleanup Docker resources +cleanup_docker() { + local level=$1 + + case $level in + "light") + log "Performing light cleanup..." + docker container prune -f + docker image prune -f + ;; + "moderate") + log "Performing moderate cleanup..." + docker container prune -f + docker image prune -f -a + docker volume prune -f + ;; + "aggressive") + log "Performing aggressive cleanup..." + docker system prune -a -f --volumes + ;; + esac +} + +# Function to cleanup build cache +cleanup_build_cache() { + log "Cleaning build cache..." + docker builder prune -f -a + + # Clean buildkit cache if available + if command -v buildctl >/dev/null 2>&1; then + buildctl prune + fi +} + +# Function to cleanup old logs +cleanup_logs() { + log "Cleaning up old logs..." + + # Clean Docker container logs + find /var/lib/docker/containers -name "*.log" -type f -mtime +7 -delete 2>/dev/null + + # Clean system logs + journalctl --vacuum-time=7d + + # Clean application logs + find /var/log -name "*.log" -type f -mtime +30 -delete 2>/dev/null + find /tmp -name "*.log" -type f -mtime +7 -delete 2>/dev/null +} + +# Main cleanup logic +main() { + log "Starting automated cleanup..." + + CURRENT_USAGE=$(get_disk_usage "$DOCKER_ROOT") + log "Current disk usage: ${CURRENT_USAGE}%" + + if [ "$CURRENT_USAGE" -gt 95 ]; then + log "CRITICAL: Disk usage above 95%, performing aggressive cleanup" + cleanup_docker "aggressive" + cleanup_build_cache + cleanup_logs + + # Stop non-essential containers if still critical + NEW_USAGE=$(get_disk_usage "$DOCKER_ROOT") + if [ "$NEW_USAGE" -gt 90 ]; then + log "Still critical, stopping non-essential containers..." + docker ps --format "{{.Names}}" | grep -E "(test|dev|temp)" | xargs -r docker stop + fi + + elif [ "$CURRENT_USAGE" -gt "$MAX_DISK_USAGE" ]; then + log "WARNING: Disk usage above ${MAX_DISK_USAGE}%, performing moderate cleanup" + cleanup_docker "moderate" + cleanup_build_cache + + elif [ "$CURRENT_USAGE" -gt 70 ]; then + log "INFO: Disk usage above 70%, performing light cleanup" + cleanup_docker "light" + else + log "INFO: Disk usage normal (${CURRENT_USAGE}%), no cleanup needed" + fi + + FINAL_USAGE=$(get_disk_usage "$DOCKER_ROOT") + log "Cleanup completed. Final disk usage: ${FINAL_USAGE}%" + + # Send alert if still high + if [ "$FINAL_USAGE" -gt "$MAX_DISK_USAGE" ]; then + log "ALERT: Disk usage still high after cleanup: ${FINAL_USAGE}%" + # Send notification (implement your notification system here) + # send_alert "Disk cleanup insufficient" "Usage: ${FINAL_USAGE}%" + fi +} + +# Execute main function +main "$@" +``` + +**2. Kubernetes-based cleanup job:** + +```yaml +# automated-cleanup-job.yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: node-cleanup +spec: + schedule: "0 */6 * * *" # Every 6 hours + jobTemplate: + spec: + template: + spec: + hostNetwork: true + hostPID: true + nodeSelector: + node-role.kubernetes.io/worker: "true" + containers: + - name: cleanup + image: alpine/k8s:1.24.0 + securityContext: + privileged: true + command: ["/bin/sh"] + args: + - -c + - | + set -e + + # Mount host filesystem + chroot /host /bin/bash -c ' + # Docker cleanup + docker container prune -f + docker image prune -f + docker volume prune -f + + # Log cleanup + find /var/log -name "*.log" -mtime +7 -delete 2>/dev/null || true + find /tmp -name "*.tmp" -mtime +1 -delete 2>/dev/null || true + + # Package cache cleanup + apt-get clean 2>/dev/null || true + yum clean all 2>/dev/null || true + + echo "Cleanup completed on $(hostname)" + ' + volumeMounts: + - name: host + mountPath: /host + - name: docker-socket + mountPath: /var/run/docker.sock + volumes: + - name: host + hostPath: + path: / + - name: docker-socket + hostPath: + path: /var/run/docker.sock + restartPolicy: OnFailure +``` + + + + + +When disk space becomes critically low: + +**1. Immediate emergency cleanup:** + +```bash +#!/bin/bash +# emergency-cleanup.sh + +echo "EMERGENCY: Performing immediate cleanup..." + +# Stop all non-critical containers +docker ps --format "{{.Names}}" | grep -vE "(essential|critical|monitoring)" | xargs -r docker stop + +# Remove all stopped containers +docker container prune -f + +# Remove all unused images +docker image prune -a -f + +# Remove all unused volumes +docker volume prune -f + +# Remove all build cache +docker builder prune -a -f + +# Clean system +rm -rf /tmp/* +find /var/log -name "*.log" -exec truncate -s 0 {} \; + +# Report final status +df -h / +echo "Emergency cleanup completed" +``` + +**2. Disk space recovery procedures:** + +```bash +#!/bin/bash +# disk-recovery.sh + +# Function to find largest directories +find_large_dirs() { + echo "=== Largest directories ===" + du -ah / 2>/dev/null | sort -rh | head -20 +} + +# Function to find largest files +find_large_files() { + echo "=== Largest files ===" + find / -type f -size +1G 2>/dev/null | xargs -r ls -lh | sort -k5 -rh +} + +# Function to analyze Docker usage +analyze_docker() { + echo "=== Docker analysis ===" + echo "Images:" + docker images --format "table {{.Repository}}:{{.Tag}}\t{{.Size}}\t{{.CreatedAt}}" + + echo -e "\nContainers:" + docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Size}}" + + echo -e "\nVolumes:" + docker volume ls --format "table {{.Name}}\t{{.Driver}}" + + echo -e "\nSystem usage:" + docker system df -v +} + +# Function to safe cleanup with confirmation +safe_cleanup() { + echo "Preparing cleanup recommendations..." + + # Identify cleanup targets + DANGLING_IMAGES=$(docker images -q -f dangling=true | wc -l) + EXITED_CONTAINERS=$(docker ps -aq -f status=exited | wc -l) + UNUSED_VOLUMES=$(docker volume ls -q -f dangling=true | wc -l) + + echo "Found:" + echo "- $DANGLING_IMAGES dangling images" + echo "- $EXITED_CONTAINERS exited containers" + echo "- $UNUSED_VOLUMES unused volumes" + + echo "Safe cleanup commands:" + echo "docker container prune -f" + echo "docker image prune -f" + echo "docker volume prune -f" + echo "docker builder prune -f" +} + +# Execute analysis +find_large_dirs +echo "" +find_large_files +echo "" +analyze_docker +echo "" +safe_cleanup +``` + + + + + +**1. Node sizing recommendations:** + +```yaml +# Recommended node configuration for builds +apiVersion: v1 +kind: Node +metadata: + name: build-node +spec: + # Minimum recommended storage for build nodes + capacity: + storage: 100Gi # Increased from 40GB + ephemeral-storage: 50Gi + + # Node taints for build workloads + taints: + - key: "workload-type" + value: "build" + effect: "NoSchedule" +``` + +**2. Build optimization strategies:** + +```dockerfile +# Multi-stage build to reduce image size +FROM node:16-alpine AS builder +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production + +FROM node:16-alpine AS runtime +WORKDIR /app +COPY --from=builder /app/node_modules ./node_modules +COPY . . +RUN npm run build && npm prune --production + +# Use .dockerignore to exclude unnecessary files +# .dockerignore +node_modules +.git +.gitignore +README.md +Dockerfile +.dockerignore +npm-debug.log +coverage/ +.nyc_output +``` + +**3. Storage optimization:** + +```yaml +# Use node affinity for build workloads +apiVersion: apps/v1 +kind: Deployment +metadata: + name: build-service +spec: + template: + spec: + nodeSelector: + node-type: "build-optimized" + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: storage-tier + operator: In + values: ["ssd", "high-performance"] + containers: + - name: builder + resources: + requests: + ephemeral-storage: "10Gi" + limits: + ephemeral-storage: "20Gi" +``` + + + +--- + +_This FAQ was automatically generated on September 10, 2024 based on a real user query._ diff --git a/docs/troubleshooting/troubleshooting-high-cpu-usage.mdx b/docs/troubleshooting/troubleshooting-high-cpu-usage.mdx new file mode 100644 index 000000000..d19f939bc --- /dev/null +++ b/docs/troubleshooting/troubleshooting-high-cpu-usage.mdx @@ -0,0 +1,227 @@ +--- +sidebar_position: 3 +title: "High CPU Usage Investigation and Troubleshooting" +description: "Guide to investigate and resolve high CPU usage spikes in containerized applications" +date: "2025-01-27" +category: "workload" +tags: ["cpu", "performance", "monitoring", "troubleshooting", "containers"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# High CPU Usage Investigation and Troubleshooting + +**Date:** January 27, 2025 +**Category:** Workload +**Tags:** CPU, Performance, Monitoring, Troubleshooting, Containers + +## Problem Description + +**Context:** Production containerized applications experiencing unexpected CPU usage spikes that may impact performance and user experience. + +**Observed Symptoms:** + +- Sudden increase in CPU utilization +- Application performance degradation +- Potential service slowdowns or timeouts +- Resource consumption alerts triggered + +**Relevant Configuration:** + +- Container runtime: Docker/containerd +- Orchestration: Kubernetes +- Monitoring: Available through SleakOps dashboard +- Application type: Production workload + +**Error Conditions:** + +- CPU usage spikes above normal baseline +- Performance impact on application responsiveness +- Potential resource exhaustion scenarios + +## Detailed Solution + + + +First, identify the scope and severity of the CPU spike: + +1. **Check current CPU metrics** in the SleakOps dashboard +2. **Identify affected containers/pods** +3. **Determine timeline** of when the spike started +4. **Compare with historical baselines** + +```bash +# Check current CPU usage for specific container +kubectl top pods -n + +# Get detailed resource usage +kubectl describe pod -n +``` + + + + + +**1. Application-Level Investigation:** + +- Check application logs for errors or unusual activity +- Review recent deployments or configuration changes +- Identify any new features or code changes + +```bash +# Check application logs +kubectl logs -n --tail=100 + +# Check for recent events +kubectl get events -n --sort-by='.lastTimestamp' +``` + +**2. System-Level Analysis:** + +- Monitor memory usage (high memory can cause CPU spikes) +- Check disk I/O patterns +- Review network traffic patterns + +**3. External Factors:** + +- Increased user traffic or load +- Database performance issues +- Third-party service dependencies + + + + + +**Access Performance Metrics:** + +1. Navigate to **SleakOps Dashboard** +2. Go to **Monitoring** → **Workloads** +3. Select your specific container/service +4. Review **CPU Usage** graphs over different time periods + +**Key Metrics to Monitor:** + +- CPU utilization percentage +- Memory usage patterns +- Request/response times +- Error rates +- Network I/O + +**Set Up Alerts:** + +Configure alerts for future incidents: + +- CPU usage > 80% for 5+ minutes +- Memory usage > 85% +- Response time > acceptable threshold + + + + + +**1. Scale Resources (Quick Fix):** + +```yaml +# Increase CPU limits temporarily +resources: + limits: + cpu: "2000m" # Increase from current limit + memory: "4Gi" + requests: + cpu: "1000m" + memory: "2Gi" +``` + +**2. Horizontal Scaling:** + +```bash +# Scale up replicas temporarily +kubectl scale deployment --replicas=5 -n +``` + +**3. Load Balancing:** + +- Ensure traffic is distributed evenly across instances +- Check if any single instance is receiving disproportionate load + + + + + +**1. Code Optimization:** + +- Profile application code to identify CPU-intensive operations +- Optimize database queries +- Implement caching strategies +- Review algorithm efficiency + +**2. Resource Management:** + +```yaml +# Implement proper resource requests and limits +resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1000m" + memory: "2Gi" +``` + +**3. Auto-scaling Configuration:** + +```yaml +# Configure Horizontal Pod Autoscaler +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + + + + + +**1. Monitoring Setup:** + +- Implement comprehensive monitoring for all critical metrics +- Set up proactive alerts before issues become critical +- Regular performance baseline reviews + +**2. Load Testing:** + +- Conduct regular load testing to identify performance bottlenecks +- Test with realistic traffic patterns +- Validate auto-scaling behavior + +**3. Resource Planning:** + +- Right-size containers based on actual usage patterns +- Plan for traffic growth and seasonal variations +- Regular capacity planning reviews + +**4. Code Review Practices:** + +- Include performance considerations in code reviews +- Monitor performance impact of new deployments +- Implement gradual rollouts for major changes + + + +--- + +_This FAQ was automatically generated on January 27, 2025 based on a real user query._ diff --git a/docs/troubleshooting/troubleshooting-web-service-dns-issues.mdx b/docs/troubleshooting/troubleshooting-web-service-dns-issues.mdx new file mode 100644 index 000000000..2480a3b6a --- /dev/null +++ b/docs/troubleshooting/troubleshooting-web-service-dns-issues.mdx @@ -0,0 +1,193 @@ +--- +sidebar_position: 3 +title: "Web Service DNS Configuration Issues" +description: "Troubleshooting DNS delegation and web service connectivity problems" +date: "2024-01-15" +category: "project" +tags: + ["dns", "web-service", "deployment", "troubleshooting", "domain-delegation"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Web Service DNS Configuration Issues + +**Date:** January 15, 2024 +**Category:** Project +**Tags:** DNS, Web Service, Deployment, Troubleshooting, Domain Delegation + +## Problem Description + +**Context:** User creates a web service in SleakOps with successful build and deployment, but the service is not accessible via its domain name due to DNS configuration issues. + +**Observed Symptoms:** + +- Web service returns 404 error when accessed via domain +- DNS propagation tools show no DNS records for the subdomain +- Build and deployment processes complete successfully +- Service appears healthy in Kubernetes cluster + +**Relevant Configuration:** + +- Service type: Web service +- Domain: Custom subdomain (e.g., `subdomain.domain.com`) +- Deployment status: Successful +- Build status: Successful + +**Error Conditions:** + +- DNS records not propagating globally +- Missing DNS delegation for subdomain +- Service accessible via port-forward but not via domain +- 404 errors when accessing via public URL + +## Detailed Solution + + + +Before troubleshooting DNS issues, confirm your service is running correctly: + +1. **Check pod status in Lens or kubectl:** + + - Pods should show as green/running + - Container health checks should be passing + - If pods are yellow, check logs for errors + +2. **Test local connectivity:** + + ```bash + # Port-forward to test service locally + kubectl port-forward service/your-service-name 8080:80 + # Then test: curl http://localhost:8080 + ``` + +3. **Verify health check configuration:** + - Ensure health check path is correct + - Test both HTTP and HTTPS endpoints + - Check for trailing slash requirements + + + + + +Use DNS tools to diagnose delegation issues: + +1. **Check DNS delegation with dig:** + + ```bash + # Check if subdomain is properly delegated + dig NS subdomain.yourdomain.com + + # Should return AWS Route53 nameservers like: + # subdomain.yourdomain.com. 300 IN NS ns-xxx.awsdns-xx.com. + # subdomain.yourdomain.com. 300 IN NS ns-xxx.awsdns-xx.co.uk. + ``` + +2. **Check A record resolution:** + + ```bash + # Verify A record points to correct IP + dig A subdomain.yourdomain.com + + # Should return load balancer IP + ``` + +3. **Use online DNS propagation tools:** + - Check https://www.whatsmydns.net/ + - Verify global DNS propagation + - Look for inconsistencies across regions + + + + + +If DNS delegation is missing or incorrect: + +1. **Identify required DNS records:** + + - Get the Route53 hosted zone nameservers from AWS Console + - Or check SleakOps dashboard for DNS configuration + +2. **Add NS records to parent domain:** + + ``` + # Add these NS records to your main domain DNS: + subdomain.yourdomain.com. IN NS ns-xxx.awsdns-xx.com. + subdomain.yourdomain.com. IN NS ns-xxx.awsdns-xx.co.uk. + subdomain.yourdomain.com. IN NS ns-xxx.awsdns-xx.net. + subdomain.yourdomain.com. IN NS ns-xxx.awsdns-xx.org. + ``` + +3. **Wait for propagation:** + - DNS changes can take 24-48 hours to fully propagate + - Use `dig` to monitor propagation progress + + + + + +**Test different URL variations:** + +```bash +# Test with and without trailing slash +curl http://subdomain.yourdomain.com/ +curl http://subdomain.yourdomain.com + +# Test both HTTP and HTTPS +curl https://subdomain.yourdomain.com/ +curl http://subdomain.yourdomain.com/ +``` + +**Check ingress configuration:** + +```bash +# Verify ingress is properly configured +kubectl get ingress -n your-namespace +kubectl describe ingress your-ingress-name -n your-namespace +``` + +**Monitor DNS propagation:** + +```bash +# Check multiple DNS servers +nslookup subdomain.yourdomain.com 8.8.8.8 +nslookup subdomain.yourdomain.com 1.1.1.1 +``` + + + + + +**Before creating web services:** + +1. **Verify domain delegation:** + + - Ensure parent domain is properly configured + - Test DNS resolution for existing subdomains + +2. **Plan subdomain structure:** + + - Use consistent naming conventions + - Consider wildcard delegation for multiple services + +3. **Monitor deployment process:** + - Check both build AND DNS configuration steps + - Verify ingress controller is running + +**After deployment:** + +1. **Wait for DNS propagation:** + + - Allow 15-30 minutes for initial propagation + - Use multiple DNS checking tools + +2. **Test systematically:** + - Start with port-forward testing + - Then test via load balancer IP + - Finally test via domain name + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/understanding-aws-eks-costs.mdx b/docs/troubleshooting/understanding-aws-eks-costs.mdx new file mode 100644 index 000000000..bbf380113 --- /dev/null +++ b/docs/troubleshooting/understanding-aws-eks-costs.mdx @@ -0,0 +1,166 @@ +--- +sidebar_position: 3 +title: "Understanding AWS EKS Costs in SleakOps" +description: "Breakdown of AWS EKS cluster costs including control plane and compute resources" +date: "2025-01-15" +category: "cluster" +tags: ["eks", "aws", "costs", "billing", "nodegroup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Understanding AWS EKS Costs in SleakOps + +**Date:** January 15, 2025 +**Category:** Cluster +**Tags:** EKS, AWS, Costs, Billing, NodeGroup + +## Problem Description + +**Context:** Users often have questions about AWS EKS cluster costs when using SleakOps, particularly understanding the different cost categories and what generates charges even during development or testing phases. + +**Observed Symptoms:** + +- Unexpected charges for EKS clusters during development +- Confusion about different AWS cost categories +- Questions about why costs occur even with minimal usage +- Difficulty understanding the relationship between cluster components and billing + +**Relevant Configuration:** + +- AWS EKS cluster with control plane +- NodeGroups for worker nodes +- Karpenter for auto-scaling +- Various add-ons and applications deployed + +**Error Conditions:** + +- Users see charges they don't understand +- Development environments generating unexpected costs +- Difficulty correlating usage with billing + +## Detailed Solution + + + +The **Amazon Elastic Container Service for Kubernetes** charge is the base cost of running an EKS cluster: + +- **Fixed cost**: $73 USD per month per cluster +- **What it covers**: The EKS control plane (master nodes) +- **Always charged**: This cost applies regardless of workload usage +- **Prorated**: If your cluster exists for part of a month, you pay proportionally + +**Example**: If your cluster runs from December 10th to December 31st (21 days), you pay approximately $51 USD (21/31 × $73). + +For detailed pricing information, visit the [official AWS EKS pricing page](https://aws.amazon.com/eks/pricing/). + + + + + +The **Elastic Compute Cloud (EC2)** charges cover the worker nodes that run your applications: + +**NodeGroups**: + +- EC2 instances that form your cluster's worker nodes +- Charged based on instance type and running time +- Examples: t3.medium, t3.large, m5.xlarge + +**Karpenter-managed instances**: + +- Automatically provisioned instances based on workload demands +- Scales up when applications need resources +- Scales down when resources are no longer needed + +**Cost factors**: + +- Instance type (CPU, memory, network performance) +- Number of instances +- Running time +- Data transfer + + + + + +Even development environments incur costs because: + +1. **Control plane is always running**: The $73/month EKS charge applies +2. **Minimum worker nodes**: Usually need at least 1-2 nodes for basic functionality +3. **Add-ons consume resources**: Monitoring, logging, and other tools need compute power +4. **Background processes**: System pods and services run continuously + +**Cost optimization strategies**: + +```yaml +# Example cost-optimized development configuration +nodegroup_config: + instance_types: ["t3.small", "t3.medium"] + min_size: 1 + max_size: 3 + desired_size: 1 + +karpenter_config: + enabled: true + spot_instances: true # Use spot instances for cost savings + +scheduling: + auto_shutdown: "18:00" # Shut down non-essential workloads + auto_startup: "09:00" # Start workloads during work hours +``` + + + + + +**In AWS Console**: + +1. Go to **Cost Explorer** +2. Filter by service: **Amazon Elastic Kubernetes Service** +3. Group by: **Service** and **Usage Type** +4. Set up **Budget Alerts** for unexpected cost increases + +**In SleakOps**: + +- Monitor cluster resource usage in the dashboard +- Review deployed applications and their resource requests +- Use the cluster scaling settings to control maximum resources + +**Best practices**: + +- Use spot instances for development workloads +- Implement cluster autoscaling +- Regularly review and clean up unused resources +- Consider cluster hibernation for non-production environments + + + + + +For a small development cluster running for a full month: + +**Fixed costs**: + +- EKS Control Plane: $73.00 + +**Variable costs**: + +- 2 × t3.medium nodes (24/7): ~$60.00 +- Additional Karpenter instances (occasional): ~$15.00 +- Data transfer and storage: ~$5.00 + +**Total estimated monthly cost**: ~$153.00 + +**For partial month usage** (like December 10-31): + +- EKS Control Plane: $51.00 (prorated) +- Compute resources: $34.00 (prorated) +- **Total**: ~$85.00 + +These are typical costs for a development environment with minimal but continuous usage. + + + +--- + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/user-access-control-production-environments.mdx b/docs/troubleshooting/user-access-control-production-environments.mdx new file mode 100644 index 000000000..1f6c2b042 --- /dev/null +++ b/docs/troubleshooting/user-access-control-production-environments.mdx @@ -0,0 +1,279 @@ +--- +sidebar_position: 3 +title: "User Access Control in Production Environments" +description: "How to manage user permissions and access control for production resources in SleakOps" +date: "2024-02-17" +category: "user" +tags: ["access-control", "iam", "permissions", "production", "security"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# User Access Control in Production Environments + +**Date:** February 17, 2024 +**Category:** User +**Tags:** Access Control, IAM, Permissions, Production, Security + +## Problem Description + +**Context:** Organizations need to implement granular access control when using a single cluster for multiple environments (development, staging, production) while restricting production access to specific team members. + +**Observed Symptoms:** + +- Need to limit production environment access to only 2 team members +- Developers require access to development and staging resources only +- Uncertainty about what resources different user roles can access +- Questions about S3 bucket and database manipulation permissions + +**Relevant Configuration:** + +- Single cluster hosting multiple environments +- Production, development, and staging environments in same AWS account +- Need for role-based access control (Viewer, Editor, Admin) +- AWS IAM integration with SleakOps + +**Error Conditions:** + +- Risk of unauthorized access to production resources +- Potential data manipulation by users with excessive permissions +- Lack of proper environment segregation + +## Detailed Solution + + + +The **Viewer** role in SleakOps has three main access requirements: + +1. **AWS Account Access**: Users get the [ReadOnlyAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/ReadOnlyAccess.html) AWS managed policy +2. **VPN Access**: Required to access internal resources +3. **SleakOps Platform Access**: Access to the SleakOps interface + +**What Viewer role CAN do:** + +- View VariableGroups in SleakOps +- Access RDS and other dependencies through VPN +- View cluster resources and configurations +- Access database connection details from VariableGroups or cluster secrets + +**What Viewer role CANNOT do:** + +- Edit objects in S3 through AWS console (requires Editor role) +- Modify VariableGroups (requires Editor role) +- Make changes to cluster configurations + + + + + +### Strategy 1: VPN-Only Access + +For developers who only need database access: + +```yaml +Access Level: VPN Only +Permissions: + - Connect to development/staging databases + - No AWS console access + - No SleakOps platform access +Implementation: Share database credentials directly +``` + +### Strategy 2: Custom IAM Role + +Create a custom "Developer" role in production account: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "ViewDevelopmentResources", + "Effect": "Allow", + "Action": [ + "rds:Describe*", + "s3:GetObject", + "s3:ListBucket", + "ec2:Describe*", + "eks:Describe*", + "eks:List*" + ], + "Resource": "*", + "Condition": { + "StringEquals": { + "aws:RequestedRegion": "us-east-1" + } + } + }, + { + "Sid": "DenyProductionAccess", + "Effect": "Deny", + "Action": "*", + "Resource": [ + "arn:aws:rds:*:*:db:prod-*", + "arn:aws:s3:::*-production/*", + "arn:aws:eks:*:*:cluster/production-*" + ] + } + ] +} +``` + +### Strategy 3: Environment-Based User Groups + +Organize users into environment-specific groups: + +| User Group | Development Access | Staging Access | Production Access | +|------------|-------------------|----------------|------------------| +| Developers | Full Access | Read-Only | No Access | +| QA Team | Read-Only | Full Access | No Access | +| DevOps | Full Access | Full Access | Admin Access | +| Product Team | No Access | Read-Only | Read-Only | + + + + + +### Step 1: Create User Groups in AWS IAM + +```bash +# Create development team group +aws iam create-group --group-name SleakOps-Development-Team + +# Create production admin group +aws iam create-group --group-name SleakOps-Production-Admins + +# Attach policies to groups +aws iam attach-group-policy \ + --group-name SleakOps-Development-Team \ + --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess + +aws iam attach-group-policy \ + --group-name SleakOps-Production-Admins \ + --policy-arn arn:aws:iam::aws:policy/PowerUserAccess +``` + +### Step 2: Configure SleakOps User Roles + +1. **Access SleakOps Admin Panel** +2. **Navigate to User Management** +3. **Create role mappings:** + +```yaml +Production Environment: + - Admins: production-admin@company.com, lead-devops@company.com + - Restricted Access: No other users + +Development Environment: + - Full Access: All developers + - Editor Access: Senior developers + - Viewer Access: Junior developers, QA team +``` + +### Step 3: Configure VPN Access + +```bash +# Development VPN group +vpn_group: development +users: + - developer1@company.com + - developer2@company.com + - qa1@company.com + +# Production VPN group (limited) +vpn_group: production +users: + - admin1@company.com + - admin2@company.com +``` + + + + + +### Separate Database Users by Environment + +```sql +-- Development database users +CREATE USER 'dev_user'@'%' IDENTIFIED BY 'dev_password'; +GRANT SELECT, INSERT, UPDATE, DELETE ON development_db.* TO 'dev_user'@'%'; + +-- Staging database users (read-only for developers) +CREATE USER 'staging_reader'@'%' IDENTIFIED BY 'staging_password'; +GRANT SELECT ON staging_db.* TO 'staging_reader'@'%'; + +-- Production database users (very restricted) +CREATE USER 'prod_admin'@'%' IDENTIFIED BY 'secure_prod_password'; +GRANT ALL PRIVILEGES ON production_db.* TO 'prod_admin'@'%'; + +CREATE USER 'prod_readonly'@'%' IDENTIFIED BY 'prod_readonly_password'; +GRANT SELECT ON production_db.* TO 'prod_readonly'@'%'; +``` + +### Connection String Management + +Store connection strings in environment-specific VariableGroups: + +```yaml +# Development VariableGroup +DATABASE_URL: mysql://dev_user:dev_password@dev-db:3306/development_db +READ_ONLY_DATABASE_URL: mysql://dev_user:dev_password@dev-db:3306/development_db + +# Production VariableGroup (restricted access) +DATABASE_URL: mysql://prod_admin:secure_password@prod-db:3306/production_db +READ_ONLY_DATABASE_URL: mysql://prod_readonly:readonly_password@prod-db:3306/production_db +``` + + + + + +### Set Up Access Monitoring + +1. **Enable AWS CloudTrail** for all API calls +2. **Configure VPN logging** to track connection attempts +3. **Monitor database connections:** + +```sql +-- Enable MySQL general log +SET GLOBAL general_log = 'ON'; +SET GLOBAL general_log_file = '/var/log/mysql/general.log'; + +-- Monitor active connections +SELECT USER, HOST, DB, COMMAND, TIME, STATE +FROM information_schema.PROCESSLIST; +``` + +### Regular Access Reviews + +1. **Monthly access review meetings** +2. **Quarterly permission audits** +3. **Automated alerts for unusual access patterns** + +```bash +# Example CloudWatch alarm for unusual production access +aws cloudwatch put-metric-alarm \ + --alarm-name "UnusualProductionAccess" \ + --alarm-description "Alert on unusual production database access" \ + --metric-name DatabaseConnections \ + --namespace AWS/RDS \ + --statistic Sum \ + --period 300 \ + --threshold 10 \ + --comparison-operator GreaterThanThreshold +``` + +### Best Practices Summary + +1. **Principle of Least Privilege**: Grant minimum necessary permissions +2. **Environment Segregation**: Clear boundaries between dev/staging/prod +3. **Regular Audits**: Review and update permissions regularly +4. **Monitoring**: Track all access attempts and unusual patterns +5. **Documentation**: Maintain clear documentation of who has access to what +6. **Emergency Procedures**: Have protocols for emergency access and revocation + + + +--- + +_This FAQ was automatically generated on February 17, 2024 based on a real user query._ diff --git a/docs/troubleshooting/user-deletion-aws-account-binding-error.mdx b/docs/troubleshooting/user-deletion-aws-account-binding-error.mdx new file mode 100644 index 000000000..daf80699f --- /dev/null +++ b/docs/troubleshooting/user-deletion-aws-account-binding-error.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "User Deletion Error Due to AWS Account Binding" +description: "Solution for user deletion issues when bound to AWS accounts" +date: "2024-01-15" +category: "user" +tags: + ["user-management", "aws", "deletion", "account-binding", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# User Deletion Error Due to AWS Account Binding + +**Date:** January 15, 2024 +**Category:** User +**Tags:** User Management, AWS, Deletion, Account Binding, Troubleshooting + +## Problem Description + +**Context:** Users attempting to delete a user account from the SleakOps platform encounter errors when the user is bound to an AWS account. The deletion process fails, leaving the user in a pending state that prevents recreation. + +**Observed Symptoms:** + +- User appears with error status in the interface +- Deletion process fails to complete +- User remains in pending state after attempted deletion +- Cannot recreate user with same credentials +- Error is related to AWS account binding + +**Relevant Configuration:** + +- User has active AWS account binding +- User deletion attempted through SleakOps interface +- User shows error status before deletion attempt +- Platform: SleakOps user management system + +**Error Conditions:** + +- Error occurs during user deletion process +- User has existing AWS account associations +- Deletion leaves user in inconsistent state +- Prevents user recreation with same identity + +## Detailed Solution + + + +When a user is bound to an AWS account in SleakOps, several resources may be associated: + +- IAM roles and policies +- Service accounts in EKS clusters +- VPC and networking configurations +- Resource access permissions + +These bindings must be properly cleaned up before user deletion can complete successfully. + + + + + +Before attempting to delete a user with AWS account binding: + +1. **Check Active Resources**: + + - Navigate to User Management → User Details + - Review "AWS Resources" tab + - Document any active clusters or projects + +2. **Remove AWS Associations**: + + ```bash + # Remove user from all projects first + sleakops project remove-user --user-email user@example.com --all-projects + + # Unbind AWS account + sleakops user unbind-aws --user-email user@example.com + ``` + +3. **Verify Clean State**: + - Ensure no active clusters are assigned to the user + - Confirm no pending AWS operations + - Check that IAM roles are properly cleaned up + + + + + +To safely delete a user with AWS account binding: + +1. **Preparation Phase**: + + ```bash + # List user's active resources + sleakops user list-resources --user-email user@example.com + + # Remove from all projects + sleakops project remove-user --user-email user@example.com --all-projects + ``` + +2. **AWS Cleanup Phase**: + + ```bash + # Unbind AWS account + sleakops user unbind-aws --user-email user@example.com --force-cleanup + + # Wait for cleanup completion + sleakops user check-cleanup-status --user-email user@example.com + ``` + +3. **Final Deletion**: + ```bash + # Delete user after cleanup + sleakops user delete --user-email user@example.com --confirm + ``` + + + + + +If a user is stuck in pending deletion state: + +1. **Check Deletion Status**: + + ```bash + sleakops user status --user-email user@example.com + ``` + +2. **Force Cleanup** (Admin only): + + ```bash + # Admin command to force cleanup + sleakops admin user force-cleanup --user-email user@example.com + + # Complete the deletion + sleakops admin user complete-deletion --user-email user@example.com + ``` + +3. **Manual AWS Cleanup** (if needed): + - Go to AWS IAM Console + - Remove any remaining roles with prefix `sleakops-user-{user-id}` + - Clean up any orphaned service accounts in EKS clusters + + + + + +Once cleanup is complete, to recreate the user: + +1. **Verify Clean State**: + + ```bash + sleakops user check --email user@example.com + # Should return "User not found" + ``` + +2. **Create New User**: + + ```bash + sleakops user create --email user@example.com --name "User Name" --role member + ``` + +3. **Rebind AWS Account** (if needed): + ```bash + sleakops user bind-aws --user-email user@example.com --aws-account-id 123456789012 + ``` + + + + + +To avoid future user deletion issues: + +1. **Regular Cleanup**: + + - Remove users from projects before deletion + - Unbind AWS accounts when no longer needed + - Monitor user resource usage regularly + +2. **Proper Workflow**: + + ```bash + # Recommended deletion workflow + sleakops user prepare-deletion --user-email user@example.com + sleakops user confirm-deletion --user-email user@example.com + ``` + +3. **Monitoring**: + - Set up alerts for failed user operations + - Regular audits of user-AWS bindings + - Document user resource associations + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/user-mfa-recovery-after-device-loss.mdx b/docs/troubleshooting/user-mfa-recovery-after-device-loss.mdx new file mode 100644 index 000000000..a289b1aba --- /dev/null +++ b/docs/troubleshooting/user-mfa-recovery-after-device-loss.mdx @@ -0,0 +1,185 @@ +--- +sidebar_position: 3 +title: "MFA Recovery After Device Loss" +description: "How to recover access when MFA device is lost or stolen" +date: "2024-10-10" +category: "user" +tags: ["mfa", "2fa", "authentication", "recovery", "security"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# MFA Recovery After Device Loss + +**Date:** October 10, 2024 +**Category:** User +**Tags:** MFA, 2FA, Authentication, Recovery, Security + +## Problem Description + +**Context:** User has lost access to their MFA (Multi-Factor Authentication) device due to theft, loss, or damage, preventing them from logging into their SleakOps account. + +**Observed Symptoms:** + +- Cannot complete 2FA authentication during login +- MFA codes are no longer accessible +- User is completely locked out of their account +- Standard login process fails at the MFA verification step + +**Relevant Configuration:** + +- Account has MFA/2FA enabled +- Primary authentication device is no longer available +- User email account is still accessible +- Account recovery may require administrator intervention + +**Error Conditions:** + +- MFA verification fails during login +- No backup authentication methods available +- User cannot generate valid authentication codes +- Account lockout persists until MFA is reset + +## Detailed Solution + + + +If you've lost your MFA device: + +1. **Don't panic** - Account recovery is possible +2. **Contact support immediately** via email: support@sleakops.com +3. **Provide verification details**: + - Your registered email address + - Account username + - Approximate date of last successful login + - Reason for MFA device loss (theft, damage, etc.) + +**Important**: Do not attempt multiple failed login attempts as this may trigger additional security measures. + + + + + +When contacting support for MFA reset: + +**Email Template:** + +``` +Subject: MFA Reset Request - [Your Email] + +Hello SleakOps Support, + +I need to request an MFA reset for my account due to [device loss/theft/damage]. + +Account Details: +- Email: [your-email@company.com] +- Username: [if different from email] +- Last successful login: [approximate date] +- Reason: [brief explanation] + +I can verify my identity through [alternative method if available]. + +Thank you for your assistance. +``` + +**Required Information:** + +- Registered email address +- Account verification details +- Reason for MFA device unavailability +- Alternative contact method if needed + + + + + +The support team will verify your identity through: + +1. **Email verification** - Responding from your registered email +2. **Account details** - Confirming account-specific information +3. **Security questions** - If previously configured +4. **Alternative verification** - Through team administrator if applicable + +**Timeline:** MFA resets are typically processed within 24-48 hours during business days. + +**Security Note:** This process is intentionally thorough to protect your account security. + + + + + +Once your MFA is reset: + +1. **Log in immediately** using your password only +2. **Set up new MFA** as soon as possible +3. **Configure backup methods**: + - Save backup codes in a secure location + - Consider multiple authenticator apps + - Set up alternative authentication methods + +**Recommended MFA Setup:** + +``` +Primary: Authenticator app (Google Authenticator, Authy) +Backup: SMS to verified phone number +Emergency: Backup codes stored securely +``` + +4. **Update security practices**: + - Store backup codes separately from your device + - Consider using cloud-based authenticators (Authy) + - Inform your team about the security incident + + + + + +**Best Practices:** + +1. **Multiple Authentication Methods:** + + - Set up at least 2 different MFA methods + - Use both app-based and SMS backup + - Save backup codes in a password manager + +2. **Secure Backup Storage:** + + - Store backup codes in encrypted password manager + - Keep physical copies in secure location + - Don't store codes on the same device as your authenticator + +3. **Regular Review:** + + - Test backup methods periodically + - Update phone numbers when changed + - Review and refresh backup codes quarterly + +4. **Team Coordination:** + - Inform team administrators of your MFA setup + - Ensure multiple team members have admin access + - Document recovery procedures for your organization + + + + + +**For urgent MFA recovery needs:** + +- **Email**: support@sleakops.com +- **Subject Line**: "URGENT: MFA Reset Required - [Your Email]" +- **Response Time**: 24-48 hours (business days) + +**Include in urgent requests:** + +- Clear explanation of urgency +- Business impact if applicable +- Alternative verification methods available +- Best contact method for follow-up + +**Note**: While support aims to help quickly, security verification cannot be bypassed and may require additional time. + + + +--- + +_This FAQ was automatically generated on November 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/user-password-reset-error.mdx b/docs/troubleshooting/user-password-reset-error.mdx new file mode 100644 index 000000000..e017ec842 --- /dev/null +++ b/docs/troubleshooting/user-password-reset-error.mdx @@ -0,0 +1,115 @@ +--- +sidebar_position: 3 +title: "AWS Password Reset Error for Team Members" +description: "Solution for 'error resetting password' when resetting AWS credentials for team members" +date: "2025-02-03" +category: "user" +tags: ["aws", "password", "reset", "member", "authentication"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# AWS Password Reset Error for Team Members + +**Date:** February 3, 2025 +**Category:** User +**Tags:** AWS, Password, Reset, Member, Authentication + +## Problem Description + +**Context:** When attempting to reset the AWS password for a team member through the SleakOps platform, the operation fails and leaves the user in an inconsistent state. + +**Observed Symptoms:** + +- Error message: "error resetting password" appears during the reset process +- User account gets stuck with "updating" label/status +- Password reset operation does not complete successfully +- User remains unable to access AWS resources + +**Relevant Configuration:** + +- User type: Team member (not admin) +- Operation: AWS password reset +- Platform: SleakOps user management interface +- User status: Stuck in "updating" state + +**Error Conditions:** + +- Error occurs during password reset initiation +- Affects team members specifically +- User account becomes locked in updating state +- Subsequent reset attempts may also fail + +## Detailed Solution + + + +If you're experiencing this error, follow these steps: + +1. **Wait for automatic recovery**: The system may automatically resolve the "updating" status within 5-10 minutes +2. **Refresh the user management page**: Sometimes the status update is delayed in the UI +3. **Contact support**: If the issue persists, provide the specific user email and timestamp of the error + + + + + +This error typically happens due to: + +1. **AWS IAM permission conflicts**: The service account may lack sufficient permissions to reset user passwords +2. **Concurrent operations**: Multiple password reset attempts happening simultaneously +3. **AWS service temporary unavailability**: Transient issues with AWS IAM service +4. **User state inconsistency**: The user account may be in an invalid state in AWS IAM + + + + + +To avoid this problem in the future: + +1. **Single reset attempt**: Only attempt one password reset at a time per user +2. **Wait between attempts**: If a reset fails, wait at least 5 minutes before trying again +3. **Check user status**: Ensure the user is in "active" status before attempting reset +4. **Verify permissions**: Ensure your admin account has proper permissions for user management + +```bash +# Check if user is in proper state before reset +# This should be done through SleakOps UI +1. Navigate to Team Management +2. Locate the user +3. Verify status shows "Active" (not "Updating" or "Pending") +4. Then proceed with password reset +``` + + + + + +If you're an admin and need to manually resolve this: + +1. **Access the SleakOps admin panel** +2. **Navigate to User Management** +3. **Find the affected user** +4. **Check the user's current status** +5. **If stuck in 'updating', try these actions:** + - Refresh the page and wait 2-3 minutes + - Try to "cancel" the current operation if available + - Contact SleakOps support with the user email and error details + + + + + +If the standard reset continues to fail: + +1. **Temporary workaround**: Create a new temporary user account for immediate access +2. **Direct AWS access**: If you have AWS console access, you can reset the password directly in IAM +3. **Remove and re-add user**: As a last resort, remove the user from the team and re-invite them + +**Note**: Always coordinate with your team before removing users, as this may affect their access to projects and resources. + + + +--- + +_This FAQ was automatically generated on February 3, 2025 based on a real user query._ diff --git a/docs/troubleshooting/vargroup-data-loss-recovery.mdx b/docs/troubleshooting/vargroup-data-loss-recovery.mdx new file mode 100644 index 000000000..3cdd1639f --- /dev/null +++ b/docs/troubleshooting/vargroup-data-loss-recovery.mdx @@ -0,0 +1,187 @@ +--- +sidebar_position: 15 +title: "VarGroup Data Loss and Recovery" +description: "How to handle VarGroup data loss and recovery procedures in SleakOps" +date: "2025-02-26" +category: "project" +tags: ["vargroup", "data-loss", "recovery", "backup", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VarGroup Data Loss and Recovery + +**Date:** February 26, 2025 +**Category:** Project +**Tags:** VarGroup, Data Loss, Recovery, Backup, Troubleshooting + +## Problem Description + +**Context:** Users may experience VarGroup data loss due to system bugs, accidental deletions, or deployment failures. This can result in environment variables being reverted to older versions or completely lost. + +**Observed Symptoms:** + +- VarGroup shows older version instead of latest configuration +- Missing environment variables in specific environments +- Variables that were previously configured are no longer available +- Deployment failures due to missing required variables +- Duplicate VarGroups appearing in the system + +**Relevant Configuration:** + +- Environment: Development, staging, or production +- VarGroup type: Global or environment-specific +- Last known working timestamp +- User permissions and access levels + +**Error Conditions:** + +- Data loss occurs after system bugs or maintenance +- Variables revert to previous versions unexpectedly +- Failed deployments trigger VarGroup corruption +- Duplicate VarGroups cause configuration conflicts + +## Detailed Solution + + + +When VarGroup data loss is detected: + +1. **Document the timeline**: Note when the issue was first observed +2. **Identify affected environments**: Check which environments are impacted +3. **Review recent changes**: Look at deployment history and user activities +4. **Check VarGroup history**: Review the update timeline in the admin panel + +```bash +# Example of checking recent deployments +kubectl get deployments -n your-namespace --sort-by=.metadata.creationTimestamp +``` + + + + + +**Option 1: Platform Recovery (If Available)** + +- Contact SleakOps support immediately +- Provide the exact timestamp of the last known good configuration +- Include the VarGroup name and affected environment + +**Option 2: Manual Recreation** + +- Recreate the VarGroup from scratch +- Use documentation or team knowledge to restore variables +- Test thoroughly in development before applying to production + +**Option 3: Version Control Recovery** + +- If you maintain VarGroup configurations in version control +- Restore from your latest backup or commit +- Apply the configuration through the SleakOps interface + + + + + +**Backup Strategies:** + +1. **Export VarGroups regularly**: + + ```bash + # Export current VarGroup configuration + sleakops vargroup export --name global --environment develop > vargroup-backup.json + ``` + +2. **Version Control Integration**: + + - Store VarGroup configurations in Git repositories + - Use Infrastructure as Code (IaC) approaches + - Implement automated backups + +3. **Documentation**: + - Maintain updated documentation of all environment variables + - Document the purpose and expected values of each variable + - Keep a changelog of VarGroup modifications + +**Monitoring and Alerts:** + +- Set up alerts for VarGroup changes +- Monitor deployment failures that might indicate missing variables +- Regular audits of VarGroup configurations + + + + + +When deployments fail due to VarGroup issues: + +1. **Check deployment logs**: + + ```bash + kubectl logs deployment/your-app -n your-namespace + ``` + +2. **Verify VarGroup content**: + + - Log into SleakOps admin panel + - Navigate to VarGroups section + - Compare current configuration with expected values + +3. **Test variable availability**: + + ```bash + # Test if variables are properly injected + kubectl exec -it pod/your-pod -- env | grep YOUR_VARIABLE + ``` + +4. **Rollback if necessary**: + - Use previous working deployment + - Restore VarGroup to last known good state + - Redeploy after verification + + + + + +**Contact SleakOps support immediately if:** + +- Data loss affects production environments +- Multiple VarGroups are affected +- The issue appears to be platform-wide +- You cannot recreate the lost configuration + +**Information to provide:** + +- Ticket reference (if available) +- Exact timestamp when the issue was discovered +- VarGroup names and affected environments +- Recent deployment history +- Any error messages or logs +- Timeline of recent changes or activities + +**Support Contact:** + +- Email: support@sleakops.com +- Include "VarGroup Data Loss" in the subject line +- Use "Reply to all" when responding to support tickets + + + + + +After recovering VarGroup data: + +- [ ] Verify all environment variables are present +- [ ] Test application functionality in development +- [ ] Run deployment tests +- [ ] Update documentation with any changes +- [ ] Implement additional backup measures +- [ ] Review and improve monitoring +- [ ] Train team on prevention strategies +- [ ] Document lessons learned + + + +--- + +_This FAQ was automatically generated on February 26, 2025 based on a real user query._ diff --git a/docs/troubleshooting/vargroup-deployment-error-state.mdx b/docs/troubleshooting/vargroup-deployment-error-state.mdx new file mode 100644 index 000000000..c160c2fff --- /dev/null +++ b/docs/troubleshooting/vargroup-deployment-error-state.mdx @@ -0,0 +1,124 @@ +--- +sidebar_position: 3 +title: "VarGroup Error State Preventing Deployment" +description: "Solution for VarGroup stuck in error state blocking deployments" +date: "2024-03-17" +category: "project" +tags: ["vargroup", "deployment", "variables", "error-state"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VarGroup Error State Preventing Deployment + +**Date:** March 17, 2024 +**Category:** Project +**Tags:** VarGroup, Deployment, Variables, Error State + +## Problem Description + +**Context:** User attempts to modify environment variables through VarGroup but encounters an error state that prevents both variable updates and deployments. + +**Observed Symptoms:** + +- VarGroup enters "Error" state after attempting to modify variables +- Deployment process fails with "unpublished changes" message +- Cannot publish pending changes due to VarGroup error state +- Variable modifications (like APP_DEBUG from false to true) are not applied +- System shows there are changes to publish but publishing fails + +**Relevant Configuration:** + +- Platform: SleakOps +- Component: VarGroup (Variable Groups) +- Variable example: APP_DEBUG (boolean value) +- Error occurs during Kubernetes operations + +**Error Conditions:** + +- Error appears when modifying VarGroup variables +- Kubernetes timeout causes initial failure +- VarGroup remains in error state preventing further operations +- Deployment blocked until VarGroup state is resolved + +## Detailed Solution + + + +The most direct solution is to retry the VarGroup modification: + +1. **Navigate to VarGroup section** in your project +2. **Locate the VarGroup in "Error" state** +3. **Click "Edit" on the failed VarGroup** +4. **Re-enter the same variable values** you intended to change +5. **Save the changes** - this will trigger a new execution +6. **Wait for the status** to change from "Error" to "Created" + +This retry mechanism helps resolve temporary Kubernetes timeout issues. + + + + + +Before retrying, verify your intended changes: + +1. **Document current values**: Note down current variable values +2. **Confirm intended changes**: Verify what values you want to set +3. **Apply changes carefully**: Make sure you're setting the correct values +4. **Validate after success**: Once VarGroup shows "Created", verify the variables are correctly set + +**Example verification:** + +``` +Before: APP_DEBUG = false +Intended: APP_DEBUG = true +After retry: Confirm APP_DEBUG = true +``` + + + + + +The error occurs because: + +1. **Kubernetes API delays**: Sometimes Kubernetes takes longer to respond +2. **Resource contention**: High cluster load can cause timeouts +3. **Network issues**: Temporary connectivity problems +4. **State synchronization**: VarGroup state gets stuck during the timeout + +**This is typically a temporary issue** that resolves with a retry. + + + + + +Once the VarGroup is fixed: + +1. **Verify VarGroup status**: Ensure it shows "Created" instead of "Error" +2. **Check for pending changes**: The "unpublished changes" message should clear +3. **Attempt deployment**: Try your deployment process again +4. **Monitor deployment**: Watch for successful completion + +If deployment still fails, check for other pending changes in: + +- Other VarGroups +- Application configuration +- Infrastructure settings + + + + + +To minimize future VarGroup errors: + +1. **Make changes during low-traffic periods**: Reduce chance of Kubernetes timeouts +2. **Modify one VarGroup at a time**: Avoid concurrent modifications +3. **Wait for completion**: Don't make additional changes while one is processing +4. **Monitor cluster health**: Check if your cluster is experiencing high load +5. **Use smaller batches**: If changing many variables, do it in smaller groups + + + +--- + +_This FAQ was automatically generated on March 17, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vargroup-editing-redis-connection-error.mdx b/docs/troubleshooting/vargroup-editing-redis-connection-error.mdx new file mode 100644 index 000000000..ed66cd1e4 --- /dev/null +++ b/docs/troubleshooting/vargroup-editing-redis-connection-error.mdx @@ -0,0 +1,121 @@ +--- +sidebar_position: 3 +title: "Vargroup Editing Error - Redis Connection Issue" +description: "Solution for 'something goes wrong' error when editing vargroup variables due to Redis connectivity issues" +date: "2024-11-20" +category: "project" +tags: ["vargroup", "redis", "variables", "editing", "connection"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Vargroup Editing Error - Redis Connection Issue + +**Date:** November 20, 2024 +**Category:** Project +**Tags:** Vargroup, Redis, Variables, Editing, Connection + +## Problem Description + +**Context:** User attempts to edit variables in vargroups through the SleakOps platform but encounters connection errors due to Redis service disruptions. + +**Observed Symptoms:** + +- "Something goes wrong" error message when trying to edit vargroup variables +- Error occurs across multiple environments (develop and production) +- Unable to access vargroup editing interface +- Error persists across different vargroups in the same project + +**Relevant Configuration:** + +- Affected vargroups: web-mecubro-develop, web-mecubro-com (production) +- Platform component: Variable management system +- Backend dependency: Redis service +- User interface: Vargroup editor + +**Error Conditions:** + +- Error occurs when Redis service is unavailable or experiencing connectivity issues +- Affects all vargroup editing operations +- May impact multiple users simultaneously +- Temporary service disruption + +## Detailed Solution + + + +Once the Redis service has been restored: + +1. **Wait for service confirmation**: Ensure Redis service is fully operational +2. **Clear browser cache**: Refresh the page or clear browser cache +3. **Retry the operation**: Attempt to edit the vargroup variables again +4. **Check multiple environments**: Verify that both develop and production environments are accessible + +The error should resolve automatically once the Redis connection is restored. + + + + + +To check if Redis-related issues are affecting the platform: + +1. **Check platform status page**: Look for any ongoing service disruptions +2. **Try other platform features**: Test if other variable-related features work +3. **Contact support**: If the issue persists after Redis recovery, report the problem + +**Signs that Redis is working again:** + +- Other users can edit vargroups successfully +- Platform status shows all services operational +- No error messages when accessing variable management + + + + + +To minimize impact during temporary service outages: + +1. **Save work frequently**: When editing large sets of variables, save changes incrementally +2. **Use version control**: Keep track of variable changes in your documentation +3. **Plan critical changes**: Avoid making critical variable changes during peak hours +4. **Have rollback plan**: Keep previous variable configurations documented + +**Backup approach for critical changes:** + +```bash +# Export current variables before making changes +sleakops vargroup export --project web-mecubro --env develop > backup-variables.json + +# Apply changes when service is stable +sleakops vargroup import --project web-mecubro --env develop < new-variables.json +``` + + + + + +If you continue experiencing issues after Redis service restoration: + +1. **Clear all browser data**: + + - Clear cookies and local storage for SleakOps domain + - Try using an incognito/private browser window + - Test from a different browser + +2. **Check network connectivity**: + + - Verify your internet connection is stable + - Try accessing from a different network + - Check if corporate firewall is blocking requests + +3. **Report persistent issues**: + - Include specific error messages + - Mention which vargroups are affected + - Provide browser and network information + - Reference the original Redis outage for context + + + +--- + +_This FAQ was automatically generated on November 20, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vargroup-publish-error-without-deploy.mdx b/docs/troubleshooting/vargroup-publish-error-without-deploy.mdx new file mode 100644 index 000000000..ed1085730 --- /dev/null +++ b/docs/troubleshooting/vargroup-publish-error-without-deploy.mdx @@ -0,0 +1,148 @@ +--- +sidebar_position: 3 +title: "Variable Group Publish Error Without Deploy" +description: "Error when publishing variable groups without deploying first" +date: "2025-02-10" +category: "project" +tags: ["vargroup", "variables", "publish", "deploy", "error"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variable Group Publish Error Without Deploy + +**Date:** February 10, 2025 +**Category:** Project +**Tags:** Vargroup, Variables, Publish, Deploy, Error + +## Problem Description + +**Context:** When updating variable groups (vargroups) in SleakOps and attempting to publish changes without first deploying, the system throws an error on the first publish attempt. + +**Observed Symptoms:** + +- Error occurs when publishing a vargroup after updating it without deploying +- First publish attempt fails with an error +- Second publish attempt succeeds without issues +- The error is reproducible following the same steps + +**Relevant Configuration:** + +- Component: Variable Groups (vargroups) +- Action sequence: Update → Publish (without Deploy) +- Platform: SleakOps +- Error occurs consistently on first attempt + +**Error Conditions:** + +- Error appears specifically when skipping the deploy step +- Occurs during the publish operation +- First attempt always fails +- Subsequent attempts work correctly +- Reproducible behavior + +## Detailed Solution + + + +In SleakOps, variable groups follow a specific workflow: + +1. **Update**: Modify variable values or add/remove variables +2. **Deploy**: Apply changes to the environment +3. **Publish**: Make changes available to dependent services + +Skipping the deploy step can cause synchronization issues between the configuration state and the published state. + + + + + +If you encounter this error, you can resolve it immediately by: + +1. **First attempt fails**: Note the error but don't panic +2. **Retry the publish**: Click publish again immediately +3. **Second attempt succeeds**: The publish should work on the second try + +This workaround will resolve the immediate issue, but it's recommended to follow the proper workflow to avoid the error entirely. + + + + + +To prevent this error from occurring, follow this sequence: + +```bash +# Recommended sequence +1. Update vargroup → 2. Deploy → 3. Publish +``` + +**Step-by-step process:** + +1. **Update your variable group**: + + - Navigate to your project's variable groups + - Make the necessary changes to variables + - Save the changes + +2. **Deploy the changes**: + + - Click the "Deploy" button + - Wait for deployment to complete successfully + - Verify the deployment status + +3. **Publish the changes**: + - Click the "Publish" button + - The publish should complete without errors + + + + + +This error happens due to a synchronization issue in the SleakOps platform: + +- **State Mismatch**: When you skip deploy, there's a mismatch between the configuration state and the runtime state +- **Validation Failure**: The publish process validates against the deployed state, which hasn't been updated +- **Cache Issues**: The system may have cached states that become inconsistent + +The second attempt works because the first attempt partially updates the internal state, allowing the second attempt to succeed. + + + + + +If you need to publish without deploying for specific reasons: + +1. **Use the retry method**: Accept that the first attempt will fail and retry +2. **Batch your changes**: Make all variable updates at once, then deploy and publish +3. **Consider the impact**: Evaluate if skipping deploy is necessary for your use case + +**When skipping deploy might be acceptable:** + +- Testing configuration changes +- Preparing variables for future deployments +- Working in development environments + +**When you should always deploy first:** + +- Production environments +- When other services depend on the variables +- When consistency is critical + + + + + +If you continue experiencing this problem: + +1. **Document the steps**: Record the exact sequence that reproduces the error +2. **Note the error message**: Copy the exact error text +3. **Environment details**: Include project name, environment, and variable group name +4. **Contact support**: Report this as a bug for the development team to investigate + +This appears to be a known issue that the SleakOps team is working to resolve. + + + +--- + +_This FAQ was automatically generated on February 10, 2025 based on a real user query._ diff --git a/docs/troubleshooting/variable-groups-editing-error.mdx b/docs/troubleshooting/variable-groups-editing-error.mdx new file mode 100644 index 000000000..e377bdbd0 --- /dev/null +++ b/docs/troubleshooting/variable-groups-editing-error.mdx @@ -0,0 +1,161 @@ +--- +sidebar_position: 3 +title: "Variable Groups Editing Error" +description: "Solution for variable groups editing issues and cluster secrets workaround" +date: "2024-04-21" +category: "project" +tags: + ["variable-groups", "secrets", "cluster", "configuration", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variable Groups Editing Error + +**Date:** April 21, 2024 +**Category:** Project +**Tags:** Variable Groups, Secrets, Cluster, Configuration, Troubleshooting + +## Problem Description + +**Context:** Users experience errors when attempting to edit variable groups in SleakOps, preventing them from updating configuration values such as database connection parameters. + +**Observed Symptoms:** + +- Unable to edit variable groups through the standard interface +- Error occurs when trying to update variable group values +- Changes to variable groups are not saved or applied +- Specific issues with database configuration updates + +**Relevant Configuration:** + +- Variable groups containing database connection strings +- Development environment configurations +- MySQL database parameters +- Account-specific permissions + +**Error Conditions:** + +- Error appears when accessing variable group editing interface +- Problem is account-specific +- Affects ability to update database connections +- Prevents configuration changes for development environments + +## Detailed Solution + + + +While the variable groups interface has issues, you can edit the values directly through cluster secrets: + +1. **Access Cluster Configuration** + + - Navigate to your cluster in the SleakOps dashboard + - Go to the **Secrets** section + +2. **Locate Variable Group Secrets** + + - Find the secret corresponding to your variable group + - Look for secrets with names matching your variable group pattern + +3. **Edit Secret Values** + + - Click on the secret to edit + - Update the values directly in the secret configuration + - Save the changes + +4. **Restart Affected Services** + - Restart any services that depend on these variables + - This ensures the new values are picked up + + + + + +**To edit database configuration through cluster secrets:** + +1. **Navigate to Cluster Secrets** + + ``` + Dashboard → Clusters → [Your Cluster] → Secrets + ``` + +2. **Find Database Variable Group** + + - Look for secrets named similar to your variable group + - Example: `mecubrov4develop-mysql-secrets` + +3. **Edit Database Parameters** + + - Update connection strings + - Modify host, port, database name as needed + - Update credentials if required + +4. **Apply Changes** + - Save the secret configuration + - The changes will be automatically synchronized + +**Example secret structure:** + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: mecubrov4develop-mysql +type: Opaque +data: + DB_HOST: + DB_PORT: + DB_NAME: + DB_USER: + DB_PASSWORD: +``` + + + + + +**Platform Update:** + +A new SleakOps release is scheduled that will resolve this variable groups editing issue: + +- **Release Timeline:** Mid-week (Wednesday/Thursday) +- **Fix Scope:** Complete resolution of variable group editing errors +- **Impact:** Normal variable group editing functionality will be restored + +**After the Update:** + +- Variable groups can be edited directly through the standard interface +- No need to use the cluster secrets workaround +- All account-specific editing issues will be resolved + + + + + +**To avoid similar issues in the future:** + +1. **Regular Backups** + + - Export variable group configurations regularly + - Keep documentation of critical variable values + +2. **Use Environment-Specific Groups** + + - Separate development, staging, and production variables + - Use clear naming conventions + +3. **Version Control** + + - Track changes to variable configurations + - Document reasons for configuration updates + +4. **Testing Process** + - Test configuration changes in development first + - Verify database connections after updates + - Monitor application logs after changes + + + +--- + +_This FAQ was automatically generated on April 21, 2024 based on a real user query._ diff --git a/docs/troubleshooting/variable-management-synchronization-issues.mdx b/docs/troubleshooting/variable-management-synchronization-issues.mdx new file mode 100644 index 000000000..c93c43f9a --- /dev/null +++ b/docs/troubleshooting/variable-management-synchronization-issues.mdx @@ -0,0 +1,189 @@ +--- +sidebar_position: 3 +title: "Variable Management and Synchronization Issues" +description: "Solution for variable group synchronization problems and unexpected changes in environment configurations" +date: "2024-12-19" +category: "project" +tags: + ["variables", "environment", "secrets", "synchronization", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variable Management and Synchronization Issues + +**Date:** December 19, 2024 +**Category:** Project +**Tags:** Variables, Environment, Secrets, Synchronization, Configuration + +## Problem Description + +**Context:** Users experience unexpected changes to environment variables and configuration secrets across different environments (development, production) after platform updates or system maintenance. + +**Observed Symptoms:** + +- Environment variables modified without user authorization +- Variable groups showing "dummy" or placeholder values +- Inconsistencies between development and production environments +- Missing or empty secrets in specific environments +- Variables not updating correctly from the cluster interface + +**Relevant Configuration:** + +- Platform: SleakOps variable management system +- Affected environments: Development and Production +- Storage: AWS Parameter Store with encryption +- Variable groups (vargroups) configuration + +**Error Conditions:** + +- Variables appear modified after platform updates +- Synchronization failures between environments +- Missing variable groups in development environment +- Placeholder values replacing actual configuration + +## Detailed Solution + + + +The variable synchronization issue typically occurs when: + +1. **Platform updates** modify the variable management system +2. **Missing variable groups** in specific environments cause inconsistencies +3. **Security policies** prevent automatic restoration of sensitive values +4. **Centralized management** conflicts with local configurations + +The system creates empty secrets with placeholder values when it cannot find corresponding variable groups to maintain security standards. + + + + + +To recover from variable synchronization issues: + +1. **Identify affected environments**: + + ```bash + # Check variable groups status + sleakops env list --show-variables + sleakops secrets list --environment development + ``` + +2. **Restore from backups**: + + - Access SleakOps console + - Navigate to **Environment Settings** → **Variables** + - Look for **Backup/History** section + - Restore previous working configuration + +3. **Manual variable restoration**: + - Replace "dummy" values with actual configuration + - Update each variable group individually + - Verify synchronization across environments + + + + + +To prevent future variable synchronization issues: + +1. **Enable automatic backups**: + + ```yaml + # In your sleakops.yml + environments: + development: + variables: + backup_enabled: true + backup_frequency: "daily" + encryption: true + ``` + +2. **Implement variable validation**: + + - Set up integrity checks before applying changes + - Use variable templates for consistency + - Implement approval workflows for production changes + +3. **Environment isolation**: + - Keep development and production variables separate + - Use different variable groups for each environment + - Implement proper access controls + + + + + +SleakOps centralizes variable management with these features: + +1. **Parameter Store integration**: + + - All variables stored in AWS Parameter Store + - Automatic encryption for sensitive data + - Version history and rollback capabilities + +2. **Automatic synchronization**: + + - Variables sync across all environments + - Real-time updates to running applications + - Conflict resolution mechanisms + +3. **Security features**: + - No plain-text storage of credentials + - Placeholder values for missing configurations + - Audit trail for all changes + + + + + +When experiencing variable problems: + +1. **Check variable group status**: + + ```bash + sleakops vargroups status --environment development + sleakops vargroups validate --all + ``` + +2. **Verify synchronization**: + + ```bash + sleakops sync status + sleakops sync force --environment development + ``` + +3. **Review change history**: + + - Access SleakOps console + - Go to **Audit Logs** → **Variable Changes** + - Review recent modifications and their sources + +4. **Manual synchronization**: + ```bash + # Force sync specific variable group + sleakops vargroups sync --name database-config --environment development + ``` + + + + + +Follow this checklist to fully recover from variable issues: + +- [ ] **Backup current state** before making changes +- [ ] **Identify all affected environments** and variable groups +- [ ] **Restore production variables** from the most recent backup +- [ ] **Update development variables** with correct values +- [ ] **Verify application functionality** in each environment +- [ ] **Test variable synchronization** between environments +- [ ] **Enable monitoring** for future variable changes +- [ ] **Document the incident** and lessons learned +- [ ] **Schedule regular backups** if not already configured +- [ ] **Review access permissions** for variable management + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/volume-unique-path-error.mdx b/docs/troubleshooting/volume-unique-path-error.mdx new file mode 100644 index 000000000..65f642b25 --- /dev/null +++ b/docs/troubleshooting/volume-unique-path-error.mdx @@ -0,0 +1,139 @@ +--- +sidebar_position: 3 +title: "Volume Unique Path Error After Deletion" +description: "Solution for 'Unique path' error when recreating volumes at the same mount point" +date: "2025-02-11" +category: "project" +tags: ["volume", "storage", "error", "mount-point", "deletion"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Volume Unique Path Error After Deletion + +**Date:** February 11, 2025 +**Category:** Project +**Tags:** Volume, Storage, Error, Mount Point, Deletion + +## Problem Description + +**Context:** User attempts to recreate a volume at the same mount point after deleting a previous volume, but encounters a "Unique path" error in SleakOps platform. + +**Observed Symptoms:** + +- "Unique path" error when trying to create a new volume +- Error occurs at the same mount point where a volume was previously deleted +- Volume appears to be properly deleted from the interface +- Unable to reuse the same mount path for new volumes + +**Relevant Configuration:** + +- Project: "velo-opensactions" +- Mount path: "/opensanctions/data" +- Volume was previously created and then deleted +- Attempting to recreate volume at identical mount point + +**Error Conditions:** + +- Error appears during volume creation process +- Occurs specifically when reusing mount paths +- Persists even after volume deletion appears successful +- May indicate incomplete cleanup of volume resources + +## Detailed Solution + + + +The "Unique path" error typically occurs when: + +1. **Incomplete volume cleanup**: The volume deletion process didn't fully remove all associated resources +2. **Mount point reservation**: The system still considers the mount path as "in use" +3. **Resource synchronization delay**: There's a delay between volume deletion and path availability +4. **Orphaned references**: Database entries or metadata still reference the deleted volume + +This is a known issue that has been identified and fixed in recent platform updates. + + + + + +While waiting for the fix to be applied, you can use these workarounds: + +**Option 1: Use a different mount path** + +```bash +# Instead of /opensanctions/data +# Try /opensanctions/data-v2 or /opensanctions/storage +``` + +**Option 2: Wait and retry** + +- Wait 10-15 minutes after volume deletion +- The system may eventually release the path reservation +- Retry volume creation with the same path + +**Option 3: Contact support for manual cleanup** + +- If the path must be reused immediately +- Support team can manually clean up orphaned references + + + + + +To avoid this issue in the future: + +1. **Verify complete deletion**: + + - Check that no workloads are still using the volume + - Ensure all pods using the volume are stopped + +2. **Wait before recreation**: + + - Allow 5-10 minutes between deletion and recreation + - This ensures all cleanup processes complete + +3. **Use unique mount paths**: + - Consider using timestamps or version numbers in paths + - Example: `/data/app-v1`, `/data/app-v2` + + + + + +To confirm a volume is fully deleted: + +1. **Check the Volumes section**: + + - Go to your project dashboard + - Navigate to Storage → Volumes + - Verify the volume is not listed + +2. **Check workload configurations**: + + - Review all workloads in the project + - Ensure no workload references the deleted volume + - Look for any "Volume not found" errors + +3. **Monitor for orphaned resources**: + - Check if any persistent volume claims remain + - Verify no storage classes are still referencing the volume + + + + + +This issue has been identified and resolved: + +- **Status**: Fixed in recent platform release +- **Solution**: Improved volume cleanup process +- **Impact**: Prevents orphaned mount path reservations +- **Rollout**: Automatically applied to all projects + +If you continue experiencing this issue after the fix deployment, please contact support as it may indicate a different underlying problem. + + + +--- + +_This FAQ was automatically generated on February 11, 2025 based on a real user query._ diff --git a/docs/troubleshooting/vpn-addon-access-troubleshooting.mdx b/docs/troubleshooting/vpn-addon-access-troubleshooting.mdx new file mode 100644 index 000000000..599cddcd5 --- /dev/null +++ b/docs/troubleshooting/vpn-addon-access-troubleshooting.mdx @@ -0,0 +1,205 @@ +--- +sidebar_position: 3 +title: "VPN Connection Issues with Cluster Addons" +description: "Troubleshooting VPN connectivity problems preventing access to cluster addons and development environments" +date: "2024-12-19" +category: "user" +tags: ["vpn", "connection", "addons", "troubleshooting", "lens"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VPN Connection Issues with Cluster Addons + +**Date:** December 19, 2024 +**Category:** User +**Tags:** VPN, Connection, Addons, Troubleshooting, Lens + +## Problem Description + +**Context:** User experiences connectivity issues with development cluster addons and Lens after working on a new project, while production services remain accessible. + +**Observed Symptoms:** + +- Cannot access any development cluster addons +- Unable to connect to development cluster via Lens +- Loss of access to new project (n8n) +- Production services (Grafana, Loki) remain accessible +- Issues started in the morning (France timezone) + +**Relevant Configuration:** + +- Environment: Development cluster vs Production cluster +- Tools affected: Cluster addons, Lens, project workloads +- Geographic location: France +- VPN: Standard SleakOps VPN connection + +**Error Conditions:** + +- Selective connectivity loss (dev accessible, prod not) +- Coincides with new project work +- Persists despite VPN reconnection and DNS clearing +- Affects multiple services simultaneously + +## Detailed Solution + + + +When you can access production services but not development cluster addons, this typically indicates: + +1. **Network segmentation**: Development and production environments use different network routes +2. **VPN routing tables**: Your VPN client may have outdated or incomplete routing information +3. **DNS resolution**: Development cluster DNS entries may not be resolving correctly +4. **Authentication tokens**: Your kubeconfig or access tokens for development may have expired + + + + + +Beyond basic VPN reconnection, try these steps: + +1. **Complete VPN reset**: + + ```bash + # Disconnect VPN completely + # Clear VPN client cache/settings + # Restart VPN client application + # Reconnect with fresh configuration + ``` + +2. **Verify routing tables**: + + ```bash + # On Windows + route print + + # On macOS/Linux + netstat -rn + ``` + +3. **Test specific endpoints**: + + ```bash + # Test development cluster API + curl -k https://your-dev-cluster-api-endpoint/version + + # Test addon endpoints + nslookup your-addon-domain.sleakops.com + ``` + + + + + +For Lens connection issues: + +1. **Refresh cluster configuration**: + + - Open Lens + - Go to cluster settings + - Click "Refresh" or "Reconnect" + - Verify kubeconfig path is correct + +2. **Update kubeconfig**: + + ```bash + # Download fresh kubeconfig from SleakOps + sleakops cluster kubeconfig --cluster development + + # Or update existing config + kubectl config use-context development + kubectl cluster-info + ``` + +3. **Clear Lens cache**: + - Close Lens completely + - Clear application cache (location varies by OS) + - Restart Lens + + + + + +If you lost access to a newly created project: + +1. **Verify project permissions**: + + - Check if project deployment completed successfully + - Confirm your user has proper access rights + - Verify project is in "Running" state in SleakOps dashboard + +2. **Check project networking**: + + ```bash + # Verify project endpoints are accessible + nslookup your-project-url.sleakops.com + + # Test direct connectivity + curl -I https://your-project-url.sleakops.com + ``` + +3. **Re-sync project configuration**: + - Go to SleakOps dashboard + - Navigate to your project + - Click "Sync" or "Refresh configuration" + + + + + +When production works but development doesn't: + +1. **Check environment status**: + + - Verify development cluster is healthy in SleakOps dashboard + - Look for any maintenance notifications + - Check if there are ongoing deployments + +2. **Network configuration differences**: + + - Development clusters may use different IP ranges + - VPN split-tunneling might affect development routes + - Firewall rules may differ between environments + +3. **Authentication scope**: + - Your tokens might have different scopes for dev vs prod + - Development access might require additional permissions + - Check if your user role includes development cluster access + + + + + +If these steps don't resolve the issue, escalate with this information: + +1. **Network diagnostics**: + + ```bash + # Collect routing information + route print > routing_info.txt + + # DNS resolution tests + nslookup grafana.prod.sleakops.com + nslookup your-addon.dev.sleakops.com + + # Connectivity tests + traceroute your-dev-cluster-endpoint + ``` + +2. **Timeline information**: + + - Exact time when issues started + - Last successful connection time + - Any recent changes or new project creation + +3. **Environment details**: + - Operating system and version + - VPN client version + - Lens version + - Geographic location/timezone + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vpn-cluster-access-troubleshooting.mdx b/docs/troubleshooting/vpn-cluster-access-troubleshooting.mdx new file mode 100644 index 000000000..2c98199b6 --- /dev/null +++ b/docs/troubleshooting/vpn-cluster-access-troubleshooting.mdx @@ -0,0 +1,229 @@ +--- +sidebar_position: 3 +title: "VPN Connection and Cluster Access Issues" +description: "Troubleshooting VPN connectivity and Kubernetes cluster access problems" +date: "2024-12-11" +category: "user" +tags: ["vpn", "cluster", "access", "troubleshooting", "connectivity"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VPN Connection and Cluster Access Issues + +**Date:** December 11, 2024 +**Category:** User +**Tags:** VPN, Cluster, Access, Troubleshooting, Connectivity + +## Problem Description + +**Context:** User attempts to access a production Kubernetes cluster through SleakOps but encounters connectivity issues despite having an active VPN connection. + +**Observed Symptoms:** + +- Unable to access production cluster +- VPN appears to be connected but cluster access fails +- Ping tests to internal network addresses may fail +- Internal workload URLs are not accessible +- kubectl commands timeout or fail to connect + +**Relevant Configuration:** + +- VPN connection: Active/Connected status +- Target environment: Production cluster +- Internal network range: 10.130.0.0/16 (typical) +- Cluster endpoint: EKS cluster with private endpoint + +**Error Conditions:** + +- Error occurs when trying to access cluster resources +- Problem persists despite VPN connection showing as active +- Internal URLs and services are unreachable +- kubectl commands fail with timeout errors + +## Detailed Solution + + + +When experiencing cluster access issues with VPN, follow these diagnostic steps: + +1. **Verify VPN connection status** in your VPN client +2. **Test basic network connectivity** to internal ranges +3. **Check DNS resolution** for internal services +4. **Validate cluster endpoint accessibility** + +Start with a simple ping test to verify basic connectivity: + +```bash +# Test connectivity to internal network +ping 10.130.0.2 + +# Test DNS resolution +nslookup your-cluster-endpoint.us-east-1.eks.amazonaws.com +``` + + + + + +If the VPN shows as connected but you can't reach internal resources: + +1. **Disconnect and reconnect** the VPN client +2. **Check routing table** to ensure traffic is routed through VPN: + +```bash +# On Windows +route print + +# On macOS/Linux +route -n get 10.130.0.0 +netstat -rn | grep 10.130 +``` + +3. **Verify DNS settings** are pointing to internal DNS servers +4. **Test with different VPN protocols** if available (OpenVPN vs IKEv2) +5. **Check firewall settings** that might block VPN traffic + + + + + +To check your cluster configuration: + +1. **Review kubeconfig file** generated by SleakOps: + +```bash +# View current kubeconfig +kubectl config view + +# Look for the cluster server endpoint +kubectl config view -o jsonpath='{.clusters[0].cluster.server}' +``` + +2. **Verify the cluster endpoint format**: + + - Should look like: `https://XXXXXXXXXX.us-east-1.eks.amazonaws.com` + - Must be accessible through VPN + +3. **Test direct connectivity** to the endpoint: + +```bash +# Test HTTPS connectivity +curl -k https://your-cluster-endpoint.us-east-1.eks.amazonaws.com/version + +# Or use telnet to test port 443 +telnet your-cluster-endpoint.us-east-1.eks.amazonaws.com 443 +``` + + + + + +To verify internal workload connectivity: + +1. **Access SleakOps dashboard** and navigate to your production project +2. **Go to Workloads section** to see internal URLs +3. **Test internal URLs** while connected to VPN: + +```bash +# Example internal URL test +curl -I http://your-internal-service.internal.domain + +# Test with verbose output for debugging +curl -v http://your-internal-service.internal.domain +``` + +4. **Check service discovery**: + +```bash +# List services in your cluster +kubectl get services --all-namespaces + +# Test service connectivity +kubectl port-forward service/your-service 8080:80 +``` + + + + + +**Solution 1: Refresh VPN configuration** + +1. Download a fresh VPN configuration from SleakOps +2. Remove old VPN profiles from your client +3. Import the new configuration +4. Reconnect to VPN + +**Solution 2: DNS configuration** + +```bash +# Flush DNS cache (Windows) +ipconfig /flushdns + +# Flush DNS cache (macOS) +sudo dscacheutil -flushcache + +# Flush DNS cache (Linux) +sudo systemctl restart systemd-resolved +``` + +**Solution 3: Alternative connection methods** + +If VPN continues to fail: + +1. **Use kubectl proxy** for temporary access: + +```bash +kubectl proxy --port=8080 +# Access cluster through http://localhost:8080 +``` + +2. **Enable public endpoint** temporarily (if security allows) +3. **Use bastion host** for SSH tunneling + +**Solution 4: Network troubleshooting** + +```bash +# Check active network interfaces +ifconfig -a # macOS/Linux +ipconfig /all # Windows + +# Verify VPN interface is active and has correct IP +# Look for tun0, utun, or similar VPN interface +``` + + + + + +**Regular maintenance:** + +1. **Keep VPN client updated** to latest version +2. **Regularly refresh VPN certificates** before expiration +3. **Test connectivity** after any network changes +4. **Document working configurations** for quick recovery + +**Monitoring setup:** + +```bash +# Create a simple connectivity test script +#!/bin/bash +echo "Testing VPN connectivity..." +ping -c 3 10.130.0.2 +echo "Testing cluster endpoint..." +kubectl cluster-info +echo "Testing internal services..." +curl -I http://your-internal-service.domain +``` + +**Emergency access:** + +1. **Configure alternative access methods** (bastion host) +2. **Keep emergency contact information** for support +3. **Document troubleshooting steps** for your team + + + +--- + +_This FAQ was automatically generated on December 11, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vpn-connection-disconnection-issues.mdx b/docs/troubleshooting/vpn-connection-disconnection-issues.mdx new file mode 100644 index 000000000..1ce31b84a --- /dev/null +++ b/docs/troubleshooting/vpn-connection-disconnection-issues.mdx @@ -0,0 +1,170 @@ +--- +sidebar_position: 3 +title: "VPN Connection Disconnection Issues" +description: "Troubleshooting VPN disconnections and credential updates in SleakOps" +date: "2024-08-01" +category: "user" +tags: ["vpn", "connection", "credentials", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VPN Connection Disconnection Issues + +**Date:** August 1, 2024 +**Category:** User +**Tags:** VPN, Connection, Credentials, Troubleshooting + +## Problem Description + +**Context:** Users experience intermittent VPN disconnections when accessing SleakOps services, requiring credential updates and reconnection to restore access to internal resources like Grafana dashboards. + +**Observed Symptoms:** + +- VPN connection drops unexpectedly +- Unable to access internal services (e.g., Grafana ingress) +- Connection issues resolved after credential update +- Services work correctly when VPN is properly connected + +**Relevant Configuration:** + +- VPN client: SleakOps provided VPN configuration +- Target services: Internal ingress endpoints (e.g., Grafana) +- Authentication: User credentials for VPN access + +**Error Conditions:** + +- VPN disconnects without user action +- Internal services become inaccessible +- Reconnection may fail with old credentials +- Issue resolves after credential refresh + +## Detailed Solution + + + +When experiencing VPN disconnection: + +1. **Disconnect completely** from the VPN +2. **Wait 10-15 seconds** before attempting to reconnect +3. **Reconnect to the VPN** using your client +4. **Test access** to internal services + +If reconnection fails, proceed to credential update steps. + + + + + +To update your VPN credentials: + +1. **Access SleakOps Dashboard** + + - Log in to your SleakOps account + - Navigate to **User Settings** or **VPN Access** + +2. **Generate new credentials** + + - Click **"Regenerate VPN Credentials"** + - Download the new configuration file + +3. **Update your VPN client** + - Remove old VPN profile + - Import the new configuration + - Connect using updated credentials + +```bash +# For OpenVPN clients +sudo openvpn --config /path/to/new-config.ovpn +``` + + + + + +After reconnecting, verify your access: + +1. **Check VPN status** + + ```bash + # Check your IP address + curl ifconfig.me + + # Should show SleakOps VPN IP range + ``` + +2. **Test internal service access** + + ```bash + # Test Grafana access (example) + curl -I https://grafana.prod.your-domain.com/login + + # Should return HTTP 200 or redirect + ``` + +3. **Verify DNS resolution** + ```bash + # Test internal DNS resolution + nslookup grafana.prod.your-domain.com + ``` + + + + + +To minimize VPN disconnections: + +1. **Enable auto-reconnect** in your VPN client settings +2. **Use keep-alive options** if available +3. **Check network stability** on your local connection +4. **Update VPN client** to the latest version + +**For OpenVPN clients:** + +```conf +# Add to your .ovpn file +keepalive 10 120 +ping-timer-rem +persist-tun +persist-key +``` + +**For network administrators:** + +- Configure firewall to allow VPN traffic +- Ensure UDP port 1194 (or configured port) is open +- Check for network equipment that might drop long connections + + + + + +**If problems persist:** + +1. **Check system logs** + + ```bash + # On Linux/macOS + tail -f /var/log/openvpn.log + + # On Windows + # Check Event Viewer > Applications and Services Logs > OpenVPN + ``` + +2. **Test different connection methods** + + - Try TCP instead of UDP + - Use different VPN server endpoints if available + - Test from different network locations + +3. **Contact support with details** + - VPN client version + - Operating system + - Error messages from logs + - Time when disconnection occurred + + + +--- + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/vpn-dns-configuration-troubleshooting.mdx b/docs/troubleshooting/vpn-dns-configuration-troubleshooting.mdx new file mode 100644 index 000000000..ecbba3f4d --- /dev/null +++ b/docs/troubleshooting/vpn-dns-configuration-troubleshooting.mdx @@ -0,0 +1,222 @@ +--- +sidebar_position: 3 +title: "VPN DNS Configuration Issues with Lens Cluster Access" +description: "Troubleshooting DNS configuration problems when accessing Kubernetes clusters through VPN" +date: "2024-12-19" +category: "user" +tags: ["vpn", "dns", "lens", "connectivity", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VPN DNS Configuration Issues with Lens Cluster Access + +**Date:** December 19, 2024 +**Category:** User +**Tags:** VPN, DNS, Lens, Connectivity, Troubleshooting + +## Problem Description + +**Context:** Users experience timeout issues when trying to access Kubernetes clusters through Lens after VPN connection, despite following DNS configuration guides. + +**Observed Symptoms:** + +- Persistent timeout errors when connecting to cluster via Lens +- VPN connection appears to be working +- DNS configuration in `/etc/resolv.conf` stops working after previously functioning +- Connection issues appear intermittently or after system changes + +**Relevant Configuration:** + +- Operating System: Ubuntu 24.04.2 LTS (or similar Linux distributions) +- VPN Client: Pritunl +- Kubernetes Client: Lens +- DNS Configuration: `/etc/resolv.conf` + +**Error Conditions:** + +- Timeout occurs when accessing cluster through Lens +- Problem persists even after VPN reconnection +- DNS configuration changes don't take effect +- Previously working configuration suddenly stops functioning + +## Detailed Solution + + + +The most common issue is incorrect DNS configuration in `/etc/resolv.conf`. Instead of appending DNS servers, you should replace the entire content: + +**Incorrect approach (appending):** + +```bash +# Don't do this - appending to existing content +echo "nameserver 10.0.0.2" >> /etc/resolv.conf +``` + +**Correct approach (replacing):** + +```bash +# Replace the entire content +sudo tee /etc/resolv.conf > /dev/null < + + + +For a more permanent solution, configure DNS through your system's network manager: + +**For Ubuntu/GNOME:** + +1. Go to **Settings** → **Network** +2. Click on your **WiFi** or **Ethernet** connection +3. Go to **IPv4** or **IPv6** tab +4. Set **DNS** to **Manual** +5. Add DNS servers: + - Primary: `10.0.0.2` (VPC DNS server) + - Secondary: `8.8.8.8` (Google DNS) + - Tertiary: `8.8.4.4` (Google DNS backup) +6. Click **Apply** + +**For command line:** + +```bash +# Using nmcli +nmcli con mod "Your-Connection-Name" ipv4.dns "10.0.0.2,8.8.8.8" +nmcli con down "Your-Connection-Name" +nmcli con up "Your-Connection-Name" +``` + + + + + +Ensure your Pritunl client is configured to handle DNS properly: + +1. Open **Pritunl Client** +2. Click on the **gear icon** next to your profile +3. Check **Advanced Settings**: + - Enable **DNS Suffix** if available + - Enable **Force DNS Configuration** (this forces the client to override system DNS) + - Disable **Block Outside DNS** if enabled +4. Reconnect to the VPN + +**Alternative Pritunl configuration:** + +```bash +# If using Pritunl from command line +pritunl-client enable [profile-id] +pritunl-client start [profile-id] --dns-force +``` + + + + + +After configuring DNS, verify that resolution works correctly: + +```bash +# Check current DNS configuration +cat /etc/resolv.conf + +# Test DNS resolution for your cluster +nslookup your-cluster-endpoint.amazonaws.com + +# Test with dig for more details +dig your-cluster-endpoint.amazonaws.com + +# Test connectivity to the cluster +kubectl cluster-info + +# Test from Lens - check connection status +``` + +**Expected output:** + +- `/etc/resolv.conf` should show your VPC DNS server first +- `nslookup` should resolve to internal IP addresses +- `kubectl cluster-info` should connect successfully + + + + + +If DNS configuration keeps getting overwritten: + +**Make resolv.conf immutable:** + +```bash +# After setting correct DNS configuration +sudo chattr +i /etc/resolv.conf + +# To make it mutable again later +sudo chattr -i /etc/resolv.conf +``` + +**Use systemd-resolved (Ubuntu 18.04+):** + +```bash +# Configure systemd-resolved +sudo systemctl enable systemd-resolved +sudo systemctl start systemd-resolved + +# Edit resolved configuration +sudo nano /etc/systemd/resolved.conf + +# Add your DNS servers +[Resolve] +DNS=10.0.0.2 8.8.8.8 +Domains=~. + +# Restart the service +sudo systemctl restart systemd-resolved + +# Link resolv.conf to systemd-resolved +sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf +``` + + + + + +If DNS is working but Lens still has issues: + +1. **Clear Lens cache:** + + ```bash + # Remove Lens configuration and cache + rm -rf ~/.config/Lens + rm -rf ~/.cache/Lens + ``` + +2. **Check kubeconfig:** + + ```bash + # Verify kubeconfig is accessible + kubectl config current-context + kubectl config view + ``` + +3. **Test direct connection:** + + ```bash + # Test if you can reach the API server + curl -k https://your-cluster-endpoint.amazonaws.com/version + ``` + +4. **Lens proxy settings:** + - Open Lens + - Go to **File** → **Preferences** → **Proxy** + - Ensure proxy settings don't interfere with VPN + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vpn-dns-resolution-kubeconfig.mdx b/docs/troubleshooting/vpn-dns-resolution-kubeconfig.mdx new file mode 100644 index 000000000..8d125da49 --- /dev/null +++ b/docs/troubleshooting/vpn-dns-resolution-kubeconfig.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "VPN DNS Resolution Issues with Kubeconfig" +description: "Solution for DNS resolution problems when connecting to EKS clusters through VPN" +date: "2024-01-15" +category: "user" +tags: ["vpn", "dns", "kubeconfig", "pritunl", "ubuntu", "lens"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VPN DNS Resolution Issues with Kubeconfig + +**Date:** January 15, 2024 +**Category:** User +**Tags:** VPN, DNS, Kubeconfig, Pritunl, Ubuntu, Lens + +## Problem Description + +**Context:** Users experience connectivity issues when trying to access EKS clusters through VPN connection using tools like Lens or kubectl. The problem is related to DNS resolution conflicts between the VPN client (Pritunl) and certain Linux distributions like Ubuntu. + +**Observed Symptoms:** + +- Cannot connect to EKS cluster through Lens despite being connected to VPN +- DNS resolution fails for EKS cluster endpoints +- Connection works with direct IP addresses but not with DNS names +- Issue occurs specifically on Ubuntu and similar Linux distributions + +**Relevant Configuration:** + +- VPN Client: Pritunl +- Operating System: Ubuntu (and similar Linux distributions) +- Cluster Access: EKS clusters in AWS +- Tools: Lens, kubectl +- Connection Method: VPN + kubeconfig + +**Error Conditions:** + +- Error occurs when VPN is connected +- DNS resolution fails for EKS endpoints +- Problem persists across different clusters (Dev/Prod) +- Issue is distribution-specific (Ubuntu-based systems) + +## Detailed Solution + + + +This issue occurs because certain Linux distributions (particularly Ubuntu) have DNS configuration conflicts when using Pritunl VPN. The system cannot properly resolve the EKS cluster DNS names through the VPN tunnel, causing connection failures. + +The problem affects: + +- Ubuntu and Ubuntu-based distributions +- Systems with systemd-resolved +- Configurations where VPN DNS settings conflict with local DNS + + + + + +The most effective solution is to replace the EKS cluster DNS endpoint with its direct IP address in your kubeconfig file. + +**Steps:** + +1. **Locate your kubeconfig file** (usually at `~/.kube/config`) +2. **Find the server line** that looks like: + ```yaml + server: https://9CFEED5AD69EF4F87D19D6FF9FBF7AD9.gr7.us-east-1.eks.amazonaws.com + ``` +3. **Replace it with the IP address:** + + ```yaml + # For Production cluster + server: https://10.130.96.192 + + # For Development cluster + server: https://10.110.98.134 + ``` + +4. **Save the file** and test the connection + +**Example kubeconfig modification:** + +```yaml +apiVersion: v1 +clusters: + - cluster: + certificate-authority-data: LS0tLS1CRUdJTi... + server: https://10.130.96.192 # Changed from DNS to IP + name: arn:aws:eks:us-east-1:123456789:cluster/my-cluster +contexts: + - context: + cluster: arn:aws:eks:us-east-1:123456789:cluster/my-cluster + user: arn:aws:eks:us-east-1:123456789:cluster/my-cluster + name: arn:aws:eks:us-east-1:123456789:cluster/my-cluster +current-context: arn:aws:eks:us-east-1:123456789:cluster/my-cluster +kind: Config +preferences: {} +users: + - name: arn:aws:eks:us-east-1:123456789:cluster/my-cluster + user: + exec: + apiVersion: client.authentication.k8s.io/v1beta1 + command: aws + args: + - eks + - get-token + - --cluster-name + - my-cluster +``` + + + + + +If the IP replacement doesn't work permanently, you can try resetting the DNS and network connection in Pritunl: + +**Steps:** + +1. **Open Pritunl client** +2. **Look for network options** (usually in settings or connection options) +3. **Find DNS reset buttons** - there should be options to: + - Reset DNS settings + - Reset network connection +4. **Click both reset buttons** +5. **Reconnect to the VPN** +6. **Test cluster connectivity** + +**Note:** This solution may be temporary and you might need to repeat it periodically. + + + + + +When using Lens with the modified kubeconfig: + +**Steps:** + +1. **Ensure VPN is connected** before opening Lens +2. **Import the modified kubeconfig** with IP addresses instead of DNS names +3. **Add cluster in Lens** using the updated configuration +4. **Test connection** - you should now be able to see logs and cluster resources + +**Troubleshooting Lens connection:** + +- Verify VPN connection is active +- Check that the IP address in kubeconfig matches your environment (Dev/Prod) +- Ensure AWS credentials are properly configured +- Try refreshing the cluster connection in Lens + + + + + +**Important:** Different environments use different IP addresses. Make sure you're using the correct IP for each environment: + +**Development Environment:** + +```yaml +server: https://10.110.98.134 +``` + +**Production Environment:** + +```yaml +server: https://10.130.96.192 +``` + +**Getting the correct IP:** +If you need to find the IP address for your specific cluster: + +1. Contact SleakOps support for the current IP addresses +2. Or use `nslookup` when not connected to VPN: + ```bash + nslookup your-cluster-endpoint.gr7.us-east-1.eks.amazonaws.com + ``` + + + + + +**Current Status:** The SleakOps team is working on a permanent solution to this DNS resolution issue. + +**Temporary Workaround:** The IP address replacement method described above is the current recommended workaround. + +**Future Updates:** Once a permanent solution is implemented, users will be notified and can revert to using DNS names in their kubeconfig files. + +**Affected Systems:** + +- Ubuntu and Ubuntu-based distributions +- Systems using systemd-resolved +- Certain network configurations that conflict with Pritunl VPN + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vpn-kubernetes-cluster-access-troubleshooting.mdx b/docs/troubleshooting/vpn-kubernetes-cluster-access-troubleshooting.mdx new file mode 100644 index 000000000..85096fb35 --- /dev/null +++ b/docs/troubleshooting/vpn-kubernetes-cluster-access-troubleshooting.mdx @@ -0,0 +1,222 @@ +--- +sidebar_position: 3 +title: "VPN Connection Issues with Kubernetes Cluster Access" +description: "Troubleshooting VPN connectivity problems when accessing Kubernetes clusters through Lens" +date: "2025-01-28" +category: "user" +tags: ["vpn", "kubernetes", "lens", "connectivity", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# VPN Connection Issues with Kubernetes Cluster Access + +**Date:** January 28, 2025 +**Category:** User +**Tags:** VPN, Kubernetes, Lens, Connectivity, Troubleshooting + +## Problem Description + +**Context:** User successfully connects to VPN using Pritunl but cannot access the Kubernetes cluster through Lens IDE. The connection attempts result in timeout errors, and there are concerns about DNS resolution and network segment configuration. + +**Observed Symptoms:** + +- Pritunl VPN connection appears successful +- Lens IDE shows timeout errors when trying to connect to Kubernetes cluster +- Cluster resolves to public IP instead of internal IP +- VPN assigns different network segment than expected for Kubernetes cluster +- AWS console shows cluster update notifications + +**Relevant Configuration:** + +- VPN Client: Pritunl +- Kubernetes IDE: Lens +- Platform: AWS EKS +- Connection method: kubeconfig import +- DNS resolution: Forced DNS configuration attempted + +**Error Conditions:** + +- Timeout errors occur when Lens attempts to connect +- Problem persists despite successful VPN connection +- DNS resolution points to public IP instead of private cluster endpoint +- Issue occurs even with forced DNS settings + +## Detailed Solution + + + +First, verify that your VPN is properly resolving the cluster's internal endpoint: + +1. **Check your kubeconfig file** to find the cluster endpoint: + + ```bash + cat ~/.kube/config | grep server + ``` + +2. **Test DNS resolution** with the VPN connected: + + ```bash + nslookup your-cluster-endpoint.amazonaws.com + ping your-cluster-endpoint.amazonaws.com + ``` + +3. **Test HTTP connectivity**: + ```bash + curl -k https://your-cluster-endpoint.amazonaws.com + ``` + +The cluster should resolve to a private IP address (10.x.x.x or 172.x.x.x range) when VPN is active. + + + + + +Multiple VPN connections can interfere with DNS resolution: + +1. **Disconnect all other VPN connections**: + + - Check system tray for active VPN clients + - Disable any corporate VPN connections + - Close other VPN applications + +2. **Verify Pritunl is the only active VPN**: + + ```bash + # On Windows + ipconfig /all + + # On macOS/Linux + ifconfig + ``` + +3. **Check routing table**: + + ```bash + # Windows + route print + + # macOS/Linux + route -n + ``` + +Ensure the VPN route has priority over other network interfaces. + + + + + +Lens may cache the cluster endpoint resolution: + +1. **Close Lens completely** +2. **Ensure VPN is connected and stable** +3. **Clear Lens cache** (optional): + + - Windows: `%APPDATA%\Lens` + - macOS: `~/Library/Application Support/Lens` + - Linux: `~/.config/Lens` + +4. **Restart Lens and re-import kubeconfig**: + + - File → Add Cluster + - Import your kubeconfig file again + - Test connection + +5. **Verify cluster context**: + ```bash + kubectl config current-context + kubectl cluster-info + ``` + + + + + +If basic troubleshooting doesn't work: + +1. **Check VPN assigned IP range**: + + ```bash + # Find your VPN interface + ip addr show | grep tun + # or + ifconfig | grep -A 5 tun + ``` + +2. **Verify you can reach the cluster's network**: + + ```bash + # Try to reach the cluster's internal network + ping 10.0.0.1 # Replace with your cluster's network + ``` + +3. **Test kubectl directly**: + + ```bash + kubectl get nodes + kubectl get pods --all-namespaces + ``` + +4. **Check DNS settings**: + + ```bash + # Windows + nslookup + server + + # macOS/Linux + cat /etc/resolv.conf + ``` + + + + + +Regarding the cluster update notification in AWS console: + +1. **This is normal** - AWS regularly notifies about available updates +2. **Don't update during troubleshooting** - Focus on connectivity first +3. **Updates should be planned** - Coordinate with your SleakOps team + +**To check cluster version**: + +```bash +kubectl version --short +``` + +**Note**: Cluster updates don't typically cause VPN connectivity issues. + + + + + +If Lens continues to have issues: + +1. **Use kubectl directly**: + + ```bash + kubectl get nodes + kubectl get pods + ``` + +2. **Try other Kubernetes IDEs**: + + - K9s (terminal-based) + - Octant (web-based) + - VS Code with Kubernetes extension + +3. **Web-based access** (if available): + + - Kubernetes Dashboard + - SleakOps web interface + +4. **Verify with SleakOps support**: + - Confirm VPN configuration + - Check cluster accessibility settings + - Verify user permissions + + + +--- + +_This FAQ was automatically generated on January 28, 2025 based on a real user query._ diff --git a/docs/troubleshooting/vpn-mobile-access-configuration.mdx b/docs/troubleshooting/vpn-mobile-access-configuration.mdx new file mode 100644 index 000000000..e9a5c05df --- /dev/null +++ b/docs/troubleshooting/vpn-mobile-access-configuration.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Mobile VPN Access Configuration" +description: "How to configure VPN access from mobile devices using Pritunl Server" +date: "2024-03-10" +category: "user" +tags: ["vpn", "mobile", "pritunl", "openvpn", "aws", "access"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Mobile VPN Access Configuration + +**Date:** March 10, 2024 +**Category:** User +**Tags:** VPN, Mobile, Pritunl, OpenVPN, AWS, Access + +## Problem Description + +**Context:** Users need to access SleakOps VPN from mobile devices for testing applications without deploying to production environments. + +**Observed Symptoms:** + +- Unable to install Pritunl client on mobile device +- Need to perform mobile testing through VPN connection +- Requirement to test applications from mobile without production deployment + +**Relevant Configuration:** + +- VPN Server: Pritunl Server running on EC2 instance +- Protocol: OpenVPN compatible +- Access method: AWS Secrets Manager for credentials +- Target environments: Development and Production + +**Error Conditions:** + +- Pritunl client not available for mobile installation +- Need alternative connection method for mobile devices +- Testing requirements from mobile devices + +## Detailed Solution + + + +SleakOps VPN infrastructure uses: + +- **Pritunl Server**: Runs on EC2 instance in AWS +- **OpenVPN Protocol**: Compatible with standard OpenVPN clients +- **Native Mobile Support**: Most mobile devices support OpenVPN natively +- **Multi-environment Access**: Separate configurations for development and production + + + + + +To get your VPN configuration: + +1. **Log into AWS Console** with your SleakOps user credentials +2. **Switch to target account** (development or production) +3. **Navigate to AWS Systems Manager**: + - Search for "Systems Manager" in AWS services + - Or go to "Secrets Manager" service +4. **Find Pritunl Secret**: + - Look for secrets related to Pritunl + - Click on the secret to view details +5. **Reveal Secret Values**: + - Click "Retrieve secret value" button + - Note down: IP address, username, and password + + + + + +Once you have the credentials: + +1. **Open web browser** on any device +2. **Navigate to Pritunl IP** (from AWS secret) +3. **Login with credentials**: + - Username: (from AWS secret) + - Password: (from AWS secret) +4. **Access Dashboard**: You'll see the Pritunl management interface + + + + + +In the Pritunl dashboard: + +1. **Find your user account** in the users list +2. **Download connection profile** in multiple formats: + - **ZIP file**: Contains all configuration files + - **OVPN file**: Standard OpenVPN configuration + - **URL config**: For quick mobile setup +3. **Choose mobile-friendly format**: OVPN is recommended for mobile + + + + + +### iOS Devices: + +1. **Download OpenVPN Connect** from App Store +2. **Import OVPN file**: + - Email the OVPN file to yourself + - Open from email and choose "Open in OpenVPN" + - Or use the URL config for direct import +3. **Connect**: Tap connect in OpenVPN app + +### Android Devices: + +1. **Download OpenVPN for Android** from Play Store +2. **Import configuration**: + - Transfer OVPN file to device + - Open with OpenVPN app + - Or use built-in VPN settings with OVPN file +3. **Alternative**: Use native Android VPN settings + +### Native Mobile VPN (Alternative): + +Most modern smartphones support OpenVPN natively: + +- Go to **Settings** → **VPN** +- Add new VPN configuration +- Import OVPN file or enter details manually + + + + + +Once VPN is configured: + +1. **Connect to VPN** from mobile device +2. **Access development environment**: + - Your applications will be accessible through internal URLs + - No need to deploy to production for testing +3. **Perform mobile testing**: + - Test responsive design + - Verify mobile-specific functionality + - Check performance on mobile networks +4. **Disconnect VPN** when testing is complete + +**Benefits:** + +- Test on real mobile devices +- Access development environment securely +- No production deployment required +- Maintain development/production separation + + + + + +**Connection Issues:** + +- Verify credentials are correct +- Check mobile data/WiFi connectivity +- Ensure VPN server is running (check AWS EC2 instance) + +**Profile Import Issues:** + +- Try different import methods (email, cloud storage, URL) +- Verify OVPN file is not corrupted +- Use ZIP format if OVPN fails + +**Performance Issues:** + +- Mobile networks may have higher latency +- Consider testing on both WiFi and cellular +- VPN adds encryption overhead + +**Alternative Solutions:** + +- Use browser-based testing tools +- Set up port forwarding for specific services +- Create temporary public endpoints for testing + + + +--- + +_This FAQ was automatically generated on March 10, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vpn-pritunl-connection-troubleshooting.mdx b/docs/troubleshooting/vpn-pritunl-connection-troubleshooting.mdx new file mode 100644 index 000000000..b65169107 --- /dev/null +++ b/docs/troubleshooting/vpn-pritunl-connection-troubleshooting.mdx @@ -0,0 +1,163 @@ +--- +sidebar_position: 3 +title: "Pritunl VPN Connection Issues" +description: "Troubleshooting guide for Pritunl VPN connection problems in SleakOps" +date: "2024-12-19" +category: "user" +tags: ["vpn", "pritunl", "connection", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Pritunl VPN Connection Issues + +**Date:** December 19, 2024 +**Category:** User +**Tags:** VPN, Pritunl, Connection, Troubleshooting + +## Problem Description + +**Context:** Users experiencing connectivity issues with Pritunl VPN in SleakOps production environment, where the VPN client shows "connecting" status but never establishes a successful connection. + +**Observed Symptoms:** + +- Pritunl client remains in "connecting" state indefinitely +- No error messages displayed in the client +- Unable to access internal resources through VPN +- VPN server appears to be running but connections fail + +**Relevant Configuration:** + +- VPN Type: Pritunl OpenVPN +- Environment: Production +- Client: Pritunl desktop application +- Profile format: .ovpn configuration file + +**Error Conditions:** + +- Connection attempts timeout without establishing tunnel +- Issue may be intermittent or persistent +- Problem can occur after profile updates or network changes +- Local network configuration may interfere with connection + +## Detailed Solution + + + +Before troubleshooting the VPN profile, verify basic connectivity to the VPN server: + +1. **Test HTTPS connectivity** to the VPN server IP: + ```bash + # Replace with your actual VPN server IP + curl -k https://3.82.69.46/ + ``` +2. **Expected behavior**: You should see an HTTPS response (even with certificate warnings) + +3. **If no response**: The issue is likely network-related (firewall, ISP blocking, etc.) + + + + + +The most common solution is to regenerate the VPN profile: + +1. **Access SleakOps Dashboard** +2. Navigate to **VPN Settings** or **User Profile** +3. **Generate new Pritunl URL**: + - Look for "Generate VPN Profile" or similar option + - Click to create a new temporary download link +4. **Download fresh profile**: + - The generated URLs have limited validity (few hours) + - Download the new .ovpn profile immediately + + + + + +Complete profile reinstallation process: + +1. **Remove existing profile**: + + - Open Pritunl client + - Right-click on the problematic profile + - Select "Delete" or "Remove" + +2. **Clear any cached data**: + + - Close Pritunl client completely + - Restart the application + +3. **Install new profile**: + - Import the newly downloaded .ovpn file + - Verify profile settings are correct + - Attempt connection + + + + + +If profile regeneration doesn't work, try network-level troubleshooting: + +1. **Test different networks**: + + - Try connecting from mobile hotspot + - Test from different WiFi network + - This helps identify ISP or local network issues + +2. **Check firewall settings**: + + - Ensure OpenVPN ports are not blocked + - Common ports: 1194 (UDP), 443 (TCP) + - Temporarily disable local firewall for testing + +3. **DNS resolution**: + - Verify VPN server hostname resolves correctly + - Try using IP address instead of hostname in profile + + + + + +For administrators with access to SleakOps backend: + +1. **Access Secrets Manager**: + + - Find Pritunl server credentials + - Access Pritunl admin console + +2. **Check server logs**: + + - Review connection attempts in Pritunl logs + - Look for authentication failures or network errors + +3. **Verify server status**: + - Ensure Pritunl service is running + - Check server resource usage + - Verify network connectivity from server side + + + + + +If standard troubleshooting doesn't resolve the issue: + +1. **Try different VPN protocols**: + + - Switch between UDP and TCP if options available + - Test different ports if configurable + +2. **Update Pritunl client**: + + - Ensure you're using the latest version + - Check for compatibility issues + +3. **Contact support with details**: + - Provide VPN server IP and connection logs + - Include network environment details + - Specify when the issue started occurring + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/vpn-pritunl-dns-resolution-issues.mdx b/docs/troubleshooting/vpn-pritunl-dns-resolution-issues.mdx new file mode 100644 index 000000000..08f23a495 --- /dev/null +++ b/docs/troubleshooting/vpn-pritunl-dns-resolution-issues.mdx @@ -0,0 +1,222 @@ +--- +sidebar_position: 3 +title: "Pritunl VPN DNS Resolution Issues" +description: "Troubleshooting DNS resolution problems with Pritunl VPN connections" +date: "2024-01-15" +category: "user" +tags: ["vpn", "pritunl", "dns", "networking", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Pritunl VPN DNS Resolution Issues + +**Date:** January 15, 2024 +**Category:** User +**Tags:** VPN, Pritunl, DNS, Networking, Troubleshooting + +## Problem Description + +**Context:** Users experience intermittent DNS resolution problems when connecting to SleakOps infrastructure through Pritunl VPN from different locations and operating systems. + +**Observed Symptoms:** + +- Pritunl VPN connection works intermittently +- DNS resolution fails on some operating systems +- Connection issues vary by geographic location +- Problem appears inconsistently across different client systems + +**Relevant Configuration:** + +- VPN Client: Pritunl +- Connection locations: Multiple IPs across Argentina +- Production DNS server: 10.130.0.2 +- Multiple operating systems affected + +**Error Conditions:** + +- DNS resolution failures occur sporadically +- Problem varies by operating system +- Issues appear when connecting from different geographic locations +- Inconsistent behavior across connection attempts + +## Detailed Solution + + + +The first troubleshooting step is to reset Pritunl's DNS and networking configurations: + +**Steps to reset:** + +1. Open your Pritunl client +2. Go to **Settings** or **Preferences** +3. Look for **Advanced Options** or **Network Settings** +4. Click on **Reset DNS** +5. Click on **Reset Networking** +6. Restart the Pritunl client +7. Reconnect to your VPN profile + +**What this does:** + +- Clears cached DNS entries +- Resets network interface configurations +- Forces fresh DNS resolver setup + + + + + +If the reset doesn't resolve the issue, manually configure the production DNS server: + +**Add DNS Server 10.130.0.2:** + +**On Windows:** + +1. Go to **Network and Sharing Center** +2. Click on your active network connection +3. Click **Properties** +4. Select **Internet Protocol Version 4 (TCP/IPv4)** +5. Click **Properties** +6. Select **Use the following DNS server addresses** +7. Add `10.130.0.2` as primary DNS +8. Click **OK** + +**On macOS:** + +1. Go to **System Preferences** → **Network** +2. Select your network connection +3. Click **Advanced** +4. Go to **DNS** tab +5. Click **+** and add `10.130.0.2` +6. Click **OK** + +**On Linux:** + +```bash +# Edit resolv.conf +sudo nano /etc/resolv.conf + +# Add this line at the top +nameserver 10.130.0.2 + +# Or use systemd-resolved +sudo systemctl edit systemd-resolved + +# Add: +[Resolve] +DNS=10.130.0.2 +``` + + + + + +You can also configure DNS directly in your Pritunl connection profile: + +1. Open Pritunl client +2. Find your connection profile +3. Click the **gear icon** or **Edit** +4. Look for **DNS Settings** or **Advanced** +5. Add `10.130.0.2` to the DNS servers list +6. Save the profile +7. Reconnect + +**Alternative method:** + +```bash +# If using Pritunl from command line +pritunl-client add [profile-file] +pritunl-client start [profile-id] --dns 10.130.0.2 +``` + + + + + +**Windows Specific:** + +- Disable IPv6 if not needed: `netsh interface ipv6 set global randomizeidentifiers=disabled` +- Flush DNS cache: `ipconfig /flushdns` +- Reset Winsock: `netsh winsock reset` + +**macOS Specific:** + +- Flush DNS cache: `sudo dscacheutil -flushcache` +- Reset network settings: `sudo ifconfig en0 down && sudo ifconfig en0 up` + +**Linux Specific:** + +- Restart NetworkManager: `sudo systemctl restart NetworkManager` +- Flush DNS cache: `sudo systemd-resolve --flush-caches` +- Check DNS resolution: `nslookup [domain] 10.130.0.2` + + + + + +For users connecting from different locations in Argentina: + +**Connection Optimization:** + +1. Choose the closest Pritunl server location +2. Test different connection protocols (UDP vs TCP) +3. Adjust MTU size if needed: + + ```bash + # Test optimal MTU + ping -f -l 1472 [server-ip] + + # Set MTU in Pritunl + # Add to profile: tun-mtu 1500 + ``` + +**Network Quality Testing:** + +```bash +# Test DNS resolution speed +dig @10.130.0.2 [your-domain] + +# Test connection quality +ping -c 10 10.130.0.2 + +# Trace route to identify issues +traceroute 10.130.0.2 +``` + + + + + +After applying the fixes, verify DNS resolution is working: + +**Test Commands:** + +```bash +# Test DNS resolution +nslookup google.com 10.130.0.2 +dig @10.130.0.2 your-internal-domain.com + +# Test internal services +ping internal-service.sleakops.local +curl -I https://your-api.sleakops.com + +# Check current DNS servers +# On Linux/macOS: +cat /etc/resolv.conf + +# On Windows: +ipconfig /all | findstr "DNS Servers" +``` + +**Expected Results:** + +- DNS queries should resolve quickly +- Internal domains should be accessible +- No timeout errors +- Consistent resolution across multiple attempts + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/web-service-dns-404-error.mdx b/docs/troubleshooting/web-service-dns-404-error.mdx new file mode 100644 index 000000000..3b8d709d9 --- /dev/null +++ b/docs/troubleshooting/web-service-dns-404-error.mdx @@ -0,0 +1,206 @@ +--- +sidebar_position: 3 +title: "Web Service DNS 404 Error" +description: "Solution for web service returning 404 error despite pod running correctly" +date: "2024-01-15" +category: "workload" +tags: ["webservice", "dns", "404", "troubleshooting", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Web Service DNS 404 Error + +**Date:** January 15, 2024 +**Category:** Workload +**Tags:** Web Service, DNS, 404, Troubleshooting, Networking + +## Problem Description + +**Context:** User creates a new web service in production environment with DNS registration. The pod is running correctly and responds to port forwarding, but accessing the URL returns a 404 error. + +**Observed Symptoms:** + +- Pod is running correctly in Kubernetes (visible in Lens) +- Port forwarding to the pod works properly +- DNS URL returns 404 error: "This [service-name].[domain].com page can't be found" +- Error indicates the webpage wasn't found, not that the IP couldn't be resolved +- Browser shows: "No webpage was found for the web address" + +**Relevant Configuration:** + +- Environment: Production +- Service type: Web service +- DNS registration: Configured +- Pod status: Running and responding +- Load balancer: Recently had issues (mentioned as resolved) + +**Error Conditions:** + +- Error occurs when accessing the public URL +- Pod responds correctly to direct port forwarding +- DNS resolution appears to work (no IP resolution error) +- 404 error suggests routing or ingress configuration issue + +## Detailed Solution + + + +When a pod works with port forwarding but returns 404 via DNS, the issue is typically in the ingress/routing layer: + +1. **Ingress Controller**: The ingress may not be properly configured +2. **Service Configuration**: The Kubernetes service might not be correctly exposing the pod +3. **Load Balancer**: Recent load balancer issues might have affected routing rules +4. **Path Routing**: The ingress might be expecting a different path or host configuration + + + + + +Check if the Kubernetes Service is properly configured: + +```bash +# Check if the service exists and has endpoints +kubectl get svc -n [namespace] +kubectl describe svc [service-name] -n [namespace] + +# Verify endpoints are populated +kubectl get endpoints [service-name] -n [namespace] +``` + +The service should: + +- Have the correct selector matching your pod labels +- Show endpoints pointing to your pod IP +- Use the correct port configuration + + + + + +Check the ingress configuration in SleakOps: + +1. **In SleakOps Dashboard**: + + - Go to your project → Web Services + - Check the DNS configuration + - Verify the path and host settings + +2. **In Kubernetes**: + +```bash +# Check ingress resources +kubectl get ingress -n [namespace] +kubectl describe ingress [ingress-name] -n [namespace] +``` + +3. **Common Issues**: + - Incorrect host configuration + - Missing or wrong path rules + - Backend service name mismatch + + + + + +Since there were recent load balancer issues: + +1. **Check Load Balancer Health**: + +```bash +# Check if load balancer is receiving traffic +kubectl get svc -n ingress-nginx +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller +``` + +2. **Verify DNS Propagation**: + +```bash +# Check if DNS is resolving to the correct IP +nslookup [your-domain].com +dig [your-domain].com +``` + +3. **Test Load Balancer Directly**: + - Get the load balancer IP + - Test with curl using Host header: + +```bash +curl -H "Host: [your-domain].com" http://[load-balancer-ip]/ +``` + + + + + +**Solution 1: Recreate the DNS registration** + +1. In SleakOps, remove the DNS configuration +2. Wait 2-3 minutes +3. Re-add the DNS configuration + +**Solution 2: Check application path configuration** + +```yaml +# Ensure your application is serving on the correct path +# If your app serves on /app, configure ingress accordingly +path: / +pathType: Prefix +``` + +**Solution 3: Verify application is listening on correct port** + +```bash +# Port forward and check what port the app is actually using +kubectl port-forward pod/[pod-name] 8080:8080 +# Test different ports if 8080 doesn't work +``` + +**Solution 4: Check for recent changes** +If this worked before, check: + +- Recent deployments that might have changed the application +- Load balancer configuration changes +- DNS or ingress rule modifications + + + + + +1. **Verify pod is healthy**: + +```bash +kubectl get pods -n [namespace] +kubectl logs [pod-name] -n [namespace] +``` + +2. **Test service connectivity**: + +```bash +# From within the cluster +kubectl run test-pod --image=curlimages/curl -it --rm -- sh +# Inside the pod: +curl http://[service-name].[namespace].svc.cluster.local:[port] +``` + +3. **Check ingress logs**: + +```bash +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller | grep [your-domain] +``` + +4. **Validate ingress rules**: + +```bash +kubectl get ingress [ingress-name] -o yaml +``` + +5. **Test with different paths**: + - Try accessing `https://[domain].com/health` or other known endpoints + - Check if the application has a specific base path requirement + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/web-service-domain-configuration.mdx b/docs/troubleshooting/web-service-domain-configuration.mdx new file mode 100644 index 000000000..48834aaf1 --- /dev/null +++ b/docs/troubleshooting/web-service-domain-configuration.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "Web Service Domain Configuration Issue" +description: "How to fix incorrect domain configuration in web services" +date: "2024-03-10" +category: "workload" +tags: ["web-service", "domain", "configuration", "staging"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Web Service Domain Configuration Issue + +**Date:** March 10, 2024 +**Category:** Workload +**Tags:** Web Service, Domain, Configuration, Staging + +## Problem Description + +**Context:** When configuring a web service in SleakOps, the domain configuration resulted in an incorrect nested subdomain structure instead of the intended clean subdomain. + +**Observed Symptoms:** + +- Domain appears as `supra.staging.supra.social` (nested structure) +- Expected domain should be `staging.supra.social` (clean subdomain) +- Domain configuration created a "frankenstein" structure +- Web service is accessible but with wrong URL + +**Relevant Configuration:** + +- Current domain: `supra.staging.supra.social` +- Desired domain: `staging.supra.social` +- Service type: Web service +- Environment: Staging + +**Error Conditions:** + +- Incorrect domain configuration during web service setup +- Domain nesting issue creating redundant subdomain +- Need to reconfigure existing web service domain + +## Detailed Solution + + + +To fix the domain configuration in your web service: + +1. **Navigate to your project** in the SleakOps dashboard +2. **Go to Workloads** section +3. **Find your web service** in the list +4. **Click on the web service** to open its configuration +5. **Click Edit** or the configuration icon + + + + + +In the web service configuration: + +1. **Locate the Domain Configuration section** +2. **Clear the current domain field** if it shows the incorrect domain +3. **Enter the correct domain**: `staging.supra.social` +4. **Save the configuration** + +```yaml +# Example configuration +domain: staging.supra.social +# NOT: supra.staging.supra.social +``` + +**Important:** Make sure you enter only the desired subdomain without duplicating the base domain. + + + + + +After updating the domain configuration: + +1. **Wait 5-10 minutes** for the changes to propagate +2. **Check the deployment status** in the workload dashboard +3. **Verify the new domain** is working correctly +4. **Test access** to `staging.supra.social` + +If the old domain is still cached, you may need to: + +- Clear your browser cache +- Wait for DNS propagation (up to 24 hours in some cases) +- Use incognito/private browsing mode to test + + + + + +If you still experience issues after the configuration change: + +**Check DNS Configuration:** + +```bash +# Test DNS resolution +nslookup staging.supra.social + +# Check if the domain points to the correct IP +dig staging.supra.social +``` + +**Verify SSL Certificate:** + +- The SSL certificate should automatically update for the new domain +- If you see SSL warnings, wait a few more minutes for certificate provisioning + +**Common Issues:** + +- **Domain not resolving**: Check if DNS records are properly configured +- **SSL certificate errors**: Wait for automatic certificate provisioning +- **Old domain still working**: This is normal during transition period + + + + + +To prevent similar issues in the future: + +**Best Practices:** + +1. **Plan your domain structure** before creating the web service +2. **Use clear naming conventions**: `[environment].[app].[domain]` +3. **Double-check domain entries** before saving configuration +4. **Test domain configuration** in a development environment first + +**Domain Structure Examples:** + +``` +# Good examples +staging.myapp.com +api.myapp.com +dev.myapp.com + +# Avoid nested duplications +# Bad: myapp.staging.myapp.com +# Bad: api.myapp.api.myapp.com +``` + + + +--- + +_This FAQ was automatically generated on March 10, 2024 based on a real user query._ diff --git a/docs/troubleshooting/web-service-domain-reset-issue.mdx b/docs/troubleshooting/web-service-domain-reset-issue.mdx new file mode 100644 index 000000000..29a1c2081 --- /dev/null +++ b/docs/troubleshooting/web-service-domain-reset-issue.mdx @@ -0,0 +1,114 @@ +--- +sidebar_position: 3 +title: "Web Service Domain Reset During Replica Update" +description: "Issue where domain name resets to project name when editing web service replicas" +date: "2025-02-20" +category: "workload" +tags: ["webservice", "domain", "replicas", "frontend", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Web Service Domain Reset During Replica Update + +**Date:** February 20, 2025 +**Category:** Workload +**Tags:** WebService, Domain, Replicas, Frontend, Configuration + +## Problem Description + +**Context:** When editing a Web Service configuration to modify replica count, the domain name unexpectedly changes from the current production domain to the default project name, causing potential downtime. + +**Observed Symptoms:** + +- Domain name resets to project name when editing Web Service +- Occurs when making simple changes like replica count updates +- Causes temporary website downtime if not noticed immediately +- Frontend shows project name instead of current production domain + +**Relevant Configuration:** + +- Component: Web Service configuration +- Action: Editing replica count +- Expected behavior: Domain should remain unchanged +- Actual behavior: Domain resets to project name + +**Error Conditions:** + +- Occurs during Web Service editing process +- Happens when modifying any Web Service parameter +- Results in unintended domain changes +- Can cause service interruption + +## Detailed Solution + + + +To avoid accidental domain changes when editing Web Services: + +1. **Always verify the domain field** before saving changes +2. **Check that the domain matches your current production URL** +3. **If the domain has reset, manually correct it** to your production domain +4. **Save the configuration** only after verifying all fields + +**Important:** Always double-check the domain field even when making unrelated changes like replica count adjustments. + + + + + +Currently, the Web Service edit form may display the project name as the default domain instead of preserving the existing production domain. This is a known frontend issue that can cause: + +- Unexpected domain changes +- Service interruptions +- Need for manual domain correction +- Potential downtime if not caught immediately + + + + + +To minimize risk when editing Web Services: + +1. **Review all fields** before saving, not just the ones you intended to change +2. **Take note of your current domain** before starting the edit process +3. **Make changes during maintenance windows** when possible +4. **Have monitoring in place** to quickly detect domain changes +5. **Consider making changes in smaller batches** to limit impact + + + + + +To quickly detect if your domain has been accidentally changed: + +1. **Set up external monitoring** for your production URLs +2. **Configure alerts** for HTTP response changes +3. **Monitor DNS resolution** for your domains +4. **Use health checks** that verify the correct service is responding + +```bash +# Example monitoring script +curl -f https://your-production-domain.com/health || echo "Domain issue detected" +``` + + + + + +If you've accidentally changed the domain: + +1. **Immediately edit the Web Service again** +2. **Correct the domain field** to your production domain +3. **Save the configuration** +4. **Wait for the deployment to complete** +5. **Verify the service is accessible** at the correct URL +6. **Check that all traffic is routing correctly** + +The recovery process typically takes a few minutes for the changes to propagate. + + + +--- + +_This FAQ was automatically generated on February 20, 2025 based on a real user query._ diff --git a/docs/troubleshooting/webservice-alias-configuration.mdx b/docs/troubleshooting/webservice-alias-configuration.mdx new file mode 100644 index 000000000..2725cde9b --- /dev/null +++ b/docs/troubleshooting/webservice-alias-configuration.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "Web Service Alias Configuration" +description: "How to create and manage aliases for web services in SleakOps" +date: "2024-12-19" +category: "workload" +tags: ["webservice", "alias", "configuration", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Web Service Alias Configuration + +**Date:** December 19, 2024 +**Category:** Workload +**Tags:** Web Service, Alias, Configuration, Networking + +## Problem Description + +**Context:** Users need to create custom domain aliases for their web services in SleakOps to provide alternative URLs or custom domain names for accessing their applications. + +**Observed Symptoms:** + +- Need for custom domain names for web services +- Requirement to access services through multiple URLs +- Desire to use branded or user-friendly domain names +- Need for load balancing across multiple endpoints + +**Relevant Configuration:** + +- Web service deployment in SleakOps +- Custom domain names or subdomains +- DNS configuration requirements +- SSL/TLS certificate management + +**Error Conditions:** + +- Default service URLs may not meet branding requirements +- Multiple access points needed for the same service +- Custom domain routing requirements + +## Detailed Solution + + + +To create aliases for your web services in SleakOps: + +1. **Navigate to Web Service Details**: + + - Go to your project dashboard + - Select the web service you want to configure + - Click on the web service name to access its details + +2. **Access Alias Configuration**: + + - In the web service detail page, look for the "Aliases" or "Networking" section + - Click on "Add Alias" or similar button + +3. **Configure the Alias**: + - Enter your custom domain name (e.g., `api.mycompany.com`) + - Select the appropriate protocol (HTTP/HTTPS) + - Configure any additional routing rules if needed + + + + + +Before creating an alias, ensure your DNS is properly configured: + +1. **CNAME Record**: Create a CNAME record pointing your custom domain to the SleakOps service endpoint +2. **A Record**: Alternatively, use an A record pointing to the service IP address +3. **Verification**: Ensure DNS propagation is complete before testing + +```bash +# Example DNS configuration +# CNAME record +api.mycompany.com. IN CNAME your-service.sleakops.io. + +# Or A record +api.mycompany.com. IN A 192.168.1.100 +``` + + + + + +For HTTPS aliases, SleakOps can automatically manage SSL certificates: + +1. **Automatic Certificates**: SleakOps can automatically provision Let's Encrypt certificates +2. **Custom Certificates**: Upload your own SSL certificates if required +3. **Certificate Renewal**: Automatic renewal is handled by the platform + +**Steps to enable HTTPS**: + +- Enable SSL/TLS in the alias configuration +- Choose between automatic or custom certificate +- Verify certificate installation after creation + + + + + +You can create multiple aliases for the same web service: + +1. **Different Domains**: Point multiple domains to the same service +2. **Subdomain Routing**: Use different subdomains for different purposes +3. **Environment-specific**: Create aliases for different environments + +**Example use cases**: + +- `api.mycompany.com` - Production API +- `api-staging.mycompany.com` - Staging environment +- `v1.api.mycompany.com` - Version-specific endpoint + + + + + +If your alias is not working properly: + +1. **DNS Propagation**: Wait for DNS changes to propagate (up to 48 hours) +2. **Certificate Issues**: Check SSL certificate status and validity +3. **Firewall Rules**: Ensure no firewall rules are blocking the traffic +4. **Service Health**: Verify the underlying web service is running correctly + +**Verification commands**: + +```bash +# Check DNS resolution +nslookup api.mycompany.com + +# Test HTTP connectivity +curl -I http://api.mycompany.com + +# Test HTTPS connectivity +curl -I https://api.mycompany.com +``` + + + + + +**Naming Conventions**: + +- Use descriptive subdomain names +- Follow consistent naming patterns +- Consider environment prefixes (prod-, staging-, dev-) + +**Security Considerations**: + +- Always use HTTPS for production services +- Implement proper certificate management +- Consider using wildcard certificates for multiple subdomains + +**Performance Optimization**: + +- Use CDN integration when available +- Configure appropriate caching headers +- Monitor alias performance and usage + + + +--- + +_This FAQ was automatically generated on December 19, 2024 based on a real user query._ diff --git a/docs/troubleshooting/websocket-mixed-content-error.mdx b/docs/troubleshooting/websocket-mixed-content-error.mdx new file mode 100644 index 000000000..07a5aebcb --- /dev/null +++ b/docs/troubleshooting/websocket-mixed-content-error.mdx @@ -0,0 +1,203 @@ +--- +sidebar_position: 3 +title: "WebSocket Mixed Content Error - WSS Protocol Required" +description: "Solution for WebSocket mixed content error when connecting from HTTPS to WS endpoint" +date: "2024-12-18" +category: "general" +tags: ["websocket", "https", "ssl", "mixed-content", "security"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# WebSocket Mixed Content Error - WSS Protocol Required + +**Date:** December 18, 2024 +**Category:** General +**Tags:** WebSocket, HTTPS, SSL, Mixed Content, Security + +## Problem Description + +**Context:** When attempting to establish a WebSocket connection from an HTTPS webpage to a WebSocket endpoint using the `ws://` protocol, browsers block the connection due to mixed content security policies. + +**Observed Symptoms:** + +- Mixed Content error in browser console +- WebSocket connection fails to establish +- Error message: "This request has been blocked; this endpoint must be available over WSS" +- Connection attempt from HTTPS page to `ws://` endpoint is rejected + +**Relevant Configuration:** + +- Frontend served over: HTTPS +- WebSocket endpoint protocol: `ws://` (insecure) +- Browser security policy: Mixed content blocking enabled +- WebSocket URL format: `ws://domain/ws/path/?token=...` + +**Error Conditions:** + +- Error occurs when HTTPS page tries to connect to `ws://` endpoint +- Modern browsers enforce mixed content policies +- Connection is blocked before establishment +- Issue affects all secure contexts (HTTPS pages) + +## Detailed Solution + + + +The primary solution is to use the secure WebSocket protocol (`wss://`) instead of the insecure protocol (`ws://`): + +**Before (causes error):** + +```javascript +const websocket = new WebSocket("ws://apiqa.simplee.cl/ws/lead/?token=..."); +``` + +**After (correct):** + +```javascript +const websocket = new WebSocket("wss://apiqa.simplee.cl/ws/lead/?token=..."); +``` + +This change ensures that: + +- The connection uses SSL/TLS encryption +- Browser mixed content policies are satisfied +- Communication remains secure end-to-end + + + + + +Verify that your WebSocket server is configured to handle secure connections: + +**For Node.js applications:** + +```javascript +const https = require("https"); +const WebSocket = require("ws"); +const fs = require("fs"); + +const server = https.createServer({ + cert: fs.readFileSync("path/to/cert.pem"), + key: fs.readFileSync("path/to/key.pem"), +}); + +const wss = new WebSocket.Server({ server }); +``` + +**For reverse proxy (Nginx):** + +```nginx +server { + listen 443 ssl; + server_name apiqa.simplee.cl; + + ssl_certificate /path/to/cert.pem; + ssl_certificate_key /path/to/key.pem; + + location /ws/ { + proxy_pass http://backend; + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + proxy_set_header Host $host; + } +} +``` + + + + + +For applications that need to work in both HTTP and HTTPS environments, use dynamic protocol selection: + +```javascript +function getWebSocketURL(path) { + const protocol = window.location.protocol === "https:" ? "wss:" : "ws:"; + const host = window.location.host; + return `${protocol}//${host}${path}`; +} + +// Usage +const wsUrl = getWebSocketURL("/ws/lead/?token=..."); +const websocket = new WebSocket(wsUrl); +``` + +Or use relative URLs that automatically inherit the page's protocol: + +```javascript +// This automatically uses wss:// on HTTPS pages and ws:// on HTTP pages +const websocket = new WebSocket( + `${window.location.protocol === "https:" ? "wss:" : "ws:"}//${ + window.location.host + }/ws/lead/?token=...` +); +``` + + + + + +If you're still experiencing issues after changing to `wss://`: + +1. **Verify SSL certificate:** + + ```bash + openssl s_client -connect apiqa.simplee.cl:443 -servername apiqa.simplee.cl + ``` + +2. **Test WebSocket endpoint:** + + ```bash + # Using websocat tool + websocat wss://apiqa.simplee.cl/ws/lead/ + ``` + +3. **Check browser developer tools:** + + - Open Network tab + - Look for WebSocket connections + - Check for SSL/TLS errors + +4. **Common issues:** + - Self-signed certificates (use valid SSL certificate) + - Port blocking (ensure port 443 is open) + - Firewall rules (allow WSS traffic) + - Load balancer configuration (ensure WebSocket support) + + + + + +When implementing WSS connections: + +1. **Always use WSS in production:** + + - Never use `ws://` in production environments + - Encrypt all WebSocket communication + +2. **Validate SSL certificates:** + + - Use certificates from trusted CAs + - Avoid self-signed certificates in production + +3. **Implement proper authentication:** + + ```javascript + const token = getAuthToken(); // Your auth mechanism + const websocket = new WebSocket(`wss://api.domain.com/ws/?token=${token}`); + ``` + +4. **Handle connection errors gracefully:** + ```javascript + websocket.onerror = function (error) { + console.error("WebSocket error:", error); + // Implement reconnection logic + }; + ``` + + + +--- + +_This FAQ was automatically generated on December 18, 2024 based on a real user query._ diff --git a/docs/troubleshooting/workload-502-bad-gateway-nestjs.mdx b/docs/troubleshooting/workload-502-bad-gateway-nestjs.mdx new file mode 100644 index 000000000..ea6d08c40 --- /dev/null +++ b/docs/troubleshooting/workload-502-bad-gateway-nestjs.mdx @@ -0,0 +1,803 @@ +--- +sidebar_position: 15 +title: "502 Bad Gateway Error with NestJS Application" +description: "Solution for 502 Bad Gateway errors when NestJS pods are running but API endpoints are unreachable" +date: "2025-01-15" +category: "workload" +tags: ["502", "bad-gateway", "nestjs", "api", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# 502 Bad Gateway Error with NestJS Application + +**Date:** January 15, 2025 +**Category:** Workload +**Tags:** 502, Bad Gateway, NestJS, API, Troubleshooting + +## Problem Description + +**Context:** User has a NestJS application deployed in Kubernetes that shows normal startup logs and appears to be running correctly, but API endpoints return 502 Bad Gateway errors. + +**Observed Symptoms:** + +- Pods are running and show normal NestJS startup logs +- Application modules initialize successfully (TypeORM, Config, Logger, etc.) +- Routes are mapped correctly (`/health`, `/session`) +- API requests return `502 Bad Gateway` error +- Both GET and POST requests fail + +**Relevant Configuration:** + +- Application: NestJS with TypeORM +- Service name: `rattlesnake-develop` +- Pod count: 2 pods running +- Routes: `/health` (GET), `/session` (POST) + +**Error Conditions:** + +- Error occurs after successful application startup +- Affects all API endpoints +- Happens despite pods showing as healthy in Kubernetes +- Problem resolved by generating a new deployment + +## Detailed Solution + + + +A 502 Bad Gateway error in Kubernetes typically indicates that the ingress controller or service can reach the pod, but the pod is not responding correctly to HTTP requests. Common causes include: + +1. **Port mismatch**: Application listening on different port than service expects +2. **Health check failures**: Readiness/liveness probes failing +3. **Application not fully ready**: App appears started but HTTP server not listening +4. **Service selector issues**: Service not routing to correct pods + + + + + +Check if your NestJS application is listening on the correct port: + +```typescript +// In your main.ts file +import { NestFactory } from "@nestjs/core"; +import { AppModule } from "./app.module"; + +async function bootstrap() { + const app = await NestFactory.create(AppModule); + const port = process.env.PORT || 3000; + await app.listen(port, "0.0.0.0"); // Important: bind to 0.0.0.0 + console.log(`Application is running on: ${await app.getUrl()}`); +} +bootstrap(); +``` + +Ensure your Kubernetes service matches this port: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: rattlesnake-develop +spec: + selector: + app: rattlesnake-develop + ports: + - port: 80 + targetPort: 3000 # Should match your app port + protocol: TCP +``` + + + + + +Add proper health check endpoints and configure Kubernetes probes: + +```typescript +// Add to your app controller +@Get('/health') +getHealth() { + return { + status: 'ok', + timestamp: new Date().toISOString(), + uptime: process.uptime() + }; +} + +@Get('/ready') +getReadiness() { + // Add any readiness checks (database connection, etc.) + return { status: 'ready' }; +} +``` + +Configure Kubernetes probes: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: rattlesnake-develop +spec: + template: + spec: + containers: + - name: app + image: your-image + ports: + - containerPort: 3000 + livenessProbe: + httpGet: + path: /health + port: 3000 + initialDelaySeconds: 30 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /ready + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 5 +``` + + + + + +Use these commands to debug the connection: + +```bash +# Check if pods are ready +kubectl get pods -l app=rattlesnake-develop + +# Check service endpoints +kubectl get endpoints rattlesnake-develop + +# Test direct pod connectivity +kubectl port-forward pod/rattlesnake-develop-xxx 3000:3000 +# Then test: curl http://localhost:3000/health + +# Check service connectivity +kubectl port-forward service/rattlesnake-develop 8080:80 +# Then test: curl http://localhost:8080/health + +# Check pod logs for HTTP server startup +kubectl logs -f deployment/rattlesnake-develop +``` + + + + + +Ensure your NestJS application is properly configured for containerized environments: + +```typescript +// Enable graceful shutdown +async function bootstrap() { + const app = await NestFactory.create(AppModule); + + // Enable shutdown hooks + app.enableShutdownHooks(); + + // Configure CORS if needed + app.enableCors(); + + // Global prefix (optional) + app.setGlobalPrefix("api"); + + // Bind to all interfaces + await app.listen(process.env.PORT || 3000, "0.0.0.0"); +} +``` + +Check your database configuration for container environments: + +```typescript +// TypeORM configuration +@Module({ + imports: [ + TypeOrmModule.forRootAsync({ + useFactory: () => ({ + type: "postgres", + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT) || 5432, + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + synchronize: false, // Never true in production + retryAttempts: 3, + retryDelay: 3000, + }), + }), + ], +}) +export class AppModule {} +``` + + + + + +If you need an immediate fix, force a new deployment in SleakOps: + +1. **Trigger a new deployment**: + - Make a small change to your repository (add a comment, update a dependency) + - Push the change to trigger a new build and deployment + - This will create fresh pods with clean state + +2. **Alternative - Manual pod restart**: + ```bash + # Restart deployment to create new pods + kubectl rollout restart deployment rattlesnake-develop + + # Wait for rollout to complete + kubectl rollout status deployment rattlesnake-develop + + # Verify new pods are running + kubectl get pods -l app=rattlesnake-develop + ``` + +This often resolves the 502 error by creating fresh pods without any potential state issues. + + + + + +### Step 1: Check Pod and Service Configuration + +```bash +# Check pod status and logs +kubectl get pods -l app=rattlesnake-develop +kubectl logs --tail=50 + +# Check service configuration +kubectl get svc rattlesnake-develop -o yaml +kubectl describe svc rattlesnake-develop + +# Check endpoints +kubectl get endpoints rattlesnake-develop +``` + +### Step 2: Test Direct Pod Connectivity + +```bash +# Port forward directly to a pod +kubectl port-forward 8080:3000 + +# Test the application directly +curl http://localhost:8080/health +curl -X POST http://localhost:8080/session -H "Content-Type: application/json" -d '{}' +``` + +### Step 3: Check Application Startup Sequence + +Ensure your NestJS app is fully ready before accepting connections: + +```typescript +// Enhanced main.ts with better startup handling +import { NestFactory } from '@nestjs/core'; +import { AppModule } from './app.module'; +import { Logger } from '@nestjs/common'; + +async function bootstrap() { + const logger = new Logger('Bootstrap'); + + try { + const app = await NestFactory.create(AppModule); + + // Enable shutdown hooks + app.enableShutdownHooks(); + + // Configure CORS if needed + app.enableCors({ + origin: true, + credentials: true, + }); + + // Global prefix for all routes + app.setGlobalPrefix('api', { exclude: ['health'] }); + + const port = process.env.PORT || 3000; + + // Listen on all interfaces (crucial for Kubernetes) + await app.listen(port, '0.0.0.0'); + + logger.log(`Application is running on port ${port}`); + logger.log(`Health check available at: http://localhost:${port}/health`); + + // Test internal connectivity + try { + const response = await fetch(`http://localhost:${port}/health`); + logger.log(`Self-health check: ${response.status}`); + } catch (error) { + logger.error('Self-health check failed:', error.message); + } + + } catch (error) { + logger.error('Failed to start application:', error); + process.exit(1); + } +} + +bootstrap(); +``` + + + + + +### Health Check Controller + +Create a robust health check endpoint: + +```typescript +// health.controller.ts +import { Controller, Get, HttpStatus } from '@nestjs/common'; +import { HealthCheck, HealthCheckService, TypeOrmHealthIndicator } from '@nestjs/terminus'; + +@Controller('health') +export class HealthController { + constructor( + private health: HealthCheckService, + private db: TypeOrmHealthIndicator, + ) {} + + @Get() + @HealthCheck() + check() { + return this.health.check([ + () => this.db.pingCheck('database'), + ]); + } + + @Get('ready') + readiness() { + return { + status: 'ok', + timestamp: new Date().toISOString(), + uptime: process.uptime(), + }; + } + + @Get('live') + liveness() { + return { + status: 'ok', + pid: process.pid, + memory: process.memoryUsage(), + }; + } +} +``` + +### Kubernetes Readiness/Liveness Probes + +Configure proper probes in your deployment: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: rattlesnake-develop +spec: + replicas: 2 + selector: + matchLabels: + app: rattlesnake-develop + template: + metadata: + labels: + app: rattlesnake-develop + spec: + containers: + - name: app + image: your-app-image + ports: + - containerPort: 3000 + env: + - name: PORT + value: "3000" + readinessProbe: + httpGet: + path: /health/ready + port: 3000 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + livenessProbe: + httpGet: + path: /health/live + port: 3000 + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 5 + failureThreshold: 3 + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" +``` + + + + + +### 1. Database Connection Issues + +```typescript +// app.module.ts - Robust database configuration +import { TypeOrmModule } from '@nestjs/typeorm'; + +@Module({ + imports: [ + TypeOrmModule.forRootAsync({ + useFactory: () => ({ + type: 'postgres', + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT) || 5432, + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + autoLoadEntities: true, + synchronize: process.env.NODE_ENV === 'development', + retryAttempts: 5, + retryDelay: 3000, + maxQueryExecutionTime: 1000, + // Connection pool settings + extra: { + connectionLimit: 10, + acquireTimeoutMillis: 60000, + timeout: 60000, + }, + }), + }), + ], +}) +export class AppModule {} +``` + +### 2. Async Module Initialization + +```typescript +// Ensure all async modules initialize properly +@Module({ + imports: [ + ConfigModule.forRoot({ + isGlobal: true, + validationSchema: Joi.object({ + NODE_ENV: Joi.string().valid('development', 'production', 'test').required(), + PORT: Joi.number().default(3000), + DB_HOST: Joi.string().required(), + DB_PORT: Joi.number().default(5432), + DB_USERNAME: Joi.string().required(), + DB_PASSWORD: Joi.string().required(), + DB_NAME: Joi.string().required(), + }), + }), + TypeOrmModule.forRootAsync({ + inject: [ConfigService], + useFactory: async (configService: ConfigService) => { + // Add startup delay to ensure database is ready + await new Promise(resolve => setTimeout(resolve, 2000)); + + return { + type: 'postgres', + host: configService.get('DB_HOST'), + port: configService.get('DB_PORT'), + username: configService.get('DB_USERNAME'), + password: configService.get('DB_PASSWORD'), + database: configService.get('DB_NAME'), + autoLoadEntities: true, + synchronize: configService.get('NODE_ENV') === 'development', + }; + }, + }), + ], +}) +export class AppModule {} +``` + +### 3. Graceful Shutdown Handling + +```typescript +// main.ts - Add graceful shutdown +import { NestFactory } from '@nestjs/core'; +import { AppModule } from './app.module'; + +async function bootstrap() { + const app = await NestFactory.create(AppModule); + + // Enable graceful shutdown + app.enableShutdownHooks(); + + // Handle shutdown signals + process.on('SIGTERM', async () => { + console.log('SIGTERM received, shutting down gracefully'); + await app.close(); + process.exit(0); + }); + + process.on('SIGINT', async () => { + console.log('SIGINT received, shutting down gracefully'); + await app.close(); + process.exit(0); + }); + + const port = process.env.PORT || 3000; + await app.listen(port, '0.0.0.0'); + console.log(`Application is running on: ${await app.getUrl()}`); +} + +bootstrap().catch(error => { + console.error('Failed to start application:', error); + process.exit(1); +}); +``` + + + + + +### Application Logging + +```typescript +// logger.service.ts +import { Injectable, Logger } from '@nestjs/common'; + +@Injectable() +export class AppLogger extends Logger { + error(message: string, trace?: string, context?: string) { + // Enhanced error logging + const errorInfo = { + message, + trace, + context, + timestamp: new Date().toISOString(), + pid: process.pid, + memory: process.memoryUsage(), + }; + + console.error(JSON.stringify(errorInfo)); + super.error(message, trace, context); + } + + log(message: string, context?: string) { + const logInfo = { + level: 'info', + message, + context, + timestamp: new Date().toISOString(), + }; + + console.log(JSON.stringify(logInfo)); + super.log(message, context); + } +} +``` + +### Monitoring Endpoints + +```typescript +// monitoring.controller.ts +import { Controller, Get } from '@nestjs/common'; + +@Controller('monitoring') +export class MonitoringController { + @Get('metrics') + getMetrics() { + return { + uptime: process.uptime(), + memory: process.memoryUsage(), + cpu: process.cpuUsage(), + version: process.version, + platform: process.platform, + timestamp: new Date().toISOString(), + }; + } + + @Get('config') + getConfig() { + return { + nodeEnv: process.env.NODE_ENV, + port: process.env.PORT, + // Don't expose sensitive data + database: { + host: process.env.DB_HOST, + port: process.env.DB_PORT, + name: process.env.DB_NAME, + }, + }; + } +} +``` + +### Request Logging Middleware + +```typescript +// request-logger.middleware.ts +import { Injectable, NestMiddleware, Logger } from '@nestjs/common'; +import { Request, Response, NextFunction } from 'express'; + +@Injectable() +export class RequestLoggerMiddleware implements NestMiddleware { + private logger = new Logger('HTTP'); + + use(request: Request, response: Response, next: NextFunction): void { + const { ip, method, originalUrl } = request; + const userAgent = request.get('User-Agent') || ''; + const startTime = Date.now(); + + response.on('close', () => { + const { statusCode } = response; + const contentLength = response.get('Content-Length'); + const responseTime = Date.now() - startTime; + + const logData = { + method, + url: originalUrl, + statusCode, + contentLength, + responseTime, + ip, + userAgent, + }; + + if (statusCode >= 400) { + this.logger.error(`HTTP ${statusCode} ${method} ${originalUrl} - ${responseTime}ms`); + console.error(JSON.stringify(logData)); + } else { + this.logger.log(`HTTP ${statusCode} ${method} ${originalUrl} - ${responseTime}ms`); + } + }); + + next(); + } +} +``` + + + + + +### 1. Robust Deployment Strategy + +```yaml +# deployment.yaml with rolling update strategy +apiVersion: apps/v1 +kind: Deployment +metadata: + name: rattlesnake-develop +spec: + replicas: 3 # Always use multiple replicas + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 1 + selector: + matchLabels: + app: rattlesnake-develop + template: + metadata: + labels: + app: rattlesnake-develop + spec: + containers: + - name: app + image: your-app-image + ports: + - containerPort: 3000 + readinessProbe: + httpGet: + path: /health/ready + port: 3000 + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 + livenessProbe: + httpGet: + path: /health/live + port: 3000 + initialDelaySeconds: 60 + periodSeconds: 30 + failureThreshold: 3 + lifecycle: + preStop: + exec: + command: ["/bin/sh", "-c", "sleep 15"] +``` + +### 2. Circuit Breaker Pattern + +```typescript +// circuit-breaker.service.ts +import { Injectable } from '@nestjs/common'; + +@Injectable() +export class CircuitBreakerService { + private failures = 0; + private lastFailureTime = 0; + private state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN + + async execute(operation: () => Promise, fallback?: () => T): Promise { + if (this.state === 'OPEN') { + if (Date.now() - this.lastFailureTime > 60000) { // 1 minute timeout + this.state = 'HALF_OPEN'; + } else { + return fallback ? fallback() : Promise.reject(new Error('Circuit breaker is OPEN')); + } + } + + try { + const result = await operation(); + this.onSuccess(); + return result; + } catch (error) { + this.onFailure(); + return fallback ? fallback() : Promise.reject(error); + } + } + + private onSuccess() { + this.failures = 0; + this.state = 'CLOSED'; + } + + private onFailure() { + this.failures++; + this.lastFailureTime = Date.now(); + + if (this.failures >= 5) { + this.state = 'OPEN'; + } + } +} +``` + +### 3. Resource Management + +```typescript +// resource-monitor.service.ts +import { Injectable, Logger } from '@nestjs/common'; +import { Cron } from '@nestjs/schedule'; + +@Injectable() +export class ResourceMonitorService { + private readonly logger = new Logger(ResourceMonitorService.name); + + @Cron('*/30 * * * * *') // Every 30 seconds + checkResources() { + const memUsage = process.memoryUsage(); + const memUsageMB = { + rss: Math.round(memUsage.rss / 1024 / 1024), + heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024), + heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024), + external: Math.round(memUsage.external / 1024 / 1024), + }; + + // Alert if memory usage is high + if (memUsageMB.heapUsed > 400) { // 400MB threshold + this.logger.warn(`High memory usage detected: ${JSON.stringify(memUsageMB)}`); + } + + // Force garbage collection if memory is critically high + if (memUsageMB.heapUsed > 450) { + if (global.gc) { + global.gc(); + this.logger.log('Forced garbage collection'); + } + } + } +} +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/workload-cron-job-configuration.mdx b/docs/troubleshooting/workload-cron-job-configuration.mdx new file mode 100644 index 000000000..d84ad5f1f --- /dev/null +++ b/docs/troubleshooting/workload-cron-job-configuration.mdx @@ -0,0 +1,174 @@ +--- +sidebar_position: 15 +title: "Cron Job Configuration in SleakOps" +description: "How to configure cron expressions for scheduled workloads in SleakOps" +date: "2024-08-15" +category: "workload" +tags: ["cron", "scheduling", "workload", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cron Job Configuration in SleakOps + +**Date:** August 15, 2024 +**Category:** Workload +**Tags:** Cron, Scheduling, Workload, Configuration + +## Problem Description + +**Context:** Users need to configure cron expressions for scheduled workloads in SleakOps but may not have direct access to a text field for cron expression input in the current platform version. + +**Observed Symptoms:** + +- Need to set up scheduled jobs with specific timing requirements +- Limited interface options for cron expression configuration +- Requirement for custom scheduling patterns beyond basic presets + +**Relevant Configuration:** + +- Workload type: Scheduled jobs/cron jobs +- Platform: SleakOps workload management +- Scheduling requirements: Custom cron expressions + +**Error Conditions:** + +- Difficulty configuring complex scheduling patterns +- Limited scheduling options in current UI +- Need for precise timing control + +## Detailed Solution + + + +Currently, you can configure cron jobs in SleakOps through the workload configuration interface: + +1. **Navigate to Workloads** in your SleakOps project +2. **Create or Edit** a workload +3. **Select Workload Type**: Choose "Cron Job" or "Scheduled Task" +4. **Configure Schedule**: Use the available scheduling options + +**Available scheduling methods:** + +- Predefined intervals (hourly, daily, weekly) +- Custom time selection through UI components +- Advanced configuration through YAML editor + + + + + +For complex cron expressions, you can use the YAML editor: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: my-scheduled-job +spec: + schedule: "0 2 * * 1-5" # Run at 2 AM, Monday to Friday + jobTemplate: + spec: + template: + spec: + containers: + - name: my-container + image: my-app:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Running scheduled task'"] + restartPolicy: OnFailure +``` + +**Common cron expression patterns:** + +- `0 0 * * *` - Daily at midnight +- `0 */6 * * *` - Every 6 hours +- `0 9 * * 1-5` - Weekdays at 9 AM +- `*/15 * * * *` - Every 15 minutes + + + + + +Cron expressions in Kubernetes follow this format: + +``` +┌───────────── minute (0 - 59) +│ ┌───────────── hour (0 - 23) +│ │ ┌───────────── day of month (1 - 31) +│ │ │ ┌───────────── month (1 - 12) +│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday to Saturday) +│ │ │ │ │ +│ │ │ │ │ +* * * * * +``` + +**Special characters:** + +- `*` - Any value +- `,` - Value list separator +- `-` - Range of values +- `/` - Step values +- `?` - No specific value (day of month/week only) + +**Examples:** + +- `0 0 1 * *` - First day of every month at midnight +- `0 */2 * * *` - Every 2 hours +- `0 9-17 * * 1-5` - Every hour from 9 AM to 5 PM, Monday to Friday + + + + + +**Enhanced Cron Configuration (Coming Soon):** + +The SleakOps team is working on improved cron job configuration features: + +1. **Direct Cron Expression Input**: A dedicated text field for entering cron expressions directly +2. **Visual Cron Builder**: Interactive interface to build cron expressions +3. **Expression Validation**: Real-time validation of cron syntax +4. **Schedule Preview**: Visual representation of when jobs will run + +**Current Workarounds:** + +- Use the YAML editor for complex expressions +- Reference cron expression generators online +- Test expressions in development environments first + + + + + +**Common issues and solutions:** + +1. **Job not running at expected times:** + + - Verify timezone settings in your cluster + - Check cron expression syntax + - Review job history in SleakOps dashboard + +2. **Jobs failing to complete:** + + - Check resource limits and requests + - Review container logs + - Verify image availability and permissions + +3. **Monitoring cron jobs:** + + ```bash + # Check cron job status + kubectl get cronjobs + + # View job execution history + kubectl get jobs + + # Check specific job logs + kubectl logs job/my-scheduled-job-1234567890 + ``` + + + +--- + +_This FAQ was automatically generated on January 15, 2025 based on a real user query._ diff --git a/docs/troubleshooting/workload-internal-configuration-issue.mdx b/docs/troubleshooting/workload-internal-configuration-issue.mdx new file mode 100644 index 000000000..229458a05 --- /dev/null +++ b/docs/troubleshooting/workload-internal-configuration-issue.mdx @@ -0,0 +1,185 @@ +--- +sidebar_position: 3 +title: "Internal Workload Configuration Issue" +description: "Unable to modify internal workload configurations in SleakOps platform" +date: "2025-03-28" +category: "workload" +tags: ["workload", "internal", "configuration", "ui", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Internal Workload Configuration Issue + +**Date:** March 28, 2025 +**Category:** Workload +**Tags:** Workload, Internal, Configuration, UI, Troubleshooting + +## Problem Description + +**Context:** Users attempting to modify internal workload configurations through the SleakOps platform interface encounter UI navigation issues that prevent them from completing the configuration process. + +**Observed Symptoms:** + +- "Next" button in Service Schema step becomes unresponsive +- Unable to advance through configuration wizard for internal workloads +- Problem affects multiple internal workloads consistently +- Public workloads can be modified without issues +- Configuration changes cannot be saved or applied + +**Relevant Configuration:** + +- Workload type: Internal workloads +- Affected step: Service Schema configuration step +- UI component: Next button functionality +- Working alternative: Public workloads function normally + +**Error Conditions:** + +- Error occurs specifically with internal workload types +- Problem appears during Service Schema step of configuration wizard +- Affects all internal workloads uniformly +- Does not affect public workload configurations + +## Detailed Solution + + + +While the platform UI issue is being resolved, you can modify internal workload configurations directly using Lens: + +1. **Access your cluster through Lens** +2. **Navigate to Workloads** → **Deployments** +3. **Find your internal workload** +4. **Edit the deployment directly** + +For common modifications: + +```yaml +# To change replica count +spec: + replicas: 3 # Change this value + +# To modify resource limits +spec: + template: + spec: + containers: + - name: your-container + resources: + limits: + cpu: "500m" + memory: "512Mi" + requests: + cpu: "250m" + memory: "256Mi" +``` + + + + + +You can also use kubectl to modify internal workloads: + +```bash +# Get current deployment configuration +kubectl get deployment -o yaml > workload-backup.yaml + +# Edit the deployment +kubectl edit deployment + +# Or apply changes from a file +kubectl apply -f modified-workload.yaml + +# Verify changes +kubectl get deployment +kubectl describe deployment +``` + + + + + +**Scaling replicas:** + +```bash +kubectl scale deployment --replicas=5 +``` + +**Updating resource limits:** + +```bash +kubectl patch deployment -p '{ + "spec": { + "template": { + "spec": { + "containers": [{ + "name": "", + "resources": { + "limits": { + "cpu": "1000m", + "memory": "1Gi" + }, + "requests": { + "cpu": "500m", + "memory": "512Mi" + } + } + }] + } + } + } +}' +``` + +**Updating environment variables:** + +```bash +kubectl set env deployment/ NEW_VAR=new_value +``` + +**Rolling restart:** + +```bash +kubectl rollout restart deployment/ +``` + +**Check rollout status:** + +```bash +kubectl rollout status deployment/ +``` + + + + + +To prevent similar issues in the future: + +1. **Monitor platform status:** + + - Check SleakOps status page for known issues + - Subscribe to platform notifications + - Join community channels for updates + +2. **Maintain kubectl access:** + + - Keep local kubectl configuration updated + - Ensure Lens is installed and configured + - Document common commands for your team + +3. **Backup configurations:** + + - Export workload configurations regularly + - Keep local copies of important deployments + - Version control your Kubernetes manifests + +4. **Report UI issues:** + - Contact SleakOps support when encountering UI problems + - Provide detailed reproduction steps + - Include browser and platform version information + + + +--- + +_This FAQ was automatically generated on March 28, 2025 based on a real user query._ diff --git a/docs/troubleshooting/workload-job-resource-limits-missing.mdx b/docs/troubleshooting/workload-job-resource-limits-missing.mdx new file mode 100644 index 000000000..ddefe5543 --- /dev/null +++ b/docs/troubleshooting/workload-job-resource-limits-missing.mdx @@ -0,0 +1,158 @@ +--- +sidebar_position: 3 +title: "Job Resource Limits Not Applied in Kubernetes Pod" +description: "Solution for when CPU and memory limits defined in SleakOps jobs are not applied to Kubernetes pods" +date: "2024-04-22" +category: "workload" +tags: ["job", "kubernetes", "resources", "limits", "memory", "cpu"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Job Resource Limits Not Applied in Kubernetes Pod + +**Date:** April 22, 2024 +**Category:** Workload +**Tags:** Job, Kubernetes, Resources, Limits, Memory, CPU + +## Problem Description + +**Context:** User defines CPU and memory limits for a job in SleakOps platform, but the resulting Kubernetes pod shows `resources: {}` instead of the specified resource constraints. + +**Observed Symptoms:** + +- Job configured with CPU (1000 min, 1500 max) and Memory (2500 min, 3000 max) +- Generated Kubernetes pod shows `resources: {}` in the spec +- Pod experiences out-of-memory errors due to lack of resource limits +- Error message: "The node was low on resource: memory. Container was using 4068000Ki, request is 0" + +**Relevant Configuration:** + +- CPU Min: 1000m, Max: 1500m +- Memory Min: 2500Mi, Max: 3000Mi +- Job type: Kubernetes Job +- Platform: SleakOps on AWS EKS + +**Error Conditions:** + +- Resource limits defined in SleakOps UI are not translated to Kubernetes pod spec +- Pod runs without resource constraints leading to potential node resource exhaustion +- Container can consume unlimited resources causing eviction + +## Detailed Solution + + + +This is a confirmed bug in the SleakOps platform where resource limits defined in the job configuration are not properly applied to the generated Kubernetes pods. + +**Status:** Fixed in upcoming release (scheduled for current week) + +The development team has identified and resolved the issue where CPU and memory limits specified in the SleakOps job configuration were not being translated into the Kubernetes pod specification. + + + + + +If you need to run the job urgently before the fix is released, you can manually add resource limits to the Kubernetes job: + +1. **Export the current job configuration** +2. **Manually edit the job YAML** to include resource specifications +3. **Apply the modified configuration** + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: your-job-name +spec: + template: + spec: + containers: + - name: your-container-name + image: your-image + resources: + requests: + cpu: "1000m" # Your minimum CPU value + memory: "2500Mi" # Your minimum memory value + limits: + cpu: "1500m" # Your maximum CPU value + memory: "3000Mi" # Your maximum memory value + # ... rest of your container spec +``` + + + + + +When manually adding resources, use the correct Kubernetes resource format: + +**CPU Resources:** + +- Use millicores: `1000m` = 1 CPU core +- Or decimal: `1.5` = 1.5 CPU cores + +**Memory Resources:** + +- Use standard units: `Mi` (Mebibytes), `Gi` (Gibibytes) +- Examples: `2500Mi`, `3Gi` + +**Complete Example:** + +```yaml +resources: + requests: # Minimum guaranteed resources + cpu: "1000m" + memory: "2500Mi" + limits: # Maximum allowed resources + cpu: "1500m" + memory: "3000Mi" +``` + + + + + +After applying the manual fix or when the platform update is released: + +1. **Check the pod specification:** + +```bash +kubectl get pod -o yaml | grep -A 10 resources: +``` + +2. **Verify resource allocation:** + +```bash +kubectl describe pod +``` + +3. **Monitor resource usage:** + +```bash +kubectl top pod +``` + +The output should show your specified CPU and memory limits instead of empty resources. + + + + + +**Best Practices for Resource Management:** + +1. **Always set resource limits** to prevent resource starvation +2. **Set appropriate requests** to ensure proper scheduling +3. **Monitor resource usage** to adjust limits based on actual consumption +4. **Use resource quotas** at namespace level for additional protection + +**Recommended Resource Strategy:** + +- Set requests to 70-80% of expected usage +- Set limits to 120-150% of expected usage +- Monitor and adjust based on actual metrics + + + +--- + +_This FAQ was automatically generated on April 22, 2024 based on a real user query._ diff --git a/docs/troubleshooting/workload-memory-limits-and-debug-pods.mdx b/docs/troubleshooting/workload-memory-limits-and-debug-pods.mdx new file mode 100644 index 000000000..b2b5a132c --- /dev/null +++ b/docs/troubleshooting/workload-memory-limits-and-debug-pods.mdx @@ -0,0 +1,214 @@ +--- +sidebar_position: 3 +title: "Increasing Memory Limits and Creating Debug Pods" +description: "How to increase memory limits for web services and create debug pods for running scripts" +date: "2024-01-15" +category: "workload" +tags: ["memory", "limits", "debug", "pods", "web-service", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Increasing Memory Limits and Creating Debug Pods + +**Date:** January 15, 2024 +**Category:** Workload +**Tags:** Memory, Limits, Debug, Pods, Web Service, Troubleshooting + +## Problem Description + +**Context:** Users need to increase memory limits for web service pods to run resource-intensive scripts and want to create dedicated debug pods for running console commands. + +**Observed Symptoms:** + +- Web service pods hitting memory limits (768M default) +- Scripts failing due to insufficient memory +- Need for temporary pods to execute commands +- Requirement for console access to run maintenance scripts + +**Relevant Configuration:** + +- Default memory limit: 768M +- Workload type: Web Service +- Platform: SleakOps Kubernetes environment +- Need for temporary debug containers + +**Error Conditions:** + +- Memory limit exceeded during script execution +- Out of memory errors in application logs +- Need for ad-hoc command execution environment + +## Detailed Solution + + + +To increase memory limits for your web service: + +1. **Navigate to Workloads**: + + - Go to **Workloads** → **Web Service** + - Find your web service in the list + +2. **Edit the Service**: + + - Click on the 'web' service you want to modify + - Click **Edit** to open the configuration + +3. **Configure Resources**: + + - Navigate to the **last step** of the configuration forms + - In the **Resources** section, you'll find minimum and maximum resource settings + - Increase the **maximum memory** limit to your desired value (e.g., 2048M or 4096M) + +4. **Deploy Changes**: + - Make sure **Deploy** is activated/checked + - Click **Save** or **Update** + - Wait for the deployment to complete + +```yaml +# Example resource configuration +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2048Mi" # Increased from 768Mi + cpu: "1000m" +``` + +**Note**: The deployment process will update these values in the cluster automatically. + + + + + +To create a debug pod for running console commands: + +1. **Create a New Job**: + + - Go to **Workloads** → **Jobs** + - Click **Create New Job** + +2. **Configure the Job**: + + - **Name**: Give it a descriptive name like `debug-pod` or `script-runner` + - **Image URL**: Leave **empty** (this will use the same image as your web service) + - **Image Tag**: Leave **empty** + - **Command**: Enter `sleep infinity` or `sleep 999999` + +3. **Set Resources** (Second Step): + - Configure the memory and CPU resources this debug pod should have + - Set appropriate limits based on your script requirements + +```yaml +# Example debug job configuration +apiVersion: batch/v1 +kind: Job +metadata: + name: debug-pod +spec: + template: + spec: + containers: + - name: debug + image: your-app-image # Same as web service + command: ["sleep", "infinity"] + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2048Mi" + cpu: "1000m" + restartPolicy: Never +``` + +4. **Deploy the Job**: + - Click **Create** to deploy the debug pod + - Wait for it to be in **Running** status + + + + + +Once your debug pod is running: + +1. **Navigate to Kubernetes Dashboard**: + + - Go to the **Kubernetes** section in SleakOps + - Find your debug pod in the pods list + +2. **Open Shell Access**: + + - Click on the debug pod + - Look for the **Shell** or **Terminal** button + - Click to open a terminal session + +3. **Run Your Scripts**: + + ```bash + # Example commands you can run + php artisan migrate + php artisan cache:clear + npm run build + python manage.py migrate + ``` + +4. **Clean Up**: + - Once finished, delete the debug pod + - Go back to **Jobs** and delete the debug job + - This prevents unnecessary resource usage + + + + + +**Memory Limit Best Practices:** + +- Start with 2x your current limit and monitor usage +- Use monitoring tools to understand actual memory consumption +- Consider if the scripts can be optimized instead of just increasing limits +- Set both requests and limits appropriately + +**Debug Pod Best Practices:** + +- Always clean up debug pods after use +- Use descriptive names for easy identification +- Set appropriate resource limits to avoid cluster resource exhaustion +- Consider using `kubectl exec` directly if you have cluster access + +**Alternative Approaches:** + +1. **Scheduled Jobs**: For recurring scripts, create proper Kubernetes CronJobs +2. **Init Containers**: For one-time setup scripts during deployment +3. **Sidecar Containers**: For ongoing maintenance tasks + +```yaml +# Example of a proper maintenance job +apiVersion: batch/v1 +kind: CronJob +metadata: + name: maintenance-script +spec: + schedule: "0 2 * * *" # Daily at 2 AM + jobTemplate: + spec: + template: + spec: + containers: + - name: maintenance + image: your-app-image + command: + ["/bin/sh", "-c", "php artisan queue:work --stop-when-empty"] + resources: + limits: + memory: "1024Mi" + restartPolicy: OnFailure +``` + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/docs/troubleshooting/workload-public-access-configuration.mdx b/docs/troubleshooting/workload-public-access-configuration.mdx new file mode 100644 index 000000000..39046929d --- /dev/null +++ b/docs/troubleshooting/workload-public-access-configuration.mdx @@ -0,0 +1,146 @@ +--- +sidebar_position: 3 +title: "Workload Public Access Configuration" +description: "How to configure workloads for public access without VPN connection" +date: "2025-03-18" +category: "workload" +tags: ["workload", "public-access", "vpn", "configuration", "staging"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Workload Public Access Configuration + +**Date:** March 18, 2025 +**Category:** Workload +**Tags:** Workload, Public Access, VPN, Configuration, Staging + +## Problem Description + +**Context:** Users need to access staging environments without being connected to VPN, but workloads are configured for private access by default in SleakOps. + +**Observed Symptoms:** + +- Staging workload is only accessible when connected to VPN +- Team members cannot access staging environment without VPN connection +- Need to make workload publicly accessible for broader team access + +**Relevant Configuration:** + +- Environment: Staging +- Access method: Currently VPN-only +- Desired access: Public internet access +- Platform: SleakOps workload management + +**Error Conditions:** + +- Workload inaccessible without VPN connection +- Connection timeouts when accessing from public internet +- Need to modify workload configuration for public access + +## Detailed Solution + + + +To make a workload publicly accessible in SleakOps: + +1. **Navigate to your project** in the SleakOps dashboard +2. **Select the workload** you want to make public +3. **Go to workload settings** or configuration +4. **Find the network/access configuration section** +5. **Change the access type** from "Private" to "Public" +6. **Save and redeploy** the workload + +The workload will now be accessible from the public internet without requiring VPN connection. + + + + + +**Private Access (Default):** + +- Workload is only accessible through VPN +- Higher security for internal applications +- Requires VPN connection for all access + +**Public Access:** + +- Workload is accessible from the public internet +- Suitable for staging environments and public-facing applications +- No VPN connection required +- Should be used with proper authentication and security measures + + + + + +When making workloads publicly accessible: + +1. **Enable authentication**: Ensure your application has proper login mechanisms +2. **Use HTTPS**: Always enable SSL/TLS for public workloads +3. **Implement rate limiting**: Protect against abuse and DDoS attacks +4. **Monitor access logs**: Keep track of who is accessing your application +5. **Regular security updates**: Keep your application and dependencies updated + +```yaml +# Example workload configuration +apiVersion: v1 +kind: Service +metadata: + name: staging-app +spec: + type: LoadBalancer # For public access + ports: + - port: 80 + targetPort: 3000 + selector: + app: staging-app +``` + + + + + +If you're still having issues after making the workload public: + +1. **Check DNS propagation**: It may take a few minutes for DNS changes to propagate +2. **Verify load balancer status**: Ensure the load balancer is properly configured +3. **Check security groups**: Verify that necessary ports are open +4. **Review application logs**: Look for any application-specific errors +5. **Test from different networks**: Verify access from multiple locations + +**Common commands for troubleshooting:** + +```bash +# Check service status +kubectl get services + +# Check ingress configuration +kubectl get ingress + +# View workload logs +kubectl logs -f deployment/your-workload-name +``` + + + + + +**Staging Environments:** + +- Can be made public for team collaboration +- Should still have basic authentication +- Consider IP whitelisting for additional security + +**Production Environments:** + +- Evaluate carefully before making public +- Implement comprehensive security measures +- Use proper monitoring and alerting +- Consider gradual rollout strategies + + + +--- + +_This FAQ was automatically generated on March 18, 2025 based on a real user query._ diff --git a/docs/troubleshooting/workload-traffic-routing-issues.mdx b/docs/troubleshooting/workload-traffic-routing-issues.mdx new file mode 100644 index 000000000..ac05280ea --- /dev/null +++ b/docs/troubleshooting/workload-traffic-routing-issues.mdx @@ -0,0 +1,244 @@ +--- +sidebar_position: 3 +title: "Traffic Routing Issues to Backend Pods" +description: "Solution for traffic routing problems when accessing backend services through subdomains" +date: "2024-01-15" +category: "workload" +tags: ["networking", "routing", "subdomain", "backend", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Traffic Routing Issues to Backend Pods + +**Date:** January 15, 2024 +**Category:** Workload +**Tags:** Networking, Routing, Subdomain, Backend, Troubleshooting + +## Problem Description + +**Context:** User has a backend service deployed in SleakOps that can connect to the database successfully, but external traffic through subdomains is not reaching the backend pods. + +**Observed Symptoms:** + +- Direct port forwarding to pod works on localhost +- External access through subdomain fails with routing errors +- Backend can connect to database successfully +- Database version 1 is running properly +- HTTP requests never reach the backend service + +**Relevant Configuration:** + +- Backend service: Deployed and running +- Database: Version 1, connected successfully +- Networking: Subdomain configured but not routing traffic +- Port forwarding: Works locally via kubectl + +**Error Conditions:** + +- Subdomain access fails with redirection errors +- Traffic routing fails between ingress and backend pods +- External requests do not reach the application +- Local port forwarding works correctly + +## Detailed Solution + + + +This type of issue typically occurs due to one of these common problems: + +1. **Ingress configuration**: Missing or incorrect ingress rules +2. **Service configuration**: Service not properly exposing the backend +3. **DNS configuration**: Subdomain not properly configured +4. **Network policies**: Blocking external traffic +5. **Load balancer issues**: External load balancer not routing correctly + +First, verify that your service is properly configured and running: + +```bash +kubectl get services +kubectl get pods +kubectl get ingress +``` + + + + + +Check if your backend service is properly configured: + +```bash +# Check service details +kubectl describe service your-backend-service + +# Verify service endpoints +kubectl get endpoints your-backend-service +``` + +Ensure your service configuration includes: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: backend-service +spec: + selector: + app: your-backend-app + ports: + - protocol: TCP + port: 80 + targetPort: 8080 # Your backend port + type: ClusterIP +``` + + + + + +Verify your ingress is properly configured for the subdomain: + +```bash +# Check ingress status +kubectl get ingress +kubectl describe ingress your-ingress-name +``` + +Ensure your ingress configuration is correct: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: backend-ingress + annotations: + kubernetes.io/ingress.class: nginx +spec: + rules: + - host: your-subdomain.yourdomain.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: backend-service + port: + number: 80 +``` + + + + + +Check if your subdomain is properly configured: + +1. **Test DNS resolution**: + +```bash +nslookup your-subdomain.yourdomain.com +dig your-subdomain.yourdomain.com +``` + +2. **Check if subdomain points to correct IP**: + + - The subdomain should point to your cluster's load balancer IP + - Get the load balancer IP: `kubectl get ingress` + +3. **Verify in SleakOps dashboard**: + - Go to your project settings + - Check the subdomain configuration + - Ensure it's properly linked to your backend service + + + + + +Test connectivity at different levels: + +1. **Test pod directly**: + +```bash +kubectl port-forward pod/your-backend-pod 8080:8080 +curl http://localhost:8080/health +``` + +2. **Test service internally**: + +```bash +kubectl run test-pod --image=curlimages/curl -it --rm -- sh +# Inside the pod: +curl http://backend-service/health +``` + +3. **Test ingress internally**: + +```bash +curl -H "Host: your-subdomain.yourdomain.com" http://INGRESS_IP/health +``` + +4. **Test external access**: + +```bash +curl https://your-subdomain.yourdomain.com/health +``` + + + + + +**If service is not working:** + +- Check that service selector matches pod labels +- Verify port configuration matches container port + +**If ingress is not working:** + +- Ensure ingress controller is running: `kubectl get pods -n ingress-nginx` +- Check ingress class annotation +- Verify host configuration matches your subdomain + +**If DNS is not resolving:** + +- Check subdomain configuration in SleakOps dashboard +- Verify DNS propagation (can take up to 24 hours) +- Try using a different DNS server for testing + +**If SSL/TLS issues:** + +- Check certificate status: `kubectl describe certificate` +- Verify cert-manager is working: `kubectl get pods -n cert-manager` + + + + + +In SleakOps platform: + +1. **Check service configuration**: + + - Go to your project → Services + - Verify the backend service is properly configured + - Check port mappings and health checks + +2. **Verify subdomain settings**: + + - Go to project settings → Domains + - Ensure subdomain is active and properly configured + - Check SSL certificate status + +3. **Review logs**: + + - Check backend service logs in the SleakOps dashboard + - Look for any startup errors or connectivity issues + - Check ingress controller logs if available + +4. **Network policies**: + - Ensure no network policies are blocking traffic + - Check if any security groups or firewall rules are interfering + + + +--- + +_This FAQ was automatically generated on January 15, 2024 based on a real user query._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current.json b/i18n/es/docusaurus-plugin-content-docs/current.json index 06ee8d685..bb58ea050 100644 --- a/i18n/es/docusaurus-plugin-content-docs/current.json +++ b/i18n/es/docusaurus-plugin-content-docs/current.json @@ -50,5 +50,9 @@ "sidebar.tutorialSidebar.category.VPN": { "message": "VPN", "description": "The label for category VPN in sidebar tutorialSidebar" + }, + "sidebar.tutorialSidebar.category.Troubleshooting": { + "message": "Solución de Problemas", + "description": "The label for category Troubleshooting in sidebar tutorialSidebar" } } diff --git a/i18n/es/docusaurus-plugin-content-docs/current/network.mdx b/i18n/es/docusaurus-plugin-content-docs/current/network.mdx new file mode 100644 index 000000000..86adb6434 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/network.mdx @@ -0,0 +1,71 @@ +--- +id: networking-y-recursos-de-red +title: Networking y Recursos de Red +sidebar_label: Networking y Recursos de Red +sidebar_position: 9 +--- +import Zoom from "react-medium-image-zoom"; + +# Networking y Recursos de Red + +Esta documentación tiene como objetivo brindar una visión clara y sencilla de la arquitectura de red que SleakOps despliega en los entornos de los clientes. Aquí encontrarás cómo se organiza la red, cómo se protegen los recursos y cómo se facilita la comunicación tanto interna como externa. + +> ❓ **Nota**: La red está diseñada para garantizar seguridad, escalabilidad y alta disponibilidad. Permite separar ambientes, proteger datos sensibles y exponer servicios de manera controlada y segura. + +## 1. Descripción general de la arquitectura + +La infraestructura de red en SleakOps se basa en los siguientes componentes principales: + +- **VPC (Virtual Private Cloud):** Segmenta la red por entorno (Management, Production, Development). +- **Subredes:** + - *Públicas:* expuestas a Internet. + - *Privadas:* acceso restringido, acceden a Internet a través de un NAT Gateway. + - *Persistencia:* bases de datos, almacenamiento. +- **Internet Gateway:** Permite la comunicación entre la VPC y el exterior (Internet). +- **Route Tables:** Definen las rutas de tráfico entre subredes y hacia/desde Internet. +- **Security Groups:** Firewalls virtuales que controlan el tráfico de entrada y salida de los recursos. +- **DNS Interno:** Permite que los recursos se comuniquen usando nombres en vez de IPs. +- **External-DNS:** Servicio que corre dentro de cada clúster Kubernetes (EKS), encargado de gestionar automáticamente los registros DNS públicos en Route53 para los servicios expuestos desde el clúster. + +## 2. Flujo típico de comunicación + +El siguiente es un ejemplo de cómo viaja el tráfico en la red de SleakOps: + +1. **Acceso desde Internet:** + Un usuario accede a un servicio expuesto (por ejemplo, una API). El tráfico llega al Internet Gateway y es dirigido a la subred pública. + +2. **Control de acceso:** + El Security Group asociado al recurso valida si la conexión está permitida. + +3. **Comunicación interna:** + Los servicios internos (en subredes privadas o de persistencia) pueden comunicarse entre sí usando el DNS interno, siempre bajo las reglas de los Security Groups. + +4. **Exposición de servicios:** + Si un servicio dentro del clúster Kubernetes debe ser accesible desde Internet (por ejemplo, una API), se expone a través de un Application Load Balancer y External-DNS se encarga de registrar automáticamente el nombre en Route53. + +> Esta segmentación y control aseguran que solo los servicios necesarios sean expuestos y que los datos sensibles permanezcan protegidos. + + + reference-architecture + + +## 3. External-DNS y Route53 + +Se utiliza una solución automatizada para gestionar los registros DNS públicos de los servicios desplegados, integrando la infraestructura con servicios de DNS externos como Route53. + +- External-DNS **no expone servicios directamente**, sino que automatiza la gestión de registros DNS públicos para recursos ya expuestos (por ejemplo, mediante un Application Load Balancer). +- Esto permite que los servicios sean accesibles de forma segura y sencilla desde Internet. + +## 4. Conectividad entre entornos mediante VPC Peering + +Para permitir la comunicación controlada entre entornos (por ejemplo, entre Management y Production), SleakOps configura **conexiones VPC Peering** de manera explícita entre las VPCs de los distintos entornos. + +- Un **VPC Peering** permite que dos VPCs puedan intercambiar tráfico interno como si estuvieran en la misma red. +- **No requiere** pasar por Internet, NAT Gateway ni VPN. +- Es una conexión directa entre dos redes. + +> 💡 Además del acceso mediante Internet Gateway, SleakOps contempla otros mecanismos de conectividad como **Pritunl VPN**, **NAT Gateway** y **Transit Gateway**, dependiendo del caso de uso y el nivel de aislamiento requerido. + diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/_category_.json b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/_category_.json new file mode 100644 index 000000000..81dedad99 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/_category_.json @@ -0,0 +1,11 @@ +{ + "label": "Solución de problemas", + "position": 8, + "collapsible": true, + "collapsed": true, + "description": "Problemas comunes y soluciones para usuarios de SleakOps", + "link": { + "type": "doc", + "id": "troubleshooting/index" + } +} diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/api-access-troubleshooting-private-services.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/api-access-troubleshooting-private-services.mdx new file mode 100644 index 000000000..f3596c665 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/api-access-troubleshooting-private-services.mdx @@ -0,0 +1,255 @@ +--- +sidebar_position: 3 +title: "Problemas de Acceso a Servicios API Privados" +description: "Solución de problemas de conectividad con servicios API privados en clústeres de Kubernetes" +date: "2024-12-19" +category: "usuario" +tags: + ["api", "vpn", "servicio-privado", "conectividad", "solución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Acceso a Servicios API Privados + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Usuario +**Etiquetas:** API, VPN, Servicio Privado, Conectividad, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Usuarios que experimentan problemas de conectividad al intentar acceder a servicios API privados desplegados en clústeres de Kubernetes a través de la plataforma SleakOps. + +**Síntomas Observados:** + +- Incapacidad para conectarse a servicios API privados +- Solicitudes API que agotan el tiempo o fallan +- Problemas de conectividad intermitentes +- Servicios accesibles internamente pero no externamente + +**Configuración Relevante:** + +- Tipo de servicio: Servicio API privado +- Red: Clúster Kubernetes con red privada +- Método de acceso: Se requiere conexión VPN +- Exposición del servicio: Red interna del clúster + +**Condiciones de Error:** + +- Fallos de conexión cuando la VPN no está activa +- API inaccesible desde redes externas +- Sin respuesta de los endpoints del servicio privado +- Errores de tiempo de espera de red + +## Solución Detallada + + + +Primero, asegúrese de que su conexión VPN esté activa y correctamente configurada: + +1. **Verificar Estado de la VPN:** + + - Confirme que el cliente VPN esté conectado + - Revise el estado de la conexión en su cliente VPN + - Asegúrese de estar conectado al perfil VPN correcto + +2. **Probar Conectividad VPN:** + + ```bash + # Probar conectividad a redes internas del clúster + ping + + # Verificar si puede alcanzar el DNS del clúster + nslookup ..svc.cluster.local + ``` + +3. **Verificar Rutas de Red:** + ```bash + # Revisar tabla de rutas + route -n # Linux/Mac + route print # Windows + ``` + + + + + +Verifique si el servicio API está correctamente configurado para acceso privado: + +1. **Verificar Tipo de Servicio:** + + ```bash + kubectl get svc -n + ``` + +2. **Verificar Endpoints del Servicio:** + + ```bash + kubectl get endpoints -n + ``` + +3. **Revisar Configuración del Servicio:** + ```yaml + # Ejemplo de configuración de servicio privado + apiVersion: v1 + kind: Service + metadata: + name: private-api-service + namespace: default + spec: + type: ClusterIP # Solo acceso interno + ports: + - port: 80 + targetPort: 8080 + selector: + app: api-service + ``` + + + + + +Examine los logs del pod para identificar posibles problemas: + +1. **Revisar Logs del Pod API:** + + ```bash + # Obtener logs del pod + kubectl logs -n + + # Seguir logs en tiempo real + kubectl logs -f -n + + # Obtener logs de todos los contenedores en el pod + kubectl logs -n --all-containers + ``` + +2. **Buscar Problemas Comunes:** + + - Tiempos de espera en conexiones + - Fallos de autenticación + - Restricciones de recursos + - Problemas de conectividad a base de datos + +3. **Verificar Estado del Pod:** + ```bash + kubectl get pods -n + kubectl describe pod -n + ``` + + + + + +Realice solución de problemas a nivel de red: + +1. **Probar desde Dentro del Clúster:** + + ```bash + # Crear un pod de depuración + kubectl run debug-pod --image=nicolaka/netshoot -it --rm + + # Probar conectividad desde dentro del clúster + curl http://..svc.cluster.local + ``` + +2. **Verificar Políticas de Red:** + + ```bash + kubectl get networkpolicies -n + ``` + +3. **Verificar Resolución DNS:** + + ```bash + # Desde el pod de depuración + nslookup ..svc.cluster.local + ``` + +4. **Probar Conectividad de Puertos:** + ```bash + # Probar puerto específico + telnet + nc -zv + ``` + + + + + +Si necesita acceso externo a la API privada: + +1. **Crear Recurso Ingress:** + + ```yaml + apiVersion: networking.k8s.io/v1 + kind: Ingress + metadata: + name: private-api-ingress + namespace: default + annotations: + nginx.ingress.kubernetes.io/rewrite-target: / + spec: + rules: + - host: api.tudominio.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: private-api-service + port: + number: 80 + ``` + +2. **Configurar TLS (Opcional):** + ```yaml + spec: + tls: + - hosts: + - api.tudominio.com + secretName: api-tls-secret + ``` + + + + + +**Soluciones Rápidas:** + +1. **Reiniciar Conexión VPN:** + + - Desconectar y reconectar la VPN + - Probar con diferentes servidores VPN si están disponibles + +2. **Limpiar Caché DNS:** + + ```bash + # Linux + sudo systemctl flush-dns + + # macOS + sudo dscacheutil -flushcache + + # Windows + ipconfig /flushdns + ``` + +3. **Verificar Reglas de Firewall:** + - Asegurarse de que el firewall local permita tráfico VPN + - Verificar configuraciones del firewall corporativo + +**Mejores Prácticas:** + +- Siempre conectarse a la VPN antes de acceder a servicios privados +- Usar descubrimiento de servicios en lugar de IPs codificadas +- Implementar chequeos de salud adecuados para servicios API +- Monitorear logs de servicios regularmente +- Configurar alertas para disponibilidad del servicio + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-cost-monitoring-and-optimization.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-cost-monitoring-and-optimization.mdx new file mode 100644 index 000000000..7dc9ed47f --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-cost-monitoring-and-optimization.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "Monitoreo y Optimización de Costos en AWS" +description: "Comprendiendo y gestionando los incrementos de costos de AWS en entornos SleakOps" +date: "2025-01-16" +category: "proveedor" +tags: ["aws", "costos", "monitoreo", "optimización", "facturación"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Monitoreo y Optimización de Costos en AWS + +**Fecha:** 16 de enero de 2025 +**Categoría:** Proveedor +**Etiquetas:** AWS, Costos, Monitoreo, Optimización, Facturación + +## Descripción del Problema + +**Contexto:** Usuarios que experimentan incrementos mensuales graduales en los costos de AWS (aproximadamente $50/mes) en sus entornos gestionados por SleakOps y necesitan entender las causas raíz y pronosticar gastos futuros. + +**Síntomas Observados:** + +- Incrementos mensuales de costos de aproximadamente $50 +- Costos que aumentan consistentemente durante varios meses +- Incertidumbre sobre si los incrementos continuarán +- Necesidad de pronósticos de costos precisos + +**Configuración Relevante:** + +- Múltiples entornos AWS gestionados a través de SleakOps +- Despliegues recientes: bases de datos, monitoreo Grafana, Loki +- Cambios en la configuración de NodePools en noviembre +- Múltiples cuentas AWS potencialmente involucradas + +**Condiciones de Error:** + +- Incrementos de costos no directamente atribuibles a cambios en la infraestructura +- Posibles costos por intentos fallidos de creación de cuentas +- Posibles incrementos de costos relacionados con tráfico +- Discrepancias en pronósticos entre diferentes vistas + +## Solución Detallada + + + +### Análisis de Costos Paso a Paso + +1. **Establecer un período base**: Identificar cuándo sus entornos se estabilizaron +2. **Rastrear cambios incrementales**: Documentar cada cambio en la infraestructura con fechas +3. **Separar tráfico de costos de infraestructura**: AWS cobra por tráfico de red +4. **Considerar el momento del despliegue**: Los despliegues a mitad de mes afectan el costo completo del mes siguiente + +### Factores clave que afectan los costos: + +- **Cambios en infraestructura**: Nuevas bases de datos, herramientas de monitoreo +- **Incrementos de tráfico**: Más usuarios = mayores costos de red +- **Momento del despliegue**: Facturación parcial vs. facturación completa del mes +- **Múltiples cuentas**: Costos distribuidos en diferentes cuentas AWS + + + + + +### Componentes de Infraestructura y su Impacto en Costos + +| Componente | Costo Mensual Típico | Notas | +| ------------------------ | -------------------- | ------------------ | +| Instancias RDS pequeñas | $15-30 | Por base de datos | +| Monitoreo Grafana + Loki | ~$15 fijo + tráfico | Stack de monitoreo | +| Cambios en NodePool | Mínimo | Usualmente <$5/mes | +| Base de clúster EKS | $72/mes | Por clúster | +| Balanceadores de carga | $16-25/mes | Por ALB/NLB | + +### Costos Relacionados con Tráfico + +- **Transferencia de datos saliente**: $0.09/GB (primer 1GB gratis) +- **Transferencia Inter-Zonas de Disponibilidad**: $0.01/GB +- **NAT Gateway**: $0.045/GB procesado + + + + + +### Uso de AWS Cost Explorer + +1. **Acceder a Cost Explorer** desde la cuenta raíz de AWS +2. **Filtrar por ID de cuenta** para aislar costos por cuenta +3. **Agrupar por servicio** para identificar qué servicios AWS están aumentando +4. **Establecer rangos de fechas** para comparar mes a mes + +### Métricas clave para analizar: + +```bash +# Ejemplo de desglose de costos para buscar: +- Instancias EC2 (cómputo) +- RDS (bases de datos) +- EKS (servicio Kubernetes) +- Transferencia de Datos (red) +- EBS (almacenamiento) +- Balanceadores de carga +``` + +### Preguntas para formular: + +- ¿Cuándo se estabilizaron los entornos? +- ¿Qué se desplegó y cuándo? +- ¿Ha aumentado el tráfico de la aplicación? +- ¿Existen recursos sin usar en cuentas fallidas? + + + + + +### Limitaciones del Pronóstico de Costos de AWS + +- **Pronósticos a principios de mes** (primeros 3-5 días) pueden ser inexactos +- **Variaciones estacionales** afectan las predicciones +- **Nuevos despliegues** sesgan los algoritmos de pronóstico +- **Picos de tráfico** crean inflación temporal en el pronóstico + +### Mejores prácticas para pronósticos: + +1. **Esperar hasta mitad de mes** para pronósticos más precisos +2. **Usar tendencias de 3 meses** en lugar de comparaciones de un solo mes +3. **Considerar cambios conocidos** al proyectar +4. **Monitorear gasto diario** para detectar anomalías temprano + +### Monitoreo de Costos en SleakOps + +SleakOps provee visibilidad de costos mediante: + +- Paneles de costos en tiempo real +- Desglose mensual de costos por entorno +- Métricas de utilización de recursos +- Recomendaciones de optimización + + + + + +### Identificación de Costos Innecesarios + +1. **Recursos en cuentas fallidas**: Verificar recursos en cuentas que fallaron durante la configuración inicial +2. **Bases de datos sin uso**: Identificar bases de datos sin conexiones +3. **Instancias sobredimensionadas**: Ajustar tamaño según utilización +4. **Balanceadores de carga huérfanos**: Eliminar ALBs/NLBs sin uso + +### Proceso de limpieza: + +```bash +# Ejemplo de recursos para revisar: +- Volúmenes EBS sin uso +- Instancias EC2 detenidas pero no terminadas +- Balanceadores de carga sin objetivos +- Instancias RDS sin conexiones +- IPs elásticas no asociadas a instancias +``` + +### Estrategias de optimización: + +- **Instancias reservadas** para cargas predecibles +- **Instancias Spot** para cargas no críticas +- **Autoescalado** para ajustar demanda +- **Optimización de almacenamiento** (gp3 vs gp2, políticas de ciclo de vida) + + + + + +### Alertas de Costos en AWS + +Configurar alertas de facturación para: + +- Umbrales de presupuesto mensual +- Patrones de gasto inusuales +- Incrementos de costos por servicio + +### Monitoreo en SleakOps + +- Revisar reportes mensuales de costos +- Monitorear utilización de recursos +- Rastrear costos por entorno +- Configurar notificaciones para cambios significativos + +### Proceso de revisión regular: + +1. **Semanalmente**: Revisar anomalías de costos +2. **Mensualmente**: Revisar tendencias y pronósticos de costos +3. **Trimestralmente**: Optimizar asignación de recursos +4. **Anualmente**: Revisar estrategia de instancias reservadas + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 16 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-ec2-public-ip-assignment.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-ec2-public-ip-assignment.mdx new file mode 100644 index 000000000..007790851 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-ec2-public-ip-assignment.mdx @@ -0,0 +1,171 @@ +--- +sidebar_position: 3 +title: "La instancia EC2 no recibe dirección IPv4 pública" +description: "Solución para instancias EC2 creadas en VPC sin asignación de IP pública" +date: "2024-01-15" +category: "proveedor" +tags: ["aws", "ec2", "vpc", "redes", "ip-publica"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# La instancia EC2 no recibe dirección IPv4 pública + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proveedor +**Etiquetas:** AWS, EC2, VPC, Redes, IP Pública + +## Descripción del problema + +**Contexto:** El usuario crea una instancia EC2 dentro de una VPC de producción a través de SleakOps, pero la instancia no recibe una dirección IPv4 pública que pueda ser accedida externamente por proveedores terceros. + +**Síntomas observados:** + +- Instancia EC2 creada exitosamente en la VPC especificada +- No se asignó dirección IPv4 pública a la instancia +- La instancia no es accesible desde redes externas +- Proveedores terceros no pueden acceder a la instancia + +**Configuración relevante:** + +- Entorno: VPC de producción +- Tipo de instancia: EC2 +- Red: Despliegue basado en VPC +- Requisito de acceso: Se necesita acceso externo para proveedores + +**Condiciones de error:** + +- No se asignó IP pública automáticamente durante la creación de la instancia +- La instancia solo tiene IP privada dentro de la VPC +- No hay conectividad externa disponible + +## Solución detallada + + + +La razón más común por la que una instancia EC2 no obtiene una IP pública es que se lanza en una subred privada o en una subred pública sin habilitar la asignación automática de IP pública. + +**Para verificar la configuración de la subred:** + +1. Ir a **Consola AWS** → **VPC** → **Subredes** +2. Buscar la subred donde se lanzó la instancia +3. Revisar la configuración de **"Asignar automáticamente dirección IPv4 pública"** +4. Si está deshabilitada, esto explica por qué no se asignó IP pública + + + + + +La solución recomendada es asignar una Elastic IP (EIP) a tu instancia: + +**Pasos para asignar Elastic IP:** + +1. Ir a **Consola AWS** → **EC2** → **Elastic IPs** +2. Hacer clic en **"Asignar dirección Elastic IP"** +3. Elegir **el pool de direcciones IPv4 de Amazon** +4. Hacer clic en **"Asignar"** +5. Seleccionar la EIP recién creada +6. Hacer clic en **"Acciones"** → **"Asociar dirección Elastic IP"** +7. Seleccionar tu instancia EC2 +8. Hacer clic en **"Asociar"** + +```bash +# Usando AWS CLI +aws ec2 allocate-address --domain vpc +aws ec2 associate-address --instance-id i-1234567890abcdef0 --allocation-id eipalloc-12345678 +``` + + + + + +Una vez que tengas una IP pública, asegúrate de que tus grupos de seguridad permitan el tráfico necesario: + +**Para acceso HTTP/HTTPS:** + +``` +Tipo: HTTP +Protocolo: TCP +Rango de puertos: 80 +Origen: 0.0.0.0/0 + +Tipo: HTTPS +Protocolo: TCP +Rango de puertos: 443 +Origen: 0.0.0.0/0 +``` + +**Para acceso específico de proveedores:** + +``` +Tipo: TCP personalizado +Protocolo: TCP +Rango de puertos: [PUERTO_DE_TU_APLICACIÓN] +Origen: [RANGO_IP_DEL_PROVEEDOR] +``` + + + + + +Asegúrate de que la tabla de rutas de tu subred tenga una ruta hacia una puerta de enlace de Internet: + +1. Ir a **Consola AWS** → **VPC** → **Tablas de rutas** +2. Buscar la tabla de rutas asociada a tu subred +3. Verificar que exista una ruta como: + - **Destino:** `0.0.0.0/0` + - **Objetivo:** `igw-xxxxxxxxx` (Puerta de enlace de Internet) + +Si esta ruta no existe, tu instancia no tendrá acceso a internet incluso con una IP pública. + + + + + +Si gestionas esto a través de SleakOps, puedes configurar la asignación de IP pública: + +1. En la **Configuración del Proyecto** +2. Ir a **Infraestructura** → **Computación** +3. Habilitar **"Asignar IP pública"** para tus instancias EC2 +4. O configurar la asignación de **Elastic IP** en las opciones avanzadas + +```yaml +# Ejemplo de configuración en SleakOps +compute: + ec2_instances: + - name: "instancia-produccion" + instance_type: "t3.medium" + subnet_type: "publica" + assign_public_ip: true + elastic_ip: true +``` + + + + + +**Información importante sobre costos:** + +- **Las Elastic IPs son gratuitas** cuando están asociadas a una instancia en ejecución +- **Las Elastic IPs cuestan $0.005/hora** cuando no están asociadas a ninguna instancia +- Se aplican **cargos por transferencia de datos** para tráfico saliente a internet +- Siempre libera las Elastic IPs no usadas para evitar cargos + + + + + +Si no deseas usar Elastic IPs, considera estas alternativas: + +1. **Application Load Balancer (ALB)**: Para aplicaciones web +2. **Network Load Balancer (NLB)**: Para tráfico TCP/UDP +3. **NAT Gateway**: Para acceso a internet solo de salida +4. **Endpoints de VPC**: Para acceso a servicios AWS sin internet + +Estas soluciones pueden proporcionar acceso externo sin asignar IPs públicas directamente a las instancias. + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-marketplace-login-setup-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-marketplace-login-setup-issue.mdx new file mode 100644 index 000000000..fdf3619f7 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-marketplace-login-setup-issue.mdx @@ -0,0 +1,164 @@ +--- +sidebar_position: 3 +title: "Problema de Inicio de Sesión y Configuración en AWS Marketplace" +description: "Solución para problemas de inicio de sesión después de suscribirse a SleakOps a través de AWS Marketplace" +date: "2024-01-15" +category: "usuario" +tags: ["aws-marketplace", "autenticación", "inicio de sesión", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Inicio de Sesión y Configuración en AWS Marketplace + +**Fecha:** 15 de enero de 2024 +**Categoría:** Usuario +**Etiquetas:** AWS Marketplace, Autenticación, Inicio de sesión, Configuración + +## Descripción del Problema + +**Contexto:** El usuario se suscribió a SleakOps a través de AWS Marketplace pero encuentra problemas de autenticación al intentar completar la configuración del producto. + +**Síntomas observados:** + +- Después de suscribirse a través de AWS Marketplace, al hacer clic en "Configurar Producto" redirige a la página de registro +- El sistema no reconoce al usuario como conectado +- El inicio de sesión manual redirige a la página principal sin completar la configuración +- El problema persiste tras múltiples intentos en la misma sesión del navegador +- El proceso de configuración nunca se completa con éxito + +**Configuración relevante:** + +- Método de suscripción: AWS Marketplace +- Navegador: Mismo navegador usado durante todo el proceso +- Estado de inicio de sesión: El usuario parece estar conectado pero el sistema no lo reconoce +- Etapa de configuración: Configuración inicial del producto después de la suscripción en el marketplace + +**Condiciones del error:** + +- El error ocurre inmediatamente después de la suscripción en AWS Marketplace +- Sucede al hacer clic en el botón "Configurar Producto" +- Persiste después de intentos de inicio de sesión manual +- Ocurre de forma consistente en múltiples sesiones del navegador + +## Solución Detallada + + + +Cuando te suscribes a SleakOps a través de AWS Marketplace, hay un flujo de autenticación específico que debe completarse: + +1. **Suscripción en AWS Marketplace**: Te suscribes a través de AWS +2. **Redirección a SleakOps**: AWS te redirige a nuestra plataforma con tokens especiales +3. **Vinculación de Cuenta**: Tu cuenta de AWS se vincula con una cuenta de SleakOps +4. **Finalización de Configuración**: Comienza el proceso de configuración del producto + +Si este flujo se interrumpe, pueden ocurrir problemas de autenticación. + + + + + +La causa más común son conflictos con la caché del navegador. Sigue estos pasos: + +1. **Eliminar cookies de SleakOps**: + + - Ve a la configuración de tu navegador + - Busca "Cookies y datos de sitios" + - Busca `sleakops.com` y elimina todas las cookies + +2. **Eliminar cookies de AWS Marketplace**: + + - También elimina las cookies de `aws.amazon.com` + - Esto asegura un estado limpio de autenticación + +3. **Limpiar caché del navegador**: + + - Borra imágenes y archivos en caché + - Esto evita que tokens antiguos de autenticación interfieran + +4. **Reiniciar el proceso**: + - Regresa a AWS Marketplace + - Haz clic nuevamente en "Configurar Producto" + + + + + +Para aislar problemas relacionados con el navegador: + +1. **Abre una ventana en modo incógnito/privado** +2. **Ve a AWS Marketplace** +3. **Navega a tu suscripción de SleakOps** +4. **Haz clic en "Configurar Producto"** + +Si esto funciona, el problema está definitivamente relacionado con caché/cookies del navegador. + + + + + +Si el flujo automático falla, puedes vincular manualmente tus cuentas: + +1. **Crea una cuenta en SleakOps** (si no tienes una): + + - Ve a `https://app.sleakops.com/register` + - Usa la misma dirección de correo electrónico que tu cuenta AWS + +2. **Contacta soporte** con: + + - Tu ID de cuenta AWS + - El correo electrónico de tu cuenta SleakOps + - Detalles de tu suscripción en AWS Marketplace + +3. **Nosotros vincularemos manualmente** tus cuentas y activaremos tu suscripción + + + + + +Algunos navegadores tienen políticas de seguridad más estrictas que pueden interferir con la autenticación entre dominios: + +**Navegadores recomendados:** + +- Chrome (última versión) +- Firefox (última versión) +- Safari (si usas macOS) + +**Navegadores a evitar:** + +- Versiones antiguas de Internet Explorer +- Navegadores con bloqueadores de anuncios agresivos +- Navegadores con configuraciones estrictas de privacidad + +**Configuraciones del navegador a revisar:** + +- Desactivar bloqueadores de anuncios para los dominios AWS y SleakOps +- Permitir cookies de terceros temporalmente +- Asegurarse que JavaScript esté habilitado + + + + + +Si aún tienes problemas, sigue este proceso completo de recuperación: + +1. **Cerrar sesión de todos los servicios AWS** +2. **Borrar todos los datos del navegador** (cookies, caché, almacenamiento local) +3. **Reiniciar el navegador** +4. **Iniciar sesión en la Consola AWS** +5. **Ir a AWS Marketplace → Administrar suscripciones** +6. **Buscar la suscripción a SleakOps** +7. **Hacer clic en "Configurar Producto"** +8. **Completar el flujo de autenticación sin interrupciones** + +Si esto aún no funciona, contacta a nuestro equipo de soporte con: + +- Tu ID de cuenta AWS +- Capturas de pantalla del error +- Información del navegador y versión + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-waf-application-load-balancer-protection.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-waf-application-load-balancer-protection.mdx new file mode 100644 index 000000000..e02ae4207 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/aws-waf-application-load-balancer-protection.mdx @@ -0,0 +1,361 @@ +--- +sidebar_position: 15 +title: "Configuración de AWS WAF para Protección de Application Load Balancer" +description: "Cómo configurar AWS WAF para proteger tu Application Load Balancer del tráfico malicioso y ataques de bots" +date: "2024-12-19" +category: "provider" +tags: ["aws", "waf", "seguridad", "load-balancer", "proteccion-bots"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de AWS WAF para Protección de Application Load Balancer + +**Fecha:** 19 de diciembre, 2024 +**Categoría:** Provider +**Etiquetas:** AWS, WAF, Seguridad, Load Balancer, Protección de Bots + +## Descripción del Problema + +**Contexto:** Los usuarios que experimentan tráfico malicioso de bots o registros falsos en su plataforma necesitan implementar protección a nivel de infraestructura utilizando AWS WAF (Web Application Firewall). + +**Síntomas Observados:** + +- Registros de usuarios falsos apareciendo en la plataforma +- Patrones de tráfico de bots sospechosos +- Amenazas de seguridad potenciales de ataques automatizados +- Necesidad de filtrado de tráfico antes de que llegue a la aplicación + +**Configuración Relevante:** + +- Plataforma: SleakOps en AWS +- Load Balancer: Application Load Balancer (ALB) +- Servicio: AWS WAF v2 +- Protección necesaria: Detección de bots y filtrado de tráfico + +**Condiciones de Error:** + +- Tráfico malicioso evadiendo la seguridad a nivel de aplicación +- Ataques de bots automatizados en endpoints de registro +- Necesidad de filtrado proactivo de tráfico + +## Solución Detallada + + + +AWS WAF (Web Application Firewall) es un servicio de firewall basado en la nube que ayuda a proteger tus aplicaciones web de exploits comunes y bots. Aunque SleakOps aún no tiene integración nativa con WAF, puedes configurarlo fácilmente de forma manual para proteger tu Application Load Balancer. + +**Beneficios Clave:** + +- Bloquea tráfico malicioso antes de que llegue a tu aplicación +- Proporciona detección y mitigación de bots +- Ofrece capacidades de limitación de velocidad +- Incluye conjuntos de reglas administradas para patrones de ataque comunes + + + + + +Antes de configurar AWS WAF, asegúrate de tener: + +**Acceso Requerido:** + +- Acceso a la consola de AWS con permisos apropiados +- Permisos de administrador de WAF +- Capacidad para modificar configuraciones del load balancer + +**Información a Recopilar:** + +1. **ARN del Application Load Balancer** + + ```bash + # Encuentra tu ALB en la consola de AWS o vía CLI + aws elbv2 describe-load-balancers --query 'LoadBalancers[*].[LoadBalancerName,LoadBalancerArn]' + ``` + +2. **Detalles de la Aplicación** + + - Nombre(s) de dominio principal(es) + - Endpoints críticos que necesitan protección (ej., /register, /login) + - Patrones de tráfico legítimo esperados + +3. **Requerimientos de Seguridad** + + - Restricciones geográficas necesarias + - Requerimientos de limitación de velocidad + - Sensibilidad de detección de bots + + + + + +Crea una nueva Web ACL para definir tus reglas de protección: + +1. **Navegar a la Consola de AWS WAF** + + - Ve a Consola de AWS → WAF & Shield + - Haz clic en "Create web ACL" + +2. **Configurar Ajustes Básicos** + + ``` + Nombre: sleakops-alb-protection + Descripción: Protección WAF para ALB de SleakOps + Tipo de recurso: Application Load Balancer + ``` + +3. **Agregar Recurso** + + - Selecciona tu Application Load Balancer + - Elige la región donde está ubicado tu ALB + +4. **Establecer Acción Predeterminada** + + - Acción predeterminada: "Allow" (recomendado para configuración inicial) + - Esto permite el tráfico a menos que sea bloqueado por reglas específicas + + + + + +Agrega reglas de protección esenciales a tu Web ACL: + +**1. Conjuntos de Reglas Administradas de AWS (Recomendado):** + +```yaml +# Conjunto de Reglas Principales - Protege contra OWASP Top 10 +Nombre de Regla: AWSManagedRulesCommonRuleSet +Prioridad: 1 +Acción: Block + +# Entradas Maliciosas Conocidas - Protege contra solicitudes maliciosas +Nombre de Regla: AWSManagedRulesKnownBadInputsRuleSet +Prioridad: 2 +Acción: Block + +# Control de Bots - Detección avanzada de bots +Nombre de Regla: AWSManagedRulesBotControlRuleSet +Prioridad: 3 +Acción: Block +``` + +**2. Regla de Limitación de Velocidad Personalizada:** + +```yaml +Nombre de Regla: RateLimitingRule +Prioridad: 4 +Condición: Regla basada en velocidad +Límite de velocidad: 2000 solicitudes por 5 minutos +Acción: Block +Alcance: Todas las solicitudes de una sola IP +``` + +**3. Restricciones Geográficas (Opcional):** + +```yaml +Nombre de Regla: GeoBlockRule +Prioridad: 5 +Condición: Coincidencia geográfica +Países: [Lista de países a bloquear] +Acción: Block +``` + + + + + +Conecta tu Web ACL a tu Application Load Balancer: + +1. **En la Configuración de Web ACL** + + - Ve a "Associated AWS resources" + - Haz clic en "Add AWS resources" + +2. **Seleccionar tu ALB** + + - Tipo de recurso: Application Load Balancer + - Selecciona tu load balancer específico + - Haz clic en "Add" + +3. **Verificar Asociación** + + ```bash + # Verificar que WAF está asociado con ALB + aws wafv2 list-web-acls --scope REGIONAL --region us-east-1 + ``` + + + + + +Habilita el logging para monitorear solicitudes bloqueadas y ajustar tus reglas: + +1. **Crear Grupo de Logs de CloudWatch** + + ```bash + # Crear grupo de logs para logs de WAF + aws logs create-log-group --log-group-name aws-waf-logs-sleakops + ``` + +2. **Habilitar Logging de WAF** + + - En la Consola de WAF, ve a tu Web ACL + - Haz clic en "Logging and metrics" + - Habilita logging + - Elige tu grupo de logs de CloudWatch + +3. **Configurar Análisis de Logs** + + ```json + { + "logDestinationConfigs": [ + "arn:aws:logs:us-east-1:123456789012:log-group:aws-waf-logs-sleakops" + ], + "logFormat": "json", + "managedByFirewallManager": false + } + ``` + + + + + +Prueba tu configuración de WAF para asegurar que funciona correctamente: + +**1. Probar Tráfico Legítimo** + +```bash +# Probar que las solicitudes normales pasan +curl -I https://tu-dominio.com/ + +# Debería devolver headers de respuesta normales +``` + +**2. Probar Limitación de Velocidad** + +```bash +# Generar múltiples solicitudes para probar limitación de velocidad +for i in {1..100}; do + curl -s -o /dev/null -w "%{http_code}\n" https://tu-dominio.com/ +done + +# Debería mostrar respuestas 403 después de alcanzar el límite de velocidad +``` + +**3. Monitorear Métricas de WAF** + +- Ve a CloudWatch → Metrics → WAF +- Monitorea solicitudes bloqueadas y permitidas +- Verifica falsos positivos + +**4. Revisar Logs** + +```bash +# Consultar logs de WAF para solicitudes bloqueadas +aws logs filter-log-events \ + --log-group-name aws-waf-logs-sleakops \ + --filter-pattern "\"action\":\"BLOCK\"" +``` + + + + + +Optimiza tus reglas de WAF basándote en patrones de tráfico reales: + +**1. Revisar Falsos Positivos** + +- Monitorea solicitudes legítimas siendo bloqueadas +- Ajusta la sensibilidad de las reglas si es necesario +- Agrega reglas de excepción para endpoints específicos + +**2. Reglas Personalizadas para tu Aplicación** + +```yaml +# Bloquear solicitudes a endpoints de admin desde IPs públicas +Nombre de Regla: AdminProtection +Prioridad: 10 +Condición: Ruta coincide con "/admin/*" Y NO IP origen en rango permitido +Acción: Block + +# Proteger endpoint de registro con limitación de velocidad más estricta +Nombre de Regla: RegistrationProtection +Prioridad: 11 +Condición: Ruta coincide con "/register" +Límite de velocidad: 5 solicitudes por 5 minutos +Acción: Block +``` + +**3. Monitorear y Ajustar** + +- Revisión semanal de patrones de tráfico bloqueado +- Ajustar límites de velocidad basados en uso legítimo +- Actualizar restricciones geográficas según sea necesario + + + + + +**Precios de AWS WAF (Aproximados):** + +- Web ACL: $1.00 por mes +- Evaluaciones de reglas: $0.60 por millón de solicitudes +- Grupos de reglas administradas: $1.00-$10.00 por mes cada uno +- Reglas administradas de Control de Bots: $10.00 por mes + $0.80 por millón de solicitudes + +**Consejos de Optimización de Costos:** + +1. **Comenzar con Reglas Esenciales** + + - Empezar con Core Rule Set y Known Bad Inputs + - Agregar Bot Control solo si es necesario + - Monitorear costos antes de agregar reglas administradas adicionales + +2. **Optimización de Prioridad de Reglas** + + ```yaml + # Ordenar reglas por probabilidad de coincidencia (más específicas primero) + Prioridad 1: Restricciones geográficas (si aplica) + Prioridad 2: Limitación de velocidad + Prioridad 3: Reglas Administradas de AWS + ``` + +3. **Revisión Regular** + + - Análisis de costos mensual + - Remover reglas no utilizadas + - Optimizar umbrales de limitación de velocidad + + + + + +**Problema 1: Tráfico Legítimo Siendo Bloqueado** + +```bash +# Verificar logs de WAF para solicitudes bloqueadas específicas +aws logs filter-log-events \ + --log-group-name aws-waf-logs-sleakops \ + --filter-pattern "\"action\":\"BLOCK\"" \ + --start-time $(date -d '1 hour ago' +%s)000 + +# Solución: Agregar reglas de excepción o ajustar sensibilidad de reglas +``` + +**Problema 2: WAF No Bloquea Tráfico Esperado** + +- Verificar que Web ACL está asociada con el ALB correcto +- Revisar orden y prioridades de reglas +- Asegurar que la acción predeterminada está configurada correctamente + +**Problema 3: Alta Tasa de Falsos Positivos** + +- Revisar grupos de reglas administradas específicos que causan problemas +- Implementar modo de conteo antes del modo de bloqueo para nuevas reglas +- Agregar reglas de excepción personalizadas para patrones legítimos + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-newrelic-pkg-resources-deprecation-warning.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-newrelic-pkg-resources-deprecation-warning.mdx new file mode 100644 index 000000000..98ff2fb21 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-newrelic-pkg-resources-deprecation-warning.mdx @@ -0,0 +1,192 @@ +--- +sidebar_position: 3 +title: "Fallo en el trabajo de compilación con advertencia de desaprobación de pkg_resources de New Relic" +description: "Solución para fallos en la compilación causados por advertencias de desaprobación de pkg_resources de New Relic en entornos Python" +date: "2025-06-10" +category: "proyecto" +tags: ["compilación", "python", "newrelic", "pkg_resources", "despliegue"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en el trabajo de compilación con advertencia de desaprobación de pkg_resources de New Relic + +**Fecha:** 10 de junio de 2025 +**Categoría:** Proyecto +**Etiquetas:** Compilación, Python, New Relic, pkg_resources, Despliegue + +## Descripción del problema + +**Contexto:** Los usuarios experimentan fallos en los trabajos de compilación al intentar desplegar en producción, con mensajes de error relacionados con el uso por parte de New Relic de la API obsoleta pkg_resources, incluso cuando New Relic no se usa explícitamente en su aplicación. + +**Síntomas observados:** + +- Los trabajos de compilación fallan durante el despliegue en producción +- Aparece un mensaje de advertencia sobre la desaprobación de pkg_resources +- El error se origina en el módulo de configuración de New Relic +- El problema bloquea despliegues críticos a producción + +**Configuración relevante:** + +- Versión de Python: 3.9 +- Ubicación del paquete New Relic: `/usr/local/lib/python3.9/site-packages/newrelic/config.py` +- Versión de setuptools: Probablemente 81 o superior +- Entorno de compilación: contenedores gestionados por SleakOps + +**Condiciones del error:** + +- El error ocurre durante el proceso de compilación +- Aparece en la tubería de despliegue a producción +- La advertencia hace referencia a la desaprobación de pkg_resources programada para el 30-11-2025 +- El problema persiste aunque New Relic no se use activamente + +## Solución detallada + + + +Este problema ocurre porque: + +1. **El agente de New Relic está instalado** en el entorno base de Python usado por los contenedores de compilación de SleakOps +2. **pkg_resources está desaprobado** en versiones recientes de setuptools (81+) +3. **New Relic no ha actualizado** su código para usar la nueva API importlib.metadata +4. La advertencia se trata como un error en el proceso de compilación + +Aunque no uses New Relic directamente, puede estar instalado como parte de la imagen base del contenedor para propósitos de monitoreo. + + + + + +Agrega esto a tu `requirements.txt` o `pyproject.toml` para fijar setuptools a una versión anterior a la desaprobación: + +**Para requirements.txt:** + +```txt +setuptools<81 +``` + +**Para pyproject.toml:** + +```toml +[build-system] +requires = ["setuptools<81", "wheel"] + +[project] +dependencies = [ + "setuptools<81", + # tus otras dependencias +] +``` + +**Para Dockerfile:** + +```dockerfile +RUN pip install "setuptools<81" +``` + + + + + +Puedes suprimir la advertencia específica configurando variables de entorno en tu configuración de compilación: + +**En tu configuración de despliegue:** + +```yaml +environment: + PYTHONWARNINGS: "ignore::UserWarning:newrelic.config" +``` + +**O suprimir todas las UserWarnings (menos recomendado):** + +```yaml +environment: + PYTHONWARNINGS: "ignore::UserWarning" +``` + +**En Dockerfile:** + +```dockerfile +ENV PYTHONWARNINGS="ignore::UserWarning:newrelic.config" +``` + + + + + +Si New Relic está instalado pero no se usa: + +**Opción 1: Eliminar New Relic completamente** + +```bash +pip uninstall newrelic +``` + +**Opción 2: Actualizar a la última versión de New Relic** + +```bash +pip install --upgrade newrelic +``` + +**Opción 3: Agregar a requirements.txt con la versión más reciente** + +```txt +newrelic>=9.0.0 +``` + +Consulta las [versiones del agente Python de New Relic](https://github.com/newrelic/newrelic-python-agent/releases) para obtener la última versión que soluciona este problema. + + + + + +Aquí tienes un enfoque completo con Dockerfile que aborda el problema: + +```dockerfile +FROM python:3.9-slim + +# Fijar setuptools para evitar advertencias de desaprobación de pkg_resources +RUN pip install --upgrade pip "setuptools<81" wheel + +# Establecer variable de entorno para suprimir advertencias de New Relic si es necesario +ENV PYTHONWARNINGS="ignore::UserWarning:newrelic.config" + +# Copiar e instalar requerimientos +COPY requirements.txt . +RUN pip install -r requirements.txt + +# Resto de tu Dockerfile +COPY . . +CMD ["python", "app.py"] +``` + + + + + +Para una solución permanente: + +1. **Monitorea actualizaciones de New Relic**: Mantente atento a cuando New Relic lance una versión que use `importlib.metadata` en lugar de `pkg_resources` + +2. **Actualiza imágenes base regularmente**: Asegúrate de que las imágenes base de tus contenedores estén actualizadas con versiones compatibles + +3. **Gestión de dependencias**: Usa herramientas de gestión de dependencias como `pip-tools` o `poetry` para fijar versiones: + +```bash +# Usando pip-tools +pip-compile requirements.in +``` + +4. **Ajuste en la pipeline CI/CD**: Añade chequeos en tu pipeline para detectar estas advertencias temprano: + +```yaml +# En tu configuración CI/CD +script: + - python -W error::UserWarning -c "import sys; print('No warnings')" || echo "Advertencias detectadas" +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de junio de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-pods-stuck-creation.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-pods-stuck-creation.mdx new file mode 100644 index 000000000..afe08e261 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-pods-stuck-creation.mdx @@ -0,0 +1,183 @@ +--- +sidebar_position: 3 +title: "Pods de Build Atascados en Estado de Creación" +description: "Solución para builds que se quedan atascados con pods que nunca completan la creación" +date: "2025-03-22" +category: "proyecto" +tags: ["build", "pods", "despliegue", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Pods de Build Atascados en Estado de Creación + +**Fecha:** 22 de marzo de 2025 +**Categoría:** Proyecto +**Etiquetas:** Build, Pods, Despliegue, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario experimenta builds que permanecen atascados en un estado incompleto, con pods que nunca terminan de crearse y que no son visibles en la interfaz. + +**Síntomas Observados:** + +- Los builds parecen "fallados" o atascados +- Los pods no son visibles en la interfaz +- Los pods nunca completan su proceso de creación +- El problema persiste durante períodos prolongados (más de 12 horas) +- El proceso de build parece congelado + +**Configuración Relevante:** + +- Plataforma: SleakOps +- Sistema de build: Builds basados en Kubernetes +- Duración: Períodos extendidos sin resolución +- Interfaz de usuario: Pods no aparecen en el panel de control + +**Condiciones de Error:** + +- Builds iniciados pero nunca completados +- El proceso de creación del pod se queda colgado indefinidamente +- No hay progreso visible en el estado del build +- El problema requiere intervención manual para resolverse + +## Solución Detallada + + + +Cuando los builds se quedan atascados y los pods no aparecen, esto típicamente indica: + +1. **Restricciones de recursos**: Recursos insuficientes en el clúster para programar los pods +2. **Problemas al descargar imágenes**: Problemas al obtener las imágenes de contenedor +3. **Problemas de programación en nodos**: Los pods no pueden asignarse a nodos disponibles +4. **Problemas con volúmenes persistentes**: Problemas de almacenamiento que impiden el inicio del pod +5. **Conectividad de red**: Problemas con la red del clúster + + + + + +Para diagnosticar y resolver pods de build atascados: + +**Paso 1: Verificar recursos del clúster** + +```bash +# Ver recursos de los nodos +kubectl top nodes + +# Ver estado de pods en el namespace de build +kubectl get pods -n --show-labels + +# Describir pods atascados para información detallada +kubectl describe pod -n +``` + +**Paso 2: Verificar pods pendientes** + +```bash +# Listar todos los pods pendientes +kubectl get pods --all-namespaces --field-selector=status.phase=Pending + +# Ver eventos para problemas de programación +kubectl get events --sort-by=.metadata.creationTimestamp +``` + + + + + +**Solución 1: Reiniciar builds atascados** + +En el panel de SleakOps: + +1. Navegar a **Proyectos** → **Tu Proyecto** +2. Ir a la sección **Builds** +3. Encontrar el build atascado +4. Hacer clic en **Cancelar Build** +5. Iniciar un nuevo build + +**Solución 2: Limpiar caché de build** + +Si los builds se quedan atascados de forma recurrente: + +1. Ir a **Configuración del Proyecto** +2. Navegar a **Configuración de Build** +3. Habilitar la opción **"Limpiar caché de build"** +4. Disparar un nuevo build + +**Solución 3: Verificar límites de recursos** + +Verifica la asignación de recursos de tu proyecto: + +```yaml +# Ejemplo de configuración de recursos +resources: + requests: + memory: "512Mi" + cpu: "500m" + limits: + memory: "1Gi" + cpu: "1000m" +``` + + + + + +**Buenas prácticas para evitar builds atascados:** + +1. **Monitorear uso de recursos**: + + - Revisar regularmente el consumo de recursos del clúster + - Establecer solicitudes y límites de recursos adecuados + - Monitorear la longitud de la cola de builds + +2. **Optimizar configuración de build**: + + - Usar imágenes base más pequeñas cuando sea posible + - Implementar estrategias adecuadas de caché de build + - Configurar tiempos de espera razonables para builds + +3. **Mantenimiento regular**: + - Limpiar builds antiguos periódicamente + - Monitorear y limpiar imágenes Docker no usadas + - Mantener los entornos de build actualizados + +**Ejemplo de configuración de build:** + +```yaml +build: + timeout: 30m + resources: + requests: + memory: 1Gi + cpu: 500m + cache: + enabled: true + ttl: 24h +``` + + + + + +Contacta al soporte de SleakOps si: + +- Los builds permanecen atascados después de intentar las soluciones anteriores +- Múltiples proyectos se ven afectados simultáneamente +- El problema persiste por más de 2 horas +- Observas problemas de recursos a nivel de clúster + +**Información para proporcionar al contactar soporte:** + +- Nombre del proyecto y ID de build +- Duración del problema +- Cambios recientes en la configuración de build +- Capturas de pantalla del estado atascado del build +- Mensajes de error de los logs de build + + + +--- + +_Esta FAQ fue generada automáticamente el 22 de marzo de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-status-discrepancy-lens-vs-platform.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-status-discrepancy-lens-vs-platform.mdx new file mode 100644 index 000000000..8ec53af96 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/build-status-discrepancy-lens-vs-platform.mdx @@ -0,0 +1,169 @@ +--- +sidebar_position: 3 +title: "Discrepancia en el Estado de Construcción Entre la Plataforma y Lens" +description: "Solución para el estado de construcción que aparece como 'creando' en la plataforma mientras Lens muestra una finalización exitosa" +date: "2024-01-15" +category: "proyecto" +tags: ["construcción", "despliegue", "lens", "estado", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Discrepancia en el Estado de Construcción Entre la Plataforma y Lens + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Construcción, Despliegue, Lens, Estado, Solución de Problemas + +## Descripción del Problema + +**Contexto:** El usuario experimenta una discrepancia entre el estado de construcción mostrado en la plataforma SleakOps y lo que se observa en Lens (IDE de Kubernetes). La plataforma muestra la construcción atascada en estado "creando" mientras que Lens indica que la construcción se completó con éxito. + +**Síntomas Observados:** + +- Estado de construcción atascado en "creando" en la plataforma SleakOps +- Lens muestra que la construcción se completó exitosamente +- Problema de sincronización del estado entre la interfaz de la plataforma y el estado real de Kubernetes +- La construcción parece estar funcionando a pesar del estado mostrado en la plataforma + +**Configuración Relevante:** + +- Entorno: Producción +- Plataforma: SleakOps +- Herramienta de monitoreo: Lens +- Proceso de construcción: Parece completarse con éxito en el clúster + +**Condiciones de Error:** + +- La discrepancia de estado ocurre durante el proceso de construcción +- La interfaz de la plataforma no refleja el estado real de Kubernetes +- El problema persiste incluso después de la finalización exitosa de la construcción +- Puede afectar la visibilidad del flujo de trabajo de despliegue + +## Solución Detallada + + + +Para confirmar el estado real de su construcción: + +1. **Verifique directamente los recursos de Kubernetes:** + + ```bash + kubectl get pods -n + kubectl get deployments -n + kubectl describe deployment + ``` + +2. **En Lens:** + + - Navegue a Workloads → Deployments + - Revise el estado de su aplicación + - Verifique la preparación y estado de ejecución de los pods + +3. **Revise los logs de construcción:** + ```bash + kubectl logs -f deployment/ + ``` + + + + + +Para resolver el problema de sincronización del estado: + +1. **Refresque el panel de SleakOps:** + + - Refresque el navegador con un hard refresh (Ctrl+F5 o Cmd+Shift+R) + - Limpie la caché del navegador si es necesario + +2. **Dispare una sincronización de estado:** + + - Navegue a su proyecto en SleakOps + - Haga clic en "Actualizar Estado" si está disponible + - O realice una actualización menor de configuración para forzar la sincronización + +3. **Revise los logs de la plataforma:** + - Contacte soporte para verificar si existen problemas de sincronización en el controlador + - La plataforma puede necesitar reconciliar el estado real del clúster + + + + + +Este problema típicamente ocurre debido a: + +1. **Retrasos en la sincronización del controlador:** + + - El controlador de la plataforma puede estar experimentando demoras + - Problemas de conectividad de red entre la plataforma y el clúster + +2. **Problemas con webhooks o procesamiento de eventos:** + + - Eventos de Kubernetes que no llegan correctamente a la plataforma + - Acumulación en la cola de procesamiento de eventos + +3. **Caché del estado de recursos:** + + - La plataforma puede estar mostrando un estado en caché + - El estado real del clúster ha avanzado más allá de la versión en caché + +4. **Limitación de tasa en la API:** + - La plataforma puede estar limitada en la tasa de consultas al estado del clúster + - Esto causa actualizaciones de estado retrasadas + + + + + +Si su construcción está realmente funcionando (confirmado vía Lens): + +1. **Continúe con su flujo de trabajo:** + + - La aplicación probablemente esté funcionando correctamente + - El estado en la plataforma eventualmente se sincronizará + +2. **Monitoree la salud de la aplicación:** + + ```bash + # Verifique los endpoints de la aplicación + kubectl get svc -n + + # Pruebe la conectividad de la aplicación + kubectl port-forward svc/ 8080:80 + ``` + +3. **Documente el problema:** + - Tome capturas de pantalla tanto del estado en la plataforma como en Lens + - Anote la marca de tiempo cuando se observó la discrepancia + - Esto ayuda al equipo de soporte a investigar la causa raíz + + + + + +Para minimizar problemas de discrepancia de estado: + +1. **Use múltiples herramientas de monitoreo:** + + - No dependa únicamente de la interfaz de la plataforma + - Mantenga Lens o kubectl a mano para verificación + +2. **Configure alertas de monitoreo:** + + - Configure alertas basadas en el estado real de pods/despliegues + - Use Prometheus/Grafana para monitoreo independiente + +3. **Actualizaciones regulares de la plataforma:** + + - Asegúrese de que la plataforma SleakOps esté actualizada a la última versión + - Las actualizaciones suelen incluir mejoras en la sincronización + +4. **Reporte problemas de estado puntualmente:** + - El reporte temprano ayuda a identificar patrones + - Ayuda al equipo de plataforma a mejorar la sincronización + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/celery-beat-duplicate-execution.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/celery-beat-duplicate-execution.mdx new file mode 100644 index 000000000..678c36b89 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/celery-beat-duplicate-execution.mdx @@ -0,0 +1,367 @@ +--- +sidebar_position: 3 +title: "Ejecución Duplicada de Tareas en Celery Beat" +description: "Solución para evitar que las tareas de Celery Beat se ejecuten múltiples veces al escalar pods del backend" +date: "2024-12-23" +category: "workload" +tags: ["celery", "cronjob", "escalado", "backend", "tareas-duplicadas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Ejecución Duplicada de Tareas en Celery Beat + +**Fecha:** 23 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Celery, Cronjob, Escalado, Backend, Tareas Duplicadas + +## Descripción del Problema + +**Contexto:** Al ejecutar una aplicación backend con múltiples pods en Kubernetes, las tareas programadas con Celery Beat se ejecutan múltiples veces, una por cada instancia de pod en ejecución. + +**Síntomas Observados:** + +- Las tareas programadas de Celery Beat se ejecutan múltiples veces (p. ej., 4 veces si hay 4 pods del backend) +- Cada instancia de pod ejecuta su propio scheduler de Celery Beat +- Las tareas que deberían ejecutarse una sola vez se duplican en todas las réplicas de pods +- Posible inconsistencia de datos o desperdicio de recursos debido a ejecuciones duplicadas + +**Configuración Relevante:** + +- Despliegue backend: Múltiples pods (p. ej., 4 réplicas) +- Scheduler de tareas: Celery Beat integrado en los pods del backend +- Plataforma: Entorno Kubernetes de SleakOps +- Tipo de carga: Servicio backend con tareas programadas + +**Condiciones de Error:** + +- El problema ocurre cuando el backend se escala a más de 1 réplica +- Cada pod ejecuta Celery Beat de forma independiente +- No hay coordinación entre instancias para las tareas programadas +- Las tareas se ejecutan N veces donde N = número de réplicas del pod backend + +## Solución Detallada + + + +El enfoque recomendado es dejar de usar Celery Beat y migrar a Kubernetes CronJobs. Esto garantiza que las tareas se ejecuten solo una vez sin importar la cantidad de pods del backend. + +**Beneficios de los CronJobs:** + +- Ejecución única garantizada por programación +- Programación nativa de Kubernetes +- Mejor aislamiento de recursos +- Monitoreo y depuración más sencilla +- Sin dependencia del escalado de pods del backend + + + + + +**Paso 1: Identificar las Tareas Actuales de Celery Beat** + +Lista todas tus tareas programadas actuales en la configuración de Celery: + +```python +# Ejemplo de configuración actual de celery beat +from celery.schedules import crontab + +beat_schedule = { + 'send-daily-report': { + 'task': 'myapp.tasks.send_daily_report', + 'schedule': crontab(hour=9, minute=0), + }, + 'cleanup-old-data': { + 'task': 'myapp.tasks.cleanup_old_data', + 'schedule': crontab(hour=2, minute=0), + }, +} +``` + +**Paso 2: Crear Ejecuciones CronJob en SleakOps** + +Para cada tarea de Celery Beat, crea una ejecución CronJob separada: + +1. Ve a tu proyecto en SleakOps +2. Navega a la sección **Ejecuciones** +3. Haz clic en **Agregar Ejecución** +4. Selecciona el tipo **CronJob** +5. Configura la programación y el comando + + + + + +**Ejemplo 1: CronJob para Reporte Diario** + +```yaml +# Configuración CronJob en SleakOps +name: daily-report-cronjob +type: cronjob +schedule: "0 9 * * *" # Todos los días a las 9:00 AM +image: your-backend-image:latest +command: ["python", "manage.py", "send_daily_report"] +resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "200m" +``` + +**Ejemplo 2: CronJob para Limpieza de Datos** + +```yaml +# Configuración CronJob en SleakOps +name: cleanup-cronjob +type: cronjob +schedule: "0 2 * * *" # Todos los días a las 2:00 AM +image: your-backend-image:latest +command: ["python", "manage.py", "cleanup_old_data"] +resources: + requests: + memory: "128Mi" + cpu: "50m" + limits: + memory: "256Mi" + cpu: "100m" +``` + +**Formato de Programación Cron:** + +- "0 9 \* \* \*" - Diario a las 9:00 AM +- "_/15 _ \* \* \*" - Cada 15 minutos +- "0 _/6 _ \* \*" - Cada 6 horas +- "0 0 \* \* 0" - Semanalmente los domingos a medianoche + + + + + +Después de crear los CronJobs, elimina Celery Beat de tu backend: + +**Paso 1: Actualizar Configuración del Backend** + +```python +# Eliminar o comentar beat_schedule +# beat_schedule = { +# 'send-daily-report': { +# 'task': 'myapp.tasks.send_daily_report', +# 'schedule': crontab(hour=9, minute=0), +# }, +# } + +# Mantener solo la configuración de la app Celery para tareas asíncronas +app = Celery('myapp') +app.config_from_object('django.conf:settings', namespace='CELERY') +``` + +**Paso 2: Actualizar Despliegue** + +Asegúrate de que tu despliegue backend ya no inicie Celery Beat: + +```dockerfile +# Eliminar celery beat del comando de inicio +# ANTES: CMD ["celery", "-A", "myapp", "beat", "--loglevel=info"] +# AHORA: Solo ejecutar el servidor web +CMD ["gunicorn", "myapp.wsgi:application"] +``` + + + + + +Si debes continuar usando Celery Beat, aquí algunas opciones alternativas: + +**Opción 1: Pod Dedicado para Celery Beat** + +Crear un despliegue separado solo para Celery Beat con réplica = 1: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: celery-beat +spec: + replicas: 1 # Mantener siempre en 1 + selector: + matchLabels: + app: celery-beat + template: + spec: + containers: + - name: celery-beat + image: your-backend-image:latest + command: ["celery", "-A", "myapp", "beat", "--loglevel=info"] +``` + +**Opción 2: Elección de Líder (Compleja)** + +Implementar elección de líder para que solo un pod ejecute Celery Beat, pero esto añade complejidad y no es recomendable. + +**Por qué los CronJobs son Mejores:** + +- Arquitectura más simple +- Función nativa de Kubernetes +- Mejor gestión de recursos +- Depuración más sencilla +- Sin punto único de falla + + + + + +Después de migrar a CronJobs, monitorea su ejecución para asegurar que todo funcione correctamente: + +**Paso 1: Verificar Estado de CronJobs en SleakOps** + +1. Ve a la sección **Ejecuciones** de tu proyecto +2. Verifica que tus CronJobs aparezcan en la lista +3. Revisa los tiempos de **Última Ejecución** y **Próxima Ejecución** +4. Monitorea la columna **Estado** para detectar fallas + +**Paso 2: Ver Logs de CronJobs** + +```bash +# Verificar historial de ejecución de CronJobs +kubectl get cronjobs + +# Ver ejecuciones de trabajos recientes +kubectl get jobs + +# Revisar logs de una ejecución específica +kubectl logs job/daily-report-cronjob- +``` + +**Paso 3: Configurar Alertas (Opcional)** + +Configura alertas para ejecuciones fallidas de CronJob: + +```yaml +# Ejemplo de configuración de alertas +apiVersion: v1 +kind: ConfigMap +metadata: + name: cronjob-alerts +data: + alert-rules.yaml: | + groups: + - name: cronjob.rules + rules: + - alert: CronJobFailed + expr: kube_job_status_failed > 0 + for: 0m + labels: + severity: warning + annotations: + summary: "CronJob {{ $labels.job_name }} falló" +``` + +**Paso 4: Verificar No Hay Ejecuciones Duplicadas** + +Monitorea los logs de tu aplicación para confirmar que las tareas ya no se ejecutan múltiples veces: + +```bash +# Revisar logs de aplicación para ejecuciones duplicadas de tareas +kubectl logs -l app=your-backend-app | grep "task_name" + +# Deberías ver una sola ejecución por tiempo programado, no múltiples +``` + + + + + +Usa esta lista de verificación para confirmar que tu migración fue exitosa: + +**Lista Pre-Migración:** + +- [ ] Documentar todas las tareas actuales de Celery Beat y sus horarios +- [ ] Identificar los comandos necesarios para ejecutar cada tarea +- [ ] Planificar el cronograma de migración para evitar interrupciones del servicio +- [ ] Preparar plan de rollback si es necesario + +**Lista Post-Migración:** + +- [ ] Todos los CronJobs están creados y visibles en SleakOps +- [ ] Los horarios de CronJob coinciden con los horarios originales de Celery Beat +- [ ] La primera ejecución de cada CronJob se completa exitosamente +- [ ] No hay ejecuciones duplicadas de tareas en los logs de aplicación +- [ ] Configuración de Celery Beat eliminada del código backend +- [ ] El despliegue backend ya no inicia el proceso Celery Beat +- [ ] El uso de recursos está optimizado (sin procesos Celery Beat inactivos) + +**Lista de Pruebas:** + +- [ ] Escalar pods backend hacia arriba y abajo - verificar que las tareas sigan ejecutándose una vez +- [ ] Activar manualmente un CronJob para probar ejecución +- [ ] Verificar manejo de fallas y lógica de reintento de CronJob +- [ ] Confirmar que las tareas programadas mantienen el tiempo esperado +- [ ] Confirmar que las interacciones con base de datos/servicios externos funcionen correctamente + + + + + +**Problema 1: CronJob No Se Ejecuta** + +```bash +# Verificar configuración de CronJob +kubectl describe cronjob your-cronjob-name + +# Causas comunes: +# - Formato incorrecto de programación cron +# - Variables de entorno requeridas faltantes +# - Errores de pull de imagen +# - Restricciones de recursos +``` + +**Solución:** +- Verifica la sintaxis de programación cron usando validadores cron en línea +- Asegúrate de que todas las variables de entorno estén configuradas correctamente +- Verifica que la imagen de contenedor sea accesible +- Revisa las solicitudes y límites de recursos + +**Problema 2: CronJob Falla pero la Tarea de Celery Tendría Éxito** + +Diferencias comunes al migrar desde Celery Beat: + +```python +# Celery Beat se ejecuta en contexto de aplicación +# CronJob se ejecuta como contenedor separado - asegurar: + +# 1. Las conexiones de base de datos están configuradas correctamente +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', + 'HOST': os.environ.get('DB_HOST'), + # ... otras configuraciones + } +} + +# 2. Todas las variables de entorno requeridas están disponibles +# 3. La tarea puede ejecutarse independientemente sin contexto de worker Celery +``` + +**Problema 3: Comportamiento Diferente de Zona Horaria** + +```yaml +# Asegurar zona horaria consistente en CronJob +spec: + schedule: "0 9 * * *" + timeZone: "UTC" # Establecer zona horaria explícitamente + jobTemplate: + spec: + template: + spec: + containers: + - name: task + env: + - name: TZ + value: "UTC" +``` + + + +_Esta sección de preguntas frecuentes fue generada automáticamente el 23 de diciembre de 2024 basada en una consulta real de usuario._ \ No newline at end of file diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-build-stage-failure.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-build-stage-failure.mdx new file mode 100644 index 000000000..9e4efb14b --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-build-stage-failure.mdx @@ -0,0 +1,180 @@ +--- +sidebar_position: 3 +title: "Fallo en la Etapa de Construcción CI/CD" +description: "Solución de problemas de fallos en la construcción de pipelines CI/CD en entornos específicos" +date: "2024-10-10" +category: "proyecto" +tags: + ["ci-cd", "construcción", "pipeline", "despliegue", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en la Etapa de Construcción CI/CD + +**Fecha:** 10 de octubre de 2024 +**Categoría:** Proyecto +**Etiquetas:** CI/CD, Construcción, Pipeline, Despliegue, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario experimenta fallos en la construcción del pipeline CI/CD en el entorno 'stage' mientras que el mismo código funciona correctamente en el entorno 'develop' y localmente. El trabajo de construcción falla antes de llegar a la etapa de compilación. + +**Síntomas Observados:** + +- La construcción falla en el entorno 'stage' pero funciona en 'develop' +- La compilación y ejecución local funcionan correctamente +- El error ocurre antes de que el trabajo de construcción llegue al clúster +- El pipeline falla en la etapa de construcción, impidiendo el despliegue +- El último commit exitoso continúa funcionando mientras los nuevos commits fallan al construir + +**Configuración Relevante:** + +- Proyecto: byron-backoffice-service-dev-develop +- Entorno: 'stage' (fallando) vs 'develop' (funcionando) +- Último commit exitoso: d419a838b44f1c11b22adf68f6cf984170def38f +- Pipeline CI/CD configurado a través de SleakOps + +**Condiciones de Error:** + +- El error ocurre durante la etapa de construcción CI/CD +- El fallo sucede antes del intento de compilación +- Solo afecta al entorno 'stage' +- Impide que se creen nuevos despliegues + +## Solución Detallada + + + +Cuando las construcciones CI/CD fallan en un entorno pero funcionan en otro, el problema suele estar relacionado con: + +1. **Archivos de configuración CI/CD específicos del entorno** +2. **Variables de entorno o secretos diferentes** +3. **Configuraciones de pipeline específicas por rama** +4. **Restricciones de recursos en el entorno objetivo** + +Comience comparando los archivos de configuración CI/CD entre los entornos. + + + + + +Para comprobar si el archivo CI/CD está configurado correctamente: + +1. **Acceda al panel de su proyecto en SleakOps** +2. **Navegue a la sección CI/CD** +3. **Compare la configuración del pipeline** entre 'develop' y 'stage' +4. **Verifique que el archivo CI/CD fue copiado correctamente** desde SleakOps + +**Problemas comunes a revisar:** + +```yaml +# Verifique diferencias específicas del entorno +stages: + - build + - test + - deploy + +variables: + ENVIRONMENT: "stage" # Asegúrese que coincida + +build: + stage: build + script: + - echo "Construyendo para $ENVIRONMENT" + # Verifique que los comandos de construcción sean idénticos +``` + + + + + +Para disparar una construcción nueva: + +**Opción 1: Desde el Panel de SleakOps** + +1. Vaya a su proyecto en SleakOps +2. Navegue a la sección **Despliegues** +3. Haga clic en el botón **"Build"** para el entorno stage +4. Monitoree los registros de construcción para mensajes de error específicos + +**Opción 2: Desde el Repositorio Git** + +1. Realice un pequeño commit (como actualizar un comentario) +2. Haga push a la rama que dispara el pipeline stage +3. Monitoree la ejecución del pipeline + + + + + +Dado que 'develop' funciona pero 'stage' falla, compare estas configuraciones: + +**Variables de Entorno:** + +- Verifique que todas las variables de entorno requeridas estén definidas para 'stage' +- Confirme que los secretos y credenciales estén configurados correctamente +- Asegúrese que las conexiones a bases de datos y URLs de servicios externos sean correctas + +**Asignación de Recursos:** + +- Verifique que el entorno 'stage' tenga recursos suficientes +- Revise si existen cuotas o límites de recursos +- Asegúrese que el clúster tenga capacidad disponible + +**Configuración de Ramas:** + +```yaml +# Verifique que los disparadores del pipeline estén configurados correctamente +only: + - develop # para entorno develop + - stage # para entorno stage +``` + + + + + +Para obtener información más detallada sobre el fallo: + +1. **Acceda a los logs de construcción** en su plataforma CI/CD (GitHub Actions, GitLab CI, etc.) +2. **Busque el mensaje exacto de error** que ocurre antes de la compilación +3. **Revise fallos comunes previos a la construcción:** + - Fallos al descargar imágenes Docker + - Dependencias o herramientas faltantes + - Problemas de permisos + - Problemas de conectividad de red + +**Patrones comunes de errores previos a la construcción:** + +```bash +# Errores relacionados con Docker +Error: Falló al descargar la imagen +Error: No se puede conectar al daemon de Docker + +# Errores de permisos +Permiso denegado +Acceso denegado + +# Errores de red/conectividad +Tiempo de espera de conexión +Fallo en resolución DNS +``` + + + + + +Si necesita funcionalidad inmediata mientras soluciona problemas: + +1. **Mantenga el despliegue actual funcionando** (commit d419a838b44f1c11b22adf68f6cf984170def38f) +2. **Cree una rama hotfix** desde el último commit que funciona si necesita cambios urgentes +3. **Pruebe las correcciones primero en 'develop'** antes de aplicarlas en 'stage' + +Esto asegura que su aplicación permanezca funcional mientras investiga la causa raíz. + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de octubre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-github-actions-not-triggering.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-github-actions-not-triggering.mdx new file mode 100644 index 000000000..62d310f06 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-github-actions-not-triggering.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "La canalización CI/CD de GitHub Actions no se activa" +description: "Solución para cuando las compilaciones de integración continua dejan de funcionar tras un push al repositorio" +date: "2024-01-15" +category: "proyecto" +tags: ["github", "ci-cd", "despliegue", "solución de problemas", "pipeline"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# La canalización CI/CD de GitHub Actions no se activa + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** GitHub, CI/CD, Despliegue, Solución de problemas, Pipeline + +## Descripción del problema + +**Contexto:** El usuario experimenta problemas con la canalización de integración continua en SleakOps donde los flujos de trabajo de GitHub Actions dejan de activar compilaciones y despliegues automáticos tras hacer push a la rama de desarrollo. + +**Síntomas observados:** + +- El push a la rama de desarrollo ya no activa compilaciones automáticas +- La canalización CI/CD que funcionaba previamente ha dejado de funcionar +- No hay ejecución visible de flujos de trabajo en GitHub Actions +- No se crean despliegues automáticamente + +**Configuración relevante:** + +- Repositorio: GitHub +- Rama: dev (rama de desarrollo) +- Proyecto: mx-simplee-web +- Canalización: GitHub Actions con configuración YAML +- Plataforma: Integración con SleakOps + +**Condiciones de error:** + +- La canalización funcionaba antes pero dejó de hacerlo repentinamente +- No hay cambios aparentes en el archivo de configuración YAML +- Los eventos push no disparan la ejecución del flujo de trabajo +- No hay mensajes de error visibles en el repositorio + +## Solución detallada + + + +Primero, verifica el estado actual de tus flujos de trabajo en GitHub Actions: + +1. Ve a tu repositorio en GitHub +2. Haz clic en la pestaña **Actions** +3. Busca las ejecuciones más recientes de los flujos de trabajo +4. Comprueba si los flujos de trabajo: + - No se están activando en absoluto + - Fallan durante la ejecución + - Están en cola pero no se ejecutan + +Esto ayudará a identificar si el problema está en la activación o en la ejecución. + + + + + +Aunque no hayas cambiado el archivo YAML recientemente, verifica su estado actual: + +1. Revisa la ubicación del archivo de flujo de trabajo: `.github/workflows/[nombre-del-flujo].yml` +2. Verifica la configuración del disparador: + +```yaml +on: + push: + branches: [dev, main] + pull_request: + branches: [dev, main] +``` + +3. Asegúrate de que el nombre de la rama coincida exactamente (sensible a mayúsculas/minúsculas) +4. Revisa que no haya errores de sintaxis usando un validador YAML + + + + + +Las reglas de protección de rama pueden impedir que los flujos de trabajo se ejecuten: + +1. Ve a **Settings** → **Branches** en tu repositorio +2. Revisa si existen reglas de protección en tu rama `dev` +3. Asegúrate de que "Restringir pushes que crean archivos" no esté bloqueando el flujo de trabajo +4. Verifica que las verificaciones de estado requeridas no estén impidiendo la ejecución + + + + + +Comprueba si GitHub Actions tiene los permisos necesarios: + +1. Ve a **Settings** → **Actions** → **General** +2. Asegúrate de que "Permisos de Actions" esté configurado para permitir flujos de trabajo +3. Revisa "Permisos de flujo de trabajo" - debe ser "Permisos de lectura y escritura" +4. Verifica que "Permitir que GitHub Actions cree y apruebe pull requests" esté habilitado si es necesario + + + + + +Revisa la integración entre GitHub y SleakOps: + +1. En el panel de SleakOps, ve a la configuración de tu proyecto +2. Verifica que la URL del repositorio conectado sea correcta +3. Comprueba que la rama objetivo esté configurada como `dev` +4. Asegúrate de que el webhook siga activo en GitHub: + - Ve a **Settings** → **Webhooks** en tu repositorio + - Busca el webhook de SleakOps y verifica que esté activo + - Revisa las entregas recientes para detectar solicitudes fallidas + + + + + +Si el problema persiste, intenta recrear el flujo de trabajo: + +1. Crea una nueva rama desde `dev` +2. Haz un cambio pequeño para activar el flujo de trabajo +3. Crea un flujo de trabajo de prueba simple: + +```yaml +name: Test CI +on: + push: + branches: [dev] +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Test + run: echo "Workflow está funcionando" +``` + +4. Haz push de los cambios y verifica que el flujo de trabajo se active +5. Si funciona, agrega gradualmente los pasos originales de tu flujo de trabajo + + + + + +Basado en escenarios comunes, prueba estas soluciones: + +1. **Volver a hacer push para activar**: Haz un commit pequeño y haz push de nuevo +2. **Verificar cuotas**: Asegúrate de no haber excedido los minutos de GitHub Actions +3. **Reiniciar flujos de trabajo**: Cancela cualquier flujo atascado y vuelve a intentarlo +4. **Actualizar acción de checkout**: Usa la última versión `actions/checkout@v4` +5. **Limpiar caché**: Elimina cachés de flujos de trabajo si están corruptos + +```bash +# Forzar activación del flujo con un commit vacío +git commit --allow-empty -m "Trigger workflow" +git push origin dev +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-pipeline-setup-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-pipeline-setup-troubleshooting.mdx new file mode 100644 index 000000000..4a868576c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ci-cd-pipeline-setup-troubleshooting.mdx @@ -0,0 +1,189 @@ +--- +sidebar_position: 3 +title: "La canalización CI/CD no se activa al hacer push en la rama" +description: "Guía de solución de problemas para canalizaciones CI/CD que no activan compilaciones y despliegues automáticos" +date: "2024-03-14" +category: "proyecto" +tags: ["ci-cd", "pipeline", "git", "despliegue", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# La canalización CI/CD no se activa al hacer push en la rama + +**Fecha:** 14 de marzo de 2024 +**Categoría:** Proyecto +**Etiquetas:** CI/CD, Pipeline, Git, Despliegue, Solución de problemas + +## Descripción del problema + +**Contexto:** El usuario hace push de cambios de código a una rama de desarrollo, pero la canalización CI/CD no se activa automáticamente para crear nuevas imágenes de contenedor y desplegar actualizaciones al entorno. + +**Síntomas observados:** + +- Los push a la rama de desarrollo no activan la ejecución de la canalización +- No se construyen automáticamente nuevas imágenes de contenedor +- Los despliegues no se actualizan con los últimos cambios de código +- La canalización parece estar inactiva o mal configurada + +**Configuración relevante:** + +- Rama objetivo: Rama de desarrollo (por ejemplo, `dev`, `develop`) +- Repositorio: Control de versiones basado en Git +- Canalización: Integración CI/CD de SleakOps +- Comportamiento esperado: Construcción y despliegue automáticos al hacer push + +**Condiciones de error:** + +- La canalización no se activa tras hacer git push +- No se observa actividad de compilación en el panel del proyecto +- El despliegue permanece en la versión anterior +- No se generan mensajes de error ni registros en la canalización + +## Solución detallada + + + +Primero, asegúrate de que tu proyecto esté configurado para monitorear la rama correcta: + +1. Navega a **Proyecto > Configuración > Configuración General** +2. Revisa el ajuste de **Rama Objetivo** +3. Verifica que coincida con la rama a la que haces push (por ejemplo, `dev`, `develop`, `main`) +4. Guarda los cambios si necesitas actualizar el nombre de la rama + +**Problemas comunes:** + +- Proyecto configurado para `main` pero se hace push a `dev` +- Desajuste en el nombre de la rama (por ejemplo, `develop` vs `development`) +- Problemas de sensibilidad a mayúsculas y minúsculas en nombres de ramas + + + + + +Asegúrate de que tu repositorio tenga un archivo de canalización correctamente configurado: + +1. Ve a **Proyecto > Configuración > Pipelines Git** +2. Revisa el ejemplo YAML proporcionado +3. Crea o actualiza tu archivo de canalización en el repositorio +4. Nombres comunes de archivo: `.sleakops.yml`, `.sleakops/pipeline.yml` + +**Ejemplo de configuración de canalización:** + +```yaml +version: "1" +pipeline: + triggers: + - branch: dev + on: push + stages: + - name: build + steps: + - name: build-image + action: docker-build + dockerfile: Dockerfile + - name: deploy + steps: + - name: deploy-to-dev + action: deploy + environment: development +``` + +**Puntos clave:** + +- Asegúrate de que la `branch` en triggers coincida con tu rama objetivo +- Incluye las etapas `build` y `deploy` +- Verifica que el archivo de canalización esté en la raíz del repositorio o en el directorio `.sleakops/` + + + + + +Configura la clave API requerida para la autenticación de la canalización: + +1. Ve a **Configuración > CLI** en el panel de SleakOps +2. Genera o copia tu clave API +3. En la configuración de tu repositorio Git, añade una nueva variable de entorno: + - **Nombre:** `SLEAKOPS_KEY` + - **Valor:** Tu clave API del paso 2 + - **Ámbito:** Disponible para procesos de pipeline/CI + +**Para diferentes proveedores de Git:** + +**GitHub:** + +- Ve a Repositorio > Configuración > Secrets and variables > Actions +- Añade un nuevo secreto de repositorio: `SLEAKOPS_KEY` + +**GitLab:** + +- Ve a Proyecto > Configuración > CI/CD > Variables +- Añade la variable: `SLEAKOPS_KEY` (márcala como protegida si es necesario) + +**Bitbucket:** + +- Ve a Repositorio > Configuración del repositorio > Pipelines > Variables del repositorio +- Añade la variable: `SLEAKOPS_KEY` + + + + + +Después de completar la configuración, verifica que todo funcione: + +1. **Revisar estado de la canalización:** + + - Ve al panel de tu proyecto + - Busca actividad de canalización en la sección **Despliegues** o **Compilaciones** + +2. **Probar con un cambio pequeño:** + + - Realiza un cambio menor en tu código + - Haz commit y push a la rama objetivo + - Monitorea la ejecución de la canalización + +3. **Revisar registros:** + + - Consulta los registros de la canalización para mensajes de error + - Verifica que se ejecuten los pasos de compilación y despliegue + +4. **Puntos comunes de verificación:** + - El archivo de canalización está comprometido en el repositorio + - Los nombres de las ramas coinciden exactamente (sensibles a mayúsculas/minúsculas) + - La clave API es válida y tiene permisos necesarios + - El webhook del repositorio está configurado (generalmente automático) + + + + + +Si la canalización aún no se activa: + +**Verificar configuración del webhook:** + +- Confirma que tu proveedor Git tenga la URL correcta del webhook +- Prueba la entrega del webhook en la configuración de tu proveedor Git + +**Validar permisos de la clave API:** + +- Asegúrate de que la clave API tenga permisos para despliegues del proyecto +- Intenta regenerar la clave API si es antigua + +**Revisar sintaxis de la canalización:** + +- Valida la sintaxis YAML usando un validador YAML en línea +- Verifica errores de indentación en el archivo de canalización + +**Contactar soporte:** +Si los problemas persisten, proporciona la siguiente información: + +- Nombre e ID del proyecto +- Nombre de la rama objetivo +- SHA del commit reciente que debería haber activado la canalización +- Cualquier mensaje de error desde el panel + + + +--- + +_Esta FAQ fue generada automáticamente el 14 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cicd-pip-to-pipx-installation.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cicd-pip-to-pipx-installation.mdx new file mode 100644 index 000000000..9b68d6d1c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cicd-pip-to-pipx-installation.mdx @@ -0,0 +1,188 @@ +--- +sidebar_position: 3 +title: "Error en la Pipeline CI/CD - Método de Instalación de SleakOps" +description: "Solución para fallos en la pipeline CI/CD al instalar la herramienta CLI de SleakOps" +date: "2024-10-15" +category: "proyecto" +tags: ["cicd", "pipeline", "instalación", "pipx", "python"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error en la Pipeline CI/CD - Método de Instalación de SleakOps + +**Fecha:** 15 de octubre de 2024 +**Categoría:** Proyecto +**Etiquetas:** CI/CD, Pipeline, Instalación, pipx, Python + +## Descripción del Problema + +**Contexto:** Los flujos de trabajo CI/CD fallan al intentar instalar la herramienta CLI de SleakOps usando el método tradicional `pip install` en GitHub Actions u otros entornos CI/CD. + +**Síntomas Observados:** + +- Fallos en la pipeline CI/CD durante el paso de instalación de SleakOps +- Errores de instalación en la ejecución del workflow +- Interrupción del proceso de construcción en la fase de instalación de dependencias + +**Configuración Relevante:** + +- Método de instalación: `pip install sleakops` (incorrecto) +- Entorno: pipeline CI/CD (GitHub Actions, GitLab CI, etc.) +- Gestor de paquetes Python: pip vs pipx + +**Condiciones de Error:** + +- El error ocurre durante la ejecución del workflow +- Sucede específicamente en el paso de instalación de SleakOps +- Puede estar relacionado con conflictos de dependencias o problemas de aislamiento + +## Solución Detallada + + + +La solución es reemplazar `pip install sleakops` por `pipx install sleakops` en la configuración de tu workflow CI/CD. + +**Antes (incorrecto):** + +```yaml +- name: Install SleakOps + run: pip install sleakops +``` + +**Después (correcto):** + +```yaml +- name: Install SleakOps + run: pipx install sleakops +``` + + + + + +`pipx` es el método recomendado para instalar aplicaciones CLI de Python porque: + +1. **Aislamiento**: Crea entornos aislados para cada aplicación +2. **Sin conflictos**: Previene conflictos de dependencias con otros paquetes Python +3. **Instalación limpia**: Mantiene limpio tu entorno global de Python +4. **Mejor para herramientas CLI**: Diseñado específicamente para aplicaciones de línea de comandos + + + + + +Aquí tienes un ejemplo completo de cómo instalar correctamente SleakOps en un workflow de GitHub Actions: + +```yaml +name: Deploy with SleakOps + +on: + push: + branches: [main] + +jobs: + deploy: + runs-on: ubuntu-latest + + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: "3.9" + + - name: Install pipx + run: | + python -m pip install --upgrade pip + python -m pip install pipx + python -m pipx ensurepath + + - name: Install SleakOps + run: pipx install sleakops + + - name: Deploy application + run: | + sleakops deploy + env: + SLEAKOPS_TOKEN: ${{ secrets.SLEAKOPS_TOKEN }} +``` + + + + + +**GitLab CI (.gitlab-ci.yml):** + +```yaml +stages: + - deploy + +deploy: + stage: deploy + image: python:3.9 + before_script: + - pip install pipx + - pipx install sleakops + script: + - sleakops deploy +``` + +**Pipeline de Jenkins:** + +```groovy +pipeline { + agent any + stages { + stage('Install SleakOps') { + steps { + sh 'pip install pipx' + sh 'pipx install sleakops' + } + } + stage('Deploy') { + steps { + sh 'sleakops deploy' + } + } + } +} +``` + + + + + +Si aún encuentras problemas después de cambiar a pipx: + +1. **Asegúrate de que pipx esté instalado:** + + ```bash + python -m pip install pipx + python -m pipx ensurepath + ``` + +2. **Verifica la compatibilidad de la versión de Python:** + + - SleakOps requiere Python 3.7 o superior + - Usa `python --version` para comprobar + +3. **Limpia la caché de pipx si es necesario:** + + ```bash + pipx uninstall sleakops + pipx install sleakops + ``` + +4. **Verifica la instalación:** + ```bash + sleakops --version + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de octubre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cloudfront-existing-s3-bucket.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cloudfront-existing-s3-bucket.mdx new file mode 100644 index 000000000..c85d2864b --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cloudfront-existing-s3-bucket.mdx @@ -0,0 +1,164 @@ +--- +sidebar_position: 3 +title: "CDN CloudFront para un Bucket S3 Existente" +description: "Cómo crear una distribución CloudFront para un bucket S3 ya creado en SleakOps" +date: "2024-12-19" +category: "dependency" +tags: ["cloudfront", "s3", "cdn", "aws", "storage"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# CDN CloudFront para un Bucket S3 Existente + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** CloudFront, S3, CDN, AWS, Almacenamiento + +## Descripción del Problema + +**Contexto:** El usuario tiene un bucket S3 ya creado a través de SleakOps y desea agregar una distribución CDN CloudFront para él. La opción CloudFront está disponible durante la creación del bucket, pero no es visible para buckets ya existentes. + +**Síntomas Observados:** + +- Opción CloudFront disponible solo durante la creación del bucket S3 +- No hay opción visible para añadir CloudFront a buckets existentes en la interfaz de SleakOps +- Necesidad de habilitar CDN para almacenamiento S3 ya desplegado + +**Configuración Relevante:** + +- Bucket S3: Ya creado mediante SleakOps +- Plataforma: AWS +- Servicio: Se necesita distribución CDN CloudFront +- Limitación actual: Edición de dependencias no habilitada en SleakOps + +**Condiciones de Error:** + +- No se puede modificar las dependencias del bucket S3 existente en SleakOps +- No hay opción directa para añadir CloudFront después de crear el bucket +- El usuario quiere evitar recrear el bucket existente + +## Solución Detallada + + + +Actualmente, SleakOps no soporta la edición de dependencias para recursos existentes. Esto significa que no se puede agregar CloudFront a un bucket S3 después de haber sido creado mediante la interfaz de la plataforma. + +La opción CloudFront solo está disponible durante el proceso inicial de creación del bucket S3. + + + + + +Puedes crear una distribución CloudFront manualmente usando la Consola de AWS: + +1. **Accede a la Consola AWS** + + - Inicia sesión en tu cuenta AWS + - Navega al servicio **CloudFront** + +2. **Crear Distribución** + + - Haz clic en **Create Distribution** + - Selecciona el tipo de distribución **Web** + +3. **Configurar Origen** + + - **Dominio de Origen**: Selecciona tu bucket S3 existente del menú desplegable + - **Ruta de Origen**: Déjalo vacío (a menos que quieras servir desde una carpeta específica) + - **Acceso al Origen**: Elige **Origin Access Control (OAC)** para mayor seguridad + +4. **Configuración de la Distribución** + + - **Clase de Precio**: Elige según tus necesidades geográficas + - **Nombres de Dominio Alternativos (CNAMEs)**: Agrega tu dominio personalizado si es necesario + - **Certificado SSL**: Usa el predeterminado o sube un certificado personalizado + +5. **Desplegar** + - Haz clic en **Create Distribution** + - Espera el despliegue (normalmente 15-20 minutos) + + + + + +Después de crear la distribución CloudFront, actualiza la política de tu bucket S3 para permitir el acceso de CloudFront: + +1. **Obtener ID de Distribución CloudFront** + - Copia el ID de Distribución desde la consola de CloudFront + +2. **Actualizar Política del Bucket** + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "AllowCloudFrontServicePrincipal", + "Effect": "Allow", + "Principal": { + "Service": "cloudfront.amazonaws.com" + }, + "Action": "s3:GetObject", + "Resource": "arn:aws:s3:::nombre-de-tu-bucket/*", + "Condition": { + "StringEquals": { + "AWS:SourceArn": "arn:aws:cloudfront::tu-account-id:distribution/tu-distribution-id" + } + } + } + ] + } + ``` + +3. **Aplicar Política** + - Ve a los permisos del bucket S3 + - Actualiza la política del bucket con el JSON anterior + - Reemplaza los placeholders con tus valores reales + + + + + +Verifica que tu distribución CloudFront esté funcionando: + +1. **Obtener URL de CloudFront** + - Copia el Nombre de Dominio de la Distribución desde la consola de CloudFront + +2. **Probar Acceso** + ```bash + # Probar con curl + curl https://tu-dominio-distribucion.cloudfront.net/tu-archivo.txt + + # O probar en el navegador + https://tu-dominio-distribucion.cloudfront.net/tu-archivo.txt + ``` + +3. **Verificar Headers de Cache** + ```bash + curl -I https://tu-dominio-distribucion.cloudfront.net/tu-archivo.txt + ``` + + + + + +Si la configuración manual es muy compleja, puedes recrear el bucket S3 con CloudFront: + +1. **Respaldar Datos** + - Descarga todos los archivos del bucket existente + - Anota la configuración actual del bucket + +2. **Crear Nueva Dependencia S3** + - En SleakOps, crea un nuevo bucket S3 + - Habilita la opción CloudFront durante la creación + - Sube tus datos respaldados + +3. **Actualizar Aplicación** + - Actualiza tu aplicación para usar el nuevo bucket + - Prueba la conectividad y funcionalidad + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-addons-after-migration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-addons-after-migration.mdx new file mode 100644 index 000000000..282e527de --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-addons-after-migration.mdx @@ -0,0 +1,178 @@ +--- +sidebar_position: 3 +title: "Complementos del Clúster Perdidos Después de la Migración" +description: "Cómo localizar y restaurar los complementos del clúster después de la migración de la plataforma" +date: "2024-10-15" +category: "cluster" +tags: ["migración", "complementos", "loki", "grafana", "monitoreo"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Complementos del Clúster Perdidos Después de la Migración + +**Fecha:** 15 de octubre de 2024 +**Categoría:** Clúster +**Etiquetas:** Migración, Complementos, Loki, Grafana, Monitoreo + +## Descripción del Problema + +**Contexto:** Después de una migración de clúster en SleakOps, los usuarios no pueden localizar los complementos instalados previamente como Loki, Grafana y otras herramientas de monitoreo en la nueva interfaz. + +**Síntomas Observados:** + +- Los complementos del clúster (Loki, Grafana, etc.) no son visibles en la nueva interfaz +- Las herramientas de monitoreo configuradas previamente parecen estar ausentes +- Los usuarios no pueden acceder a los paneles de monitoreo que estaban disponibles antes de la migración +- Incertidumbre sobre el estado de los complementos tras la migración de la plataforma + +**Configuración Relevante:** + +- Plataforma: SleakOps con interfaz nueva +- Complementos afectados: Loki, Grafana y otras herramientas de monitoreo +- Contexto de migración: Migración de clúster reciente realizada +- Interfaz: Interfaz actualizada/nueva de SleakOps + +**Condiciones de Error:** + +- Complementos no visibles inmediatamente después de la migración +- Ocurre al acceder a la nueva interfaz tras la migración +- Afecta capacidades de monitoreo y registro +- Puede impactar la visibilidad operativa + +## Solución Detallada + + + +Durante las migraciones de clúster, SleakOps desactiva temporalmente los complementos para asegurar: + +1. **Integridad de los datos**: Previene la corrupción de datos durante el proceso de migración +2. **Gestión de recursos**: Evita conflictos entre las configuraciones del clúster antiguo y nuevo +3. **Migración limpia**: Garantiza que los complementos se reconfiguren correctamente para el nuevo entorno +4. **Consistencia del estado**: Mantiene estados consistentes de los complementos a lo largo de la migración + +Este es un procedimiento estándar y los complementos se reactivan una vez que la migración está completa. + + + + + +Para encontrar tus complementos en la interfaz actualizada de SleakOps: + +1. **Navega a Gestión del Clúster**: + + - Ve al panel de control de tu clúster + - Busca la sección "Complementos" o "Extensiones" + +2. **Revisa la Sección de Monitoreo**: + + - Accede a la pestaña "Monitoreo" + - Busca Grafana, Loki y otras herramientas de monitoreo + +3. **Verifica el Estado de los Complementos**: + + - Comprueba si los complementos aparecen como "Activos" o "Pendientes" + - Algunos complementos pueden necesitar unos minutos para inicializarse completamente + +4. **Accede al Panel de Grafana**: + ``` + # Patrón típico de acceso a Grafana + https://grafana.[tu-dominio-del-clúster] + ``` + + + + + +Si los complementos aún no son visibles: + +1. **Espera la reactivación automática**: + + - La mayoría de los complementos se reactivan automáticamente en 10-15 minutos + - Revisa el estado del clúster para cualquier operación pendiente + +2. **Reactivación manual si es necesario**: + + - Ve a Configuración del Clúster → Complementos + - Desactiva y vuelve a activar los complementos que aparezcan inactivos + - Guarda la configuración + +3. **Verifica la salud de los complementos**: + + ```bash + # Verifica el estado de los pods de los complementos + kubectl get pods -n monitoring + kubectl get pods -n logging + ``` + +4. **Contacta soporte si los problemas persisten**: + - Si los complementos no aparecen después de 30 minutos + - Si encuentras errores durante la reactivación + + + + + +Después de que los complementos sean visibles, verifica que funcionen correctamente: + +1. **Acceso al Panel de Grafana**: + + - Inicia sesión en Grafana con tus credenciales de SleakOps + - Verifica que los paneles muestren datos + - Comprueba que las fuentes de datos estén conectadas + +2. **Agregación de Logs con Loki**: + + - Verifica que los logs se estén recopilando + - Revisa las políticas de retención de logs + - Prueba consultas de logs en Grafana + +3. **Alertas de Monitoreo**: + + - Verifica que las reglas de alerta estén activas + - Prueba los canales de notificación + - Revisa el historial de alertas + +4. **Métricas de Rendimiento**: + - Confirma que la recopilación de métricas funcione + - Verifica la disponibilidad de datos históricos + - Revisa la configuración de retención de métricas + + + + + +Si los complementos siguen sin estar disponibles: + +1. **Revisa los recursos del clúster**: + + ```bash + kubectl get nodes + kubectl top nodes + ``` + +2. **Verifica el estado de los namespaces**: + + ```bash + kubectl get namespaces + kubectl get pods --all-namespaces + ``` + +3. **Revisa los logs de los complementos**: + + ```bash + kubectl logs -n monitoring deployment/grafana + kubectl logs -n logging deployment/loki + ``` + +4. **Soluciones comunes**: + - Reinicia los despliegues de los complementos + - Verifica restricciones de recursos + - Revisa las políticas de red + - Examina la disponibilidad de la clase de almacenamiento + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de octubre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-automatic-shutdown-startup-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-automatic-shutdown-startup-issues.mdx new file mode 100644 index 000000000..1db6d0e17 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-automatic-shutdown-startup-issues.mdx @@ -0,0 +1,232 @@ +--- +sidebar_position: 3 +title: "Problemas con el Apagado y Arranque Automático del Clúster" +description: "Solución de problemas con el apagado/arranque automático del clúster que causa fallos en la API" +date: "2025-02-14" +category: "clúster" +tags: + ["clúster", "automatización", "apagado", "arranque", "step-functions", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas con el Apagado y Arranque Automático del Clúster + +**Fecha:** 14 de febrero de 2025 +**Categoría:** Clúster +**Etiquetas:** Clúster, Automatización, Apagado, Arranque, Step Functions, AWS + +## Descripción del Problema + +**Contexto:** Los clústeres SleakOps configurados con horarios automáticos de apagado/arranque pueden experimentar problemas donde el clúster no arranca correctamente, causando fallos en la API y tiempo de inactividad en las aplicaciones. + +**Síntomas Observados:** + +- Aplicaciones mostrando errores de conexión o tiempos de espera +- Endpoints de API devolviendo respuestas de error +- Servicios backend que no responden a chequeos de salud +- Todas las aplicaciones en el clúster apareciendo como no disponibles +- Problemas que ocurren después de períodos programados de apagado del clúster + +**Configuración Relevante:** + +- Tipo de clúster: Entornos de desarrollo/pruebas +- Apagado automático: Configurado para horas nocturnas +- Arranque automático: Configurado para horas laborales +- AWS Step Functions: Usado para la gestión del ciclo de vida del clúster +- Región: us-east-1 (típicamente) + +**Condiciones de Error:** + +- Errores que aparecen en la mañana tras el arranque automático +- El clúster parece estar en ejecución pero las aplicaciones no son accesibles +- La ejecución de Step Function puede haber fallado o completado con errores +- Se requiere intervención manual para restaurar el servicio + +## Solución Detallada + + + +Si estás experimentando este problema ahora mismo, puedes resolverlo activando manualmente el arranque del clúster: + +1. **Accede a la Consola de AWS** en tu cuenta de desarrollo +2. **Navega al servicio Step Functions** +3. **Encuentra la Step Function de arranque del clúster** (normalmente nombrada con el patrón: `*-up-sfn-*`) +4. **Ejecuta la Step Function** manualmente +5. **Espera la finalización** (típicamente 5-10 minutos) +6. **Verifica que las aplicaciones** sean accesibles nuevamente + +**Formato de enlace directo:** + +``` +https://us-east-1.console.aws.amazon.com/states/home?region=us-east-1#/statemachines/view/[TU_STEP_FUNCTION_ARN] +``` + + + + + +Este problema normalmente ocurre debido a: + +1. **Fallos en la ejecución de Step Functions**: El proceso automático de arranque encuentra errores +2. **Problemas de sincronización**: Dependencias entre servicios durante el arranque +3. **Restricciones de recursos**: Recursos insuficientes durante la inicialización del clúster +4. **Conectividad de red**: Problemas temporales de red durante el arranque +5. **Desviación en la configuración**: Cambios en la configuración del clúster que afectan la automatización + +**Puntos comunes de fallo:** + +- Problemas con el escalado de grupos de nodos +- Problemas en la programación de pods +- Retrasos en el descubrimiento de servicios +- Fallos en los chequeos de salud del balanceador de carga + + + + + +Para prevenir que este problema se repita: + +**1. Monitorea las ejecuciones de Step Functions:** + +```bash +# Ver ejecuciones recientes +aws stepfunctions list-executions --state-machine-arn [TU_ARN] --max-items 10 +``` + +**2. Configura alarmas en CloudWatch:** + +- Fallos en ejecuciones de Step Functions +- Duración del arranque del clúster que excede los umbrales +- Fallos en chequeos de salud de aplicaciones + +**3. Implementa mecanismos de reintento:** + +- Configura Step Functions con lógica de reintentos +- Añade retroceso exponencial para pasos fallidos +- Incluye pasos de aprobación manual para fallos críticos + +**4. Mejoras en chequeos de salud:** + +- Extiende los tiempos de espera de chequeos de salud +- Añade chequeos de dependencia entre servicios +- Implementa secuencias de arranque ordenadas y suaves + + + + + +**Paso 1: Verifica el estado de la Step Function** + +```bash +# Obtener detalles de la ejecución +aws stepfunctions describe-execution --execution-arn [EJECUCION_ARN] +``` + +**Paso 2: Verifica el estado del clúster** + +```bash +# Comprobar estado del clúster +kubectl get nodes +kubectl get pods --all-namespaces +``` + +**Paso 3: Revisa los logs de las aplicaciones** + +```bash +# Revisar logs de pods para errores +kubectl logs -f deployment/[TU_APP] -n [NAMESPACE] +``` + +**Paso 4: Verifica la conectividad del servicio** + +```bash +# Probar conectividad interna del servicio +kubectl exec -it [NOMBRE_POD] -- curl http://[NOMBRE_SERVICIO]:8080/health +``` + +**Paso 5: Revisa el ingreso y balanceador de carga** + +```bash +# Verificar estado del ingreso +kubectl get ingress +kubectl describe ingress [NOMBRE_INGRESS] +``` + + + + + +**Optimización de la secuencia de arranque:** + +1. **Arranque escalonado:** No iniciar todos los servicios simultáneamente +2. **Orden de dependencias:** Iniciar bases de datos antes que las aplicaciones +3. **Retrasos en chequeos de salud:** Permitir tiempo suficiente para la inicialización de servicios +4. **Reservas de recursos:** Asegurar CPU/memoria adecuadas durante el arranque + +**Configuración de Step Functions:** + +```json +{ + "Comment": "Arranque del clúster con lógica de reintentos", + "StartAt": "StartCluster", + "States": { + "StartCluster": { + "Type": "Task", + "Resource": "arn:aws:states:::aws-sdk:eks:updateCluster", + "Retry": [ + { + "ErrorEquals": ["States.TaskFailed"], + "IntervalSeconds": 30, + "MaxAttempts": 3, + "BackoffRate": 2.0 + } + ], + "Next": "WaitForCluster" + } + } +} +``` + +**Configuración de monitoreo:** + +- Configurar alertas para ejecuciones fallidas +- Monitorear utilización de recursos del clúster +- Rastrear tiempos de arranque de aplicaciones +- Registrar todos los eventos de automatización + + + + + +Si el apagado/arranque automático sigue causando problemas: + +**Opción 1: Ajustar horarios de apagado/arranque** + +- Extender el tiempo de arranque antes de las horas laborales +- Añadir tiempo de margen para inicialización completa +- Escalonar el apagado de diferentes servicios + +**Opción 2: Implementar chequeos de salud** + +- Añadir chequeos de salud completos antes de marcar el clúster como listo +- Incluir verificación de salud a nivel de aplicación +- Implementar reversión automática ante fallos en chequeos de salud + +**Opción 3: Usar autoscaling del clúster** + +- Configurar autoscaler para escalado automático +- Usar aprovisionamiento automático de nodos para optimización de costos +- Implementar presupuestos de interrupción de pods + +**Opción 4: Considerar mantener siempre encendidos los entornos críticos** + +- Mantener entornos tipo producción siempre activos +- Usar optimización de costos mediante dimensionamiento adecuado en lugar de apagado +- Implementar cuotas y límites de recursos + + + +--- + +_Esta FAQ fue generada automáticamente el 14 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-aws-vcpu-quota-limit.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-aws-vcpu-quota-limit.mdx new file mode 100644 index 000000000..5a3182121 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-aws-vcpu-quota-limit.mdx @@ -0,0 +1,200 @@ +--- +sidebar_position: 3 +title: "Error de Límite de Cuota de vCPU en AWS con Karpenter" +description: "Solución para el error VcpuLimitExceeded al usar instancias GPU en clústeres EKS" +date: "2024-04-24" +category: "cluster" +tags: ["aws", "karpenter", "cuota", "gpu", "instancias", "eks"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Límite de Cuota de vCPU en AWS con Karpenter + +**Fecha:** 24 de abril de 2024 +**Categoría:** Clúster +**Etiquetas:** AWS, Karpenter, Cuota, GPU, Instancias, EKS + +## Descripción del Problema + +**Contexto:** El usuario intenta configurar un nodepool con instancias GPU (g4ad.xlarge) para cargas de trabajo de computación de alto rendimiento, pero encuentra limitaciones en la cuota de vCPU cuando Karpenter intenta aprovisionar las instancias. + +**Síntomas observados:** + +- Karpenter falla al lanzar NodeClaim con error "VcpuLimitExceeded" +- Mensaje de error: "Ha solicitado más capacidad de vCPU de la que su límite actual de vCPU de 0 permite" +- No se pueden aprovisionar instancias GPU (g4ad.xlarge) +- Las instancias estándar (c7a.large, c7a.xlarge) funcionan correctamente +- La configuración del nodepool parece correcta pero no se crean nodos + +**Configuración relevante:** + +- Tipos de instancia: g4ad.xlarge (instancias GPU) +- Aprovisionamiento de NodeClaim con Karpenter +- Cuota del Servicio AWS inicialmente establecida en 0 para familias de instancias GPU +- Configuración del selector del nodepool usando node.kubernetes.io/instance-type + +**Condiciones de error:** + +- El error ocurre durante el aprovisionamiento de nodos con Karpenter +- Específico para tipos de instancia GPU (familia g4ad) +- La cuota del Servicio AWS bloquea la creación de instancias +- Las instancias estándar de cómputo funcionan sin problemas + +## Solución Detallada + + + +AWS implementa Cuotas de Servicio para controlar el uso de recursos entre diferentes familias de instancias. Las instancias GPU tienen cuotas separadas de las instancias estándar de cómputo: + +- **Instancias estándar** (t3, c5, m5, etc.): Usualmente tienen cuotas por defecto +- **Instancias GPU** (g4, p3, p4, etc.): A menudo comienzan con cuota 0 por seguridad +- **Instancias especializadas**: Pueden requerir solicitudes explícitas de cuota + +El error "VcpuLimitExceeded" indica que su cuenta no tiene suficiente cuota para el tipo de instancia solicitado. + + + + + +Para solicitar un aumento de cuota para instancias GPU: + +1. **Acceda a la Consola de Cuotas de Servicio de AWS**: + + - Vaya a [Consola de Cuotas de Servicio de AWS](https://us-east-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-FD8E9B9A) + - Navegue a **Amazon Elastic Compute Cloud (Amazon EC2)** + +2. **Encuentre la cuota correcta**: + + - Busque "Running On-Demand G and VT instances" + - Esta cuota cubre las instancias g4ad.xlarge + +3. **Solicite el aumento**: + + - Haga clic en "Request quota increase" + - Comience con un número conservador (por ejemplo, 32-64 vCPUs) + - Proporcione justificación comercial + +4. **Para instancias NVIDIA**, hay una cuota separada: + - "Running On-Demand P instances" + - Requerida para familias de instancias p3, p4 + + + + + +Calcule las vCPUs necesarias según sus requerimientos de instancia: + +```yaml +# Cálculo ejemplo para g4ad.xlarge +Instancia: g4ad.xlarge +vCPUs por instancia: 4 +Instancias deseadas: 10 +Total de vCPUs necesarias: 40 +# Solicitar cuota: 64 vCPUs (con margen) +``` + +**Conteos comunes de vCPU para instancias GPU**: + +- g4ad.xlarge: 4 vCPUs +- g4ad.2xlarge: 8 vCPUs +- g4ad.4xlarge: 16 vCPUs +- g4dn.xlarge: 4 vCPUs +- p3.2xlarge: 8 vCPUs + + + + + +Asegúrese de que su nodepool esté configurado correctamente para instancias GPU: + +```yaml +# Ejemplo de configuración de nodepool +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: gpu-nodepool +spec: + template: + spec: + requirements: + - key: node.kubernetes.io/instance-type + operator: In + values: + - g4ad.xlarge + - g4ad.2xlarge + - key: karpenter.sh/capacity-type + operator: In + values: + - on-demand # Las instancias GPU funcionan mejor con on-demand + nodeClassRef: + apiVersion: karpenter.k8s.aws/v1beta1 + kind: EC2NodeClass + name: gpu-nodeclass +``` + +**Consideraciones importantes**: + +- Use tipo de capacidad `on-demand` para instancias GPU +- Asegure que la AMI soporte drivers GPU +- Configure taints/tolerations apropiados para cargas GPU + + + + + +**Monitoree el uso de cuota**: + +```bash +# Verificar uso actual de cuota +aws service-quotas get-service-quota \ + --service-code ec2 \ + --quota-code L-FD8E9B9A + +# Monitorear logs de Karpenter +kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter +``` + +**Problemas comunes y soluciones**: + +1. **Cuota aprobada pero sigue fallando**: + + - Espere 15-30 minutos para que la cuota se propague + - Intente diferentes zonas de disponibilidad + +2. **Instancia no disponible**: + + - Verifique disponibilidad de instancias en su región + - Considere tipos de instancia alternativos + +3. **Compatibilidad de AMI**: + - Asegure que la AMI soporte drivers GPU + - Use AMI optimizada para EKS con soporte GPU + + + + + +**Gestión de cuotas**: + +- Solicite cuotas proactivamente antes del despliegue +- Comience con números conservadores y aumente según sea necesario +- Monitoree el uso para evitar límites inesperados + +**Optimización de costos**: + +- Use instancias Spot para cargas GPU no críticas +- Implemente políticas adecuadas de escalado de nodos +- Considere tipos de instancia mixtos en nodepools + +**Gestión de configuración**: + +- Use Infraestructura como Código (Terraform) para solicitudes de cuota +- Documente requerimientos de cuota en guías de despliegue +- Configure monitoreo para la utilización de cuotas + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-connection-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-connection-troubleshooting.mdx new file mode 100644 index 000000000..b154b1d38 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-connection-troubleshooting.mdx @@ -0,0 +1,230 @@ +--- +sidebar_position: 3 +title: "Problemas de Conexión al Clúster" +description: "Guía de solución de problemas para problemas de conectividad al clúster" +date: "2024-12-19" +category: "cluster" +tags: ["conexión", "kubectl", "aws-sdk", "solución de problemas", "red"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Conexión al Clúster + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Conexión, kubectl, AWS SDK, Solución de problemas, Red + +## Descripción del Problema + +**Contexto:** El usuario no puede conectarse a un clúster de Kubernetes a través de la plataforma SleakOps, a pesar de que el clúster parece estar funcionando normalmente. + +**Síntomas Observados:** + +- Imposibilidad de establecer conexión con el clúster +- Los intentos de conexión fallan sin mensajes de error claros +- El estado del clúster aparece normal en la plataforma +- El problema persiste en múltiples intentos de conexión + +**Configuración Relevante:** + +- Plataforma: Clúster Kubernetes de SleakOps +- Herramientas requeridas: kubectl, AWS SDK, posible cliente VPN +- Red: Variable (diferentes conexiones a internet) +- Entorno local: Máquina local del usuario + +**Condiciones de Error:** + +- Fallos de conexión durante los intentos de acceso al clúster +- El problema puede estar relacionado con dependencias locales desactualizadas +- La conectividad de red puede ser un factor contribuyente +- El problema persiste a pesar de que el clúster está operativo + +## Solución Detallada + + + +Asegúrese de que todas las herramientas requeridas estén actualizadas a sus últimas versiones: + +**kubectl:** + +```bash +# Verificar versión actual +kubectl version --client + +# Actualizar kubectl (Linux/macOS) +curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" +sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl + +# Actualizar kubectl (Windows) +choco upgrade kubernetes-cli +``` + +**AWS CLI:** + +```bash +# Verificar versión actual +aws --version + +# Actualizar AWS CLI +pip install --upgrade awscli +# o +curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +unzip awscliv2.zip +sudo ./aws/install --update +``` + +**Pritunl (si usa VPN):** + +- Descargue la última versión desde el sitio oficial de Pritunl +- Desinstale la versión antigua antes de instalar la nueva + + + + + +Realice estos diagnósticos de red: + +**1. Pruebe la conectividad básica:** + +```bash +# Probar resolución DNS +nslookup your-cluster-endpoint.eks.amazonaws.com + +# Probar conectividad de puerto +telnet your-cluster-endpoint.eks.amazonaws.com 443 +``` + +**2. Verifique la configuración del firewall y proxy:** + +```bash +# Verificar si está detrás de un firewall corporativo +curl -I https://your-cluster-endpoint.eks.amazonaws.com + +# Probar con otra red (hotspot móvil) +# Cambie a datos móviles y reintente la conexión +``` + +**3. Verificación de conexión VPN:** + +- Asegúrese de que la VPN esté conectada si es necesaria +- Intente desconectar/reconectar la VPN +- Revise los registros del cliente VPN para detectar errores + + + + + +Actualice la configuración local del clúster: + +**1. Re-descargar configuración del clúster:** + +```bash +# Para clústeres AWS EKS +aws eks update-kubeconfig --region your-region --name your-cluster-name + +# Verificar configuración +kubectl config current-context +kubectl config view +``` + +**2. Probar conectividad con el clúster:** + +```bash +# Probar acceso básico al clúster +kubectl cluster-info +kubectl get nodes +kubectl get pods --all-namespaces +``` + +**3. Limpiar y regenerar credenciales:** + +```bash +# Limpiar caché de credenciales AWS +rm -rf ~/.aws/cli/cache/ + +# Reautenticarse si es necesario +aws configure +``` + + + + + +Si la conexión estándar falla, pruebe estas alternativas: + +**1. Usar Terminal Web de SleakOps:** + +- Acceda al clúster a través de la interfaz web de SleakOps +- Use la función de terminal incorporada +- Esto evita problemas de configuración local + +**2. Reenvío de puertos para servicios específicos:** + +```bash +# Reenviar puertos de servicios específicos +kubectl port-forward service/your-service 8080:80 +``` + +**3. Acceso temporal al clúster:** + +```bash +# Generar kubeconfig temporal +kubectl config set-cluster temp-cluster --server=https://your-endpoint +kubectl config set-context temp-context --cluster=temp-cluster +kubectl config use-context temp-context +``` + + + + + +Siga este proceso de diagnóstico paso a paso: + +**Paso 1: Verificación del Entorno** + +```bash +# Verificar versiones de todas las herramientas +kubectl version --client +aws --version +helm version (si aplica) +``` + +**Paso 2: Prueba de Conectividad** + +```bash +# Probar endpoint del clúster +curl -k https://your-cluster-endpoint.eks.amazonaws.com/version +``` + +**Paso 3: Verificación de Autenticación** + +```bash +# Verificar credenciales AWS +aws sts get-caller-identity + +# Probar autenticación con el clúster +kubectl auth can-i get pods +``` + +**Paso 4: Análisis de Red** + +- Intente desde otra red (datos móviles) +- Revise reglas del firewall corporativo +- Verifique conectividad VPN si es requerida + +**Paso 5: Validación de Configuración** + +```bash +# Validar kubeconfig +kubectl config validate + +# Verificar contextos +kubectl config get-contexts +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-critical-addons-node-failure.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-critical-addons-node-failure.mdx new file mode 100644 index 000000000..1aa36b535 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-critical-addons-node-failure.mdx @@ -0,0 +1,188 @@ +--- +sidebar_position: 3 +title: "Fallo del Nodo de Complementos Críticos en Producción" +description: "Solución para fallos del nodo CriticalAddonsOnly que causan tiempo de inactividad en producción" +date: "2024-01-15" +category: "cluster" +tags: + [ + "eks", + "complementos-críticos", + "alta-disponibilidad", + "producción", + "autoscaling", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo del Nodo de Complementos Críticos en Producción + +**Fecha:** 15 de enero de 2024 +**Categoría:** Clúster +**Etiquetas:** EKS, Complementos Críticos, Alta Disponibilidad, Producción, AutoScaling + +## Descripción del Problema + +**Contexto:** El clúster EKS de producción experimenta tiempo de inactividad debido a la ausencia del nodo CriticalAddonsOnly, provocando fallos en el sistema y falta de disponibilidad del servicio. + +**Síntomas Observados:** + +- Sistemas de producción caídos +- Nodo CriticalAddonsOnly ausente en el clúster +- Complementos críticos de Kubernetes no pueden ser programados +- Interrupción del servicio que afecta a los usuarios finales + +**Configuración Relevante:** + +- Entorno: Clúster EKS de producción +- Tipo de nodo: Nodo dedicado CriticalAddonsOnly +- Costo actual: ~10 USD/mes +- Costo con alta disponibilidad: ~50 USD/mes + +**Condiciones de Error:** + +- Punto único de fallo en la programación de complementos críticos +- No hay nodos de respaldo para componentes críticos del sistema +- El grupo de AutoScaling carece de diversidad en tipos de instancia + +## Solución Detallada + + + +Para una resolución inmediata sin costos adicionales: + +1. **Acceder a la Consola AWS** + + - Navegar a EC2 → Grupos de Auto Scaling + - Encontrar el Grupo de Auto Scaling CriticalAddons de su clúster + +2. **Editar la Plantilla de Lanzamiento** + + ```bash + # Ejemplo de tipos de instancia para añadir + - t3.medium + - t3.large + - m5.large + - m5.xlarge + ``` + +3. **Actualizar el Grupo de AutoScaling** + + - Ir a la sección "Tipos de Instancia" + - Añadir múltiples tipos de instancia compatibles + - Esto provee opciones de respaldo cuando el tipo principal no está disponible + +4. **Forzar Reemplazo del Nodo** + ```bash + # Forzar creación de nuevo nodo + kubectl drain --ignore-daemonsets --delete-emptydir-data + kubectl delete node + ``` + + + + + +Para entornos de producción, implemente alta disponibilidad: + +1. **Habilitar HA en el Panel de SleakOps** + + - Ir a Configuración del Clúster + - Navegar a la sección "Complementos Críticos" + - Activar la opción "Alta Disponibilidad" + - Incremento de costo: ~10 → 50 USD/mes + +2. **Beneficios de la Configuración HA** + + - Múltiples nodos CriticalAddons distribuidos en zonas de disponibilidad + - Capacidades automáticas de conmutación por error + - Cero tiempo de inactividad para componentes críticos + - Fiabilidad de nivel producción + +3. **Ejemplo de Configuración** + ```yaml + critical_addons: + high_availability: true + min_nodes: 2 + max_nodes: 3 + instance_types: ["t3.medium", "t3.large", "m5.large"] + availability_zones: ["us-east-1a", "us-east-1b", "us-east-1c"] + ``` + + + + + +**Para Clústeres de Producción:** + +1. **Siempre Habilitar Alta Disponibilidad** + + - Crítico para cargas de trabajo en producción + - Previene puntos únicos de fallo + - Incremento mínimo de costo para máxima confiabilidad + +2. **Diversidad en Tipos de Instancia** + + ```yaml + # Buena práctica: múltiples tipos de instancia + instance_types: + - "t3.medium" # Elección primaria + - "t3.large" # Opción de respaldo + - "m5.large" # Familia alternativa + - "m5.xlarge" # Respaldo más grande + ``` + +3. **Distribución Multi-Zona (AZ)** + + - Distribuir nodos entre zonas de disponibilidad + - Protege contra fallos a nivel de zona + - Asegura disponibilidad de complementos durante interrupciones + +4. **Configuración de Monitoreo** + + ```bash + # Monitorear pods de complementos críticos + kubectl get pods -n kube-system -l k8s-app=critical-addon + + # Verificar estado de preparación de nodos + kubectl get nodes -l node-role.kubernetes.io/critical-addons + ``` + + + + + +**Momento para los Cambios:** + +- **Durante Horas Laborales:** Seguro para configuraciones HA y adición de tipos de instancia +- **Sin Tiempo de Inactividad Requerido:** Son modificaciones menores +- **Monitoreo en Tiempo Real:** Puede realizarse mientras se supervisan los sistemas + +**Pasos para Implementación Segura:** + +1. **Verificación Previa al Cambio** + + ```bash + # Verificar estado actual del clúster + kubectl get nodes + kubectl get pods -n kube-system + ``` + +2. **Implementar Cambios** + + - Añadir tipos de instancia al Grupo de AutoScaling + - O habilitar Alta Disponibilidad mediante SleakOps + +3. **Validación Posterior al Cambio** + ```bash + # Verificar nueva configuración + kubectl get nodes -o wide + kubectl describe node + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-spot-instances-unavailable.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-spot-instances-unavailable.mdx new file mode 100644 index 000000000..a50d663aa --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-spot-instances-unavailable.mdx @@ -0,0 +1,210 @@ +--- +sidebar_position: 3 +title: "Instancias Spot de EKS No Disponibles Durante la Creación del Nodegroup" +description: "Solución para fallos en nodegroups de EKS debido a instancias Spot no disponibles" +date: "2024-01-15" +category: "cluster" +tags: ["eks", "instancias-spot", "nodegroup", "aws", "resolución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Instancias Spot de EKS No Disponibles Durante la Creación del Nodegroup + +**Fecha:** 15 de enero de 2024 +**Categoría:** Cluster +**Etiquetas:** EKS, Instancias Spot, Nodegroup, AWS, Resolución de Problemas + +## Descripción del Problema + +**Contexto:** El usuario experimenta problemas con la creación de nodegroups en el clúster EKS al usar instancias Spot, particularmente después de operaciones automáticas de inicio/parada del clúster durante los fines de semana. + +**Síntomas Observados:** + +- El nodegroup no se lanza debido a la indisponibilidad de instancias Spot +- La funcionalidad automática de inicio/parada del clúster provoca problemas en la provisión de nodos +- Los nodos de complementos críticos (criticaladdonsonly) no se inician +- El clúster se vuelve parcialmente inaccesible afectando el entorno de staging + +**Configuración Relevante:** + +- Plataforma: AWS EKS +- Tipo de instancia: Instancias Spot +- Entorno: Staging (STG) +- Funcionalidad: Inicio/parada automática del clúster durante fines de semana +- Región: us-east-1 + +**Condiciones de Error:** + +- El error ocurre durante el reinicio automático del clúster tras el apagado del fin de semana +- No hay disponibilidad de instancias Spot del tipo requerido en la región +- El reinicio manual del clúster resuelve parcialmente el problema pero algunos nodos permanecen no disponibles +- El problema parece ser recurrente + +## Solución Detallada + + + +Las instancias Spot en AWS tienen disponibilidad variable basada en: + +1. **Demanda actual**: Alta demanda reduce la disponibilidad +2. **Tipo de instancia**: Algunos tipos son más escasos que otros +3. **Zona de disponibilidad**: Diferentes zonas tienen distinta capacidad +4. **Hora del día/semana**: Los patrones de demanda afectan la disponibilidad + +Cuando AWS no tiene suficiente capacidad Spot, la creación del nodegroup falla con errores de capacidad. + + + + + +Para resolver el problema actual: + +1. **Verificar disponibilidad de instancias Spot**: + + - Ir a Consola AWS → EC2 → Solicitudes Spot + - Revisar precios y disponibilidad actuales de Spot + +2. **Modificar configuración del nodegroup**: + + ```yaml + # Añadir múltiples tipos de instancia para mejor disponibilidad + instance_types: + - "m5.large" + - "m5a.large" + - "m4.large" + - "c5.large" + ``` + +3. **Usar política de instancias mixtas**: + - Combinar instancias On-Demand y Spot + - Establecer un porcentaje de división (ejemplo: 20% On-Demand, 80% Spot) + + + + + +Configure su nodegroup con estas mejores prácticas: + +```yaml +# Configuración recomendada +nodegroup_config: + instance_types: + - "m5.large" + - "m5a.large" + - "m4.large" + - "c5.large" + - "c4.large" + capacity_type: "SPOT" + scaling_config: + min_size: 1 + max_size: 10 + desired_size: 3 + update_config: + max_unavailable_percentage: 25 + # Diversificar en múltiples zonas de disponibilidad + subnets: + - "subnet-xxx" # us-east-1a + - "subnet-yyy" # us-east-1b + - "subnet-zzz" # us-east-1c +``` + +**Recomendaciones clave**: + +- Usar 4-5 tipos diferentes de instancia +- Distribuir en múltiples zonas de disponibilidad +- Considerar características de rendimiento similares + + + + + +Para componentes críticos del sistema que deben estar siempre disponibles: + +1. **Crear un nodegroup dedicado On-Demand**: + + ```yaml + critical_nodegroup: + capacity_type: "ON_DEMAND" + instance_types: ["t3.medium"] + scaling_config: + min_size: 2 + max_size: 3 + desired_size: 2 + taints: + - key: "CriticalAddonsOnly" + value: "true" + effect: "NoSchedule" + ``` + +2. **Usar selectores de nodo para cargas críticas**: + ```yaml + # En los manifiestos de cargas críticas + nodeSelector: + node.kubernetes.io/instance-type: "t3.medium" + tolerations: + - key: "CriticalAddonsOnly" + operator: "Equal" + value: "true" + effect: "NoSchedule" + ``` + + + + + +Para prevenir problemas con el inicio/parada automática del clúster: + +1. **Implementar secuencia de inicio gradual**: + + - Iniciar primero los nodegroups críticos + - Esperar que los pods del sistema estén listos + - Luego iniciar los nodegroups de aplicaciones + +2. **Configurar chequeos de salud en el inicio**: + + ```bash + # Añadir al script de inicio + kubectl wait --for=condition=Ready nodes --all --timeout=300s + kubectl wait --for=condition=Ready pods -n kube-system --all --timeout=300s + ``` + +3. **Considerar deshabilitar temporalmente el auto inicio/parada**: + + - Hasta que mejore la disponibilidad de Spot + - O hasta implementar configuración de instancias mixtas + +4. **Configurar alertas de monitoreo**: + - Alertar cuando los nodegroups no se inicien + - Monitorear tasas de interrupción de instancias Spot + + + + + +Si los problemas con instancias Spot persisten: + +1. **Enfoque híbrido**: + + - Usar On-Demand para componentes del sistema + - Usar Spot para cargas de trabajo de aplicaciones + +2. **Despliegue multi-región**: + + - Considerar distribuir cargas entre regiones + - Usar regiones con mejor disponibilidad Spot + +3. **Instancias reservadas**: + + - Para cargas predecibles + - Mejor control de costos que On-Demand + +4. **Fargate para cargas críticas**: + - No requiere gestión de instancias + - Costo mayor pero mejor confiabilidad + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-upgrade-schedule-process.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-upgrade-schedule-process.mdx new file mode 100644 index 000000000..e59563424 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-upgrade-schedule-process.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "Programa y Proceso de Actualización del Clúster EKS" +description: "Comprendiendo la frecuencia, el proceso y la gestión de actualizaciones del clúster EKS en SleakOps" +date: "2025-02-06" +category: "cluster" +tags: ["eks", "actualización", "kubernetes", "mantenimiento", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Programa y Proceso de Actualización del Clúster EKS + +**Fecha:** 6 de febrero de 2025 +**Categoría:** Clúster +**Etiquetas:** EKS, Actualización, Kubernetes, Mantenimiento, AWS + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender cómo se gestionan las actualizaciones del clúster EKS en SleakOps, incluyendo la frecuencia, el proceso y si se requiere intervención manual. + +**Síntomas Observados:** + +- Preguntas sobre la frecuencia y el momento de las actualizaciones +- Preocupaciones sobre el tiempo de inactividad durante las actualizaciones +- Incertidumbre sobre procesos de actualización manual vs automática +- Necesidad de claridad sobre la hoja de ruta de actualizaciones y progresión de versiones + +**Configuración Relevante:** + +- Versión actual de EKS: 1.29 +- Versión objetivo: 1.32 (planificada para el primer semestre) +- Frecuencia de actualizaciones: 2 actualizaciones mayores por año +- Plataforma: AWS EKS gestionado por SleakOps + +**Condiciones de Error:** + +- Posible interrupción del servicio durante las actualizaciones +- Riesgo de problemas de compatibilidad con modificaciones manuales en el clúster +- Necesidad de coordinación entre el equipo de SleakOps y los usuarios + +## Solución Detallada + + + +SleakOps gestiona las actualizaciones del clúster EKS con el siguiente programa: + +- **Frecuencia**: 2 actualizaciones mayores de versión de Kubernetes por año +- **Estado Actual**: Versión 1.29 +- **Objetivo**: Versión 1.32 para finales del primer semestre de 2025 +- **Actualizaciones Restantes**: 3 actualizaciones de versión (1.29 → 1.30 → 1.31 → 1.32) + +**Cronograma de Actualizaciones:** + +- Las actualizaciones se programan típicamente durante ventanas de mantenimiento +- Los usuarios reciben notificaciones anticipadas por correo electrónico +- Las fechas específicas se comunican antes de cada actualización + + + + + +SleakOps se encarga de todas las actualizaciones del clúster EKS: + +- **Gestión Automática**: El equipo de SleakOps administra y ejecuta todas las actualizaciones +- **Sin Intervención Manual**: Los usuarios no necesitan realizar actualizaciones manuales a través de la Consola AWS +- **Pruebas Previas**: Se realizan pruebas extensas antes de las actualizaciones en producción +- **Monitoreo**: El equipo de SleakOps supervisa el proceso de actualización + +**Actualizaciones Bajo Demanda:** +En algunos casos, SleakOps ofrece un flujo de trabajo dentro de la plataforma para que los usuarios puedan activar actualizaciones bajo demanda cuando sea necesario. + + + + + +**Tiempo de Inactividad Esperado:** + +- Las actualizaciones están diseñadas para minimizar el tiempo de inactividad +- Las pruebas previas a la actualización reducen el riesgo de problemas +- La mayoría de las actualizaciones no deberían causar interrupción del servicio + +**Riesgos Potenciales:** + +- Las modificaciones manuales en los clústeres pueden causar problemas de compatibilidad +- Las configuraciones personalizadas no gestionadas por SleakOps podrían verse afectadas +- Los usuarios deben evitar cambios manuales para prevenir complicaciones en la actualización + +**Recomendaciones:** + +- Mantenerse alerta durante las ventanas de actualización +- Evitar modificaciones manuales en el clúster +- Reportar cualquier problema inmediatamente al soporte de SleakOps + + + + + +**Antes de las Actualizaciones:** + +1. **Revisar Aplicaciones**: Asegurarse de que las aplicaciones sean compatibles con la versión objetivo de Kubernetes +2. **Respaldar Datos Críticos**: Aunque SleakOps gestiona esto, asegure respaldos de los datos de la aplicación +3. **Monitorear Comunicaciones**: Estar atento a las notificaciones de actualización de SleakOps +4. **Evitar Cambios Manuales**: No realizar modificaciones manuales en el clúster antes de las actualizaciones + +**Durante las Actualizaciones:** + +1. **Estar Disponible**: Estar disponible para responder a cualquier problema +2. **Monitorear Aplicaciones**: Verificar el estado de las aplicaciones tras la finalización de la actualización +3. **Reportar Problemas**: Contactar a SleakOps inmediatamente si surgen inconvenientes + +**Después de las Actualizaciones:** + +1. **Verificar Funcionalidad**: Probar las funcionalidades críticas de la aplicación +2. **Revisar Registros**: Analizar los registros de la aplicación y del sistema para detectar problemas +3. **Actualizar Dependencias**: Asegurar que las dependencias de la aplicación sean compatibles con la nueva versión de Kubernetes + + + + + +**Ruta Actual de Actualización:** + +``` +Actual: 1.29 → Objetivo: 1.32 +Secuencia de actualización: 1.29 → 1.30 → 1.31 → 1.32 +``` + +**Consideraciones de Compatibilidad:** + +- Cada actualización mantiene compatibilidad hacia atrás para la mayoría de funciones +- Las APIs obsoletas pueden eliminarse en versiones más nuevas +- Los recursos personalizados y operadores deben verificarse para compatibilidad +- Las integraciones de terceros pueden requerir actualizaciones + +**Mejores Prácticas:** + +- Mantener las aplicaciones actualizadas para usar las APIs actuales de Kubernetes +- Evitar usar funciones obsoletas +- Probar aplicaciones en entornos de desarrollo con versiones más nuevas de Kubernetes +- Seguir las políticas de desuso de Kubernetes + + + +--- + +_Este FAQ fue generado automáticamente el 6 de febrero de 2025 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-upgrade-volume-attachment-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-upgrade-volume-attachment-issue.mdx new file mode 100644 index 000000000..920ae7703 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-eks-upgrade-volume-attachment-issue.mdx @@ -0,0 +1,166 @@ +--- +sidebar_position: 3 +title: "Fallo en la Actualización del Clúster EKS Debido a Volumen Huérfano" +description: "Solución para fallos en la actualización del clúster EKS causados por volúmenes adjuntos a nodos pero sin uso" +date: "2024-12-19" +category: "cluster" +tags: + ["eks", "actualización", "volúmenes", "nodegroup", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en la Actualización del Clúster EKS Debido a Volumen Huérfano + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** EKS, Actualización, Volúmenes, NodeGroup, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Durante una actualización del clúster EKS a la versión 1.31, el proceso de actualización del NodeGroup falla, dejando nodos antiguos en ejecución mientras se impide que la actualización se complete con éxito. + +**Síntomas Observados:** + +- La actualización del clúster EKS falla durante la fase de actualización del NodeGroup +- Los nodos antiguos permanecen activos en lugar de ser reemplazados +- Las canalizaciones CI/CD dejan de funcionar correctamente +- Los procesos de despliegue se ven afectados +- Conflictos en la conexión de volúmenes impiden el reemplazo de nodos + +**Configuración Relevante:** + +- Versión del clúster EKS: Actualización a 1.31 +- Volumen afectado: `/app/certs` (volumen huérfano) +- Estado del volumen: Adjunto al nodo pero sin uso por ningún pod +- Estado del volumen en SleakOps: Marcado como "eliminado" pero aún físicamente adjunto + +**Condiciones de Error:** + +- La actualización del NodeGroup falla debido a conflictos en la conexión de volúmenes +- Ocurre cuando los volúmenes están adjuntos a nodos pero no se usan activamente +- El problema aparece durante la fase de reemplazo de nodos en la actualización +- Afecta a clústeres con volúmenes previamente eliminados pero aún adjuntos + +## Solución Detallada + + + +La falla en la actualización ocurre porque: + +1. **Volúmenes huérfanos**: Volúmenes que fueron eliminados en SleakOps pero permanecen físicamente adjuntos a instancias EC2 +2. **Conflicto en el reemplazo de nodos**: Durante las actualizaciones de EKS, AWS necesita reemplazar nodos, pero los volúmenes adjuntos impiden la terminación limpia del nodo +3. **Desajuste en el estado del volumen**: El volumen existe en AWS pero no está rastreado en la configuración actual de despliegue + +Este es un problema común cuando los volúmenes se eliminan de la configuración de SleakOps pero los volúmenes EBS subyacentes de AWS permanecen adjuntos a los nodos. + + + + + +Para solucionar el problema de la actualización: + +1. **Identificar el volumen problemático**: + + ```bash + # Verificar volúmenes adjuntos en el nodo + kubectl describe nodes + # Buscar volúmenes en la consola AWS EC2 + aws ec2 describe-volumes --filters "Name=attachment.instance-id,Values=i-xxxxxxxxx" + ``` + +2. **Desconectar el volumen huérfano**: + + ```bash + # Desconectar volumen de la instancia EC2 + aws ec2 detach-volume --volume-id vol-xxxxxxxxx + ``` + +3. **Reintentar la actualización del clúster**: + - La actualización debería continuar normalmente una vez resuelto el conflicto del volumen + - Monitorear el proceso de reemplazo del NodeGroup + + + + + +Consideraciones importantes para la seguridad de los datos: + +1. **Los datos del volumen se preservan**: Desconectar el volumen no elimina los datos +2. **El volumen permanece en AWS**: El volumen EBS se mantiene intacto en tu cuenta de AWS +3. **Opciones de recuperación**: Puedes volver a conectar el volumen más tarde si es necesario + +```bash +# Verificar estado del volumen después de la desconexión +aws ec2 describe-volumes --volume-ids vol-xxxxxxxxx + +# El volumen mostrará estado "available" en lugar de "in-use" +``` + +**Mejor práctica**: Haz un snapshot antes de desconectar si el volumen contiene datos críticos: + +```bash +# Crear snapshot para seguridad +aws ec2 create-snapshot --volume-id vol-xxxxxxxxx --description "Backup antes de la actualización del clúster" +``` + + + + + +Para evitar este problema en futuras actualizaciones: + +1. **Eliminación limpia de volúmenes**: Al eliminar volúmenes en SleakOps, asegúrate de que se desconecten correctamente: + + - Eliminar volumen de la configuración de SleakOps + - Verificar que el volumen esté desconectado de los nodos + - Opcionalmente eliminar el volumen si ya no se necesita + +2. **Lista de verificación previa a la actualización**: + + ```bash + # Buscar volúmenes huérfanos antes de actualizar + kubectl get pv + kubectl get pvc --all-namespaces + + # Verificar que no haya volúmenes sin uso adjuntos + aws ec2 describe-volumes --filters "Name=attachment.instance-id,Values=i-*" + ``` + +3. **Mantenimiento regular**: Auditar y limpiar periódicamente volúmenes no usados + + + + + +Después de resolver el problema del volumen y completar la actualización: + +1. **Verificar versión del clúster**: + + ```bash + kubectl version --short + aws eks describe-cluster --name your-cluster-name --query 'cluster.version' + ``` + +2. **Revisar estado de los nodos**: + + ```bash + kubectl get nodes + # Verificar que todos los nodos estén ejecutando la nueva versión + ``` + +3. **Probar funcionalidad CI/CD**: + + - Desplegar una aplicación de prueba + - Verificar que las canalizaciones de despliegue funcionen correctamente + - Comprobar que todos los servicios sean accesibles + +4. **Monitorear posibles problemas**: + - Observar los logs del clúster para cualquier anomalía + - Verificar que todas las cargas de trabajo se ejecuten normalmente + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-kubernetes-upgrade-process.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-kubernetes-upgrade-process.mdx new file mode 100644 index 000000000..4419b0dee --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-kubernetes-upgrade-process.mdx @@ -0,0 +1,168 @@ +--- +sidebar_position: 3 +title: "Proceso de Actualización del Clúster Kubernetes" +description: "Cómo manejar las actualizaciones de clúster en la plataforma SleakOps" +date: "2024-12-19" +category: "clúster" +tags: ["kubernetes", "actualización", "clúster", "mantenimiento"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Proceso de Actualización del Clúster Kubernetes + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Kubernetes, Actualización, Clúster, Mantenimiento + +## Descripción del Problema + +**Contexto:** Los usuarios ven una notificación de "Actualización Requerida" en la sección Clúster de su panel de SleakOps y necesitan orientación sobre cómo proceder con la actualización del clúster Kubernetes. + +**Síntomas Observados:** + +- Aparece un banner de "Actualización Requerida" en la sección Clúster - Producción +- Incertidumbre sobre el proceso de actualización y posible tiempo de inactividad +- Preocupaciones sobre la interrupción del servicio durante la actualización +- Preguntas sobre compatibilidad de código, versión o base de datos + +**Configuración Relevante:** + +- Plataforma: Clúster Kubernetes gestionado por SleakOps +- Entorno: Clúster de producción +- Tipo de actualización: Actualización de versión de Kubernetes con componentes centrales y complementos + +**Condiciones de Error:** + +- No hay errores reales, pero la notificación de actualización requiere acción +- Posibles problemas de compatibilidad con recursos externos +- Riesgo de interrupción del servicio si no se planifica adecuadamente + +## Solución Detallada + + + +Cuando veas la notificación "Actualización Requerida": + +1. **Haz clic en Aceptar** en la notificación de actualización en tu panel de SleakOps +2. El sistema comenzará automáticamente el proceso de actualización +3. **No se requiere intervención manual** durante la actualización + +El proceso de actualización es completamente automatizado y gestionado por SleakOps. + + + + + +**Duración:** Aproximadamente 1 hora para el proceso completo de actualización + +**Tiempo de inactividad:** + +- **No se espera tiempo de inactividad** para servicios correctamente configurados +- Los servicios con múltiples réplicas de pods seguirán funcionando +- Los servicios con un solo pod pueden experimentar breves interrupciones durante la actualización de nodos + +**Orden del proceso:** + +1. Los nodos principales de SleakOps se actualizan uno a uno (actualización progresiva) +2. Los complementos internos se actualizan después de completar los nodos +3. Se actualizan todos los componentes listados en la notificación de actualización + + + + + +**Preparación de la Aplicación:** + +- Asegúrate de que tus aplicaciones tengan **múltiples réplicas de pods** para alta disponibilidad +- Verifica que tus servicios puedan manejar reinicios progresivos +- Comprueba que tus aplicaciones no dependan de asignaciones específicas de nodos + +**Recursos Externos:** + +- Revisa cualquier **instalación externa** en tu clúster (no gestionada por SleakOps) +- Consulta el registro de cambios de Kubernetes para recursos obsoletos +- Verifica la compatibilidad de operadores personalizados o herramientas de terceros + +**Consideraciones de Base de Datos:** + +- No se necesitan preparaciones especiales de base de datos para actualizaciones gestionadas por SleakOps +- Asegura que las conexiones a la base de datos puedan manejar breves interrupciones de red +- Verifica que los volúmenes persistentes estén correctamente configurados + + + + + +Antes de aceptar la actualización: + +1. **Revisa el registro de cambios de Kubernetes** proporcionado en la notificación de actualización +2. **Verifica las APIs obsoletas** que puedan usar tus herramientas externas: + ```bash + kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --ignore-not-found + ``` +3. **Actualiza cualquier herramienta externa** que use APIs obsoletas de Kubernetes +4. **Prueba las aplicaciones críticas** en un entorno de pruebas si es posible + +**Recursos obsoletos comunes a verificar:** + +- Versiones antiguas de API para Deployments, Services, Ingress +- Definiciones de Recursos Personalizados (CRDs) con esquemas obsoletos +- Políticas de red que usan APIs obsoletas + + + + + +Durante la actualización: + +1. **Monitorea tus aplicaciones** mediante tus herramientas habituales de monitoreo +2. **Revisa el panel de SleakOps** para el progreso de la actualización +3. **Atiende cualquier alerta** de tus sistemas de monitoreo + +**Indicadores de actualización exitosa:** + +- Los nodos muestran la versión actualizada de Kubernetes +- Todos los pods están en ejecución y saludables +- Los servicios responden normalmente +- No hay registros persistentes de errores + +**Si ocurren problemas:** + +- Contacta al soporte de SleakOps inmediatamente +- Proporciona mensajes de error o síntomas específicos +- Incluye los detalles de la notificación de actualización + + + + + +Después de que la actualización finalice: + +1. **Verifica el estado del clúster:** + + ```bash + kubectl get nodes + kubectl get pods --all-namespaces + ``` + +2. **Comprueba la salud de las aplicaciones:** + + - Prueba los endpoints críticos de las aplicaciones + - Verifica la conectividad con la base de datos + - Confirma que el monitoreo y el registro funcionan + +3. **Revisa los registros de actualización:** + + - Consulta el panel de SleakOps para el resumen de la actualización + - Revisa cualquier advertencia o aviso + +4. **Actualiza la documentación:** + - Registra la nueva versión de Kubernetes + - Actualiza cualquier configuración específica de versión + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-lens-connection-timeout.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-lens-connection-timeout.mdx new file mode 100644 index 000000000..a4c062375 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-lens-connection-timeout.mdx @@ -0,0 +1,204 @@ +--- +sidebar_position: 3 +title: "Tiempo de espera de conexión de Lens al clúster de Kubernetes" +description: "Solución para errores de tiempo de espera al conectar con clústeres de Kubernetes mediante Lens IDE" +date: "2024-04-29" +category: "usuario" +tags: + [ + "lens", + "kubernetes", + "conexión", + "tiempo de espera", + "vpn", + "solución de problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Tiempo de espera de conexión de Lens al clúster de Kubernetes + +**Fecha:** 29 de abril de 2024 +**Categoría:** Usuario +**Etiquetas:** Lens, Kubernetes, Conexión, Tiempo de espera, VPN, Solución de problemas + +## Descripción del problema + +**Contexto:** Los usuarios experimentan errores de tiempo de espera al intentar conectarse a clústeres de Kubernetes a través de Lens IDE, a pesar de seguir correctamente todos los pasos de configuración. + +**Síntomas observados:** + +- Errores de tiempo de espera en Lens IDE +- Imposibilidad de acceder a los recursos del clúster mediante Lens +- Autenticación exitosa pero comunicación fallida con el clúster +- El error ocurre para miembros específicos del equipo mientras otros pueden conectarse normalmente + +**Configuración relevante:** + +- Herramienta: Lens Kubernetes IDE +- Autenticación: Credenciales de usuario AWS IAM +- Método de conexión: A través de VPN SleakOps +- Tipo de clúster: EKS (Amazon Elastic Kubernetes Service) + +**Condiciones del error:** + +- Tiempo de espera durante el intento de conexión al clúster +- El error persiste a pesar de los pasos de configuración correctos +- El problema parece estar relacionado con la red y no con la autenticación +- El problema puede ser intermitente o afectar a usuarios específicos + +## Solución detallada + + + +Los tiempos de espera de conexión de Lens a clústeres de Kubernetes ocurren típicamente debido a: + +1. **Problemas de conectividad de red**: Problemas con la conexión VPN o restricciones de firewall +2. **Problemas de resolución DNS**: Incapacidad para resolver el endpoint del clúster +3. **Expiración del token de autenticación**: Credenciales AWS o tokens kubeconfig expirados +4. **Accesibilidad del endpoint del clúster**: Endpoint del clúster EKS no accesible desde la red actual +5. **Problemas de configuración de Lens**: kubeconfig o configuraciones de contexto incorrectas + + + + + +Primero, asegúrate de que tu conexión VPN funcione correctamente: + +1. **Verificar estado de VPN**: Confirma que estás conectado a la VPN SleakOps +2. **Probar conectividad**: Intenta hacer ping a recursos internos +3. **Resolución DNS**: Asegúrate de poder resolver nombres internos + +```bash +# Probar conectividad VPN +ping internal-resource.sleakops.local + +# Verificar resolución DNS +nslookup your-cluster-endpoint.eks.amazonaws.com +``` + + + + + +Actualiza tu archivo kubeconfig para asegurar credenciales frescas: + +```bash +# Actualizar kubeconfig para clúster EKS +aws eks update-kubeconfig --region your-region --name your-cluster-name + +# Verificar configuración +kubectl config current-context + +# Probar conectividad básica +kubectl get nodes +``` + +Si usas perfiles AWS CLI: + +```bash +# Especificar perfil +aws eks update-kubeconfig --region your-region --name your-cluster-name --profile your-profile +``` + + + + + +Asegúrate de que Lens esté configurado correctamente: + +1. **Importar kubeconfig**: Ve a **Archivo** → **Agregar clúster** → **Desde kubeconfig** +2. **Seleccionar contexto correcto**: Escoge el contexto adecuado del clúster +3. **Verificar configuración de proxy**: Revisa que la configuración de proxy de Lens coincida con tu red + +**Configuración de proxy en Lens:** + +- Ve a **Preferencias** → **Proxy** +- Asegúrate de que la configuración de proxy coincida con la de tu VPN +- Intenta deshabilitar el proxy si usas VPN + + + + + +Si el problema persiste, realiza estos diagnósticos de red: + +```bash +# Verificar si el endpoint del clúster es accesible +telnet your-cluster-endpoint.eks.amazonaws.com 443 + +# Probar con curl +curl -k https://your-cluster-endpoint.eks.amazonaws.com/version + +# Verificar rutas +traceroute your-cluster-endpoint.eks.amazonaws.com +``` + +**Problemas comunes de red:** + +- Firewall corporativo bloqueando puertos API de Kubernetes (443, 6443) +- VPN no enruta tráfico correctamente hacia regiones AWS +- DNS no resuelve correctamente endpoints de EKS + + + + + +Si Lens continúa con tiempos de espera, prueba estas alternativas: + +1. **Usar kubectl directamente**: Verifica si kubectl funciona sin Lens +2. **Probar otra red**: Intenta desde una ubicación de red diferente +3. **Usar AWS CloudShell**: Accede al clúster mediante AWS Console CloudShell +4. **Reenvío de puertos**: Usa kubectl port-forward para servicios específicos + +```bash +# Probar acceso directo con kubectl +kubectl get pods --all-namespaces + +# Reenvío de puertos para servicios específicos +kubectl port-forward service/your-service 8080:80 +``` + + + + + +Asegúrate de que el usuario AWS IAM tenga los permisos adecuados: + +1. **Revisar políticas IAM**: Verifica que las políticas de acceso a EKS estén asignadas +2. **Verificar ConfigMap aws-auth**: Confirma que el usuario esté mapeado en el clúster +3. **Probar acceso AWS CLI**: Confirma que las credenciales AWS funcionen + +```bash +# Probar credenciales AWS +aws sts get-caller-identity + +# Verificar acceso al clúster EKS +aws eks describe-cluster --name your-cluster-name + +# Verificar acceso con kubectl +kubectl auth can-i get pods +``` + + + + + +Mientras se resuelve el problema del servidor VPN: + +1. **Usar consola AWS**: Accede a recursos Kubernetes desde la consola AWS EKS +2. **AWS CloudShell**: Usa CloudShell para comandos kubectl +3. **Reenvío de puertos local**: Reenvía servicios específicos a localhost +4. **VPN alternativa**: Si está disponible, intenta conectar mediante otro endpoint VPN + +```bash +# Ejemplo de reenvío de puerto para acceso al dashboard +kubectl port-forward -n kubernetes-dashboard service/kubernetes-dashboard 8443:443 +``` + + + +--- + +_Este FAQ fue generado automáticamente el 29/04/2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-manual-shutdown-scheduled-feature.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-manual-shutdown-scheduled-feature.mdx new file mode 100644 index 000000000..e671de56e --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-manual-shutdown-scheduled-feature.mdx @@ -0,0 +1,167 @@ +--- +sidebar_position: 3 +title: "El Apagado Manual del Clúster Requiere la Función de Apagado Programado" +description: "Solución para el apagado manual del clúster cuando la función de apagado programado no está habilitada" +date: "2024-12-12" +category: "cluster" +tags: ["clúster", "apagado", "programado", "manual", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# El Apagado Manual del Clúster Requiere la Función de Apagado Programado + +**Fecha:** 12 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Clúster, Apagado, Programado, Manual, Configuración + +## Descripción del Problema + +**Contexto:** Los usuarios desean detener manualmente su clúster SleakOps en momentos específicos (para pruebas, optimización de costos, etc.) pero encuentran que el botón de apagado manual no está disponible o no funciona. + +**Síntomas Observados:** + +- El botón de apagado manual del clúster no está disponible en la interfaz +- No se puede detener el clúster manualmente aun cuando es necesario para situaciones específicas +- El clúster permanece activo consumiendo recursos cuando no se requiere +- No hay opción para pausar el clúster por períodos indefinidos + +**Configuración Relevante:** + +- Clúster SleakOps desplegado y en ejecución +- Función de Apagado Programado: No habilitada +- Requisito de apagado manual: inmediato o con horario flexible +- Caso de uso: pruebas, desarrollo, optimización de costos + +**Condiciones de Error:** + +- No es posible el apagado manual sin que la función programada esté habilitada +- Los recursos continúan generando costos durante períodos de inactividad +- No se puede pausar el clúster por tiempos indefinidos +- Falta de flexibilidad para necesidades de gestión ad-hoc del clúster + +## Solución Detallada + + + +Para habilitar el apagado manual del clúster, primero debe activar la función de "Apagado Programado": + +1. Vaya a la **Configuración del Clúster** +2. Busque la opción **"Apagado Programado"** +3. **Habilite** esta función +4. Configure ajustes básicos del horario (pueden ser mínimos) + +Una vez habilitada, esta función ejecuta un módulo en segundo plano que permite operaciones tanto programadas como manuales de apagado. + + + + + +Puede configurar un horario mínimo que le brinde máximo control manual: + +```yaml +# Ejemplo de configuración mínima +scheduled_shutdown: + enabled: true + schedule: + # Establecer un horario muy permisivo o solo fuera de horas laborales + weekdays: [] + weekend: [] + timezone: "UTC" +``` + +**Pasos:** + +1. Habilite Apagado Programado +2. Configure horarios mínimos o sin horarios automáticos +3. Use controles manuales según necesidad +4. La función puede deshabilitarse después si no es necesaria + + + + + +Después de habilitar la función de Apagado Programado: + +1. Navegue al panel de control de su clúster +2. Busque el **botón de apagado/detención** (ahora disponible) +3. Haga clic para detener manualmente el clúster +4. Inicie el clúster manualmente cuando sea necesario +5. No se requiere un horario predefinido para operaciones manuales + +**Beneficios:** + +- Detener el clúster durante pausas de pruebas +- Pausar por períodos indefinidos +- Reanudar cuando se necesite +- Optimizar costos en entornos de desarrollo + + + + + +Para clústeres de desarrollo que requieren control flexible de encendido/apagado: + +```yaml +cluster_config: + name: "dev-cluster" + scheduled_shutdown: + enabled: true + auto_schedule: false # Sin programación automática + manual_control: true # Habilitar inicio/detención manual + +# Esto permite: +# - Apagado manual en cualquier momento +# - Inicio manual cuando se necesite +# - Sin operaciones automáticas +# - Optimización de costos para entornos de desarrollo +``` + +**Casos de Uso:** + +- Entornos de desarrollo y pruebas +- Clústeres usados esporádicamente +- Proyectos sensibles a costos +- Pausas temporales de proyectos + + + + + +**Problema Conocido:** Este requisito es actualmente una inconsistencia de la plataforma que debería mejorarse. + +**Comportamiento Actual:** + +- El apagado manual requiere que la función de Apagado Programado esté habilitada +- El módulo en segundo plano debe estar activo para capacidades de apagado +- SleakOps ejecuta varios recursos continuamente por defecto + +**Mejora Esperada a Futuro:** + +- El apagado manual debería estar disponible sin requisitos de programación +- Opciones más flexibles de gestión del clúster +- Mejor separación entre operaciones programadas y manuales + + + + + +Hasta que se resuelva la limitación de la plataforma: + +1. **Habilite Apagado Programado** aunque no necesite programación automática +2. **Configure horarios mínimos** que no interfieran con su flujo de trabajo +3. **Use controles manuales** como método principal de apagado +4. **Documente sus patrones de uso** para optimizar configuraciones futuras + +**Consejos para Optimización de Costos:** + +- Habilite la función inmediatamente después de crear el clúster +- Detenga clústeres durante períodos conocidos de inactividad +- Monitoree el uso de recursos para identificar oportunidades de optimización +- Considere múltiples clústeres pequeños en lugar de uno grande siempre activo + + + +--- + +_Esta FAQ fue generada automáticamente el 12 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-node-errors-web-api-unavailable.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-node-errors-web-api-unavailable.mdx new file mode 100644 index 000000000..b5a2c89be --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-node-errors-web-api-unavailable.mdx @@ -0,0 +1,178 @@ +--- +sidebar_position: 3 +title: "Web y API no disponibles debido a errores en los nodos" +description: "Solución para servicios web y API que fallan debido a problemas en los nodos de Kubernetes" +date: "2024-01-15" +category: "cluster" +tags: ["nodos", "web", "api", "solución de problemas", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Web y API no disponibles debido a errores en los nodos + +**Fecha:** 15 de enero de 2024 +**Categoría:** Cluster +**Etiquetas:** Nodos, Web, API, Solución de problemas, Kubernetes + +## Descripción del problema + +**Contexto:** El usuario reporta que tanto los servicios web como API no funcionan correctamente, con mensajes de error que indican problemas relacionados con los nodos en el clúster de Kubernetes. + +**Síntomas observados:** + +- Servicio web inaccesible +- Endpoints de API sin respuesta +- Mensajes de error mencionando problemas en los nodos +- Indisponibilidad completa del servicio + +**Configuración relevante:** + +- Plataforma: clúster Kubernetes de SleakOps +- Servicios afectados: frontend web y backend API +- Tipo de error: errores relacionados con nodos + +**Condiciones de error:** + +- Fallo simultáneo de servicios web y API +- Errores apuntan a problemas en la infraestructura subyacente de los nodos +- Los servicios permanecen indisponibles hasta que se resuelven los problemas en los nodos + +## Solución detallada + + + +Primero, verifica el estado de los nodos de tu clúster para identificar el problema específico: + +1. **Accede al panel de SleakOps** +2. Navega a **Clusters** → **Tu Clúster** +3. Ve a la sección **Nodos** +4. Revisa los nodos con estado: + - `NotReady` + - `Unknown` + - `SchedulingDisabled` + +Alternativamente, si tienes acceso a kubectl: + +```bash +kubectl get nodes +kubectl describe nodes +``` + + + + + +**Agotamiento de recursos en nodos:** + +- **Síntoma**: Nodos con alto uso de CPU/memoria +- **Solución**: Escalar el grupo de nodos o añadir más nodos + +**Problemas de conectividad de red:** + +- **Síntoma**: Nodos no pueden comunicarse con el plano de control +- **Solución**: Verificar grupos de seguridad y configuración de red + +**Problemas de espacio en disco:** + +- **Síntoma**: Nodos sin espacio suficiente en disco +- **Solución**: Limpiar imágenes no usadas o aumentar el tamaño del disco + +**Fallo de nodo:** + +- **Síntoma**: Nodos completamente no responsivos +- **Solución**: Reemplazar nodos fallidos a través de SleakOps + + + + + +Una vez resueltos los problemas en los nodos, reinicia tus servicios web y API: + +1. **En el panel de SleakOps:** + + - Ve a **Proyectos** → **Tu Proyecto** + - Encuentra tus cargas de trabajo web y API + - Haz clic en **Reiniciar** en cada servicio + +2. **Mediante kubectl (si está disponible):** + +```bash +# Reiniciar despliegue web +gkubectl rollout restart deployment/web-app -n tu-namespace + +# Reiniciar despliegue API +kubectl rollout restart deployment/api-app -n tu-namespace + +# Verificar estado del rollout +kubectl rollout status deployment/web-app -n tu-namespace +kubectl rollout status deployment/api-app -n tu-namespace +``` + + + + + +Si el problema está relacionado con capacidad insuficiente de nodos: + +1. **Accede al panel de SleakOps** +2. Ve a **Clusters** → **Tu Clúster** +3. Navega a **Grupos de nodos** +4. Selecciona el grupo de nodos afectado +5. Incrementa el **Tamaño deseado** o el **Tamaño máximo** +6. Haz clic en **Actualizar grupo de nodos** + +El sistema aprovisionará automáticamente nuevos nodos y redistribuirá las cargas de trabajo. + + + + + +Después de implementar las soluciones, monitorea la recuperación: + +1. **Verifica el estado de los nodos** hasta que todos muestren `Ready` +2. **Verifica el estado de los pods**: + ```bash + kubectl get pods -n tu-namespace + ``` +3. **Prueba el servicio web** accediendo a la URL de tu aplicación +4. **Prueba los endpoints de la API** usando curl o tu herramienta preferida: + ```bash + curl -X GET https://tu-url-api/health + ``` +5. **Monitorea los logs** para detectar errores restantes: + ```bash + kubectl logs -f deployment/web-app -n tu-namespace + kubectl logs -f deployment/api-app -n tu-namespace + ``` + + + + + +Para evitar problemas similares: + +1. **Configura alertas de monitoreo** para la salud de los nodos +2. **Configura autoescalado** para los grupos de nodos +3. **Implementa solicitudes y límites de recursos** para tus aplicaciones +4. **Realiza chequeos de salud regulares** en servicios críticos +5. **Monitorea métricas del clúster** regularmente desde el panel de SleakOps + +**Configuración recomendada para monitoreo:** + +```yaml +# Ejemplo de límites de recursos +resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-nodepool-memory-limit-deployment-failure.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-nodepool-memory-limit-deployment-failure.mdx new file mode 100644 index 000000000..7af7af59c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-nodepool-memory-limit-deployment-failure.mdx @@ -0,0 +1,226 @@ +--- +sidebar_position: 3 +title: "Fallo en el Despliegue Debido a Límites de Memoria del Nodepool" +description: "Solución para fallos en despliegues causados por que el nodepool alcanza límites de capacidad de memoria" +date: "2024-03-13" +category: "cluster" +tags: + ["despliegue", "nodepool", "memoria", "escalado", "resolución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en el Despliegue Debido a Límites de Memoria del Nodepool + +**Fecha:** 13 de marzo de 2024 +**Categoría:** Clúster +**Etiquetas:** Despliegue, Nodepool, Memoria, Escalado, Resolución de problemas + +## Descripción del Problema + +**Contexto:** El usuario experimenta fallos en despliegues en el entorno de QA donde las compilaciones y despliegues toman más de 50 minutos y eventualmente se agota el tiempo, impidiendo actualizaciones exitosas de la aplicación. + +**Síntomas Observados:** + +- Compilaciones de despliegue que tardan más de 50 minutos +- El proceso de despliegue se detiene/se agota el tiempo antes de completarse +- Los pods de migración no pueden ser programados +- Las nuevas versiones de la aplicación no se despliegan + +**Configuración Relevante:** + +- Entorno: clúster de QA +- Aplicación: servicio backend +- Tiempo límite de despliegue: ~50 minutos +- Nodepool: capacidad de memoria limitada +- Pods de migración que requieren recursos adicionales + +**Condiciones de Error:** + +- El nodepool alcanza el límite de memoria al intentar añadir nuevos nodos +- Recursos insuficientes para programar pods de migración +- Tiempo de espera en la pipeline de despliegue debido a restricciones de recursos +- No se pueden crear nuevos pods por límites de capacidad + +## Solución Detallada + + + +El fallo en el despliegue ocurre porque: + +1. **Límite de Memoria del Nodepool**: El nodepool ha alcanzado su máxima asignación de memoria +2. **Programación de Recursos**: Kubernetes no puede programar nuevos pods (como los de migración) debido a recursos insuficientes +3. **Dependencias del Despliegue**: El proceso de despliegue requiere recursos adicionales que no están disponibles +4. **Planificación de Capacidad**: La configuración actual del nodepool no considera el uso máximo de recursos durante los despliegues + + + + + +Para resolver el problema inmediato: + +1. **Acceder al Panel de SleakOps** +2. Navegar a **Gestión de Clúster** → **Nodepools** +3. Seleccionar el nodepool afectado +4. **Incrementar la Asignación de Memoria**: + - Ir a la pestaña **Configuración** + - Aumentar el límite de **Memoria Máxima** + - O aumentar el número máximo de nodos si se usa escalado basado en nodos +5. **Aplicar Cambios** y esperar a que se provisionen los nuevos nodos + +```yaml +# Ejemplo de configuración del nodepool +nodepool_config: + min_nodes: 2 + max_nodes: 8 # Incrementado desde el límite previo + instance_type: "t3.large" # O actualizar el tipo de instancia + max_memory_gb: 32 # Límite de memoria incrementado +``` + + + + + +Para prevenir problemas futuros, monitorea los recursos de tu clúster: + +1. **Usar el Panel de Nodepool de SleakOps**: + + - Revisar gráficos de utilización de CPU/Memoria + - Monitorear tendencias durante los tiempos de despliegue + - Configurar alertas para uso alto de recursos + +2. **Métricas Clave a Vigilar**: + + - Utilización de memoria > 80% + - Utilización de CPU > 70% + - Número de pods pendientes + - Capacidad del nodo vs. uso + +3. **Acceder a Kubecost** (si está instalado): + - Asegurar que la conexión VPN esté activa + - Navegar al panel de análisis de costos + - Revisar eficiencia en la asignación de recursos + + + + + +Para mejorar la confiabilidad del despliegue: + +1. **Solicitudes y Límites de Recursos**: + +```yaml +# Establecer solicitudes apropiadas de recursos +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + +2. **Estrategia de Despliegue**: + +```yaml +# Usar actualizaciones continuas con gestión adecuada de recursos +strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 1 +``` + +3. **Configuración de Job de Migración**: + +```yaml +# Asegurar que los jobs de migración tengan recursos adecuados +apiVersion: batch/v1 +kind: Job +metadata: + name: migration-job +spec: + template: + spec: + containers: + - name: migrate + resources: + requests: + memory: "256Mi" + cpu: "100m" +``` + + + + + +Para una gestión sostenible del clúster: + +1. **Calcular Uso Máximo**: + + - Recursos en operación normal + - Recursos adicionales durante despliegues + - Recursos para jobs de migración y mantenimiento + - Reserva para picos inesperados (20-30%) + +2. **Estrategia de Dimensionamiento del Nodepool**: + + - **Desarrollo/QA**: 2-4 nodos con autoescalado + - **Producción**: mínimo 3-6 nodos con límites más altos + - **Considerar tipos de instancia**: balancear costo y rendimiento + +3. **Configuración de Autoescalado**: + +```yaml +autoscaling: + enabled: true + min_nodes: 2 + max_nodes: 10 + target_cpu_percent: 70 + target_memory_percent: 80 +``` + + + + + +Cuando los despliegues fallan: + +1. **Verificar estado del Nodepool**: + + ```bash + kubectl get nodes + kubectl describe nodes + ``` + +2. **Verificar estado de los Pods**: + + ```bash + kubectl get pods --all-namespaces + kubectl describe pod + ``` + +3. **Verificar uso de recursos**: + + ```bash + kubectl top nodes + kubectl top pods --all-namespaces + ``` + +4. **Revisar eventos**: + + ```bash + kubectl get events --sort-by=.metadata.creationTimestamp + ``` + +5. **Mensajes de error comunes**: + - `Insufficient memory` + - `Insufficient cpu` + - `0/X nodes are available` + - `FailedScheduling` + + + +--- + +_Esta FAQ fue generada automáticamente el 13 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-nodepool-missing-after-shutdown.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-nodepool-missing-after-shutdown.mdx new file mode 100644 index 000000000..db8929f2e --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-nodepool-missing-after-shutdown.mdx @@ -0,0 +1,143 @@ +--- +sidebar_position: 3 +title: "Nodepools Desaparecidos Después del Apagado del Clúster" +description: "Solución para nodepools que desaparecen tras el ciclo de apagado/arranque del clúster" +date: "2025-02-20" +category: "cluster" +tags: ["nodepool", "shutdown", "cluster", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Nodepools Desaparecidos Después del Apagado del Clúster + +**Fecha:** 20 de febrero de 2025 +**Categoría:** Clúster +**Etiquetas:** Nodepool, Apagado, Clúster, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan la desaparición de nodepools después de un ciclo de apagado y arranque del clúster, particularmente al usar la función de apagado nocturno en SleakOps. + +**Síntomas Observados:** + +- Los procesos de compilación fallan tras el reinicio del clúster +- Los nodepools faltan en el clúster +- No se pueden desplegar aplicaciones por falta de recursos computacionales +- El clúster parece estar en funcionamiento pero sin nodos trabajadores + +**Configuración Relevante:** + +- Clúster con apagado nocturno habilitado +- Múltiples nodepools configurados +- Procesos de compilación dependientes de la disponibilidad de nodepools + +**Condiciones de Error:** + +- El error ocurre tras el ciclo de apagado/arranque del clúster +- Afecta a clústeres con funciones de apagado automático +- Los nodepools no se restauran automáticamente después del reinicio del clúster +- Fallan las operaciones de compilación y despliegue + +## Solución Detallada + + + +Si experimentas nodepools desaparecidos después de un apagado del clúster: + +1. **Contacta al soporte de SleakOps** inmediatamente para restaurar los nodepools +2. **Evita usar compilaciones** hasta que los nodepools se restauren +3. **Verifica el estado del clúster** en el panel de SleakOps +4. **Confirma la configuración de los nodepools** una vez restaurados + +El equipo de SleakOps puede restaurar manualmente tus nodepools mientras se implementa la solución permanente. + + + + + +Este problema es causado por un error conocido en la función de apagado nocturno: + +- **Proceso de apagado**: Apaga correctamente el clúster +- **Proceso de arranque**: Reinicia el clúster pero no logra restaurar los nodepools +- **Impacto**: Deja el clúster sin nodos trabajadores +- **Estado**: El equipo de SleakOps está corrigiendo el error + + + + + +Hasta que la solución esté disponible: + +1. **Deshabilita el apagado nocturno** si es posible: + + - Ve a **Configuración del Clúster** + - Navega a **Gestión de Energía** + - Desactiva **Apagado Automático** + +2. **Alternativa de apagado manual**: + + - Apaga el clúster manualmente cuando sea necesario + - Asegúrate de estar disponible para verificar el estado de los nodepools tras el reinicio + +3. **Monitorea el estado del clúster**: + - Verifica la disponibilidad de los nodepools antes de iniciar compilaciones + - Confirma que los nodos trabajadores estén presentes en el panel de Kubernetes + + + + + +Después de que los nodepools sean restaurados, verifica que funcionen correctamente: + +```bash +# Verificar nodos en el clúster +kubectl get nodes + +# Verificar estado de los nodepools +kubectl get nodes --show-labels + +# Comprobar si los pods pueden programarse +kubectl get pods --all-namespaces +``` + +La salida esperada debe mostrar: + +- Múltiples nodos trabajadores en estado "Ready" +- Etiquetas de nodo adecuadas que indiquen la pertenencia a nodepools +- Pods programados exitosamente en los nodos + + + + + +Para evitar este problema en el futuro: + +1. **Espera la solución**: SleakOps está trabajando en una solución permanente +2. **Monitorea los anuncios**: Mantente atento a las actualizaciones sobre la disponibilidad de la solución +3. **Rehabilita el apagado con cuidado**: Solo vuelve a habilitar el apagado automático después de que la solución esté implementada +4. **Prueba exhaustivamente**: Después de la solución, prueba el ciclo de apagado/arranque primero en un entorno no productivo + + + + + +Contacta al soporte de SleakOps inmediatamente si: + +- Los nodepools desaparecen tras el reinicio del clúster +- Las compilaciones fallan por recursos insuficientes +- No puedes desplegar aplicaciones +- El clúster aparece en ejecución pero sin nodos trabajadores + +Proporciona la siguiente información: + +- Nombre e ID del clúster +- Hora en que ocurrió el apagado/arranque +- Capturas de pantalla de los mensajes de error +- Configuración actual de los nodepools + + + +--- + +_Esta FAQ fue generada automáticamente el 20 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-production-check-node-scaling.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-production-check-node-scaling.mdx new file mode 100644 index 000000000..b15ec5de6 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-production-check-node-scaling.mdx @@ -0,0 +1,218 @@ +--- +sidebar_position: 3 +title: "Escalado de Nodos y Cambios en Recursos con Production Check" +description: "Comprendiendo el escalado de nodos y los cambios en recursos al habilitar Production Check en clusters de SleakOps" +date: "2024-11-20" +category: "cluster" +tags: + ["producción", "escalado", "karpenter", "nodos", "taints", "disponibilidad"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Escalado de Nodos y Cambios en Recursos con Production Check + +**Fecha:** 20 de noviembre de 2024 +**Categoría:** Cluster +**Etiquetas:** Producción, Escalado, Karpenter, Nodos, Taints, Disponibilidad + +## Descripción del Problema + +**Contexto:** El usuario habilitó la verificación "Production" en su cluster SleakOps y configuró un grupo de nodos on-demand, pero notó un aumento significativo en la cantidad de nodos y en el uso de recursos de infraestructura. + +**Síntomas Observados:** + +- Incremento significativo en el número de nodos tras habilitar Production check +- 7 nodos en total, con 4 asignados a infraestructura (cargas no relacionadas con aplicaciones) +- Todos los nodos ahora tienen taints aplicados (antes solo los nodos no contenedorizados tenían taints) +- Mayor consumo de recursos y costos de facturación +- Separación de cargas de trabajo de infraestructura y de aplicaciones + +**Configuración Relevante:** + +- Production check: Habilitado +- Tipo de grupo de nodos: solo instancias on-demand +- Versión de SleakOps: 1.7.0 (con cambios en taints) +- Karpenter: Habilitado con políticas de consolidación +- Configuración de alta disponibilidad: despliegue multi-AZ + +**Condiciones de Error:** + +- Aumento de costos operativos debido a más nodos +- Sobreaprovisionamiento de recursos en algunas cargas +- Confusión sobre asignación de nodos a infraestructura vs aplicaciones + +## Solución Detallada + + + +Al habilitar Production check en SleakOps, se aplican automáticamente los siguientes cambios para aumentar la disponibilidad del cluster: + +**Cambios en la Infraestructura de Nodos:** + +- Añade un nodo adicional al grupo de nodos principal en una zona de disponibilidad diferente +- Garantiza redundancia multi-AZ para componentes críticos del cluster +- Separa cargas de infraestructura de cargas de aplicación usando taints + +**Redundancia de Sistemas Críticos:** + +- Añade redundancia a sistemas críticos del cluster como Karpenter y ALB Controller +- Resulta en más pods ejecutándose para componentes de infraestructura +- Asegura alta disponibilidad para servicios de gestión del cluster + +**Protección de Aplicaciones:** + +- Añade Pod Disruption Budgets (PDBs) a los despliegues +- Evita que las políticas de consolidación de Karpenter afecten servicios en ejecución +- Protege contra rotación de nodos que impacte la disponibilidad de aplicaciones + + + + + +Desde la versión 1.7.0 de SleakOps, la plataforma implementa una estrategia de separación de cargas basada en taints: + +**Cargas de Infraestructura:** + +- Se ejecutan en nodos dedicados con taints específicos +- Usan instancias Graviton (más rentables) +- Manejan componentes de gestión, monitoreo y logging del cluster + +**Cargas de Aplicación:** + +- Se ejecutan en nodos separados sin taints de infraestructura +- Usan tipos de instancia estándar optimizados para aplicaciones +- Aisladas de la competencia por recursos de infraestructura + +**Impacto en Costos:** + +- Nodos de infraestructura usan instancias Graviton más baratas +- Nodos de aplicación mantienen características de rendimiento +- El costo total puede ser similar a pesar de más nodos + +```yaml +# Ejemplo de configuración de taints +infrastructure_nodes: + taints: + - key: "sleakops.com/infrastructure" + value: "true" + effect: "NoSchedule" + instance_types: ["t4g.medium", "t4g.large"] # Instancias Graviton + +application_nodes: + taints: [] # Sin taints para cargas de aplicación + instance_types: ["t3.medium", "t3.large"] # Instancias estándar +``` + + + + + +Para optimizar los recursos del cluster tras habilitar Production check: + +**Optimización de Memoria:** + +- Revisar solicitudes de memoria vs uso real +- Ejemplo: trabajos worker usando 2GB pico pero solicitando 8GB totales +- Ajustar solicitudes para reflejar consumo real + +```yaml +# Antes de la optimización +resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "1000m" + +# Después de optimización basada en uso real +resources: + requests: + memory: "512Mi" # Reducido según uso real + cpu: "250m" + limits: + memory: "2Gi" # Límite más realista + cpu: "500m" +``` + +**Ajuste de Componentes de Infraestructura:** + +- Requerimientos de recursos de Grafana pueden optimizarse +- Almacenamiento y memoria de Loki pueden ajustarse +- Reglas de anti-affinity podrían necesitar corrección para casos específicos + + + + + +Karpenter optimiza automáticamente la selección de instancias y la consolidación: + +**Selección de Instancias:** + +- Siempre selecciona las instancias más baratas que cumplen con los requisitos +- Considera precios spot vs on-demand cuando spot está habilitado +- Equilibra requisitos de rendimiento con costo + +**Política de Consolidación:** + +- Consolida cargas automáticamente cuando es posible +- Respeta Pod Disruption Budgets (PDBs) para mantener disponibilidad +- Puede no resultar siempre en menos instancias, pero asegura eficiencia de costos + +**Compensación Disponibilidad vs Costo:** + +- Production check prioriza disponibilidad sobre costo +- Instancias on-demand aumentan facturación pero brindan estabilidad +- Despliegue multi-AZ asegura resiliencia pero requiere más recursos + +```yaml +# Configuración de NodePool de Karpenter para optimización de costos +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: cost-optimized +spec: + requirements: + - key: "karpenter.sh/capacity-type" + operator: In + values: ["spot", "on-demand"] # Preferir spot cuando esté disponible + - key: "node.kubernetes.io/instance-type" + operator: In + values: ["t3.medium", "t3.large", "t4g.medium", "t4g.large"] + disruption: + consolidationPolicy: WhenUnderutilized + consolidateAfter: 30s +``` + + + + + +Para comprender y optimizar mejor su cluster: + +**Monitoreo de Recursos:** + +- Usar dashboards de Grafana para monitorear recursos reales vs solicitados +- Seguir la utilización de nodos en diferentes grupos +- Monitorear tendencias de costos tras habilitar Production check + +**Acciones de Optimización:** + +1. Revisar y ajustar solicitudes de recursos para todas las cargas +2. Considerar habilitar instancias spot para cargas no críticas +3. Monitorizar configuraciones de PDB para asegurar que no sean muy restrictivas +4. Revisión periódica de métricas de consolidación de Karpenter + +**Resultados Esperados:** + +- Mayor disponibilidad y resiliencia +- Rendimiento más predecible +- Utilización optimizada de recursos con el tiempo +- Mejor separación de responsabilidades entre infraestructura y aplicaciones + + + +--- + +_Esta FAQ fue generada automáticamente el 20 de noviembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-production-mode-spot-instances.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-production-mode-spot-instances.mdx new file mode 100644 index 000000000..bccc2e80e --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-production-mode-spot-instances.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "Modo de Producción del Clúster y Tipos de Instancias" +description: "Comprendiendo el modo de producción, instancias SPOT vs On-Demand y configuración de Karpenter" +date: "2024-12-19" +category: "cluster" +tags: ["producción", "instancias-spot", "on-demand", "karpenter", "eks"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Modo de Producción del Clúster y Tipos de Instancias + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Producción, Instancias SPOT, On-Demand, Karpenter, EKS + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender la diferencia entre los modos de clúster de producción y desarrollo en SleakOps, particularmente respecto a los tipos de instancias (SPOT vs On-Demand) y su impacto en la estabilidad y costos del clúster. + +**Síntomas Observados:** + +- Confusión sobre cuándo usar el modo producción +- Incertidumbre sobre instancias SPOT vs On-Demand para nodepools +- Preguntas sobre si el modo producción cambia automáticamente la configuración del nodepool +- Problemas de programación de pods relacionados con taints en nodos en Karpenter + +**Configuración Relevante:** + +- Modo del clúster: Producción vs Desarrollo +- Tipos de instancias: SPOT vs On-Demand +- Configuración de nodepools en Karpenter +- Cargas críticas del sistema (ALB, external-dns, etc.) + +**Condiciones de Error:** + +- Pods que no pueden programarse debido a taints en nodos +- Inestabilidad del clúster con instancias SPOT +- Confusión sobre optimización de costos vs confiabilidad + +## Solución Detallada + + + +El modo producción en SleakOps afecta la **infraestructura crítica del clúster**, no las cargas de trabajo de tu aplicación: + +**Qué Cambia el Modo Producción:** + +- Usa instancias On-Demand para componentes críticos del sistema (Karpenter, Controlador ALB, External-DNS, etc.) +- Aplica configuraciones adicionales que mejoran la disponibilidad del clúster +- Asegura estabilidad del sistema para cargas de producción + +**Qué NO Cambia el Modo Producción:** + +- Tus nodepools de aplicación permanecen sin cambios +- Puedes seguir usando instancias SPOT para tus cargas +- No afecta las configuraciones de despliegue de la aplicación + + + + + +**Instancias SPOT:** + +- **Ventajas:** Ahorro de hasta 90% en costos, ideales para cargas tolerantes a fallos +- **Desventajas:** AWS puede terminarlas con aviso de 2 minutos +- **Mejor para:** Entornos de desarrollo, procesamiento por lotes, aplicaciones sin estado + +**Instancias On-Demand:** + +- **Ventajas:** Disponibilidad garantizada, costos predecibles +- **Desventajas:** Costo más alto (precio completo) +- **Mejor para:** Aplicaciones críticas, bases de datos, cargas de producción que requieren alta disponibilidad + +**Recomendación para Nodepools:** + +```yaml +# Para aplicaciones que pueden manejar rotación de nodos +instance_type: spot + +# Para aplicaciones críticas que requieren disponibilidad garantizada +instance_type: on-demand +``` + + + + + +Karpenter gestiona automáticamente diferentes tipos de nodos con taints específicos: + +**Taints Comunes en Karpenter:** + +- `karpenter.sh/nodepool: spot-amd64` - Instancias SPOT con arquitectura AMD64 +- `karpenter.sh/nodepool: spot-arm64` - Instancias SPOT con arquitectura ARM64 +- `karpenter.sh/nodepool: ondemand-amd64` - Instancias On-Demand con arquitectura AMD64 +- `karpenter.sh/nodepool: ondemand-arm64` - Instancias On-Demand con arquitectura ARM64 + +**Si los pods no pueden programarse debido a taints:** + +1. Verifica las tolerancias de tus pods +2. Revisa la configuración del nodepool +3. Asegúrate que Karpenter pueda aprovisionar nodos adecuados + +```yaml +# Ejemplo de tolerancia para instancias SPOT +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule +``` + + + + + +**Para Clústeres de Producción:** + +1. **Habilitar Modo Producción:** + + - Ir a Configuración del Clúster + - Activar "Modo Producción" + - Esto asegura componentes críticos del sistema + +2. **Estrategia de Nodepools:** + + - Usar **On-Demand** para aplicaciones críticas (bases de datos, APIs) + - Usar **SPOT** para cargas tolerantes a fallos (trabajadores, trabajos por lotes) + - Considerar nodepools mixtos para optimización de costos + +3. **Preparación de Aplicaciones para SPOT:** + - Implementar manejo de apagado ordenado + - Usar autoescalado horizontal de pods + - Diseñar para tolerar rotación de nodos + - Implementar chequeos de salud adecuados + +**Ejemplo de Configuración Mixta:** + +```yaml +# Nodepool para cargas críticas +critical_nodepool: + instance_type: on-demand + min_size: 2 + max_size: 10 + +# Nodepool para procesamiento por lotes +batch_nodepool: + instance_type: spot + min_size: 0 + max_size: 50 +``` + + + + + +Si los pods en el namespace `kube-system` fallan al programarse: + +**1. Verificar Disponibilidad de Nodos:** + +```bash +kubectl get nodes +kubectl describe nodes +``` + +**2. Verificar Estado de Pods:** + +```bash +kubectl get pods -n kube-system +kubectl describe pod -n kube-system +``` + +**3. Revisar Logs de Karpenter:** + +```bash +kubectl logs -n karpenter deployment/karpenter +``` + +**4. Verificar Configuración de Nodepool:** + +- Asegurar que los nodepools estén correctamente configurados +- Comprobar que Karpenter pueda aprovisionar nuevos nodos +- Verificar cuotas y permisos en AWS + +**5. Soluciones Comunes:** + +- Reiniciar despliegue de Karpenter +- Revisar cuotas EC2 en AWS +- Verificar capacidad de subred +- Revisar configuraciones de grupos de seguridad + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-prometheus-memory-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-prometheus-memory-issues.mdx new file mode 100644 index 000000000..d2a1e0da0 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-prometheus-memory-issues.mdx @@ -0,0 +1,204 @@ +--- +sidebar_position: 3 +title: "Problemas de Memoria de Prometheus que Causan Bloqueos en Pods" +description: "Solución para problemas de memoria de Prometheus que causan que los pods de la aplicación se bloqueen en producción" +date: "2024-12-23" +category: "cluster" +tags: ["prometheus", "memoria", "monitoreo", "solución de problemas", "pods"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Memoria de Prometheus que Causan Bloqueos en Pods + +**Fecha:** 23 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Prometheus, Memoria, Monitoreo, Solución de Problemas, Pods + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan bloqueos en pods de aplicaciones en entornos de producción, donde los pods parecen saludables en herramientas de monitoreo como Lens pero las solicitudes a la API se vuelven completamente no responsivas. El problema está típicamente relacionado con el consumo de memoria de Prometheus que afecta la estabilidad del nodo. + +**Síntomas Observados:** + +- Los pods de la aplicación aparecen como "verdes" o saludables en las herramientas de monitoreo de Kubernetes +- Las solicitudes a la API se bloquean completamente sin devolver respuestas +- Solo algunos pods se bloquean mientras otros en el mismo despliegue continúan funcionando normalmente +- El balanceador de carga distribuye tráfico tanto a pods funcionando como a los bloqueados +- El problema ocurre principalmente en producción con alto tráfico, no en desarrollo +- Los pods no se caen ni reinician, simplemente dejan de responder + +**Configuración Relevante:** + +- Addon de Prometheus instalado en el clúster +- Múltiples réplicas de la aplicación (por ejemplo, 15 réplicas) +- Entorno de producción con carga de tráfico significativa +- Pods distribuidos en múltiples nodos + +**Condiciones de Error:** + +- Pod de Prometheus consumiendo memoria excesiva +- Agotamiento de recursos del nodo causando bloqueos en pods +- Fallas intermitentes afectando a un subconjunto de pods +- Mala experiencia de usuario debido a tiempos de espera en solicitudes + +## Solución Detallada + + + +Este problema ocurre típicamente cuando: + +1. **Agotamiento de memoria de Prometheus**: El pod de Prometheus comienza a consumir más RAM de la asignada +2. **Agotamiento de recursos del nodo**: Cuando Prometheus agota los recursos del nodo, afecta a todos los pods en ese nodo +3. **Degradación parcial del servicio**: Solo los pods en nodos afectados se bloquean, mientras otros continúan funcionando +4. **Latencia de servicios de terceros**: Las llamadas a APIs externas pueden contribuir al problema creando cuellos de botella + +El indicador clave es que los pods parecen saludables pero se vuelven no responsivos a las solicitudes. + + + + + +Para resolver el problema inmediato: + +1. **Acceder a la configuración del addon de Prometheus**: + + - Ve a la sección **Addons** de tu clúster + - Encuentra **Prometheus** en la lista de addons + - Haz clic para abrir la configuración + +2. **Incrementar la asignación mínima de RAM**: + + - Ubica la configuración "RAM mínima" + - Aumenta el valor a **2200 MB** (o más según tus necesidades) + - Guarda la configuración + +3. **Aplicar los cambios**: + - El sistema actualizará Prometheus con la nueva asignación de memoria + - Este proceso puede tardar hasta 20 minutos en completarse + +```yaml +# Ejemplo de configuración de Prometheus +prometheus: + resources: + requests: + memory: "2200Mi" + limits: + memory: "4000Mi" +``` + + + + + +Si los nodos ya están afectados: + +1. **Identificar el pod problemático de Prometheus**: + + - Busca el pod de Prometheus con contenedores mostrando estado naranja/advertencia + - Anota en qué nodo está alojado este pod + +2. **Eliminar el nodo afectado**: + + - Haz clic en el nombre del nodo para abrir detalles + - Selecciona "Eliminar" para remover el nodo + - **Importante**: Haz esto al mismo tiempo o justo antes de actualizar Prometheus + +3. **Verificar recuperación**: + - Espera a que el pod de Prometheus se reprograme en un nuevo nodo + - Comprueba que todos los contenedores muestren estado verde/saludable + - Monitorea los pods de la aplicación para confirmar que la funcionalidad se restauró + + + + + +Para prevenir problemas futuros y diagnosticar mejor: + +1. **Instalar Monitoreo de Rendimiento de Aplicaciones (APM)**: + + - **OpenTelemetry** (disponible como addon de clúster, actualmente en beta) + - **New Relic** (servicio externo) + - **Datadog** (servicio externo) + +2. **Configurar OpenTelemetry en SleakOps**: + + - Ve a **Clúster** → **Addons** + - Instala el addon **OpenTelemetry** + - Aplica a tus servicios de proyecto + +3. **Monitorear latencia de servicios de terceros**: + - Verifica tiempos de respuesta de APIs externas + - Identifica posibles cuellos de botella en dependencias de servicio + - Configura alertas para alta latencia o tiempos de espera + +```yaml +# Ejemplo de configuración de OpenTelemetry +opentelemetry: + enabled: true + exporters: + - jaeger + - prometheus + sampling_rate: 0.1 +``` + + + + + +**Gestión de Recursos:** + +- Establece solicitudes y límites adecuados de recursos para todas las aplicaciones +- Monitorea tendencias de uso de recursos a lo largo del tiempo +- Escala recursos de Prometheus proactivamente según el tamaño del clúster + +**Configuración de Monitoreo:** + +- Implementa un APM completo para rastrear el rendimiento de la aplicación +- Configura alertas para agotamiento de recursos +- Monitorea dependencias de servicios externos + +**Proceso de Solución de Problemas:** + +- Crea tickets de soporte cuando ocurran problemas con intervalos de tiempo específicos +- Documenta síntomas exactos y condiciones de error +- Incluye detalles relevantes de configuración + +**Pruebas de Carga:** + +- Prueba las aplicaciones bajo cargas de tráfico similares a producción +- Valida la asignación de recursos en entornos de staging +- Monitorea fugas de memoria y degradación de rendimiento + + + + + +Cuando se experimenten problemas similares: + +**Acciones Inmediatas:** + +1. Verifica el estado del pod de Prometheus y uso de recursos +2. Identifica qué nodos están afectados +3. Incrementa la asignación de memoria de Prometheus +4. Elimina nodos afectados si es necesario + +**Pasos de Investigación:** + +1. Anota los intervalos de tiempo exactos cuando ocurren los problemas +2. Revisa dashboards de Grafana para métricas de recursos +3. Revisa logs de la aplicación para errores o tiempos de espera +4. Monitorea tiempos de respuesta de servicios de terceros + +**Documentación:** + +1. Crea tickets de soporte con intervalos de tiempo específicos +2. Incluye nombres de pods e información de nodos +3. Describe síntomas exactos observados +4. Anota cualquier cambio reciente en configuración + + + +--- + +_Esta FAQ fue generada automáticamente el 23 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-understanding-ec2-instances-and-costs.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-understanding-ec2-instances-and-costs.mdx new file mode 100644 index 000000000..e8bdc0300 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-understanding-ec2-instances-and-costs.mdx @@ -0,0 +1,202 @@ +--- +sidebar_position: 3 +title: "Comprendiendo las Instancias EC2 y los Costos en Clústeres EKS" +description: "Explicación de las instancias EC2 creadas por SleakOps y sus implicaciones de costo" +date: "2025-01-29" +category: "cluster" +tags: ["eks", "aws", "ec2", "costos", "karpenter", "vpn"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Comprendiendo las Instancias EC2 y los Costos en Clústeres EKS + +**Fecha:** 29 de enero de 2025 +**Categoría:** Clúster +**Etiquetas:** EKS, AWS, EC2, Costos, Karpenter, VPN + +## Descripción del Problema + +**Contexto:** Los usuarios notan múltiples instancias EC2 ejecutándose en su cuenta AWS junto con su clúster EKS y desean entender el propósito de cada instancia y los costos asociados. + +**Síntomas Observados:** + +- Múltiples instancias EC2 apareciendo en la consola AWS (por ejemplo, t4g.medium, t3a.small, c7g.xlarge) +- Incertidumbre sobre qué instancias son para VPN vs nodos del clúster +- Preguntas sobre las implicaciones de costo a medida que las cargas de trabajo escalan +- Necesidad de comprender la relación entre pods e instancias EC2 + +**Configuración Relevante:** + +- Clúster EKS con gestión de nodos por Karpenter +- Configuración de instancias Spot para pods +- Instancia VPN para acceso seguro +- Varios tipos de instancia (t4g.medium, t3a.small, c7g.xlarge) + +**Condiciones de Error:** + +- No hay errores técnicos, pero sí confusión sobre costos de infraestructura +- Necesidad de comprensión para optimización de costos +- Aclaración del comportamiento de escalado requerida + +## Solución Detallada + + + +**Instancia VPN:** + +- **Propósito**: Proporciona acceso VPN seguro a tu clúster y recursos +- **Tamaño típico**: Usualmente t3a.small o una instancia pequeña similar +- **Costo**: Costo fijo, funciona continuamente independientemente de la carga +- **Escalado**: No escala con la carga de tu aplicación + +**Nodos Trabajadores del Clúster:** + +- **Propósito**: Ejecutar tus pods de aplicación y cargas de trabajo Kubernetes +- **Gestionados por**: Karpenter (provisión automática de nodos) +- **Tipos de instancia**: Varias tamaños (t4g.medium, c7g.xlarge, etc.) +- **Costo**: Variable, escala con la demanda de tu carga de trabajo + + + + + +Karpenter gestiona automáticamente los nodos de tu clúster: + +```yaml +# Ejemplo de configuración de Nodepool +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: default +spec: + template: + spec: + requirements: + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: karpenter.sh/capacity-type + operator: In + values: ["spot", "on-demand"] + nodeClassRef: + apiVersion: karpenter.k8s.aws/v1beta1 + kind: EC2NodeClass + name: default +``` + +**Comportamientos clave:** + +- **Escalado automático**: Crea instancias cuando los pods necesitan recursos +- **Optimización de costos**: Prefiere instancias spot cuando están disponibles +- **Dimensionamiento adecuado**: Selecciona tipos de instancia apropiados para los requisitos +- **Limpieza**: Termina instancias no usadas para reducir costos + + + + + +**Costos Fijos:** + +- Instancia VPN: ~10-20 USD/mes (dependiendo del tamaño de instancia) +- Plano de control EKS: 0.10 USD/hora (~73 USD/mes) + +**Costos Variables (escala con la carga):** + +- Instancias nodos trabajadores: Depende de: + - Número de pods en ejecución + - Requerimientos de recursos (CPU/memoria) + - Tipos de instancia seleccionados por Karpenter + - Precios Spot vs On-Demand + +**Consejos para optimización de costos:** + +```bash +# Monitorea la utilización de tus nodos +kubectl top nodes + +# Revisa las solicitudes de recursos de los pods +kubectl describe pods -A | grep -A 5 "Requests:" + +# Visualiza las decisiones de nodos de Karpenter +kubectl logs -n karpenter deployment/karpenter +``` + + + + + +**Proceso de Escalado:** + +1. **Creación de Pod:** Despliegas más pods en tu clúster +2. **Evaluación de Recursos:** Karpenter evalúa los requerimientos de recursos +3. **Provisión de Nodo:** Si los nodos existentes no pueden alojar nuevos pods: + - Karpenter provisiona nuevas instancias EC2 + - Selecciona el tipo de instancia óptimo según requerimientos + - Prefiere instancias spot para ahorro de costos +4. **Programación de Pods:** Kubernetes programa los pods en nodos disponibles +5. **Reducción de Escala:** Cuando se eliminan pods, Karpenter elimina nodos no usados + +**Ejemplo de escenario de escalado:** + +```yaml +# Si tienes pods con estos requerimientos: +resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi +``` + +Karpenter: + +- Calcula las necesidades totales de recursos +- Selecciona tipos de instancia costo-efectivos +- Proporciona el número mínimo de instancias necesarias + + + + + +**Herramientas para Monitoreo de Costos:** + +1. **AWS Cost Explorer**: Rastrea el gasto en EC2 por tipo de instancia +2. **Métricas de Karpenter**: Monitorea las decisiones de provisión de nodos +3. **Dashboard de SleakOps**: Visualiza la utilización de recursos del clúster + +**Estrategias para Controlar Costos:** + +```yaml +# Establece límites de recursos en tus despliegues +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + containers: + - name: app + resources: + requests: + cpu: 100m + memory: 128Mi + limits: + cpu: 500m + memory: 512Mi +``` + +**Controles de costo en Nodepool:** + +- Configura tamaños máximos de instancia +- Establece límites en el grupo de nodos +- Usa instancias spot cuando sea posible +- Configura solicitudes de recursos apropiadas + +Para configuración detallada de nodepool, consulta: [Documentación de Nodepool](https://docs.sleakops.com/cluster/nodepools) + + + +--- + +_Este FAQ fue generado automáticamente el 29 de enero de 2025 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-upgrade-maintenance-scheduling.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-upgrade-maintenance-scheduling.mdx new file mode 100644 index 000000000..c70b8abe7 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-upgrade-maintenance-scheduling.mdx @@ -0,0 +1,254 @@ +--- +sidebar_position: 3 +title: "Programación de Actualizaciones y Mantenimiento de Clúster" +description: "Mejores prácticas para programar actualizaciones de clúster y ventanas de mantenimiento" +date: "2024-12-19" +category: "cluster" +tags: + [ + "clúster", + "actualización", + "mantenimiento", + "programación", + "tiempo de inactividad", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Programación de Actualizaciones y Mantenimiento de Clúster + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Clúster, Actualización, Mantenimiento, Programación, Tiempo de inactividad + +## Descripción del Problema + +**Contexto:** Las organizaciones necesitan coordinar las actualizaciones del clúster y las actividades de mantenimiento para minimizar la interrupción del servicio y asegurar una comunicación adecuada entre el equipo de SleakOps y los clientes. + +**Síntomas Observados:** + +- Interrupciones inesperadas del servicio durante las ventanas de mantenimiento +- Falta de comunicación clara sobre los horarios de actualización +- Actualizaciones incompletas del clúster que causan inconsistencias en la plataforma +- Servicios que se vuelven inaccesibles sin previo aviso +- Confusión sobre los horarios de inicio y fin del mantenimiento + +**Configuración Relevante:** + +- Procesos de actualización del clúster (actualizaciones de versión de Kubernetes) +- Actividades de migración de infraestructura +- Cambios en la configuración de dominio y DNS +- Actualizaciones de balanceadores de carga y entradas (ingress) + +**Condiciones de Error:** + +- Fallos en servicios durante mantenimiento no programado +- Actualizaciones parciales que dejan clústeres en estados inconsistentes +- Problemas de configuración de DNS e ingress después de la actualización +- Problemas de validación de certificados tras cambios en la infraestructura + +## Solución Detallada + + + +### Planificación Previa al Mantenimiento + +Antes de programar cualquier actualización o mantenimiento de clúster: + +1. **Evaluar Impacto:** Identificar todos los servicios y cargas de trabajo que se verán afectados +2. **Definir Alcance:** Delimitar claramente qué será actualizado o modificado +3. **Estimar Duración:** Proporcionar estimaciones realistas del tiempo para la ventana de mantenimiento +4. **Preparar Plan de Reversión:** Tener una estrategia clara para revertir cambios en caso de problemas + +### Protocolo de Comunicación + +- **Aviso Previo:** Notificar a los clientes al menos 48-72 horas antes del mantenimiento +- **Horario Detallado:** Proporcionar horas específicas de inicio y fin +- **Información de Contacto:** Asegurar que haya contactos de emergencia disponibles durante el mantenimiento +- **Actualizaciones de Estado:** Enviar actualizaciones regulares durante la ventana de mantenimiento + + + + + +### Mejores Prácticas para Actualización de Clúster + +```yaml +# Ejemplo de comunicación de horario de mantenimiento +Ventana de Mantenimiento: + Inicio: "2024-12-20 02:00 UTC" + Fin: "2024-12-20 06:00 UTC" + ZonaHoraria: "UTC-3 (Hora Argentina: 23:00 - 03:00)" + +Servicios Afectados: + - Actualización del clúster Kubernetes (1.28 → 1.29) + - Actualizaciones en el grupo de nodos + - Configuración del balanceador de carga + - Renovación de DNS y certificados + +Tiempo de Inactividad Esperado: + - Estimado: 15-30 minutos + - Máximo: 2 horas +``` + +### Pasos de Coordinación + +1. **Acuerdo del Horario:** Confirmar la ventana de mantenimiento con el cliente +2. **Lista de Verificación Pre-actualización:** Verificar que se cumplan todos los prerrequisitos +3. **Creación de Copias de Seguridad:** Asegurar respaldo de todos los datos críticos +4. **Enfoque por Etapas:** Actualizar primero los entornos no productivos +5. **Configuración de Monitoreo:** Tener monitoreo y alertas listas + + + + + +### Notificación Previa al Mantenimiento + +```markdown +Asunto: Mantenimiento Programado - [Nombre del Clúster] - [Fecha] + +Estimado [Cliente], + +Hemos programado mantenimiento para su clúster: + +**Detalles del Mantenimiento:** + +- Fecha: [Fecha] +- Hora de Inicio: [Hora] [Zona Horaria] +- Hora de Fin: [Hora] [Zona Horaria] +- Duración Estimada: [Duración] + +**Actividades a Realizar:** + +- [Lista de actividades] + +**Impacto Esperado:** + +- [Descripción de posibles interrupciones del servicio] + +**Preparación Requerida:** + +- [Acciones necesarias por parte del cliente] + +Enviaremos actualizaciones durante la ventana de mantenimiento. + +Saludos cordiales, +Equipo SleakOps +``` + +### Actualizaciones Durante el Mantenimiento + +```markdown +Asunto: Actualización de Mantenimiento - [Estado] - [Hora] + +**Estado:** [En Progreso/Completado/Problema] +**Hora:** [Hora Actual] +**Progreso:** [Descripción de actividades actuales] +**Próximos Pasos:** [Qué sucederá a continuación] +**Finalización Estimada:** [Hora actualizada si es diferente] +``` + + + + + +### Lista de Verificación Post-Mantenimiento + +1. **Verificación de Servicios** + + - Probar todos los servicios críticos + - Verificar resolución DNS + - Comprobar certificados SSL + - Validar reglas del balanceador de carga + +2. **Monitoreo de Rendimiento** + + - Monitorizar rendimiento del clúster durante 24-48 horas + - Revisar utilización de recursos + - Verificar funcionalidad de autoescalado + +3. **Comunicación con el Cliente** + - Enviar notificación de finalización + - Proporcionar resumen de los cambios realizados + - Incluir instrucciones post-mantenimiento + +### Plantilla de Notificación de Finalización + +```markdown +Asunto: Mantenimiento Completado - [Nombre del Clúster] + +Estimado [Cliente], + +El mantenimiento programado ha sido completado con éxito. + +**Resumen:** + +- Hora de Inicio: [Hora real de inicio] +- Hora de Fin: [Hora real de fin] +- Duración: [Duración real] + +**Cambios Realizados:** + +- [Lista de actividades completadas] + +**Verificación:** + +- Todos los servicios están operativos +- Las métricas de rendimiento son normales +- No se detectaron problemas + +**Próximos Pasos:** + +- Monitorice sus aplicaciones durante las próximas 24 horas +- Contáctenos inmediatamente si detecta algún problema + +Gracias por su paciencia. + +Saludos cordiales, +Equipo SleakOps +``` + + + + + +### Si Ocurren Problemas Durante el Mantenimiento + +1. **Evaluación Inmediata** + + - Detener actividades actuales si es seguro hacerlo + - Evaluar la gravedad del problema + - Determinar si es necesario revertir cambios + +2. **Comunicación con el Cliente** + + - Notificar al cliente inmediatamente sobre cualquier problema + - Proporcionar un cronograma realista para la resolución + - Explicar las opciones disponibles (continuar, revertir, pausar) + +3. **Proceso de Escalamiento** + - Tener disponibles miembros senior del equipo + - Establecer rutas claras de escalamiento + - Mantener registros detallados de todas las actividades + +### Protocolo de Contacto de Emergencia + +```yaml +Contactos de Emergencia: + Primario: "[Contacto primario y teléfono]" + Secundario: "[Contacto secundario y teléfono]" + Escalamiento: "[Contacto de gerencia]" + +Canales de Comunicación: + - Email (inmediato) + - Teléfono/SMS (problemas críticos) + - Slack/Teams (actualizaciones en tiempo real) +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-upgrade-zero-downtime.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-upgrade-zero-downtime.mdx new file mode 100644 index 000000000..68efcabd9 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cluster-upgrade-zero-downtime.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 3 +title: "Actualización de Clúster Kubernetes sin Tiempo de Inactividad" +description: "Cómo realizar actualizaciones de clúster en SleakOps sin interrupción del servicio" +date: "2025-02-19" +category: "clúster" +tags: + [ + "actualización", + "kubernetes", + "tiempo de inactividad", + "mantenimiento", + "producción", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Actualización de Clúster Kubernetes sin Tiempo de Inactividad + +**Fecha:** 19 de febrero de 2025 +**Categoría:** Clúster +**Etiquetas:** Actualización, Kubernetes, Tiempo de inactividad, Mantenimiento, Producción + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan actualizar sus clústeres Kubernetes en SleakOps pero están preocupados por el posible tiempo de inactividad durante horas críticas de negocio, especialmente en períodos de alto tráfico o en operaciones de fin de mes. + +**Síntomas Observados:** + +- Preocupación por la interrupción del servicio durante las actualizaciones del clúster +- Necesidad de programar actualizaciones en horas de bajo tráfico +- Incertidumbre sobre el proceso de actualización y su impacto +- Preguntas sobre capacidades de actualización de autoservicio + +**Configuración Relevante:** + +- Clústeres de producción con múltiples réplicas +- Servicios web y workers con requisitos de alta disponibilidad +- Operaciones críticas de negocio en ventanas horarias específicas +- Políticas de despliegue configuradas para actualizaciones continuas + +**Condiciones de Error:** + +- Riesgo de tiempo de inactividad durante horas críticas de negocio +- Potencial degradación del servicio durante el proceso de actualización +- Necesidad de soporte inmediato si surgen problemas durante la actualización + +## Solución Detallada + + + +SleakOps implementa actualizaciones de clúster sin tiempo de inactividad mediante: + +1. **Actualizaciones Graduales de Nodos**: Los nodos se rotan progresivamente, no todos a la vez +2. **Políticas de Despliegue**: El 33% de cada despliegue debe permanecer activo durante las actualizaciones +3. **Distribución de Pods**: Los servicios se distribuyen en múltiples nodos +4. **Verificaciones de Salud**: Monitorización continua asegura la disponibilidad del servicio + +**Requisitos para Cero Tiempo de Inactividad:** + +- Clúster en modo producción +- Al menos 2 réplicas por servicio web o worker +- Asignación adecuada de recursos y reglas de anti-afinidad de pods + + + + + +Para ejecutar la actualización tú mismo: + +1. **Accede a la Consola**: Ve a la página de configuración de tu clúster +2. **Navega a Configuración General**: Encuentra la sección de actualización +3. **Revisa Detalles de la Actualización**: Verifica la versión objetivo y los cambios +4. **Inicia la Actualización**: Haz clic en el botón de actualización cuando estés listo + +**Ruta en la Consola:** + +``` +Consola SleakOps → Clústeres → [Tu Clúster] → Configuración → Configuración General +``` + +**Buenas Prácticas:** + +- Programar durante horas de bajo tráfico (se recomienda temprano en la mañana) +- Notificar a tu equipo antes de comenzar +- Monitorear los logs de la aplicación durante el proceso +- Tener un plan de reversión listo si es necesario + + + + + +Durante el proceso de actualización, monitorea: + +1. **Salud de la Aplicación**: Revisa los endpoints de tu aplicación +2. **Estado de los Pods**: Asegúrate que los pods se estén recreando correctamente +3. **Uso de Recursos**: Monitorea CPU y memoria durante la transición +4. **Logs**: Observa cualquier mensaje de error en los logs de la aplicación + +**Indicadores Clave de Actualización Exitosa:** + +- Todos los pods muestran estado "Running" +- Los endpoints de la aplicación responden normalmente +- No hay picos de error en los paneles de monitoreo +- Las conexiones a base de datos permanecen estables + + + + + +**Horario Recomendado:** + +- Horas tempranas de la mañana (2-6 AM hora local) +- Fines de semana o ventanas de mantenimiento +- Evitar fin de mes o períodos de alto tráfico +- Considerar zonas horarias para aplicaciones globales + +**Lista de Verificación Pre-Actualización:** + +- [ ] Verificar que el clúster tenga múltiples réplicas +- [ ] Comprobar que los despliegues recientes de la aplicación estén estables +- [ ] Asegurar que el monitoreo esté activo +- [ ] Notificar a los miembros relevantes del equipo +- [ ] Planificar soporte inmediato si es necesario + + + + + +SleakOps proporciona: + +1. **Soporte Prioritario**: Prioridad el mismo día para problemas relacionados con la actualización +2. **Monitoreo Proactivo**: El equipo monitorea actualizaciones críticas +3. **Respuesta Inmediata**: Resolución rápida para cualquier problema +4. **Validación Post-Actualización**: Verificación de que todos los servicios estén saludables + +**Cuándo Contactar Soporte:** + +- Si la actualización tarda más de lo esperado +- Aparecen errores en la aplicación durante la actualización +- Los pods no se reinician correctamente +- Degradación del rendimiento después de la actualización + + + + + +Si ocurren problemas durante la actualización: + +1. **Acciones Inmediatas**: + + - Contactar soporte de SleakOps inmediatamente + - Documentar cualquier mensaje de error + - Evitar intervenciones manuales a menos que se indique + +2. **Proceso de Reversión**: + + - SleakOps puede iniciar la reversión si es necesario + - Las aplicaciones serán restauradas a versiones previas + - La integridad de los datos se mantiene en todo momento + +3. **Post-Incidencia**: + - Análisis de causa raíz proporcionado + - Recomendaciones para futuras actualizaciones + - Programar reintento con precauciones adicionales + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/connecting-to-cluster-with-lens.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/connecting-to-cluster-with-lens.mdx new file mode 100644 index 000000000..ea6a4f0e8 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/connecting-to-cluster-with-lens.mdx @@ -0,0 +1,203 @@ +--- +sidebar_position: 3 +title: "Conectando al Cluster de Kubernetes con Lens IDE" +description: "Guía paso a paso para conectarte a tu cluster SleakOps usando Lens IDE" +date: "2024-12-19" +category: "cluster" +tags: ["lens", "kubernetes", "ide", "conexión", "kubectl"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Conectando al Cluster de Kubernetes con Lens IDE + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Cluster +**Etiquetas:** Lens, Kubernetes, IDE, Conexión, kubectl + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan conectarse a su cluster de Kubernetes SleakOps desde su entorno local de desarrollo para monitorear cargas de trabajo, depurar problemas y gestionar recursos. + +**Síntomas Observados:** + +- Necesidad de acceder a recursos del cluster desde la máquina local +- Deseo de usar una interfaz visual en lugar de solo línea de comandos +- Requiere capacidad para ver pods, registros y estado de cargas de trabajo +- Necesidad de solucionar problemas de despliegue + +**Configuración Relevante:** + +- Cluster SleakOps con datos de conexión disponibles +- Máquina local de desarrollo +- Preferencia de IDE Kubernetes (Lens recomendado) +- CLI kubectl como opción alternativa + +**Casos de Uso:** + +- Monitoreo del estado y registros de pods +- Depuración de problemas de despliegue +- Gestión visual de recursos Kubernetes +- Solución de problemas de chequeos de salud de aplicaciones + +## Solución Detallada + + + +Lens es un IDE para Kubernetes que proporciona una interfaz visual para gestionar clusters Kubernetes. SleakOps lo recomienda porque ofrece: + +- **Gestión visual del cluster**: Interfaz gráfica fácil de usar +- **Monitoreo en tiempo real**: Vista en vivo de pods, servicios y despliegues +- **Agregación de registros**: Visualización centralizada de logs de múltiples pods +- **Gestión de recursos**: Crear, editar y eliminar recursos Kubernetes +- **Soporte multi-cluster**: Conexión simultánea a múltiples clusters + +**Alternativas:** + +- CLI kubectl (interfaz de línea de comandos) +- Otros IDEs para Kubernetes (k9s, Octant, etc.) +- Consolas de proveedores cloud + + + + + +Antes de conectarte a tu cluster, asegúrate de tener instaladas estas 4 dependencias en tu máquina local: + +1. **Lens IDE**: Descarga desde [k8slens.dev](https://k8slens.dev) +2. **kubectl**: Herramienta de línea de comandos de Kubernetes +3. **AWS CLI**: Para autenticación en AWS (si usas AWS) +4. **Docker**: Runtime de contenedores (opcional pero recomendado) + +### Comandos de Instalación + +**macOS (usando Homebrew):** + +```bash +# Instalar kubectl +brew install kubectl + +# Instalar AWS CLI +brew install awscli + +# Instalar Docker Desktop +brew install --cask docker +``` + +**Linux (Ubuntu/Debian):** + +```bash +# Instalar kubectl +curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" +sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl + +# Instalar AWS CLI +curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +unzip awscliv2.zip +sudo ./aws/install +``` + +**Windows:** + +- Descargar Lens desde el sitio oficial +- Instalar kubectl usando chocolatey: `choco install kubernetes-cli` +- Instalar AWS CLI desde la documentación oficial de AWS + + + + + +Para conectarte a tu cluster SleakOps, necesitas autenticarte y obtener las credenciales necesarias: + +1. **Accede al Panel de SleakOps**: Inicia sesión en tu cuenta SleakOps +2. **Navega al Cluster**: Ve a tu cluster específico +3. **Encuentra Datos de Conexión**: Busca "Datos de Conexión" en las opciones del cluster +4. **Obtén Credenciales AWS**: Sigue el enlace proporcionado para obtener: + - AWS Access Key ID + - AWS Secret Access Key + - Token de sesión (si usas credenciales temporales) + +### Configurar Credenciales AWS + +```bash +# Configura AWS CLI con tus credenciales +aws configure +# Ingresa tu Access Key ID, Secret Access Key y región + +# O establece variables de entorno +export AWS_ACCESS_KEY_ID="tu-access-key" +export AWS_SECRET_ACCESS_KEY="tu-secret-key" +export AWS_SESSION_TOKEN="tu-session-token" # si aplica +``` + + + + + +El archivo kubeconfig contiene la información de conexión al cluster: + +1. **Obtén el YAML kubeconfig**: Copia el YAML kubeconfig desde los datos de conexión de SleakOps +2. **Guárdalo en un archivo**: Guarda el archivo local kubeconfig + +### Método 1: Reemplazo directo de archivo + +```bash +# Haz una copia de seguridad del kubeconfig existente (si la hay) +cp ~/.kube/config ~/.kube/config.backup + +# Crea el directorio .kube si no existe +mkdir -p ~/.kube + +# Guarda el nuevo kubeconfig +# Pega el contenido YAML de SleakOps en ~/.kube/config +``` + +### Método 2: Fusionar con configuración existente + +```bash +# Si tienes múltiples clusters, fusiona configuraciones +export KUBECONFIG=~/.kube/config:~/ruta/a/sleakops-config.yaml +kubectl config view --flatten > ~/.kube/config.new +mv ~/.kube/config.new ~/.kube/config +``` + +### Verificar conexión + +```bash +# Prueba la conexión +kubectl cluster-info +kubectl get nodes +``` + + + + + +Una vez que tienes el kubeconfig configurado: + +1. **Abre Lens IDE** +2. **Agregar Cluster**: Haz clic en "+" o "Agregar Cluster" +3. **Pegar kubeconfig**: Pega el contenido YAML desde SleakOps +4. **Conectar**: Lens detectará y conectará automáticamente con tu cluster + +### Navegando en Lens + +- **Workloads → Pods**: Ver todos los pods en tu cluster +- **Workloads → Deployments**: Gestionar tus despliegues +- **Network → Services**: Ver y administrar servicios +- **Storage**: Gestionar volúmenes y claims persistentes +- **Namespaces**: Tu entorno de proyecto aparecerá como `nombre-proyecto-entorno` + +### Encontrando tu Aplicación + +Tus pods de aplicación estarán en un namespace con este patrón: + +``` +[nombre-proyecto]-[nombre-entorno] +``` + +Por ejemplo: `miapp-producción` o `miapp-preproducción` + + + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cors-configuration-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cors-configuration-troubleshooting.mdx new file mode 100644 index 000000000..e147d6cda --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cors-configuration-troubleshooting.mdx @@ -0,0 +1,211 @@ +--- +sidebar_position: 3 +title: "Configuración y Solución de Problemas de CORS" +description: "Cómo diagnosticar y solucionar errores de CORS en aplicaciones desplegadas en SleakOps" +date: "2024-12-19" +category: "workload" +tags: ["cors", "frontend", "backend", "api", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración y Solución de Problemas de CORS + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** CORS, Frontend, Backend, API, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan errores de CORS (Cross-Origin Resource Sharing) cuando su aplicación frontend intenta comunicarse con su API backend, aunque CORS parece estar configurado correctamente. + +**Síntomas observados:** + +- Mensajes de error de CORS en la consola del navegador al hacer llamadas a la API +- Las solicitudes del frontend a la API backend son bloqueadas +- El error ocurre a pesar de tener la configuración CORS en su lugar +- La aplicación funcionaba previamente pero ahora muestra errores de CORS + +**Configuración relevante:** + +- Frontend y backend desplegados en SleakOps +- Configuración CORS presente en el código backend +- Endpoints de la API accesibles pero bloqueados por la política CORS del navegador + +**Condiciones del error:** + +- El error aparece cuando el frontend realiza solicitudes al backend +- Puede ocurrir en todas las solicitudes o en endpoints específicos +- El navegador bloquea la solicitud antes de que llegue al servidor +- El error persiste a pesar de una aparente configuración CORS correcta + +## Solución Detallada + + + +La infraestructura de SleakOps no interfiere con la configuración CORS. CORS se maneja completamente a nivel de aplicación, lo que significa: + +- **Nivel de infraestructura:** Los balanceadores de carga y controladores de ingreso de SleakOps pasan los encabezados CORS sin modificaciones +- **Nivel de aplicación:** Su aplicación backend es responsable de establecer los encabezados CORS adecuados +- **Aplicación del navegador:** CORS es aplicado por el navegador, no por la infraestructura del servidor + +Esto significa que los problemas de CORS suelen estar relacionados con la configuración de su aplicación, no con la plataforma. + + + + + +Uno de los problemas más comunes de CORS está relacionado con los encabezados Content-Type: + +**Verifique que su configuración CORS permita el Content-Type que está enviando:** + +```javascript +// Ejemplo: configuración CORS en Express.js +const cors = require("cors"); + +app.use( + cors({ + origin: ["http://localhost:3000", "https://tu-dominio-frontend.com"], + methods: ["GET", "POST", "PUT", "DELETE", "OPTIONS"], + allowedHeaders: [ + "Content-Type", + "Authorization", + "X-Requested-With", + "Accept", + ], + credentials: true, + }) +); +``` + +**Valores comunes de Content-Type que necesitan permiso explícito:** + +- `application/json` +- `application/x-www-form-urlencoded` +- `multipart/form-data` +- `text/plain` + + + + + +**Verifique si cambios recientes afectaron la configuración CORS:** + +1. **Verifique que el middleware CORS esté configurado correctamente:** + +```python +# Ejemplo: configuración Flask-CORS +from flask_cors import CORS + +app = Flask(__name__) +CORS(app, origins=[ + "http://localhost:3000", + "https://tu-dominio-frontend.com" +], supports_credentials=True) +``` + +2. **Asegúrese de que los encabezados CORS estén establecidos para todos los endpoints relevantes:** + +```javascript +// Encabezados CORS manuales (si no usa middleware) +app.use((req, res, next) => { + res.header("Access-Control-Allow-Origin", "https://tu-dominio-frontend.com"); + res.header("Access-Control-Allow-Methods", "GET,PUT,POST,DELETE,OPTIONS"); + res.header("Access-Control-Allow-Headers", "Content-Type, Authorization"); + res.header("Access-Control-Allow-Credentials", "true"); + + if (req.method === "OPTIONS") { + res.sendStatus(200); + } else { + next(); + } +}); +``` + +3. **Revise configuraciones específicas por entorno:** + +```yaml +# Ejemplo: variables de entorno para CORS +CORS_ALLOWED_ORIGINS: "https://tu-frontend.sleakops.app,http://localhost:3000" +CORS_ALLOW_CREDENTIALS: "true" +``` + + + + + +**Determine si el error afecta todas las solicitudes o solo algunas específicas:** + +1. **Pruebe diferentes endpoints:** + +```bash +# Prueba con curl para ver la respuesta del servidor +curl -H "Origin: https://tu-frontend.com" \ + -H "Access-Control-Request-Method: POST" \ + -H "Access-Control-Request-Headers: Content-Type" \ + -X OPTIONS \ + https://tu-api.sleakops.app/api/endpoint +``` + +2. **Revise las herramientas de desarrollo del navegador:** + + - Abra la pestaña Red (Network) + - Busque solicitudes OPTIONS de preflight + - Verifique los encabezados de respuesta CORS + - Confirme los encabezados reales que se envían en la solicitud + +3. **Patrones comunes:** + - Las **solicitudes simples** (GET, POST con content-types simples) pueden funcionar + - Las **solicitudes complejas** (encabezados personalizados, content-type JSON) pueden fallar + - Las **solicitudes preflight** (OPTIONS) pueden carecer de respuestas adecuadas + + + + + +**Verifique si las librerías frontend están enviando encabezados inesperados:** + +1. **Configuración de Axios:** + +```javascript +// Asegúrese que axios no agregue encabezados problemáticos +const api = axios.create({ + baseURL: "https://tu-api.sleakops.app", + withCredentials: true, + headers: { + "Content-Type": "application/json", + // Elimine cualquier encabezado personalizado que pueda disparar preflight + }, +}); +``` + +2. **Configuración de Fetch API:** + +```javascript +// Configuración correcta de fetch +fetch("https://tu-api.sleakops.app/api/endpoint", { + method: "POST", + credentials: "include", // Solo si necesita cookies + headers: { + "Content-Type": "application/json", + }, + body: JSON.stringify(data), +}); +``` + +3. **Verifique encabezados de autenticación:** + +```javascript +// Asegúrese que el encabezado Authorization esté permitido en CORS +const token = localStorage.getItem("token"); +if (token) { + headers["Authorization"] = `Bearer ${token}`; +} +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 25 de diciembre de 2024 basada en una consulta real de usuario._ + diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cost-analysis-and-resource-optimization.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cost-analysis-and-resource-optimization.mdx new file mode 100644 index 000000000..b263e4101 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cost-analysis-and-resource-optimization.mdx @@ -0,0 +1,179 @@ +--- +sidebar_position: 3 +title: "Análisis de Costos y Optimización de Recursos en SleakOps" +description: "Guía para analizar costos y optimizar recursos de RAM para WebServices y bases de datos" +date: "2024-12-19" +category: "general" +tags: + [ + "analisis-de-costos", + "aws", + "grafana", + "optimizacion-de-recursos", + "webservice", + "base-de-datos", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Análisis de Costos y Optimización de Recursos en SleakOps + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** General +**Etiquetas:** Análisis de Costos, AWS, Grafana, Optimización de Recursos, WebService, Base de Datos + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan analizar costos mensuales y optimizar la asignación de recursos para WebServices y bases de datos en SleakOps, especialmente al considerar ajustes de RAM. + +**Síntomas Observados:** + +- Necesidad de análisis de costos mensual en lugar de desglose diario +- Incertidumbre sobre el impacto de la optimización de recursos de RAM en los costos +- Dificultad para determinar la asignación óptima de recursos para WebServices y bases de datos +- Necesidad de entender las implicaciones de costos antes de hacer cambios en recursos + +**Configuración Relevante:** + +- Plataforma: SleakOps con backend AWS +- Recursos: WebServices y bases de datos que requieren optimización de RAM +- Monitoreo: Dashboards de Grafana disponibles +- Facturación: Consola AWS con desglose detallado de costos + +**Condiciones de Error:** + +- Visibilidad insuficiente de costos para la toma de decisiones +- Posible sobreasignación o subasignación de recursos +- Dificultad para predecir impacto en costos de cambios en recursos + +## Solución Detallada + + + +Para el análisis de costos más detallado, use la Consola AWS: + +1. **Navegue a AWS Cost Explorer o al Panel de Facturación** +2. **Aplique el filtro crucial**: Establezca `Charge Type = Usage` + + - Este filtro es esencial porque los créditos de AWS pueden mostrar costos $0 incluso cuando se usan recursos + - El filtro Usage muestra el consumo real de recursos independientemente de los créditos aplicados + +3. **Visualice la tabla de desglose detallado de costos**: + - Agrupe por servicio, recurso o período de tiempo + - Analice tendencias en periodos mensuales + - Exporte datos para análisis adicional si es necesario + +**Importante**: Siempre use el filtro `Charge Type = Usage` para obtener una visibilidad precisa de los costos, especialmente si su cuenta tiene créditos AWS que podrían enmascarar los costos reales de uso. + + + + + +Para determinar si su WebService necesita más RAM: + +1. **Acceda al Dashboard de Grafana**: + + - Navegue a `Kubernetes / Compute Resources / Namespace (Pods)` + +2. **Analice el uso actual de RAM**: + + - Busque los pods de su WebService + - Verifique el porcentaje de utilización de RAM durante diferentes períodos de carga + - Preste atención tanto a patrones de uso en reposo como en picos + +3. **Guías de interpretación**: + - **Uso en reposo alrededor del 20%**: Línea base generalmente aceptable + - **Uso pico por encima del 80%**: Considere aumentar la RAM + - **Uso constante por debajo del 30%**: Potencial para reducción de RAM + +```yaml +# Ejemplo de configuración de recursos +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + + + + + +Al considerar ajustes de RAM: + +**Para aumentos pequeños (p. ej., 300MB por pod)**: + +- Si se ejecutan 3 pods: 300MB × 3 = ~1GB de aumento total +- El impacto en costos del clúster puede ser mínimo o nulo +- Los nodos existentes del clúster podrían tener capacidad de RAM no usada suficiente +- El aumento de costo depende de si se requieren nodos adicionales + +**Proceso de estimación de costos**: + +1. **Verifique la utilización actual de nodos** en Grafana +2. **Calcule el aumento total de RAM** en todos los pods +3. **Verifique si los nodos existentes pueden acomodar** el aumento +4. **Estime costos adicionales de nodos** si se requiere escalado + +**Buenas prácticas**: + +- Comience con cambios incrementales pequeños +- Monitoree el impacto antes de hacer ajustes mayores +- Considere la optimización conjunta de WebService y base de datos +- Pruebe los cambios primero en ambiente de staging + + + + + +Al optimizar recursos de base de datos: + +1. **Monitoree métricas de desempeño de la base de datos**: + + - Tiempos de ejecución de consultas + - Utilización del pool de conexiones + - Patrones de uso de memoria + +2. **Escenarios comunes de optimización**: + + - **RAM sobreasignada en la base de datos**: Puede reducirse si se aumenta la RAM del WebService + - **Caché de base de datos**: Más RAM en WebService puede reducir la carga en la base de datos + - **Pool de conexiones**: Optimice según conexiones concurrentes reales + +3. **Enfoque de optimización coordinada**: + - Aumente la RAM del WebService para mejor desempeño de la aplicación + - Monitoree reducción de carga en la base de datos + - Disminuya gradualmente la RAM de la base de datos si las métricas lo respaldan + - Valide el desempeño durante todo el proceso + + + + + +Para mejor visibilidad de costos mensuales: + +1. **AWS Cost Explorer**: + + - Establezca rango de tiempo en períodos mensuales + - Agrupe costos por servicio o etiquetas de recursos + - Cree reportes personalizados para recursos de SleakOps + +2. **Panel de facturación de SleakOps**: + + - Use selectores de rango de fechas para vistas mensuales + - Exporte datos para análisis de tendencias + - Configure alertas de costos para gestión de presupuestos + +3. **Rutina regular de monitoreo**: + - Revisiones semanales de utilización de recursos + - Análisis y optimización mensual de costos + - Evaluaciones trimestrales de asignación de recursos + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cronjob-timezone-deployment-behavior.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cronjob-timezone-deployment-behavior.mdx new file mode 100644 index 000000000..ee43fb4d0 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/cronjob-timezone-deployment-behavior.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 3 +title: "Zona horaria y comportamiento de implementación de CronJob" +description: "Comprendiendo la configuración de zona horaria de CronJob y el comportamiento de ejecución durante despliegues" +date: "2024-10-24" +category: "workload" +tags: ["cronjob", "zona horaria", "despliegue", "programación", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Zona horaria y comportamiento de implementación de CronJob + +**Fecha:** 24 de octubre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** CronJob, Zona horaria, Despliegue, Programación, Kubernetes + +## Descripción del problema + +**Contexto:** El usuario necesita configurar CronJobs para que se ejecuten a horas específicas (por ejemplo, 17:30 diariamente) y desea entender el manejo de la zona horaria y el comportamiento de ejecución durante los despliegues. + +**Síntomas observados:** + +- El CronJob parece ejecutarse durante el despliegue/lanzamiento +- Incertidumbre sobre la configuración de la zona horaria del clúster +- Necesidad de una programación precisa en horas locales específicas +- Preocupación por ejecuciones no deseadas + +**Configuración relevante:** + +- Expresión cron: `30 17 * * *` (diariamente a las 17:30) +- Zona horaria objetivo: Argentina (UTC-3) +- Zona horaria del clúster: UTC +- Desencadenantes de despliegue: lanzamientos manuales + +**Condiciones de error:** + +- El CronJob se ejecuta durante el despliegue en lugar de en el horario programado +- Confusión de zona horaria entre UTC y hora local +- Múltiples ejecuciones cuando solo se desea una diaria + +## Solución detallada + + + +Los clústeres de Kubernetes en SleakOps están configurados con **zona horaria UTC** por defecto. Esto significa: + +- Todos los horarios de CronJob se interpretan en tiempo UTC +- Su expresión cron `30 17 * * *` se ejecuta a las 17:30 UTC +- Para ejecutar a las 17:30 hora de Argentina (UTC-3), necesita `30 20 * * *` + +**Ejemplo de conversión de hora:** + +- Hora Argentina: 17:30 (UTC-3) +- Equivalente UTC: 20:30 +- Expresión cron correcta: `30 20 * * *` + + + + + +Los CronJobs **NO** deberían ejecutarse durante los despliegues. Si esto sucede, indica un problema de configuración: + +**Comportamiento normal:** + +- Los CronJobs solo se ejecutan según su programación +- Los despliegues actualizan la definición del CronJob pero no desencadenan ejecuciones +- Los trabajos programados existentes continúan ejecutándose + +**Si el CronJob se ejecuta durante el despliegue, revise:** + +1. **RestartPolicy**: Debe ser `OnFailure` o `Never` +2. **Historial de trabajos**: Verifique `successfulJobsHistoryLimit` y `failedJobsHistoryLimit` +3. **Concurrencia**: Configure `concurrencyPolicy: Forbid` para evitar solapamientos + + + + + +Para un trabajo diario a las 17:30 hora de Argentina: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: daily-job +spec: + schedule: "30 20 * * *" # 20:30 UTC = 17:30 Argentina + concurrencyPolicy: Forbid + successfulJobsHistoryLimit: 3 + failedJobsHistoryLimit: 1 + jobTemplate: + spec: + template: + spec: + restartPolicy: OnFailure + containers: + - name: job-container + image: your-image + command: ["your-command"] +``` + +**Configuraciones clave:** + +- `concurrencyPolicy: Forbid`: Previene múltiples instancias +- `successfulJobsHistoryLimit: 3`: Conserva los últimos 3 trabajos exitosos +- `restartPolicy: OnFailure`: Reinicia solo en caso de fallo + + + + + +Si necesita un control más preciso de la zona horaria: + +**Opción 1: Soporte de zona horaria en Kubernetes 1.25+** + +```yaml +spec: + schedule: "30 17 * * *" + timeZone: "America/Argentina/Buenos_Aires" +``` + +**Opción 2: Manejo de zona horaria a nivel de aplicación** + +```bash +# En su contenedor +export TZ=America/Argentina/Buenos_Aires +date # Mostrará la hora de Argentina +``` + +**Opción 3: Cálculo manual en UTC** + +- Siempre calcule el equivalente UTC de la hora local deseada +- Considere los cambios por horario de verano +- Actualice las expresiones cron cuando cambie el horario de verano + + + + + +**Verifique el estado del CronJob:** + +```bash +kubectl get cronjobs +kubectl describe cronjob your-cronjob-name +``` + +**Revise el historial de trabajos:** + +```bash +kubectl get jobs +kubectl logs job/your-job-name +``` + +**Problemas comunes:** + +1. **Cálculo incorrecto de zona horaria**: Verifique la conversión a UTC +2. **El despliegue desencadena ejecución**: Revise la configuración del trabajo +3. **Múltiples ejecuciones**: Configure `concurrencyPolicy: Forbid` +4. **Trabajos que no se ejecutan**: Verifique la sintaxis cron y la hora del clúster + +**Verifique la hora del clúster:** + +```bash +kubectl run temp-pod --image=busybox --rm -it -- date +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 24 de octubre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-credentials-access.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-credentials-access.mdx new file mode 100644 index 000000000..193d23233 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-credentials-access.mdx @@ -0,0 +1,175 @@ +--- +sidebar_position: 3 +title: "Acceso a Credenciales de Base de Datos en SleakOps" +description: "Cómo acceder a credenciales de base de datos almacenadas como secretos de Kubernetes" +date: "2024-12-19" +category: "dependency" +tags: ["base de datos", "credenciales", "secretos", "postgres", "vargroup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Acceso a Credenciales de Base de Datos en SleakOps + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Credenciales, Secretos, Postgres, VarGroup + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan acceder a las credenciales de la base de datos (específicamente la contraseña del usuario dbMaster) para tareas administrativas como deshabilitar triggers y restauración de datos, pero la contraseña no es visible directamente en la interfaz de SleakOps. + +**Síntomas Observados:** + +- El nombre de usuario de la base de datos es visible pero la contraseña no se muestra +- Se requieren credenciales para operaciones administrativas en la base de datos +- La contraseña no está disponible en la interfaz de base de datos de SleakOps +- Necesario para tareas como gestión de triggers y restauración de datos + +**Configuración Relevante:** + +- Tipo de base de datos: PostgreSQL +- Usuario: dbMaster +- Almacenamiento: Secreto de Kubernetes en el clúster +- Patrón de nombre del secreto: `{dbName}-postgres` +- Ubicación: Namespace del proyecto + +**Condiciones de Error:** + +- No se pueden realizar tareas administrativas en la base de datos sin las credenciales +- La contraseña no se muestra en la interfaz estándar de SleakOps +- Se necesita acceso en múltiples entornos (desarrollo, staging, producción) + +## Solución Detallada + + + +SleakOps no almacena las contraseñas de base de datos en su propia base de datos por razones de seguridad. En cambio, las credenciales se almacenan como Secretos de Kubernetes dentro de tu clúster: + +- **Ubicación de almacenamiento**: Secreto de Kubernetes en el namespace del proyecto +- **Formato del nombre del secreto**: `{dbName}-postgres` +- **Seguridad**: Sigue las mejores prácticas de gestión de secretos de Kubernetes +- **Acceso**: Disponible a través de la interfaz de edición de VarGroup + + + + + +Para ver las credenciales de base de datos en SleakOps: + +1. **Navega a tu proyecto** +2. **Ve a la sección VarGroup** +3. **Haz clic en "Editar" en el VarGroup correspondiente** +4. **Visualiza las credenciales**: SleakOps obtiene el secreto del clúster y lo muestra + +```bash +# La estructura del secreto típicamente es: +apiVersion: v1 +kind: Secret +metadata: + name: {dbName}-postgres + namespace: {project-namespace} +data: + username: + password: +``` + + + + + +Si tienes acceso al clúster, puedes obtener las credenciales directamente: + +```bash +# Listar secretos en el namespace del proyecto +kubectl get secrets -n {project-namespace} + +# Obtener el secreto específico de la base de datos +kubectl get secret {dbName}-postgres -n {project-namespace} -o yaml + +# Decodificar la contraseña directamente +kubectl get secret {dbName}-postgres -n {project-namespace} -o jsonpath='{.data.password}' | base64 --decode +``` + +**Nota**: Este método requiere acceso kubectl a tu clúster. + + + + + +Para acceder a las credenciales en diferentes entornos: + +1. **Entorno de Desarrollo**: + + - Navega al proyecto de desarrollo + - Sigue el proceso de edición de VarGroup + +2. **Staging/Producción**: + - Repite el mismo proceso para cada entorno + - Cada entorno tiene su propio namespace y secretos + - Los nombres de los secretos siguen el mismo patrón: `{dbName}-postgres` + +```bash +# Ejemplo para diferentes entornos +# Desarrollo +kubectl get secret myapp-postgres -n myproject-dev + +# Staging +kubectl get secret myapp-postgres -n myproject-staging + +# Producción +kubectl get secret myapp-postgres -n myproject-prod +``` + + + + + +Una vez que tienes las credenciales, puedes usarlas para: + +**Conexión a la Base de Datos**: + +```bash +# Conectarse a PostgreSQL +psql -h {database-host} -U dbMaster -d {database-name} +``` + +**Tareas Administrativas Comunes**: + +```sql +-- Deshabilitar triggers +ALTER TABLE {table_name} DISABLE TRIGGER ALL; + +-- Habilitar triggers +ALTER TABLE {table_name} ENABLE TRIGGER ALL; + +-- Crear respaldo de base de datos +pg_dump -h {host} -U dbMaster {database_name} > backup.sql + +-- Restaurar base de datos +psql -h {host} -U dbMaster {database_name} < backup.sql +``` + + + + + +**Notas Importantes de Seguridad**: + +- Las credenciales se almacenan de forma segura como Secretos de Kubernetes +- El acceso está controlado a través de permisos del proyecto en SleakOps +- Nunca almacenar credenciales en archivos de texto plano +- Usar credenciales específicas para cada entorno +- Rotar contraseñas regularmente + +**Control de Acceso**: + +- Solo usuarios con acceso al proyecto pueden ver las credenciales +- La edición de VarGroup requiere permisos adecuados +- El acceso al clúster está restringido a usuarios autorizados + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-maintenance-window-strategies.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-maintenance-window-strategies.mdx new file mode 100644 index 000000000..2bb364aec --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-maintenance-window-strategies.mdx @@ -0,0 +1,387 @@ +--- +sidebar_position: 15 +title: "Estrategias para la Ventana de Mantenimiento de Bases de Datos" +description: "Soluciones para realizar migraciones de bases de datos sin bloquear operaciones" +date: "2024-01-15" +category: "dependency" +tags: + [ + "base de datos", + "migraciones", + "mantenimiento", + "mysql", + "postgresql", + "réplicas de lectura", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Estrategias para la Ventana de Mantenimiento de Bases de Datos + +**Fecha:** 15 de enero de 2024 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Migraciones, Mantenimiento, MySQL, PostgreSQL, Réplicas de lectura + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan realizar migraciones de bases de datos (como operaciones ALTER TABLE) en bases de datos de producción que tienen un alto volumen de conexiones provenientes de múltiples fuentes, incluyendo servicios web, cronjobs y funciones serverless. + +**Síntomas Observados:** + +- Las migraciones de bases de datos se bloquean debido a bloqueos de tablas +- Las operaciones ALTER no pueden completarse debido a conexiones concurrentes +- Alto número de conexiones activas desde varios servicios +- Tiempo de espera o fallos en migraciones durante picos de uso + +**Configuración Relevante:** + +- Base de datos: Base de datos de producción con alto volumen de conexiones +- Tipo de migración: Operaciones ALTER TABLE +- Fuentes de conexión: Servicios web, cronjobs, funciones Lambda +- Tamaño de tabla: Tablas pequeñas (2 registros) pero con alta frecuencia de consultas + +**Condiciones de Error:** + +- Las migraciones fallan debido a conflictos de bloqueo de tablas +- Las operaciones ALTER esperan indefinidamente por bloqueos de tabla +- Degradación del rendimiento de la base de datos durante intentos de migración + +## Solución Detallada + + + +Dado que SleakOps actualmente no proporciona una función integrada de ventana de mantenimiento, puedes implementarlo a nivel de aplicación: + +**Opción 1: Enfoque con Feature Flag** + +```javascript +// Variable de entorno o configuración +const MAINTENANCE_MODE = process.env.MAINTENANCE_MODE === "true"; + +// En las rutas de tu aplicación +app.use((req, res, next) => { + if (MAINTENANCE_MODE && req.path !== "/health") { + return res.status(503).json({ + message: "Sistema en mantenimiento. Por favor, inténtelo más tarde.", + retryAfter: 300, // segundos + }); + } + next(); +}); +``` + +**Opción 2: Control del Pool de Conexiones a la Base de Datos** + +```javascript +// Reducir temporalmente el tamaño del pool de conexiones +const pool = mysql.createPool({ + host: "localhost", + user: "user", + password: "password", + database: "mydb", + connectionLimit: MAINTENANCE_MODE ? 1 : 10, +}); +``` + + + + + +Implementar una arquitectura maestro-esclavo puede reducir significativamente la carga en tu base de datos primaria: + +**Configuración de la Base de Datos:** + +```yaml +# En la configuración de base de datos de SleakOps +database: + type: mysql # o postgresql + master: + instance_class: db.t3.medium + allocated_storage: 100 + read_replicas: + - instance_class: db.t3.small + allocated_storage: 100 + region: same # o diferente para distribución geográfica +``` + +**Cambios en el Código de la Aplicación:** + +```javascript +// Pools de conexión separados +const masterPool = mysql.createPool({ + host: process.env.DB_MASTER_HOST, + // ... configuración del maestro +}); + +const replicaPool = mysql.createPool({ + host: process.env.DB_REPLICA_HOST, + // ... configuración de la réplica +}); + +// Usar réplica para operaciones de lectura +function getCountries() { + return replicaPool.query("SELECT * FROM countries"); +} + +// Usar maestro para operaciones de escritura +function updateCountry(id, data) { + return masterPool.query("UPDATE countries SET ? WHERE id = ?", [data, id]); +} +``` + +**Consideraciones Importantes:** + +- Las réplicas de lectura tienen consistencia eventual (retraso de algunos segundos) +- Las lecturas críticas que requieran consistencia inmediata deben usar el maestro +- Enviar todas las escrituras a la base de datos maestra + + + + + +Para conjuntos de datos pequeños y que cambian raramente, como la información de países, el almacenamiento en caché a nivel de aplicación es muy efectivo: + +**Caché en Memoria:** + +```javascript +class CountryCache { + constructor() { + this.cache = new Map(); + this.lastUpdate = null; + this.TTL = 60 * 60 * 1000; // 1 hora + } + + async getCountries() { + const now = Date.now(); + + if (!this.lastUpdate || now - this.lastUpdate > this.TTL) { + const countries = await this.fetchFromDatabase(); + this.cache.set("countries", countries); + this.lastUpdate = now; + return countries; + } + + return this.cache.get("countries"); + } + + async fetchFromDatabase() { + // Usar réplica para esta lectura + return replicaPool.query("SELECT * FROM countries"); + } +} + +const countryCache = new CountryCache(); +``` + +**Caché con Redis (Alternativa):** + +```javascript +const redis = require("redis"); +const client = redis.createClient(process.env.REDIS_URL); + +async function getCachedCountries() { + const cached = await client.get("countries"); + + if (cached) { + return JSON.parse(cached); + } + + const countries = await replicaPool.query("SELECT * FROM countries"); + await client.setex("countries", 3600, JSON.stringify(countries)); // TTL de 1 hora + + return countries; +} +``` + + + + + +**Estrategia 1: Ejecución en Horas de Baja Actividad** + +```bash +# Programar migraciones durante períodos de bajo tráfico +# Usar cron o tu pipeline de CI/CD +0 2 * * * /ruta/al/script-de-migracion.sh +``` + +**Estrategia 2: Cambios de Esquema en Línea (MySQL)** + +```sql +-- Usar pt-online-schema-change para tablas grandes +pt-online-schema-change \ + --alter "ADD COLUMN new_field VARCHAR(255)" \ + --execute \ + D=database_name,t=table_name +``` + +**Estrategia 3: Despliegue Blue-Green de Base de Datos** + +```bash +# 1. Crear nueva instancia de base de datos +# 2. Aplicar migraciones a la nueva instancia +# 3. Configurar replicación de la antigua a la nueva +# 4. Cambiar la aplicación a la nueva base de datos +# 5. Verificar y limpiar la instancia antigua +``` + +**Estrategia 4: Limitación de Conexiones** + +```javascript +// Limitar temporalmente las conexiones antes de la migración +const connectionSemaphore = new Semaphore(2); // Permitir solo 2 conexiones concurrentes + +async function executeQuery(query) { + await connectionSemaphore.acquire(); + try { + return await pool.query(query); + } finally { + connectionSemaphore.release(); + } +} +``` + + + + + +**Métricas Clave para Monitorear:** + +```javascript +// Monitoreo de conexiones a la base de datos +7const dbMetrics = { + activeConnections: () => pool.pool._allConnections.length, + idleConnections: () => pool.pool._freeConnections.length, + queuedRequests: () => pool.pool._connectionQueue.length, +}; + +// Función para monitorear durante migraciones +function monitorDatabaseDuringMigration() { + const interval = setInterval(() => { + console.log(`Conexiones activas: ${dbMetrics.activeConnections()}`); + console.log(`Conexiones inactivas: ${dbMetrics.idleConnections()}`); + console.log(`Solicitudes en cola: ${dbMetrics.queuedRequests()}`); + }, 5000); + + return interval; +} +``` + +**Configuración de Alertas:** + +```javascript +// Configurar alertas para métricas críticas +function setupDatabaseAlerts() { + setInterval(() => { + const activeConnections = dbMetrics.activeConnections(); + + if (activeConnections > 80) { + console.warn(`ALERTA: Conexiones altas detectadas: ${activeConnections}`); + // Enviar notificación al equipo + } + }, 30000); +} +``` + +**Herramientas de Monitoreo Recomendadas:** + +- **AWS CloudWatch** (para RDS) +- **Prometheus + Grafana** (para métricas personalizadas) +- **New Relic** o **DataDog** (para monitoreo integral) + + + + + +**Plan de Rollback Automático:** + +```bash +#!/bin/bash +# rollback-migration.sh + +# Verificar si la migración falló +if [ $? -ne 0 ]; then + echo "Migración falló, iniciando rollback..." + + # Restaurar desde backup + mysql -u $DB_USER -p$DB_PASS $DB_NAME < backup_pre_migration.sql + + # Reiniciar servicios + kubectl rollout undo deployment/app-deployment + + # Notificar al equipo + curl -X POST $SLACK_WEBHOOK -d '{"text":"Rollback de migración ejecutado"}' +fi +``` + +**Verificaciones Post-Migración:** + +```javascript +// Script de verificación después de migración +async function verifyMigration() { + try { + // Verificar integridad de datos + const countCheck = await pool.query('SELECT COUNT(*) as count FROM countries'); + console.log(`Registros en tabla countries: ${countCheck[0].count}`); + + // Verificar funcionamiento de la aplicación + const healthCheck = await fetch('/health'); + if (!healthCheck.ok) { + throw new Error('Health check falló'); + } + + console.log('Migración verificada exitosamente'); + } catch (error) { + console.error('Verificación de migración falló:', error); + // Activar rollback automático + process.exit(1); + } +} +``` + + + + + +**1. Preparación Previa:** + +- Siempre crear backups completos antes de migraciones +- Probar migraciones en entornos de staging idénticos +- Documentar todos los pasos y procedimientos de rollback +- Coordinar con el equipo sobre ventanas de mantenimiento + +**2. Durante la Migración:** + +- Monitorear métricas de rendimiento en tiempo real +- Mantener comunicación constante con el equipo +- Tener procedimientos de rollback listos para ejecutar +- Documentar cualquier problema o comportamiento inesperado + +**3. Después de la Migración:** + +- Verificar integridad de datos y funcionamiento de la aplicación +- Monitorear rendimiento durante las siguientes 24 horas +- Mantener backups por al menos 7 días +- Documentar lecciones aprendidas para futuras migraciones + +**4. Automatización Recomendada:** + +```yaml +# Ejemplo de pipeline de migración automatizada +migration_pipeline: + steps: + - backup_database + - run_migration + - verify_migration + - notify_team + rollback_on_failure: true + monitoring_enabled: true +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ +``` diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-data-integrity-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-data-integrity-issues.mdx new file mode 100644 index 000000000..98030a11a --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-data-integrity-issues.mdx @@ -0,0 +1,266 @@ +--- +sidebar_position: 3 +title: "Problemas de Integridad de Datos en la Migración de Bases de Datos" +description: "Solución de problemas por transferencia incompleta de datos durante migraciones de bases de datos" +date: "2024-12-19" +category: "dependency" +tags: + ["base de datos", "migración", "integridad de datos", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Integridad de Datos en la Migración de Bases de Datos + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Migración, Integridad de datos, Solución de problemas + +## Descripción del Problema + +**Contexto:** Durante los procesos de migración de bases de datos en producción en SleakOps, los usuarios pueden experimentar problemas donde la migración finaliza sin errores pero la integridad de los datos se ve comprometida. + +**Síntomas observados:** + +- El proceso de migración finaliza exitosamente sin mensajes de error +- Falta de datos recientes en la base de datos destino +- Algunas tablas aparecen incompletas después de la migración +- Inconsistencias de datos entre las bases de datos origen y destino + +**Configuración relevante:** + +- Tipo de migración: Migración de base de datos en producción +- Entorno: Transferencia de producción a producción +- Herramienta de migración: Utilidades de migración de base de datos de SleakOps +- Tipo de base de datos: No especificado (PostgreSQL/MySQL/etc.) + +**Condiciones de error:** + +- La migración parece exitosa pero falla la validación de datos +- No se transfieren los registros más recientes +- Ocurren transferencias incompletas de tablas de forma esporádica +- No hay mensajes de error explícitos durante el proceso de migración + +## Solución Detallada + + + +Antes de iniciar cualquier migración de base de datos, realice estos pasos de validación: + +1. **Verificación del conteo de registros**: + + ```sql + -- Verificar el total de registros en la base de datos origen + SELECT table_name, + (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count + FROM ( + SELECT table_name, + query_to_xml(format('select count(*) as cnt from %I.%I', + table_schema, table_name), false, true, '') as xml_count + FROM information_schema.tables + WHERE table_schema = 'public' + ) t; + ``` + +2. **Identificar las marcas de tiempo más recientes**: + + ```sql + -- Encontrar los registros más recientes por tabla + SELECT table_name, MAX(created_at) as latest_record + FROM your_table_name + GROUP BY table_name; + ``` + +3. **Crear lista de verificación para la migración**: + - Documentar los conteos actuales de registros + - Anotar las marcas de tiempo más recientes + - Identificar tablas críticas + - Planificar consultas de validación + + + + + +Cuando las migraciones finalizan pero faltan datos: + +1. **Revisar los registros de migración**: + + ```bash + # Revisar los logs de migración de SleakOps + kubectl logs -n sleakops-system deployment/migration-controller + + # Buscar patrones específicos + grep -i "error\|warning\|timeout" migration.log + ``` + +2. **Verificar tiempos de espera de conexión**: + + - La conexión a la base de datos puede agotar el tiempo durante transferencias grandes + - Comprobar la estabilidad de la red entre origen y destino + - Revisar la configuración del pool de conexiones de la base de datos + +3. **Problemas con aislamiento de transacciones**: + ```sql + -- Verificar transacciones de larga duración + SELECT pid, now() - pg_stat_activity.query_start AS duration, query + FROM pg_stat_activity + WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'; + ``` + + + + + +Después de la migración, realice estos pasos de validación: + +1. **Comparación del conteo de registros**: + + ```bash + # Script para comparar conteos de registros + #!/bin/bash + + SOURCE_DB="source_connection_string" + TARGET_DB="target_connection_string" + + for table in $(psql $SOURCE_DB -t -c "SELECT tablename FROM pg_tables WHERE schemaname='public';"); do + source_count=$(psql $SOURCE_DB -t -c "SELECT COUNT(*) FROM $table;") + target_count=$(psql $TARGET_DB -t -c "SELECT COUNT(*) FROM $table;") + + if [ "$source_count" != "$target_count" ]; then + echo "DESCUADRE: $table - Origen: $source_count, Destino: $target_count" + fi + done + ``` + +2. **Verificación de frescura de datos**: + + ```sql + -- Comprobar si se migraron datos recientes + SELECT table_name, + MAX(created_at) as latest_migrated, + NOW() - MAX(created_at) as data_age + FROM ( + -- Unión de todas las tablas con columnas de marca de tiempo + SELECT 'users' as table_name, created_at FROM users + UNION ALL + SELECT 'orders' as table_name, created_at FROM orders + -- Agregar otras tablas según sea necesario + ) combined + GROUP BY table_name; + ``` + +3. **Verificación de integridad referencial**: + ```sql + -- Verificar relaciones de claves foráneas + SELECT conname, conrelid::regclass, confrelid::regclass + FROM pg_constraint + WHERE contype = 'f' + AND NOT EXISTS ( + SELECT 1 FROM pg_constraint c2 + WHERE c2.conname = pg_constraint.conname + AND c2.connamespace != pg_constraint.connamespace + ); + ``` + + + + + +Si se detectan problemas de integridad de datos: + +1. **Sincronización incremental de datos**: + + ```sql + -- Sincronizar registros recientes faltantes + INSERT INTO target_table + SELECT * FROM source_table + WHERE created_at > (SELECT MAX(created_at) FROM target_table) + ON CONFLICT (id) DO UPDATE SET + column1 = EXCLUDED.column1, + updated_at = EXCLUDED.updated_at; + ``` + +2. **Re-migración específica por tabla**: + + ```bash + # Re-migrar tablas específicas + pg_dump source_db -t table_name | psql target_db + ``` + +3. **Configuración de recuperación punto en el tiempo**: + + - Habilitar archivado WAL antes de futuras migraciones + - Crear snapshots de la base de datos antes de la migración + - Implementar verificación automatizada de respaldos + +4. **Procedimiento de reversión de migración**: + + ```bash + # Restaurar desde respaldo previo a la migración + pg_restore -d target_database backup_file.dump + + # Verificar la restauración + psql target_database -c "SELECT COUNT(*) FROM critical_table;" + ``` + + + + + +Para prevenir futuros problemas de integridad de datos en migraciones: + +1. **Implementar pruebas de migración**: + + - Siempre probar migraciones primero en entorno de staging + - Usar snapshots de datos de producción para pruebas + - Validar integridad de datos en el entorno de pruebas + +2. **Configurar monitoreo**: + ```yaml + # Configuración de monitoreo de SleakOps + apiVersion: v1 + kind: ConfigMap + metadata: + name: migration-monitoring + data: + config.yaml: | + checks: + - name: validacion_conteo_registros + query: "SELECT COUNT(*) FROM critical_table" + expected_min: 1000 + - name: frescura_de_datos + query: "SELECT MAX(created_at) FROM critical_table" + max_age_hours: 24 + ``` + +3. **Automatizar validación post-migración**: + + ```bash + #!/bin/bash + # Script de validación automática post-migración + + echo "Iniciando validación post-migración..." + + # Verificar conteos de registros + ./validate_record_counts.sh + + # Verificar integridad referencial + psql $TARGET_DB -f validate_foreign_keys.sql + + # Verificar frescura de datos + ./check_data_freshness.sh + + echo "Validación completada." + ``` + +4. **Documentar procedimientos de migración**: + - Crear runbooks detallados para cada tipo de migración + - Documentar puntos de verificación críticos + - Mantener scripts de validación actualizados + - Establecer criterios claros de éxito/fallo + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-environment-variables.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-environment-variables.mdx new file mode 100644 index 000000000..ee0f2b30b --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-environment-variables.mdx @@ -0,0 +1,930 @@ +--- +sidebar_position: 15 +title: "Configuración de Variables de Entorno para Migración de Base de Datos" +description: "Solución para problemas de migración de base de datos con variables de entorno en SleakOps" +date: "2024-12-11" +category: "proyecto" +tags: + ["base-de-datos", "migracion", "variables-de-entorno", "dotnet", "postgres"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Variables de Entorno para Migración de Base de Datos + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Base de datos, Migración, Variables de Entorno, .NET, PostgreSQL + +## Descripción del Problema + +**Contexto:** El usuario encuentra múltiples problemas al ejecutar migraciones de base de datos con .NET Entity Framework en SleakOps, específicamente con conexiones a bases de datos PostgreSQL y configuraciones de variables de entorno. + +**Síntomas Observados:** + +- El comando de migración de base de datos falla debido a una ruta de proyecto incorrecta +- Errores de conexión a la base de datos por falta de autenticación con contraseña +- Los hooks de migración no encuentran variables de entorno necesarias como `CORS_ALLOWED_ORIGINS` +- Variables de entorno no accesibles para los procesos de migración + +**Configuración Relevante:** + +- Framework: .NET Entity Framework +- Base de datos: PostgreSQL +- Entorno: plataforma SleakOps +- Grupos de Variables: Variables de entorno del proyecto Backend +- Comando de migración: `dotnet ef database update` + +**Condiciones de Error:** + +- Ruta de proyecto incorrecta en el comando de migración +- Nombres de variables de entorno desajustados (`POSTGRES_PASS` vs `POSTGRES_PASSWORD`) +- Grupos de Variables asignados a servicios específicos en lugar de alcance global +- Hooks de migración incapaces de acceder a las variables de entorno requeridas + +## Solución Detallada + + + +La sintaxis correcta para ejecutar migraciones de Entity Framework es: + +```bash +dotnet ef database update --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj +``` + +**Errores comunes:** + +- Errores tipográficos en la ruta del proyecto +- Referencias relativas faltantes o incorrectas +- Extensión o nombre del archivo de proyecto incorrectos + +**Pasos para verificar:** + +1. **Verificar que el archivo del proyecto existe en la ruta especificada** +2. **Comprobar que la ruta es relativa a su directorio de trabajo actual** +3. **Asegurarse de que el nombre del archivo `.csproj` coincida exactamente** + +### Comandos de verificación + +```bash +# Verificar estructura del proyecto +ls -la ../Netdo.Firev.WebApi/ +find . -name "*.csproj" -type f + +# Verificar desde el directorio correcto +pwd +dotnet ef --help + +# Comando completo con verbosidad para debugging +dotnet ef database update --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj --verbose +``` + +### Comandos alternativos + +```bash +# Si está en el directorio raíz del proyecto +dotnet ef database update --project ./Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj + +# Especificar directorio de trabajo +dotnet ef database update --project ./Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj --startup-project ./ + +# Con context específico +dotnet ef database update --project ./Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj --context ApplicationDbContext +``` + + + + + +El nombre de las variables de entorno debe coincidir exactamente entre la configuración de su aplicación y los Grupos de Variables: + +**Problema:** La aplicación espera `POSTGRES_PASSWORD` pero el Grupo de Variables contiene `POSTGRES_PASS` + +**Solución:** + +1. **Revise los archivos de configuración de su aplicación** (appsettings.json, Program.cs, o Startup.cs) +2. **Identifique los nombres exactos de variables esperados** +3. **Actualice el Grupo de Variables para que coincida** + +### Verificación de variables en .NET + +```csharp +// En appsettings.json +{ + "ConnectionStrings": { + "DefaultConnection": "Host={POSTGRES_HOST};Database={POSTGRES_DB};Username={POSTGRES_USER};Password={POSTGRES_PASSWORD};Port={POSTGRES_PORT}" + } +} + +// En Program.cs o Startup.cs +var connectionString = Configuration.GetConnectionString("DefaultConnection"); + +// Verificar variables específicas +var postgresHost = Environment.GetEnvironmentVariable("POSTGRES_HOST"); +var postgresPassword = Environment.GetEnvironmentVariable("POSTGRES_PASSWORD"); +var corsOrigins = Environment.GetEnvironmentVariable("CORS_ALLOWED_ORIGINS"); +``` + +### Mapeo de variables comunes + +| Variable Esperada | Variable Incorrecta | Uso Típico | +| ---------------------- | ------------------- | ------------------ | +| `POSTGRES_PASSWORD` | `POSTGRES_PASS` | Contraseña DB | +| `POSTGRES_HOST` | `DB_HOST` | Host de DB | +| `POSTGRES_PORT` | `DB_PORT` | Puerto de DB | +| `POSTGRES_USER` | `DB_USER` | Usuario de DB | +| `POSTGRES_DB` | `DATABASE_NAME` | Nombre de DB | +| `CORS_ALLOWED_ORIGINS` | `CORS_ORIGINS` | Configuración CORS | + +### Script de verificación de variables + +```bash +#!/bin/bash +# verify-env-variables.sh + +echo "=== Verificación de Variables de Entorno ===" + +# Variables requeridas para PostgreSQL +required_vars=( + "POSTGRES_HOST" + "POSTGRES_PORT" + "POSTGRES_DB" + "POSTGRES_USER" + "POSTGRES_PASSWORD" + "CORS_ALLOWED_ORIGINS" +) + +# Verificar cada variable +for var in "${required_vars[@]}"; do + if [ -z "${!var}" ]; then + echo "❌ $var: NO DEFINIDA" + else + echo "✅ $var: ${!var:0:10}..." # Mostrar solo primeros 10 caracteres + fi +done + +# Test de conexión a PostgreSQL +echo "" +echo "=== Test de Conexión PostgreSQL ===" +psql "postgresql://$POSTGRES_USER:$POSTGRES_PASSWORD@$POSTGRES_HOST:$POSTGRES_PORT/$POSTGRES_DB" -c "SELECT version();" +``` + + + + + +Los Grupos de Variables deben estar disponibles para **todos los servicios** que necesiten ejecutar migraciones, no solo para servicios específicos. + +### Problema: Alcance limitado + +Si el Grupo de Variables está asignado solo al servicio "Backend", los hooks de migración y otros procesos no tendrán acceso. + +### Solución: Configurar alcance global + +1. **Navegar a Grupos de Variables** en el panel de SleakOps +2. **Seleccionar el grupo que contiene las variables de base de datos** +3. **Configurar el alcance como "Global" o "Todo el proyecto"** +4. **Verificar que incluye todos los entornos necesarios** + +### Estructura recomendada de Grupos de Variables + +```yaml +# Grupo: Database Configuration (Alcance: Global) +POSTGRES_HOST: "postgres-server.internal" +POSTGRES_PORT: "5432" +POSTGRES_DB: "netdo_firev_db" +POSTGRES_USER: "app_user" +POSTGRES_PASSWORD: "secure_password_123" + +# Grupo: Application Configuration (Alcance: Global) +CORS_ALLOWED_ORIGINS: "https://frontend.com,https://admin.frontend.com" +JWT_SECRET: "super_secret_jwt_key" +ASPNETCORE_ENVIRONMENT: "Production" + +# Grupo: Service-Specific (Alcance: Servicio específico) +SERVICE_NAME: "backend-api" +LOG_LEVEL: "Information" +``` + +### Verificar acceso a variables desde hooks + +```bash +# En un hook de migración +#!/bin/bash +echo "=== Variables disponibles en hook ===" +env | grep -E "(POSTGRES|CORS)" | sort + +# Test específico de variables críticas +if [ -z "$POSTGRES_PASSWORD" ]; then + echo "❌ POSTGRES_PASSWORD no disponible en hook" + exit 1 +fi + +if [ -z "$CORS_ALLOWED_ORIGINS" ]; then + echo "❌ CORS_ALLOWED_ORIGINS no disponible en hook" + exit 1 +fi + +echo "✅ Todas las variables críticas están disponibles" +``` + +### Configuración en SleakOps + +```json +{ + "variableGroups": [ + { + "name": "Database Configuration", + "scope": "global", + "environment": "all", + "variables": { + "POSTGRES_HOST": "{{dependency.postgres.host}}", + "POSTGRES_PORT": "{{dependency.postgres.port}}", + "POSTGRES_DB": "{{dependency.postgres.database}}", + "POSTGRES_USER": "{{dependency.postgres.username}}", + "POSTGRES_PASSWORD": "{{dependency.postgres.password}}" + } + }, + { + "name": "Application Settings", + "scope": "global", + "environment": "production", + "variables": { + "CORS_ALLOWED_ORIGINS": "https://prod.example.com", + "ASPNETCORE_ENVIRONMENT": "Production" + } + } + ] +} +``` + + + + + +Los hooks de migración necesitan configuración especial para acceder a las variables de entorno correctamente. + +### Configuración de Hook Pre-Deploy + +```bash +#!/bin/bash +# hooks/pre-deploy.sh + +set -e + +echo "=== Pre-Deploy Migration Hook ===" + +# Verificar variables críticas +if [ -z "$POSTGRES_HOST" ] || [ -z "$POSTGRES_PASSWORD" ]; then + echo "❌ Variables de base de datos no disponibles" + echo "Disponibles: $(env | grep POSTGRES | cut -d= -f1 | tr '\n' ' ')" + exit 1 +fi + +# Configurar cadena de conexión +export ConnectionStrings__DefaultConnection="Host=$POSTGRES_HOST;Database=$POSTGRES_DB;Username=$POSTGRES_USER;Password=$POSTGRES_PASSWORD;Port=$POSTGRES_PORT" + +echo "✅ Variables configuradas correctamente" + +# Verificar conectividad antes de migración +echo "=== Test de Conectividad ===" +timeout 10 bash -c " + + + +La configuración de cadenas de conexión debe ser flexible para trabajar con variables de entorno en diferentes entornos. + +### Configuración en appsettings.json + +```json +{ + "ConnectionStrings": { + "DefaultConnection": "Host=localhost;Database=netdo_firev_dev;Username=postgres;Password=password;Port=5432" + }, + "Logging": { + "LogLevel": { + "Default": "Information", + "Microsoft.AspNetCore": "Warning", + "Microsoft.EntityFrameworkCore": "Information" + } + }, + "AllowedHosts": "*" +} +``` + +### Configuración en appsettings.Production.json + +```json +{ + "ConnectionStrings": { + "DefaultConnection": "Host={POSTGRES_HOST};Database={POSTGRES_DB};Username={POSTGRES_USER};Password={POSTGRES_PASSWORD};Port={POSTGRES_PORT};SSL Mode=Require;Trust Server Certificate=true" + }, + "Logging": { + "LogLevel": { + "Default": "Warning", + "Microsoft.EntityFrameworkCore": "Warning" + } + } +} +``` + +### Configuración en Program.cs + +```csharp +using Microsoft.EntityFrameworkCore; + +var builder = WebApplication.CreateBuilder(args); + +// Configurar cadena de conexión con variables de entorno +var connectionString = builder.Configuration.GetConnectionString("DefaultConnection"); + +// Reemplazar placeholders con variables de entorno +connectionString = connectionString? + .Replace("{POSTGRES_HOST}", Environment.GetEnvironmentVariable("POSTGRES_HOST") ?? "localhost") + .Replace("{POSTGRES_DB}", Environment.GetEnvironmentVariable("POSTGRES_DB") ?? "netdo_firev_dev") + .Replace("{POSTGRES_USER}", Environment.GetEnvironmentVariable("POSTGRES_USER") ?? "postgres") + .Replace("{POSTGRES_PASSWORD}", Environment.GetEnvironmentVariable("POSTGRES_PASSWORD") ?? "password") + .Replace("{POSTGRES_PORT}", Environment.GetEnvironmentVariable("POSTGRES_PORT") ?? "5432"); + +builder.Services.AddDbContext(options => +{ + options.UseNpgsql(connectionString); + if (builder.Environment.IsDevelopment()) + { + options.EnableSensitiveDataLogging(); + options.EnableDetailedErrors(); + } +}); + +// Configurar CORS con variable de entorno +var corsOrigins = Environment.GetEnvironmentVariable("CORS_ALLOWED_ORIGINS")?.Split(',') + ?? new[] { "http://localhost:3000" }; + +builder.Services.AddCors(options => +{ + options.AddDefaultPolicy(policy => + { + policy.WithOrigins(corsOrigins) + .AllowAnyMethod() + .AllowAnyHeader() + .AllowCredentials(); + }); +}); + +var app = builder.Build(); + +// Ejecutar migraciones automáticamente en producción +if (app.Environment.IsProduction()) +{ + using var scope = app.Services.CreateScope(); + var context = scope.ServiceProvider.GetRequiredService(); + await context.Database.MigrateAsync(); +} + +app.UseCors(); +app.Run(); +``` + +### Método alternativo con builder pattern + +```csharp +public static class DatabaseExtensions +{ + public static IServiceCollection AddDatabaseConfiguration( + this IServiceCollection services, + IConfiguration configuration, + IWebHostEnvironment environment) + { + var connectionStringBuilder = new NpgsqlConnectionStringBuilder + { + Host = Environment.GetEnvironmentVariable("POSTGRES_HOST") ?? "localhost", + Database = Environment.GetEnvironmentVariable("POSTGRES_DB") ?? "netdo_firev_dev", + Username = Environment.GetEnvironmentVariable("POSTGRES_USER") ?? "postgres", + Password = Environment.GetEnvironmentVariable("POSTGRES_PASSWORD") ?? "password", + Port = int.Parse(Environment.GetEnvironmentVariable("POSTGRES_PORT") ?? "5432") + }; + + if (environment.IsProduction()) + { + connectionStringBuilder.SslMode = SslMode.Require; + connectionStringBuilder.TrustServerCertificate = true; + connectionStringBuilder.CommandTimeout = 30; + connectionStringBuilder.Timeout = 15; + } + + services.AddDbContext(options => + { + options.UseNpgsql(connectionStringBuilder.ConnectionString); + + if (environment.IsDevelopment()) + { + options.EnableSensitiveDataLogging(); + options.EnableDetailedErrors(); + options.LogTo(Console.WriteLine, LogLevel.Information); + } + }); + + return services; + } +} + +// Uso en Program.cs +builder.Services.AddDatabaseConfiguration(builder.Configuration, builder.Environment); +``` + + + + + +### Error: "Cannot access a disposed object" + +**Causa:** Contexto de base de datos dispuesto antes de completar la migración + +**Solución:** + +```csharp +// Program.cs - Configuración correcta de scope +public static async Task Main(string[] args) +{ + var app = CreateHostBuilder(args).Build(); + + // Crear scope separado para migraciones + using (var scope = app.Services.CreateScope()) + { + var context = scope.ServiceProvider.GetRequiredService(); + + try + { + await context.Database.MigrateAsync(); + var logger = scope.ServiceProvider.GetRequiredService>(); + logger.LogInformation("Database migration completed successfully"); + } + catch (Exception ex) + { + var logger = scope.ServiceProvider.GetRequiredService>(); + logger.LogError(ex, "An error occurred while migrating the database"); + throw; + } + } + + await app.RunAsync(); +} +``` + +### Error: "Password authentication failed" + +**Causa:** Variables de entorno de contraseña incorrectas o no disponibles + +**Solución:** + +```bash +# Verificar variables específicas +echo "POSTGRES_PASSWORD value: '$POSTGRES_PASSWORD'" +echo "Length: ${#POSTGRES_PASSWORD}" + +# Test de conexión directo +psql "postgresql://$POSTGRES_USER:$POSTGRES_PASSWORD@$POSTGRES_HOST:$POSTGRES_PORT/$POSTGRES_DB" -c "\dt" + +# Debug de conexión con logging +PGPASSWORD="$POSTGRES_PASSWORD" psql -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT version();" +``` + +### Error: "Network is unreachable" + +**Causa:** Problemas de conectividad de red con PostgreSQL + +**Solución:** + +```bash +#!/bin/bash +# network-debug.sh + +echo "=== Network Connectivity Debug ===" + +# Test de DNS +echo "1. DNS Resolution:" +nslookup "$POSTGRES_HOST" + +# Test de conectividad TCP +echo "2. TCP Connectivity:" +timeout 10 bash -c "/dev/null || echo "Ping no disponible" + +# Test con telnet +echo "4. Telnet test:" +timeout 5 telnet "$POSTGRES_HOST" "$POSTGRES_PORT" 2>/dev/null || echo "Telnet connection failed" + +# Verificar variables de red +echo "5. Network environment:" +echo "POSTGRES_HOST: $POSTGRES_HOST" +echo "POSTGRES_PORT: $POSTGRES_PORT" +ip route show +``` + +### Error: "Migration assembly not found" + +**Causa:** Problema con la ruta del proyecto o ensamblado de migraciones + +**Solución:** + +```bash +# Verificar estructura del proyecto +find . -name "*.csproj" -exec echo "Found project: {}" \; +find . -name "*Migrations*" -type d -exec echo "Found migrations: {}" \; + +# Limpiar y rebuild +dotnet clean +dotnet restore +dotnet build + +# Verificar herramientas EF +dotnet tool list -g | grep dotnet-ef +dotnet tool install --global dotnet-ef --version 7.0.0 + +# Comando de migración con debugging +dotnet ef database update \ + --project ./Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj \ + --startup-project ./Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj \ + --context ApplicationDbContext \ + --verbose \ + --no-build +``` + +### Script completo de debugging + +```bash +#!/bin/bash +# migration-debug.sh + +set -e + +echo "=== Migration Debug Script ===" + +# 1. Verificar variables de entorno +echo "1. Environment Variables:" +env | grep -E "(POSTGRES|CORS|ASPNET)" | sort + +# 2. Verificar estructura del proyecto +echo "2. Project Structure:" +find . -maxdepth 3 -name "*.csproj" -o -name "Migrations" -type d + +# 3. Verificar herramientas +echo "3. .NET Tools:" +dotnet --version +dotnet tool list -g | grep -E "(dotnet-ef|entity)" + +# 4. Test de conectividad +echo "4. Database Connectivity:" +timeout 10 bash -c "/dev/null || \ + echo "❌ Cannot list migrations" + +# 7. Test de compilación +echo "7. Build Test:" +dotnet build ./Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj --no-restore --verbosity quiet && \ + echo "✅ Build successful" || \ + echo "❌ Build failed" + +echo "=== Debug Complete ===" +``` + + + + + +### Estrategia de migraciones zero-downtime + +```csharp +public static class DatabaseMigrationExtensions +{ + public static async Task MigrateDatabaseAsync(this IHost host) + { + using var scope = host.Services.CreateScope(); + var context = scope.ServiceProvider.GetRequiredService(); + var logger = scope.ServiceProvider.GetRequiredService>(); + + try + { + logger.LogInformation("Starting database migration..."); + + // Verificar conectividad primero + await context.Database.CanConnectAsync(); + logger.LogInformation("Database connection verified"); + + // Obtener migraciones pendientes + var pendingMigrations = await context.Database.GetPendingMigrationsAsync(); + + if (pendingMigrations.Any()) + { + logger.LogInformation($"Applying {pendingMigrations.Count()} pending migrations..."); + foreach (var migration in pendingMigrations) + { + logger.LogInformation($"Pending migration: {migration}"); + } + + // Aplicar migraciones con timeout extendido + using var transaction = await context.Database.BeginTransactionAsync(); + context.Database.SetCommandTimeout(TimeSpan.FromMinutes(10)); + + await context.Database.MigrateAsync(); + await transaction.CommitAsync(); + + logger.LogInformation("Database migration completed successfully"); + } + else + { + logger.LogInformation("No pending migrations found"); + } + } + catch (Exception ex) + { + logger.LogError(ex, "Database migration failed"); + throw; + } + + return host; + } +} +``` + +### Configuración de salud de base de datos + +```csharp +public static class HealthCheckExtensions +{ + public static IServiceCollection AddDatabaseHealthChecks( + this IServiceCollection services, + string connectionString) + { + services.AddHealthChecks() + .AddNpgSql( + connectionString, + name: "postgresql", + tags: new[] { "database", "postgresql" }) + .AddDbContextCheck( + name: "application-db-context", + tags: new[] { "database", "ef-core" }); + + return services; + } +} + +// En Program.cs +app.MapHealthChecks("/health", new HealthCheckOptions +{ + ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse, + ResultStatusCodes = + { + [HealthStatus.Healthy] = StatusCodes.Status200OK, + [HealthStatus.Degraded] = StatusCodes.Status200OK, + [HealthStatus.Unhealthy] = StatusCodes.Status503ServiceUnavailable + } +}); + +app.MapHealthChecks("/health/ready", new HealthCheckOptions +{ + Predicate = check => check.Tags.Contains("ready"), + ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse +}); + +app.MapHealthChecks("/health/live", new HealthCheckOptions +{ + Predicate = _ => false, + ResponseWriter = UIResponseWriter.WriteHealthCheckUIResponse +}); +``` + +### Monitoreo de migraciones + +```csharp +public class MigrationMonitoringService : BackgroundService +{ + private readonly IServiceProvider _services; + private readonly ILogger _logger; + + public MigrationMonitoringService( + IServiceProvider services, + ILogger logger) + { + _services = services; + _logger = logger; + } + + protected override async Task ExecuteAsync(CancellationToken stoppingToken) + { + while (!stoppingToken.IsCancellationRequested) + { + try + { + using var scope = _services.CreateScope(); + var context = scope.ServiceProvider.GetRequiredService(); + + // Verificar estado de migraciones + var appliedMigrations = await context.Database.GetAppliedMigrationsAsync(); + var pendingMigrations = await context.Database.GetPendingMigrationsAsync(); + + if (pendingMigrations.Any()) + { + _logger.LogWarning($"Found {pendingMigrations.Count()} pending migrations"); + + // Opcional: Aplicar automáticamente en entornos específicos + if (Environment.GetEnvironmentVariable("AUTO_MIGRATE") == "true") + { + _logger.LogInformation("Auto-migration enabled, applying pending migrations..."); + await context.Database.MigrateAsync(); + } + } + + // Verificar integridad de datos + var recordCount = await context.Database.ExecuteScalarAsync( + "SELECT COUNT(*) FROM __EFMigrationsHistory"); + + _logger.LogDebug($"Migration history contains {recordCount} entries"); + + } + catch (Exception ex) + { + _logger.LogError(ex, "Error monitoring database migrations"); + } + + await Task.Delay(TimeSpan.FromMinutes(5), stoppingToken); + } + } +} + +// Registrar el servicio +builder.Services.AddHostedService(); +``` + +### Backup y rollback automatizado + +```bash +#!/bin/bash +# migration-with-backup.sh + +set -e + +DB_BACKUP_DIR="/backups" +TIMESTAMP=$(date +%Y%m%d_%H%M%S) +BACKUP_FILE="$DB_BACKUP_DIR/pre_migration_backup_$TIMESTAMP.sql" + +echo "=== Database Migration with Backup ===" + +# Crear directorio de backup +mkdir -p "$DB_BACKUP_DIR" + +# 1. Backup pre-migración +echo "Creating pre-migration backup..." +pg_dump "postgresql://$POSTGRES_USER:$POSTGRES_PASSWORD@$POSTGRES_HOST:$POSTGRES_PORT/$POSTGRES_DB" \ + --no-password \ + --verbose \ + --format=custom \ + --compress=9 \ + --file="$BACKUP_FILE" + +echo "✅ Backup created: $BACKUP_FILE" + +# 2. Verificar backup +echo "Verifying backup integrity..." +pg_restore --list "$BACKUP_FILE" > /dev/null && \ + echo "✅ Backup verification successful" || \ + { echo "❌ Backup verification failed"; exit 1; } + +# 3. Ejecutar migraciones +echo "Executing migrations..." +if dotnet ef database update --project ./Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj --verbose; then + echo "✅ Migrations completed successfully" + + # 4. Verificar aplicación después de migración + echo "Verifying application health..." + sleep 10 + curl -f "http://localhost:8080/health" || { + echo "❌ Health check failed, considering rollback..." + + # Opcional: Rollback automático + if [ "$AUTO_ROLLBACK" = "true" ]; then + echo "Performing automatic rollback..." + pg_restore --clean --if-exists \ + "postgresql://$POSTGRES_USER:$POSTGRES_PASSWORD@$POSTGRES_HOST:$POSTGRES_PORT/$POSTGRES_DB" \ + "$BACKUP_FILE" + echo "✅ Rollback completed" + exit 1 + fi + } + + echo "✅ Migration process completed successfully" + + # 5. Limpiar backups antiguos (mantener últimos 5) + find "$DB_BACKUP_DIR" -name "pre_migration_backup_*.sql" -type f | \ + sort -r | tail -n +6 | xargs -r rm + +else + echo "❌ Migration failed" + echo "Backup available at: $BACKUP_FILE" + exit 1 +fi +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 11 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-heroku-to-aws-rds.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-heroku-to-aws-rds.mdx new file mode 100644 index 000000000..95ee48c08 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migration-heroku-to-aws-rds.mdx @@ -0,0 +1,560 @@ +--- +sidebar_position: 3 +title: "Migración de Base de Datos de Heroku a AWS RDS" +description: "Guía completa para migrar bases de datos PostgreSQL de Heroku a AWS RDS usando DMS y pg_restore" +date: "2024-12-19" +category: "dependency" +tags: + ["base de datos", "migración", "heroku", "aws", "rds", "postgresql", "dms"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Migración de Base de Datos de Heroku a AWS RDS + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Migración, Heroku, AWS, RDS, PostgreSQL, DMS + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan migrar grandes bases de datos PostgreSQL desde Heroku a AWS RDS a través de la plataforma SleakOps, enfrentando volcados incompletos y discrepancias de tamaño entre la base de datos de origen y la restaurada. + +**Síntomas Observados:** + +- La restauración del volcado de Heroku resulta en un tamaño de base de datos menor (94GB vs 385GB en producción) +- La descarga estándar del volcado mediante curl puede estar incompleta +- Necesidad de acceso externo a la base de datos para herramientas de migración +- Requerimiento de configuración de AWS DMS para completar la migración + +**Configuración Relevante:** + +- Origen: Base de datos PostgreSQL en Heroku (385GB) +- Destino: AWS RDS PostgreSQL +- Tamaño del volcado restaurado: 94GB (incompleto) +- Método de migración: pg_restore con --jobs=8 --no-owner --no-acl + +**Condiciones de Error:** + +- Transferencia de datos incompleta usando métodos estándar de volcado de Heroku +- Desajuste de tamaño que indica pérdida de datos durante la migración +- Necesidad de replicación continua para sincronizar datos faltantes + +## Solución Detallada + + + +El proceso estándar de volcado de Heroku puede no capturar todos los datos debido a: + +1. **Limitaciones de tiempo de espera:** Bases de datos grandes pueden agotarse durante la generación del volcado +2. **Transacciones activas:** Datos escritos durante la creación del volcado podrían perderse +3. **Problemas de compresión:** Algunos tipos de datos pueden no comprimirse/descomprimirse correctamente +4. **Límites de conexión:** Heroku puede limitar operaciones de volcado que duren mucho tiempo + +**Pasos de verificación:** + +```bash +# Verificar el conteo de filas en ambas bases de datos +psql -h heroku-host -c "SELECT schemaname,tablename,n_tup_ins-n_tup_del as rowcount FROM pg_stat_user_tables ORDER BY rowcount DESC;" +psql -h rds-host -c "SELECT schemaname,tablename,n_tup_ins-n_tup_del as rowcount FROM pg_stat_user_tables ORDER BY rowcount DESC;" + +# Comparar tamaños de base de datos +psql -h heroku-host -c "SELECT pg_size_pretty(pg_database_size(current_database()));" +psql -h rds-host -c "SELECT pg_size_pretty(pg_database_size(current_database()));" + +# Verificar integridad del volcado +file latest.dump +pg_restore --list latest.dump | head -20 +``` + +**Indicadores de volcado incompleto:** + +- Diferencias significativas en el conteo de filas entre tablas +- Tamaños de base de datos que no coinciden proporcionalmente +- Errores durante pg_restore relacionados con datos faltantes +- Índices o constraints no restaurados correctamente + + + + + +El servicio de migración de bases de datos de AWS (DMS) puede manejar migraciones de bases de datos grandes con tiempo de inactividad mínimo: + +**Requisitos previos:** + +1. Crear instancia de replicación DMS +2. Configurar endpoint de origen (PostgreSQL en Heroku) +3. Configurar endpoint de destino (AWS RDS) +4. Establecer acceso externo a la base de datos + +**Configuración de DMS:** + +```json +{ + "replication-instance-class": "dms.t3.large", + "replication-instance-id": "heroku-to-rds-migration", + "storage-encrypted": true, + "allocated-storage": 500, + "vpc-security-group-ids": ["sg-xxxxxxxxx"], + "replication-subnet-group-id": "dms-subnet-group" +} +``` + +**Configuración del endpoint de origen (Heroku):** + +```json +{ + "endpoint-id": "heroku-postgres-source", + "endpoint-type": "source", + "engine-name": "postgres", + "server-name": "heroku-postgres-host.compute-1.amazonaws.com", + "port": 5432, + "database-name": "database_name", + "username": "heroku_user", + "password": "heroku_password", + "ssl-mode": "require" +} +``` + +**Configuración del endpoint de destino (RDS):** + +```json +{ + "endpoint-id": "rds-postgres-target", + "endpoint-type": "target", + "engine-name": "postgres", + "server-name": "rds-instance.region.rds.amazonaws.com", + "port": 5432, + "database-name": "database_name", + "username": "rds_user", + "password": "rds_password" +} +``` + + + + + +Para usar herramientas como DMS o migración manual, necesita configurar acceso externo en SleakOps: + +**Pasos en SleakOps:** + +1. **Navegar a Dependencias** → Seleccionar su base de datos PostgreSQL +2. **Configurar Acceso Externo:** + - Habilitar "External Access" + - Configurar security groups apropiados + - Obtener endpoint y credenciales + +```yaml +# Configuración de security group para DMS +SecurityGroupRules: + - Type: Egress + Protocol: TCP + Port: 5432 + Destination: 0.0.0.0/0 + - Type: Ingress + Protocol: TCP + Port: 5432 + Source: dms-subnet-group-cidr +``` + +**Verificación de conectividad:** + +```bash +# Probar conexión desde instancia DMS +psql -h rds-endpoint.region.rds.amazonaws.com -p 5432 -U username -d database_name -c "SELECT version();" + +# Verificar que DMS puede acceder a Heroku +psql -h heroku-host.compute-1.amazonaws.com -p 5432 -U heroku_user -d database_name -c "SELECT version();" +``` + + + + + +**Configuración de tarea de migración completa:** + +```json +{ + "migration-type": "full-load-and-cdc", + "replication-task-id": "heroku-to-rds-full-migration", + "source-endpoint-arn": "arn:aws:dms:region:account:endpoint:heroku-postgres-source", + "target-endpoint-arn": "arn:aws:dms:region:account:endpoint:rds-postgres-target", + "replication-instance-arn": "arn:aws:dms:region:account:rep:heroku-to-rds-migration", + "table-mappings": { + "rules": [ + { + "rule-type": "selection", + "rule-id": "1", + "rule-name": "1", + "object-locator": { + "schema-name": "public", + "table-name": "%" + }, + "rule-action": "include" + } + ] + } +} +``` + +**Configuración avanzada de tarea:** + +```json +{ + "task-settings": { + "TargetMetadata": { + "SupportLobs": true, + "FullLobMode": false, + "LobChunkSize": 64, + "LimitedSizeLobMode": true, + "LobMaxSize": 32 + }, + "FullLoadSettings": { + "CommitRate": 10000, + "MaxFullLoadSubTasks": 8, + "TransactionConsistencyTimeout": 600, + "CreatePkAfterFullLoad": false + }, + "ChangeDataCaptureSettings": { + "StartFromTimestamp": "2024-12-19T00:00:00", + "MemoryLimitTotal": 1024, + "MemoryKeepTime": 60 + } + } +} +``` + + + + + +**Scripts de validación post-migración:** + +```sql +-- Comparar conteo de tablas +SELECT + 'Source' as database_type, + schemaname, + tablename, + n_tup_ins - n_tup_del as row_count +FROM pg_stat_user_tables +ORDER BY row_count DESC; + +-- Verificar integridad referencial +SELECT + conname as constraint_name, + conrelid::regclass as table_name, + confrelid::regclass as referenced_table +FROM pg_constraint +WHERE contype = 'f'; + +-- Validar secuencias +SELECT + sequencename, + last_value, + increment_by +FROM pg_sequences; +``` + +**Verificación de índices:** + +```sql +-- Comparar índices entre origen y destino +SELECT + schemaname, + tablename, + indexname, + indexdef +FROM pg_indexes +WHERE schemaname = 'public' +ORDER BY tablename, indexname; +``` + +**Script de validación automatizada:** + +```python +#!/usr/bin/env python3 +import psycopg2 +import sys +from datetime import datetime + +def connect_db(host, port, database, username, password): + """Establecer conexión a la base de datos""" + try: + conn = psycopg2.connect( + host=host, + port=port, + database=database, + user=username, + password=password + ) + return conn + except Exception as e: + print(f"Error conectando a {host}: {e}") + return None + +def compare_table_counts(source_conn, target_conn): + """Comparar conteo de filas entre origen y destino""" + query = """ + SELECT + schemaname, + tablename, + n_tup_ins - n_tup_del as row_count + FROM pg_stat_user_tables + WHERE schemaname = 'public' + ORDER BY tablename; + """ + + source_cur = source_conn.cursor() + target_cur = target_conn.cursor() + + source_cur.execute(query) + target_cur.execute(query) + + source_results = dict((row[1], row[2]) for row in source_cur.fetchall()) + target_results = dict((row[1], row[2]) for row in target_cur.fetchall()) + + discrepancies = [] + for table, source_count in source_results.items(): + target_count = target_results.get(table, 0) + if source_count != target_count: + discrepancies.append({ + 'table': table, + 'source_count': source_count, + 'target_count': target_count, + 'difference': source_count - target_count + }) + + return discrepancies + +def main(): + # Configuración de conexiones + heroku_config = { + 'host': 'heroku-host.compute-1.amazonaws.com', + 'port': 5432, + 'database': 'database_name', + 'username': 'heroku_user', + 'password': 'heroku_password' + } + + rds_config = { + 'host': 'rds-instance.region.rds.amazonaws.com', + 'port': 5432, + 'database': 'database_name', + 'username': 'rds_user', + 'password': 'rds_password' + } + + # Establecer conexiones + source_conn = connect_db(**heroku_config) + target_conn = connect_db(**rds_config) + + if not source_conn or not target_conn: + sys.exit(1) + + print(f"Iniciando validación de migración - {datetime.now()}") + + # Comparar conteos de tablas + discrepancies = compare_table_counts(source_conn, target_conn) + + if discrepancies: + print("Discrepancias encontradas:") + for disc in discrepancies: + print(f" Tabla: {disc['table']}") + print(f" Origen: {disc['source_count']} filas") + print(f" Destino: {disc['target_count']} filas") + print(f" Diferencia: {disc['difference']} filas") + else: + print("✅ Validación exitosa: Todos los conteos de tablas coinciden") + + source_conn.close() + target_conn.close() + +if __name__ == "__main__": + main() +``` + + + + + +**Configuración de parámetros para bases de datos grandes:** + +```sql +-- Configurar parámetros en la base de datos de destino para mejorar rendimiento +ALTER SYSTEM SET shared_buffers = '4GB'; +ALTER SYSTEM SET work_mem = '256MB'; +ALTER SYSTEM SET maintenance_work_mem = '1GB'; +ALTER SYSTEM SET checkpoint_completion_target = 0.9; +ALTER SYSTEM SET wal_buffers = '64MB'; +ALTER SYSTEM SET effective_cache_size = '12GB'; + +-- Recargar configuración +SELECT pg_reload_conf(); +``` + +**Estrategias de optimización DMS:** + +```json +{ + "performance-settings": { + "ParallelLoadThreads": 8, + "ParallelLoadBufferSize": 50, + "MaxFullLoadSubTasks": 8, + "ParallelApplyThreads": 4, + "ParallelApplyBufferSize": 100, + "CommitRate": 10000 + } +} +``` + +**Monitoreo del progreso de migración:** + +```bash +#!/bin/bash +# Script para monitorear progreso de DMS + +aws dms describe-replication-tasks \ + --filters Name=replication-task-id,Values=heroku-to-rds-full-migration \ + --query 'ReplicationTasks[0].ReplicationTaskStats' + +# Monitorear métricas CloudWatch +aws cloudwatch get-metric-statistics \ + --namespace AWS/DMS \ + --metric-name CDCLatencyTarget \ + --dimensions Name=ReplicationInstanceIdentifier,Value=heroku-to-rds-migration \ + --start-time 2024-12-19T00:00:00Z \ + --end-time 2024-12-19T23:59:59Z \ + --period 300 \ + --statistics Average +``` + + + + + +**Plan de rollback en caso de problemas:** + +```bash +#!/bin/bash +# Script de rollback de emergencia + +# 1. Detener la tarea de migración DMS +aws dms stop-replication-task \ + --replication-task-arn arn:aws:dms:region:account:task:heroku-to-rds-full-migration + +# 2. Cambiar conexiones de aplicación de vuelta a Heroku +kubectl patch configmap app-config \ + --patch '{"data":{"DATABASE_URL":"postgres://heroku-user:password@heroku-host:5432/db"}}' + +# 3. Reiniciar pods de aplicación +kubectl rollout restart deployment/app-deployment + +# 4. Verificar conectividad +kubectl exec -it deployment/app-deployment -- psql $DATABASE_URL -c "SELECT version();" +``` + +**Verificaciones pre-rollback:** + +```sql +-- Verificar que la aplicación puede conectar a Heroku +SELECT + pid, + usename, + application_name, + client_addr, + state +FROM pg_stat_activity +WHERE datname = 'production_db'; + +-- Verificar integridad de datos en Heroku +SELECT + tablename, + n_tup_ins - n_tup_del as current_rows +FROM pg_stat_user_tables +WHERE schemaname = 'public' +ORDER BY current_rows DESC; +``` + +**Procedimiento de emergencia:** + +1. **Notificación inmediata** al equipo de desarrollo +2. **Cambio de DNS** si es necesario para dirigir tráfico +3. **Monitoreo de métricas** de aplicación post-rollback +4. **Documentación de incidentes** para análisis posterior + + + + + +**Tareas de limpieza después de migración exitosa:** + +```sql +-- Actualizar estadísticas de la base de datos +ANALYZE; + +-- Recompilar índices para optimizar rendimiento +REINDEX DATABASE database_name; + +-- Verificar y limpiar logs de WAL antiguos +SELECT pg_current_wal_lsn(); +SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')); + +-- Configurar autovacuum para nueva carga de trabajo +ALTER SYSTEM SET autovacuum_vacuum_scale_factor = 0.1; +ALTER SYSTEM SET autovacuum_analyze_scale_factor = 0.05; +SELECT pg_reload_conf(); +``` + +**Limpieza de recursos AWS:** + +```bash +#!/bin/bash +# Limpiar recursos DMS después de migración exitosa + +# Eliminar tarea de replicación +aws dms delete-replication-task \ + --replication-task-arn arn:aws:dms:region:account:task:heroku-to-rds-full-migration + +# Eliminar endpoints (después de confirmar que no se necesitan) +aws dms delete-endpoint --endpoint-arn arn:aws:dms:region:account:endpoint:heroku-postgres-source +aws dms delete-endpoint --endpoint-arn arn:aws:dms:region:account:endpoint:rds-postgres-target + +# Eliminar instancia de replicación +aws dms delete-replication-instance \ + --replication-instance-arn arn:aws:dms:region:account:rep:heroku-to-rds-migration +``` + +**Configuración de monitoreo post-migración:** + +```yaml +# CloudWatch alarms para monitorear la nueva base de datos +DatabaseCPUAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: RDS-HighCPU + MetricName: CPUUtilization + Namespace: AWS/RDS + Statistic: Average + Period: 300 + EvaluationPeriods: 2 + Threshold: 80 + +DatabaseConnectionsAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: RDS-HighConnections + MetricName: DatabaseConnections + Namespace: AWS/RDS + Statistic: Average + Period: 300 + EvaluationPeriods: 2 + Threshold: 40 +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migrations-execution-hooks.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migrations-execution-hooks.mdx new file mode 100644 index 000000000..6cba22806 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-migrations-execution-hooks.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 15 +title: "Migraciones de Base de Datos con Execution Hooks" +description: "Cómo configurar y gestionar migraciones de base de datos usando SleakOps Execution Hooks" +date: "2024-12-19" +category: "workload" +tags: ["base de datos", "migraciones", "hooks", "ejecuciones", "despliegue"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Migraciones de Base de Datos con Execution Hooks + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Base de datos, Migraciones, Hooks, Ejecuciones, Despliegue + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan ejecutar migraciones de base de datos como parte de su proceso de despliegue en SleakOps, específicamente para aplicaciones .NET que usan migraciones de Entity Framework. + +**Síntomas observados:** + +- Necesidad de ejecutar comandos `dotnet ef database update` durante los despliegues +- Incertidumbre sobre cuándo y cómo deben ejecutarse las migraciones en la canalización CI/CD +- Preguntas sobre ejecutar migraciones independientemente de los despliegues +- Necesidad de ejecución automática de migraciones antes de las actualizaciones de código + +**Configuración relevante:** + +- Entorno: Entornos de Desarrollo y Producción +- Framework: .NET con Entity Framework +- Comando de migración: `dotnet ef database update` +- Tipo de hook: `pre-upgrade` +- Tipo de ejecución: Hook y Job + +**Condiciones de error:** + +- Migraciones que no se ejecutan automáticamente durante los despliegues +- Necesidad de ejecutar migraciones sin activar un despliegue completo +- Integración de comandos de migración en el proceso de construcción + +## Solución Detallada + + + +SleakOps ofrece soporte integrado para migraciones de base de datos mediante Execution Hooks: + +**Cómo funciona:** + +1. **Hooks pre-upgrade**: Se crean automáticamente como hooks `db-migration` en tus entornos +2. **Ejecución automática**: Se ejecutan antes de cada despliegue cuando haces push en las ramas `develop` o `main` +3. **Ejecución de comando**: Ejecuta el comando `update-database` antes de actualizar el código de la aplicación + +**Configuración:** + +```yaml +# Configuración del hook (creado automáticamente) +name: db-migration +type: pre-upgrade +command: dotnet ef database update --project YourProject.csproj +``` + +**Flujo del proceso:** + +1. Hacer push de código a la rama `develop` o `main` +2. CI/CD dispara el proceso de Build +3. Inicia el despliegue +4. **Se ejecuta el hook pre-upgrade** → Se ejecuta la migración de base de datos +5. El código de la aplicación se actualiza en el clúster + +Esto significa que las migraciones se ejecutan automáticamente con cada despliegue, por lo que normalmente no es necesario ejecutar migraciones manualmente antes de las compilaciones. + + + + + +Para ejecutar migraciones independientemente de los despliegues: + +**Crear una ejecución de tipo Job:** + +1. Ve a tu proyecto en SleakOps +2. Navega a la sección **Ejecuciones** +3. Crea una nueva ejecución con tipo **Job** +4. Configura el comando de migración + +**Configuración del Job:** + +```yaml +name: manual-db-migration +type: job +command: dotnet ef database update --project YourProject.csproj +``` + +**Características:** + +- **Ejecución única:** Se ejecuta solo cuando se activa manualmente +- **Independiente:** No afecta despliegues ni compilaciones +- **Bajo demanda:** Ejecuta cuando necesites correr migraciones manualmente + +**Casos de uso:** + +- Correcciones de emergencia en la base de datos +- Pruebas de migraciones en staging +- Escenarios de reversión +- Configuración inicial de la base de datos + + + + + +Para ejecutar migraciones durante el proceso de construcción (no recomendado como enfoque principal): + +**Agregar comando de migración al Dockerfile:** + +```dockerfile +# Contenido existente de tu Dockerfile +WORKDIR /app +COPY . . + +# Agregar comando de migración antes de CMD +RUN dotnet ef database update --project ../Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj + +# Instrucción CMD existente +CMD ["dotnet", "YourApp.dll"] +``` + +**Consideraciones importantes:** + +- **Conectividad a la base de datos:** Asegurar que la base de datos esté accesible durante la construcción +- **Cadenas de conexión:** Deben estar disponibles en tiempo de construcción +- **Entorno de construcción:** El servidor de base de datos debe ser accesible desde el entorno de build +- **Seguridad:** Evitar exponer credenciales de producción en el proceso de construcción + +**Enfoque alternativo usando construcciones multi-stage:** + +```dockerfile +# Etapa de build +FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build +WORKDIR /src +COPY . . +RUN dotnet restore +RUN dotnet build + +# Etapa de migración (opcional) +FROM build AS migration +RUN dotnet ef database update --project YourProject.csproj + +# Etapa de runtime +FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS runtime +WORKDIR /app +COPY --from=build /src/published . +CMD ["dotnet", "YourApp.dll"] +``` + + + + + +**Enfoque recomendado:** + +1. **Usar hooks pre-upgrade** (comportamiento por defecto de SleakOps) para migraciones automáticas +2. **Crear ejecuciones de tipo Job** para migraciones manuales/emergencias +3. **Evitar migraciones en Dockerfile** a menos que requisitos específicos lo demanden + +**Lista de verificación para estrategia de migración:** + +- ✅ Verificar que los hooks pre-upgrade estén configurados en todos los entornos +- ✅ Probar migraciones primero en entorno de desarrollo +- ✅ Crear ejecución tipo Job para capacidad de migración manual +- ✅ Asegurar que las cadenas de conexión a base de datos estén configuradas correctamente +- ✅ Monitorear logs de ejecución de migraciones durante despliegues + +**Solución de problemas comunes:** + +- **Hook no se ejecuta:** Verificar que el hook exista en la configuración del entorno +- **Fallos de conexión:** Verificar conectividad a base de datos desde el clúster +- **Errores de permisos:** Asegurar que la cuenta de servicio tenga acceso a la base de datos +- **Conflictos de migración:** Revisar el historial de migraciones de Entity Framework + +**Consideraciones específicas por entorno:** + +- **Desarrollo:** Migraciones frecuentes, usar hooks automáticos +- **Staging:** Probar migraciones antes de producción, usar hooks y jobs +- **Producción:** Planificación cuidadosa de migraciones, respaldo previo a migración + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-performance-optimization.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-performance-optimization.mdx new file mode 100644 index 000000000..388acfeb7 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-performance-optimization.mdx @@ -0,0 +1,207 @@ +--- +sidebar_position: 3 +title: "Optimización y Escalado del Rendimiento de la Base de Datos" +description: "Soluciones para problemas de rendimiento de base de datos, tiempos de espera en endpoints y estrategias de escalado" +date: "2024-01-15" +category: "dependency" +tags: ["base de datos", "rendimiento", "escalado", "timeout", "optimización"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Optimización y Escalado del Rendimiento de la Base de Datos + +**Fecha:** 15 de enero de 2024 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Rendimiento, Escalado, Timeout, Optimización + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan tiempos de respuesta lentos en los endpoints y errores de timeout a pesar de aumentar los recursos de la aplicación. El problema afecta la experiencia del usuario y requiere estrategias de optimización de rendimiento. + +**Síntomas Observados:** + +- Errores de timeout en endpoints +- Tiempos de respuesta lentos en llamadas API +- Reportes de lentitud en la aplicación por parte de usuarios +- Problemas de rendimiento persisten tras escalar recursos de la aplicación + +**Configuración Relevante:** + +- Base de datos: instancia AWS RDS +- Recursos de la aplicación: CPU y RAM ya aumentados +- Carga de trabajo: múltiples pods en ejecución +- Plataforma: SleakOps en AWS + +**Condiciones de Error:** + +- Timeout ocurre en endpoints específicos +- Degradación del rendimiento afecta a múltiples usuarios +- Problemas persisten a pesar del escalado de recursos +- El problema parece estar relacionado con la base de datos + +## Solución Detallada + + + +Para identificar la causa raíz de los problemas de rendimiento, es necesario implementar un monitoreo y análisis adecuados: + +**Herramientas APM Recomendadas:** + +- **New Relic**: Monitoreo completo del rendimiento de aplicaciones +- **DataDog**: Monitoreo full-stack con insights de bases de datos +- **AWS X-Ray**: Trazado distribuido para aplicaciones AWS + +**Métricas Clave a Monitorear:** + +- Tiempos de ejecución de consultas a la base de datos +- Utilización del pool de conexiones +- Uso de CPU y memoria en la base de datos +- Latencia de red entre aplicación y base de datos +- Logs de consultas lentas + + + + + +**Escalado Vertical (Scale Up):** + +Puedes escalar tu instancia RDS directamente desde la Consola AWS: + +1. Ve a **Consola AWS RDS** +2. Selecciona tu instancia de base de datos +3. Haz clic en **Modificar** +4. Elige una clase de instancia más grande +5. Aplica los cambios inmediatamente o durante la ventana de mantenimiento + +**Nota:** El escalado puede causar 20-30 minutos de posible lentitud durante la transición. + +**Opciones de Escalado Horizontal:** + +- **Replicas de lectura**: Para cargas de trabajo con muchas lecturas +- **Sharding de base de datos**: Para aplicaciones con muchas escrituras +- **Pool de conexiones**: Optimizar la gestión de conexiones + +```yaml +# Ejemplo de configuración de base de datos +database: + instance_class: "db.t3.large" # Escalar desde db.t3.medium + allocated_storage: 100 + max_connections: 200 + connection_timeout: 30 +``` + + + + + +**Optimización de Consultas:** + +- Identificar y optimizar consultas lentas +- Añadir índices adecuados en la base de datos +- Implementar estrategias de caching de consultas +- Usar pool de conexiones + +**Mejoras a Nivel de Código:** + +- Implementar procesamiento asíncrono para operaciones pesadas +- Añadir capas de caché (Redis, Memcached) +- Optimizar manejo de conexiones a la base de datos +- Usar operaciones por lotes cuando sea posible + +**Escalado de Carga de Trabajo en SleakOps:** + +```yaml +# Escala tus cargas en SleakOps +workload: + replicas: 5 # Incrementar número de pods + resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1000m" + memory: "2Gi" +``` + + + + + +**Monitoreo de Base de Datos:** + +1. Habilita **AWS Performance Insights** para RDS +2. Configura métricas personalizadas en **CloudWatch** +3. Activa el registro de consultas lentas +4. Monitorea conteo de conexiones y eventos de espera + +**Monitoreo de Aplicación:** + +1. Integra herramienta APM (New Relic, DataDog) +2. Añade métricas personalizadas para lógica de negocio +3. Implementa trazado distribuido +4. Configura alertas para umbrales de rendimiento + +**Indicadores Clave de Rendimiento:** + +- Tiempo promedio de respuesta < 200ms +- Tiempo de consulta a base de datos < 100ms +- Tasa de error < 1% +- Utilización de CPU < 70% + + + + + +**Problemas Relacionados con la Base de Datos:** + +- Consultas ineficientes sin índices adecuados +- Demasiadas conexiones concurrentes +- Contención de bloqueos y consultas bloqueantes +- Recursos insuficientes en la base de datos + +**Problemas Relacionados con la Aplicación:** + +- Llamadas síncronas a APIs externas +- Fugas de memoria que causan presión en el recolector de basura +- Serialización de datos ineficiente +- Falta de estrategias adecuadas de caching + +**Problemas de Infraestructura:** + +- Latencia de red entre servicios +- Recursos insuficientes en la aplicación +- Configuración deficiente del balanceo de carga +- Pool de conexiones inadecuado + + + + + +**¿Es seguro escalar RDS directamente en AWS?** + +Sí, puedes modificar tu instancia RDS directamente desde la Consola AWS incluso cuando usas SleakOps: + +**Pasos:** + +1. Ve a **Consola AWS RDS** +2. Selecciona tu instancia de base de datos +3. Haz clic en **Modificar** +4. Cambia la clase de instancia o almacenamiento +5. Elige **Aplicar inmediatamente** o programa para ventana de mantenimiento + +**Consideraciones Importantes:** + +- **Tiempo de inactividad:** 20-30 minutos de posible lentitud +- **Impacto en conexiones:** Las conexiones existentes pueden ser desconectadas +- **Compatibilidad con SleakOps:** Los cambios no afectan la funcionalidad de SleakOps +- **Monitoreo:** Supervisa el rendimiento durante y después del cambio + +**Por qué esta opción no está en la Consola SleakOps:** +SleakOps se enfoca en despliegue y gestión de aplicaciones. Los cambios en infraestructura de base de datos suelen manejarse directamente en la consola del proveedor cloud para mayor control granular. + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-performance-sql-query-optimization.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-performance-sql-query-optimization.mdx new file mode 100644 index 000000000..90248cd53 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-performance-sql-query-optimization.mdx @@ -0,0 +1,209 @@ +--- +sidebar_position: 15 +title: "Problemas de Rendimiento de Base de Datos y Optimización de Consultas SQL" +description: "Solución de problemas de consultas lentas en base de datos que causan tiempos de espera en la aplicación" +date: "2024-12-19" +category: "dependency" +tags: + ["base de datos", "rendimiento", "sql", "tiempo de espera", "optimización"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Rendimiento de Base de Datos y Optimización de Consultas SQL + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Rendimiento, SQL, Tiempo de espera, Optimización + +## Descripción del Problema + +**Contexto:** Sitios web en producción que experimentan una degradación severa del rendimiento con errores generalizados de tiempo de espera, rastreados hasta consultas SQL problemáticas que causan alta carga en la base de datos. + +**Síntomas Observados:** + +- Sitios de producción funcionando extremadamente lentos +- Errores de tiempo de espera en múltiples endpoints +- Alto consumo de CPU en instancias de base de datos +- Tiempos de respuesta de la aplicación significativamente degradados +- Usuarios incapaces de completar operaciones normales + +**Configuración Relevante:** + +- Entorno: Producción +- Tipo de base de datos: Basada en SQL (PostgreSQL/MySQL) +- Capa de aplicación: Múltiples servicios web +- Monitoreo: Alertas de uso de CPU activadas + +**Condiciones de Error:** + +- Los tiempos de espera ocurren durante períodos de máxima carga +- Picos en la utilización de CPU de la base de datos correlacionan con consultas lentas +- El problema afecta múltiples componentes de la aplicación simultáneamente +- La degradación del rendimiento impacta significativamente la experiencia del usuario + +## Solución Detallada + + + +Cuando se experimentan tiempos de espera generalizados en la aplicación, comience con el análisis del rendimiento de la base de datos: + +1. **Revise las métricas de la base de datos** en su panel de monitoreo +2. **Identifique consultas lentas** usando los logs de base de datos o herramientas de monitoreo +3. **Monitoree el uso de CPU y memoria** en las instancias de la base de datos +4. **Revise despliegues recientes** que podrían haber introducido consultas problemáticas + +```bash +# Revisar consultas lentas en PostgreSQL +SELECT query, mean_time, calls, total_time +FROM pg_stat_statements +ORDER BY mean_time DESC +LIMIT 10; + +# Revisar consultas lentas en MySQL +SHOW PROCESSLIST; +SELECT * FROM information_schema.processlist +WHERE command != 'Sleep' ORDER BY time DESC; +``` + + + + + +Una vez identificadas las consultas problemáticas, aplique estas estrategias de optimización: + +**1. Análisis de Índices:** + +```sql +-- Verificar índices faltantes +EXPLAIN ANALYZE SELECT * FROM your_table WHERE problematic_column = 'value'; + +-- Crear índices apropiados +CREATE INDEX idx_column_name ON table_name(column_name); +``` + +**2. Reescritura de Consultas:** + +- Evitar sentencias SELECT \* +- Usar cláusulas LIMIT para conjuntos de resultados grandes +- Optimizar operaciones JOIN +- Considerar el almacenamiento en caché de consultas + +**3. Configuración de Base de Datos:** + +```yaml +# Ejemplo de optimización en PostgreSQL +shared_buffers: 256MB +effective_cache_size: 1GB +work_mem: 4MB +maintenance_work_mem: 64MB +``` + + + + + +Implemente un monitoreo integral para prevenir problemas futuros: + +**1. Monitoreo de Rendimiento de Consultas:** + +- Habilitar registro de consultas lentas +- Configurar alertas para consultas de larga duración +- Monitorear planes de ejecución de consultas + +**2. Monitoreo de Recursos:** + +```yaml +# Métricas Prometheus para monitoreo de base de datos +postgresql_exporter: + enabled: true + metrics: + - pg_stat_database + - pg_stat_statements + - pg_stat_activity +``` + +**3. Monitoreo a Nivel de Aplicación:** + +- Métricas del pool de conexiones a base de datos +- Distribución de tiempos de respuesta de consultas +- Seguimiento de tasa de errores + + + + + +Implemente estas prácticas para evitar futuros problemas de rendimiento en la base de datos: + +**1. Proceso de Revisión de Código:** + +- Revisar todas las consultas a base de datos antes del despliegue +- Usar herramientas de análisis de consultas en desarrollo +- Probar con volúmenes de datos similares a producción + +**2. Mantenimiento de Base de Datos:** + +```bash +# Tareas regulares de mantenimiento +# PostgreSQL +VACUUM ANALYZE; +REINDEX DATABASE your_database; + +# MySQL +OPTIMIZE TABLE your_table; +ANALYZE TABLE your_table; +``` + +**3. Pruebas de Rendimiento:** + +- Realizar pruebas de carga a consultas antes de producción +- Monitorear rendimiento de consultas en ambientes de staging +- Configurar pruebas automáticas de regresión de rendimiento + +**4. Gestión de Conexiones:** + +```yaml +# Configuración del pool de conexiones de base de datos +database: + pool_size: 20 + max_connections: 100 + connection_timeout: 30s + idle_timeout: 300s +``` + + + + + +Cuando se enfrentan problemas críticos de rendimiento en base de datos: + +**1. Acciones Inmediatas:** + +- Identificar y terminar consultas de larga duración si es necesario +- Escalar recursos de base de datos temporalmente +- Habilitar caché de consultas si está disponible +- Redirigir tráfico a sistemas de respaldo si es posible + +**2. Comunicación:** + +- Notificar a las partes interesadas sobre el problema +- Proveer actualizaciones regulares sobre el progreso de la resolución +- Documentar el incidente para análisis post-mortem + +**3. Pasos de Recuperación:** + +```bash +# Terminar consultas problemáticas (PostgreSQL) +SELECT pg_terminate_backend(pid) +FROM pg_stat_activity +WHERE state = 'active' AND query_start < NOW() - INTERVAL '5 minutes'; + +# Recargar configuración de base de datos +SELECT pg_reload_conf(); +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-postgresql-restore-errors.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-postgresql-restore-errors.mdx new file mode 100644 index 000000000..c30e45051 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-postgresql-restore-errors.mdx @@ -0,0 +1,162 @@ +--- +sidebar_position: 3 +title: "Errores de Restauración de Base de Datos PostgreSQL" +description: "Soluciones para errores comunes en la restauración de bases de datos PostgreSQL incluyendo conflictos de restricciones e índices" +date: "2024-01-15" +category: "dependency" +tags: + ["postgresql", "base de datos", "restauración", "restricciones", "índices"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Errores de Restauración de Base de Datos PostgreSQL + +**Fecha:** 15 de enero de 2024 +**Categoría:** Dependencia +**Etiquetas:** PostgreSQL, Base de datos, Restauración, Restricciones, Índices + +## Descripción del Problema + +**Contexto:** Al intentar restaurar un volcado de base de datos PostgreSQL, los usuarios encuentran errores relacionados con dependencias de restricciones y conflictos de índices durante el proceso de restauración. + +**Síntomas Observados:** + +- Error: "no se puede eliminar el índice active_storage_blobs_pkey porque la restricción active_storage_blobs_pkey en la tabla active_storage_blobs lo requiere" +- Error: "no se puede eliminar la restricción users_pkey en la tabla public.users porque otros objetos dependen de ella" +- Error: "la columna 'id' no existe" al recrear índices +- El proceso de restauración falla durante la manipulación de índices y restricciones + +**Configuración Relevante:** + +- Base de datos: PostgreSQL (instancia RDS) +- Variables de entorno: `SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS`, `SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME`, `SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME` +- Herramienta de restauración: `pg_restore` +- La base de datos contiene tablas Rails ActiveStorage y relaciones complejas de claves foráneas + +**Condiciones del Error:** + +- Ocurre durante la restauración de la base de datos desde archivos de volcado +- Sucede al intentar eliminar restricciones de clave primaria que tienen claves foráneas dependientes +- Aparece cuando el script de restauración intenta eliminar y recrear índices en orden incorrecto +- El error persiste tras múltiples intentos de restauración + +## Solución Detallada + + + +Cuando se presentan errores de restricciones e índices, la solución más confiable es comenzar con una base de datos completamente limpia: + +```bash +# Conectarse a la instancia RDS y eliminar/crear la base de datos +psql -h "${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS}" \ +-U "${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME}" \ +-d postgres \ +-c "DROP DATABASE IF EXISTS ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME};" \ +-c "CREATE DATABASE ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME};" +``` + +Este enfoque elimina todas las restricciones e índices existentes que podrían entrar en conflicto con el proceso de restauración. + + + + + +Algunas aplicaciones requieren que existan esquemas específicos antes de la restauración. Cree los esquemas necesarios: + +```bash +# Conectarse a su RDS y crear los esquemas requeridos +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "CREATE SCHEMA IF NOT EXISTS heroku_ext;" +``` + +Esto es especialmente importante para aplicaciones migradas desde Heroku u otras plataformas que usan esquemas personalizados. + + + + + +Si necesita modificar el script de restauración, asegúrese de que las restricciones se eliminen en el orden correcto: + +1. **Eliminar primero las restricciones de clave foránea** +2. **Luego eliminar las restricciones de clave primaria** +3. **Finalmente eliminar los índices** + +El error ocurre porque el script intenta eliminar las claves primarias antes que las claves foráneas que dependen de ellas. + +```sql +-- Ejemplo de orden correcto: +-- 1. Eliminar restricciones de clave foránea +ALTER TABLE campaign_agency_users DROP CONSTRAINT IF EXISTS fk_rails_80e17a26a2; + +-- 2. Eliminar restricciones de clave primaria +ALTER TABLE users DROP CONSTRAINT IF EXISTS users_pkey; + +-- 3. Eliminar índices +DROP INDEX IF EXISTS active_storage_blobs_pkey CASCADE; +``` + + + + + +Si la restauración estándar sigue fallando, pruebe estas alternativas: + +**Método 1: Usar pg_restore con opciones específicas** + +```bash +pg_restore --verbose --clean --no-acl --no-owner \ +-h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +your_dump_file.dump +``` + +**Método 2: Restaurar solo datos (omitir esquema)** + +```bash +pg_restore --data-only --verbose \ +-h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +your_dump_file.dump +``` + +**Método 3: Creación manual del esquema** + +1. Crear el esquema de la base de datos manualmente usando migraciones Rails o scripts SQL +2. Luego restaurar solo los datos usando la opción `--data-only` + + + + + +Después de la restauración, verifique la integridad de la base de datos: + +```bash +# Verificar que existan todas las tablas +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "\dt" + +# Verificar restricciones +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "SELECT conname FROM pg_constraint WHERE contype = 'f';" + +# Verificar integridad de datos +psql -h ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_ADDRESS} \ +-U ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_USERNAME} \ +-d ${SUPRADBPRODPOSTGRESQL_POSTGRESQL_NAME} \ +-c "SELECT COUNT(*) FROM users;" +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-restore-pod-management.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-restore-pod-management.mdx new file mode 100644 index 000000000..79b737cd5 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-restore-pod-management.mdx @@ -0,0 +1,239 @@ +--- +sidebar_position: 3 +title: "Gestión de Pods de Restauración de Base de Datos" +description: "Gestión de pods de restauración de base de datos de larga duración y su impacto en los despliegues" +date: "2025-03-26" +category: "dependency" +tags: + [ + "base de datos", + "restauración", + "pod", + "despliegue", + "solución de problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Gestión de Pods de Restauración de Base de Datos + +**Fecha:** 26 de marzo de 2025 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Restauración, Pod, Despliegue, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas en los despliegues cuando los pods de restauración de base de datos están en ejecución durante períodos prolongados, causando conflictos con los procesos de construcción y despliegue en SleakOps. + +**Síntomas Observados:** + +- Tiempos de espera en la construcción mientras el pod `restoredb` está en ejecución +- Fallos en el despliegue debido a operaciones de base de datos en conflicto +- Pods de restauración de larga duración (ejecutándose durante días) +- Procesos de construcción que no completan exitosamente +- Aparición de múltiples conjuntos de réplicas para servicios web + +**Configuración Relevante:** + +- Tipo de pod: `restoredb` (operación de restauración de base de datos) +- Duración: Ejecutándose durante varios días (3-4 días) +- Impacto: Afecta la canalización de construcción y despliegue +- Plataforma: Entorno Kubernetes de SleakOps + +**Condiciones de Error:** + +- Tiempos de espera en la construcción cuando el pod de restauración está activo +- Canalización de despliegue bloqueada por operaciones de restauración en curso +- Conflictos de recursos entre pods de restauración y aplicación +- Imposibilidad de reducir la escala de pods de restauración cuando no se necesitan + +## Solución Detallada + + + +Los pods de restauración de base de datos pueden interferir con las operaciones normales de despliegue porque: + +1. **Bloqueo de Recursos:** El proceso de restauración puede bloquear recursos de base de datos necesarios para la aplicación +2. **Conflictos de Red:** Las conexiones a la base de datos pueden estar monopolizadas por la operación de restauración +3. **Uso de Memoria/CPU:** Las operaciones de restauración de larga duración consumen recursos del clúster +4. **Conflictos de Estado:** Los pods de aplicación pueden fallar las comprobaciones de salud mientras la base de datos se está restaurando + +Por eso, las construcciones y despliegues a menudo fallan mientras un pod de restauración está en ejecución. + + + + + +Para monitorear el estado de tu pod de restauración: + +**Usando el Panel de SleakOps:** + +1. Ve a **Workloads** → **Jobs** +2. Busca trabajos de restauración como `restoredb` o similares +3. Revisa las columnas **Estado** y **Duración** + +**Usando Lens o kubectl:** + +```bash +# Listar todos los pods con "restore" en el nombre +kubectl get pods | grep restore + +# Ver logs específicos del pod de restauración +kubectl logs -f + +# Ver detalles y estado del pod +kubectl describe pod +``` + + + + + +**Cuando necesitas el pod de restauración más adelante:** + +Si planeas usar la funcionalidad de restauración pronto (por ejemplo, mañana por la mañana), puedes reducir temporalmente la escala: + +```bash +# Reducir la escala del trabajo de restauración (si es un deployment) +kubectl scale deployment restoredb --replicas=0 + +# O eliminar el pod específico (si es un pod independiente) +kubectl delete pod +``` + +**Cuando quieres detenerlo completamente:** + +```bash +# Eliminar el trabajo por completo +kubectl delete job + +# O a través del panel de SleakOps +# Ve a Workloads → Jobs → Elimina el trabajo de restauración +``` + +**Importante:** Siempre asegúrate de que la operación de restauración esté completa o pueda interrumpirse de forma segura antes de detenerla. + + + + + +Para evitar conflictos con las operaciones regulares: + +**1. Programa durante ventanas de mantenimiento:** + +- Planifica las restauraciones durante períodos de bajo tráfico +- Coordina con los horarios de despliegue +- Notifica a los miembros del equipo sobre las operaciones de restauración planificadas + +**2. Usa entornos separados:** + +- Realiza las restauraciones primero en staging +- Prueba el proceso de restauración antes de producción +- Mantén los despliegues de producción separados de las operaciones de restauración + +**3. Monitorea el uso de recursos:** + +```yaml +# Ejemplo de límites de recursos para trabajos de restauración +apiVersion: batch/v1 +kind: Job +metadata: + name: database-restore +spec: + template: + spec: + containers: + - name: restore + resources: + limits: + memory: "2Gi" + cpu: "1000m" + requests: + memory: "1Gi" + cpu: "500m" +``` + + + + + +**Si las construcciones están agotando el tiempo de espera:** + +1. **Verifica si la restauración aún es necesaria:** + + - Confirma el estado de la operación de restauración + - Determina si puede detenerse de forma segura + +2. **Solución temporal:** + + ```bash + # Detener temporalmente el pod de restauración + kubectl delete pod + + # Reintenta tu construcción + # La restauración puede reiniciarse después si es necesario + ``` + +3. **Solución a largo plazo:** + - Programa las restauraciones durante ventanas de mantenimiento + - Usa instancias de base de datos separadas para pruebas de restauración + - Implementa tiempos máximos para trabajos de restauración + +**Si ves múltiples conjuntos de réplicas:** + +Esto es normal durante despliegues pero puede indicar problemas: + +```bash +# Ver estado de los conjuntos de réplicas +kubectl get rs + +# Limpia conjuntos de réplicas antiguos si es necesario +kubectl delete rs +``` + + + + + +**1. Implementa una gestión adecuada de trabajos:** + +```yaml +# Añade activeDeadlineSeconds para evitar ejecuciones infinitas +apiVersion: batch/v1 +kind: Job +metadata: + name: database-restore +spec: + activeDeadlineSeconds: 3600 # Tiempo máximo de 1 hora + backoffLimit: 3 + template: + # ... resto de la especificación del trabajo +``` + +**2. Usa cuotas de recursos:** + +```yaml +apiVersion: v1 +kind: ResourceQuota +metadata: + name: restore-quota +spec: + hard: + requests.cpu: "2" + requests.memory: 4Gi + limits.cpu: "4" + limits.memory: 8Gi +``` + +**3. Configura alertas de monitoreo:** + +- Alerta cuando los trabajos de restauración duren más de lo esperado +- Monitorea el uso de recursos durante las operaciones de restauración +- Configura notificaciones para despliegues fallidos + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 26 de marzo de 2025 basándose en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-restore-pod-procedures.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-restore-pod-procedures.mdx new file mode 100644 index 000000000..1dfe5c5c1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-restore-pod-procedures.mdx @@ -0,0 +1,416 @@ +--- +sidebar_position: 15 +title: "Restauración de Base de Datos en Entorno Pod" +description: "Procedimientos para restaurar volcados de bases de datos en pods de Kubernetes con resiliencia en la conexión" +date: "2024-03-21" +category: "dependency" +tags: ["base de datos", "restauración", "volcado", "pod", "tmux", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Restauración de Base de Datos en Entorno Pod + +**Fecha:** 21 de marzo de 2024 +**Categoría:** Dependencia +**Etiquetas:** Base de datos, Restauración, Volcado, Pod, Tmux, Kubernetes + +## Descripción del Problema + +**Contexto:** Al realizar operaciones de restauración de bases de datos en pods de Kubernetes, los usuarios necesitan procedimientos confiables para manejar archivos de volcado grandes manteniendo la estabilidad de la conexión durante procesos largos de restauración. + +**Síntomas Observados:** + +- Caídas de conexión durante operaciones largas de restauración de base de datos +- Problemas de espacio en volumen debido a acumulación de archivos de volcado antiguos +- Necesidad de persistencia de sesión durante procesos de restauración +- Requisito de monitoreo del progreso de la restauración + +**Configuración Relevante:** + +- Entorno: Restauración de base de datos en producción +- Plataforma: Pods de Kubernetes +- Herramientas: Archivos de volcado de base de datos, tmux para gestión de sesiones +- Almacenamiento: Volúmenes de pod con espacio limitado + +**Condiciones de Error:** + +- Tiempo de espera de conexión durante operaciones de restauración +- Espacio insuficiente en disco para archivos de volcado +- Interrupción del proceso por problemas de red +- Pérdida del progreso de restauración cuando la conexión se cae + +## Solución Detallada + + + +El script mejorado de restauración incluye varias optimizaciones: + +```bash +#!/bin/bash +# Script mejorado para restauración de base de datos + +set -e + +# Configuración +DUMP_DIR="/data/dumps" +LOG_FILE="/data/logs/restore_$(date +%Y%m%d_%H%M%S).log" +MAX_DUMP_AGE_DAYS=7 + +# Función para limpiar volcados antiguos +clean_old_dumps() { + echo "Limpiando volcados con más de ${MAX_DUMP_AGE_DAYS} días..." | tee -a $LOG_FILE + find $DUMP_DIR -name "*.sql" -type f -mtime +$MAX_DUMP_AGE_DAYS -delete + find $DUMP_DIR -name "*.dump" -type f -mtime +$MAX_DUMP_AGE_DAYS -delete + echo "Volcados antiguos limpiados con éxito" | tee -a $LOG_FILE +} + +# Función para verificar espacio disponible +check_disk_space() { + AVAILABLE_SPACE=$(df $DUMP_DIR | awk 'NR==2 {print $4}') + REQUIRED_SPACE=1048576 # 1GB en KB + + if [ $AVAILABLE_SPACE -lt $REQUIRED_SPACE ]; then + echo "Advertencia: Espacio en disco bajo. Disponible: ${AVAILABLE_SPACE}KB" | tee -a $LOG_FILE + clean_old_dumps + fi +} + +# Función principal de restauración +t_restore_database() { + local dump_file=$1 + local database_name=$2 + + echo "Iniciando restauración de base de datos: $dump_file -> $database_name" | tee -a $LOG_FILE + echo "Hora de inicio: $(date)" | tee -a $LOG_FILE + + # Restaurar con monitoreo de progreso + pv $dump_file | psql -h $DB_HOST -U $DB_USER -d $database_name 2>&1 | tee -a $LOG_FILE + + echo "Restauración completada a las: $(date)" | tee -a $LOG_FILE +} + +# Verificaciones previas a la restauración +check_disk_space +clean_old_dumps + +# Ejecutar restauración +restore_database "$1" "$2" +``` + + + + + +Para manejar caídas de conexión durante operaciones largas de restauración, use tmux: + +```bash +# Iniciar una nueva sesión tmux para la restauración +kubectl exec -it -- tmux new-session -d -s restore + +# Adjuntarse a la sesión +kubectl exec -it -- tmux attach-session -t restore + +# Dentro de la sesión tmux, ejecutar la restauración +./restore_script.sh /data/dumps/production_dump.sql production_db + +# Desconectarse de la sesión (Ctrl+b, luego d) +# La sesión sigue ejecutándose incluso si la conexión se cae + +# Volver a conectarse después para verificar el progreso +kubectl exec -it -- tmux attach-session -t restore + +# Listar todas las sesiones +kubectl exec -it -- tmux list-sessions +``` + +**Beneficios de usar tmux:** + +- Persistencia de sesión ante caídas de conexión +- Capacidad para monitorear el progreso remotamente +- Múltiples ventanas para operaciones paralelas +- Compartición de sesión entre miembros del equipo + + + + + +Para prevenir problemas de espacio en volumen durante las operaciones de restauración: + +```yaml +# Configuración del pod con almacenamiento adecuado +apiVersion: v1 +kind: Pod +metadata: + name: db-restore-pod +spec: + containers: + - name: restore-container + image: postgres:14 + volumeMounts: + - name: dump-storage + mountPath: /data/dumps + - name: logs-storage + mountPath: /data/logs + resources: + requests: + storage: "50Gi" # Espacio adecuado para volcados + volumes: + - name: dump-storage + persistentVolumeClaim: + claimName: dump-pvc + - name: logs-storage + emptyDir: {} +``` + +**Comandos para gestión de espacio:** + +```bash +# Verificar uso actual +kubectl exec -it -- df -h /data/dumps + +# Limpiar volcados antiguos manualmente +kubectl exec -it -- find /data/dumps -name "*.sql" -mtime +7 -delete + +# Monitorear espacio durante la restauración +kubectl exec -it -- watch "df -h /data/dumps" +``` + + + + + +Para monitorear el proceso de restauración de manera efectiva: + +```bash +# Usando pv (pipe viewer) para monitoreo de progreso +kubectl exec -it -- pv /data/dumps/large_dump.sql | psql -h localhost -U postgres -d target_db + +# Monitorear logs en tiempo real +kubectl exec -it -- tail -f /data/logs/restore_*.log + +# Verificar crecimiento del tamaño de la base de datos +kubectl exec -it -- psql -h localhost -U postgres -c "SELECT pg_size_pretty(pg_database_size('target_db'));" + +# Monitorear conexiones activas +kubectl exec -it -- psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE datname='target_db';" +``` + +**Script para monitoreo de progreso:** + +```bash +#!/bin/bash +# progress_monitor.sh + +DB_NAME=$1 +while true; do + SIZE=$(psql -h localhost -U postgres -t -c "SELECT pg_size_pretty(pg_database_size('$DB_NAME'));") + echo "$(date): Tamaño de la base de datos: $SIZE" + sleep 30 +done +``` + + + + + +Después de completar la restauración, es crucial verificar la integridad de los datos: + +```bash +# Verificar integridad de la base de datos +kubectl exec -it -- psql -h localhost -U postgres -d target_db -c "SELECT count(*) FROM information_schema.tables;" + +# Comparar conteos de registros con la base de datos original +kubectl exec -it -- psql -h localhost -U postgres -d target_db -c " +SELECT + schemaname, + tablename, + n_tup_ins as inserts, + n_tup_upd as updates, + n_tup_del as deletes +FROM pg_stat_user_tables +ORDER BY schemaname, tablename;" + +# Verificar índices y restricciones +kubectl exec -it -- psql -h localhost -U postgres -d target_db -c " +SELECT + indexname, + indexdef +FROM pg_indexes +WHERE schemaname = 'public';" +``` + +**Script de validación completo:** + +```bash +#!/bin/bash +# validate_restore.sh + +DB_NAME=$1 +LOG_FILE="/data/logs/validation_$(date +%Y%m%d_%H%M%S).log" + +echo "Iniciando validación de restauración para: $DB_NAME" | tee -a $LOG_FILE + +# Verificar conectividad +if psql -h localhost -U postgres -d $DB_NAME -c "SELECT 1;" > /dev/null 2>&1; then + echo "✓ Conectividad a la base de datos: OK" | tee -a $LOG_FILE +else + echo "✗ Error de conectividad a la base de datos" | tee -a $LOG_FILE + exit 1 +fi + +# Verificar tablas principales +TABLE_COUNT=$(psql -h localhost -U postgres -d $DB_NAME -t -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';") +echo "✓ Número de tablas: $TABLE_COUNT" | tee -a $LOG_FILE + +# Verificar datos de muestra +SAMPLE_DATA=$(psql -h localhost -U postgres -d $DB_NAME -t -c "SELECT count(*) FROM users LIMIT 1;" 2>/dev/null || echo "0") +echo "✓ Datos de muestra (usuarios): $SAMPLE_DATA registros" | tee -a $LOG_FILE + +echo "Validación completada: $(date)" | tee -a $LOG_FILE +``` + + + + + +**Problemas frecuentes durante la restauración y sus soluciones:** + +1. **Error de permisos:** + +```bash +# Verificar permisos del usuario +kubectl exec -it -- psql -h localhost -U postgres -c "SELECT rolname, rolsuper, rolcreatedb FROM pg_roles WHERE rolname = 'tu_usuario';" + +# Otorgar permisos necesarios +kubectl exec -it -- psql -h localhost -U postgres -c "ALTER USER tu_usuario CREATEDB;" +``` + +2. **Problemas de codificación:** + +```bash +# Verificar codificación de la base de datos +kubectl exec -it -- psql -h localhost -U postgres -c "SELECT datname, encoding FROM pg_database WHERE datname = 'tu_db';" + +# Crear base de datos con codificación específica +kubectl exec -it -- psql -h localhost -U postgres -c "CREATE DATABASE nueva_db WITH ENCODING 'UTF8';" +``` + +3. **Conflictos de versión:** + +```bash +# Verificar versión de PostgreSQL +kubectl exec -it -- psql -h localhost -U postgres -c "SELECT version();" + +# Actualizar estadísticas después de la restauración +kubectl exec -it -- psql -h localhost -U postgres -d tu_db -c "ANALYZE;" +``` + +4. **Problemas de espacio:** + +```bash +# Monitorear espacio durante la restauración +kubectl exec -it -- watch "df -h /var/lib/postgresql/data" + +# Limpiar logs de transacciones si es necesario +kubectl exec -it -- psql -h localhost -U postgres -c "CHECKPOINT;" +``` + +**Mejores prácticas para evitar problemas:** + +- Siempre hacer backup antes de restaurar +- Verificar compatibilidad de versiones +- Monitorear recursos durante el proceso +- Usar transacciones para operaciones críticas +- Documentar todos los pasos realizados + + + + + +**Script automatizado completo para restauración:** + +```bash +#!/bin/bash +# automated_restore.sh - Script completo de restauración automatizada + +set -euo pipefail + +# Configuración +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CONFIG_FILE="${SCRIPT_DIR}/restore.conf" +LOG_DIR="/data/logs" +DUMP_DIR="/data/dumps" + +# Cargar configuración +source $CONFIG_FILE + +# Funciones +log() { + echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE" +} + +cleanup() { + log "Limpiando recursos temporales..." + # Limpiar archivos temporales si existen + rm -f /tmp/restore_*.tmp +} + +# Trap para limpieza en caso de error +trap cleanup EXIT + +# Función principal +main() { + local dump_file="$1" + local target_db="$2" + + log "=== Iniciando proceso de restauración automatizada ===" + log "Archivo de volcado: $dump_file" + log "Base de datos destino: $target_db" + + # Verificaciones previas + check_prerequisites + check_disk_space + backup_current_db "$target_db" + + # Proceso de restauración + log "Iniciando restauración..." + restore_database "$dump_file" "$target_db" + + # Verificación post-restauración + validate_restore "$target_db" + + log "=== Restauración completada exitosamente ===" +} + +# Ejecutar función principal +main "$@" +``` + +**Archivo de configuración (restore.conf):** + +```bash +# restore.conf +DB_HOST="localhost" +DB_USER="postgres" +DB_PORT="5432" +LOG_FILE="/data/logs/restore_$(date +%Y%m%d_%H%M%S).log" +BACKUP_RETENTION_DAYS=7 +MAX_RESTORE_TIME_HOURS=6 +NOTIFICATION_EMAIL="admin@empresa.com" +``` + +**Mejores prácticas para automatización:** + +- Usar archivos de configuración separados +- Implementar logging detallado +- Agregar verificaciones de salud +- Configurar notificaciones automáticas +- Mantener backups de seguridad +- Documentar todos los procesos + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 21 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-typeorm-ssl-connection-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-typeorm-ssl-connection-error.mdx new file mode 100644 index 000000000..49dcbea55 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/database-typeorm-ssl-connection-error.mdx @@ -0,0 +1,234 @@ +--- +sidebar_position: 3 +title: "Error de Conexión SSL de TypeORM con RDS" +description: "Solución para errores de conexión SSL en pg_hba.conf al conectar TypeORM a bases de datos RDS" +date: "2024-01-15" +category: "dependency" +tags: ["typeorm", "rds", "ssl", "database", "postgresql"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Conexión SSL de TypeORM con RDS + +**Fecha:** 15 de enero de 2024 +**Categoría:** Dependencia +**Etiquetas:** TypeORM, RDS, SSL, Base de datos, PostgreSQL + +## Descripción del Problema + +**Contexto:** El usuario experimenta problemas de conexión SSL al intentar conectar TypeORM a una base de datos PostgreSQL en RDS en ambiente de producción, mientras que la misma configuración funciona en desarrollo. + +**Síntomas Observados:** + +- Error durante la ejecución de migraciones: `no pg_hba.conf entry for host "10.130.96.232", user "postgres", database "rattlesnake", no encryption` +- Las migraciones no se ejecutan en ambiente de producción +- La misma configuración funciona en ambientes de desarrollo/pruebas +- La aplicación no puede conectarse a la base de datos + +**Configuración Relevante:** + +- Base de datos: PostgreSQL en AWS RDS +- ORM: TypeORM +- Ambiente: Diferencia entre producción y desarrollo +- Aplicación de SSL: RDS requiere conexiones SSL + +**Condiciones del Error:** + +- El error ocurre durante la ejecución de migraciones con TypeORM +- Sucede cuando la aplicación intenta conectarse a la base de datos +- Es específico del ambiente de producción +- Relacionado con requisitos de SSL/encriptación + +## Solución Detallada + + + +El error `no pg_hba.conf entry for host... no encryption` indica que: + +1. **RDS requiere conexiones SSL** en producción +2. **TypeORM no está configurado** para usar SSL +3. **Los ambientes de desarrollo** pueden tener requisitos SSL diferentes +4. **pg_hba.conf** en RDS está configurado para rechazar conexiones no encriptadas + +Esta es una diferencia común de seguridad entre configuraciones de base de datos de desarrollo y producción. + + + + + +Agrega la configuración SSL a las opciones de conexión de TypeORM: + +```typescript +// Configuración en TypeScript +const connectionOptions: ConnectionOptions = { + type: "postgres", + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT || "5432"), + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + // Agregar configuración SSL + ssl: { + rejectUnauthorized: false, + }, + // ... otras opciones +}; +``` + +```javascript +// Configuración en JavaScript +module.exports = { + type: "postgres", + host: process.env.DB_HOST, + port: process.env.DB_PORT || 5432, + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + ssl: { + rejectUnauthorized: false, + }, +}; +``` + + + + + +Para manejar diferentes requisitos SSL según el ambiente: + +```typescript +const sslConfig = + process.env.NODE_ENV === "production" + ? { + ssl: { + rejectUnauthorized: false, + }, + } + : {}; + +const connectionOptions: ConnectionOptions = { + type: "postgres", + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT || "5432"), + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + ...sslConfig, + // ... otras opciones +}; +``` + +O usar variables de entorno: + +```typescript +const connectionOptions: ConnectionOptions = { + // ... otra configuración + ssl: + process.env.DB_SSL_ENABLED === "true" + ? { + rejectUnauthorized: + process.env.DB_SSL_REJECT_UNAUTHORIZED !== "false", + } + : false, +}; +``` + + + + + +Al ejecutar migraciones, asegúrate que la configuración del CLI de TypeORM incluya SSL: + +```json +// ormconfig.json +{ + "type": "postgres", + "host": "tu-endpoint-rds", + "port": 5432, + "username": "postgres", + "password": "tu-contraseña", + "database": "tu-base-de-datos", + "ssl": { + "rejectUnauthorized": false + }, + "migrations": ["src/migrations/*.ts"], + "cli": { + "migrationsDir": "src/migrations" + } +} +``` + +Luego ejecuta las migraciones: + +```bash +# Usando npm/pnpm +pnpm typeorm migration:run + +# O con configuración explícita +pnpm typeorm migration:run -f ormconfig.json +``` + + + + + +**Seguridad de la solución actual:** + +- `rejectUnauthorized: false` deshabilita la validación del certificado +- Aún usa conexión encriptada (SSL/TLS) +- Aceptable para la mayoría de escenarios de producción con RDS + +**Alternativas más seguras:** + +1. **Usar certificado CA de RDS:** + +```typescript +ssl: { + ca: fs.readFileSync('rds-ca-2019-root.pem').toString(), + rejectUnauthorized: true +} +``` + +2. **Modo SSL basado en ambiente:** + +```typescript +ssl: { + mode: 'require', // o 'verify-full' + rejectUnauthorized: process.env.NODE_ENV === 'production' +} +``` + +3. **Usar autenticación de base de datos AWS IAM** (para mayor seguridad) + + + + + +Si la configuración SSL no resuelve el problema: + +1. **Verifica la configuración SSL de RDS:** + + - Comprueba si el parámetro `rds.force_ssl` está habilitado + - Verifica que el grupo de seguridad permita conexiones desde tu aplicación + +2. **Prueba la conexión manualmente:** + +```bash +psql "host=tu-endpoint-rds port=5432 dbname=tu-db user=postgres sslmode=require" +``` + +3. **Verifica la compatibilidad de la versión de TypeORM:** + + - Asegúrate de usar una versión reciente de TypeORM + - Algunas versiones antiguas tenían problemas con la configuración SSL + +4. **Verifica las variables de entorno:** + - Asegúrate que todas las variables de conexión a la base de datos estén correctamente configuradas en producción + - Comprueba que el endpoint de la base de datos sea accesible desde tu ambiente de producción + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dependencies-monitoring-graphs-not-displaying.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dependencies-monitoring-graphs-not-displaying.mdx new file mode 100644 index 000000000..475a2a382 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dependencies-monitoring-graphs-not-displaying.mdx @@ -0,0 +1,128 @@ +--- +sidebar_position: 3 +title: "Gráficos de Monitoreo de Dependencias No Se Muestran" +description: "Solución para gráficos de monitoreo de RDS y OpenSearch que aparecen vacíos o no cargan correctamente" +date: "2025-01-22" +category: "dependency" +tags: ["monitoreo", "rds", "opensearch", "gráficos", "dependencias"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Gráficos de Monitoreo de Dependencias No Se Muestran + +**Fecha:** 22 de enero de 2025 +**Categoría:** Dependencia +**Etiquetas:** Monitoreo, RDS, OpenSearch, Gráficos, Dependencias + +## Descripción del Problema + +**Contexto:** Al acceder a la sección de monitoreo para dependencias (RDS, OpenSearch, etc.) en SleakOps, la página de monitoreo carga pero los gráficos de uso no muestran datos o permanecen vacíos. + +**Síntomas Observados:** + +- La página de monitoreo carga correctamente +- Los gráficos aparecen vacíos o sin datos +- El problema afecta tanto a dependencias RDS como OpenSearch +- El problema ocurre específicamente con dependencias antiguas + +**Configuración Relevante:** + +- Tipo de dependencia: RDS, OpenSearch +- Monitoreo: Activado +- Dependencias creadas antes de una actualización específica de la plataforma +- Visualización de gráficos: Vacía/blanca + +**Condiciones de Error:** + +- Ocurre en dependencias creadas antes de que se implementara el almacenamiento de variables de monitoreo +- Afecta múltiples tipos de dependencia (RDS, OpenSearch) +- La recolección de datos de monitoreo funciona pero la visualización falla +- El problema es consistente en las dependencias afectadas + +## Solución Detallada + + + +Este problema ocurre debido a una variable faltante en la base de datos de la plataforma que es necesaria para la visualización de gráficos de monitoreo. Cuando se crearon ciertas dependencias, esta variable de monitoreo no se almacenó, causando que los gráficos no muestren datos aunque el sistema de monitoreo esté recopilando métricas. + +El problema afecta específicamente a: + +- Dependencias creadas antes de que se implementara el almacenamiento de la variable de monitoreo +- Dependencias tanto de RDS como de OpenSearch +- Cualquier dependencia que dependa de esta configuración específica de monitoreo + + + + + +Mientras se espera la corrección en la plataforma, puedes probar estas soluciones temporales: + +1. **Consultar directamente CloudWatch**: Accede a tu consola de AWS CloudWatch para ver las métricas de RDS y OpenSearch directamente +2. **Usar AWS CLI**: Consulta las métricas usando comandos AWS CLI: + +```bash +# Para métricas de RDS +aws cloudwatch get-metric-statistics \ + --namespace AWS/RDS \ + --metric-name CPUUtilization \ + --dimensions Name=DBInstanceIdentifier,Value=tu-instancia-db \ + --start-time 2025-01-22T00:00:00Z \ + --end-time 2025-01-22T23:59:59Z \ + --period 3600 \ + --statistics Average + +# Para métricas de OpenSearch +aws cloudwatch get-metric-statistics \ + --namespace AWS/ES \ + --metric-name CPUUtilization \ + --dimensions Name=DomainName,Value=tu-dominio,Name=ClientId,Value=tu-id-cuenta \ + --start-time 2025-01-22T00:00:00Z \ + --end-time 2025-01-22T23:59:59Z \ + --period 3600 \ + --statistics Average +``` + +3. **Configurar monitoreo temporal**: Crea dashboards personalizados en CloudWatch para las dependencias afectadas + + + + + +El equipo de SleakOps está trabajando en una solución integral que: + +1. **Identifique todas las dependencias afectadas**: Escanear la base de datos para dependencias que carecen de la variable de monitoreo +2. **Complete las variables faltantes**: Agregar la configuración de monitoreo requerida a las dependencias existentes +3. **Prevenga futuras ocurrencias**: Asegurar que todas las nuevas dependencias incluyan la variable de monitoreo desde su creación + +**Cronograma esperado**: La corrección se está implementando como una solución completa en lugar de parches individuales para asegurar que todos los usuarios afectados se beneficien simultáneamente. + + + + + +Para verificar si tus dependencias están afectadas por este problema: + +1. **Navega a Dependencias** en tu panel de SleakOps +2. **Selecciona tu dependencia RDS o OpenSearch** +3. **Ve a la pestaña de Monitoreo** +4. **Verifica si los gráficos muestran datos**: + - Si los gráficos están vacíos pero la página carga → Estás afectado por este problema + - Si los gráficos muestran datos → Tu dependencia funciona correctamente + - Si la página no carga → Podría ser un problema diferente + + + + + +Para nuevas dependencias creadas después de la corrección en la plataforma: + +1. **Verifica la configuración del monitoreo**: Después de crear una nueva dependencia, comprueba que los gráficos de monitoreo muestren datos en 5-10 minutos +2. **Prueba diferentes rangos de tiempo**: Asegúrate de que los gráficos funcionen para distintos períodos (1 hora, 24 horas, 7 días) +3. **Contacta soporte temprano**: Si las nuevas dependencias presentan este problema, repórtalo de inmediato ya que podría indicar que la corrección necesita ajustes + + + +--- + +_Este FAQ fue generado automáticamente el 22 de enero de 2025 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dependency-connection-timeout-mysql-redis.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dependency-connection-timeout-mysql-redis.mdx new file mode 100644 index 000000000..1682174fa --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dependency-connection-timeout-mysql-redis.mdx @@ -0,0 +1,237 @@ +--- +sidebar_position: 3 +title: "Problemas de Tiempo de Espera en la Conexión con MySQL y Redis" +description: "Solución de problemas de tiempos de espera en conexiones a dependencias MySQL y Redis en entornos de producción" +date: "2024-01-15" +category: "dependency" +tags: + ["mysql", "redis", "timeout", "connection", "troubleshooting", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Tiempo de Espera en la Conexión con MySQL y Redis + +**Fecha:** 15 de enero de 2024 +**Categoría:** Dependencia +**Etiquetas:** MySQL, Redis, Tiempo de Espera, Conexión, Solución de Problemas, Redes + +## Descripción del Problema + +**Contexto:** Servicio API en producción que experimenta tiempos de espera simultáneos en la conexión tanto a la base de datos MySQL como a la caché Redis, mientras que otros servicios en el mismo entorno funcionan correctamente y las conexiones externas (VPN) tienen éxito. + +**Síntomas Observados:** + +- Tiempo de espera en conexión a MySQL: `Error: connect ETIMEDOUT` +- Tiempo de espera en conexión a Redis: `ConnectionTimeoutError: Connection timeout` +- Solo afecta a un servicio API específico en producción +- Otros servicios pueden conectarse exitosamente a las mismas dependencias +- Las conexiones externas vía VPN funcionan normalmente +- Los servicios de base de datos y Redis están activos y accesibles + +**Configuración Relevante:** + +- Entorno: Servicio API en producción +- Dependencias afectadas: Base de datos MySQL y caché Redis +- Método de conexión: Red interna del clúster +- Secretos y credenciales: Presentes y cargados correctamente +- Acceso externo: Funciona vía VPN + +**Condiciones de Error:** + +- Los errores ocurren simultáneamente para MySQL y Redis +- Problema aislado a un servicio/pod específico +- Comportamiento intermitente - a veces funciona, otras falla +- No se han realizado cambios recientes en la configuración + +## Solución Detallada + + + +Cuando un solo servicio pierde conectividad con múltiples dependencias mientras otros funcionan bien, esto típicamente indica: + +1. **Problemas de red en el pod**: El pod específico puede tener problemas de conectividad de red +2. **Restricciones de recursos**: Límites de memoria/CPU causando agotamiento del pool de conexiones +3. **Problemas de resolución DNS**: Problemas en el descubrimiento de servicios dentro del clúster +4. **Cambios en grupos de seguridad/firewall**: Políticas de red que bloquean tráfico específico del pod + + + + + +**Paso 1: Reiniciar el servicio afectado** + +1. En el panel de SleakOps, vaya a su proyecto +2. Encuentre el servicio API afectado +3. Haga clic en **Reiniciar** para recrear los pods +4. Monitoree los logs para verificar la recuperación de la conexión + +**Paso 2: Verificar uso de recursos del pod** + +```bash +# Verificar consumo de recursos del pod +kubectl top pods -n su-namespace + +# Revisar eventos del pod para problemas de recursos +kubectl describe pod su-api-pod -n su-namespace +``` + +**Paso 3: Verificar conectividad de red desde el pod** + +```bash +# Probar conectividad desde dentro del pod +kubectl exec -it su-api-pod -n su-namespace -- sh + +# Probar conexión a MySQL +telnet mysql-service 3306 + +# Probar conexión a Redis +telnet redis-service 6379 + +# Verificar resolución DNS +nslookup mysql-service +nslookup redis-service +``` + + + + + +Los tiempos de espera en conexiones a menudo ocurren debido a pools de conexiones agotados: + +**Configuración del Pool de Conexiones MySQL:** + +```javascript +// Configuración recomendada para conexión MySQL +const mysql = require("mysql2/promise"); + +const pool = mysql.createPool({ + host: process.env.MYSQL_HOST, + user: process.env.MYSQL_USER, + password: process.env.MYSQL_PASSWORD, + database: process.env.MYSQL_DATABASE, + connectionLimit: 10, + acquireTimeout: 60000, + timeout: 60000, + reconnect: true, +}); +``` + +**Configuración de Conexión a Redis:** + +```javascript +// Configuración recomendada para conexión Redis +const redis = require("redis"); + +const client = redis.createClient({ + host: process.env.REDIS_HOST, + port: process.env.REDIS_PORT, + connectTimeout: 60000, + lazyConnect: true, + retryDelayOnFailover: 100, + maxRetriesPerRequest: 3, +}); +``` + + + + + +Los recursos insuficientes pueden causar problemas de conexión: + +**En SleakOps:** + +1. Vaya a la **configuración del servicio API** +2. Verifique los **Límites de Recursos**: + - Memoria: Aumentar si está cerca del límite + - CPU: Asegurar una asignación adecuada + +**Mínimos recomendados para servicios API:** + +```yaml +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + +**Monitorear uso de recursos:** + +```bash +# Verificar uso actual de recursos +kubectl top pod su-api-pod -n su-namespace + +# Revisar eventos de recursos +kubectl get events -n su-namespace --sort-by='.lastTimestamp' +``` + + + + + +Verifique si las políticas de red están bloqueando las conexiones: + +**Verificar políticas de red:** + +```bash +# Listar políticas de red +kubectl get networkpolicies -n su-namespace + +# Revisar detalles de una política específica +kubectl describe networkpolicy nombre-politica -n su-namespace +``` + +**Probar conectividad del servicio:** + +```bash +# Probar desde otro pod en el mismo namespace +kubectl run test-pod --image=busybox --rm -it --restart=Never -- sh + +# Dentro del pod de prueba: +telnet mysql-service.su-namespace.svc.cluster.local 3306 +telnet redis-service.su-namespace.svc.cluster.local 6379 +``` + + + + + +Para prevenir problemas futuros, implemente monitoreo: + +**Monitoreo de conexiones:** + +```javascript +// Añadir chequeos de salud de conexión +const healthCheck = async () => { + try { + // Probar MySQL + await pool.execute("SELECT 1"); + + // Probar Redis + await redis.ping(); + + console.log("Dependencias saludables"); + } catch (error) { + console.error("Fallo en chequeo de salud de dependencias:", error); + } +}; + +// Ejecutar chequeo cada 30 segundos +setInterval(healthCheck, 30000); +``` + +**Agregar a su aplicación:** + +1. Implementar lógica de reintento de conexión +2. Añadir patrones de circuito de corte +3. Monitorear métricas del pool de conexiones +4. Configurar alertas para fallos de conexión + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-automatic-updates-github.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-automatic-updates-github.mdx new file mode 100644 index 000000000..0544c1206 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-automatic-updates-github.mdx @@ -0,0 +1,201 @@ +--- +sidebar_position: 3 +title: "Actualizaciones Automáticas de Despliegue desde GitHub" +description: "Cómo configurar despliegues automáticos al hacer push de código al repositorio de GitHub" +date: "2024-08-02" +category: "proyecto" +tags: ["despliegue", "github", "ci-cd", "automatización", "actualizaciones"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Actualizaciones Automáticas de Despliegue desde GitHub + +**Fecha:** 2 de agosto de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, GitHub, CI/CD, Automatización, Actualizaciones + +## Descripción del Problema + +**Contexto:** El usuario desea entender cómo los cambios de código enviados a su repositorio de GitHub se despliegan automáticamente en su entorno de SleakOps durante las fases de prueba. + +**Síntomas Observados:** + +- Incertidumbre sobre si los cambios de código se despliegan automáticamente +- Necesidad de comprender el flujo de trabajo de despliegue +- Preguntas sobre requisitos de intervención manual +- Preocupaciones sobre despliegues en fase de prueba + +**Configuración Relevante:** + +- Fuente: Repositorio de GitHub +- Plataforma: SleakOps +- Entorno: Desarrollo/Pruebas +- Dominio: firev.com.ar + +**Condiciones de Error:** + +- Proceso de despliegue poco claro +- Posibles pasos manuales requeridos +- Incertidumbre en flujo de trabajo de pruebas + +## Solución Detallada + + + +SleakOps ofrece capacidades de despliegue automático cuando está configurado correctamente: + +**Comportamiento por Defecto:** + +- El código enviado a la rama conectada dispara compilaciones automáticas +- Las compilaciones exitosas se despliegan automáticamente al entorno objetivo +- No se requiere intervención manual en el panel de SleakOps + +**Requisitos:** + +- El repositorio de GitHub debe estar correctamente conectado a SleakOps +- La configuración del webhook debe estar activa +- La configuración de compilación debe ser válida + + + + + +Para asegurarte de que tu repositorio está correctamente conectado: + +1. **Verificar Conexión del Repositorio:** + + - Ve a tu Proyecto en SleakOps + - Navega a **Configuración** → **Repositorio** + - Verifica que la URL del repositorio de GitHub sea correcta + - Comprueba que el webhook esté activo (estado verde) + +2. **Verificar Configuración de Rama:** + + - Asegúrate de que la rama correcta esté seleccionada para despliegue + - Ramas comunes: `main`, `master`, `develop` + +3. **Probar la Conexión:** + - Realiza un pequeño cambio en tu repositorio + - Haz push a la rama configurada + - Revisa la pestaña **Ejecuciones** en SleakOps para una nueva compilación + + + + + +El flujo típico de despliegue en SleakOps: + +```mermaid +graph LR + A[Push a GitHub] --> B[Disparo del Webhook] + B --> C[Compilación en SleakOps] + C --> D[¿Compilación Exitosa?] + D -->|Sí| E[Despliegue Automático] + D -->|No| F[Compilación Fallida] + E --> G[Aplicación Actualizada] +``` + +**Pasos:** + +1. **Push de Código**: El desarrollador envía código a GitHub +2. **Disparo del Webhook**: GitHub notifica a SleakOps sobre los cambios +3. **Proceso de Compilación**: SleakOps compila la aplicación +4. **Despliegue Automático**: Si la compilación tiene éxito, el despliegue ocurre automáticamente +5. **Actualización en Vivo**: La aplicación se actualiza con el nuevo código + + + + + +Para fases de prueba, considera estas prácticas: + +**1. Usar Entorno de Desarrollo:** + +```yaml +# Configuración recomendada +Entornos: + - develop (para pruebas) + - production (para sitio en vivo) +``` + +**2. Estrategia de Ramas:** + +- Usar la rama `develop` para pruebas +- Usar `main`/`master` para producción +- Probar cambios en develop antes de fusionar a main + +**3. Monitoreo de Despliegues:** + +- Revisar la pestaña **Ejecuciones** después de cada push +- Monitorear logs de compilación para errores +- Verificar funcionalidad de la aplicación tras el despliegue + +**4. Estrategia de Reversión:** + +- Mantener versiones anteriores disponibles +- Probar procedimientos de rollback +- Documentar estados conocidos como estables + + + + + +Si los despliegues automáticos no funcionan: + +**1. Revisar Estado de Compilación:** + +- Ve a la pestaña **Ejecuciones** +- Busca compilaciones fallidas (estado rojo) +- Revisa logs de compilación para errores + +**2. Verificar Webhook:** + +- Revisa configuración del repositorio en GitHub +- Busca el webhook de SleakOps en **Configuración** → **Webhooks** +- Asegura que el webhook esté activo y recibiendo cargas + +**3. Configuración de Rama:** + +- Confirma que haces push a la rama correcta +- Verifica que el nombre de la rama coincida con la configuración en SleakOps + +**4. Configuración de Compilación:** + +- Revisa sintaxis del Dockerfile +- Verifica variables de entorno +- Asegura que todas las dependencias estén definidas correctamente + + + + + +Puede ser necesaria acción manual en estos casos: + +**1. Fallas en la Compilación:** + +- Corregir problemas de código y hacer push nuevamente +- Actualizar configuración de compilación si es necesario + +**2. Variables de Entorno:** + +- Actualizar variables en el panel de SleakOps +- Reiniciar ejecuciones si se modificaron variables + +**3. Cambios en Infraestructura:** + +- Requerimientos de escalado +- Ajustes en límites de recursos +- Nuevas dependencias + +**4. Configuración de Dominio:** + +- Cambios en DNS +- Actualizaciones de certificados SSL +- Configuración de dominio personalizado + + + +--- + +_Este FAQ fue generado automáticamente el 2 de agosto de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-build-failed-production.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-build-failed-production.mdx new file mode 100644 index 000000000..339e0f888 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-build-failed-production.mdx @@ -0,0 +1,199 @@ +--- +sidebar_position: 3 +title: "Fallo en la Construcción de Producción con Problemas de Carga de Logs" +description: "Solución para fallos en la construcción de producción cuando los logs no se cargan en el panel de SleakOps" +date: "2024-01-15" +category: "proyecto" +tags: + ["construcción", "producción", "logs", "despliegue", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en la Construcción de Producción con Problemas de Carga de Logs + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Construcción, Producción, Logs, Despliegue, Solución de problemas + +## Descripción del Problema + +**Contexto:** El proceso de construcción en producción ha dejado de funcionar y al intentar ver los logs de error a través del panel de SleakOps, el indicador de carga aparece indefinidamente sin mostrar los logs reales. + +**Síntomas Observados:** + +- El proceso de construcción en producción no termina satisfactoriamente +- El panel de SleakOps muestra carga perpetua al intentar acceder a los logs de construcción +- Imposibilidad de ver detalles de error para diagnosticar el fallo de construcción +- La construcción funcionaba previamente pero de repente dejó de funcionar + +**Configuración Relevante:** + +- Entorno: Producción +- Plataforma: SleakOps +- Tipo de problema: Fallo de construcción + Problema de acceso a logs +- Estado: Afecta tanto a la construcción como a la visualización de logs + +**Condiciones de Error:** + +- Fallo en la construcción durante el despliegue en producción +- La carga de logs se queda atascada en estado de carga +- El problema impide una solución adecuada +- El problema apareció de forma repentina tras construcciones que funcionaban + +## Solución Detallada + + + +Cuando se enfrentan tanto fallos en la construcción como problemas de carga de logs, siga este enfoque diagnóstico: + +1. **Verificar estado de la construcción**: Compruebe si la construcción está realmente en ejecución o si ha fallado completamente +2. **Refrescar el navegador**: Intente actualizar el panel de SleakOps +3. **Probar otro navegador**: Acceda a los logs desde una ventana de incógnito o un navegador diferente +4. **Conectividad de red**: Asegúrese de tener una conexión a internet estable + + + + + +Si los logs del panel no se cargan, pruebe estas alternativas: + +**Vía CLI (si está disponible):** + +```bash +# Ver construcciones recientes +sleakops builds list --project nombre-de-tu-proyecto + +# Obtener logs de una construcción específica +sleakops builds logs --build-id +``` + +**Vía API:** + +```bash +# Obtener información de construcciones +curl -H "Authorization: Bearer TU_TOKEN" \ + https://api.sleakops.com/v1/projects/ID_PROYECTO/builds +``` + +**Revisar notificaciones por correo:** + +- Revise cualquier correo de fallo de construcción que pueda contener detalles de error + + + + + +Razones típicas para fallos repentinos en construcciones de producción: + +**Problemas de Recursos:** + +- Memoria o CPU insuficiente durante la construcción +- Espacio en disco agotado +- Tiempo de construcción excedido + +**Cambios en la Configuración:** + +- Variables de entorno modificadas o faltantes +- Cambios en Dockerfile que rompieron la construcción +- Conflictos en versiones de dependencias + +**Dependencias Externas:** + +- Problemas con el registro de paquetes (npm, pip, etc.) +- Indisponibilidad de la imagen base +- Problemas de conectividad de red + +**Problemas de Código:** + +- Commits recientes que introdujeron errores que rompen la construcción +- Archivos faltantes o rutas incorrectas +- Errores de compilación + + + + + +Cuando los logs del panel de SleakOps no se cargan: + +**Soluciones relacionadas con el navegador:** + +1. Limpiar caché y cookies del navegador +2. Deshabilitar extensiones del navegador temporalmente +3. Probar modo incógnito/privado +4. Actualizar el navegador a la última versión + +**Soluciones específicas del panel:** + +1. Cerrar sesión y volver a iniciar sesión en SleakOps +2. Verificar si otras secciones del panel funcionan correctamente +3. Intentar acceder desde otro dispositivo +4. Esperar unos minutos e intentar de nuevo (problemas temporales del servidor) + +**Si el problema persiste:** + +- Contactar al soporte de SleakOps con el ID específico de la construcción +- Proporcionar errores de la consola del navegador (F12 → pestaña Consola) +- Incluir captura de pantalla del problema de carga + + + + + +Para que tus construcciones de producción vuelvan a funcionar: + +**1. Identificar la última construcción que funcionó:** + +```bash +# Encontrar construcciones exitosas recientes +sleakops builds list --status success --limit 5 +``` + +**2. Comparar configuraciones:** + +- Revisar qué cambió entre las construcciones que funcionaban y las que fallan +- Revisar commits recientes y actualizaciones de configuración +- Verificar que las variables de entorno no hayan cambiado + +**3. Revertir si es necesario:** + +```bash +# Desplegar desde la última construcción conocida como buena +sleakops deploy --build-id +``` + +**4. Probar con cambios mínimos:** + +- Intentar construir con un cambio simple primero +- Añadir complejidad gradualmente para identificar el punto que rompe + + + + + +Para evitar problemas similares: + +**Monitoreo:** + +- Configurar notificaciones de fallos en construcción +- Monitorear tendencias de duración de construcción +- Rastrear uso de recursos durante las construcciones + +**Buenas prácticas:** + +- Probar construcciones en staging antes de producción +- Usar fijación de versiones de dependencias +- Implementar estrategias de caché en construcción +- Realizar copias de seguridad regulares de configuraciones que funcionan + +**Documentación:** + +- Documentar dependencias y requerimientos de construcción +- Mantener registro de configuraciones de entorno que funcionan +- Mantener procedimientos de reversión + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-build-failures-during-updates.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-build-failures-during-updates.mdx new file mode 100644 index 000000000..1f7011ab6 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-build-failures-during-updates.mdx @@ -0,0 +1,155 @@ +--- +sidebar_position: 3 +title: "Fallos de Construcción de Despliegue Durante Actualizaciones de la Plataforma" +description: "Solución para fallos de construcción que ocurren durante actualizaciones programadas de la plataforma" +date: "2024-10-14" +category: "proyecto" +tags: ["despliegue", "construcción", "actualizaciones", "solución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallos de Construcción de Despliegue Durante Actualizaciones de la Plataforma + +**Fecha:** 14 de octubre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Construcción, Actualizaciones, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan fallos en el despliegue al intentar hacer públicos sus proyectos durante actualizaciones programadas de la plataforma. El proceso de construcción falla temporalmente mientras la plataforma SleakOps está en mantenimiento o actualización. + +**Síntomas Observados:** + +- Errores de despliegue al cambiar la visibilidad del proyecto a público +- Fallos en la construcción durante ventanas de tiempo específicas +- El error ocurre sin tiempo de inactividad en los servicios en ejecución +- Incapacidad temporal para desplegar nuevos cambios + +**Configuración Relevante:** + +- Tipo de proyecto: Entorno de desarrollo +- Acción intentada: Cambiar la visibilidad del proyecto a público +- Estado de la plataforma: En mantenimiento/actualización programada +- Disponibilidad del servicio: Sin tiempo de inactividad para servicios en ejecución + +**Condiciones del Error:** + +- El error ocurre durante actualizaciones programadas de la plataforma +- El proceso de construcción falla temporalmente +- La canalización de despliegue se ve afectada +- Los servicios en ejecución permanecen sin afectación + +## Solución Detallada + + + +Los fallos de construcción durante las actualizaciones de la plataforma son problemas temporales que ocurren cuando: + +1. **Se están actualizando componentes de la plataforma**: La infraestructura central de construcción está en mantenimiento +2. **Las canalizaciones de construcción están temporalmente no disponibles**: Los sistemas CI/CD pueden estar reiniciándose +3. **Cambios en la asignación de recursos**: Los recursos de construcción se redistribuyen temporalmente +4. **Actualizaciones de configuración**: Se están modificando ajustes de la plataforma + +**Importante**: Estos fallos no afectan las aplicaciones en ejecución, solo los nuevos despliegues. + + + + + +Al enfrentar fallos de construcción durante actualizaciones: + +1. **Esperar a que la actualización finalice**: La mayoría de las actualizaciones de la plataforma se completan en 15-30 minutos +2. **Reintentar el despliegue**: Una vez finalizadas las actualizaciones, intente nuevamente la acción original +3. **Verificar el estado de la plataforma**: Monitorear la página de estado de SleakOps o las notificaciones +4. **Verificar el estado del proyecto**: Asegurarse de que la configuración del proyecto permanezca intacta + +```bash +# Ejemplo: Reintentar despliegue después de la actualización +# No se requieren comandos especiales - simplemente reintente a través de la interfaz de usuario +``` + + + + + +Para minimizar el impacto de las actualizaciones programadas: + +1. **Programar los despliegues adecuadamente**: + + - Evitar despliegues durante ventanas de mantenimiento anunciadas + - Planificar despliegues críticos fuera de los horarios de actualización + +2. **Monitorear las comunicaciones de la plataforma**: + + - Suscribirse a las actualizaciones de estado de SleakOps + - Revisar notificaciones por correo electrónico sobre mantenimientos programados + +3. **Implementar estrategias de despliegue**: + - Usar entornos de staging para pruebas + - Desplegar durante períodos de baja actividad + - Tener planes de reversión listos + + + + + +Si los fallos de construcción persisten después de que las actualizaciones de la plataforma hayan finalizado: + +1. **Verificar la configuración del proyecto**: + + ```yaml + # Verifique la configuración de su proyecto + visibility: public + environment: development + build_status: ready + ``` + +2. **Limpiar caché de construcción**: + + - Ir a Configuración del Proyecto + - Navegar a Configuración de Construcción + - Seleccionar "Limpiar Caché de Construcción" + - Reintentar el despliegue + +3. **Verificar cuotas de recursos**: + + - Verificar los límites de su cuenta + - Asegurarse de tener minutos de construcción disponibles + - Revisar cuotas de almacenamiento + +4. **Contactar soporte si es necesario**: + - Proporcionar mensajes de error específicos + - Incluir detalles del proyecto y tiempos + - Usar "Responder a todos" para una resolución más rápida + + + + + +Durante las actualizaciones de la plataforma, monitoree sus servicios: + +1. **Los servicios en ejecución permanecen disponibles**: + + - No se espera tiempo de inactividad para despliegues activos + - Las aplicaciones existentes continúan funcionando normalmente + - Solo se ven afectados los nuevos builds/despliegues + +2. **Verificación de chequeo de salud**: + + ```bash + # Verifique que sus servicios siguen respondiendo + curl -I https://your-app.sleakops.dev + # Debe devolver 200 OK + ``` + +3. **Monitoreo de logs**: + - Revisar logs de la aplicación por anomalías + - Monitorear uso de recursos durante actualizaciones + - Verificar que las conexiones a la base de datos permanezcan estables + + + +--- + +_Esta FAQ fue generada automáticamente el 14 de octubre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-environment-variables-migration-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-environment-variables-migration-issues.mdx new file mode 100644 index 000000000..8d9a7a0be --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-environment-variables-migration-issues.mdx @@ -0,0 +1,176 @@ +--- +sidebar_position: 3 +title: "Variables de Entorno No Disponibles Después de la Migración" +description: "Solución para variables de entorno que dejan de estar disponibles durante migraciones de plataforma que afectan compilaciones con argumentos predeterminados en Dockerfile" +date: "2024-04-24" +category: "deployment" +tags: ["environment-variables", "migration", "dockerfile", "secrets", "build"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variables de Entorno No Disponibles Después de la Migración + +**Fecha:** 24 de abril de 2024 +**Categoría:** Despliegue +**Etiquetas:** Variables de Entorno, Migración, Dockerfile, Secretos, Compilación + +## Descripción del Problema + +**Contexto:** Durante migraciones de plataforma que afectan la forma en que los secretos se exponen en los entornos, las aplicaciones pueden perder acceso a las variables de entorno, afectando particularmente las compilaciones con argumentos predeterminados en Dockerfiles. + +**Síntomas Observados:** + +- Las variables de entorno dejan de estar disponibles repentinamente después del despliegue +- Las aplicaciones fallan al conectarse a servicios externos (por ejemplo, puntos de acceso de login) +- Variables que antes funcionaban dejan de recibirse +- El problema afecta múltiples entornos (desarrollo, staging) +- Ocurre después de fusionar PRs y disparar redeploys + +**Configuración Relevante:** + +- Plataforma: SleakOps en AWS +- Entornos afectados: Desarrollo y Staging +- Tipo de compilación: Aplicaciones frontend con compilaciones Dockerfile +- Tipo de variables: Variables de entorno usadas para endpoints de API + +**Condiciones de Error:** + +- Ocurre durante migraciones de plataforma que afectan la exposición de secretos +- Afecta compilaciones con argumentos predeterminados en Dockerfile +- El problema persiste a través de múltiples despliegues +- El problema aparece tras merges exitosos de PR y redeploys + +## Solución Detallada + + + +Durante migraciones de plataforma, SleakOps puede actualizar la forma en que los secretos y variables de entorno se exponen a las aplicaciones. Esto puede afectar temporalmente: + +- **Compilaciones Dockerfile con argumentos predeterminados**: Las variables pasadas como argumentos de build pueden no estar disponibles +- **Variables de entorno en tiempo de ejecución**: Variables necesarias durante la ejecución de la aplicación +- **Montaje de secretos**: Cómo los secretos se ponen a disposición de los contenedores + +El proceso de migración garantiza mayor seguridad y consistencia pero puede causar interrupciones temporales. + + + + + +Cuando las variables de entorno dejan de estar disponibles: + +1. **Verificar configuración de grupos de variables**: + + - Confirmar que los grupos de variables están configurados correctamente + - Asegurar que las variables están asignadas a los entornos correctos + +2. **Revisar argumentos de build en Dockerfile**: + + ```dockerfile + # Asegurar que las declaraciones ARG estén presentes + ARG API_ENDPOINT + ARG DATABASE_URL + + # Usar variables de entorno correctamente + ENV API_ENDPOINT=${API_ENDPOINT} + ENV DATABASE_URL=${DATABASE_URL} + ``` + +3. **Validar configuraciones específicas por entorno**: + - Comprobar que las variables están definidas para cada entorno (dev, staging, prod) + - Verificar que los nombres de variables coinciden exactamente entre configuración y código + + + + + +Después de completar la migración: + +1. **Actualizar grupos de variables**: + + - Revisar y actualizar asignaciones de grupos de variables + - Asegurar que todas las variables necesarias estén configuradas correctamente + - Probar la disponibilidad de variables en cada entorno + +2. **Redeplegar aplicaciones**: + + ```bash + # Disparar un despliegue fresco para captar la nueva configuración de variables + git commit --allow-empty -m "Disparar redeploy después de la migración" + git push origin develop + ``` + +3. **Verificar funcionalidad de la aplicación**: + - Probar todos los endpoints que dependen de variables de entorno + - Revisar logs de la aplicación para errores relacionados con variables + - Validar conectividad a servicios externos + + + + + +Para evitar problemas en futuras migraciones: + +```dockerfile +# 1. Declarar todos los argumentos de build requeridos +ARG NODE_ENV=production +ARG API_BASE_URL +ARG DATABASE_URL + +# 2. Establecer variables de entorno desde argumentos de build +ENV NODE_ENV=${NODE_ENV} +ENV API_BASE_URL=${API_BASE_URL} +ENV DATABASE_URL=${DATABASE_URL} + +# 3. Proveer valores por defecto cuando sea apropiado +ENV API_TIMEOUT=${API_TIMEOUT:-30000} + +# 4. Validar variables críticas +RUN test -n "$API_BASE_URL" || (echo "API_BASE_URL es requerido" && exit 1) +``` + + + + + +Para prevenir problemas similares: + +1. **Configurar validación de variables de entorno**: + + ```javascript + // Añadir validación en el inicio de la aplicación + const requiredVars = ["API_BASE_URL", "DATABASE_URL", "JWT_SECRET"]; + + requiredVars.forEach((varName) => { + if (!process.env[varName]) { + console.error(`Falta la variable de entorno requerida: ${varName}`); + process.exit(1); + } + }); + ``` + +2. **Implementar health checks**: + + ```javascript + // Endpoint de health check para verificar configuración + app.get("/health", (req, res) => { + const config = { + apiEndpoint: !!process.env.API_BASE_URL, + databaseConnected: !!process.env.DATABASE_URL, + // No exponer valores reales por seguridad + }; + + res.json({ status: "ok", config }); + }); + ``` + +3. **Monitorear notificaciones de despliegue**: + - Suscribirse a anuncios de migraciones de plataforma + - Probar aplicaciones inmediatamente después de actualizaciones de plataforma + - Mantener entornos staging que reflejen la configuración de producción + + + +--- + +_Esta FAQ fue generada automáticamente el 24 de abril de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-fargate-vcpu-quota-limit.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-fargate-vcpu-quota-limit.mdx new file mode 100644 index 000000000..1a2508b8c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-fargate-vcpu-quota-limit.mdx @@ -0,0 +1,198 @@ +--- +sidebar_position: 3 +title: "Límite de cuota vCPU de Fargate durante el despliegue" +description: "Solución para fallos en despliegues debido a limitaciones de cuota vCPU en Fargate" +date: "2025-02-13" +category: "deployment" +tags: ["fargate", "aws", "cuota", "vcpu", "despliegue", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Límite de cuota vCPU de Fargate durante el despliegue + +**Fecha:** 13 de febrero de 2025 +**Categoría:** Despliegue +**Etiquetas:** Fargate, AWS, Cuota, vCPU, Despliegue, Solución de problemas + +## Descripción del problema + +**Contexto:** Los usuarios experimentan fallos en el despliegue en SleakOps al usar AWS Fargate como modo de despliegue, específicamente debido a limitaciones de cuota de vCPU que impiden despliegues exitosos. + +**Síntomas observados:** + +- El despliegue falla durante la ejecución +- Mensajes de error relacionados con capacidad insuficiente en Fargate +- Imposibilidad de completar el proceso de despliegue +- Despliegues previos pueden haber funcionado pero empiezan a fallar repentinamente +- El problema afecta múltiples intentos de despliegue + +**Configuración relevante:** + +- Modo de despliegue: AWS Fargate +- Plataforma: SleakOps +- Servicio: Despliegues de contenedores +- Tipo de recurso: asignación de vCPU + +**Condiciones de error:** + +- El error ocurre durante la ejecución del despliegue +- Sucede cuando se alcanza la cuota de vCPU de Fargate +- Puede coincidir con intentos de eliminar grupos de variables u otros recursos +- Impide tanto nuevos despliegues como operaciones de limpieza + +## Solución detallada + + + +Los límites de cuota de vCPU de Fargate pueden causar fallos en el despliegue. Para identificar si este es tu problema: + +1. **Revisar la consola de cuotas de servicio de AWS:** + + - Navega a Consola AWS → Cuotas de servicio + - Busca "AWS Fargate" + - Busca "Cantidad de recursos vCPU bajo demanda de Fargate" + +2. **Revisar los registros de despliegue:** + + - Busca mensajes de error que mencionen "capacidad insuficiente" + - Verifica códigos de error específicos de Fargate + - Monitorea fallos en la asignación de recursos + +3. **Verificar uso actual:** + - Consola AWS → ECS → Clústeres + - Revisa las tareas Fargate en ejecución y su asignación de vCPU + + + + + +Para solicitar un aumento de cuota de vCPU en Fargate: + +1. **Accede a Cuotas de servicio:** + + ```bash + # A través de AWS CLI (opcional) + aws service-quotas get-service-quota \ + --service-code fargate \ + --quota-code L-3032A538 + ``` + +2. **Envía la solicitud de aumento de cuota:** + + - Ve a Consola AWS → Cuotas de servicio + - Busca "Cantidad de recursos vCPU bajo demanda de Fargate" + - Haz clic en "Solicitar aumento de cuota" + - Especifica el nuevo límite necesario + - Proporciona justificación comercial + +3. **Tiempo típico de procesamiento:** + - Solicitudes estándar: 24-48 horas + - Solicitudes urgentes: pueden acelerarse a través del Soporte AWS + + + + + +Mientras esperas la aprobación de la cuota, prueba estas soluciones temporales: + +1. **Optimiza la asignación de recursos:** + + ```yaml + # Reduce la asignación de vCPU en tu configuración de despliegue + resources: + limits: + cpu: "0.25" # En lugar de "0.5" o "1.0" + memory: "512Mi" + requests: + cpu: "0.1" + memory: "256Mi" + ``` + +2. **Limpia recursos no usados:** + + - Detén tareas Fargate innecesarias + - Elimina despliegues inactivos + - Borra servicios ECS no utilizados + +3. **Usa una estrategia de despliegue diferente:** + + - Despliega en lotes más pequeños + - Implementa despliegues continuos con menor concurrencia + - Considera usar temporalmente el tipo de lanzamiento EC2 + +4. **Alternativas regionales:** + - Despliega en una región AWS diferente con capacidad disponible + - Usa múltiples regiones para distribuir la carga + + + + + +Para prevenir futuros problemas con la cuota de vCPU de Fargate: + +1. **Configura monitoreo:** + + ```yaml + # Alarma de CloudWatch para uso de Fargate + FargateVCPUUsageAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: FargateVCPUUsageHigh + MetricName: CPUUtilization + Namespace: AWS/ECS + Statistic: Average + Threshold: 80 + ComparisonOperator: GreaterThanThreshold + ``` + +2. **Implementa gobernanza de recursos:** + + - Establece límites de recursos predeterminados para despliegues + - Implementa flujos de aprobación para despliegues con alto consumo + - Auditoría regular del uso de recursos + +3. **Planifica la capacidad:** + + - Monitorea tendencias de uso + - Solicita aumentos de cuota proactivamente + - Mantén capacidad de reserva para picos de uso + +4. **Documentación y alertas:** + - Documenta los límites actuales de cuota + - Configura alertas al 70% y 85% de uso + - Crea runbooks para gestión de cuotas + + + + + +Cuando los problemas de cuota impiden operaciones de limpieza (como eliminar grupos de variables): + +1. **Despliegue temporal para limpieza:** + + - Despliega recursos mínimos para habilitar operaciones de limpieza + - Usa la asignación de vCPU más baja posible + - Ejecuta tareas de limpieza inmediatamente después del despliegue + +2. **Limpieza manual vía consola AWS:** + + ```bash + # Lista los servicios ECS que podrían estar bloqueando la limpieza + aws ecs list-services --cluster nombre-de-tu-cluster + + # Detén servicios si es seguro hacerlo + aws ecs update-service --cluster nombre-de-tu-cluster \ + --service nombre-de-tu-servicio --desired-count 0 + ``` + +3. **Coordina con soporte de SleakOps:** + - Reporta el recurso específico que causa problemas + - Solicita intervención manual si es necesario + - Proporciona registros de despliegue y mensajes de error + + + +--- + +_Esta FAQ fue generada automáticamente el 13 de febrero de 2025 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-helm-selector-immutable-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-helm-selector-immutable-error.mdx new file mode 100644 index 000000000..e3eebec02 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-helm-selector-immutable-error.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Error Inmutable en el Selector de Despliegue de Helm" +description: "Solución para el error de campo inmutable en el selector de despliegue de Kubernetes durante actualizaciones con Helm" +date: "2024-12-19" +category: "workload" +tags: ["helm", "despliegue", "kubernetes", "selector", "actualización"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error Inmutable en el Selector de Despliegue de Helm + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Helm, Despliegue, Kubernetes, Selector, Actualización + +## Descripción del Problema + +**Contexto:** El usuario encuentra un fallo en el despliegue al intentar desplegar un proyecto a través de la plataforma SleakOps usando charts de Helm. + +**Síntomas observados:** + +- El despliegue falla con error "UPGRADE FAILED" +- El mensaje de error indica "campo es inmutable" para spec.selector +- No se puede parchear el despliegue debido a cambios en las etiquetas del selector +- El proceso de actualización con Helm queda bloqueado + +**Configuración relevante:** + +- Plataforma: SleakOps con clúster Kubernetes +- Herramienta de despliegue: Helm +- Tipo de error: `spec.selector: Valor inválido` con `campo es inmutable` +- Recurso afectado: Despliegue de Kubernetes + +**Condiciones del error:** + +- Ocurre durante el proceso de actualización con Helm +- Sucede cuando las etiquetas del selector del despliegue han sido modificadas +- Impide el despliegue exitoso de aplicaciones actualizadas +- Típicamente ocurre tras cambios manuales o conflictos + +## Solución Detallada + + + +El error ocurre porque el campo `spec.selector` del Despliegue de Kubernetes es inmutable después de su creación. Esto significa: + +1. **Las etiquetas del selector no pueden cambiarse** una vez creado el Despliegue +2. **Helm intenta actualizar** el selector durante la actualización +3. **Kubernetes rechaza** el cambio debido a las reglas de inmutabilidad +4. **Modificaciones manuales** o conflictos pueden desencadenar este problema + +El mensaje de error muestra que Helm está intentando parchear un despliegue con etiquetas de selector diferentes a las existentes. + + + + + +La solución más rápida es eliminar el despliegue existente y dejar que SleakOps lo recree: + +**Usando Lens (IDE de Kubernetes):** + +1. Abre Lens y conéctate a tu clúster +2. Navega a **Cargas de trabajo** → **Despliegues** +3. Busca el despliegue problemático (ejemplo: `velo-crawler-scheduler-produce-production-crawler-scheduler`) +4. Haz clic derecho y selecciona **Eliminar** +5. Confirma la eliminación + +**Usando kubectl:** + +```bash +kubectl delete deployment velo-crawler-scheduler-produce-production-crawler-scheduler -n +``` + +**Después de la eliminación:** + +- Inicia un nuevo despliegue desde el panel de SleakOps +- O haz push de código para activar la pipeline CI/CD +- El despliegue se recreará con los selectores correctos + + + + + +Para evitar este problema en el futuro: + +**1. Evita modificaciones manuales:** + +- No edites manualmente despliegues a través de Lens o kubectl +- Usa siempre el panel de SleakOps o CI/CD para los cambios + +**2. Maneja correctamente los conflictos:** + +- Cuando ocurran conflictos en los charts de Helm, asegúrate que las etiquetas del selector permanezcan consistentes +- Revisa los cambios en las plantillas de despliegue antes de hacer merge + +**3. Usa buenas prácticas con Helm:** + +```yaml +# En tu plantilla Helm, asegura etiquetado consistente +apiVersion: apps/v1 +kind: Deployment +metadata: + name: { { include "chart.fullname" . } } + labels: { { - include "chart.labels" . | nindent 4 } } +spec: + selector: + matchLabels: { { - include "chart.selectorLabels" . | nindent 6 } } + template: + metadata: + labels: { { - include "chart.selectorLabels" . | nindent 8 } } +``` + + + + + +Si no puedes eliminar el despliegue inmediatamente: + +**1. Escalar a cero réplicas:** + +```bash +kubectl scale deployment --replicas=0 -n +``` + +**2. Usar desinstalación y reinstalación con Helm:** + +```bash +# Desinstalar la release +helm uninstall -n + +# Reinstalar desde SleakOps +# Iniciar nuevo despliegue a través de la plataforma +``` + +**3. Corrección manual del selector (avanzado):** +Si necesitas preservar el despliegue, puedes: + +- Exportar el YAML actual del despliegue +- Eliminar el despliegue +- Modificar el YAML para que coincida con los selectores esperados +- Aplicar la versión corregida +- Luego continuar con el despliegue normal desde SleakOps + + + + + +Después de aplicar la solución: + +**1. Verificar el estado del despliegue:** + +```bash +kubectl get deployments -n +kubectl describe deployment -n +``` + +**2. Verificar que los pods estén corriendo:** + +```bash +kubectl get pods -n -l app.kubernetes.io/instance= +``` + +**3. Revisar logs de la aplicación:** + +```bash +kubectl logs -l app.kubernetes.io/instance= -n +``` + +**4. Probar funcionalidad de la aplicación:** + +- Verificar que la aplicación responda correctamente +- Comprobar endpoints de salud +- Confirmar comportamiento esperado + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-image-not-updating-develop-branch.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-image-not-updating-develop-branch.mdx new file mode 100644 index 000000000..8adab1ce7 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-image-not-updating-develop-branch.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 3 +title: "La implementación no refleja los últimos cambios después de la fusión" +description: "Solución para cuando las implementaciones no se actualizan con los últimos cambios de código tras fusionar a la rama develop" +date: "2024-12-19" +category: "proyecto" +tags: ["implementación", "compilación", "docker", "ci-cd", "caché-de-imágenes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# La implementación no refleja los últimos cambios después de la fusión + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Implementación, Compilación, Docker, CI/CD, Caché de imágenes + +## Descripción del problema + +**Contexto:** El equipo de desarrollo fusiona cambios de código a la rama develop, lo que desencadena una implementación automática, pero la aplicación desplegada no refleja los últimos cambios. El problema parece estar relacionado con la construcción de la imagen Docker que no usa el código más reciente. + +**Síntomas observados:** + +- El proceso de implementación se ejecuta correctamente tras la fusión a la rama develop +- La aplicación muestra funcionalidades/contenido antiguo en lugar de los cambios más recientes +- El mismo comportamiento ocurre en varios repositorios (3 repositorios afectados) +- La imagen Docker parece estar usando contenido en caché o desactualizado + +**Configuración relevante:** + +- Rama: `develop` (implementación automática habilitada) +- Repositorios afectados: Múltiples (3 repositorios) +- Disparador de implementación: Fusión/push a develop +- Plataforma: Implementación automatizada SleakOps + +**Condiciones de error:** + +- El problema ocurre consistentemente tras fusionar a la rama develop +- Afecta a múltiples repositorios simultáneamente +- No se reportan errores de compilación, pero los cambios no se reflejan +- Sospecha de problema con la caché de imágenes Docker + +## Solución detallada + + + +La causa más común es que la caché de capas de Docker impide que la construcción use el código más reciente: + +1. **Revisar los registros de compilación** en la sección de implementación de SleakOps +2. Buscar mensajes como "Using cache" en los pasos de construcción de Docker +3. Verificar que la etiqueta/hash de la imagen sea diferente entre compilaciones +4. Comprobar si los comandos COPY del Dockerfile invalidan correctamente la caché + +```dockerfile +# Problemático - la caché puede no invalidarse +COPY . /app + +# Mejor - copiar primero los archivos de paquetes, luego el código fuente +COPY package*.json /app/ +RUN npm install +COPY . /app +``` + + + + + +Para forzar que SleakOps reconstruya sin usar caché: + +1. Ve a tu **Panel de Proyecto** +2. Navega a **Implementaciones** → **Configuración de compilación** +3. Habilita la opción **"Forzar reconstrucción"** +4. O agrega la bandera `--no-cache` en la configuración de compilación +5. Dispara una nueva implementación + +**Método alternativo:** + +- Haz un pequeño commit (como actualizar un comentario) +- Haz push a la rama develop para disparar una compilación fresca + + + + + +Asegúrate de que tu Dockerfile invalide correctamente la caché cuando cambia el código: + +```dockerfile +# Buenas prácticas para aplicaciones Node.js +FROM node:18-alpine +WORKDIR /app + +# Copiar primero los archivos de paquetes (capa de caché) +COPY package*.json ./ +RUN npm ci --only=production + +# Copiar código fuente (invalida caché cuando cambia el código) +COPY . . + +# Añadir timestamp de compilación para asegurar compilaciones frescas +ARG BUILD_DATE +ENV BUILD_DATE=${BUILD_DATE} + +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +Agrega argumentos de compilación que cambien con cada compilación: + +1. **En tu configuración CI/CD:** + +```yaml +build_args: + BUILD_DATE: "$(date +%Y%m%d-%H%M%S)" + GIT_COMMIT: "${GITHUB_SHA}" +``` + +2. **En tu Dockerfile:** + +```dockerfile +ARG BUILD_DATE +ARG GIT_COMMIT +ENV BUILD_INFO="${BUILD_DATE}-${GIT_COMMIT}" + +# Esto asegura que la capa se reconstruya cada vez +RUN echo "Build: ${BUILD_INFO}" > /app/build-info.txt +``` + + + + + +Para confirmar que el problema está resuelto: + +1. **Revisa las etiquetas de la imagen** en SleakOps: + + - Ve a **Workloads** → **Tu servicio** + - Verifica que la etiqueta de la imagen coincida con la última compilación + +2. **Agrega un endpoint de versión** a tu aplicación: + +```javascript +// Añadir a tu app +app.get("/version", (req, res) => { + res.json({ + version: process.env.BUILD_DATE || "desconocida", + commit: process.env.GIT_COMMIT || "desconocido", + timestamp: new Date().toISOString(), + }); +}); +``` + +3. **Prueba el endpoint** después de la implementación para confirmar los cambios + + + + + +Dado que esto afecta a 3 repositorios, aplica estos cambios sistemáticamente: + +1. **Crea un Dockerfile plantilla** con la invalidación correcta de caché +2. **Actualiza todos los repositorios afectados** con el mismo patrón +3. **Habilita la reconstrucción forzada** para todos los proyectos temporalmente +4. **Prueba cada repositorio** individualmente tras los cambios + +**Ejemplo de script para actualización masiva:** + +```bash +#!/bin/bash +REPOS=("repo1" "repo2" "repo3") + +for repo in "${REPOS[@]}"; do + echo "Actualizando $repo..." + cd $repo + # Copiar Dockerfile optimizado + cp ../templates/Dockerfile . + git add Dockerfile + git commit -m "Fix: Actualizar Dockerfile para correcta invalidación de caché" + git push origin develop + cd .. +done +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-migration-hooks-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-migration-hooks-troubleshooting.mdx new file mode 100644 index 000000000..c7bbf9b15 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-migration-hooks-troubleshooting.mdx @@ -0,0 +1,234 @@ +--- +sidebar_position: 3 +title: "Problemas de Despliegue de Hooks de Migración" +description: "Solución de problemas de tiempos de espera en despliegues y fallos de hooks de migración en SleakOps" +date: "2024-12-11" +category: "proyecto" +tags: + [ + "despliegue", + "migración", + "hooks", + "solución de problemas", + "tiempo de espera", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Despliegue de Hooks de Migración + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Migración, Hooks, Solución de problemas, Tiempo de espera + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan tiempos de espera en el despliegue cuando los hooks de migración no se completan con éxito, causando que los despliegues permanezcan en estado "STALLED" indefinidamente. + +**Síntomas Observados:** + +- El despliegue muestra error de tiempo de espera después de 30 minutos +- El pod del hook de migración nunca alcanza el estado "Succeeded" +- El despliegue permanece en estado "STALLED" +- Los errores de migración no son visibles en los logs del despliegue +- La ejecución manual de migraciones en el pod muestra resultados diferentes a la ejecución del hook + +**Configuración Relevante:** + +- Plataforma: sistema de despliegue SleakOps +- Tipo de hook: hooks de migración de base de datos +- Método de despliegue: Jobs de Kubernetes +- Visibilidad de logs: reporte limitado de errores en logs de despliegue + +**Condiciones de Error:** + +- El hook de migración falla pero no reporta error explícito +- La migración de base de datos encuentra violaciones de restricciones +- La ejecución del hook no termina correctamente con códigos de estado +- Los logs de los hooks de migración no aparecen en stdout + +## Solución Detallada + + + +Para verificar si tu hook de migración es la causa del tiempo de espera en el despliegue: + +1. **Revisa el estado del pod del hook** en el clúster: + + ```bash + kubectl get pods -l job-name=migration-hook + kubectl describe pod + ``` + +2. **Busca el estado de finalización del hook**: + + - El hook debería mostrar estado `Completed` + - Si está en `Running`, la migración nunca terminó + - Si muestra `Failed`, revisa los logs del pod + +3. **Verifica los logs del hook**: + ```bash + kubectl logs + ``` + + + + + +Cuando los logs del hook no muestran el error real, ejecuta las migraciones manualmente: + +1. **Accede al pod de la aplicación**: + + ```bash + kubectl exec -it -- /bin/bash + ``` + +2. **Ejecuta las migraciones manualmente**: + + ```bash + # Para aplicaciones Rails + bundle exec rails db:migrate + + # Para aplicaciones Django + python manage.py migrate + + # Para aplicaciones Node.js + npm run migrate + ``` + +3. **Revisa errores específicos**: + - Violaciones de restricciones de base de datos + - Columnas o tablas faltantes + - Conflictos de tipos de datos + - Violaciones de restricción NOT NULL + + + + + +**Violaciones de restricción NOT NULL:** + +```sql +-- Error: Intento de agregar columna NOT NULL a tabla con datos existentes +-- Solución: Agregar columna como nullable primero, luego actualizar valores +ALTER TABLE users ADD COLUMN email VARCHAR(255); +UPDATE users SET email = 'default@example.com' WHERE email IS NULL; +ALTER TABLE users ALTER COLUMN email SET NOT NULL; +``` + +**Conflictos de tipo de datos:** + +```sql +-- Error: No se puede cambiar el tipo de columna con datos existentes +-- Solución: Crear nueva columna, migrar datos, eliminar columna antigua +ALTER TABLE products ADD COLUMN price_new DECIMAL(10,2); +UPDATE products SET price_new = CAST(price AS DECIMAL(10,2)); +ALTER TABLE products DROP COLUMN price; +ALTER TABLE products RENAME COLUMN price_new TO price; +``` + + + + + +Asegúrate de que tus scripts de migración terminen correctamente: + +**Para scripts shell:** + +```bash +#!/bin/bash +set -e # Salir ante cualquier error + +# Ejecuta tus migraciones +echo "Ejecutando migraciones de base de datos..." +bundle exec rails db:migrate + +# Salida explícita de éxito +echo "Migraciones completadas con éxito" +exit 0 +``` + +**Para scripts Python:** + +```python +import sys +import subprocess + +try: + # Ejecutar comando de migración + result = subprocess.run(['python', 'manage.py', 'migrate'], + check=True, capture_output=True, text=True) + print("Migraciones completadas con éxito") + sys.exit(0) +except subprocess.CalledProcessError as e: + print(f"Migración fallida: {e.stderr}") + sys.exit(1) +``` + + + + + +Para obtener mejor visibilidad de fallos en hooks de migración: + +1. **Agrega logging detallado a tus scripts de migración**: + + ```bash + #!/bin/bash + echo "[$(date)] Iniciando migraciones de base de datos" + echo "[$(date)] Estado actual de la base de datos:" + # Añade aquí chequeos del estado de la base de datos + + echo "[$(date)] Ejecutando migraciones..." + bundle exec rails db:migrate 2>&1 | tee /tmp/migration.log + + if [ ${PIPESTATUS[0]} -eq 0 ]; then + echo "[$(date)] Migraciones completadas con éxito" + exit 0 + else + echo "[$(date)] Fallo en migraciones" + cat /tmp/migration.log + exit 1 + fi + ``` + +2. **Asegura que toda la salida vaya a stdout/stderr**: + - Redirige toda la salida de comandos a stdout + - Usa `2>&1` para capturar stderr + - Evita escribir logs solo en archivos + + + + + +Cuando un despliegue está atascado debido a hooks de migración fallidos: + +1. **Elimina el pod de migración fallido**: + + ```bash + kubectl delete pod + ``` + +2. **Soluciona el problema subyacente de la migración**: + + - Resuelve manualmente conflictos en la base de datos + - Actualiza los scripts de migración si es necesario + - Prueba las migraciones en un entorno de staging + +3. **Reintenta el despliegue**: + + - El hook se recreará automáticamente + - Monitorea el estado del nuevo pod del hook + - Revisa los logs para confirmar finalización exitosa + +4. **Si los problemas persisten**: + - Considera deshabilitar temporalmente el hook de migración + - Ejecuta las migraciones manualmente después del despliegue + - Actualiza la configuración del hook para hacerlo más resiliente + + + +--- + +_Esta FAQ fue generada automáticamente el 11 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-missing-tolerations-after-nodepool-changes.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-missing-tolerations-after-nodepool-changes.mdx new file mode 100644 index 000000000..f55ed7cbf --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-missing-tolerations-after-nodepool-changes.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "Despliegue sin Tolerancias Después de Cambios en el Nodepool" +description: "Solución para pods que no pueden programarse tras modificaciones en el nodepool" +date: "2025-03-05" +category: "cluster" +tags: ["nodepool", "tolerancias", "despliegue", "programación", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegue sin Tolerancias Después de Cambios en el Nodepool + +**Fecha:** 5 de marzo de 2025 +**Categoría:** Clúster +**Etiquetas:** Nodepool, Tolerancias, Despliegue, Programación, Kubernetes + +## Descripción del Problema + +**Contexto:** Después de actualizaciones del clúster o modificaciones en el nodepool, los despliegues existentes pueden perder la capacidad de programar pods en nodos disponibles debido a la falta de tolerancias. + +**Síntomas Observados:** + +- Los pods permanecen en estado Pendiente a pesar de haber nodos disponibles +- Despliegues que funcionaban previamente no pueden programarse +- Se crean nuevos nodepools pero los pods no los utilizan +- Mensajes de error relacionados con fallos en la programación de pods + +**Configuración Relevante:** + +- Múltiples nodepools en el clúster (ejemplo: producción y desarrollo) +- Diferentes tipos de nodos: on-demand y spot +- Despliegues escalados a 0 y luego escalados nuevamente +- Recientes actualizaciones del clúster o cambios en nodepool + +**Condiciones de Error:** + +- Ocurre tras eliminar nodepools por defecto antiguos +- Sucede cuando los despliegues no han sido actualizados luego de cambios en nodepool +- Afecta despliegues que fueron escalados a cero durante transiciones de nodepool +- El problema persiste hasta que las tolerancias se actualizan manualmente + +## Solución Detallada + + + +Este problema típicamente ocurre cuando: + +1. **Eliminación del nodepool por defecto**: El nodepool por defecto original se elimina durante las actualizaciones del clúster +2. **Falta de tolerancias**: Los despliegues existentes no tienen las tolerancias correctas para los nuevos nodepools +3. **Configuraciones obsoletas**: Los despliegues conservan configuraciones de programación antiguas que ya no coinciden con los nodos disponibles + +Kubernetes requiere que los pods tengan tolerancias que coincidan con las taints de los nodos, las cuales se usan comúnmente para separar diferentes tipos de cargas de trabajo (producción vs desarrollo, on-demand vs spot). + + + + + +La forma más sencilla de resolver este problema es mediante la plataforma SleakOps: + +1. **Navega a tu servicio** en el panel de control de SleakOps +2. **Haz clic en "Editar"** sobre el despliegue afectado +3. **No modifiques ningún valor** - deja todo tal cual +4. **Haz clic en "Desplegar"** para activar un nuevo despliegue + +Este proceso: + +- Añadirá automáticamente las tolerancias correctas +- Actualizará la configuración del despliegue +- Permitirá que los pods se programen en los nodepools adecuados + +```bash +# El sistema añadirá automáticamente tolerancias como: +tolerationsː +- key: "node-type" + operator: "Equal" + value: "spot" + effect: "NoSchedule" +``` + + + + + +Para entender la configuración de tus nodepools: + +1. **Consulta los nodepools disponibles**: + + ```bash + kubectl get nodes --show-labels + ``` + +2. **Verifica las taints de los nodos**: + + ```bash + kubectl describe nodes | grep -A5 -B5 Taints + ``` + +3. **Consulta el estado de programación de pods**: + ```bash + kubectl describe pod | grep -A10 Events + ``` + +Configuraciones comunes de taints: + +- Nodos de producción: `node-type=ondemand:NoSchedule` +- Nodos de desarrollo: `node-type=spot:NoSchedule` + + + + + +Si tienes múltiples despliegues afectados: + +1. **Identifica todos los servicios afectados** en el panel de SleakOps +2. **Repite el proceso de editar/desplegar** para cada servicio +3. **Prioriza primero los servicios críticos** +4. **Monitorea la programación de pods** después de cada actualización + +Para servicios que fueron escalados a 0: + +1. Primero aplica la corrección (editar sin cambios + desplegar) +2. Luego escala el despliegue según sea necesario + + + + + +Para evitar este problema en el futuro: + +1. **Planifica los cambios en nodepool**: Coordina con tu equipo de SleakOps antes de cambios mayores +2. **Actualiza los despliegues proactivamente**: Cuando cambien los nodepools, actualiza los despliegues afectados +3. **Usa estrategias consistentes de taints**: Mantén la etiquetación y taints de nodos de forma consistente +4. **Monitorea tras actualizaciones**: Revisa el estado de los despliegues después de actualizaciones del clúster + +**Buenas prácticas:** + +- Mantén las configuraciones de despliegue actualizadas +- Prueba operaciones de escalado tras cambios en nodepool +- Documenta tu estrategia de taints para nodepools +- Configura monitoreo para fallos en la programación de pods + + + +--- + +_Esta FAQ fue generada automáticamente el 5 de marzo de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-newrelic-pkg-resources-warning.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-newrelic-pkg-resources-warning.mdx new file mode 100644 index 000000000..2e95dae8d --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-newrelic-pkg-resources-warning.mdx @@ -0,0 +1,189 @@ +--- +sidebar_position: 3 +title: "Fallo en el Despliegue en Producción con Advertencia de pkg_resources de New Relic" +description: "Solución para fallos en el despliegue causados por la advertencia de deprecación de pkg_resources de New Relic" +date: "2024-12-11" +category: "proyecto" +tags: + [ + "despliegue", + "compilación", + "newrelic", + "python", + "pkg_resources", + "setuptools", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en el Despliegue en Producción con Advertencia de pkg_resources de New Relic + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Compilación, NewRelic, Python, pkg_resources, Setuptools + +## Descripción del Problema + +**Contexto:** El usuario intenta desplegar un proyecto Python en producción usando SleakOps CLI pero encuentra fallos en la compilación relacionados con el uso de la API pkg_resources obsoleta por parte de New Relic. + +**Síntomas Observados:** + +- El proceso de compilación comienza normalmente pero falla con el mensaje "Algo salió mal" +- Aparece una advertencia UserWarning sobre la deprecación de pkg_resources en los diagnósticos +- La compilación queda atascada en un bucle "Construyendo proyecto..." antes de fallar +- El despliegue al entorno de producción falla completamente + +**Configuración Relevante:** + +- Proyecto: `api-mecubro` +- Rama: `master` +- Versión de Python: `3.9` +- Integración de New Relic habilitada +- Comando usado: `sleakops build -p api-mecubro -b master -w` + +**Condiciones de Error:** + +- El error ocurre durante el proceso de compilación para despliegue en producción +- La advertencia se origina en `/usr/local/lib/python3.9/site-packages/newrelic/config.py:4555` +- La advertencia de deprecación de pkg_resources sugiere fijar Setuptools<81 +- El fallo en la compilación impide un despliegue exitoso + +## Solución Detallada + + + +La solución más rápida es fijar la versión de Setuptools en los requisitos de tu proyecto: + +1. **Agregar a requirements.txt**: + +```txt +setuptools<81 +``` + +2. **O agregar a pyproject.toml** (si usas este archivo): + +```toml +[build-system] +requires = ["setuptools<81", "wheel"] +``` + +3. **Reconstruir el proyecto**: + +```bash +sleakops build -p api-mecubro -b master -w +``` + + + + + +Si tienes control sobre el Dockerfile, puedes fijar Setuptools durante el proceso de construcción: + +```dockerfile +# Antes de instalar otras dependencias +RUN pip install "setuptools<81" + +# Luego instala tus requerimientos +COPY requirements.txt . +RUN pip install -r requirements.txt +``` + +O instalarlo junto con otras dependencias: + +```dockerfile +RUN pip install "setuptools<81" -r requirements.txt +``` + + + + + +La mejor solución a largo plazo es actualizar el agente de New Relic para Python a una versión que no use pkg_resources: + +1. **Verificar versión actual de New Relic**: + +```bash +pip show newrelic +``` + +2. **Actualizar a la última versión**: + +```txt +# En requirements.txt +newrelic>=9.0.0 +``` + +3. **Verificar compatibilidad** con tu versión de Python y aplicación + +4. **Probar exhaustivamente** en un entorno de desarrollo antes de desplegar en producción + + + + + +Como solución temporal, puedes suprimir la advertencia específica: + +1. **Agregar variable de entorno en tu configuración de despliegue**: + +```bash +export PYTHONWARNINGS="ignore::UserWarning:pkg_resources" +``` + +2. **O agregar a las variables de entorno de tu aplicación**: + +```yaml +# En la configuración de tu proyecto SleakOps +environment: + PYTHONWARNINGS: "ignore::UserWarning:pkg_resources" +``` + +**Nota**: Esto solo suprime la advertencia pero no soluciona el problema subyacente. + + + + + +Después de implementar cualquier solución: + +1. **Probar la compilación localmente** (si es posible): + +```bash +sleakops build -p api-mecubro -b master -w +``` + +2. **Monitorear los registros de compilación** para el mensaje de advertencia + +3. **Verificar funcionalidad de New Relic** tras el despliegue: + + - Comprobar que se están recolectando métricas + - Verificar que los datos APM aparecen en el panel de New Relic + - Probar la funcionalidad de seguimiento de errores + +4. **Revisar el rendimiento de la aplicación** para asegurar que no hay regresiones + + + + + +Para prevenir este problema en nuevos proyectos: + +1. **Usar gestión moderna de dependencias**: + + - Usar `pyproject.toml` en lugar de `setup.py` cuando sea posible + - Fijar versiones críticas de dependencias incluyendo herramientas de construcción + +2. **Actualizaciones regulares de dependencias**: + + - Mantener actualizado el agente de New Relic + - Monitorear advertencias de deprecación en CI/CD + +3. **Consistencia del entorno de construcción**: + - Usar versiones específicas de Python y paquetes + - Probar compilaciones en entornos similares a producción + + + +--- + +_Esta FAQ fue generada automáticamente el 11 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-old-pods-not-terminating.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-old-pods-not-terminating.mdx new file mode 100644 index 000000000..dc3d98599 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-old-pods-not-terminating.mdx @@ -0,0 +1,224 @@ +--- +sidebar_position: 3 +title: "Problema de Despliegue - Pods Antiguos No Se Terminan" +description: "Solución para cuando los nuevos despliegues crean pods pero los pods antiguos permanecen activos y reciben tráfico" +date: "2024-06-10" +category: "workload" +tags: + [ + "despliegue", + "kubernetes", + "pods", + "enrutamiento-de-tráfico", + "resolución-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Despliegue - Pods Antiguos No Se Terminan + +**Fecha:** 10 de junio de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Despliegue, Kubernetes, Pods, Enrutamiento de Tráfico, Resolución de Problemas + +## Descripción del Problema + +**Contexto:** El usuario experimenta problemas al desplegar nuevas versiones donde se crean pods nuevos con éxito pero los pods antiguos de despliegues previos (creados hace 2 días) permanecen activos y siguen recibiendo tráfico en lugar de los pods nuevos. + +**Síntomas Observados:** + +- Se crea un pod nuevo con la última etiqueta de imagen +- Los pods antiguos (de más de 2 días) no se terminan +- Todo el tráfico continúa siendo dirigido a los pods antiguos en lugar de a los nuevos +- El pod nuevo funciona correctamente cuando se accede directamente vía port-forward +- Los chequeos de salud del pod nuevo están pasando +- Al intentar eliminar manualmente los pods antiguos, estos se reinician + +**Configuración Relevante:** + +- Proyecto: `rattlesnake` +- Entorno: `development` +- Etiqueta de Imagen: `7edb7463a22198fc4d79bac76cfcb2c0f94b3755` +- Plataforma: Kubernetes (visto a través de Lens) + +**Condiciones de Error:** + +- El problema ocurre durante el proceso de despliegue +- Los pods antiguos se reinician al eliminarlos manualmente +- El enrutamiento de tráfico no cambia a los pods nuevos +- El problema ha ocurrido anteriormente con otros proyectos + +## Solución Detallada + + + +Este problema típicamente ocurre debido a uno de estos problemas en el despliegue de Kubernetes: + +1. **Mala configuración de la estrategia de despliegue**: Las configuraciones de actualización continua pueden estar impidiendo que los pods antiguos se terminen +2. **Restricciones de recursos**: Recursos insuficientes que impiden que los pods nuevos estén listos +3. **Desajuste en el selector del servicio**: El servicio no apunta a los pods nuevos +4. **Fallos en las sondas de disponibilidad (readiness probes)**: Los pods nuevos pueden no estar pasando las comprobaciones de disponibilidad +5. **Múltiples controladores de despliegue**: Controladores en conflicto gestionando los mismos pods + + + + + +Primero, verifica el estado actual de tu despliegue: + +```bash +# Ver estado del despliegue +kubectl get deployments -n +kubectl describe deployment -n + +# Ver pods y sus edades +kubectl get pods -n --show-labels +kubectl get pods -n -o wide + +# Ver conjuntos de réplicas +kubectl get replicasets -n +``` + +Busca: + +- Múltiples conjuntos de réplicas con pods +- Estado de disponibilidad de los pods +- Estado del despliegue y su progreso + + + + + +Verifica si el servicio apunta correctamente a los pods nuevos: + +```bash +# Ver endpoints del servicio +kubectl get endpoints -n +kubectl describe service -n + +# Comparar selector del servicio con etiquetas de los pods +kubectl get service -n -o yaml +kubectl get pods -n --show-labels +``` + +El selector del servicio debe coincidir con las etiquetas de tus pods nuevos, no con las antiguas. + + + + + +Aunque port-forward funciona, verifica si las sondas de disponibilidad están configuradas correctamente: + +```bash +# Ver estado de disponibilidad del pod +kubectl describe pod -n + +# Buscar configuración de readiness probe +kubectl get pod -n -o yaml | grep -A 10 readinessProbe +``` + +Si las sondas de disponibilidad fallan, el servicio no enviará tráfico a los pods nuevos. + + + + +Si el despliegue está atascado, puedes forzar un nuevo rollout: + +```bash +# Forzar un nuevo rollout del despliegue +kubectl rollout restart deployment -n + +# Monitorear el progreso del rollout +kubectl rollout status deployment -n + +# Ver historial de rollouts +kubectl rollout history deployment -n +``` + +Esto creará un nuevo conjunto de réplicas y terminará gradualmente los pods antiguos. + + + + + +Verifica si hay problemas de recursos que impidan el despliegue correcto: + +```bash +# Ver uso de recursos del nodo +kubectl top nodes +kubectl describe nodes + +# Ver límites de recursos de los pods +kubectl describe pod -n + +# Ver eventos del namespace +kubectl get events -n --sort-by='.lastTimestamp' +``` + +Busca eventos relacionados con: +- Falta de CPU o memoria +- Problemas de programación de pods +- Fallos en el pull de imágenes + + + + + +Si los métodos anteriores no funcionan, puedes realizar una limpieza manual: + +```bash +# Escalar el despliegue a 0 réplicas +kubectl scale deployment --replicas=0 -n + +# Esperar a que todos los pods se terminen +kubectl get pods -n -w + +# Escalar de vuelta al número deseado de réplicas +kubectl scale deployment --replicas= -n +``` + +**Precaución:** Este método causará tiempo de inactividad del servicio. + + + +## Prevención + +Para evitar este problema en el futuro: + +1. **Configurar correctamente la estrategia de despliegue:** + ```yaml + spec: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 25% + maxSurge: 25% + ``` + +2. **Asegurar sondas de disponibilidad apropiadas:** + ```yaml + readinessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 10 + periodSeconds: 5 + ``` + +3. **Monitorear despliegues regularmente:** + ```bash + kubectl rollout status deployment -n --timeout=300s + ``` + +4. **Usar herramientas de CI/CD que verifiquen el estado del despliegue antes de completar** + +## Recursos Adicionales + +- [Documentación oficial de Kubernetes sobre Despliegues](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) +- [Estrategias de Despliegue en Kubernetes](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy) +- [Configuración de Sondas de Salud](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-pending-nodepool-changes.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-pending-nodepool-changes.mdx new file mode 100644 index 000000000..9211dcf1c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-pending-nodepool-changes.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 3 +title: "Despliegue Pendiente Después de Cambios en Nodepool" +description: "Solución para despliegues que permanecen pendientes tras cambios en la configuración del nodepool" +date: "2024-01-23" +category: "cluster" +tags: ["despliegue", "nodepool", "karpenter", "aws", "escalado"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegue Pendiente Después de Cambios en Nodepool + +**Fecha:** 23 de enero de 2024 +**Categoría:** Clúster +**Etiquetas:** Despliegue, Nodepool, Karpenter, AWS, Escalado + +## Descripción del Problema + +**Contexto:** Tras realizar cambios en la configuración del nodepool en SleakOps, los despliegues permanecen en estado pendiente y no aplican automáticamente la nueva configuración. La aplicación deja de funcionar hasta que se requiere intervención manual. + +**Síntomas Observados:** + +- Los cambios en el nodepool permanecen pendientes y no se aplican automáticamente +- La aplicación deja de funcionar después de modificaciones en el nodepool +- Los despliegues requieren activación manual para aplicar los cambios pendientes +- El problema puede pasar desapercibido durante períodos prolongados + +**Configuración Relevante:** + +- Plataforma: AWS EKS con Karpenter +- Autoscaling: Activado +- Límites de recursos: Posiblemente asignación insuficiente de RAM +- Cuotas de CPU en AWS EC2: Puede estar al límite + +**Condiciones de Error:** + +- Ocurre después de cambios en la configuración del nodepool +- Los despliegues permanecen pendientes hasta ser activados manualmente +- Puede estar relacionado con límites de cuotas de recursos de AWS +- El autoscaling dispara escalado innecesario debido a límites bajos de memoria + +## Solución Detallada + + + +Cuando los cambios en el nodepool están pendientes: + +1. **Accede al Panel de SleakOps** +2. **Navega a tu proyecto** +3. **Dirígete a la sección de Despliegues** +4. **Activa manualmente el despliegue pendiente** +5. **Verifica que la aplicación esté funcionando** + +Esto aplicará inmediatamente los cambios pendientes en el nodepool y restaurará la funcionalidad de la aplicación. + + + + + +Para evitar escalados innecesarios que puedan activar límites de cuota de AWS: + +**Configuración recomendada de memoria:** + +- **RAM mínima**: 512MB +- **RAM máxima**: 1024MB + +```yaml +# Ejemplo de configuración de recursos +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1024Mi" + cpu: "500m" +``` + +**Por qué ayuda esto:** + +- Evita que el autoscaling se active por presión de memoria +- Reduce la provisión innecesaria de instancias EC2 +- Evita alcanzar los límites de cuotas de CPU de AWS + + + + + +Si encuentras límites de cuota en AWS: + +1. **Identifica el límite de cuota**: + + - Ve a la Consola de AWS → Service Quotas + - Busca "Amazon Elastic Compute Cloud (Amazon EC2)" + - Revisa las cuotas de "Running On-Demand" + +2. **Solicita aumento de cuota**: + + - Haz clic en "Request quota increase" + - Especifica el nuevo límite requerido + - Proporciona justificación comercial + - Envía la solicitud + +3. **Monitorea la solicitud**: + - AWS suele responder en 24-48 horas + - Recibirás notificaciones por correo electrónico sobre el estado + +**Nota**: Realiza la solicitud desde tu cuenta AWS de producción para un procesamiento más rápido. + + + + + +Para evitar pasar por alto despliegues pendientes: + +1. **Configura notificaciones**: + + - Configura alertas para cambios en el estado de despliegue + - Monitorea regularmente la pipeline de despliegue + +2. **Revisiones regulares**: + + - Revisa los despliegues pendientes diariamente + - Verifica la funcionalidad de la aplicación después de cambios en el nodepool + +3. **Monitoreo automatizado**: + + ```bash + # Verificar despliegues pendientes + kubectl get deployments --all-namespaces | grep -v "READY" + + # Monitorear estado de pods + kubectl get pods --all-namespaces | grep Pending + ``` + + + + + +Razones comunes por las que los despliegues quedan pendientes tras cambios en el nodepool: + +1. **Restricciones de recursos**: + + - Cuotas insuficientes de CPU/memoria + - Limitaciones de capacidad del nodo + +2. **Retrasos en la provisión de Karpenter**: + + - Límites de tasa en la API de AWS + - Disponibilidad de tipos de instancia + +3. **Conflictos de configuración**: + + - Solicitudes de recursos incompatibles + - Restricciones de programación + +4. **Problemas de sincronización**: + - Cambios aplicados en períodos de alto tráfico + - Conflictos de despliegue concurrentes + +**Estrategias de prevención**: + +- Aplica cambios en el nodepool durante períodos de bajo tráfico +- Asegura cuotas adecuadas en AWS antes del escalado +- Monitorea tendencias de utilización de recursos +- Prueba los cambios primero en un entorno de staging + + + +--- + +_Esta FAQ fue generada automáticamente el 23 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-redis-communication-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-redis-communication-error.mdx new file mode 100644 index 000000000..40db58e7a --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-redis-communication-error.mdx @@ -0,0 +1,158 @@ +--- +sidebar_position: 3 +title: "Fallo de Despliegue Debido a Error de Comunicación con Redis" +description: "Solución para fallos en despliegues causados por problemas de conectividad con Redis" +date: "2024-01-15" +category: "deployment" +tags: ["deployment", "redis", "communication", "troubleshooting", "push"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo de Despliegue Debido a Error de Comunicación con Redis + +**Fecha:** 15 de enero de 2024 +**Categoría:** Despliegue +**Etiquetas:** Despliegue, Redis, Comunicación, Solución de problemas, Push + +## Descripción del Problema + +**Contexto:** El usuario experimenta fallos en el despliegue al hacer push de código al repositorio. El proceso automático de despliegue se queda bloqueado y no se completa correctamente. + +**Síntomas observados:** + +- El proceso de despliegue se queda bloqueado después del git push +- El despliegue automático no se activa correctamente +- La acción de despliegue permanece en estado pendiente +- Inicialmente no se muestran mensajes de error al usuario + +**Configuración relevante:** + +- Proyecto: simplee-web-dev +- Entorno: dev +- Método de despliegue: Despliegue automático vía git push +- Plataforma: SleakOps + +**Condiciones de error:** + +- Error ocurre durante el proceso automático de despliegue +- Se desencadena por git push al repositorio +- La tubería de despliegue no logra completarse +- El problema persiste tras múltiples intentos de push + +## Solución Detallada + + + +Este problema suele ser causado por errores de comunicación con Redis dentro de la infraestructura de SleakOps. Redis se utiliza para: + +- Gestionar las colas de despliegue +- Almacenar información del estado del despliegue +- Coordinar entre diferentes servicios +- Cachear configuraciones de despliegue + +Cuando falla la comunicación con Redis, los despliegues quedan atascados en la cola y no pueden avanzar. + + + + + +Si encuentras este problema: + +1. **Contacta al Soporte de SleakOps**: Generalmente es un problema de infraestructura que requiere intervención administrativa +2. **Espera confirmación**: El equipo de soporte resolverá el problema de comunicación con Redis +3. **Reintenta el despliegue**: Una vez confirmado el arreglo, vuelve a intentar el despliegue +4. **Limpia despliegues conflictivos**: El soporte puede necesitar limpiar despliegues bloqueados + +**Ejemplo de interacción con soporte:** + +``` +Asunto: Despliegue atascado - error de comunicación con Redis +Proyecto: [nombre-de-tu-proyecto] +Entorno: [nombre-del-entorno] +Problema: El despliegue se queda atascado tras git push +``` + + + + + +Una vez resuelto el problema con Redis: + +1. **Verifica la solución**: Confirma con soporte que el problema está resuelto +2. **Limpia caché local:** + ```bash + git fetch --all + git status + ``` +3. **Dispara un nuevo despliegue:** + ```bash + # Haz un pequeño cambio o usa un commit vacío + git commit --allow-empty -m "Reintentar despliegue tras arreglo de Redis" + git push origin main + ``` +4. **Monitorea el despliegue:** Observa el progreso del despliegue en el panel de SleakOps + + + + + +Para minimizar el impacto de problemas similares: + +1. **Monitorea el estado del despliegue:** Siempre verifica la finalización del despliegue en el panel de SleakOps +2. **Configura notificaciones:** Configura alertas sobre el estado del despliegue +3. **Ten un plan de reversión:** Mantén lista una versión previa funcional +4. **Revisiones regulares de salud:** Monitorea la salud de tu aplicación después de los despliegues + +**Lista de verificación para monitoreo de despliegue:** + +- [ ] Despliegue iniciado correctamente +- [ ] Proceso de construcción completado +- [ ] Contenedor desplegado con éxito +- [ ] Verificación de salud de la aplicación pasada +- [ ] Actualización de enrutamiento de tráfico realizada + + + + + +Si el problema persiste tras el arreglo de Redis: + +1. **Revisa los registros de despliegue:** + + - Ve al panel de SleakOps + - Navega a tu proyecto + - Consulta el historial y los logs de despliegue + +2. **Verifica la configuración del repositorio:** + + ```yaml + # Revisa .sleakops/config.yml + version: "1" + environments: + dev: + cluster: tu-cluster + namespace: tu-namespace + ``` + +3. **Prueba con un cambio mínimo:** + + ```bash + # Actualiza la marca de tiempo o versión + echo "# Actualizado $(date)" >> README.md + git add README.md + git commit -m "Prueba de despliegue" + git push + ``` + +4. **Contacta soporte con detalles:** + - Nombre del proyecto y entorno + - Hora exacta del intento de despliegue + - Mensajes de error de los logs + - Hash del commit de git que falló + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-service-name-length-limit.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-service-name-length-limit.mdx new file mode 100644 index 000000000..bab25b999 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-service-name-length-limit.mdx @@ -0,0 +1,174 @@ +--- +sidebar_position: 3 +title: "Error de Límite de Longitud en el Nombre del Servicio de Kubernetes" +description: "Solución para nombres de servicio que exceden el límite de 63 caracteres durante el despliegue" +date: "2025-02-13" +category: "proyecto" +tags: ["despliegue", "kubernetes", "servicio", "nomenclatura", "helm"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Límite de Longitud en el Nombre del Servicio de Kubernetes + +**Fecha:** 13 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Kubernetes, Servicio, Nomenclatura, Helm + +## Descripción del Problema + +**Contexto:** El usuario encuentra un fallo en el despliegue al intentar desplegar un proyecto con un nombre largo en SleakOps. El nombre del servicio Kubernetes generado automáticamente excede el límite de 63 caracteres impuesto por Kubernetes. + +**Síntomas Observados:** + +- El despliegue falla durante el proceso de actualización con Helm +- Mensaje de error: "UPGRADE FAILED: failed to create resource" +- Error de validación del nombre del servicio: "debe tener no más de 63 caracteres" +- Los nombres de servicio generados automáticamente son demasiado largos + +**Configuración Relevante:** + +- Nombre del proyecto: `velo-contacto-email-sender` +- Entorno: `production` +- Nombre del servicio generado: `velo-contact-email-sender-production-velo-contact-email-sender-svc` +- Recuento de caracteres: Más de 63 caracteres + +**Condiciones del Error:** + +- Ocurre durante el proceso de despliegue +- Sucede con proyectos que tienen nombres largos +- Afecta a los nombres de recursos Kubernetes generados automáticamente +- No puede ser controlado directamente por la configuración del usuario + +## Solución Detallada + + + +Kubernetes tiene convenciones estrictas para los nombres de recursos: + +- **Nombres de servicio**: Máximo 63 caracteres +- **Nombres de pods**: Máximo 63 caracteres +- **Nombres de ConfigMap/Secret**: Máximo 253 caracteres +- Los nombres deben ser alfanuméricos en minúsculas con guiones + +SleakOps genera automáticamente nombres de servicio usando el patrón: +`{nombre-del-proyecto}-{entorno}-{nombre-del-proyecto}-svc` + +Esto puede fácilmente exceder el límite de 63 caracteres con nombres de proyecto largos. + + + + + +La solución más rápida es renombrar tu proyecto a un nombre más corto: + +1. **Ve a Configuración del Proyecto** en el panel de SleakOps +2. **Renombra el proyecto** a algo más corto (por ejemplo, `velo-email-sender`) +3. **Despliega nuevamente** el proyecto + +**Ejemplo:** + +``` +Original: velo-contacto-email-sender (24 caracteres) +Más corto: velo-email-sender (17 caracteres) +``` + +Esto generaría: +`velo-email-sender-production-velo-email-sender-svc` (51 caracteres) + + + + + +Si tu proyecto soporta nombres de servicio personalizados, puedes sobrescribir el valor por defecto: + +1. **En la configuración de tu proyecto**, busca las opciones de servicio +2. **Agrega un nombre de servicio personalizado**: + +```yaml +# sleakops.yaml o configuración equivalente +service: + name: "velo-contact-svc" # Nombre corto personalizado + type: ClusterIP + port: 80 +``` + +3. **Despliega nuevamente** el proyecto + + + + + +Si tienes acceso para personalizar el chart de Helm: + +1. **Crea un archivo de sobreescritura de valores**: + +```yaml +# values-override.yaml +service: + name: "velo-contact-svc" + fullnameOverride: "velo-contact-svc" + nameOverride: "contact-svc" +``` + +2. **Aplica durante el despliegue**: + +```bash +helm upgrade --install my-release ./chart \ + -f values-override.yaml +``` + + + + + +Para evitar este problema en el futuro: + +**Patrones recomendados para nombres:** + +- Usa abreviaciones: `velo-contact-sender` en lugar de `velo-contacto-email-sender` +- Mantén los nombres de proyecto por debajo de 20 caracteres cuando sea posible +- Usa nombres significativos pero concisos +- Evita palabras redundantes + +**Ejemplos de buenos nombres de proyecto:** + +``` +✅ velo-email-svc +✅ contact-sender +✅ notification-api +✅ user-auth + +❌ velo-contacto-email-sender-service +❌ my-super-long-project-name-with-details +❌ company-department-team-project-service +``` + + + + + +**Antes de crear proyectos:** + +1. **Calcula la longitud final del nombre del servicio**: + + - Nombre del proyecto + entorno + "svc" + separadores + - Debe ser menor a 60 caracteres para mayor seguridad + +2. **Usa una calculadora de nombres**: + + ``` + {nombre-del-proyecto}-{entorno}-{nombre-del-proyecto}-svc + + Ejemplo de cálculo: + "my-project" + "-production-" + "my-project" + "-svc" + = 10 + 12 + 10 + 4 = 36 caracteres ✅ + ``` + +3. **Prueba primero en staging** con el mismo nombre de proyecto + + + +--- + +_Esta FAQ fue generada automáticamente el 13 de febrero de 2025 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-creating-status.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-creating-status.mdx new file mode 100644 index 000000000..a953aa5f1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-creating-status.mdx @@ -0,0 +1,182 @@ +--- +sidebar_position: 3 +title: "Despliegue Atascado en Estado de Creación" +description: "Solución para despliegues que se quedan atascados durante la fase de despliegue después de una compilación exitosa" +date: "2025-01-27" +category: "proyecto" +tags: + ["despliegue", "compilación", "solución de problemas", "atascado", "creando"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegue Atascado en Estado de Creación + +**Fecha:** 27 de enero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Compilación, Solución de problemas, Atascado, Creando + +## Descripción del Problema + +**Contexto:** El usuario envía código a una rama y desencadena un despliegue a través de SleakOps. El proceso de compilación finaliza con éxito, pero la fase de despliegue queda atascada en estado "creando" durante períodos prolongados. + +**Síntomas Observados:** + +- El proceso de compilación finaliza con éxito +- El despliegue permanece en estado "creando" por más de 14 minutos +- No hay progreso visible ni mensajes de error durante la fase de despliegue +- La aplicación puede o no desplegarse realmente a pesar del estado atascado +- El problema ocurre de forma intermitente en ramas específicas + +**Configuración Relevante:** + +- Tipo de despliegue: `deploy_build_dev` +- Plataforma: SleakOps +- Estado de compilación: Exitoso +- Estado de despliegue: Atascado en "creando" + +**Condiciones de Error:** + +- Ocurre después de la finalización exitosa de la compilación +- La fase de despliegue se cuelga sin tiempo de espera +- No hay mensajes de error claros en la interfaz +- Puede requerir intervención manual para resolver + +## Solución Detallada + + + +Cuando un despliegue se queda atascado en estado "creando": + +1. **Revisar los registros del despliegue**: Buscar mensajes de error en los registros del despliegue +2. **Verificar el estado del clúster**: Asegurarse de que el clúster destino esté saludable y respondiendo +3. **Verificar el estado de los pods**: Confirmar si los pods se están creando realmente a pesar de que la interfaz muestre "creando" +4. **Monitorear el uso de recursos**: Comprobar si existen limitaciones de recursos (CPU, memoria, almacenamiento) + +```bash +# Verificar estado de pods en el clúster +kubectl get pods -n tu-namespace + +# Verificar estado del despliegue +kubectl get deployments -n tu-namespace + +# Verificar eventos para detectar problemas +kubectl get events -n tu-namespace --sort-by='.lastTimestamp' +``` + + + + + +**Limitaciones de Recursos:** + +- CPU o memoria insuficiente en el clúster +- Cuota de almacenamiento excedida +- Problemas de capacidad en los nodos + +**Solución:** Escalar los recursos del clúster u optimizar las solicitudes de recursos de la aplicación + +**Problemas con la Descarga de Imágenes:** + +- Imagen de contenedor no encontrada o inaccesible +- Problemas de autenticación en el registro +- Problemas de conectividad de red + +**Solución:** Verificar que la imagen exista y que las credenciales del registro sean válidas + +**Problemas de Configuración:** + +- Variables de entorno inválidas +- Secretos o config maps faltantes +- Configuraciones incorrectas de servicios + +**Solución:** Revisar y validar todos los archivos de configuración + + + + + +Contactar al soporte de SleakOps si: + +1. **El despliegue está atascado por más de 15 minutos** sin ningún progreso +2. **El clúster parece saludable** pero los despliegues fallan consistentemente +3. **No hay mensajes de error claros** en los registros o la interfaz +4. **Múltiples despliegues afectados** en diferentes proyectos + +Al contactar al soporte, proporcionar: + +- Nombre del proyecto y rama +- Marca de tiempo del despliegue +- Cualquier mensaje de error o registro +- Cambios recientes en infraestructura o aplicación + + + + + +**Gestión de Recursos:** + +```yaml +# Establecer límites de recursos apropiados +resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "256Mi" + cpu: "200m" +``` + +**Verificaciones de Salud:** + +```yaml +# Configurar verificaciones de salud adecuadas +livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + +readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 +``` + +**Estrategia de Despliegue:** + +```yaml +# Usar actualizaciones continuas para despliegues más seguros +strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 1 +``` + + + + + +Si el despliegue permanece atascado: + +1. **Cancelar el despliegue actual** (si la opción está disponible en la interfaz) +2. **Esperar la intervención del equipo de la plataforma** para desbloquear despliegues atascados +3. **Reintentar el despliegue** tras confirmación del soporte +4. **Monitorear de cerca** para detectar problemas similares en despliegues posteriores + +**Acciones post-resolución:** + +- Verificar que la aplicación esté funcionando correctamente +- Comprobar que todos los servicios sean accesibles +- Monitorear los registros de la aplicación para detectar problemas +- Documentar cualquier cambio de configuración realizado + + + +--- + +_Esta FAQ fue generada automáticamente el 27 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-starting-state.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-starting-state.mdx new file mode 100644 index 000000000..624d979f1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-starting-state.mdx @@ -0,0 +1,199 @@ +--- +sidebar_position: 3 +title: "Despliegue Atascado en Estado de Inicio" +description: "Solución para despliegues atascados en estado de inicio e incapaces de cancelar o modificar la configuración" +date: "2025-01-27" +category: "proyecto" +tags: + [ + "despliegue", + "dockerfile", + "compilación", + "solución-de-problemas", + "estado-inicio", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegue Atascado en Estado de Inicio + +**Fecha:** 27 de enero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Dockerfile, Compilación, Solución de Problemas, Estado de Inicio + +## Descripción del Problema + +**Contexto:** El usuario experimenta un despliegue que queda atascado en estado "iniciando" durante horas, impidiendo la cancelación y bloqueando cambios en la configuración del Dockerfile. + +**Síntomas Observados:** + +- El despliegue muestra estado "iniciando" por períodos prolongados (horas en lugar de minutos) +- No se puede cancelar el despliegue atascado +- Incapacidad para modificar la configuración del Dockerfile mientras el despliegue está en estado de inicio +- El proceso de compilación parece colgarse o fallar silenciosamente +- La plataforma muestra "hace unos segundos" pero el tiempo real transcurrido es mucho mayor + +**Configuración Relevante:** + +- Dockerfile multi-etapa con compilación Gradle +- Dependencias de repositorio privado de GitHub que requieren autenticación +- Argumentos de compilación Docker: `GITHUB_USER` y `GITHUB_TOKEN` +- Aplicación Java con Spring Boot +- Script de inicio complejo con validación de variables de entorno + +**Condiciones de Error:** + +- Problemas de detección del Dockerfile que causan estado inconsistente +- Argumentos de compilación Docker faltantes o mal configurados +- Dockerfile no encontrado en la raíz del repositorio +- El proceso de compilación falla pero no reporta correctamente el estado de error + +## Solución Detallada + + + +Cuando un despliegue queda atascado en estado de inicio: + +1. **Contactar Soporte**: La plataforma puede requerir intervención manual para desbloquear el estado atascado +2. **Esperar Desbloqueo Manual**: El equipo de soporte puede desbloquear manualmente el despliegue desde el backend +3. **Verificar Estado**: Una vez desbloqueado, el despliegue debería completarse o mostrar el estado de error adecuado +4. **Revisar Salud de la Aplicación**: Asegurarse que la aplicación tenga configurados chequeos de salud apropiados + +**Nota**: Esto típicamente requiere intervención de un administrador de la plataforma y no puede ser resuelto directamente por los usuarios. + + + + + +Para aplicaciones que requieren acceso a repositorios privados de GitHub: + +```dockerfile +# Etapa de Compilación +FROM gradle:jdk21-alpine AS build + +WORKDIR /workspace +COPY . . + +# Definir argumentos de compilación para autenticación GitHub +ARG GITHUB_USER +ARG GITHUB_TOKEN + +# Establecer variables de entorno para acceso a GitHub +ENV GITHUB_USER=$GITHUB_USER +ENV GITHUB_TOKEN=$GITHUB_TOKEN + +# Configurar propiedades de Gradle para acceso a repositorio privado +RUN echo "gpr.user=$GITHUB_USER" >> ~/.gradle/gradle.properties && \ + echo "gpr.key=$GITHUB_TOKEN" >> ~/.gradle/gradle.properties + +# Compilar la aplicación +RUN gradle clean assemble --no-daemon +``` + +**Importante**: Asegúrese que los argumentos de compilación Docker estén correctamente configurados en los ajustes del proyecto SleakOps. + + + + + +Para configurar los argumentos de compilación Docker: + +1. Ir a **Configuración del Proyecto** → **Configuración de Dockerfile** +2. Añadir los argumentos de compilación requeridos: + - `GITHUB_USER`: Su nombre de usuario de GitHub + - `GITHUB_TOKEN`: Token de acceso personal con permisos al repositorio +3. **Guardar Cambios**: Asegurarse que la configuración se haya guardado correctamente +4. **Verificar**: Confirmar que los argumentos aparecen en la configuración + +**Solución de Problemas**: + +- Si el botón de guardar está deshabilitado, contactar soporte para desbloquear la configuración +- Asegurar que los tokens tengan permisos adecuados para acceso a repositorios privados +- Usar tokens de acceso personal con permisos finos cuando sea posible + + + + + +Problemas comunes en la detección del Dockerfile: + +1. **Ubicación del Dockerfile**: Debe estar en la raíz del repositorio o en la ruta especificada +2. **Selección de Rama**: Asegurarse de desplegar desde la rama correcta +3. **Nombre del Archivo**: Usar el nombre exacto `Dockerfile` (sensible a mayúsculas/minúsculas) +4. **Configuración de Ruta**: Si el Dockerfile está en un subdirectorio, especificar la ruta del contexto de compilación + +**Pasos para Verificación**: + +```bash +# Verificar si Dockerfile existe en la raíz +ls -la Dockerfile + +# Verificar rama actual +git branch --show-current + +# Comprobar permisos del archivo +ls -la Dockerfile +``` + + + + + +Asegúrese que su aplicación tenga chequeos de salud adecuados: + +```dockerfile +# Añadir chequeo de salud al Dockerfile +HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \ + CMD curl -f http://localhost:${HEALTH_PORT}/health || exit 1 +``` + +**En su script de inicio**: + +```bash +# Validar variables de entorno requeridas +REQUIRED_VARS=( + "PORT" + "HEALTH_PORT" + "PRIVATE_API_KEY" + "MONGODB_URI" + "MONGODB_DATABASE" + # ... otras variables necesarias +) + +for var in "${REQUIRED_VARS[@]}"; do + if [[ -z "${!var:-}" ]]; then + echo "[ERROR] La variable de entorno $var no está configurada." >&2 + exit 1 + fi +done +``` + + + + + +Para evitar que los despliegues queden atascados: + +1. **Probar Dockerfile Localmente**: Siempre pruebe sus compilaciones Dockerfile localmente primero +2. **Validar Variables de Entorno**: Asegurarse que todas las variables de entorno requeridas estén configuradas +3. **Usar Chequeos de Salud Apropiados**: Implementar chequeos de salud completos +4. **Monitorear Logs de Compilación**: Revisar los logs de compilación para detección temprana de errores +5. **Despliegue Gradual**: Probar con configuración mínima primero, luego añadir complejidad + +**Pruebas Locales**: + +```bash +# Compilar localmente para probar +docker build --build-arg GITHUB_USER=tu_usuario --build-arg GITHUB_TOKEN=tu_token . + +# Ejecutar prueba local + +docker run -p 8080:8080 tu-imagen +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 27 de enero de 2025 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-state-resolution.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-state-resolution.mdx new file mode 100644 index 000000000..edd7bd527 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-stuck-state-resolution.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Despliegue Atascado en Estado Pendiente" +description: "Solución para despliegues que se quedan atascados y bloquean despliegues posteriores" +date: "2024-12-28" +category: "proyecto" +tags: ["despliegue", "atascado", "solución de problemas", "compilación"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegue Atascado en Estado Pendiente + +**Fecha:** 28 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Atascado, Solución de Problemas, Compilación + +## Descripción del Problema + +**Contexto:** El usuario experimenta un despliegue que queda atascado en estado pendiente, impidiendo que se creen nuevos despliegues. El pod de despliegue no se crea y las compilaciones permanecen en estado "iniciando" indefinidamente. + +**Síntomas Observados:** + +- Despliegue atascado en estado pendiente sin crear pods +- No se pueden crear nuevos despliegues para el mismo proyecto +- Múltiples compilaciones atascadas en estado "iniciando" +- No se muestran mensajes de error en la interfaz +- El despliegue parece estar bloqueando toda la canalización de despliegue + +**Configuración Relevante:** + +- Entorno: Producción +- Proyecto: Cualquier proyecto que experimente despliegues atascados +- Estado de compilación: "Iniciando" (indefinidamente) +- Estado de despliegue: Pendiente/Atascado + +**Condiciones de Error:** + +- Ocurre durante el proceso de creación del despliegue +- Bloquea intentos de despliegue posteriores +- Las compilaciones no avanzan más allá del estado "iniciando" +- No se activa ningún mecanismo automático de recuperación + +## Solución Detallada + + + +Si encuentras un despliegue atascado, contacta inmediatamente al equipo de soporte de SleakOps. La resolución normalmente implica: + +1. **Actualización del Estado de la Tarea:** El equipo de soporte identificará tareas con estados desactualizados +2. **Corrección Manual del Estado:** Actualizar los estados de las tareas para reflejar la realidad actual +3. **Desbloqueo de la Canalización:** Limpiar la cola de despliegue para permitir nuevos despliegues + +**Información de Contacto:** + +- Correo electrónico: support@sleakops.com +- Incluye el nombre de tu proyecto y detalles del despliegue + + + + + +Para minimizar la ocurrencia de despliegues atascados: + +**Monitorea tus Despliegues:** + +```bash +# Revisa el estado de despliegue regularmente +sleakops deployment list --project tu-proyecto + +# Monitorea el progreso de compilaciones +sleakops build list --status starting +``` + +**Establece Tiempos de Espera Razonables:** + +- Configura tiempos de espera adecuados para compilaciones en la configuración de tu proyecto +- Monitorea compilaciones de larga duración que puedan indicar problemas + +**Chequeos de Salud Regulares:** + +- Verifica la salud de la canalización de despliegue antes de lanzamientos críticos +- Prueba despliegues primero en entorno de staging + + + + + +Antes de contactar soporte, prueba estos pasos de diagnóstico: + +**1. Revisa los Logs del Despliegue:** + +```bash +sleakops deployment logs --deployment-id +``` + +**2. Verifica Disponibilidad de Recursos:** + +- Comprueba si tu clúster tiene recursos suficientes +- Verifica la capacidad de nodos y límites de pods + +**3. Revisa los Logs de Compilación:** + +```bash +sleakops build logs --build-id +``` + +**4. Verifica el Estado del Proyecto:** + +```bash +sleakops project status --project +``` + +**5. Verifica Dependencias:** + +- Asegúrate de que todos los servicios requeridos estén en ejecución +- Revisa la conectividad con la base de datos +- Verifica la disponibilidad de servicios externos + + + + + +Escala al soporte de SleakOps cuando: + +**Escalación Inmediata Requerida:** + +- Despliegues en producción están bloqueados +- Múltiples compilaciones atascadas por más de 30 minutos +- Operaciones críticas del negocio están afectadas + +**Información a Incluir:** + +- Nombre del proyecto y entorno +- ID y marca temporal del despliegue +- IDs de compilaciones que están atascadas +- Cualquier mensaje de error o logs +- Descripción del impacto en el negocio + +**Respuesta de Soporte:** + +- Problemas en producción: Dentro de 2 horas +- Problemas no producción: Dentro de 24 horas +- Fallas críticas: Respuesta inmediata + + + + + +Después de que soporte resuelva el problema: + +**1. Verifica la Canalización de Despliegue:** + +```bash +# Prueba un despliegue simple +sleakops deploy --project --environment staging +``` + +**2. Revisa el Sistema de Compilación:** + +```bash +# Dispara una nueva compilación +sleakops build create --project +``` + +**3. Monitorea para Reincidencias:** + +- Observa de cerca los despliegues posteriores +- Reporta cualquier problema similar inmediatamente +- Documenta cualquier patrón observado + +**4. Actualiza el Monitoreo:** + +- Configura alertas para despliegues atascados +- Configura notificaciones para compilaciones de larga duración + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 28 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-teams-instances-not-creating.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-teams-instances-not-creating.mdx new file mode 100644 index 000000000..e01003191 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-teams-instances-not-creating.mdx @@ -0,0 +1,196 @@ +--- +sidebar_position: 3 +title: "Despliegue y Creación de Instancias de Teams No Funcionan" +description: "Solución de problemas de fallos en despliegues y problemas en la creación de instancias de Teams" +date: "2024-01-15" +category: "proyecto" +tags: ["despliegue", "teams", "instancias", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegue y Creación de Instancias de Teams No Funcionan + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Teams, Instancias, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario reporta que los despliegues no se están ejecutando y que las instancias de Teams no se están creando en la plataforma SleakOps. + +**Síntomas Observados:** + +- Los despliegues no se disparan ni completan +- No se crean instancias de Teams +- No hay progreso visible en la canalización de despliegue +- Los servicios pueden aparecer atascados en estado pendiente + +**Configuración Relevante:** + +- Dominio: teams.simplee.cl +- Plataforma: SleakOps +- Tipo de despliegue: instancias de Teams + +**Condiciones de Error:** + +- El proceso de despliegue no inicia o no se completa +- El proceso de creación de instancias no se ejecuta +- Puede afectar múltiples servicios o entornos + +## Solución Detallada + + + +Primero, verifica el estado actual de tus despliegues: + +1. **Navega a tu Panel de Proyecto** +2. **Revisa la sección de Despliegues** para detectar despliegues fallidos o pendientes +3. **Revisa los registros del despliegue** en busca de mensajes de error +4. **Verifica el estado de la compilación** en la canalización CI/CD + +Busca indicadores comunes: + +- Indicadores de estado en rojo +- Mensajes de error en los registros +- Estados "pendientes" atascados +- Problemas de asignación de recursos + + + + + +Revisa la configuración de tu instancia de Teams: + +1. **Ve a Workloads** → **Servicios de Teams** +2. **Verifica la configuración** para teams.simplee.cl +3. **Revisa la asignación de recursos**: + - Límites de CPU y memoria + - Requerimientos de almacenamiento + - Configuración de red + +```yaml +# Ejemplo de configuración de Teams +apiVersion: v1 +kind: Service +metadata: + name: teams-service +spec: + selector: + app: teams + ports: + - port: 80 + targetPort: 8080 + type: LoadBalancer +``` + + + + + +Verifica que tu clúster tenga recursos suficientes: + +1. **Revisa los Recursos de los Nodos**: + + - CPU y memoria disponibles + - Espacio en disco + - Capacidad de red + +2. **Verifica las Cuotas**: + + - Cuotas de recursos por espacio de nombres + - Límites a nivel de clúster + - Cuotas específicas del proveedor (AWS, Azure, GCP) + +3. **Revisa el Estado de los Pods**: + ```bash + kubectl get pods -n tu-namespace + kubectl describe pod -n tu-namespace + ``` + + + + + +Para las instancias de Teams, la configuración de red es crítica: + +1. **Revisa la Configuración DNS**: + + - Verifica los registros DNS de teams.simplee.cl + - Asegúrate que los registros A/CNAME apunten correctamente al balanceador de carga + +2. **Verifica el Balanceador de Carga**: + + - Comprueba si el balanceador de carga está provisionado + - Verifica los grupos de seguridad y reglas de firewall + +3. **Prueba la Conectividad**: + ```bash + nslookup teams.simplee.cl + curl -I https://teams.simplee.cl + ``` + + + + + +**Reiniciar el Proceso de Despliegue**: + +1. Navega a tu proyecto +2. Ve a la pestaña **Despliegues** +3. Haz clic en **Re-desplegar** en el despliegue fallido + +**Eliminar Recursos Atascados**: + +1. Elimina manualmente pods atascados si es necesario +2. Reinicia los controladores de despliegue +3. Revisa bloqueos o conflictos de recursos + +**Actualizar Configuración**: + +1. Verifica todas las variables de entorno +2. Revisa secretos y configmaps +3. Asegúrate que los repositorios de imágenes sean accesibles + +**Escalar Recursos**: + +1. Incrementa la capacidad de los nodos si es necesario +2. Ajusta las solicitudes y límites de recursos +3. Considera usar tipos de instancia diferentes + + + + + +Escala al soporte de SleakOps si: + +1. **Problemas de Infraestructura**: + + - Los nodos del clúster no responden + - Fallos en servicios a nivel de proveedor + - Fallas persistentes en asignación de recursos + +2. **Problemas de Plataforma**: + + - El panel de SleakOps no responde + - La canalización de despliegue está completamente rota + - Fallos de autenticación o autorización + +3. **Preocupaciones por Pérdida de Datos**: + - Volúmenes persistentes que no se montan + - Problemas de conectividad con la base de datos + - Problemas con respaldo y recuperación + +**Información a proporcionar al escalar**: + +- Nombre del proyecto y entorno +- Mensajes de error específicos de los registros +- Línea de tiempo de cuándo comenzó el problema +- Cambios recientes realizados en la configuración +- Capturas de pantalla de los estados de error + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-timeout-database-migrations.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-timeout-database-migrations.mdx new file mode 100644 index 000000000..7e16273c1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/deployment-timeout-database-migrations.mdx @@ -0,0 +1,405 @@ +--- +sidebar_position: 3 +title: "Tiempo de Espera en el Despliegue Durante Migraciones de Base de Datos" +description: "Solución para fallos en el despliegue causados por tiempos de espera en migraciones de base de datos en tareas previas al despliegue" +date: "2024-04-25" +category: "proyecto" +tags: + [ + "despliegue", + "base de datos", + "migraciones", + "tiempo de espera", + "pre-despliegue", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Tiempo de Espera en el Despliegue Durante Migraciones de Base de Datos + +**Fecha:** 25 de abril de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Base de Datos, Migraciones, Tiempo de Espera, Pre-despliegue + +## Descripción del Problema + +**Contexto:** El despliegue del servicio backend falla durante la fase previa al despliegue al ejecutar migraciones de base de datos, causando que todo el proceso de despliegue termine por tiempo de espera y falle. + +**Síntomas Observados:** + +- El despliegue falla durante la ejecución de la tarea previa al despliegue +- Las migraciones de base de datos tardan demasiado en completarse +- El proceso termina por límites de tiempo de espera +- Múltiples intentos de despliegue muestran comportamiento similar de tiempo de espera +- Pueden aparecer errores secundarios en la aplicación (ImportError) tras fallos por tiempo de espera + +**Configuración Relevante:** + +- Tipo de servicio: Aplicación backend +- Tarea previa al despliegue: Migraciones de base de datos +- Infraestructura: Nodos gestionados por Karpenter (no se detectaron problemas de aprovisionamiento) +- Base de datos: Base de datos de producción con posibles grandes volúmenes de datos + +**Condiciones de Error:** + +- El error ocurre durante la fase previa al despliegue +- Las migraciones exceden los límites de tiempo configurados +- El problema aparece consistentemente en los intentos de despliegue +- Puede ir seguido de errores de importación en la aplicación en intentos posteriores + +## Solución Detallada + + + +Los tiempos de espera en migraciones de base de datos suelen ocurrir debido a: + +1. **Migraciones de grandes volúmenes de datos**: Operaciones en tablas con millones de registros +2. **Cambios en el esquema**: Añadir índices o columnas a tablas grandes +3. **Contención de bloqueos**: Migraciones que entran en conflicto con conexiones activas a la base de datos +4. **Restricciones de recursos**: CPU/memoria insuficiente en la base de datos durante la migración +5. **Latencia de red**: Conexión lenta entre la aplicación y la base de datos + +Para diagnosticar la causa específica: + +```bash +# Revisar los logs de migración +kubectl logs -f deployment/your-backend-service -c pre-deploy + +# Monitorizar el rendimiento de la base de datos durante la migración +# (Ejemplo AWS RDS) +aws rds describe-db-instances --db-instance-identifier your-db +``` + + + + + +En SleakOps, puedes configurar límites de tiempo de espera más largos para las tareas previas al despliegue: + +1. Ve a tus **Configuraciones del Proyecto** +2. Navega a **Configuración de Despliegue** +3. Busca **Configuración de Tareas Previas al Despliegue** +4. Incrementa el valor de **Timeout**: + +```yaml +# ejemplo sleakops.yaml +services: + backend: + pre_deploy: + timeout: 1800 # 30 minutos en lugar de los 10 minutos por defecto + command: "python manage.py migrate" +``` + +Valores recomendados para timeout: + +- Aplicaciones pequeñas: 600 segundos (10 minutos) +- Aplicaciones medianas: 1200 segundos (20 minutos) +- Aplicaciones grandes: 1800+ segundos (30+ minutos) + + + + + +Para hacer las migraciones más rápidas y confiables: + +**1. Dividir migraciones grandes:** + +```python +# En lugar de una migración grande +class Migration(migrations.Migration): + operations = [ + # 50 operaciones aquí + ] + +# Dividir en migraciones más pequeñas +class Migration001(migrations.Migration): + operations = [ + # 10 operaciones aquí + ] + +class Migration002(migrations.Migration): + operations = [ + # 10 operaciones más aquí + ] +``` + +**2. Usar optimizaciones específicas de la base de datos:** + +```python +# Ejemplo PostgreSQL - añadir índices concurrentemente +from django.contrib.postgres.operations import AddIndexConcurrently + +class Migration(migrations.Migration): + atomic = False # Requerido para operaciones concurrentes + operations = [ + AddIndexConcurrently( + model_name='yourmodel', + index=models.Index(fields=['field_name'], name='idx_field_name'), + ), + ] +``` + +**3. Ejecutar migraciones pesadas fuera de línea:** + +```bash +# Para migraciones muy grandes, ejecútalas manualmente durante ventanas de mantenimiento +kubectl exec -it deployment/backend-service -- python manage.py migrate --plan +kubectl exec -it deployment/backend-service -- python manage.py migrate app_name migration_number +``` + + + + + +Si los tiempos de espera persisten, considera estas estrategias de despliegue: + +**1. Despliegue Blue-Green con migración manual:** + +```yaml +# sleakops.yaml +services: + backend: + deployment_strategy: blue_green + pre_deploy: + enabled: false # Deshabilitar migraciones automáticas + health_check: + path: /health + timeout: 30 +``` + +Luego ejecuta las migraciones manualmente: + +```bash +# Después de que el entorno azul esté listo +kubectl exec -it deployment/backend-service-blue -- python manage.py migrate +# Cambiar el tráfico después de completar la migración +``` + +**2. Despliegue Rolling con trabajos de migración:** + +```yaml +# Crear un job separado para migraciones +apiVersion: batch/v1 +kind: Job +metadata: + name: db-migration-job +spec: + template: + spec: + containers: + - name: migrate + image: your-backend-image + command: ["python", "manage.py", "migrate"] + restartPolicy: Never + backoffLimit: 3 +``` + +**3. Pooling de conexiones a la base de datos:** + +```python +# settings.py - Optimizar conexiones a base de datos +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', + 'CONN_MAX_AGE': 600, # Pooling de conexiones + 'OPTIONS': { + 'MAX_CONNS': 20, + 'MIN_CONNS': 5, + } + } +} +``` + + + + + +Para prevenir futuros problemas de tiempo de espera: + +**1. Monitorizar el rendimiento de las migraciones:** + +```bash +# Añadir logging a las migraciones +import logging +logger = logging.getLogger(__name__) + +class Migration(migrations.Migration): + def apply_migration(self, project_state, schema_editor, collect_sql=False): + logger.info(f"Iniciando migración {self.name}") + start_time = time.time() + result = super().apply_migration(project_state, schema_editor, collect_sql) + duration = time.time() - start_time + logger.info(f"Migración {self.name} completada en {duration:.2f} segundos") + return result +``` + +**2. Configurar alertas:** + +```yaml +# Alerta cuando las migraciones tardan demasiado +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: migration-alerts +spec: + groups: + - name: migrations + rules: + - alert: MigrationTimeout + expr: increase(django_migration_duration_seconds[5m]) > 300 + labels: + severity: warning + annotations: + summary: "Migración de base de datos tardando demasiado" +``` + +**3. Probar migraciones en staging:** + +```bash +# Siempre probar con volúmenes de datos similares a producción +# Usar snapshots de base de datos para pruebas realistas + +# Crear snapshot de producción para testing +aws rds create-db-snapshot \ + --db-instance-identifier prod-db \ + --db-snapshot-identifier migration-test-$(date +%Y%m%d) + +# Restaurar snapshot en staging +aws rds restore-db-instance-from-db-snapshot \ + --db-instance-identifier staging-migration-test \ + --db-snapshot-identifier migration-test-$(date +%Y%m%d) +``` + +**4. Implementar métricas de migración:** + +```python +# middleware/migration_metrics.py +import time +from django.core.management.base import BaseCommand +from django.db import connection + +class MigrationMetrics: + @staticmethod + def measure_migration_time(migration_func): + def wrapper(*args, **kwargs): + start_time = time.time() + result = migration_func(*args, **kwargs) + end_time = time.time() + + # Log métricas + print(f"Migración completada en {end_time - start_time:.2f} segundos") + + # Enviar métricas a sistema de monitoreo + # send_metric('migration.duration', end_time - start_time) + + return result + return wrapper +``` + + + + + +**1. Configurar parámetros de base de datos para migraciones:** + +```sql +-- PostgreSQL: Optimizar para migraciones grandes +SET maintenance_work_mem = '2GB'; +SET max_parallel_workers = 8; +SET max_parallel_maintenance_workers = 4; + +-- MySQL: Configuraciones para migraciones +SET innodb_buffer_pool_size = 2147483648; -- 2GB +SET innodb_log_file_size = 512M; +SET innodb_flush_log_at_trx_commit = 2; +``` + +**2. Usar transacciones más pequeñas:** + +```python +# Para migraciones de datos grandes +from django.db import transaction + +class Migration(migrations.Migration): + def migrate_data_in_batches(apps, schema_editor): + Model = apps.get_model('app', 'Model') + batch_size = 1000 + + total_count = Model.objects.count() + processed = 0 + + while processed < total_count: + with transaction.atomic(): + batch = Model.objects.all()[processed:processed + batch_size] + for obj in batch: + # Procesar objeto + obj.save() + processed += batch_size + print(f"Procesados {processed}/{total_count} registros") +``` + +**3. Índices y constraints:** + +```python +# Crear índices después de insertar datos +class Migration(migrations.Migration): + operations = [ + # Primero insertar datos + migrations.RunPython(insert_data), + # Luego crear índices + migrations.AddIndex( + model_name='model', + index=models.Index(fields=['field'], name='idx_field'), + ), + ] +``` + + + + + +**Antes del despliegue:** + +- [ ] Verificar el tamaño de las migraciones pendientes +- [ ] Probar migraciones en staging con datos de producción +- [ ] Configurar timeouts apropiados +- [ ] Preparar plan de rollback +- [ ] Verificar recursos de base de datos disponibles + +**Durante problemas de timeout:** + +- [ ] Revisar logs de migración para identificar operación lenta +- [ ] Verificar métricas de CPU/memoria de la base de datos +- [ ] Comprobar conexiones activas a la base de datos +- [ ] Evaluar si hay bloqueos de tabla +- [ ] Considerar ejecutar migración manualmente + +**Después de resolver:** + +- [ ] Documentar la causa raíz del problema +- [ ] Actualizar timeouts si es necesario +- [ ] Mejorar monitoreo para futuras migraciones +- [ ] Revisar estrategia de migración para casos similares + +**Comandos útiles para diagnóstico:** + +```bash +# Ver migraciones pendientes +kubectl exec -it deployment/backend -- python manage.py showmigrations + +# Ver plan de migración +kubectl exec -it deployment/backend -- python manage.py migrate --plan + +# Ejecutar migración específica +kubectl exec -it deployment/backend -- python manage.py migrate app_name migration_name + +# Ver estado de la base de datos +kubectl exec -it deployment/backend -- python manage.py dbshell +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 25 de abril de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/django-celery-appregistrynotready-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/django-celery-appregistrynotready-error.mdx new file mode 100644 index 000000000..504125d49 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/django-celery-appregistrynotready-error.mdx @@ -0,0 +1,213 @@ +--- +sidebar_position: 3 +title: "Error AppRegistryNotReady de Django Celery" +description: "Solución para el error AppRegistryNotReady de Django al ejecutar trabajos cron con Celery" +date: "2024-12-26" +category: "carga de trabajo" +tags: + ["django", "celery", "cron", "appregistrynotready", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error AppRegistryNotReady de Django Celery + +**Fecha:** 26 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Django, Celery, Cron, AppRegistryNotReady, Solución de problemas + +## Descripción del problema + +**Contexto:** El usuario experimenta errores al ejecutar trabajos cron con Celery en una aplicación Django desplegada en la plataforma SleakOps. + +**Síntomas observados:** + +- Error `django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.` +- El error ocurre al usar el comando `celery -A (App) call function` +- Los trabajos cron no se ejecutan correctamente +- La aplicación parece no estar completamente inicializada cuando se ejecutan las tareas de Celery + +**Configuración relevante:** + +- Framework: Django con Celery +- Ejecución de tareas: trabajos cron +- Formato de comando: `celery -A (App) call function` +- Plataforma: SleakOps + +**Condiciones del error:** + +- El error ocurre durante la ejecución de tareas Celery +- Sucede específicamente con la programación de trabajos cron +- Las apps de Django no están correctamente cargadas cuando la tarea se ejecuta +- Puede estar relacionado con la configuración de Django + +## Solución detallada + + + +Asegúrate de que Django esté correctamente inicializado antes de que Celery inicie: + +```python +# celery.py +import os +from celery import Celery +from django.conf import settings + +# Establecer el módulo de configuración por defecto de Django +os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'tu_proyecto.settings') + +# Inicializar Django +import django +django.setup() + +app = Celery('tu_proyecto') +app.config_from_object('django.conf:settings', namespace='CELERY') +app.autodiscover_tasks() +``` + + + + + +Revisa tu archivo `settings.py` de Django para asegurarte de que todas las apps necesarias estén incluidas: + +```python +# settings.py +INSTALLED_APPS = [ + 'django.contrib.admin', + 'django.contrib.auth', + 'django.contrib.contenttypes', + 'django.contrib.sessions', + 'django.contrib.messages', + 'django.contrib.staticfiles', + + # Apps de terceros + 'celery', + 'django_celery_beat', # Si usas tareas periódicas + 'django_celery_results', # Si almacenas resultados en la BD de Django + + # Tus apps + 'tu_app_nombre', + # Agrega aquí cualquier app faltante +] +``` + + + + + +Asegúrate de que tus tareas Celery estén definidas correctamente: + +```python +# tasks.py +from celery import shared_task +from django.apps import apps + +@shared_task +def tu_tarea_cron(): + # Asegurar que Django esté listo + if not apps.ready: + import django + django.setup() + + # Lógica de tu tarea aquí + return "Tarea completada" +``` + +Para la ejecución manual de tareas, usa: + +```bash +# En lugar de: celery -A tu_proyecto call tu_app.tasks.tu_tarea +# Usa: +celery -A tu_proyecto worker --loglevel=info + +# O para ejecución puntual: +python manage.py shell -c "from tu_app.tasks import tu_tarea; tu_tarea.delay()" +``` + + + + + +Verifica que las configuraciones de Django se estén cargando correctamente: + +```python +# En tu tarea o celery.py +import os +print(f"DJANGO_SETTINGS_MODULE: {os.environ.get('DJANGO_SETTINGS_MODULE')}") + +from django.conf import settings +print(f"Configuración cargada: {settings.configured}") +print(f"Apps instaladas: {settings.INSTALLED_APPS}") +``` + +En SleakOps, asegúrate de que las variables de entorno estén definidas: + +```yaml +# En tu configuración de despliegue +environment: + DJANGO_SETTINGS_MODULE: "tu_proyecto.settings" + CELERY_BROKER_URL: "redis://redis:6379/0" + CELERY_RESULT_BACKEND: "redis://redis:6379/0" +``` + + + + + +Para despliegues en SleakOps, asegúrate de que la configuración de tus workers sea correcta: + +```yaml +# sleakops.yaml o configuración similar +workloads: + - name: celery-worker + type: worker + image: tu-app-django + command: ["celery", "-A", "tu_proyecto", "worker", "--loglevel=info"] + environment: + DJANGO_SETTINGS_MODULE: "tu_proyecto.settings" + + - name: celery-beat + type: worker + image: tu-app-django + command: ["celery", "-A", "tu_proyecto", "beat", "--loglevel=info"] + environment: + DJANGO_SETTINGS_MODULE: "tu_proyecto.settings" +``` + + + + + +Para depurar el problema: + +1. **Prueba localmente primero:** + + ```bash + # Establece la variable de entorno + export DJANGO_SETTINGS_MODULE=tu_proyecto.settings + + # Prueba la configuración de Django + python -c "import django; django.setup(); print('Configuración de Django exitosa')" + + # Prueba la tarea de Celery + python manage.py shell -c "from tu_app.tasks import tu_tarea; print(tu_tarea())" + ``` + +2. **Revisa los logs del worker de Celery:** + + ```bash + kubectl logs -f deployment/celery-worker + ``` + +3. **Verifica la conexión con Redis/Broker:** + ```python + from celery import current_app + print(current_app.control.inspect().stats()) + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 26 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/django-migration-conflicts-database-import.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/django-migration-conflicts-database-import.mdx new file mode 100644 index 000000000..f3eddaa60 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/django-migration-conflicts-database-import.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 3 +title: "Conflictos de Migraciones en Django Durante la Importación de Base de Datos" +description: "Solución para errores de migración en Django al importar volcados de base de datos con estados de migración diferentes" +date: "2024-12-20" +category: "proyecto" +tags: + [ + "django", + "migraciones", + "base de datos", + "importación", + "solución de problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Conflictos de Migraciones en Django Durante la Importación de Base de Datos + +**Fecha:** 20 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Django, Migraciones, Base de Datos, Importación, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Al importar un volcado de base de datos en una aplicación Django desplegada en SleakOps, el hook de pre-actualización encargado de ejecutar las migraciones de Django falla porque el sistema no reconoce que ciertas migraciones ya fueron aplicadas. + +**Síntomas Observados:** + +- El hook de pre-actualización falla durante el despliegue +- El sistema de migraciones de Django no reconoce migraciones previamente aplicadas +- El proceso de importación de la base de datos falla +- Conflictos de migración entre el estado del volcado y el código actual + +**Configuración Relevante:** + +- Framework: Django con Django REST Framework +- Base de datos: PostgreSQL (típico en despliegues SleakOps) +- Despliegue: Kubernetes con hooks de pre-actualización +- Aplicaciones de migración: cobranza, django_celery_results, y otras + +**Condiciones de Error:** + +- Error ocurre durante la ejecución del hook de pre-actualización +- Sucede cuando el volcado de base de datos contiene estados de migración diferentes al código actual +- Afecta múltiples aplicaciones Django con migraciones pendientes +- El problema persiste hasta que los estados de migración se sincronizan + +## Solución Detallada + + + +Este problema ocurre cuando: + +1. **Origen del volcado de base de datos**: El volcado fue creado desde una base de datos donde el proyecto corría en una rama diferente +2. **Desajuste del estado de migración**: El historial de migraciones en el volcado no coincide con las migraciones del código actual +3. **Seguimiento de migraciones en Django**: La tabla `django_migrations` contiene registros que no se alinean con los archivos de migración actuales + +El indicador clave es ver migraciones marcadas como `[ ]` (no aplicadas) al ejecutar `python manage.py showmigrations`, aunque la base de datos contenga los cambios reales de esquema. + + + + + +Para diagnosticar el problema, conéctate a tu pod de la aplicación y ejecuta: + +```bash +# Verificar el estado actual de las migraciones +python manage.py showmigrations + +# Buscar aplicaciones con estados mixtos como: +# cobranza +# [X] 0001_initial +# [ ] 0002_initial # <-- Esto indica un conflicto +``` + +También revisa directamente la base de datos: + +```sql +-- Conéctate a tu base de datos y revisa los registros de migración +SELECT app, name, applied FROM django_migrations +WHERE app IN ('cobranza', 'django_celery_results') +ORDER BY app, applied; +``` + + + + + +La solución más efectiva es marcar las migraciones en conflicto como aplicadas sin ejecutarlas realmente: + +```bash +# Conéctate a tu pod de la aplicación +kubectl exec -it -- bash + +# Marcar migraciones específicas como aplicadas de forma falsa +python manage.py migrate cobranza 0002_initial --fake +python manage.py migrate django_celery_results 0011_taskresult_periodic_task_name --fake + +# Verificar la corrección +python manage.py showmigrations +``` + +Para múltiples migraciones: + +```bash +# Marcar todas las migraciones pendientes como falsas para una app específica +python manage.py migrate cobranza --fake + +# O marcar todas las migraciones pendientes en todas las apps +python manage.py migrate --fake +``` + + + + + +Para evitar este problema en el futuro: + +1. **Volcados de ramas consistentes**: Siempre crea los volcados de base de datos desde la misma rama que vas a desplegar + +2. **Sincronización de migraciones**: Antes de crear volcados, asegúrate que las migraciones estén actualizadas: + + ```bash + python manage.py migrate + python manage.py showmigrations # Verificar que todas estén aplicadas + ``` + +3. **Proceso limpio de importación**: Al importar volcados: + + ```bash + # Después de la importación, verifica inmediatamente el estado de migraciones + python manage.py showmigrations + # Corrige cualquier conflicto antes del despliegue + ``` + +4. **Consistencia de ambiente**: Usa la misma versión de Django y versiones de las apps entre el entorno de creación del volcado y el de importación + + + + + +Para despliegues SleakOps con hooks de pre-actualización: + +1. **Modificación temporal del hook**: Si necesitas desplegar inmediatamente, puedes modificar temporalmente el hook de pre-actualización para usar `--fake-initial`: + + ```yaml + # En tu configuración de despliegue + preUpgradeHook: + command: | + python manage.py migrate --fake-initial + ``` + +2. **Corrección post-despliegue**: Después del despliegue exitoso, conéctate al pod y ejecuta las migraciones falsas adecuadas como se describió arriba + +3. **Restaurar hook normal**: Una vez corregido, restaura el hook normal de migración: + ```yaml + preUpgradeHook: + command: | + python manage.py migrate + ``` + + + + + +Después de aplicar la corrección: + +1. **Verificar estado de migraciones**: + + ```bash + python manage.py showmigrations + # Todas las migraciones deberían mostrar [X] + ``` + +2. **Probar funcionalidad de la aplicación**: + + - Verificar que las operaciones con la base de datos funcionen correctamente + - Comprobar que todos los modelos sean accesibles + - Probar funcionalidades críticas de la aplicación + +3. **Despliegue exitoso**: + - El próximo despliegue debería completarse sin errores de migración + - Los hooks de pre-actualización deberían ejecutarse correctamente + + + +--- + +_Esta FAQ fue generada automáticamente el 20 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-cloudflare-route53-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-cloudflare-route53-configuration.mdx new file mode 100644 index 000000000..d06551309 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-cloudflare-route53-configuration.mdx @@ -0,0 +1,196 @@ +--- +sidebar_position: 3 +title: "Configuración DNS con Cloudflare y Route53" +description: "Cómo configurar registros DNS al usar Cloudflare con delegación AWS Route53" +date: "2024-12-19" +category: "proveedor" +tags: ["dns", "cloudflare", "route53", "aws", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración DNS con Cloudflare y Route53 + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proveedor +**Etiquetas:** DNS, Cloudflare, Route53, AWS, Configuración + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan configurar registros DNS al usar Cloudflare como su proveedor DNS mientras tienen servicios AWS que requieren delegación Route53 para subdominios. + +**Síntomas Observados:** + +- Fallos en la resolución DNS para subdominios que apuntan a Load Balancers de AWS +- Servicios inaccesibles mediante los nombres de dominio configurados +- Problemas de propagación DNS entre Cloudflare y Route53 +- Conflictos en la configuración de registros CNAME/A + +**Configuración Relevante:** + +- Proveedor DNS: Cloudflare +- Servicio AWS: ELB (Elastic Load Balancer) +- Delegación de dominio: Route53 para subdominios específicos +- Tipos de registro: CNAME o registros A + +**Condiciones de Error:** + +- Tiempo de espera en resolución de dominio +- Servicios no accesibles vía dominios configurados +- Delegación DNS no funcionando correctamente +- Endpoints del balanceador no resolviendo + +## Solución Detallada + + + +Al usar Cloudflare como proveedor DNS principal pero necesitar Route53 para servicios AWS: + +1. **En Route53 (Consola AWS):** + + - Crear una zona alojada para su subdominio + - Anotar los registros NS (Name Server) proporcionados por Route53 + +2. **En Cloudflare:** + + - Crear registros NS que apunten su subdominio a los servidores de nombres de Route53 + - Ejemplo: `subdominio.sudominio.com` → registros NS de Route53 + +3. **Verificar la delegación:** + ```bash + dig NS subdominio.sudominio.com + ``` + + + + + +Para Load Balancers de AWS, debe crear los registros DNS apropiados: + +**En Cloudflare (para configuración directa):** + +``` +Tipo: CNAME +Nombre: corebackupgenerator +Valor: internal-k8s-autolabproduction-fc0a036b93-1779329228.us-east-1.elb.amazonaws.com +TTL: Auto +Estado proxy: Solo DNS (nube gris) +``` + +**En Route53 (si usa delegación):** + +``` +Tipo: A (Alias) +Nombre: corebackupgenerator.autolab.com.co +Destino Alias: internal-k8s-autolabproduction-fc0a036b93-1779329228.us-east-1.elb.amazonaws.com +Política de enrutamiento: Simple +``` + +**Importante:** Use registros A (Alias) en Route53 para mejor rendimiento y resolución automática de IP. + + + + + +Al configurar registros DNS en Cloudflare para servicios AWS: + +1. **Deshabilitar Proxy de Cloudflare (Nube Gris):** + + - Haga clic en el icono de nube naranja para que se vuelva gris + - Esto asegura conexión directa con los servicios AWS + - Requerido para servicios no HTTP o puertos personalizados + +2. **Configuración SSL/TLS:** + + - Establecer el modo de cifrado SSL/TLS a "Completo" o "Completo (estricto)" + - Asegurar que los certificados coincidan entre Cloudflare y AWS + +3. **Reglas de Página (si es necesario):** + - Crear reglas para omitir Cloudflare en subdominios específicos + - Útil para endpoints API o conexiones WebSocket + + + + + +**Comandos de diagnóstico:** + +1. **Verificar propagación DNS:** + + ```bash + dig corebackupgenerator.autolab.com.co + nslookup corebackupgenerator.autolab.com.co + ``` + +2. **Probar desde diferentes servidores DNS:** + + ```bash + dig @8.8.8.8 corebackupgenerator.autolab.com.co + dig @1.1.1.1 corebackupgenerator.autolab.com.co + ``` + +3. **Verificar delegación:** + ```bash + dig NS autolab.com.co + dig NS corebackupgenerator.autolab.com.co + ``` + +**Problemas comunes:** + +- **TTL demasiado alto:** Reducir TTL a 300 segundos durante cambios +- **Proxy habilitado:** Deshabilitar proxy de Cloudflare para servicios AWS +- **Tipo de registro incorrecto:** Usar CNAME para dominios externos, A para direcciones IP +- **Delegación faltante:** Asegurar que los registros NS estén correctamente configurados + + + + + +**Para configuración Cloudflare + AWS:** + +1. **Usar delegación por subdominio:** + + - Delegar subdominios específicos a Route53 + - Mantener la gestión del dominio principal en Cloudflare + +2. **Selección de tipo de registro:** + + - Usar registros A (Alias) en Route53 para recursos AWS + - Usar registros CNAME en Cloudflare para dominios externos + +3. **Monitoreo y validación:** + + - Configurar monitoreo DNS + - Probar resolución regularmente desde diferentes ubicaciones + - Monitorear validez de certificados SSL + +4. **Documentación:** + - Documentar todas las delegaciones DNS + - Mantener registro de qué registros se gestionan en cada lugar + - Mantener información de contacto para emergencias + +**Ejemplo de configuración:** + +```yaml +# Registros Cloudflare +autolab.com.co: + - Tipo: A + Valor: 192.168.1.1 + - Tipo: NS (para subdominio k8s) + Valor: servidores NS de Route53 + +# Registros Route53 (zona k8s.autolab.com.co) +k8s.autolab.com.co: + - Tipo: A (Alias) + Destino: nombre DNS del ELB + +corebackupgenerator.autolab.com.co: + - Tipo: A (Alias) + Destino: internal-k8s-autolabproduction-fc0a036b93-1779329228.us-east-1.elb.amazonaws.com +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-delegation-route53-domain-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-delegation-route53-domain-configuration.mdx new file mode 100644 index 000000000..c1eae9b04 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-delegation-route53-domain-configuration.mdx @@ -0,0 +1,209 @@ +--- +sidebar_position: 3 +title: "Problemas de Delegación DNS con Route53 y Proveedores de Dominio" +description: "Solución para problemas de delegación DNS al configurar dominios personalizados con Route53 y registradores de dominio externos" +date: "2024-12-19" +category: "provider" +tags: + ["dns", "route53", "aws", "dominio", "delegación", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Delegación DNS con Route53 y Proveedores de Dominio + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proveedor +**Etiquetas:** DNS, Route53, AWS, Dominio, Delegación, Solución de Problemas + +## Descripción del Problema + +**Contexto:** El usuario intenta configurar un dominio personalizado (ordenapp.com.ar) para apuntar a su despliegue de SleakOps, pero la delegación DNS no funciona correctamente a pesar de una configuración correcta en Route53. + +**Síntomas Observados:** + +- La delegación DNS parece estar configurada correctamente en Route53 +- El registrador del dominio (DonWeb) informa que la delegación está configurada pero "no se propaga porque los servidores DNS no están configurados en el proveedor de hosting" +- Los registros NS no se replican correctamente entre servidores DNS +- La validación del certificado SSL puede estar fallando +- El dominio no resuelve a la infraestructura AWS prevista + +**Configuración Relevante:** + +- Dominio: Dominio personalizado (.com.ar) +- Proveedor DNS: AWS Route53 +- Registrador de Dominio: DonWeb (o similar) +- SSL: AWS Certificate Manager (ACM) +- Destino: Despliegue de SleakOps con Load Balancer + +**Condiciones de Error:** + +- La propagación DNS falla a pesar de la correcta delegación NS +- Problemas de validación del certificado +- El dominio no resuelve a la infraestructura objetivo +- Los registros NS muestran resultados inconsistentes en diferentes verificadores DNS + +## Solución Detallada + + + +Primero, asegúrate de que tu zona alojada en Route53 esté configurada correctamente: + +1. **Verificar Creación de Zona Alojada:** + + - Ve a la Consola AWS → Route53 → Zonas Alojadas + - Verifica que exista la zona alojada para tu dominio + - Anota los 4 registros NS (Name Server) que proporciona AWS + +2. **Verificar Registros Requeridos:** + + ``` + Tipo: NS + Nombre: tu-dominio.com + Valor: + ns-123.awsdns-12.com + ns-456.awsdns-45.net + ns-789.awsdns-78.org + ns-012.awsdns-01.co.uk + + Tipo: A + Nombre: tu-dominio.com + Valor: [IP o Alias del Load Balancer] + + Tipo: CNAME (para validación SSL) + Nombre: _acme-challenge.tu-dominio.com + Valor: [valor de validación ACM] + ``` + + + + + +El problema suele ocurrir cuando la delegación DNS en el registrador del dominio está incompleta: + +1. **Actualizar Servidores de Nombre en el Registrador:** + + - Inicia sesión en tu registrador de dominio (DonWeb, GoDaddy, etc.) + - Ve a la sección de Gestión DNS o Servidores de Nombre + - Reemplaza los servidores de nombre por los 4 registros NS de AWS Route53 + - **Importante:** Usa los 4 servidores de nombre, no solo 2 + +2. **Eliminar Registros DNS Conflictivos:** + + - Borra cualquier registro A, CNAME u otros existentes en el registrador + - Mantén solo los registros de delegación NS + - Algunos registradores mantienen sus propios registros DNS incluso después de la delegación + +3. **Esperar la Propagación:** + - Los cambios DNS pueden tardar 24-48 horas en propagarse completamente + - Usa herramientas como `dig` o verificadores DNS en línea para monitorear el progreso + + + + + +Si la delegación aún no funciona, sigue estos pasos de solución de problemas: + +1. **Verificar Propagación DNS:** + + ```bash + # Verificar registros NS desde diferentes ubicaciones + dig NS tu-dominio.com @8.8.8.8 + dig NS tu-dominio.com @1.1.1.1 + + # Verificar si Route53 responde + dig A tu-dominio.com @ns-123.awsdns-12.com + ``` + +2. **Verificar con Herramientas en Línea:** + + - Usa https://dnschecker.org para revisar la propagación global + - Revisa específicamente los registros NS: `https://dnschecker.org/#NS/tu-dominio.com` + - Busca inconsistencias entre diferentes regiones + +3. **Problemas Comunes y Soluciones:** + - **Delegación parcial:** Asegúrate de que los 4 registros NS estén configurados + - **Conflictos de TTL:** Registros DNS antiguos pueden estar en caché (espera a que expire el TTL) + - **Interferencia del DNS del registrador:** Algunos registradores mantienen registros DNS paralelos + + + + + +Los problemas con el certificado SSL suelen acompañar a los problemas de delegación DNS: + +1. **Validación del Certificado ACM:** + + - Asegúrate de que el registro CNAME para la validación del dominio esté en Route53 + - El registro de validación debe ser: `_acme-challenge.tu-dominio.com` + - El valor debe coincidir exactamente con lo que proporciona ACM + +2. **Verificar Estado del Certificado:** + + ```bash + # Verificar estado de validación del certificado + aws acm describe-certificate --certificate-arn tu-cert-arn + ``` + +3. **Solicitar el Certificado Nuevamente si es Necesario:** + - Si la validación falla repetidamente, elimina y vuelve a solicitar el certificado + - Asegúrate de que la delegación DNS funcione antes de solicitar + + + + + +Para despliegues de SleakOps, asegúrate de una configuración correcta del dominio: + +1. **Configuración del Load Balancer:** + + - Verifica que el registro A apunte al Load Balancer correcto + - Usa registros Alias cuando sea posible en lugar de direcciones IP + - Comprueba que el Load Balancer esté en la misma región que Route53 + +2. **Configuración del Dominio en SleakOps:** + + - Actualiza la configuración de tu proyecto SleakOps para usar el dominio personalizado + - Asegura que el certificado SSL esté correctamente asociado + - Verifica que la configuración de ingress acepte el nuevo dominio + +3. **Probar Resolución del Dominio:** + + ```bash + # Probar resolución del dominio + curl -I https://tu-dominio.com + + # Verificar certificado SSL + openssl s_client -connect tu-dominio.com:443 -servername tu-dominio.com + ``` + + + + + +Si la delegación DNS sigue fallando después de seguir todos los pasos: + +1. **Contactar Soporte del Registrador con Información Específica:** + + - Proporciona los 4 servidores de nombre de AWS Route53 + - Explica que necesitas una delegación DNS completa (no solo reenvío) + - Pide que verifiquen que no existan registros DNS conflictivos + - Solicita que comprueben que sus servidores DNS no estén sobrescribiendo la delegación + +2. **Problemas Comunes con Registradores:** + + - Algunos registradores mantienen registros DNS de "parking" + - Delegación incompleta (solo 2 registros NS en lugar de 4) + - Reenvío DNS en lugar de verdadera delegación + - Registros DNS antiguos en caché en sus servidores + +3. **Documentación para Proporcionar:** + - Captura de pantalla de la zona alojada en Route53 + - Resultados de herramientas de verificación DNS + - Mensajes de error o síntomas específicos + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-delegation-ssl-certificate-validation.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-delegation-ssl-certificate-validation.mdx new file mode 100644 index 000000000..771b82cc2 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-delegation-ssl-certificate-validation.mdx @@ -0,0 +1,176 @@ +--- +sidebar_position: 3 +title: "Problemas de Delegación DNS y Validación de Certificados SSL" +description: "Solución de problemas de delegación DNS y retrasos en la validación de certificados SSL de AWS" +date: "2024-09-10" +category: "provider" +tags: ["dns", "ssl", "aws", "certificado", "delegación"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Delegación DNS y Validación de Certificados SSL + +**Fecha:** 10 de septiembre de 2024 +**Categoría:** Proveedor +**Etiquetas:** DNS, SSL, AWS, Certificado, Delegación + +## Descripción del Problema + +**Contexto:** Después de completar la configuración de la delegación DNS, los dominios aún aparecen como no delegados y los certificados SSL no están siendo validados por AWS, a pesar de que se espera que la propagación DNS haya finalizado. + +**Síntomas Observados:** + +- Los dominios se muestran como no delegados en el panel de SleakOps +- Los certificados SSL permanecen sin validar por AWS +- La propagación DNS parece incompleta a pesar de haber pasado tiempo suficiente +- El proceso de validación del certificado está atascado en estado pendiente + +**Configuración Relevante:** + +- Delegación DNS: Configurada recientemente +- Certificados SSL: Certificados gestionados por AWS +- Proveedor: AWS +- Método de validación de dominio: Validación DNS + +**Condiciones de Error:** + +- El problema persiste después del tiempo esperado para la propagación DNS +- Los certificados permanecen en estado de validación pendiente +- El estado de delegación del dominio no se actualiza en la plataforma + +## Solución Detallada + + + +La propagación DNS puede tardar hasta 48 horas en completarse a nivel global. Para verificar el estado actual: + +1. **Usar herramientas en línea de propagación DNS:** + + - Visite https://www.whatsmydns.net/ + - Ingrese su nombre de dominio + - Verifique si los registros NS están propagados globalmente + +2. **Verificación desde línea de comandos:** + + ```bash + # Verificar registros NS + dig NS su-dominio.com + + # Verificar desde diferentes servidores DNS + dig @8.8.8.8 NS su-dominio.com + dig @1.1.1.1 NS su-dominio.com + ``` + +3. **Resultados esperados:** + - Todas las consultas deben devolver los mismos registros NS + - Los registros deben apuntar a los servidores de nombres de AWS Route53 + + + + + +La validación del Certificate Manager (ACM) de AWS puede tomar tiempo adicional después de la propagación DNS: + +1. **Cronograma de validación:** + + - Propagación DNS: Hasta 48 horas + - Validación AWS: 24-72 horas adicionales después de la propagación + - Tiempo total: Hasta 5 días en algunos casos + +2. **Verificar estado del certificado en la Consola AWS:** + + ```bash + # Usando AWS CLI + aws acm list-certificates --region us-east-1 + aws acm describe-certificate --certificate-arn arn-del-certificado + ``` + +3. **Verificación de registros de validación:** + - Asegúrese que los registros CNAME de validación estén presentes + - Verifique que los registros coincidan exactamente con lo que AWS requiere + - Confirme que no existan registros DNS conflictivos + + + + + +Si la delegación parece incompleta después de 24-48 horas: + +1. **Verificar configuración de servidores de nombres:** + + ```bash + # Verificar servidores de nombres actuales + whois su-dominio.com | grep "Name Server" + ``` + +2. **Problemas comunes a revisar:** + + - Servidores de nombres no actualizados en el registrador del dominio + - Errores tipográficos en las entradas de servidores de nombres + - Registros DNS antiguos almacenados en caché localmente + - Registros A/CNAME conflictivos + +3. **Actualizar estado en la plataforma SleakOps:** + - Navegar a la configuración de su dominio + - Hacer clic en "Actualizar Estado" o "Verificar Delegación" + - Esperar a que la plataforma revalide el estado de delegación + + + + + +Considere regenerar los certificados si: + +1. **La validación ha estado pendiente por más de 5 días** +2. **Los registros DNS fueron modificados durante el proceso de validación** +3. **El certificado muestra errores de validación en la Consola AWS** + +**Pasos para regenerar:** + +1. En el panel de SleakOps: + + - Ir a la sección de Certificados SSL + - Seleccionar el certificado problemático + - Hacer clic en "Regenerar Certificado" + +2. **Regeneración manual vía AWS:** + + ```bash + # Eliminar certificado antiguo (si no está en uso) + aws acm delete-certificate --certificate-arn arn-cert-antiguo + + # Solicitar nuevo certificado + aws acm request-certificate \ + --domain-name su-dominio.com \ + --validation-method DNS + ``` + + + + + +**Configurar monitoreo para seguir el progreso:** + +1. **Eventos de AWS CloudWatch para cambios de estado del certificado** +2. **Revisiones regulares de propagación DNS** +3. **Notificaciones de la plataforma SleakOps** + +**Cronograma esperado:** + +- Hora 0-24: Inicio de propagación DNS +- Hora 24-48: Propagación DNS completada globalmente +- Hora 48-120: Validación del certificado AWS completa +- Hora 120+: Considerar regeneración si aún está pendiente + +**Indicadores de finalización exitosa:** + +- El dominio aparece como "Delegado" en SleakOps +- El estado del certificado cambia a "Emitido" en AWS +- El acceso HTTPS funciona para su dominio + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-domain-delegation-cloudflare-route53.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-domain-delegation-cloudflare-route53.mdx new file mode 100644 index 000000000..e4100c612 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-domain-delegation-cloudflare-route53.mdx @@ -0,0 +1,239 @@ +--- +sidebar_position: 3 +title: "Problemas de Delegación de Dominio DNS entre CloudFlare y Route53" +description: "Resolución de problemas de delegación y validación de dominio al usar proveedores DNS externos con SleakOps" +date: "2024-12-21" +category: "provider" +tags: ["dns", "route53", "cloudflare", "domain-delegation", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Delegación de Dominio DNS entre CloudFlare y Route53 + +**Fecha:** 21 de diciembre de 2024 +**Categoría:** Proveedor +**Etiquetas:** DNS, Route53, CloudFlare, Delegación de Dominio, AWS + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas de validación de dominio al intentar usar SleakOps con dominios gestionados por proveedores DNS externos como CloudFlare, mientras que SleakOps espera que los dominios estén delegados a AWS Route53. + +**Síntomas Observados:** + +- La validación de dominio falla para dominios principales gestionados en CloudFlare +- No se pueden validar alias de subdominios (por ejemplo, teams.simplee.cl) +- La creación de certificados SSL falla con errores de validación +- Comportamiento inconsistente donde algunos subdominios funcionan y otros no +- Desajuste en los registros NS entre lo que SleakOps espera y la delegación real + +**Configuración Relevante:** + +- Dominio principal: Gestionado en CloudFlare +- Proveedor DNS: Externo (CloudFlare) vs Esperado (Route53) +- Validación SSL: Requiere delegación correcta del dominio +- Validación SleakOps: Verifica la delegación solo una vez durante la configuración + +**Condiciones de Error:** + +- La validación del dominio falla cuando los registros NS no apuntan a Route53 +- Ocurren errores de validación del certificado SSL +- La creación de subdominios es bloqueada debido a problemas de delegación +- La gestión mixta de DNS causa comportamiento inconsistente + +## Solución Detallada + + + +**Cómo SleakOps gestiona el DNS:** + +1. **Gestión Centralizada:** SleakOps centraliza toda la gestión DNS en tus cuentas AWS a través de Route53 +2. **Gestión Automática de Servicios:** Los servicios desplegados con SleakOps gestionan automáticamente sus registros DNS +3. **Configuración Manual:** Reglas adicionales (validación de email, servicios externos) deben configurarse manualmente en Route53 +4. **Validación Única:** SleakOps valida la delegación del dominio solo una vez durante la configuración inicial + +**Dónde ver los registros DNS generados:** + +- Todos los registros DNS generados por SleakOps son visibles en la plataforma +- Los registros incluyen todos los entornos, ejecuciones y alias que crees +- Los registros son gestionados automáticamente para los servicios de SleakOps + + + + + +**Para un funcionamiento correcto, los dominios deben estar:** + +1. **Totalmente delegados a Route53:** Los registros NS del dominio principal deben apuntar a los servidores de nombres de Route53 +2. **Subdominios de dominios gestionados por SleakOps:** Deben crearse como subdominios de dominios ya delegados + +**Proceso de validación:** + +```bash +# Verificar registros NS actuales +dig NS simplee.cl + +# Comparar con servidores de nombres de Route53 +# Deben coincidir con los registros NS mostrados en la plataforma SleakOps +``` + +**Problemas comunes de delegación:** + +- Los registros NS en el registrador de dominio no coinciden con los servidores de nombres de Route53 +- Delegación parcial (algunos subdominios funcionan, otros no) +- Proxy de CloudFlare interfiriendo con la validación + + + + + +**Pasos completos para la migración:** + +1. **Exportar registros DNS existentes desde CloudFlare** + + - Descargar todos los registros DNS existentes + - Documentar cualquier configuración especial + +2. **Crear zona alojada en Route53** + + - SleakOps crea esta automáticamente cuando agregas un dominio + - Anotar los servidores de nombres proporcionados + +3. **Actualizar registrador de dominio** + + ``` + # Cambiar registros NS en el registrador de dominio a: + ns-xxx.awsdns-xx.com + ns-xxx.awsdns-xx.co.uk + ns-xxx.awsdns-xx.net + ns-xxx.awsdns-xx.org + ``` + +4. **Recrear registros necesarios en Route53** + + - Agregar manualmente cualquier registro DNS personalizado + - Configurar registros de validación de email + - Configurar registros de servicios externos + +5. **Esperar la propagación** + - Los cambios DNS pueden tardar hasta 48 horas en propagarse + - Verificar con `dig` o verificadores DNS en línea + + + + + +**Requisitos para la validación del certificado SSL:** + +1. **Eliminar proxy de CloudFlare:** Deshabilitar la nube naranja (proxy) en CloudFlare para los registros de validación +2. **Delegación correcta:** Asegurar que el dominio esté correctamente delegado a Route53 +3. **Propagación DNS:** Esperar a que los cambios DNS se propaguen globalmente + +**Solución de problemas en la validación SSL:** + +```bash +# Verificar si el dominio resuelve a los servidores de nombres correctos +dig NS your-domain.com + +# Verificar registros TXT para validación SSL +dig TXT _acme-challenge.your-domain.com + +# Probar resolución del dominio +nslookup your-domain.com +``` + +**Soluciones comunes:** + +- Deshabilitar proxy de CloudFlare durante la validación del certificado +- Asegurar que todos los registros NS apunten a Route53 +- Esperar la propagación DNS (hasta 48 horas) + + + + + +**Si necesitas mantener parte del DNS en CloudFlare:** + +1. **Delegación de subdominios:** Delegar solo subdominios específicos a Route53 + + ``` + # En CloudFlare, crear registros NS para subdominios: + app.yourdomain.com NS ns-xxx.awsdns-xx.com + app.yourdomain.com NS ns-xxx.awsdns-xx.co.uk + app.yourdomain.com NS ns-xxx.awsdns-xx.net + app.yourdomain.com NS ns-xxx.awsdns-xx.org + ``` + +2. **Usar subdominios de SleakOps:** Crear todos los servicios SleakOps bajo un subdominio dedicado + - Ejemplo: `*.apps.yourdomain.com` gestionado por Route53 + - El dominio principal permanece en CloudFlare + +**Problemas potenciales con gestión mixta:** + +- Pueden ocurrir errores de validación con cambios frecuentes de dominio +- Comportamiento inconsistente entre diferentes subdominios +- Complicaciones en la validación de certificados SSL + + + + + +**Problemas comunes de validación:** + +1. **Desajuste en registros NS** + +```bash +# Verificar qué registros NS están realmente configurados +dig +short NS yourdomain.com + +# Comparar con los registros NS esperados por SleakOps +# (visibles en la plataforma SleakOps) +``` + +2. **Retrasos en la propagación** + +- Usar múltiples verificadores DNS: whatsmydns.net +- Verificar TTL de registros DNS existentes +- Limpiar caché DNS local si es necesario + +3. **Problemas de certificado SSL** + +```bash +# Verificar estado del certificado +openssl s_client -connect yourdomain.com:443 -servername yourdomain.com + +# Verificar registros de validación ACME +dig TXT _acme-challenge.yourdomain.com +``` + +**Pasos de resolución:** + +1. Verificar que el dominio esté correctamente delegado a Route53 +2. Confirmar que todos los registros NS coincidan +3. Esperar la propagación DNS completa +4. Reintentar la validación del dominio en SleakOps +5. Contactar soporte si los problemas persisten después de 48 horas + + + +## Prevención + +**Mejores prácticas para evitar problemas de delegación:** + +1. **Planificar la migración:** Realizar la migración DNS durante ventanas de mantenimiento +2. **Documentar registros:** Mantener un inventario completo de todos los registros DNS +3. **Probar en subdominios:** Probar la delegación primero con subdominios no críticos +4. **Monitorear propagación:** Usar herramientas de monitoreo DNS para verificar la propagación +5. **Backup de configuración:** Mantener respaldos de la configuración DNS de CloudFlare + +## Recursos Adicionales + +- [Documentación de Route53](https://docs.aws.amazon.com/route53/) +- [Guía de migración DNS de CloudFlare](https://developers.cloudflare.com/dns/) +- [Verificador de propagación DNS](https://whatsmydns.net/) + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ + +``` diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-migration-donweb-to-aws.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-migration-donweb-to-aws.mdx new file mode 100644 index 000000000..50c304e94 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-migration-donweb-to-aws.mdx @@ -0,0 +1,407 @@ +--- +sidebar_position: 3 +title: "Migración DNS de DonWeb a AWS" +description: "Guía completa para migrar registros DNS de DonWeb a AWS Route 53" +date: "2024-01-15" +category: "proveedor" +tags: ["aws", "dns", "route53", "migración", "donweb"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Migración DNS de DonWeb a AWS + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proveedor +**Etiquetas:** AWS, DNS, Route 53, Migración, DonWeb + +## Descripción del Problema + +**Contexto:** El usuario necesita migrar los registros DNS del proveedor de hosting DonWeb a AWS Route 53, incluyendo la página de aterrizaje en WordPress y los servicios de correo corporativo. + +**Síntomas Observados:** + +- DNS actual gestionado por el proveedor DonWeb +- Página de aterrizaje WordPress alojada en DonWeb +- Servicios de correo corporativo alojados en DonWeb +- Necesidad de consolidar todos los servicios en AWS +- Incertidumbre sobre dónde configurar los registros DNS en AWS + +**Configuración Relevante:** + +- Proveedor actual: DonWeb +- Servicios: sitio web WordPress + correo corporativo +- Plataforma destino: AWS +- Servicio DNS requerido: AWS Route 53 + +**Condiciones de Error:** + +- Riesgo de interrupción del servicio durante la migración +- Necesidad de mantener la continuidad del servicio de correo +- El sitio WordPress debe permanecer accesible + +## Solución Detallada + + + +AWS Route 53 es el servicio DNS donde gestionarás los registros de tu dominio: + +1. **Accede a la Consola de Route 53**: + + - Ve a la Consola AWS → Route 53 + - Haz clic en "Hosted zones" + - Haz clic en "Create hosted zone" + +2. **Crear Hosted Zone**: + + - Ingresa el nombre de tu dominio (ejemplo: `tuempresa.com`) + - Selecciona "Public hosted zone" + - Haz clic en "Create hosted zone" + +3. **Anota los Servidores de Nombre**: + - AWS proporcionará 4 servidores de nombre + - Guarda estos para usarlos luego con tu registrador de dominio + + + + + +Antes de migrar, necesitas exportar tu configuración DNS actual: + +1. **Accede al Panel de Control de DonWeb**: + + - Inicia sesión en tu cuenta DonWeb + - Navega a la sección de gestión DNS + - Busca "Zona DNS" o "Registros DNS" + +2. **Documenta los Registros Actuales**: + Crea una lista de todos los registros DNS actuales: + + ``` + Registros A: + - @ (dominio raíz) → DIRECCIÓN_IP + - www → DIRECCIÓN_IP + + Registros MX (Correo): + - @ → mail.donweb.com (prioridad 10) + + Registros CNAME: + - Cualquier subdominio apuntando a otros servicios + + Registros TXT: + - Registros SPF para correo + - Cualquier registro de verificación + ``` + +3. **Opciones de Exportación**: + - Busca opción "Exportar" o "Descargar" + - Algunos proveedores ofrecen exportación de archivo de zona + - Si no está disponible, documenta manualmente todos los registros + + + + + +Para la migración de WordPress a AWS, tienes varias opciones: + +**Opción 1: AWS Lightsail (Recomendado para sitios simples)** + +1. Crea una instancia AWS Lightsail con WordPress +2. Migra tus archivos y base de datos de WordPress +3. Actualiza el registro A del DNS para apuntar a la nueva IP + +**Opción 2: EC2 con RDS** + +1. Configura una instancia EC2 para el servidor web +2. Configura RDS para la base de datos MySQL +3. Migra archivos y base de datos de WordPress +4. Configura grupos de seguridad y balanceador de carga + +**Opción 3: AWS App Runner o ECS** + +1. Conteneriza tu aplicación WordPress +2. Despliega usando App Runner o ECS +3. Utiliza RDS para la base de datos + +```bash +# Ejemplo: Crear instancia Lightsail para WordPress +aws lightsail create-instances \ + --instance-names "wordpress-site" \ + --availability-zone "us-east-1a" \ + --blueprint-id "wordpress" \ + --bundle-id "nano_2_0" +``` + + + + + +Para la migración del correo corporativo, considera estas opciones en AWS: + +**Opción 1: Amazon WorkMail** + +1. Configura la organización Amazon WorkMail +2. Crea cuentas de usuario +3. Configura registros MX para apuntar a WorkMail +4. Migra los correos existentes (si es necesario) + +**Opción 2: Correo de terceros con DNS en AWS** + +1. Elige proveedor de correo (Google Workspace, Microsoft 365) +2. Configura registros MX en Route 53 +3. Actualiza registros SPF/DKIM + +**Ejemplo de configuración de registro MX:** + +``` +Tipo: MX +Nombre: @ (o dejar en blanco) +Valor: 10 inbound-smtp.us-east-1.amazonaws.com (para WorkMail) +TTL: 300 +``` + +**Ejemplo de registro SPF:** + +``` +Tipo: TXT +Nombre: @ (o dejar en blanco) +Valor: "v=spf1 include:amazonses.com ~all" +TTL: 300 +``` + + + + + +Una vez que tus servicios estén listos en AWS, configura los registros DNS: + +1. **Registros A para el sitio web**: + + ``` + Tipo: A + Nombre: @ (dominio raíz) + Valor: TU_DIRECCIÓN_IP_AWS + TTL: 300 + + Tipo: A + Nombre: www + Valor: TU_DIRECCIÓN_IP_AWS + TTL: 300 + ``` + +2. **Registros MX para correo**: + + ``` + Tipo: MX + Nombre: @ + Valor: 10 tu-servidor-correo.amazonaws.com + TTL: 300 + ``` + +3. **Registros CNAME** (si es necesario): + + ``` + Tipo: CNAME + Nombre: subdominio + Valor: objetivo.dominio.com + TTL: 300 + ``` + +4. **Registros TXT para autenticación de correo**: + ``` + Tipo: TXT + Nombre: @ + Valor: "v=spf1 include:amazonses.com ~all" + TTL: 300 + ``` + + + + + +**Fase 1: Preparación** + +1. Exporta todos los registros DNS desde DonWeb +2. Configura los servicios AWS (Lightsail/EC2 para WordPress, WorkMail para correo) +3. Crea la zona hospedada en Route 53 +4. Prepara los nuevos registros DNS + +**Fase 2: Migración de servicios** + +1. Migra WordPress a AWS +2. Configura el servicio de correo en AWS +3. Prueba ambos servicios con IPs temporales + +**Fase 3: Cambio de DNS** + +1. Reduce TTL de registros DNS actuales a 300 segundos (5 minutos) +2. Espera el tiempo del TTL anterior para que se propague +3. Actualiza los servidores de nombre en tu registrador de dominio +4. Configura los nuevos registros DNS en Route 53 + +**Fase 4: Verificación** + +1. Verifica que el sitio web sea accesible +2. Prueba el envío y recepción de correos +3. Monitorea por 24-48 horas para asegurar estabilidad + + + + + +El paso más crítico es actualizar los servidores de nombre en tu registrador de dominio: + +1. **Accede a tu registrador de dominio** (donde compraste el dominio) +2. **Busca la sección DNS o Servidores de Nombre** +3. **Reemplaza los servidores de nombre actuales** con los proporcionados por AWS Route 53: + + ``` + Servidores de nombre de AWS Route 53 (ejemplo): + ns-1234.awsdns-12.com + ns-5678.awsdns-34.net + ns-9012.awsdns-56.org + ns-3456.awsdns-78.co.uk + ``` + +4. **Guarda los cambios** +5. **Espera la propagación** (puede tomar de 24 a 48 horas) + +**Verificar la propagación:** + +```bash +# Verificar desde diferentes ubicaciones +dig NS tudominio.com @8.8.8.8 +dig NS tudominio.com @1.1.1.1 + +# Usar herramientas online +# whatsmydns.net +# dnschecker.org +``` + + + + + +Siempre ten un plan de reversión preparado: + +**Plan de reversión rápida:** + +1. **Mantén acceso a DonWeb** durante al menos 72 horas después de la migración +2. **Documenta la configuración DNS original** antes de hacer cambios +3. **Ten los servidores de nombre originales** listos para restaurar + +**Pasos de reversión:** + +```bash +# Si algo sale mal, puedes revertir rápidamente: + +1. Accede a tu registrador de dominio +2. Cambia los servidores de nombre de vuelta a los de DonWeb +3. Espera 5-15 minutos para que se propague (si habías reducido el TTL) +4. Verifica que el sitio y correo funcionen normalmente +``` + +**Indicadores de que necesitas hacer rollback:** + +- El sitio web no es accesible después de 2 horas +- Los correos no se envían o reciben +- Errores SSL/TLS persistentes +- Pérdida significativa de tráfico web + + + + + +Después de completar la migración, es crucial monitorear: + +**Herramientas de monitoreo recomendadas:** + +1. **AWS CloudWatch** para monitorear servicios AWS +2. **Route 53 Health Checks** para monitorear disponibilidad +3. **Google Search Console** para verificar indexación +4. **Herramientas de correo** para verificar entregabilidad + +**Configurar alertas:** + +```bash +# Configurar health check en Route 53 +aws route53 create-health-check \ + --caller-reference "website-health-check-$(date +%s)" \ + --health-check-config Type=HTTPS,ResourcePath=/,FullyQualifiedDomainName=tudominio.com,Port=443 +``` + +**Métricas a monitorear:** + +- Tiempo de respuesta del sitio web +- Disponibilidad del servicio (uptime) +- Entregabilidad de correos +- Errores SSL/TLS +- Tráfico web y conversiones + +**Checklist post-migración (primeras 48 horas):** + +- [ ] Sitio web accesible desde múltiples ubicaciones +- [ ] Correos se envían y reciben correctamente +- [ ] Certificados SSL funcionan correctamente +- [ ] Formularios de contacto funcionan +- [ ] Analytics y tracking funcionan +- [ ] SEO no se ve afectado negativamente + + + + + +**Problemas frecuentes y sus soluciones:** + +1. **Sitio web no accesible:** + + ```bash + # Verificar propagación DNS + dig A tudominio.com + dig A www.tudominio.com + + # Verificar desde diferentes DNS + nslookup tudominio.com 8.8.8.8 + nslookup tudominio.com 1.1.1.1 + ``` + +2. **Correos no funcionan:** + + ```bash + # Verificar registros MX + dig MX tudominio.com + + # Probar envío de correo + telnet tu-servidor-correo.com 25 + ``` + +3. **Certificado SSL no válido:** + + - Verifica que el dominio apunte a la IP correcta + - Regenera el certificado SSL en AWS + - Verifica configuración de HTTPS en el servidor + +4. **Propagación DNS lenta:** + + - Verifica que los TTL estén configurados correctamente + - Usa herramientas de verificación DNS global + - Espera hasta 48 horas para propagación completa + +**Comandos útiles para diagnóstico:** + +```bash +# Verificar todos los registros DNS +dig ANY tudominio.com + +# Verificar ruta de red +traceroute tudominio.com + +# Verificar certificado SSL +openssl s_client -connect tudominio.com:443 -servername tudominio.com +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-propagation-public-deployment.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-propagation-public-deployment.mdx new file mode 100644 index 000000000..b8fcbe97f --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-propagation-public-deployment.mdx @@ -0,0 +1,202 @@ +--- +sidebar_position: 3 +title: "Problemas de Resolución DNS para Despliegues Públicos" +description: "Solución de problemas de propagación y resolución DNS para despliegues públicos en SleakOps" +date: "2024-12-10" +category: "project" +tags: ["dns", "despliegue", "público", "dominio", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Resolución DNS para Despliegues Públicos + +**Fecha:** 10 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** DNS, Despliegue, Público, Dominio, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario crea un despliegue público en SleakOps pero el dominio asignado no se resuelve correctamente, impidiendo el acceso a la aplicación desplegada. + +**Síntomas observados:** + +- Despliegue público creado exitosamente en SleakOps +- URL del dominio generada (por ejemplo, https://site.develop.velo.la) +- La resolución DNS falla al acceder a la URL +- La aplicación no es accesible mediante el dominio público + +**Configuración relevante:** + +- Tipo de despliegue: Despliegue público +- Formato de dominio: `https://[app].[environment].[domain].la` +- Proveedor DNS: Gestionado por SleakOps +- SSL/TLS: HTTPS habilitado + +**Condiciones de error:** + +- Las consultas DNS no devuelven resultados o direcciones IP incorrectas +- El navegador muestra "No se puede acceder a este sitio" o errores similares +- La propagación DNS puede estar todavía en curso +- El problema ocurre inmediatamente después de la creación del despliegue + +## Solución Detallada + + + +Los cambios DNS pueden tardar en propagarse por internet: + +- **Caché DNS local**: 5-15 minutos +- **Servidores DNS del ISP**: 30 minutos a 2 horas +- **Propagación global**: Hasta 24-48 horas (raro) +- **Tiempo típico de resolución**: 15-30 minutos + +Este comportamiento es normal y no es un problema de la plataforma. + + + + + +Usa estas herramientas para verificar la propagación DNS: + +**Verificadores DNS en línea:** + +```bash +# Verificar desde múltiples ubicaciones +https://dnschecker.org/ +https://www.whatsmydns.net/ +``` + +**Herramientas de línea de comando:** + +```bash +# Verificar resolución DNS +nslookup site.develop.velo.la + +# Verificar desde diferentes servidores DNS +nslookup site.develop.velo.la 8.8.8.8 +nslookup site.develop.velo.la 1.1.1.1 +``` + +**Resultado esperado:** + +``` +Name: site.develop.velo.la +Address: [DIRECCIÓN_IP] +``` + + + + + +Si el DNS ya se ha propagado pero aún no puedes acceder al sitio: + +**Windows:** + +```cmd +ipconfig /flushdns +``` + +**macOS:** + +```bash +sudo dscacheutil -flushcache +sudo killall -HUP mDNSResponder +``` + +**Linux:** + +```bash +sudo systemctl restart systemd-resolved +# o +sudo service nscd restart +``` + +**Caché del navegador:** + +- Chrome: Configuración → Privacidad → Borrar datos de navegación → Imágenes y archivos en caché +- Firefox: Configuración → Privacidad y seguridad → Borrar datos + + + + + +Asegúrate de que tu despliegue esté realmente en ejecución: + +1. **Verifica el estado del despliegue en el panel de SleakOps:** + + - Ve a tu proyecto + - Confirma que el despliegue aparece como "En ejecución" + - Revisa si hay mensajes de error + +2. **Verifica los logs de la aplicación:** + + ```bash + # Verifica si la aplicación está iniciando correctamente + kubectl logs -f deployment/[nombre-de-tu-app] + ``` + +3. **Revisa la configuración del servicio:** + - Asegúrate de que el servicio esté expuesto correctamente + - Verifica que la configuración del puerto coincida con tu aplicación + + + + + +Mientras esperas la propagación DNS: + +**1. Usa acceso directo por IP:** + +```bash +# Obtén la IP del balanceador de carga +kubectl get services +# Accede vía IP: http://[IP-EXTERNA] +``` + +**2. Modifica el archivo hosts local:** + +```bash +# Añade en /etc/hosts (Linux/Mac) o C:\Windows\System32\drivers\etc\hosts (Windows) +[IP-EXTERNA] site.develop.velo.la +``` + +**3. Usa kubectl port-forward:** + +```bash +kubectl port-forward service/[nombre-del-servicio] 8080:80 +# Accede vía http://localhost:8080 +``` + + + + + +Si los problemas DNS persisten después de 2 horas: + +**1. Verifica la configuración del dominio:** + +- Confirma que el dominio está correctamente configurado en SleakOps +- Asegúrate de que no haya errores tipográficos en el nombre del dominio +- Verifica que la configuración del dominio personalizado sea correcta + +**2. Verifica el certificado SSL:** + +```bash +# Verifica el estado del certificado SSL +openssl s_client -connect site.develop.velo.la:443 -servername site.develop.velo.la +``` + +**3. Contacta soporte:** +Si el problema persiste más allá de los tiempos normales de propagación, contacta al soporte de SleakOps con: + +- Nombre del despliegue y proyecto +- URL del dominio esperado +- Resultados de los chequeos DNS +- Mensajes de error o capturas de pantalla + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-resolution-failure-mysql-redis.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-resolution-failure-mysql-redis.mdx new file mode 100644 index 000000000..1856a8944 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dns-resolution-failure-mysql-redis.mdx @@ -0,0 +1,548 @@ +--- +sidebar_position: 3 +title: "Fallo en la resolución DNS para conexiones MySQL y Redis" +description: "Solución para fallos en la resolución DNS que causan errores de conexión en MySQL y Redis" +date: "2024-12-19" +category: "dependency" +tags: ["dns", "mysql", "redis", "conexión", "solución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en la resolución DNS para conexiones MySQL y Redis + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** DNS, MySQL, Redis, Conexión, Solución de problemas + +## Descripción del problema + +**Contexto:** Aplicación que experimenta un rendimiento lento debido a fallos en la resolución DNS al intentar conectar con los servicios MySQL y Redis en un entorno Kubernetes. + +**Síntomas observados:** + +- Aplicación funcionando muy lentamente +- Fallos repetidos en la conexión a MySQL +- Fallos en la conexión a Redis +- Errores "Fallo temporal en la resolución de nombre" +- Mensajes "No se encontraron nodos activos en su clúster" +- Tiempos de espera en la conexión a la instancia Redis de AWS ElastiCache + +**Configuración relevante:** + +- Conexión MySQL: Uso de resolución por nombre de host +- Conexión Redis: `redis-aws-production-bfdbf3f.pdvyst.0001.use2.cache.amazonaws.com:6379` +- Entorno: clúster de producción en AWS +- Patrón de error: `php_network_getaddresses: getaddrinfo failed` + +**Condiciones de error:** + +- La resolución DNS falla de forma intermitente +- Los errores ocurren durante períodos de alto tráfico +- Tanto MySQL como Redis se ven afectados simultáneamente +- La aplicación se vuelve no responsiva debido a los tiempos de espera en las conexiones + +## Solución detallada + + + +El error "Fallo temporal en la resolución de nombre" indica problemas con la resolución DNS. Esto puede suceder debido a: + +1. **Sobrecarga del servidor DNS**: demasiadas consultas DNS concurrentes +2. **Problemas de conectividad de red**: dificultades para alcanzar los servidores DNS +3. **Problemas con la caché DNS**: caché DNS obsoleta o corrupta +4. **Problemas con CoreDNS**: fallos en el servicio DNS de Kubernetes + +Para diagnosticar: + +```bash +# Verificar resolución DNS desde dentro de un pod +kubectl exec -it -- nslookup mysql-hostname +kubectl exec -it -- nslookup redis-aws-production-bfdbf3f.pdvyst.0001.use2.cache.amazonaws.com + +# Verificar logs de CoreDNS +kubectl logs -n kube-system -l k8s-app=kube-dns +``` + + + + + +Incrementar réplicas de CoreDNS para manejar más consultas DNS: + +```bash +# Verificar despliegue actual de CoreDNS +kubectl get deployment coredns -n kube-system + +# Escalar réplicas de CoreDNS +kubectl scale deployment coredns --replicas=3 -n kube-system + +# Verificar escalado +kubectl get pods -n kube-system -l k8s-app=kube-dns +``` + +Para aplicaciones con alto tráfico, se recomiendan de 3 a 5 réplicas de CoreDNS. + + + + + +Implementar caché DNS a nivel de aplicación para reducir las consultas DNS: + +**Para aplicaciones PHP:** + +```php +// Añadir a la configuración de la base de datos +'mysql' => [ + 'host' => env('DB_HOST', 'localhost'), + 'options' => [ + PDO::ATTR_PERSISTENT => true, + PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true, + ], + // Habilitar agrupación de conexiones + 'pool' => [ + 'min_connections' => 5, + 'max_connections' => 20, + ] +], + +// Para conexiones Redis +'redis' => [ + 'client' => 'predis', + 'options' => [ + 'cluster' => env('REDIS_CLUSTER', 'redis'), + 'prefix' => env('REDIS_PREFIX', Str::slug(env('APP_NAME', 'laravel'), '_').'_database_'), + // Añadir agrupación de conexiones + 'persistent' => true, + ], +] +``` + + + + + +Configurar ajustes DNS en tu despliegue: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: tu-app +spec: + template: + spec: + # Configurar política DNS + dnsPolicy: ClusterFirst + dnsConfig: + options: + # Reducir tiempo de espera DNS + - name: timeout + value: "1" + # Incrementar intentos + - name: attempts + value: "3" + # Habilitar caché DNS + - name: use-vc + - name: ndots + value: "2" + containers: + - name: app + image: tu-app:latest + # Añadir variables de entorno relacionadas con DNS + env: + - name: DB_HOST + value: "mysql-service.default.svc.cluster.local" + - name: REDIS_HOST + value: "redis-service.default.svc.cluster.local" +``` + + + + + +Crear servicios Kubernetes para evitar la resolución DNS externa: + +**Para MySQL:** + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: mysql-external +spec: + type: ExternalName + externalName: your-mysql-hostname.amazonaws.com + ports: + - port: 3306 + targetPort: 3306 +``` + +**Para Redis:** + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: redis-external +spec: + type: ExternalName + externalName: redis-aws-production-bfdbf3f.pdvyst.0001.use2.cache.amazonaws.com + ports: + - port: 6379 + targetPort: 6379 +``` + +Luego actualiza la configuración de tu aplicación: + +```bash +# Usar nombres de servicios en lugar de nombres de host externos +DB_HOST=mysql-external.default.svc.cluster.local +REDIS_HOST=redis-external.default.svc.cluster.local +``` + + + + + +Agregar monitoreo para detectar problemas DNS temprano: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: dns-monitor +data: + monitor.sh: | + #!/bin/bash + while true; do + # Probar resolución DNS + if ! nslookup mysql-service.default.svc.cluster.local > /dev/null 2>&1; then + echo "$(date): Fallo en la resolución DNS para MySQL" + fi + if ! nslookup redis-service.default.svc.cluster.local > /dev/null 2>&1; then + echo "$(date): Fallo en la resolución DNS para Redis" + fi + sleep 30 + done +``` + +Desplegar como contenedor sidecar o pod de monitoreo separado. + + + + + +La agrupación de conexiones puede reducir significativamente las consultas DNS: + +**Para aplicaciones PHP/Laravel:** + +```php +// config/database.php +'mysql' => [ + 'driver' => 'mysql', + 'host' => env('DB_HOST', '127.0.0.1'), + 'port' => env('DB_PORT', '3306'), + 'database' => env('DB_DATABASE', 'forge'), + 'username' => env('DB_USERNAME', 'forge'), + 'password' => env('DB_PASSWORD', ''), + 'charset' => 'utf8mb4', + 'collation' => 'utf8mb4_unicode_ci', + 'prefix' => '', + 'prefix_indexes' => true, + 'strict' => true, + 'engine' => null, + 'options' => [ + PDO::ATTR_PERSISTENT => true, // Conexiones persistentes + PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true, + ], +], + +// Para Redis +'redis' => [ + 'client' => env('REDIS_CLIENT', 'predis'), + 'options' => [ + 'cluster' => env('REDIS_CLUSTER', 'redis'), + 'prefix' => env('REDIS_PREFIX', Str::slug(env('APP_NAME', 'laravel'), '_').'_database_'), + 'persistent' => true, // Conexiones persistentes + ], + 'default' => [ + 'url' => env('REDIS_URL'), + 'host' => env('REDIS_HOST', '127.0.0.1'), + 'password' => env('REDIS_PASSWORD', null), + 'port' => env('REDIS_PORT', '6379'), + 'database' => env('REDIS_DB', '0'), + 'read_write_timeout' => 60, + 'persistent' => true, + ], +], +``` + +**Para aplicaciones Node.js:** + +```javascript +// database.js +const mysql = require('mysql2'); + +// Crear pool de conexiones +const pool = mysql.createPool({ + host: process.env.DB_HOST, + user: process.env.DB_USER, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + waitForConnections: true, + connectionLimit: 10, + queueLimit: 0, + acquireTimeout: 60000, + timeout: 60000, + reconnect: true +}); + +module.exports = pool.promise(); +``` + + + + + +Configurar caché DNS más agresivo para reducir consultas: + +**Configuración de systemd-resolved (Ubuntu/Debian):** + +```bash +# Editar configuración de systemd-resolved +sudo nano /etc/systemd/resolved.conf + +# Agregar configuraciones optimizadas +[Resolve] +DNS=8.8.8.8 1.1.1.1 +FallbackDNS=8.8.4.4 1.0.0.1 +Cache=yes +CacheFromLocalhost=yes +DNSStubListener=yes +ReadEtcHosts=yes +``` + +**Configuración en pods de Kubernetes:** + +```yaml +apiVersion: v1 +kind: Pod +spec: + dnsConfig: + options: + # Aumentar tiempo de caché DNS + - name: timeout + value: "2" + - name: attempts + value: "5" + # Reducir ndots para consultas más eficientes + - name: ndots + value: "1" + # Habilitar caché local + - name: use-vc + - name: rotate + containers: + - name: app + image: tu-app:latest + env: + # Usar IPs directas como fallback + - name: DB_HOST_IP + value: "10.0.1.100" # IP directa de MySQL + - name: REDIS_HOST_IP + value: "10.0.1.200" # IP directa de Redis +``` + + + + + +Implementar patrones de resiliencia para manejar fallos DNS: + +**Circuit Breaker para conexiones de base de datos:** + +```php +// CircuitBreaker.php +class DatabaseCircuitBreaker +{ + private $failureCount = 0; + private $lastFailureTime = null; + private $timeout = 30; // 30 segundos + private $threshold = 5; // 5 fallos consecutivos + + public function call(callable $operation) + { + if ($this->isOpen()) { + throw new Exception('Circuit breaker is open'); + } + + try { + $result = $operation(); + $this->onSuccess(); + return $result; + } catch (Exception $e) { + $this->onFailure(); + throw $e; + } + } + + private function isOpen() + { + if ($this->failureCount >= $this->threshold) { + if (time() - $this->lastFailureTime > $this->timeout) { + $this->reset(); + return false; + } + return true; + } + return false; + } + + private function onSuccess() + { + $this->reset(); + } + + private function onFailure() + { + $this->failureCount++; + $this->lastFailureTime = time(); + } + + private function reset() + { + $this->failureCount = 0; + $this->lastFailureTime = null; + } +} +``` + +**Health checks para servicios:** + +```yaml +apiVersion: v1 +kind: Pod +spec: + containers: + - name: app + image: tu-app:latest + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 +``` + + + + + +Configurar monitoreo proactivo para detectar problemas DNS: + +**Script de monitoreo DNS:** + +```bash +#!/bin/bash +# dns-monitor.sh + +MYSQL_HOST="mysql-service.default.svc.cluster.local" +REDIS_HOST="redis-service.default.svc.cluster.local" +LOG_FILE="/var/log/dns-monitor.log" + +check_dns_resolution() { + local hostname=$1 + local service_name=$2 + + if nslookup "$hostname" > /dev/null 2>&1; then + echo "$(date): ✓ $service_name DNS resolution OK" >> $LOG_FILE + return 0 + else + echo "$(date): ✗ $service_name DNS resolution FAILED" >> $LOG_FILE + # Enviar alerta + curl -X POST "$WEBHOOK_URL" \ + -H "Content-Type: application/json" \ + -d "{\"text\":\"DNS resolution failed for $service_name ($hostname)\"}" + return 1 + fi +} + +# Monitoreo continuo +while true; do + check_dns_resolution "$MYSQL_HOST" "MySQL" + check_dns_resolution "$REDIS_HOST" "Redis" + sleep 30 +done +``` + +**Configuración de alertas en Prometheus:** + +```yaml +# dns-alerts.yml +groups: + - name: dns-resolution + rules: + - alert: DNSResolutionFailure + expr: increase(dns_lookup_failures_total[5m]) > 5 + for: 2m + labels: + severity: critical + annotations: + summary: "DNS resolution failures detected" + description: "DNS resolution has failed {{ $value }} times in the last 5 minutes" + + - alert: HighDNSLatency + expr: dns_lookup_duration_seconds > 2 + for: 1m + labels: + severity: warning + annotations: + summary: "High DNS lookup latency" + description: "DNS lookups are taking {{ $value }} seconds" +``` + +**Dashboard de Grafana para monitoreo DNS:** + +```json +{ + "dashboard": { + "title": "DNS Monitoring", + "panels": [ + { + "title": "DNS Resolution Success Rate", + "type": "stat", + "targets": [ + { + "expr": "rate(dns_lookups_total[5m]) - rate(dns_lookup_failures_total[5m])" + } + ] + }, + { + "title": "DNS Lookup Duration", + "type": "graph", + "targets": [ + { + "expr": "dns_lookup_duration_seconds" + } + ] + } + ] + } +} +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-cache-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-cache-issues.mdx new file mode 100644 index 000000000..ce124e0c3 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-cache-issues.mdx @@ -0,0 +1,175 @@ +--- +sidebar_position: 3 +title: "Problemas con la Caché de Construcción de Docker en Producción" +description: "Solución para la caché de construcción de Docker que impide que los cambios de código se desplieguen" +date: "2024-01-15" +category: "proyecto" +tags: ["docker", "caché", "despliegue", "construcción", "ecr"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas con la Caché de Construcción de Docker en Producción + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Docker, Caché, Despliegue, Construcción, ECR + +## Descripción del Problema + +**Contexto:** El usuario realizó cambios en el código que funcionan correctamente en el entorno de desarrollo local, pero después de desplegar en producción, el comportamiento antiguo persiste a pesar de que el código está correctamente actualizado en el repositorio. + +**Síntomas Observados:** + +- Los cambios en el código funcionan correctamente en desarrollo local +- Después del despliegue, producción sigue mostrando el comportamiento/errores antiguos +- Los archivos del código fuente están correctamente actualizados en producción +- El problema parece estar relacionado con la caché de construcción de Docker + +**Configuración Relevante:** + +- Plataforma: SleakOps con construcciones Docker +- Entorno: Despliegue en producción +- Sistema de construcción: Docker con caché de capas +- Registro: AWS ECR para almacenamiento de imágenes + +**Condiciones de Error:** + +- El problema ocurre después del despliegue del código +- El entorno local funciona correctamente +- El despliegue en producción no refleja los cambios en el código +- El problema persiste a través de múltiples intentos de despliegue + +## Solución Detallada + + + +La causa más común es que la caché de construcción de Docker no detecta cambios en el código de tu aplicación. Para forzar la invalidación de la caché, agrega esta línea a tu Dockerfile antes de copiar los archivos de tu aplicación: + +```dockerfile +# Añade esto antes de los comandos COPY +# Invalidador de caché +RUN echo "Frontend cache bust: v2" > /dev/null + +# Luego tus comandos COPY normales +COPY ./ClientApp /app/ClientApp +``` + +Esto obliga a Docker a reconstruir todas las capas posteriores, asegurando que tus cambios de código se incluyan. + + + + + +Si la invalidación de caché no funciona, puede que necesites limpiar las imágenes Docker almacenadas en AWS ECR: + +1. **Accede a la Consola AWS** + + - Cambia a tu cuenta AWS de producción + - Navega al servicio **Amazon ECR** + +2. **Encuentra tu repositorio** + + - Localiza el repositorio que contiene las imágenes Docker de tu proyecto + - Por lo general estará nombrado según tu proyecto + +3. **Elimina las imágenes en caché** + + - Selecciona todas las imágenes en el repositorio + - Elimínalas para forzar una reconstrucción completa + +4. **Despliega con un nuevo commit** + - Haz un nuevo commit (puedes eliminar la línea de invalidación de caché si lo deseas) + - Despliega los cambios + + + + + +Para prevenir este problema en el futuro, estructura tu Dockerfile para maximizar la eficiencia de la caché: + +```dockerfile +# Buenas prácticas: copiar primero archivos de dependencias +COPY package.json package-lock.json ./ +RUN npm install + +# Copiar el código de la aplicación al final (cambia con más frecuencia) +COPY ./ClientApp ./ClientApp +COPY ./ServerApp ./ServerApp + +# Construir la aplicación +RUN npm run build +``` + +De esta manera, la instalación de dependencias se cachea separadamente del código de la aplicación. + + + + + +Si el problema persiste, prueba estos pasos adicionales: + +1. **Forzar reconstrucción sin caché** + + ```bash + # Si usas Docker directamente + docker build --no-cache -t tu-imagen . + ``` + +2. **Revisar los registros de construcción** + + - Revisa los logs de despliegue en SleakOps + - Busca mensajes "Using cache" que puedan indicar capas obsoletas + +3. **Verificar las marcas de tiempo de los archivos** + + - Asegúrate que tus cambios de código tengan marcas de tiempo recientes + - Comprueba si el proceso de construcción está tomando los archivos correctos + +4. **Probar con cambios mínimos** + - Haz un cambio pequeño y visible (como agregar un console.log) + - Despliega y verifica que el cambio aparezca en producción + + + + + +Para evitar este problema en el futuro: + +1. **Usa correctamente .dockerignore** + + ``` + node_modules + .git + .env.local + *.log + ``` + +2. **Implementa una correcta invalidación de caché** + + - Usa argumentos de construcción con marcas de tiempo + - Incluye números de versión en tus builds + +3. **Monitorea los procesos de construcción** + + - Revisa regularmente los logs de despliegue + - Verifica que las construcciones realmente reconstruyan las capas cambiadas + +4. **Usa construcciones multi-etapa** + + ```dockerfile + FROM node:16 AS builder + COPY package*.json ./ + RUN npm install + COPY . . + RUN npm run build + + FROM nginx:alpine + COPY --from=builder /app/dist /usr/share/nginx/html + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-cache-no-cache-option.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-cache-no-cache-option.mdx new file mode 100644 index 000000000..080ad9780 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-cache-no-cache-option.mdx @@ -0,0 +1,168 @@ +--- +sidebar_position: 3 +title: "Problemas con la Caché de Construcción de Docker y Argumentos de Construcción" +description: "Solución para que la caché de construcción de Docker no se actualice al cambiar argumentos de construcción" +date: "2025-02-10" +category: "proyecto" +tags: ["docker", "build", "cache", "arguments", "no-cache"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas con la Caché de Construcción de Docker y Argumentos de Construcción + +**Fecha:** 10 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Docker, Construcción, Caché, Argumentos, No-cache + +## Descripción del Problema + +**Contexto:** Al cambiar los argumentos de construcción de Docker en proyectos de SleakOps, el proceso de construcción puede usar capas en caché y no reflejar los nuevos valores de los argumentos en la imagen final, aunque se inicie una nueva construcción. + +**Síntomas Observados:** + +- Se cambian los argumentos de construcción en la configuración del proyecto +- Se inicia una nueva construcción con éxito +- El despliegue usa la misma etiqueta de imagen +- El contenedor aún contiene archivos/configuraciones antiguas +- La caché de construcción impide que los cambios en los argumentos tengan efecto + +**Configuración Relevante:** + +- Argumentos de construcción Docker (p. ej., `MANIFEST_FILE_URL`) +- Proceso de construcción usando caché de capas Docker +- Sistema de construcción de SleakOps con etiquetado personalizado + +**Condiciones de Error:** + +- Ocurre al modificar argumentos de construcción sin cambios en el código +- La caché de Docker reutiliza capas de construcciones previas +- Los nuevos valores de argumentos no se propagan a la imagen final +- El problema persiste hasta que se modifica el código del repositorio + +## Solución Detallada + + + +La solución más directa es usar la bandera `--no-cache` al construir. Esto fuerza a Docker a reconstruir todas las capas sin usar caché. + +**Estado Actual en SleakOps:** + +- La plataforma se está preparando para soportar flags de construcción como `--no-cache` +- Esto estará disponible como opción manual en el frontend +- Los usuarios podrán decidir cuándo forzar una reconstrucción completa + +**Cuándo usar:** + +- Después de cambiar argumentos de construcción +- Cuando los recursos externos referenciados por los argumentos han cambiado +- Al solucionar problemas relacionados con la caché + + + + + +Modifica tu Dockerfile para reducir interferencias con la caché: + +```dockerfile +# En lugar de copiar recursos externos durante la construcción +COPY external-resource.json /app/ + +# Mover el proceso de descarga a tiempo de ejecución +RUN echo "#!/bin/sh" > /app/download.sh && \ + echo "curl -o /app/resource.json \$RESOURCE_URL" >> /app/download.sh && \ + chmod +x /app/download.sh + +# Usar variables de entorno en tiempo de ejecución +ENV RESOURCE_URL="" +CMD ["/app/download.sh", "&&", "your-app"] +``` + +**Beneficios:** + +- Los argumentos de construcción no afectan la caché de capas de Docker +- Los recursos externos se obtienen en tiempo de ejecución +- Los cambios en URLs no requieren reconstrucción de la imagen + + + + + +Para construcciones automatizadas, puedes configurar tu CI/CD para usar siempre `--no-cache` cuando cambien los argumentos de construcción: + +```yaml +# Ejemplo de configuración CI/CD +build: + script: + - | + if [ "$BUILD_ARGS_CHANGED" = "true" ]; then + docker build --no-cache -t $IMAGE_TAG . + else + docker build -t $IMAGE_TAG . + fi +``` + +**Consideraciones:** + +- Tiempos de construcción más largos al usar `--no-cache` +- Mayor consumo de recursos +- Recomendado para despliegues en producción con cambios en argumentos + + + + + +**1. Usar variables de entorno en lugar de argumentos de construcción:** + +```dockerfile +# En lugar de ARG +# ARG MANIFEST_FILE_URL +# Usar ENV en tiempo de ejecución +ENV MANIFEST_FILE_URL="" +``` + +**2. Incluir valores de argumentos en capas que rompen caché:** + +```dockerfile +ARG MANIFEST_FILE_URL +# Añadir una capa que cambie cuando cambia el argumento +RUN echo "Cache bust: $MANIFEST_FILE_URL" > /tmp/cache-bust +RUN curl -o /app/manifest.json "$MANIFEST_FILE_URL" +``` + +**3. Usar construcciones multi-etapa:** + +```dockerfile +FROM alpine as downloader +ARG MANIFEST_FILE_URL +RUN curl -o /tmp/manifest.json "$MANIFEST_FILE_URL" + +FROM your-base-image +COPY --from=downloader /tmp/manifest.json /app/ +``` + + + + + +SleakOps está desarrollando características mejoradas para el control de construcción: + +**Características Planeadas:** + +- Opción de bandera `--no-cache` en el frontend +- Soporte para `--flush-cache` en escenarios específicos +- Detección automática de invalidación de caché +- Detección de cambios en argumentos de construcción + +**Cronograma:** + +- Infraestructura backend lista +- Integración frontend en desarrollo +- Control manual disponible primero +- Funciones de detección automática planificadas para versiones posteriores + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-environment-variables.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-environment-variables.mdx new file mode 100644 index 000000000..814b5baa4 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-build-environment-variables.mdx @@ -0,0 +1,194 @@ +--- +sidebar_position: 3 +title: "Variables de Entorno No Disponibles Durante la Construcción en Docker" +description: "Solución para variables de entorno que no están accesibles durante el proceso de construcción en Docker" +date: "2025-03-10" +category: "project" +tags: ["docker", "dockerfile", "build", "environment-variables", "rails"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variables de Entorno No Disponibles Durante la Construcción en Docker + +**Fecha:** 10 de marzo de 2025 +**Categoría:** Proyecto +**Etiquetas:** Docker, Dockerfile, Construcción, Variables de Entorno, Rails + +## Descripción del Problema + +**Contexto:** El usuario experimenta una falla en la construcción de Docker donde las variables de entorno definidas en la configuración Docker de SleakOps no son accesibles durante el proceso de construcción, específicamente para la descifrado de la clave maestra de Rails. + +**Síntomas Observados:** + +- La construcción falla con el error "Falta la clave de cifrado para descifrar el archivo" +- Rails no puede encontrar la variable de entorno RAILS_MASTER_KEY +- Las variables de entorno están definidas en la configuración Docker de SleakOps +- El proceso de construcción no puede acceder a la clave de cifrado requerida + +**Configuración Relevante:** + +- Framework: Ruby on Rails +- Error: Falta RAILS_MASTER_KEY para la descifrado con SOPS +- Variables de entorno definidas en la configuración Docker de SleakOps +- Contexto de construcción: proceso de construcción en contenedor Docker + +**Condiciones de Error:** + +- El error ocurre durante la fase de construcción en Docker +- Las variables de entorno no se pasan al contexto de construcción +- La descifrado de credenciales de Rails falla +- El proceso de construcción termina con error de clave de cifrado + +## Solución Detallada + + + +Las variables de entorno configuradas en SleakOps solo están disponibles en tiempo de ejecución, no durante la construcción. Para usarlas durante la construcción, debe definirlas como ARG en su Dockerfile: + +```dockerfile +# Definir el argumento que recibirá la variable de entorno +ARG RAILS_MASTER_KEY + +# Usar el argumento en su proceso de construcción +RUN echo "Clave maestra: $RAILS_MASTER_KEY" + +# Opcional: establecerlo como variable de entorno para tiempo de ejecución +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY +``` + + + + + +En SleakOps, debe configurar los argumentos de construcción por separado de las variables de entorno: + +1. Vaya a sus **Configuraciones del Proyecto** +2. Navegue a **Configuración de Docker** +3. En la sección **Argumentos de Construcción** (no Variables de Entorno) +4. Añada su argumento de construcción: + ``` + RAILS_MASTER_KEY=${RAILS_MASTER_KEY} + ``` + +Esto pasa el valor de la variable de entorno como argumento de construcción. + + + + + +Para manejar secretos durante la construcción: + +```dockerfile +# Método 1: Usando ARG (recomendado para datos no sensibles) +ARG RAILS_MASTER_KEY +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY + +# Método 2: Usando secretos de Docker (para datos sensibles) +# RUN --mount=type=secret,id=rails_master_key \ +# RAILS_MASTER_KEY=$(cat /run/secrets/rails_master_key) && \ +# # sus comandos de construcción aquí + +# Método 3: Construcción multi-etapa para evitar exponer secretos +FROM ruby:3.0 as builder +ARG RAILS_MASTER_KEY +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY +# Construir y descifrar aquí + +FROM ruby:3.0 as runtime +# Copiar solo archivos necesarios, no los secretos +COPY --from=builder /app /app +``` + + + + + +Para aplicaciones Rails con credenciales cifradas: + +```dockerfile +# Definir la clave maestra como argumento de construcción +ARG RAILS_MASTER_KEY +ARG RAILS_ENV=production + +# Establecer variables de entorno +ENV RAILS_MASTER_KEY=$RAILS_MASTER_KEY +ENV RAILS_ENV=$RAILS_ENV + +# Instalar dependencias +RUN bundle install + +# Precompilar assets (este paso necesita la clave maestra) +RUN RAILS_MASTER_KEY=$RAILS_MASTER_KEY rails assets:precompile + +# Alternativa: Crear el archivo de clave +# RUN mkdir -p config && echo "$RAILS_MASTER_KEY" > config/master.key +``` + +Asegúrese de que su archivo `config/credentials.yml.enc` esté incluido en el contexto de Docker. + + + + + +Si el problema persiste: + +1. **Verifique que el ARG esté definido antes de usarse:** + + ```dockerfile + ARG RAILS_MASTER_KEY + RUN echo "Longitud de la clave: ${#RAILS_MASTER_KEY}" # No debe ser 0 + ``` + +2. **Revise los logs de construcción para el argumento:** + + ```bash + docker build --build-arg RAILS_MASTER_KEY=su_clave_aqui . + ``` + +3. **Verifique en los logs de construcción de SleakOps:** + + - Busque "Step X: ARG RAILS_MASTER_KEY" + - Compruebe si el argumento de construcción está siendo pasado + +4. **Pruebe localmente:** + ```bash + # Pruebe la construcción con el mismo argumento + docker build --build-arg RAILS_MASTER_KEY="$(cat config/master.key)" . + ``` + + + + + +**Notas importantes de seguridad:** + +- Los argumentos de construcción son visibles en el historial de Docker (`docker history`) +- Para producción, considere usar secretos de Docker o construcciones multi-etapa +- Nunca comprometa claves maestras en el control de versiones +- Use claves diferentes para distintos entornos + +**Enfoque recomendado para producción:** + +```dockerfile +# Usar construcción multi-etapa +FROM ruby:3.0 as builder +ARG RAILS_MASTER_KEY +WORKDIR /app +COPY . . +RUN bundle install +RUN RAILS_MASTER_KEY=$RAILS_MASTER_KEY rails assets:precompile + +# Etapa de producción sin secretos +FROM ruby:3.0 as production +WORKDIR /app +COPY --from=builder /app/public/assets ./public/assets +COPY --from=builder /app/vendor/bundle ./vendor/bundle +# No copiar la clave maestra a la imagen final +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de marzo de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-daphne-logging-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-daphne-logging-configuration.mdx new file mode 100644 index 000000000..ecd528118 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-daphne-logging-configuration.mdx @@ -0,0 +1,531 @@ +--- +sidebar_position: 3 +title: "Configuración de Registro en Docker para Aplicaciones Daphne" +description: "Cómo configurar contenedores Docker para capturar y mostrar los registros de aplicaciones Daphne" +date: "2025-01-21" +category: "workload" +tags: ["docker", "daphne", "logging", "django", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Registro en Docker para Aplicaciones Daphne + +**Fecha:** 21 de enero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** Docker, Daphne, Registro, Django, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario necesita configurar un contenedor Docker que ejecuta una aplicación Django con el servidor ASGI Daphne para capturar y mostrar correctamente los registros de la aplicación a través del sistema de registro de Docker. + +**Síntomas observados:** + +- Los registros de la aplicación no son visibles en los registros del contenedor Docker +- Los registros del servidor Daphne no están siendo capturados por Docker +- Los parámetros estándar de registro de Docker (`stdin_open: true`, `tty: true`) no son suficientes +- La aplicación tiene configurado el registro pero la salida no llega al stdout de Docker + +**Configuración relevante:** + +- Aplicación: Django con servidor ASGI Daphne +- Orquestación de contenedores: Docker Compose +- Parámetros actuales: `stdin_open: true` y `tty: true` +- Logger configurado para Daphne en la aplicación + +**Condiciones de error:** + +- Se generan registros en la aplicación pero no son visibles en los logs de Docker +- La configuración estándar de registro de Docker es insuficiente para Daphne +- Es necesario redirigir los registros de la aplicación a los flujos stdout/stderr de Docker + +## Solución Detallada + + + +La solución principal es configurar Daphne para que envíe los registros directamente a `/dev/stdout`, que Docker puede capturar: + +**Método 1: Configuración por línea de comandos** + +```dockerfile +# En tu Dockerfile o comando de docker-compose.yml +CMD ["daphne", "-b", "0.0.0.0", "-p", "8000", "--access-log", "/dev/stdout", "--proxy-headers", "myproject.asgi:application"] +``` + +**Método 2: Variable de entorno** + +```yaml +# docker-compose.yml +services: + web: + environment: + - DAPHNE_ACCESS_LOG=/dev/stdout + - DAPHNE_ERROR_LOG=/dev/stderr +``` + + + + + +Actualiza tu `settings.py` de Django para asegurar que los registros se dirigen a stdout: + +```python +# settings.py +LOGGING = { + 'version': 1, + 'disable_existing_loggers': False, + 'formatters': { + 'verbose': { + 'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}', + 'style': '{', + }, + 'simple': { + 'format': '{levelname} {message}', + 'style': '{', + }, + }, + 'handlers': { + 'console': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'verbose', + }, + 'daphne': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'verbose', + }, + }, + 'root': { + 'handlers': ['console'], + 'level': 'INFO', + }, + 'loggers': { + 'daphne': { + 'handlers': ['daphne'], + 'level': 'INFO', + 'propagate': False, + }, + 'django': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + }, +} +``` + + + + + +Aquí tienes un ejemplo completo de cómo configurar tu `docker-compose.yml`: + +```yaml +version: "3.8" + +services: + web: + build: . + ports: + - "8000:8000" + environment: + - DEBUG=1 + - PYTHONUNBUFFERED=1 # Importante para salida de logs en tiempo real + command: > + daphne + --bind 0.0.0.0 + --port 8000 + --access-log /dev/stdout + --proxy-headers + myproject.asgi:application + logging: + driver: "json-file" + options: + max-size: "10m" + max-file: "3" + # Elimina estos si no son necesarios para tu caso de uso + # stdin_open: true + # tty: true +``` + + + + + +Optimiza tu Dockerfile para un mejor registro: + +```dockerfile +FROM python:3.11-slim + +# Establecer variables de entorno para mejor logging +ENV PYTHONUNBUFFERED=1 +ENV PYTHONDONTWRITEBYTECODE=1 + +WORKDIR /app + +# Copiar requirements e instalar dependencias +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Copiar código de la aplicación +COPY . . + +# Crear un usuario no root +RUN useradd --create-home --shell /bin/bash app +USER app + +# Exponer puerto +EXPOSE 8000 + +# Iniciar Daphne con logging adecuado +CMD ["daphne", "-b", "0.0.0.0", "-p", "8000", "--access-log", "/dev/stdout", "--proxy-headers", "myproject.asgi:application"] +``` + + + + + +**Verifica que los logs funcionen:** + +```bash +# Ver registros del contenedor +docker-compose logs -f web + +# O para un contenedor específico +docker logs -f +``` + +**Problemas comunes y soluciones:** + +1. **Los logs aún no aparecen:** Asegúrate de que `PYTHONUNBUFFERED=1` esté configurado +2. **Logs parciales:** Verifica si tu aplicación usa print en lugar de logging adecuado +3. **Problemas de rendimiento:** Considera usar logging estructurado (formato JSON) + +**Prueba la configuración de logging:** + +```python +# Agrega esto a tus vistas de Django o comandos de gestión +import logging +logger = logging.getLogger(__name__) + +def test_view(request): + logger.info("Mensaje de prueba de log desde vista Django") + return HttpResponse("Revisa los logs de Docker para el mensaje") +``` + +**Problemas comunes y soluciones:** + +1. **Los logs no aparecen en tiempo real:** + - Asegúrate de que `PYTHONUNBUFFERED=1` esté configurado + - Usa `docker-compose logs -f` para seguir los logs + +2. **Los logs de Daphne no se muestran:** + - Verifica que `--access-log /dev/stdout` esté en el comando + - Revisa la configuración de logging en Django + +3. **Logs duplicados:** + - Puede ocurrir si tanto Django como Daphne están enviando logs al mismo stream + - Ajusta la configuración de logging para evitar duplicación + + + + + +Para un control más granular sobre los logs, configura diferentes niveles de logging: + +```python +# settings.py - Configuración avanzada +import os + +LOGGING = { + 'version': 1, + 'disable_existing_loggers': False, + 'formatters': { + 'verbose': { + 'format': '[{asctime}] {levelname} {name} {module}.{funcName}:{lineno} {message}', + 'style': '{', + }, + 'simple': { + 'format': '[{asctime}] {levelname} {message}', + 'style': '{', + }, + 'json': { + 'format': '{"time": "{asctime}", "level": "{levelname}", "logger": "{name}", "message": "{message}"}', + 'style': '{', + }, + }, + 'handlers': { + 'console': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stdout', + 'formatter': 'verbose' if os.getenv('DEBUG') else 'json', + }, + 'error_console': { + 'class': 'logging.StreamHandler', + 'stream': 'ext://sys.stderr', + 'formatter': 'verbose', + 'level': 'ERROR', + }, + }, + 'root': { + 'handlers': ['console'], + 'level': os.getenv('LOG_LEVEL', 'INFO'), + }, + 'loggers': { + 'django': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + 'daphne': { + 'handlers': ['console'], + 'level': 'INFO', + 'propagate': False, + }, + 'django.request': { + 'handlers': ['error_console'], + 'level': 'ERROR', + 'propagate': False, + }, + 'myapp': { # Reemplaza con el nombre de tu aplicación + 'handlers': ['console'], + 'level': 'DEBUG' if os.getenv('DEBUG') else 'INFO', + 'propagate': False, + }, + }, +} +``` + +**Variables de entorno para docker-compose.yml:** + +```yaml +services: + web: + environment: + - DEBUG=0 + - LOG_LEVEL=INFO + - PYTHONUNBUFFERED=1 +``` + + + + + +Para aplicaciones en producción, considera usar logging estructurado: + +```python +# requirements.txt +django-structlog==4.0.0 +structlog==23.1.0 + +# settings.py +import structlog + +# Configuración de structlog +structlog.configure( + processors=[ + structlog.stdlib.filter_by_level, + structlog.stdlib.add_logger_name, + structlog.stdlib.add_log_level, + structlog.stdlib.PositionalArgumentsFormatter(), + structlog.processors.TimeStamper(fmt="iso"), + structlog.processors.StackInfoRenderer(), + structlog.processors.format_exc_info, + structlog.processors.UnicodeDecoder(), + structlog.processors.JSONRenderer() + ], + context_class=dict, + logger_factory=structlog.stdlib.LoggerFactory(), + wrapper_class=structlog.stdlib.BoundLogger, + cache_logger_on_first_use=True, +) + +# En tus vistas o modelos +import structlog +logger = structlog.get_logger(__name__) + +def my_view(request): + logger.info("Processing request", + user_id=request.user.id, + path=request.path, + method=request.method) +``` + +**Configuración de Docker para logging estructurado:** + +```yaml +# docker-compose.yml +services: + web: + environment: + - STRUCTURED_LOGGING=true + - LOG_FORMAT=json + logging: + driver: "json-file" + options: + max-size: "10m" + max-file: "5" + labels: "service,environment" + labels: + - "service=django-app" + - "environment=production" +``` + + + + + +Configura la integración con sistemas de monitoreo como ELK Stack o Grafana: + +**Para ELK Stack (Elasticsearch, Logstash, Kibana):** + +```yaml +# docker-compose.yml +version: "3.8" + +services: + web: + build: . + environment: + - PYTHONUNBUFFERED=1 + - LOG_FORMAT=json + logging: + driver: "gelf" + options: + gelf-address: "udp://logstash:12201" + tag: "django-app" + + logstash: + image: docker.elastic.co/logstash/logstash:8.8.0 + volumes: + - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf + ports: + - "12201:12201/udp" +``` + +**Configuración de Logstash (logstash.conf):** + +```ruby +input { + gelf { + port => 12201 + } +} + +filter { + if [tag] == "django-app" { + json { + source => "message" + } + + date { + match => [ "time", "ISO8601" ] + } + } +} + +output { + elasticsearch { + hosts => ["elasticsearch:9200"] + index => "django-logs-%{+YYYY.MM.dd}" + } +} +``` + +**Para Prometheus y Grafana:** + +```python +# requirements.txt +django-prometheus==2.3.1 + +# settings.py +INSTALLED_APPS = [ + 'django_prometheus', + # ... otras apps +] + +MIDDLEWARE = [ + 'django_prometheus.middleware.PrometheusBeforeMiddleware', + # ... otros middlewares + 'django_prometheus.middleware.PrometheusAfterMiddleware', +] + +# urls.py +from django.urls import path, include + +urlpatterns = [ + path('', include('django_prometheus.urls')), + # ... otras URLs +] +``` + + + + + +Implementa estrategias de rotación de logs para evitar problemas de espacio: + +**Configuración de Docker con rotación automática:** + +```yaml +# docker-compose.yml +services: + web: + logging: + driver: "json-file" + options: + max-size: "50m" + max-file: "10" + compress: "true" +``` + +**Script de limpieza de logs:** + +```bash +#!/bin/bash +# cleanup-logs.sh + +# Limpiar logs de Docker más antiguos de 7 días +docker system prune -f --filter "until=168h" + +# Limpiar logs específicos del contenedor +CONTAINER_NAME="myapp_web_1" +docker logs --since 7d $CONTAINER_NAME > /dev/null 2>&1 + +# Comprimir logs antiguos +find /var/lib/docker/containers -name "*.log" -mtime +7 -exec gzip {} \; + +echo "Log cleanup completed at $(date)" +``` + +**Configuración de cron para limpieza automática:** + +```bash +# Agregar al crontab +0 2 * * 0 /path/to/cleanup-logs.sh >> /var/log/log-cleanup.log 2>&1 +``` + +**Configuración de logrotate para logs de aplicación:** + +```bash +# /etc/logrotate.d/django-app +/var/log/django/*.log { + daily + missingok + rotate 30 + compress + delaycompress + notifempty + create 644 www-data www-data + postrotate + docker-compose restart web + endscript +} +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 21 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-exec-format-error-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-exec-format-error-troubleshooting.mdx new file mode 100644 index 000000000..9da04fcab --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/docker-exec-format-error-troubleshooting.mdx @@ -0,0 +1,428 @@ +--- +sidebar_position: 3 +title: "Error de Formato Exec de Docker en Jobs" +description: "Solución para error de formato exec al ejecutar contenedores Docker en jobs de SleakOps" +date: "2025-01-27" +category: "carga-de-trabajo" +tags: ["docker", "jobs", "contenedores", "error-formato-exec", "arquitectura"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Formato Exec de Docker en Jobs + +**Fecha:** 27 de enero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** Docker, Jobs, Contenedores, Error de Formato Exec, Arquitectura + +## Descripción del Problema + +**Contexto:** El usuario intenta crear un job en SleakOps usando una imagen Docker de Node.js con TypeScript pero encuentra problemas de compatibilidad de arquitectura. + +**Síntomas Observados:** + +- Mensaje de error: `exec /bin/sleep: exec format error` +- Error `ImagePullBackOff` cuando no se especifica etiqueta +- El job no inicia correctamente +- El problema ocurre específicamente con la imagen `efrecon/ts-node:9.1.1` + +**Configuración Relevante:** + +- Imagen Docker: `efrecon/ts-node` +- Etiqueta: `9.1.1` +- Plataforma: Jobs de SleakOps +- Destino: Ejecución de Node.js con TypeScript + +**Condiciones de Error:** + +- El error ocurre durante el inicio del contenedor +- Sucede al usar etiquetas específicas de la imagen Docker +- Relacionado con incompatibilidad de arquitectura entre la imagen y los nodos del clúster +- Impide la ejecución del job + +## Solución Detallada + + + +El error `exec /bin/sleep: exec format error` típicamente indica una incompatibilidad de arquitectura: + +1. **Incompatibilidad de arquitectura**: La imagen Docker fue construida para una arquitectura CPU diferente (por ejemplo, ARM vs x86_64) +2. **Desajuste de plataforma**: La imagen no coincide con la arquitectura de los nodos de tu clúster +3. **Soporte multi-arquitectura**: La etiqueta específica puede no soportar la arquitectura de tu clúster + +**Escenarios comunes:** + +- Imagen construida para ARM64 corriendo en nodos x86_64 +- Imagen construida para x86_64 corriendo en nodos ARM64 (menos común) +- Falta de soporte multi-arquitectura en la etiqueta específica + + + + + +Antes de usar una imagen Docker, verifica su compatibilidad arquitectónica: + +1. **Revisa Docker Hub**: Visita la página de la imagen en Docker Hub +2. **Busca etiquetas de arquitectura**: Verifica si hay builds multi-arquitectura disponibles +3. **Usa docker manifest** (si tienes acceso a Docker CLI): + +```bash +docker manifest inspect efrecon/ts-node:9.1.1 +``` + +4. **Revisa plataformas soportadas**: Busca `linux/amd64`, `linux/arm64`, etc. + + + + + +En lugar de `efrecon/ts-node`, considera estas alternativas bien mantenidas: + +**Opción 1: Imagen oficial de Node.js con ts-node** + +```yaml +image: node:18-alpine +command: ["/bin/sh", "-c"] +args: ["npm install -g ts-node typescript && ts-node your-script.ts"] +``` + +**Opción 2: Enfoque con Dockerfile personalizado** + +```dockerfile +FROM node:18-alpine +RUN npm install -g ts-node typescript +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +CMD ["ts-node", "src/index.ts"] +``` + +**Opción 3: Build multi-etapa** + +```dockerfile +FROM node:18-alpine as builder +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +RUN npm run build + +FROM node:18-alpine +WORKDIR /app +COPY --from=builder /app/dist ./dist +COPY package*.json ./ +RUN npm install --only=production +CMD ["node", "dist/index.js"] +``` + + + + + +Al configurar tu job en SleakOps: + +1. **Usa imágenes base confiables**: + + - `node:18-alpine` (ligera) + - `node:18` (base completa Ubuntu) + - `node:18-slim` (Ubuntu mínima) + +2. **Ejemplo de configuración de job**: + +```yaml +name: typescript-job +image: node:18-alpine +tag: latest +command: ["/bin/sh"] +args: ["-c", "npm install -g ts-node typescript && ts-node /app/script.ts"] +resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "512Mi" + cpu: "500m" +``` + +3. **Variables de entorno** (si es necesario): + +```yaml +env: + - name: NODE_ENV + value: "production" + - name: TS_NODE_PROJECT + value: "/app/tsconfig.json" +``` + + + + + +**Paso 1: Verifica la arquitectura del clúster** + +- Revisa la configuración de tu clúster SleakOps +- La mayoría de los clústeres SleakOps corren en arquitectura x86_64 (AMD64) + +**Paso 2: Prueba primero con imágenes oficiales** + +```yaml +image: node:18-alpine +command: ["node"] +args: ["-v"] +``` + +**Paso 3: Añade soporte para TypeScript gradualmente** + +```yaml +image: node:18-alpine +command: ["/bin/sh"] +args: ["-c", "npm install -g typescript && tsc --version"] +``` + +**Paso 4: Ejecución completa de TypeScript** + +```yaml +image: node:18-alpine +command: ["/bin/sh"] +args: + [ + "-c", + 'npm install -g ts-node typescript && echo ''console.log("Hello TypeScript")'' > test.ts && ts-node test.ts', + ] +``` + +**Paso 5: Monta tu código real** + +- Usa ConfigMaps o Secrets para scripts pequeños +- Usa volúmenes persistentes para bases de código grandes +- Considera construir imágenes personalizadas para aplicaciones complejas + + + + + +**Selección de Imagen:** + +- Usa imágenes oficiales de Node.js cuando sea posible +- Prefiere variantes Alpine para menor tamaño +- Siempre especifica versiones exactas (evita `latest`) + +**Manejo de TypeScript:** + +- Precompila TypeScript para jobs de producción +- Usa ts-node solo para desarrollo/pruebas +- Considera usar esbuild para compilación más rápida + +**Gestión de Recursos:** + +```yaml +resources: + requests: + memory: "128Mi" # Mínimo para Node.js + cpu: "50m" + limits: + memory: "1Gi" # Ajusta según tu app + cpu: "1000m" +``` + +**Manejo de Errores:** + +- Siempre incluye códigos de salida adecuados en tus scripts +- Usa cheques de salud cuando sea apropiado +- Registra errores en stdout/stderr para monitoreo en SleakOps + +**Ejemplo de script TypeScript robusto:** + +```typescript +// job-script.ts +import { exit } from 'process'; + +async function main() { + try { + console.log('Iniciando job...'); + + // Tu lógica aquí + await performTask(); + + console.log('Job completado exitosamente'); + exit(0); + } catch (error) { + console.error('Error en el job:', error); + exit(1); + } +} + +async function performTask() { + // Implementa tu lógica de negocio aquí + console.log('Ejecutando tarea...'); +} + +main(); +``` + + + + + +**1. Revisar logs del job:** + +```bash +# Ver logs del job en SleakOps +kubectl logs job/your-job-name + +# Ver logs de pods específicos +kubectl logs pod/your-pod-name + +# Seguir logs en tiempo real +kubectl logs -f job/your-job-name +``` + +**2. Inspeccionar el estado del job:** + +```bash +# Ver detalles del job +kubectl describe job your-job-name + +# Ver eventos relacionados +kubectl get events --field-selector involvedObject.name=your-job-name +``` + +**3. Probar la imagen localmente:** + +```bash +# Probar la imagen en tu máquina local +docker run --rm efrecon/ts-node:9.1.1 /bin/echo "test" + +# Verificar arquitectura +docker run --rm efrecon/ts-node:9.1.1 uname -m +``` + +**4. Usar modo interactivo para depuración:** + +```yaml +# Job temporal para depuración +image: node:18-alpine +command: ["/bin/sh"] +args: ["-c", "sleep 3600"] # Mantiene el contenedor vivo +``` + + + + + +**1. Validación de imágenes antes del despliegue:** + +```bash +# Script para validar compatibilidad de imagen +#!/bin/bash +IMAGE=$1 +echo "Verificando imagen: $IMAGE" + +# Verificar si la imagen existe +docker manifest inspect $IMAGE > /dev/null 2>&1 +if [ $? -ne 0 ]; then + echo "Error: La imagen no existe o no es accesible" + exit 1 +fi + +# Verificar arquitecturas soportadas +echo "Arquitecturas soportadas:" +docker manifest inspect $IMAGE | jq -r '.manifests[].platform | "\(.architecture)/\(.os)"' +``` + +**2. Usar imágenes con soporte multi-arquitectura:** + +- Prefiere imágenes oficiales que soportan múltiples arquitecturas +- Verifica que la imagen tenga builds para `linux/amd64` y `linux/arm64` +- Usa registros de imágenes que proporcionen manifiestos multi-arquitectura + +**3. Crear imágenes personalizadas:** + +```dockerfile +# Dockerfile multi-arquitectura +FROM --platform=$BUILDPLATFORM node:18-alpine AS base +WORKDIR /app + +# Instalar dependencias +COPY package*.json ./ +RUN npm ci --only=production + +# Instalar herramientas de desarrollo +FROM base AS dev +RUN npm install -g ts-node typescript + +# Imagen final +FROM base AS production +COPY --from=dev /usr/local/lib/node_modules /usr/local/lib/node_modules +COPY --from=dev /usr/local/bin /usr/local/bin +COPY . . +CMD ["ts-node", "src/index.ts"] +``` + +**4. Documentar requisitos de arquitectura:** + +```yaml +# En tu configuración de job, documenta los requisitos +metadata: + annotations: + sleakops.io/architecture: "amd64" + sleakops.io/image-verified: "true" + sleakops.io/last-tested: "2025-01-27" +``` + + + + + +**Para desarrollo con TypeScript:** + +```yaml +# Opción 1: Imagen base con instalación en tiempo de ejecución +image: node:18-alpine +command: ["/bin/sh"] +args: ["-c", "npm install -g ts-node typescript @types/node && ts-node --version && exec ts-node /app/script.ts"] + +# Opción 2: Usar imagen con TypeScript preinstalado +image: node:18 +command: ["/bin/sh"] +args: ["-c", "npm install -g typescript ts-node && ts-node /app/script.ts"] +``` + +**Para aplicaciones de producción:** + +```yaml +# Compilar TypeScript a JavaScript +image: node:18-alpine +command: ["/bin/sh"] +args: ["-c", "npm install -g typescript && tsc /app/src/index.ts --outDir /app/dist && node /app/dist/index.js"] +``` + +**Para scripts simples:** + +```yaml +# Usar JavaScript directamente +image: node:18-alpine +command: ["node"] +args: ["-e", "console.log('Hello from Node.js job!')"] +``` + +**Para casos complejos:** + +```yaml +# Usar init container para preparación +initContainers: +- name: setup + image: node:18-alpine + command: ["/bin/sh"] + args: ["-c", "npm install -g ts-node typescript && echo 'Setup complete'"] + volumeMounts: + - name: shared-tools + mountPath: /shared +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 27 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dockerfile-dotnet-build-errors.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dockerfile-dotnet-build-errors.mdx new file mode 100644 index 000000000..10c2d0333 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/dockerfile-dotnet-build-errors.mdx @@ -0,0 +1,548 @@ +--- +sidebar_position: 3 +title: "Errores de compilación en Dockerfile con aplicaciones .NET" +description: "Solución de problemas de fallos en la compilación de Docker y problemas de inicio de aplicaciones en proyectos .NET" +date: "2024-12-19" +category: "proyecto" +tags: ["dockerfile", ".net", "compilación", "solución de problemas", "docker"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Errores de compilación en Dockerfile con aplicaciones .NET + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Dockerfile, .NET, Compilación, Solución de problemas, Docker + +## Descripción del problema + +**Contexto:** Los usuarios experimentan fallos en la compilación al desplegar aplicaciones .NET a través de SleakOps, particularmente durante el proceso de compilación de Docker o cuando la aplicación no logra iniciarse correctamente después del despliegue. + +**Síntomas observados:** + +- La compilación de Docker falla durante el proceso de despliegue +- La aplicación no se inicia y muestra errores en tiempo de ejecución de .NET +- Los registros de compilación muestran errores relacionados con la ejecución del comando `dotnet` +- Los despliegues de pull request fallan inesperadamente + +**Configuración relevante:** + +- Tipo de aplicación: API Web .NET +- Proceso de compilación: despliegue basado en Docker +- Comando: `dotnet Netdo.Firev.WebApi.dll` +- Disparador de despliegue: Pull request de la rama develop a main + +**Condiciones de error:** + +- Error ocurre durante el proceso de compilación de Docker +- Puede estar relacionado con la configuración del Dockerfile +- Podría involucrar problemas a nivel de aplicación (problemas con la clase Main) +- Aparece durante flujos de trabajo automatizados de despliegue + +## Solución detallada + + + +Para diagnosticar el problema, primero prueba tu aplicación localmente usando Docker: + +1. **Ejecuta el contenedor localmente:** + + ```bash + docker compose run --rm --name api-shell api + ``` + +2. **Dentro del contenedor, prueba el comando de la aplicación:** + + ```bash + dotnet Netdo.Firev.WebApi.dll + ``` + +3. **Verifica si hay errores en tiempo de ejecución o dependencias faltantes** + +Esto ayudará a identificar si el problema está en el código de la aplicación o en la configuración de Docker. + + + + + +Problemas comunes en Dockerfile con aplicaciones .NET: + +```dockerfile +# Asegúrate de usar la imagen base correcta +FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base +WORKDIR /app +EXPOSE 80 +EXPOSE 443 + +# Etapa de compilación +FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build +WORKDIR /src +COPY ["YourProject.csproj", "./"] +RUN dotnet restore "YourProject.csproj" +COPY . . +RUN dotnet build "YourProject.csproj" -c Release -o /app/build + +# Etapa de publicación +FROM build AS publish +RUN dotnet publish "YourProject.csproj" -c Release -o /app/publish + +# Etapa final +FROM base AS final +WORKDIR /app +COPY --from=publish /app/publish . +ENTRYPOINT ["dotnet", "YourProject.dll"] +``` + +**Puntos clave a verificar:** + +- Versión correcta del runtime de .NET +- Copia y pasos de compilación adecuados +- Configuración correcta del punto de entrada +- Inclusión de todas las dependencias necesarias + + + + + +Si el error está relacionado con la clase Main o el inicio de la aplicación: + +1. **Revisa tu archivo Program.cs:** + + ```csharp + // Para .NET 6+ (modelo de hosting minimalista) + var builder = WebApplication.CreateBuilder(args); + + // Agrega servicios + builder.Services.AddControllers(); + + var app = builder.Build(); + + // Configura el pipeline + app.UseRouting(); + app.MapControllers(); + + app.Run(); + ``` + +2. **Para versiones anteriores de .NET, asegura un método Main correcto:** + + ```csharp + public class Program + { + public static void Main(string[] args) + { + CreateHostBuilder(args).Build().Run(); + } + + public static IHostBuilder CreateHostBuilder(string[] args) => + Host.CreateDefaultBuilder(args) + .ConfigureWebHostDefaults(webBuilder => + { + webBuilder.UseStartup(); + }); + } + ``` + +3. **Verifica tu archivo de proyecto (.csproj):** + ```xml + + + net6.0 + Exe + + + ``` + + + + + +Para obtener información detallada sobre la falla en la compilación: + +1. **Accede al panel de control de SleakOps** +2. **Navega a los registros de despliegue de tu proyecto** +3. **Busca mensajes de error específicos en la fase de compilación** +4. **Revisa por:** + - Dependencias faltantes + - Errores de compilación + - Problemas de configuración en tiempo de ejecución + - Problemas de permisos en archivos + +**Patrones comunes de error:** + +- `No se encontró un ejecutable coincidente` +- `Ensamblado no encontrado` +- `Errores de configuración` +- `Problemas de enlace de puerto` + + + + + +Asegúrate de que tu aplicación esté correctamente configurada para el entorno de despliegue: + +1. **Revisa appsettings.json:** + ```json + { + "Logging": { + "LogLevel": { + "Default": "Information" + } + }, + "AllowedHosts": "*", + "Kestrel": { + "Endpoints": { + "Http": { + "Url": "http://0.0.0.0:80" + } + } + } + } + ``` + +2. **Configura variables de entorno en docker-compose.yml:** + ```yaml + services: + api: + environment: + - ASPNETCORE_ENVIRONMENT=Production + - ASPNETCORE_URLS=http://+:80 + ``` + + + + + +Problemas comunes con dependencias en aplicaciones .NET: + +1. **Verifica el archivo .csproj:** + ```xml + + + net6.0 + enable + enable + + + + + + + + ``` + +2. **Limpia y restaura paquetes:** + ```bash + # Localmente + dotnet clean + dotnet restore + dotnet build + + # En el Dockerfile + RUN dotnet clean + RUN dotnet restore --no-cache + ``` + +3. **Verifica compatibilidad de versiones:** + - Asegúrate de que todas las dependencias sean compatibles con tu versión de .NET + - Revisa conflictos de versiones en el archivo de proyecto + + + + + +Un Dockerfile optimizado para aplicaciones .NET: + +```dockerfile +# Etapa 1: Imagen base para runtime +FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base +WORKDIR /app +EXPOSE 80 +EXPOSE 443 + +# Etapa 2: Imagen SDK para compilación +FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build +WORKDIR /src + +# Copiar archivos de proyecto y restaurar dependencias +COPY ["Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj", "Netdo.Firev.WebApi/"] +COPY ["Netdo.Firev.Core/Netdo.Firev.Core.csproj", "Netdo.Firev.Core/"] +COPY ["Netdo.Firev.Infrastructure/Netdo.Firev.Infrastructure.csproj", "Netdo.Firev.Infrastructure/"] + +RUN dotnet restore "Netdo.Firev.WebApi/Netdo.Firev.WebApi.csproj" + +# Copiar todo el código fuente +COPY . . + +# Compilar la aplicación +WORKDIR "/src/Netdo.Firev.WebApi" +RUN dotnet build "Netdo.Firev.WebApi.csproj" -c Release -o /app/build + +# Etapa 3: Publicación +FROM build AS publish +RUN dotnet publish "Netdo.Firev.WebApi.csproj" -c Release -o /app/publish /p:UseAppHost=false + +# Etapa 4: Imagen final +FROM base AS final +WORKDIR /app +COPY --from=publish /app/publish . + +# Crear usuario no-root para seguridad +RUN adduser --disabled-password --gecos '' appuser && chown -R appuser /app +USER appuser + +ENTRYPOINT ["dotnet", "Netdo.Firev.WebApi.dll"] +``` + +**Puntos clave del Dockerfile optimizado:** + +- Usa imágenes específicas para runtime y SDK +- Copia archivos de proyecto primero para aprovechar el cache de Docker +- Incluye usuario no-root para seguridad +- Usa parámetros de publicación optimizados + + + + + +Estrategias para diagnosticar problemas de compilación: + +1. **Habilita logging detallado:** + ```dockerfile + # En el Dockerfile, agrega verbosidad + RUN dotnet build "YourProject.csproj" -c Release -o /app/build --verbosity detailed + ``` + +2. **Ejecuta compilación paso a paso:** + ```bash + # Prueba cada etapa individualmente + docker build --target build -t myapp:build . + docker run --rm -it myapp:build /bin/bash + + # Dentro del contenedor + dotnet --info + dotnet restore --verbosity detailed + dotnet build --verbosity detailed + ``` + +3. **Verifica el entorno de compilación:** + ```bash + # Revisa la versión de .NET en el contenedor + docker run --rm mcr.microsoft.com/dotnet/sdk:6.0 dotnet --version + + # Verifica las variables de entorno + docker run --rm mcr.microsoft.com/dotnet/sdk:6.0 env + ``` + +4. **Analiza logs de SleakOps:** + - Busca mensajes específicos de error en los logs de compilación + - Identifica en qué etapa del Dockerfile falla la compilación + - Revisa errores de dependencias o configuración + + + + + +**Error: "No se encontró un ejecutable coincidente"** + +```bash +# Solución: Verifica el nombre del archivo DLL +ls /app/publish/ +# Asegúrate de que el nombre en ENTRYPOINT coincida con el archivo generado +``` + +**Error: "Ensamblado no encontrado"** + +```dockerfile +# Solución: Incluye todas las dependencias en la publicación +RUN dotnet publish "YourProject.csproj" -c Release -o /app/publish --self-contained false --runtime linux-x64 +``` + +**Error: "Puerto ya en uso"** + +```yaml +# docker-compose.yml +services: + api: + ports: + - "8080:80" # Cambia el puerto externo si hay conflicto + environment: + - ASPNETCORE_URLS=http://+:80 +``` + +**Error: "Problemas de permisos"** + +```dockerfile +# Solución: Configura permisos correctos +RUN chown -R appuser:appuser /app +RUN chmod +x /app/YourApp.dll +``` + +**Error: "Dependencias de runtime faltantes"** + +```dockerfile +# Solución: Usa imagen runtime completa si es necesario +FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base +# O para aplicaciones que requieren más dependencias +FROM mcr.microsoft.com/dotnet/runtime-deps:6.0 AS base +``` + + + + + +**Optimizaciones del Dockerfile:** + +```dockerfile +# Usa cache de NuGet para acelerar compilaciones +FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build +WORKDIR /src + +# Configura cache de NuGet +ENV NUGET_PACKAGES=/root/.nuget/packages +COPY ["*.csproj", "./"] +RUN dotnet restore --packages $NUGET_PACKAGES + +# Optimiza la imagen final +FROM mcr.microsoft.com/dotnet/aspnet:6.0-alpine AS final +WORKDIR /app + +# Instala dependencias del sistema si es necesario +RUN apk add --no-cache icu-libs + +# Configura variables de entorno para rendimiento +ENV DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=false +ENV DOTNET_EnableDiagnostics=0 +``` + +**Configuración de producción:** + +```json +// appsettings.Production.json +{ + "Logging": { + "LogLevel": { + "Default": "Warning", + "Microsoft.AspNetCore": "Warning" + } + }, + "Kestrel": { + "Limits": { + "MaxConcurrentConnections": 100, + "MaxRequestBodySize": 10485760 + } + } +} +``` + +**docker-compose.yml optimizado:** + +```yaml +version: "3.8" + +services: + api: + build: + context: . + dockerfile: Dockerfile + args: + - BUILDKIT_INLINE_CACHE=1 + environment: + - ASPNETCORE_ENVIRONMENT=Production + - DOTNET_EnableDiagnostics=0 + deploy: + resources: + limits: + memory: 512M + reservations: + memory: 256M + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:80/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 40s +``` + + + + + +**Configuración para GitHub Actions:** + +```yaml +name: .NET CI/CD + +on: + push: + branches: [ main, develop ] + pull_request: + branches: [ main ] + +jobs: + build-and-test: + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v3 + + - name: Setup .NET + uses: actions/setup-dotnet@v3 + with: + dotnet-version: 6.0.x + + - name: Restore dependencies + run: dotnet restore + + - name: Build + run: dotnet build --no-restore --configuration Release + + - name: Test + run: dotnet test --no-build --configuration Release --verbosity normal + + - name: Deploy to SleakOps + if: github.ref == 'refs/heads/main' + run: | + pip install sleakops + sleakops build -p ${{ secrets.SLEAKOPS_PROJECT }} -b main -w + sleakops deploy -p ${{ secrets.SLEAKOPS_PROJECT }} -e production -w + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} +``` + +**Configuración de health checks:** + +```csharp +// Program.cs +var builder = WebApplication.CreateBuilder(args); + +builder.Services.AddHealthChecks(); + +var app = builder.Build(); + +app.MapHealthChecks("/health"); +app.MapHealthChecks("/ready"); + +app.Run(); +``` + +**Monitoreo y alertas:** + +```yaml +# docker-compose.yml +services: + api: + labels: + - "traefik.enable=true" + - "traefik.http.routers.api.rule=Host(`api.yourdomain.com`)" + - "traefik.http.services.api.loadbalancer.healthcheck.path=/health" + - "traefik.http.services.api.loadbalancer.healthcheck.interval=30s" +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ + ``` diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-certificate-delegation-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-certificate-delegation-error.mdx new file mode 100644 index 000000000..7f369b9b7 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-certificate-delegation-error.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "Error de Delegación de Certificado de Dominio Durante el Despliegue" +description: "Solución para el error ACMModule DoesNotExist al desplegar con certificados de dominio" +date: "2024-12-19" +category: "proyecto" +tags: ["despliegue", "dominio", "certificado", "acm", "resolución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Delegación de Certificado de Dominio Durante el Despliegue + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Despliegue, Dominio, Certificado, ACM, Resolución de problemas + +## Descripción del Problema + +**Contexto:** El usuario experimenta una falla en el despliegue al intentar registrar un dominio de producción en SleakOps. El error ocurre durante el proceso de delegación del certificado para la gestión del dominio. + +**Síntomas Observados:** + +- El despliegue falla con error `Modules.DoesNotExist` +- El error menciona específicamente que `ACMModule` no fue encontrado +- El problema impide el registro del dominio para el entorno de producción +- El error ocurre durante la generación de valores de ingress + +**Configuración Relevante:** + +- Dominio: `hub.supra.social` (dominio de producción) +- Gestión de certificados: integración con AWS ACM +- Proceso de despliegue: generación de valores para el Helm chart +- Plataforma: sistema de gestión de dominios SleakOps + +**Condiciones del Error:** + +- El error ocurre durante la ejecución de `create_values_config_map` +- Falla específicamente en `domain.acm_certificate_info(raise_exceptions=True)` +- La consulta al ACMModule no devuelve resultados +- Impide la finalización del proceso de despliegue + +## Solución Detallada + + + +Al enfrentar problemas de delegación de certificados, siga estos pasos: + +1. **Acceder a la Consola SleakOps** + + - Navegue a su panel de proyecto + - Vaya a la sección **Dominios** + +2. **Re-delegar Certificados** + + - Encuentre el dominio afectado (`hub.supra.social`) + - Haga clic en **Gestión de Certificados** + - Seleccione **Re-delegar Certificado** + - Siga las indicaciones de la consola para completar la delegación + +3. **Verificar Estado de Delegación** + - Compruebe que el estado del certificado aparezca como "Activo" + - Asegúrese de que el módulo ACM esté correctamente asociado con el dominio + + + + + +Si la delegación del certificado falla, intente relanzar la tarea HZ (Hosted Zone): + +1. **Acceder a Gestión de Tareas** + + - Vaya a **Infraestructura** → **Tareas** + - Busque tareas relacionadas con HZ + +2. **Relanzar Tarea HZ** + + - Seleccione la tarea HZ más reciente + - Haga clic en **Relanzar** o **Reintentar** + - Espere a que la tarea finalice + +3. **Proceder con el Despliegue** + - Tras la finalización exitosa de la tarea HZ + - Intente desplegar nuevamente + - La delegación del certificado debería funcionar correctamente + + + + + +Para evitar conflictos con dominios no usados: + +1. **Identificar Dominios No Usados** + + - Revise la lista de dominios en la consola SleakOps + - Identifique dominios que nunca fueron registrados correctamente + +2. **Eliminar Entradas No Usadas** + + - Seleccione las entradas de dominios no usados (por ejemplo, dominios Disker) + - Haga clic en **Eliminar** o **Quitar** + - Confirme la eliminación + +3. **Verificar Estado Limpio** + - Asegúrese de que solo queden dominios activos y configurados correctamente + - Verifique que cada dominio tenga la delegación de certificado adecuada + + + + + +Antes de intentar el despliegue, verifique la configuración DNS: + +1. **Revisar Registros DNS** + + - Verifique que los registros DNS para `supra.social` sean correctos + - Asegúrese de que no se hayan hecho cambios externos fuera de SleakOps + +2. **Validar Configuración de Subdominios** + + - Compruebe que `hub.supra.social` esté configurado correctamente + - Verifique que la delegación del subdominio funcione + +3. **Probar Resolución** + + ```bash + # Probar resolución DNS + nslookup hub.supra.social + + # Verificar estado del certificado + openssl s_client -connect hub.supra.social:443 -servername hub.supra.social + ``` + + + + + +Si el problema persiste tras la re-delegación de certificados: + +1. **Verificar Estado del Módulo** + + - Confirme que ACMModule esté habilitado para su proyecto + - Contacte soporte si el módulo parece faltar + +2. **Revisar Cambios Recientes** + + - Verifique si se hicieron cambios DNS externos + - Asegúrese de que no existan solicitudes de certificado en conflicto + +3. **Monitorear Logs de Despliegue** + + - Revise los logs del despliegue para más detalles del error + - Busque fallos en la validación del certificado + +4. **Contactar Soporte** + - Si el problema persiste, proporcione: + - Nombre del dominio afectado + - Traza completa del error + - Cambios recientes realizados en DNS o certificados + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-delegation-production-environments.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-delegation-production-environments.mdx new file mode 100644 index 000000000..c3d5ba095 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-delegation-production-environments.mdx @@ -0,0 +1,206 @@ +--- +sidebar_position: 3 +title: "Problemas de Delegación de Dominio en Entornos de Producción" +description: "Solución para problemas de delegación de dominio en entornos de producción" +date: "2024-12-20" +category: "proyecto" +tags: ["dominio", "dns", "delegación", "producción", "entorno"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Delegación de Dominio en Entornos de Producción + +**Fecha:** 20 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Dominio, DNS, Delegación, Producción, Entorno + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas con la delegación de dominio al configurar entornos de producción en SleakOps, donde los dominios principales no se están delegando correctamente a pesar de una configuración adecuada. + +**Síntomas Observados:** + +- Los dominios principales no se delegan en entornos de producción +- La delegación de dominio parece estar configurada pero no surte efecto +- La propagación de DNS puede estar retrasada o incompleta +- La gestión estándar de registros DNS puede no ser suficiente para una delegación completa del dominio + +**Configuración Relevante:** + +- Tipo de entorno: Producción +- Tipo de dominio: Dominio principal/raíz (no subdominio) +- Proveedor DNS: Varios registradores de dominio +- Método de configuración: Gestión estándar de registros DNS + +**Condiciones de Error:** + +- La delegación de dominio falla a pesar de una configuración DNS correcta +- El problema ocurre específicamente con dominios principales en producción +- El problema puede estar relacionado con procesos de delegación específicos del registrador de dominio +- Los retrasos en la propagación DNS pueden enmascarar el problema real + +## Solución Detallada + + + +Primero, verifique si el problema está relacionado con retrasos en la propagación DNS: + +1. Use verificadores de propagación DNS en línea: + + - whatsmydns.net + - dnschecker.org + - dns-lookup.com + +2. Revise desde diferentes ubicaciones y servidores DNS +3. La propagación DNS puede tardar hasta 48 horas para una propagación global completa +4. Use los comandos `dig` o `nslookup` para verificar la delegación: + +```bash +# Verificar registros NS para su dominio +dig NS su-dominio.com + +# Verificar desde servidores DNS específicos +dig @8.8.8.8 NS su-dominio.com +dig @1.1.1.1 NS su-dominio.com +``` + + + + + +Muchos registradores de dominio requieren que la delegación de dominio se configure mediante un proceso separado: + +**Para una delegación completa del dominio:** + +1. **Inicie sesión en el panel de control de su registrador de dominio** +2. **Busque opciones de delegación de dominio:** + + - "Delegación de Dominio" + - "Cambiar Servidores de Nombre" + - "Gestión DNS" → "Delegar Dominio" + - "Configuraciones Avanzadas de DNS" + +3. **Ubicaciones comunes específicas de registradores:** + + - **GoDaddy**: Configuración de Dominio → Servidores de Nombre → Cambiar + - **Namecheap**: Lista de Dominios → Administrar → Servidores de Nombre + - **Route53**: Zonas Alojamientos → Conjunto de Registros NS + - **Cloudflare**: DNS → Servidores de Nombre + +4. **Configure los servidores de nombre proporcionados por SleakOps** + + + + + +Para delegar su dominio a SleakOps: + +1. **En el Panel de Control de SleakOps:** + + - Vaya a la sección **Configuración del Proyecto** + - Navegue a la sección **Dominios** + - Encuentre la información de **Servidores de Nombre** + +2. **Formato típico de servidores de nombre de SleakOps:** + + ``` + ns1.sleakops.com + ns2.sleakops.com + ns3.sleakops.com + ns4.sleakops.com + ``` + +3. **Copie estos servidores de nombre exactamente como aparecen** +4. **Configúrelos en su registrador de dominio** + + + + + +Después de configurar la delegación de dominio: + +1. **Espere la propagación** (hasta 48 horas) + +2. **Verifique la delegación con dig:** + + ```bash + # Debe devolver los servidores de nombre de SleakOps + dig NS su-dominio.com + ``` + +3. **Verifique la resolución del dominio:** + + ```bash + # Debe resolver a su entorno SleakOps + dig A su-dominio.com + ``` + +4. **Pruebe en el navegador:** + - Visite su dominio directamente + - Verifique que el certificado SSL sea válido + - Compruebe que apunte a su entorno SleakOps + + + + + +**Problema 1: Gestión Mixta de DNS** + +- Problema: Algunos registros DNS se gestionan en el registrador, otros en SleakOps +- Solución: Asegurar delegación completa - todo el DNS gestionado por SleakOps + +**Problema 2: Formato Incorrecto de Servidores de Nombre** + +- Problema: Servidores de nombre ingresados con puntos finales o formato incorrecto +- Solución: Copiar los servidores de nombre exactamente como los proporciona SleakOps + +**Problema 3: Delegación Parcial** + +- Problema: Solo algunos servidores de nombre configurados +- Solución: Configurar todos los servidores de nombre proporcionados (normalmente 2-4) + +**Problema 4: Caché del Registrador** + +- Problema: El registrador almacena en caché configuraciones DNS antiguas +- Solución: Contactar al soporte del registrador para limpiar la caché DNS + +**Problema 5: Estado de Bloqueo del Dominio** + +- Problema: El dominio está bloqueado y previene cambios de delegación +- Solución: Desbloquear el dominio en la configuración del registrador antes de cambiar servidores de nombre + + + + + +Si la delegación aún no funciona: + +1. **Contacte al soporte del registrador de dominio:** + + - Explique que necesita delegar el dominio completo + - Proporcione los servidores de nombre de SleakOps + - Pregunte sobre procedimientos específicos de delegación de dominio + +2. **Verifique el estado del dominio:** + + ```bash + whois su-dominio.com | grep -i status + ``` + +3. **Verifique que el dominio no esté expirado o bloqueado** + +4. **Revise conflictos con DNSSEC:** + + - Desactive DNSSEC en el registrador si está habilitado + - SleakOps gestionará DNSSEC después de la delegación + +5. **Pruebe primero con un subdominio:** + - Cree un subdominio de prueba (test.su-dominio.com) + - Verifique que SleakOps pueda gestionar subdominios correctamente + + + +--- + +_Esta FAQ fue generada automáticamente el 20 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-migration-custom-domains.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-migration-custom-domains.mdx new file mode 100644 index 000000000..02641fb46 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-migration-custom-domains.mdx @@ -0,0 +1,228 @@ +--- +sidebar_position: 3 +title: "Migración de Dominio y Configuración de Dominio Personalizado" +description: "Cómo migrar aplicaciones a nuevos dominios y configurar dominios personalizados en SleakOps" +date: "2024-01-15" +category: "proyecto" +tags: ["dominio", "migración", "dns", "alias", "dominio-personalizado"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Migración de Dominio y Configuración de Dominio Personalizado + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Dominio, Migración, DNS, Alias, Dominio Personalizado + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan migrar sus aplicaciones de un dominio a otro o configurar dominios personalizados para sus despliegues en SleakOps. + +**Síntomas Observados:** + +- Necesidad de cambiar las URLs de las aplicaciones del dominio antiguo al nuevo +- Deseo de usar un dominio recién adquirido para aplicaciones existentes +- Requieren orientación sobre la configuración DNS para dominios personalizados +- Necesitan entender las opciones de migración y sus implicaciones + +**Configuración Relevante:** + +- Dominio actual: `byroncode.com` +- Nuevo dominio: `ordenapp.com.ar` +- Aplicación: Frontend/página de aterrizaje +- Plataforma: Entorno de ejecución de SleakOps + +**Condiciones de Error:** + +- Aplicaciones accesibles actualmente solo a través del dominio antiguo +- Necesidad de transición sin interrupción del servicio +- Registros DNS requieren configuración manual + +## Solución Detallada + + + +Para cambios simples de URL sin una migración completa, puedes agregar un alias de dominio a tu ejecución existente: + +**Pasos:** + +1. Navega a tu ejecución en el panel de SleakOps +2. Ve a **Configuración** → **Configuración de Dominio** +3. Haz clic en **Agregar Alias** +4. Ingresa tu nuevo dominio (ejemplo: `ordenapp.com.ar`) +5. Guarda la configuración + +**Configuración DNS:** +Después de crear el alias, recibirás registros DNS que deben ser añadidos manualmente en tu registrador de dominios: + +``` +Tipo: CNAME +Nombre: www +Valor: [proporcionado-por-sleakops].sleakops.io + +Tipo: A +Nombre: @ +Valor: [dirección-IP-proporcionada] +``` + +**Beneficios:** + +- Implementación rápida +- Sin interrupción del servicio +- Mantiene la configuración existente +- Ambos dominios funcionan simultáneamente + + + + + +Para la migración completa de todos los servicios del dominio antiguo al nuevo: + +**Requisitos Previos:** + +- Coordinar con el equipo de soporte de SleakOps +- Programar ventana de mantenimiento +- Respaldar la configuración actual + +**Proceso de Migración:** + +1. **Planificación previa a la migración:** + + - Documentar todos los servicios y configuraciones actuales + - Identificar dependencias + - Planificar estrategia de reversión + +2. **Configuración del dominio:** + + - Actualizar dominio principal en la configuración del proyecto + - Reconfigurar certificados SSL + - Actualizar variables de entorno + +3. **Actualizaciones DNS:** + + - Apuntar el nuevo dominio a la infraestructura de SleakOps + - Actualizar todos los subdominios + - Configurar valores TTL adecuados + +4. **Pruebas y validación:** + - Verificar que todos los servicios sean accesibles + - Probar validez del certificado SSL + - Confirmar que todas las integraciones funcionen + +**Nota:** Esta opción requiere coordinación con el equipo de SleakOps y puede implicar tiempo de inactividad del servicio. + + + + + +Con la última versión de SleakOps, puedes crear nuevos entornos con URLs personalizadas: + +**Cuándo usar:** + +- Para entornos de desarrollo +- Cuando deseas un nuevo comienzo +- Para probar la configuración de un nuevo dominio + +**Pasos:** + +1. Crea un nuevo entorno en SleakOps +2. Durante la configuración, especifica tu dominio personalizado +3. Configura los registros DNS según lo proporcionado +4. Despliega tu aplicación en el nuevo entorno + +**Consideraciones:** + +- Las dependencias deben ser recreadas +- Puede requerirse migración de bases de datos +- Variables de entorno necesitan reconfiguración +- Adecuado principalmente para entornos de desarrollo + +```yaml +# Ejemplo de configuración de entorno +environment: + name: "producción-nuevo-dominio" + domain: "ordenapp.com.ar" + ssl: true + auto_deploy: true +``` + + + + + +**Registros DNS Comunes para SleakOps:** + +``` +# Dominio raíz +Tipo: A +Nombre: @ +Valor: [IP proporcionada por SleakOps] +TTL: 300 + +# Subdominio WWW +Tipo: CNAME +Nombre: www +Valor: [hostname].sleakops.io +TTL: 300 + +# Subdominio API (si aplica) +Tipo: CNAME +Nombre: api +Valor: [api-hostname].sleakops.io +TTL: 300 +``` + +**Propagación DNS:** + +- Los cambios pueden tardar 24-48 horas en propagarse completamente +- Usa herramientas como `dig` o verificadores DNS en línea para comprobar +- Valores TTL bajos aceleran los cambios pero aumentan consultas DNS + +**Comandos de Verificación:** + +```bash +# Verificar registro A +dig ordenapp.com.ar A + +# Verificar registro CNAME +dig www.ordenapp.com.ar CNAME + +# Verificar desde servidor DNS específico +dig @8.8.8.8 ordenapp.com.ar +``` + + + + + +**Para Entornos de Producción:** + +1. **Usar alias de dominio primero** - Prueba el nuevo dominio junto al antiguo +2. **Planificar ventanas de mantenimiento** - Para migraciones completas, programar en periodos de bajo tráfico +3. **Monitorear después de los cambios** - Vigilar problemas con certificados SSL y propagación DNS +4. **Mantener activo el dominio antiguo** - Mantener redirecciones para SEO y experiencia de usuario + +**Consideraciones sobre Certificados SSL:** + +- SleakOps provee automáticamente certificados SSL para dominios personalizados +- La generación del certificado puede tardar entre 5 y 10 minutos +- Asegúrate que los registros DNS estén correctos antes del aprovisionamiento SSL + +**Estrategia de Reversión:** + +- Mantén documentados los registros DNS originales +- Ten un plan para revertir cambios rápidamente +- Prueba el procedimiento de reversión primero en desarrollo + +**Comunicación:** + +- Notifica a los usuarios sobre cambios de dominio con anticipación +- Actualiza documentación y enlaces +- Configura redirecciones adecuadas del dominio antiguo al nuevo + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-ns-delegation-replication-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-ns-delegation-replication-issues.mdx new file mode 100644 index 000000000..10a7e6c74 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-ns-delegation-replication-issues.mdx @@ -0,0 +1,210 @@ +--- +sidebar_position: 3 +title: "Problemas de Replicación en la Delegación NS de Dominio" +description: "Solución de problemas de delegación y replicación de registros NS con proveedores de dominio" +date: "2024-01-15" +category: "general" +tags: ["dominio", "dns", "registros-ns", "delegación", "solución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Replicación en la Delegación NS de Dominio + +**Fecha:** 15 de enero de 2024 +**Categoría:** General +**Etiquetas:** Dominio, DNS, Registros NS, Delegación, Solución de Problemas + +## Descripción del Problema + +**Contexto:** El usuario intenta configurar un dominio personalizado (ordenapp.com.ar) en SleakOps pero encuentra fallas en la replicación de la delegación NS (Name Server) con su proveedor de dominio (DonWeb). + +**Síntomas Observados:** + +- Configuración de delegación NS fallida a nivel del proveedor de dominio +- Problemas de replicación con los servidores de nombres delegados +- El dominio no resuelve correctamente hacia la infraestructura de SleakOps +- El proveedor (DonWeb) reconoce el problema técnico + +**Configuración Relevante:** + +- Dominio: ordenapp.com.ar (dominio argentino) +- Proveedor de Dominio: DonWeb +- Plataforma Destino: SleakOps +- Configuración DNS: configuración de delegación NS + +**Condiciones de Error:** + +- La replicación del registro NS falla a nivel del proveedor +- La delegación del dominio no se propaga correctamente +- El problema ocurre durante la configuración inicial del dominio +- El proveedor confirma problema técnico en su lado + +## Solución Detallada + + + +La delegación NS (Name Server) es el proceso mediante el cual tu proveedor de dominio apunta tu dominio a servidores de nombres externos (en este caso, los servidores DNS de SleakOps). + +El proceso implica: + +1. **SleakOps proporciona registros NS** (por ejemplo, ns1.sleakops.com, ns2.sleakops.com) +2. **Tú configuras estos en tu proveedor de dominio** (DonWeb) +3. **El proveedor replica los cambios** a los servidores DNS raíz +4. **Ocurre la propagación DNS a nivel global** (24-48 horas) + + + + + +Mientras esperas la resolución por parte del proveedor: + +1. **Verifica los registros NS en SleakOps:** + + - Ingresa a la configuración de tu proyecto + - Revisa la sección "Dominio Personalizado" + - Anota los registros NS proporcionados + +2. **Revisa el estado actual del DNS:** + + ```bash + # Verificar registros NS actuales + dig NS ordenapp.com.ar + + # Verificar si el dominio resuelve + nslookup ordenapp.com.ar + ``` + +3. **Verifica la configuración en el proveedor:** + - Ingresa al panel de control de DonWeb + - Confirma que los registros NS estén correctamente ingresados + - Revisa si hay mensajes de error o advertencias + + + + + +Al tratar con problemas del proveedor: + +1. **Documenta el problema claramente:** + + - Proporciona los registros NS exactos de SleakOps + - Incluye capturas de pantalla de la configuración + - Anota cualquier mensaje de error + +2. **Solicita detalles técnicos específicos:** + + - Pide los logs de replicación + - Solicita un cronograma para la resolución + - Obtén confirmación de la aceptación de los registros NS + +3. **Escala si es necesario:** + - Solicita la escalación a soporte técnico especializado + - Pide hablar con especialistas en DNS/dominios + - Considera soluciones temporales + + + + + +Si los problemas con el proveedor persisten: + +1. **Delegación de subdominio:** + + ``` + # En lugar de ordenapp.com.ar, usa: + app.ordenapp.com.ar + ``` + + Configura solo el subdominio con los registros NS de SleakOps. + +2. **Enfoque CNAME (si está soportado):** + + ``` + # Crea un registro CNAME apuntando a SleakOps + www.ordenapp.com.ar CNAME tu-app.sleakops.io + ``` + +3. **Transferir dominio a otro proveedor:** + + - Considera proveedores con mejor gestión DNS + - Opciones populares: Cloudflare, Route53, Namecheap + +4. **Usar subdominio de SleakOps temporalmente:** + ``` + # Usa un subdominio provisto mientras se resuelve + tu-app.sleakops.io + ``` + + + + + +Una vez que el proveedor resuelva el problema de replicación: + +1. **Verifica la propagación DNS:** + + ```bash + # Verifica registros NS globalmente + dig +trace NS ordenapp.com.ar + + # Verifica desde diferentes ubicaciones + nslookup ordenapp.com.ar 8.8.8.8 + nslookup ordenapp.com.ar 1.1.1.1 + ``` + +2. **Prueba la resolución del dominio:** + + ```bash + # Prueba resolución HTTP + curl -I http://ordenapp.com.ar + + # Prueba HTTPS si está configurado + curl -I https://ordenapp.com.ar + ``` + +3. **Monitorea la propagación:** + + - Usa herramientas en línea como whatsmydns.net + - Verifica desde múltiples ubicaciones globales + - Permite 24-48 horas para propagación completa + +4. **Actualiza configuración en SleakOps:** + - Confirma el estado del dominio en el panel de SleakOps + - Prueba la generación de certificado SSL + - Verifica accesibilidad de la aplicación + + + + + +Para evitar problemas similares: + +1. **Elige proveedores DNS confiables:** + + - Investiga la fiabilidad del DNS del proveedor + - Revisa calidad de soporte y tiempos de respuesta + - Considera servicios DNS administrados + +2. **Documenta configuraciones DNS:** + + - Guarda registros de todas las configuraciones NS + - Captura pantallas de configuraciones importantes + - Mantén información de contacto para soporte técnico + +3. **Prueba primero en entorno de pruebas:** + + - Usa subdominios de prueba antes de producción + - Verifica cambios DNS en entornos no críticos + - Ten planes de reversión listos + +4. **Monitorea la salud del DNS:** + - Configura monitoreo para resolución de dominio + - Usa herramientas como UptimeRobot o Pingdom + - Configura alertas para fallas DNS + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-setup-alert-explanation.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-setup-alert-explanation.mdx new file mode 100644 index 000000000..7239709bf --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/domain-setup-alert-explanation.mdx @@ -0,0 +1,157 @@ +--- +sidebar_position: 3 +title: "Explicación de la alerta de configuración de dominio" +description: "Comprendiendo la alerta 'Configurar dominio' y cuándo se requiere acción" +date: "2024-01-15" +category: "proyecto" +tags: ["dominio", "alerta", "configuración", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Explicación de la alerta de configuración de dominio + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Dominio, Alerta, Configuración, Solución de problemas + +## Descripción del problema + +**Contexto:** Los usuarios reciben alertas de "Configurar dominio" para dominios que parecen funcionar correctamente, causando confusión sobre si se requiere alguna acción. + +**Síntomas observados:** + +- Múltiples dominios muestran la alerta "Configurar dominio" +- Los dominios son accesibles y funcionan normalmente (por ejemplo, https://www.mocona.com.ar/) +- La alerta aparece a pesar de la funcionalidad adecuada del dominio +- Incertidumbre sobre las acciones requeridas + +**Configuración relevante:** + +- Estado del dominio: Funcional y accesible +- Tipo de alerta: "Configurar dominio" +- Ejemplos de dominios: Dominios externos como mocona.com.ar +- Plataforma: Gestión de dominios de SleakOps + +**Condiciones de error:** + +- La alerta aparece para dominios que funcionan +- No hay problemas funcionales aparentes con los dominios +- La alerta persiste a pesar del funcionamiento normal + +## Solución detallada + + + +La alerta "Configurar dominio" en SleakOps indica una de las siguientes condiciones: + +1. **Configuración DNS incompleta**: Los registros DNS del dominio no están configurados correctamente para apuntar a tus servicios de SleakOps +2. **Certificado SSL pendiente**: El certificado SSL para el dominio aún está en proceso de provisión o validación +3. **Verificación de dominio pendiente**: La plataforma está esperando la verificación de propiedad del dominio +4. **Desajuste en la configuración**: Hay una discrepancia entre la configuración esperada y la configuración real del dominio + +Esta alerta puede aparecer incluso cuando el dominio es accesible porque puede estar apuntando a un servidor o servicio diferente. + + + + + +Debes tomar acción si: + +- **Tu aplicación no es accesible** a través del dominio +- Aparecen **errores de certificado SSL** al acceder al dominio +- **Quieres que el dominio apunte a tu despliegue en SleakOps** en lugar de su destino actual +- Las **notificaciones por correo** indican fallos en despliegues o servicios + +Puedes **ignorar la alerta** si: + +- El dominio funciona como se espera +- Apunta al destino correcto (aunque no sea SleakOps) +- Estás usando el dominio para servicios externos + + + + + +Para determinar si necesitas tomar acción: + +1. **Verifica a dónde apunta tu dominio**: + + ```bash + nslookup www.mocona.com.ar + # o + dig www.mocona.com.ar + ``` + +2. **Verifica el certificado SSL**: + + - Visita tu dominio en un navegador + - Comprueba si hay advertencias SSL + - Verifica el emisor del certificado + +3. **Revisa el estado del despliegue en SleakOps**: + + - Ve al panel de tu proyecto + - Verifica si los servicios están en ejecución + - Revisa los registros de despliegue para errores + +4. **Prueba la funcionalidad de la aplicación**: + - Accede a todas las páginas críticas + - Prueba formularios y elementos interactivos + - Verifica los endpoints de API si aplica + + + + + +Si determinas que es necesario actuar: + +**Para configuración DNS:** + +1. Accede a la gestión DNS de tu registrador de dominios +2. Actualiza los registros A para que apunten a las direcciones IP de SleakOps +3. Actualiza los registros CNAME según lo especificado en tu proyecto SleakOps + +**Para problemas con el certificado SSL:** + +1. En el panel de SleakOps, ve a **Configuración de dominio** +2. Haz clic en **Regenerar certificado SSL** +3. Espera la validación (puede tardar hasta 24 horas) + +**Para verificación de dominio:** + +1. Revisa tu correo para solicitudes de verificación +2. Sigue el enlace de verificación o agrega los registros DNS TXT requeridos +3. Contacta soporte si la verificación falla + +**Ejemplo de configuración:** + +```dns +# Registros DNS para SleakOps +www.tudominio.com A 1.2.3.4 +tudominio.com A 1.2.3.4 +*.tudominio.com CNAME tu-proyecto.sleakops.com +``` + + + + + +Si el dominio funciona correctamente y no quieres que sea gestionado por SleakOps: + +1. Ve a **Configuración del proyecto** → **Dominios** +2. Encuentra el dominio con la alerta +3. Haz clic en **Eliminar dominio** o **Desactivar monitoreo** +4. Confirma la acción + +Alternativamente, puedes: + +- **Silenciar notificaciones** para dominios específicos +- **Configurar umbrales de alerta** para reducir falsos positivos +- **Contactar soporte** para incluir dominios externos en la lista blanca + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/doppler-integration-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/doppler-integration-troubleshooting.mdx new file mode 100644 index 000000000..49561ec46 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/doppler-integration-troubleshooting.mdx @@ -0,0 +1,274 @@ +--- +sidebar_position: 3 +title: "Problemas de Integración de Doppler en SleakOps" +description: "Solución de problemas de configuración de Doppler y sincronización de variables de entorno" +date: "2024-12-19" +category: "dependency" +tags: + ["doppler", "environment-variables", "secrets", "configuration", "deployment"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Integración de Doppler en SleakOps + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Doppler, Variables de Entorno, Secretos, Configuración, Despliegue + +## Descripción del Problema + +**Contexto:** Los usuarios que integran Doppler como un servicio externo para la gestión de variables de entorno en los despliegues de SleakOps experimentan problemas donde las variables no se actualizan durante los despliegues, a pesar de una configuración exitosa. + +**Síntomas Observados:** + +- Variables de entorno que no se actualizan durante despliegues recientes +- Servicio Doppler configurado pero las variables permanecen obsoletas +- Los despliegues se completan con éxito pero usan configuración desactualizada +- La estrategia de actualización continua muestra un aumento temporal en el conteo de réplicas + +**Configuración Relevante:** + +- Servicio externo: Doppler para la gestión de variables de entorno +- Estrategia de despliegue: Actualización Continua (Rolling Update) +- Argumentos de Docker: `DOPPLER_CONFIG = staging` +- Secretos del proyecto y referencias a Doppler configuradas en SleakOps + +**Condiciones de Error:** + +- Variables que no se refrescan en nuevos despliegues +- Ocurre específicamente con variables de entorno gestionadas por Doppler +- El problema persiste tras múltiples intentos de despliegue +- Los secretos locales pueden funcionar mientras la integración con Doppler falla + +## Solución Detallada + + + +Primero, verifica que tu configuración de Doppler esté correctamente establecida en SleakOps: + +1. **Revisa los Argumentos de Docker:** + + ```yaml + # En la configuración de tu proyecto SleakOps + docker_args: + DOPPLER_CONFIG: "staging" # o el nombre de tu entorno + DOPPLER_TOKEN: "${DOPPLER_TOKEN}" # debe referenciar un secreto + ``` + +2. **Verifica el Secreto del Token de Doppler:** + + - Ve a **Configuración del Proyecto** → **Secretos** + - Asegúrate de que `DOPPLER_TOKEN` esté correctamente configurado + - El token debe tener acceso de lectura a la configuración especificada + +3. **Revisa la Referencia de la Configuración de Doppler:** + - Verifica que el nombre de la configuración coincida exactamente en el panel de Doppler + - Configuraciones comunes: `dev`, `staging`, `production` + + + + + +Asegúrate de que tu token de Doppler tenga los permisos correctos: + +1. **Prueba el Token Localmente:** + + ```bash + # Verifica si el token puede acceder a la configuración + curl -H "Authorization: Bearer TU_TOKEN_DOPPLER" \ + "https://api.doppler.com/v3/configs/config/secrets" \ + -G -d project=TU_PROYECTO -d config=staging + ``` + +2. **Verifica el Alcance del Token:** + + - El token debe tener acceso de `read` a la configuración específica + - Verifica que el token no haya expirado + - Asegúrate de que sea un **Token de Servicio**, no un **Token Personal** + +3. **Regenera el Token si es Necesario:** + - Ve al Panel de Doppler → **Acceso** → **Tokens de Servicio** + - Crea un nuevo token con el alcance adecuado + - Actualiza el secreto en SleakOps + + + + + +Para asegurar que las variables de entorno se actualicen durante el despliegue: + +1. **Forzar un Redepliegue Completo:** + + ```bash + # Forzar reinicio de todos los pods para cargar nuevas variables + kubectl rollout restart deployment/tu-nombre-de-app + ``` + +2. **Verificar Variables de Entorno en el Pod:** + + ```bash + # Verifica que las variables se carguen correctamente + kubectl exec -it pod/tu-nombre-de-pod -- env | grep TU_VARIABLE + ``` + +3. **Usar Anotaciones en el Despliegue:** + Añade una anotación con marca temporal para forzar la recreación del pod: + ```yaml + # Esto fuerza a Kubernetes a recrear los pods + spec: + template: + metadata: + annotations: + deployment.kubernetes.io/revision: "$(date +%s)" + ``` + + + + + +Si las variables aún no se actualizan: + +1. **Revisa los Logs de Doppler:** + + ```bash + # Verifica si la CLI de Doppler funciona en tu contenedor + kubectl logs deployment/tu-app -c tu-contenedor | grep -i doppler + ``` + +2. **Verifica la Instalación de la CLI de Doppler:** + + ```dockerfile + # Asegúrate de que la CLI de Doppler esté instalada en tu imagen Docker + RUN curl -Ls https://cli.doppler.com/install.sh | sh + + # Usa Doppler para ejecutar tu aplicación + CMD ["doppler", "run", "--", "tu-comando-app"] + ``` + +3. **Prueba la Sincronización Manual:** + + ```bash + # Dentro de tu contenedor, prueba la sincronización manual + doppler secrets download --no-file --format env + ``` + +4. **Verifica la Conectividad de Red:** + - Asegúrate de que tu clúster pueda acceder a `api.doppler.com` + - Verifica que no haya reglas de firewall que bloqueen la conexión + - Prueba la resolución DNS: `nslookup api.doppler.com` + + + + + +Si la integración con Doppler sigue fallando: + +1. **Usar Secretos de Kubernetes como Respaldo:** + + ```yaml + # Crea un secreto de Kubernetes con variables críticas + apiVersion: v1 + kind: Secret + metadata: + name: app-secrets + data: + DATABASE_URL: + ``` + +2. **Implementar Webhooks de Doppler:** + + - Configura webhooks de Doppler para disparar redepliegues + - Actualiza automáticamente los secretos cuando la configuración de Doppler cambie + +3. **Usar Patrón de Init Container:** + + ```yaml + # Obtén los secretos antes de que inicie el contenedor principal + initContainers: + - name: doppler-sync + image: dopplerhq/cli:latest + command: + ["doppler", "secrets", "download", "--format", "env", "--no-file"] + volumeMounts: + - name: secrets-volume + mountPath: /secrets + ``` + +4. **Enfoque Híbrido:** + - Usa secretos de SleakOps para variables críticas + - Usa Doppler para configuración no crítica + - Implementa mecanismos de respaldo en tu aplicación + + + + + +El aumento temporal en el conteo de réplicas es normal durante los despliegues: + +1. **Proceso de Actualización Continua:** + + - Kubernetes crea nuevos pods con configuración actualizada + - Mantiene los pods antiguos en ejecución hasta que los nuevos estén listos + - Gradualmente dirige el tráfico a los nuevos pods + - Termina los pods antiguos una vez que los nuevos están saludables + +2. **Comportamiento Esperado:** + + - Conteo temporal de réplicas: `deseado + maxSurge` + - Para 2 réplicas con configuración por defecto: hasta 3 pods temporalmente + - Vuelve al conteo deseado (2) tras completar el despliegue + +3. **Monitorear el Progreso del Despliegue:** + ```bash + kubectl rollout status deployment/tu-nombre-de-app + ``` + + + +## Mejores Prácticas + +### Configuración de Seguridad + +- **Nunca hardcodees tokens de Doppler** en tu código o archivos de configuración +- **Usa secretos de SleakOps** para almacenar tokens de Doppler de forma segura +- **Rota los tokens regularmente** para mantener la seguridad +- **Limita el alcance de los tokens** solo a las configuraciones necesarias + +### Configuración de Desarrollo + +- **Usa configuraciones separadas** para cada entorno (dev, staging, prod) +- **Implementa validación de variables** en tu aplicación para detectar configuración faltante +- **Documenta todas las variables requeridas** para facilitar la configuración del equipo +- **Usa valores por defecto sensatos** cuando sea posible + +### Configuración de Operaciones + +- **Monitorea la sincronización de Doppler** con alertas automatizadas +- **Implementa health checks** que verifiquen la disponibilidad de variables críticas +- **Mantén respaldos** de configuraciones críticas fuera de Doppler +- **Documenta procedimientos de recuperación** para fallos de Doppler + +## Lista de Verificación para Resolución + +### Verificación Inicial +- [ ] Token de Doppler configurado correctamente en SleakOps +- [ ] Nombre de configuración coincide exactamente con Doppler +- [ ] CLI de Doppler instalada en la imagen Docker +- [ ] Conectividad de red a api.doppler.com verificada + +### Diagnóstico Avanzado +- [ ] Logs de contenedor revisados para errores de Doppler +- [ ] Variables de entorno verificadas dentro del pod +- [ ] Permisos del token validados con API de Doppler +- [ ] Sincronización manual probada exitosamente + +### Soluciones Implementadas +- [ ] Redepliegue forzado para actualizar variables +- [ ] Anotaciones de despliegue añadidas si es necesario +- [ ] Enfoque alternativo implementado si Doppler falla +- [ ] Monitoreo y alertas configuradas + +--- + +*Este documento fue generado automáticamente el 19 de diciembre de 2024 para proporcionar soluciones completas a problemas comunes de integración con Doppler en SleakOps.* diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/eks-pod-scheduling-tolerations.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/eks-pod-scheduling-tolerations.mdx new file mode 100644 index 000000000..b7c67e1af --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/eks-pod-scheduling-tolerations.mdx @@ -0,0 +1,229 @@ +--- +sidebar_position: 3 +title: "Problemas de programación de pods en EKS con instancias spot" +description: "Solución para pods que no se programan en nodepools de instancias spot en EKS" +date: "2025-02-21" +category: "cluster" +tags: ["eks", "programación", "tolerancias", "instancias-spot", "karpenter"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de programación de pods en EKS con instancias spot + +**Fecha:** 21 de febrero de 2025 +**Categoría:** Clúster +**Etiquetas:** EKS, Programación, Tolerancias, Instancias Spot, Karpenter + +## Descripción del problema + +**Contexto:** Después de actualizar a una versión más reciente de EKS en SleakOps, los pods (como Elasticsearch) no se programan correctamente en nodepools de instancias spot debido a la falta de configuración de tolerancias. + +**Síntomas observados:** + +- Pods que no arrancan o permanecen en estado Pendiente +- Errores de programación relacionados con taints en nodos +- Aplicaciones que no se despliegan en nodepools de instancias spot +- Fallos en la programación de pods tras actualizaciones de versión de EKS + +**Configuración relevante:** + +- Clúster EKS con nodepools de instancias spot +- Nombre del nodepool: `spot-amd64` +- Provisión de nodos gestionada por Karpenter +- Clave del taint: `karpenter.sh/nodepool` + +**Condiciones de error:** + +- Ocurre después de actualizaciones de versión de EKS +- Afecta despliegues sin tolerancias adecuadas +- Impacta despliegues directos con kubectl y aplicaciones gestionadas con Helm + +## Solución detallada + + + +Con versiones más recientes de EKS y Karpenter, los nodepools de instancias spot se taintan automáticamente para evitar que cargas de trabajo regulares se programen en ellos a menos que se configuren explícitamente. Esto asegura mejor gestión de recursos y optimización de costos. + +El taint aplicado es: + +```yaml +key: karpenter.sh/nodepool +value: spot-amd64 +effect: NoSchedule +``` + +Los pods necesitan tolerancias que coincidan para poder programarse en estos nodos. + + + + + +Si desplegaste tu aplicación directamente usando `kubectl apply`, añade las siguientes tolerancias a tu YAML de despliegue: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: elasticsearch +spec: + template: + spec: + tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule + containers: + - name: elasticsearch + image: elasticsearch:7.17.0 + # ... resto de la configuración de tu contenedor +``` + +Aplica la configuración actualizada: + +```bash +kubectl apply -f tu-despliegue.yaml +``` + + + + + +Si tu aplicación fue desplegada usando Helm, debes actualizar el archivo values o pasar las tolerancias como parámetros: + +**Opción 1: Actualizar values.yaml** + +```yaml +# values.yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule +``` + +**Opción 2: Pasar tolerancias durante helm install/upgrade** + +```bash +helm upgrade elasticsearch elastic/elasticsearch \ + --set tolerations[0].key=karpenter.sh/nodepool \ + --set tolerations[0].operator=Equal \ + --set tolerations[0].value=spot-amd64 \ + --set tolerations[0].effect=NoSchedule +``` + +**Opción 3: Crear un archivo de valores personalizado** + +```yaml +# custom-tolerations.yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule +``` + +Luego aplica: + +```bash +helm upgrade elasticsearch elastic/elasticsearch -f custom-tolerations.yaml +``` + + + + + +Después de aplicar las tolerancias, verifica que tus pods se estén programando correctamente: + +1. **Revisar estado de los pods:** + +```bash +kubectl get pods -l app=elasticsearch +``` + +2. **Verificar programación del pod:** + +```bash +kubectl describe pod +``` + +3. **Comprobar en qué nodo está corriendo el pod:** + +```bash +kubectl get pods -o wide +``` + +4. **Verificar que el nodo pertenece al nodepool spot:** + +```bash +kubectl describe node | grep -i taint +``` + + + + + +Si tienes múltiples nodepools spot o quieres permitir programación tanto en instancias spot como on-demand, puedes añadir múltiples tolerancias: + +```yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Equal + value: spot-amd64 + effect: NoSchedule + - key: karpenter.sh/nodepool + operator: Equal + value: spot-arm64 + effect: NoSchedule + - key: karpenter.sh/nodepool + operator: Equal + value: on-demand + effect: NoSchedule +``` + +Alternativamente, usa el operador `Exists` para tolerar cualquier nodepool: + +```yaml +tolerations: + - key: karpenter.sh/nodepool + operator: Exists + effect: NoSchedule +``` + + + + + +**Mejores prácticas:** + +1. **Incluye siempre tolerancias en tus plantillas de despliegue** cuando uses instancias spot +2. **Utiliza charts de Helm con tolerancias configurables** para una gestión más sencilla +3. **Prueba los despliegues después de actualizaciones de EKS** para asegurar compatibilidad +4. **Documenta la configuración de tus nodepools** para referencia del equipo + +**Plantilla para futuros despliegues:** + +```yaml +# deployment-template.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.name }} +spec: + template: + spec: + tolerations: + {{- if .Values.tolerations }} + {{- toYaml .Values.tolerations | nindent 6 }} + {{- end }} + containers: + - name: {{ .Values.name }} + # ... configuración del contenedor +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 21 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-specific-configuration-files.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-specific-configuration-files.mdx new file mode 100644 index 000000000..04d62fb4c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-specific-configuration-files.mdx @@ -0,0 +1,441 @@ +--- +sidebar_position: 3 +title: "Archivos de Configuración Específicos para Entornos" +description: "Cómo gestionar archivos de configuración que cambian entre entornos (prod/dev/qa)" +date: "2024-01-15" +category: "proyecto" +tags: ["configuración", "entorno", "archivos", "despliegue"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Archivos de Configuración Específicos para Entornos + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Configuración, Entorno, Archivos, Despliegue + +## Descripción del Problema + +**Contexto:** El usuario necesita desplegar archivos de configuración que varían entre diferentes entornos (producción, desarrollo, QA) en la plataforma SleakOps. + +**Síntomas Observados:** + +- Necesidad de subir diferentes archivos de configuración por entorno +- Las variables de entorno están disponibles pero son insuficientes para configuraciones basadas en archivos +- Incertidumbre sobre cómo manejar las variaciones de archivos entre entornos +- Los archivos de configuración contienen ajustes específicos para cada entorno + +**Configuración Relevante:** + +- Múltiples entornos: prod, dev, qa +- Configuración almacenada en archivos (no solo variables de entorno) +- Variables de entorno ya disponibles +- El contenido del archivo cambia según el destino del despliegue + +**Condiciones de Error:** + +- Imposibilidad de desplegar diferentes archivos de configuración por entorno +- Archivos de configuración contienen valores codificados para entornos específicos +- Necesidad de contenido dinámico en archivos según el contexto de despliegue + +## Solución Detallada + + + +El enfoque recomendado es usar ConfigMaps de Kubernetes con configuraciones específicas para cada entorno: + +1. **Crear ConfigMaps separados** para cada entorno +2. **Usar variables de entorno** para referenciar el ConfigMap correcto +3. **Montar los ConfigMaps como archivos** en tus contenedores + +```yaml +# config-dev.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config-dev +data: + config.json: | + { + "database_url": "dev-db.example.com", + "api_endpoint": "https://api-dev.miapp.com", + "log_level": "debug", + "feature_flags": { + "new_feature": true, + "experimental": true + } + } + +--- +# config-prod.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config-prod +data: + config.json: | + { + "database_url": "prod-db.example.com", + "api_endpoint": "https://api.miapp.com", + "log_level": "info", + "feature_flags": { + "new_feature": true, + "experimental": false + } + } +``` + + + + + +Para situaciones más complejas, puedes usar un enfoque de plantillas: + +1. **Crear archivos de plantilla** con marcadores de posición +2. **Usar un script de inicialización** para generar la configuración final +3. **Combinar variables de entorno con plantillas** + +```yaml +# config-template.json +{ + "database_url": "${DATABASE_URL}", + "api_endpoint": "${API_ENDPOINT}", + "log_level": "${LOG_LEVEL}", + "cache_ttl": ${CACHE_TTL}, + "feature_flags": { + "new_feature": ${NEW_FEATURE_ENABLED}, + "experimental": ${EXPERIMENTAL_ENABLED} + } +} +``` + +**Script de inicialización:** + +```bash +#!/bin/bash +# init-config.sh + +# Sustituir variables de entorno en la plantilla +envsubst < /templates/config-template.json > /app/config.json + +# Validar el JSON generado +if ! jq empty /app/config.json; then + echo "Error: Archivo de configuración JSON inválido" + exit 1 +fi + +echo "Configuración generada exitosamente para entorno: $ENVIRONMENT" +``` + + + + + +SleakOps permite gestionar configuraciones específicas por entorno usando VarGroups: + +1. **Crear VarGroups por entorno:** + + - `app-config-dev` para desarrollo + - `app-config-staging` para staging + - `app-config-prod` para producción + +2. **Definir variables de configuración:** + +```bash +# VarGroup: app-config-dev +DATABASE_URL=postgresql://dev-db.example.com:5432/myapp_dev +API_ENDPOINT=https://api-dev.miapp.com +LOG_LEVEL=debug +CACHE_TTL=300 +NEW_FEATURE_ENABLED=true +EXPERIMENTAL_ENABLED=true + +# VarGroup: app-config-prod +DATABASE_URL=postgresql://prod-db.example.com:5432/myapp +API_ENDPOINT=https://api.miapp.com +LOG_LEVEL=info +CACHE_TTL=3600 +NEW_FEATURE_ENABLED=true +EXPERIMENTAL_ENABLED=false +``` + +3. **Generar archivos durante el despliegue:** + +```dockerfile +# En tu Dockerfile +COPY config-template.json /templates/ +COPY init-config.sh /scripts/ + +# Durante el inicio del contenedor +CMD ["/scripts/init-config.sh && /app/start.sh"] +``` + + + + + +Para configuraciones que contienen datos sensibles: + +1. **Usar Kubernetes Secrets** en lugar de ConfigMaps +2. **Separar configuraciones públicas y privadas** +3. **Implementar encriptación en reposo** + +```yaml +# secret-config.yaml +apiVersion: v1 +kind: Secret +metadata: + name: app-secret-config +type: Opaque +stringData: + database-credentials.json: | + { + "username": "app_user", + "password": "secure_password", + "ssl_cert": "/etc/ssl/certs/db-cert.pem" + } + api-keys.json: | + { + "stripe_secret": "sk_live_...", + "sendgrid_api_key": "SG...", + "jwt_secret": "your-jwt-secret" + } +``` + +**Montaje en el contenedor:** + +```yaml +# deployment.yaml +spec: + template: + spec: + containers: + - name: app + volumeMounts: + - name: public-config + mountPath: /app/config + - name: secret-config + mountPath: /app/secrets + readOnly: true + volumes: + - name: public-config + configMap: + name: app-config-${ENVIRONMENT} + - name: secret-config + secret: + secretName: app-secret-config +``` + + + + + +Implementa validación para asegurar que las configuraciones son correctas: + +1. **Validación de esquema:** + +```javascript +// config-validator.js +const Ajv = require("ajv"); +const fs = require("fs"); + +const configSchema = { + type: "object", + properties: { + database_url: { type: "string", format: "uri" }, + api_endpoint: { type: "string", format: "uri" }, + log_level: { type: "string", enum: ["debug", "info", "warn", "error"] }, + cache_ttl: { type: "number", minimum: 60 }, + }, + required: ["database_url", "api_endpoint", "log_level"], +}; + +function validateConfig(configPath) { + const ajv = new Ajv(); + const validate = ajv.compile(configSchema); + + const config = JSON.parse(fs.readFileSync(configPath, "utf8")); + const valid = validate(config); + + if (!valid) { + console.error("Errores de validación:", validate.errors); + process.exit(1); + } + + console.log("Configuración válida"); +} + +validateConfig("/app/config.json"); +``` + +2. **Pruebas de configuración:** + +```bash +#!/bin/bash +# test-config.sh + +echo "Probando configuración para entorno: $ENVIRONMENT" + +# Verificar que el archivo existe +if [ ! -f "/app/config.json" ]; then + echo "Error: Archivo de configuración no encontrado" + exit 1 +fi + +# Verificar conectividad a base de datos +node -e " + const config = require('/app/config.json'); + const { Client } = require('pg'); + const client = new Client({ connectionString: config.database_url }); + + client.connect() + .then(() => { + console.log('Conexión a base de datos exitosa'); + client.end(); + }) + .catch(err => { + console.error('Error conectando a base de datos:', err.message); + process.exit(1); + }); +" + +echo "Todas las pruebas de configuración pasaron" +``` + + + + + +1. **Organización de archivos:** + +``` +project/ +├── config/ +│ ├── base.json # Configuración común +│ ├── development.json # Configuración específica dev +│ ├── staging.json # Configuración específica staging +│ └── production.json # Configuración específica prod +├── templates/ +│ ├── config-template.json # Plantilla con variables +│ └── nginx-template.conf # Plantillas de servicios +└── scripts/ + ├── init-config.sh # Script de inicialización + └── validate-config.js # Validación de configuración +``` + +2. **Estrategias de versionado:** + +```yaml +# Usar etiquetas para versionar configuraciones +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config-v1-2-3 + labels: + app: miapp + version: "1.2.3" + environment: production +data: + config.json: | + { + "version": "1.2.3", + "environment": "production" + } +``` + +3. **Monitoreo de cambios:** + +```bash +# Script para detectar cambios en configuración +#!/bin/bash +CONFIG_HASH=$(sha256sum /app/config.json | cut -d' ' -f1) +echo "Hash de configuración actual: $CONFIG_HASH" + +# Almacenar en variable de entorno para debugging +export CONFIG_HASH +``` + +4. **Rollback de configuraciones:** + +```bash +# Mantener respaldos de configuraciones +kubectl get configmap app-config-prod -o yaml > config-backup-$(date +%Y%m%d-%H%M%S).yaml + +# Script de rollback +kubectl apply -f config-backup-20240115-143000.yaml +kubectl rollout restart deployment/miapp +``` + + + + + +**Problema 1: Configuración no se actualiza** + +```bash +# Verificar si el ConfigMap fue actualizado +kubectl describe configmap app-config-prod + +# Forzar recarga del pod +kubectl rollout restart deployment/miapp + +# Verificar logs de inicialización +kubectl logs deployment/miapp --container=init-config +``` + +**Problema 2: Archivos de configuración corruptos** + +```bash +# Validar JSON +cat /app/config.json | jq . + +# Verificar variables de entorno +env | grep -E "(DATABASE_URL|API_ENDPOINT)" + +# Regenerar configuración +/scripts/init-config.sh +``` + +**Problema 3: Configuración por defecto no funciona** + +```yaml +# Usar initContainers para configuración predeterminada +initContainers: + - name: config-init + image: busybox + command: + - sh + - -c + - | + if [ ! -f /app/config.json ]; then + echo "Creando configuración por defecto" + cp /templates/config-default.json /app/config.json + fi + volumeMounts: + - name: config-volume + mountPath: /app + - name: templates + mountPath: /templates +``` + +**Problema 4: Permisos de archivos** + +```yaml +# Configurar permisos correctos en volúmenes +volumes: + - name: config-volume + configMap: + name: app-config + defaultMode: 0644 + - name: secret-volume + secret: + secretName: app-secrets + defaultMode: 0600 +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-variables-not-accessible-nuxtjs.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-variables-not-accessible-nuxtjs.mdx new file mode 100644 index 000000000..6f0be22f4 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-variables-not-accessible-nuxtjs.mdx @@ -0,0 +1,225 @@ +--- +sidebar_position: 3 +title: "Variables de Entorno No Accesibles en Aplicación NuxtJS" +description: "Solución para variables de entorno que no se reciben en aplicaciones NuxtJS desplegadas en SleakOps" +date: "2024-01-15" +category: "proyecto" +tags: ["nuxtjs", "variables-de-entorno", "despliegue", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variables de Entorno No Accesibles en Aplicación NuxtJS + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** NuxtJS, Variables de Entorno, Despliegue, Configuración + +## Descripción del Problema + +**Contexto:** El usuario ha desplegado una aplicación NuxtJS en SleakOps y configurado variables de entorno a través de la plataforma, pero la aplicación no puede acceder a estas variables en tiempo de ejecución. + +**Síntomas Observados:** + +- Variables de entorno configuradas en la plataforma SleakOps +- Variables no accesibles dentro de la aplicación NuxtJS +- Comportamiento de la aplicación sugiere configuración de entorno ausente +- Variables pueden aparecer como indefinidas o vacías + +**Configuración Relevante:** + +- Framework: NuxtJS +- Plataforma: SleakOps +- Variables de entorno: Configuradas mediante la interfaz de la plataforma +- Tipo de despliegue: Despliegue basado en contenedores + +**Condiciones de Error:** + +- Variables no disponibles en tiempo de ejecución de la aplicación +- El problema ocurre después de un despliegue exitoso +- Persiste tras reinicios de la aplicación +- Puede afectar tanto el acceso en servidor como en cliente + +## Solución Detallada + + + +NuxtJS tiene requerimientos específicos para las variables de entorno: + +1. **Variables del lado servidor**: Disponibles en `process.env` +2. **Variables del lado cliente**: Deben exponerse explícitamente usando `publicRuntimeConfig` o `privateRuntimeConfig` +3. **Variables en tiempo de compilación**: Necesitan estar disponibles durante el proceso de build + +```javascript +// nuxt.config.js +export default { + // Solo servidor + privateRuntimeConfig: { + apiSecret: process.env.API_SECRET, + }, + // Expuestas al cliente + publicRuntimeConfig: { + apiUrl: process.env.API_URL || "https://api.example.com", + }, +}; +``` + + + + + +Para configurar correctamente las variables de entorno en SleakOps: + +1. **Ve a la Configuración de tu Proyecto** +2. **Navega a la sección de Variables de Entorno** +3. **Agrega variables con nombres adecuados**: + + - Usa el prefijo `NUXT_` para reconocimiento automático por Nuxt + - Ejemplo: `NUXT_API_URL`, `NUXT_DATABASE_URL` + +4. **Asegura que las variables estén disponibles en tiempo de compilación**: + - Marca variables críticas como "Build Time" si se necesitan durante la compilación + - Marca variables de ejecución como "Runtime" para uso en la aplicación + + + + + +Asegúrate de que tu Dockerfile maneje correctamente las variables de entorno: + +```dockerfile +# Ejemplo Dockerfile para NuxtJS +FROM node:18-alpine + +WORKDIR /app + +# Copiar archivos de paquetes +COPY package*.json ./ +RUN npm ci --only=production + +# Copiar código fuente +COPY . . + +# Construir la aplicación (variables de entorno necesarias aquí) +ARG NUXT_API_URL +ARG NUXT_APP_NAME +ENV NUXT_API_URL=$NUXT_API_URL +ENV NUXT_APP_NAME=$NUXT_APP_NAME + +RUN npm run build + +# Exponer puerto +EXPOSE 3000 + +# Iniciar la aplicación +CMD ["npm", "start"] +``` + + + + + +Para aplicaciones Nuxt 3, usa la nueva configuración de tiempo de ejecución: + +```typescript +// nuxt.config.ts +export default defineNuxtConfig({ + runtimeConfig: { + // Claves privadas (solo disponibles en servidor) + apiSecret: process.env.API_SECRET, + // Claves públicas (expuestas al cliente) + public: { + apiUrl: process.env.NUXT_PUBLIC_API_URL || "https://api.example.com", + appName: process.env.NUXT_PUBLIC_APP_NAME || "Mi App", + }, + }, +}); +``` + +Accede a las variables en tu aplicación: + +```vue + + + +``` + + + + + +1. **Verifica nombres de variables en SleakOps**: + + - Revisa errores tipográficos en los nombres + - Asegura consistencia entre plataforma y código + +2. **Revisa los logs de compilación**: + + - Comprueba disponibilidad de variables durante el build + - Verifica que no haya errores relacionados con variables faltantes + +3. **Prueba localmente**: + + ```bash + # Crear archivo .env para pruebas locales + NUXT_PUBLIC_API_URL=https://api.example.com + NUXT_PUBLIC_APP_NAME=App de Prueba + API_SECRET=tu-clave-secreta + + # Ejecutar localmente + npm run dev + ``` + +4. **Depura en producción**: + ```javascript + // Añade logs temporales + console.log("Chequeo de entorno:", { + nodeEnv: process.env.NODE_ENV, + apiUrl: process.env.NUXT_PUBLIC_API_URL, + tieneSecreto: !!process.env.API_SECRET, + }); + ``` + + + + + +**Solución 1: Actualizar nombres de variables** + +- Prefija variables públicas con `NUXT_PUBLIC_` +- Usa mayúsculas consistentes (MAYÚSCULAS para vars de entorno) + +**Solución 2: Reconstruir la aplicación** + +- Tras cambiar variables de entorno, dispara un nuevo despliegue +- Asegura que el proceso de build tome las nuevas variables + +**Solución 3: Verificar ámbito de variables** + +- Confirma si las variables deben ser de build-time o runtime +- Configura adecuadamente en la plataforma SleakOps + +**Solución 4: Actualizar nuxt.config** + +- Asegura que todas las variables requeridas estén configuradas correctamente +- Prueba acceso tanto en servidor como en cliente + + + +--- + +_Este FAQ fue generado automáticamente el 15 de enero de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-variables-not-working-during-migration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-variables-not-working-during-migration.mdx new file mode 100644 index 000000000..eb6af340f --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/environment-variables-not-working-during-migration.mdx @@ -0,0 +1,184 @@ +--- +sidebar_position: 3 +title: "Variables de Entorno No Funcionan Durante la Migración de Plataforma" +description: "Solución para variables de entorno que no llegan a las aplicaciones durante migraciones de mantenimiento en SleakOps" +date: "2024-04-24" +category: "proyecto" +tags: + [ + "variables-de-entorno", + "secretos", + "migración", + "despliegue", + "solución-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variables de Entorno No Funcionan Durante la Migración de Plataforma + +**Fecha:** 24 de abril de 2024 +**Categoría:** Proyecto +**Etiquetas:** Variables de Entorno, Secretos, Migración, Despliegue, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Durante las migraciones de mantenimiento de la plataforma SleakOps, las variables de entorno y secretos pueden volverse temporalmente inaccesibles para las aplicaciones, causando problemas en el despliegue y en tiempo de ejecución. + +**Síntomas Observados:** + +- Variables de entorno no enviadas a las aplicaciones tras compilación/despliegue +- Variables que aparecen como indefinidas o vacías en los registros de la aplicación +- Error al intentar editar variables en la plataforma SleakOps +- Aplicaciones que fallan al conectar con APIs o servicios externos +- Variables que funcionaban previamente dejan de funcionar de repente + +**Configuración Relevante:** + +- Plataforma: SleakOps +- Afectados: Variables de entorno y secretos +- Momento: Durante migraciones de mantenimiento +- Impacto: Aplicaciones no pueden acceder a la configuración + +**Condiciones de Error:** + +- Ocurre durante ventanas de mantenimiento de la plataforma +- Variables inaccesibles tras nuevas compilaciones/despliegues +- Funcionalidad de edición deshabilitada temporalmente +- Aplicaciones pueden apuntar a entornos incorrectos (QA en lugar de producción) + +## Solución Detallada + + + +Durante las migraciones de mantenimiento de SleakOps: + +1. **Proceso de respaldo de secretos**: La plataforma crea copias de seguridad de los secretos en tu cuenta AWS +2. **Edición temporalmente deshabilitada**: La edición de variables se desactiva durante la migración +3. **Secretos existentes permanecen**: Las variables siguen existiendo en el clúster pero pueden no propagarse a nuevos despliegues +4. **Impacto en compilación/despliegue**: Las nuevas compilaciones pueden no recibir variables de entorno actualizadas + +**Importante**: Evita iniciar nuevas compilaciones/despliegues durante migraciones activas. + + + + + +Si necesitas funcionalidad inmediata durante la migración: + +1. **Configuración a nivel de aplicación**: Codifica temporalmente valores críticos directamente en tu aplicación +2. **Cambio de entorno**: Apunta desarrollo/pruebas a APIs de producción como medida temporal +3. **Evita nuevos despliegues**: No inicies compilaciones/despliegues hasta que la migración finalice +4. **Usa despliegues existentes en caché**: Utiliza instancias que estén corriendo y tengan las variables + +```javascript +// Ejemplo: Valor alternativo codificado temporalmente +const API_URL = process.env.API_URL || "https://api.production.example.com"; +const API_KEY = process.env.API_KEY || "clave-alternativa-para-emergencia"; +``` + + + + + +Una vez finalizada la migración: + +1. **Verifica disponibilidad de variables**: Confirma que todas las variables de entorno son accesibles +2. **Prueba en proceso de compilación**: Inicia una compilación de prueba para confirmar que las variables se inyectan +3. **Valida el inicio de la aplicación**: Asegúrate que las aplicaciones reciben toda la configuración necesaria +4. **Monitorea los registros**: Revisa los logs de la aplicación en busca de variables faltantes + +```bash +# Ejemplo: Depurar variables de entorno en tu aplicación +console.log('Verificación de variables de entorno:'); +console.log('API_URL:', process.env.API_URL); +console.log('DATABASE_URL:', process.env.DATABASE_URL ? 'CONFIGURADA' : 'FALTA'); +console.log('SECRET_KEY:', process.env.SECRET_KEY ? 'CONFIGURADA' : 'FALTA'); +``` + + + + + +Después de completar la migración: + +1. **Acceso a edición restaurado**: La funcionalidad para editar variables será reactivada +2. **Actualiza si es necesario**: Realiza los cambios necesarios en las variables de entorno +3. **Despliega con nuevas variables**: Inicia un nuevo despliegue para aplicar las actualizaciones +4. **Verifica propagación**: Confirma que las variables llegan a todas las instancias de la aplicación + +**Pasos para verificar la funcionalidad de edición**: + +1. Ve a la sección de variables de entorno de tu proyecto +2. Intenta editar una variable no crítica +3. Guarda los cambios y despliega +4. Verifica que el cambio se refleje en tu aplicación + + + + + +Para minimizar el impacto durante futuras migraciones: + +1. **Monitorea anuncios de mantenimiento**: Mantente informado sobre migraciones planificadas +2. **Implementa degradaciones elegantes**: Diseña aplicaciones para manejar variables faltantes +3. **Usa gestión de configuración**: Implementa patrones adecuados de gestión de configuración +4. **Programa despliegues**: Evita despliegues durante ventanas conocidas de mantenimiento +5. **Prueba dependencias de variables**: Testea regularmente qué ocurre cuando faltan variables + +```javascript +// Ejemplo: Manejo robusto de configuración +class Config { + constructor() { + this.apiUrl = this.getRequiredEnv("API_URL"); + this.dbUrl = this.getRequiredEnv("DATABASE_URL"); + this.secretKey = this.getRequiredEnv("SECRET_KEY"); + } + + getRequiredEnv(key) { + const value = process.env[key]; + if (!value) { + console.error(`Variable de entorno requerida ausente: ${key}`); + // Implementar fallback o degradación elegante + return this.getFallbackValue(key); + } + return value; + } + + getFallbackValue(key) { + // Implementar lógica de fallback apropiada + const fallbacks = { + API_URL: "https://api.fallback.example.com", + // Añadir otros fallbacks según sea necesario + }; + return fallbacks[key] || null; + } +} +``` + + + + + +Cuando experimentes problemas con variables durante migraciones: + +1. **Contacta soporte inmediatamente**: Reporta el problema con detalles específicos +2. **Proporciona contexto**: Incluye qué variables están afectadas y cuándo comenzó el problema +3. **Evita múltiples despliegues**: No intentes desplegar repetidamente hasta que el problema se resuelva +4. **Documenta soluciones temporales**: Mantén registro de cualquier arreglo provisional aplicado +5. **Espera confirmación**: Obtén confirmación de que la migración terminó antes de reanudar operaciones normales + +**Información para incluir en solicitudes de soporte**: + +- Nombre del proyecto y entorno afectado +- Variables específicas que no funcionan +- Momento en que se detectó el problema +- Mensajes de error de la plataforma +- Capturas de pantalla de la configuración de variables si es útil + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/extending-charts-custom-ingress.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/extending-charts-custom-ingress.mdx new file mode 100644 index 000000000..bbfe58ede --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/extending-charts-custom-ingress.mdx @@ -0,0 +1,815 @@ +--- +sidebar_position: 3 +title: "Añadiendo Configuración Personalizada de Ingress Usando Gráficos Extendidos" +description: "Cómo configurar recursos personalizados persistentes de Ingress usando la función de Gráficos Extendidos de SleakOps" +date: "2024-12-26" +category: "proyecto" +tags: + ["ingress", "helm", "gráficos-extendidos", "aws-load-balancer", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Añadiendo Configuración Personalizada de Ingress Usando Gráficos Extendidos + +**Fecha:** 26 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Ingress, Helm, Gráficos Extendidos, AWS Load Balancer, Kubernetes + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan añadir configuraciones personalizadas de Ingress a sus proyectos SleakOps que persistan a través de despliegues y actualizaciones. La configuración estándar del proyecto puede no cubrir todos los requisitos específicos de Ingress como anotaciones personalizadas, múltiples hosts o configuraciones específicas del Controlador de Balanceador de Carga AWS. + +**Síntomas Observados:** + +- Necesidad de configurar manualmente los recursos de Ingress después de cada despliegue +- Las configuraciones personalizadas de Ingress se pierden durante las actualizaciones del proyecto +- Requisito de anotaciones y configuraciones específicas del AWS ALB +- Necesidad de manejar múltiples dominios y certificados SSL + +**Configuración Relevante:** + +- Plataforma: SleakOps con Kubernetes +- Balanceador de Carga: AWS Application Load Balancer (ALB) +- Controlador de Ingress: AWS Load Balancer Controller +- SSL/TLS: Integración con AWS Certificate Manager + +**Condiciones de Error:** + +- Las configuraciones de Ingress no persisten a través de los despliegues +- Las configuraciones manuales son sobrescritas durante las actualizaciones +- Necesidad de reglas complejas de enrutamiento y redireccionamientos + +## Solución Detallada + + + +Para añadir configuraciones personalizadas persistentes de Ingress: + +1. **Navegar al proyecto** en la consola de SleakOps +2. **Ir a la sección "Gráficos Extendidos"** en el menú lateral +3. **Hacer clic en "Plantillas"** para añadir recursos personalizados de Kubernetes +4. **Crear una nueva plantilla** que contenga la configuración de Ingress + +**Ubicación en la interfaz:** + +``` +Proyecto → Configuración → Gráficos Extendidos → Plantillas → Añadir Plantilla +``` + +**Ventajas de usar Gráficos Extendidos:** + +- Persistencia automática a través de despliegues +- Integración nativa con Helm +- Versionado junto con el código de la aplicación +- Capacidad de usar valores dinámicos de Helm + + + + + +La plantilla de Ingress debe seguir la estructura estándar de Kubernetes con soporte para valores de Helm: + +```yaml +# templates/custom-ingress.yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: {{ .Values.ingress.name | default "custom-ingress" }} + namespace: {{ .Release.Namespace }} + labels: + app.kubernetes.io/name: {{ include "chart.name" . }} + app.kubernetes.io/instance: {{ .Release.Name }} + app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} + annotations: + {{- toYaml .Values.ingress.annotations | nindent 4 }} +spec: + ingressClassName: {{ .Values.ingress.className | default "alb" }} + {{- if .Values.ingress.tls }} + tls: + {{- range .Values.ingress.tls }} + - hosts: + {{- range .hosts }} + - {{ . | quote }} + {{- end }} + secretName: {{ .secretName }} + {{- end }} + {{- end }} + rules: + {{- range .Values.ingress.hosts }} + - host: {{ .host | quote }} + http: + paths: + {{- range .paths }} + - path: {{ .path }} + pathType: {{ .pathType | default "Prefix" }} + backend: + service: + name: {{ .service.name }} + port: + number: {{ .service.port }} + {{- end }} + {{- end }} +``` + +**Componentes clave:** + +- **Metadatos:** Incluyen etiquetas estándar de Helm +- **Anotaciones:** Configuración específica del proveedor +- **Clase de Ingress:** Especifica el controlador a usar +- **TLS:** Configuración de certificados SSL +- **Reglas:** Definición de rutas y servicios backend + + + + + +Para proyectos que usan AWS Application Load Balancer, estas son las anotaciones más comunes: + +```yaml +# Anotaciones específicas de AWS ALB +annotations: + # Configuración básica del ALB + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/load-balancer-name: custom-alb-name + + # Configuración SSL/TLS + alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:region:account:certificate/cert-id + alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01 + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]' + + # Redirecciones HTTP a HTTPS + alb.ingress.kubernetes.io/ssl-redirect: "443" + alb.ingress.kubernetes.io/actions.ssl-redirect: | + { + "Type": "redirect", + "RedirectConfig": { + "Protocol": "HTTPS", + "Port": "443", + "StatusCode": "HTTP_301" + } + } + + # Configuración de Health Checks + alb.ingress.kubernetes.io/healthcheck-path: /health + alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30" + alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5" + alb.ingress.kubernetes.io/healthy-threshold-count: "2" + alb.ingress.kubernetes.io/unhealthy-threshold-count: "3" + + # Tags para gestión de recursos + alb.ingress.kubernetes.io/tags: | + Environment={{ .Values.environment }}, + Project={{ .Values.project.name }}, + ManagedBy=SleakOps + + # Configuración de subnets + alb.ingress.kubernetes.io/subnets: subnet-xxxxx,subnet-yyyyy + + # Grupos de seguridad + alb.ingress.kubernetes.io/security-groups: sg-xxxxx,sg-yyyyy +``` + +**Ejemplo de redirección personalizada:** + +```yaml +annotations: + alb.ingress.kubernetes.io/actions.redirect-to-domain: | + { + "Type": "redirect", + "RedirectConfig": { + "Host": "newdomain.com", + "Path": "/#{path}", + "Query": "#{query}", + "StatusCode": "HTTP_301" + } + } +``` + + + + + +**Configuración de valores para múltiples dominios:** + +```yaml +# values.yaml +ingress: + name: multi-domain-ingress + className: alb + annotations: + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/certificate-arn: | + arn:aws:acm:region:account:certificate/cert-id-1, + arn:aws:acm:region:account:certificate/cert-id-2 + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]' + alb.ingress.kubernetes.io/ssl-redirect: "443" + + hosts: + - host: api.example.com + paths: + - path: / + pathType: Prefix + service: + name: api-service + port: 8080 + - path: /v2 + pathType: Prefix + service: + name: api-v2-service + port: 8080 + + - host: admin.example.com + paths: + - path: / + pathType: Prefix + service: + name: admin-service + port: 3000 + + - host: www.example.com + paths: + - path: / + pathType: Prefix + service: + name: web-service + port: 80 + + tls: + - hosts: + - api.example.com + - admin.example.com + secretName: api-admin-tls + - hosts: + - www.example.com + secretName: www-tls +``` + +**Plantilla correspondiente:** + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: {{ .Values.ingress.name }} + namespace: {{ .Release.Namespace }} + annotations: + {{- range $key, $value := .Values.ingress.annotations }} + {{ $key }}: {{ $value | quote }} + {{- end }} +spec: + ingressClassName: {{ .Values.ingress.className }} + tls: + {{- range .Values.ingress.tls }} + - hosts: + {{- range .hosts }} + - {{ . }} + {{- end }} + secretName: {{ .secretName }} + {{- end }} + rules: + {{- range .Values.ingress.hosts }} + - host: {{ .host }} + http: + paths: + {{- range .paths }} + - path: {{ .path }} + pathType: {{ .pathType }} + backend: + service: + name: {{ .service.name }} + port: + number: {{ .service.port }} + {{- end }} + {{- end }} +``` + + + + + +**Configuración de acciones personalizadas:** + +```yaml +annotations: + # Acción de redirección + alb.ingress.kubernetes.io/actions.redirect-to-https: | + { + "Type": "redirect", + "RedirectConfig": { + "Protocol": "HTTPS", + "Port": "443", + "StatusCode": "HTTP_301" + } + } + + # Acción de respuesta fija + alb.ingress.kubernetes.io/actions.response-503: | + { + "Type": "fixed-response", + "FixedResponseConfig": { + "ContentType": "text/plain", + "StatusCode": "503", + "MessageBody": "Service temporarily unavailable" + } + } + + # Acción de forward con pesos (para A/B testing) + alb.ingress.kubernetes.io/actions.weighted-routing: | + { + "Type": "forward", + "ForwardConfig": { + "TargetGroups": [ + { + "ServiceName": "app-v1", + "ServicePort": "80", + "Weight": 80 + }, + { + "ServiceName": "app-v2", + "ServicePort": "80", + "Weight": 20 + } + ] + } + } +``` + +**Ejemplo de Ingress con múltiples acciones:** + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: advanced-routing-ingress + annotations: + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/actions.ssl-redirect: | + { + "Type": "redirect", + "RedirectConfig": { + "Protocol": "HTTPS", + "Port": "443", + "StatusCode": "HTTP_301" + } + } + alb.ingress.kubernetes.io/actions.maintenance-mode: | + { + "Type": "fixed-response", + "FixedResponseConfig": { + "ContentType": "text/html", + "StatusCode": "503", + "MessageBody": "

Maintenance Mode

Service will be back soon

" + } + } +spec: + ingressClassName: alb + rules: + - host: example.com + http: + paths: + # Redirección HTTP a HTTPS + - path: / + pathType: Prefix + backend: + service: + name: ssl-redirect + port: + name: use-annotation + + # Ruta normal para HTTPS + - path: /app + pathType: Prefix + backend: + service: + name: app-service + port: + number: 80 + + # Modo de mantenimiento para rutas específicas + - path: /admin + pathType: Prefix + backend: + service: + name: maintenance-mode + port: + name: use-annotation +``` + +
+ + + +**Integración con AWS Cognito:** + +```yaml +annotations: + # Configuración de autenticación con Cognito + alb.ingress.kubernetes.io/auth-type: cognito + alb.ingress.kubernetes.io/auth-idp-cognito: | + { + "UserPoolArn": "arn:aws:cognito-idp:region:account:userpool/us-west-2_xxxxxx", + "UserPoolClientId": "xxxxxxxxxxxxxxxxxxxxxxxxxx", + "UserPoolDomain": "your-domain.auth.region.amazoncognito.com" + } + alb.ingress.kubernetes.io/auth-scope: "openid profile email" + alb.ingress.kubernetes.io/auth-session-cookie: "AWSELBAuthSessionCookie" + alb.ingress.kubernetes.io/auth-session-timeout: "86400" + alb.ingress.kubernetes.io/auth-on-unauthenticated-request: authenticate +``` + +**Integración con OIDC externo:** + +```yaml +annotations: + # Configuración de autenticación OIDC + alb.ingress.kubernetes.io/auth-type: oidc + alb.ingress.kubernetes.io/auth-idp-oidc: | + { + "Issuer": "https://your-oidc-provider.com", + "AuthorizationEndpoint": "https://your-oidc-provider.com/auth", + "TokenEndpoint": "https://your-oidc-provider.com/token", + "UserInfoEndpoint": "https://your-oidc-provider.com/userinfo", + "SecretName": "oidc-client-secret" + } + alb.ingress.kubernetes.io/auth-scope: "openid profile email" +``` + +**Configuración de rutas protegidas y públicas:** + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: auth-protected-ingress + annotations: + alb.ingress.kubernetes.io/auth-type: cognito + alb.ingress.kubernetes.io/auth-idp-cognito: | + { + "UserPoolArn": "arn:aws:cognito-idp:region:account:userpool/us-west-2_xxxxxx", + "UserPoolClientId": "client-id", + "UserPoolDomain": "domain.auth.region.amazoncognito.com" + } +spec: + rules: + - host: app.example.com + http: + paths: + # Rutas públicas (sin autenticación) + - path: /public + pathType: Prefix + backend: + service: + name: public-service + port: + number: 80 + + # Rutas protegidas (requieren autenticación) + - path: /admin + pathType: Prefix + backend: + service: + name: admin-service + port: + number: 80 + + - path: /dashboard + pathType: Prefix + backend: + service: + name: dashboard-service + port: + number: 80 +``` + + + + + +**Verificación de configuración aplicada:** + +```bash +#!/bin/bash +# Script para verificar la configuración de Ingress + +# Obtener información del Ingress +kubectl get ingress -n namespace-name -o yaml + +# Verificar eventos relacionados +kubectl get events -n namespace-name --sort-by=.metadata.creationTimestamp + +# Verificar el estado del ALB +kubectl describe ingress ingress-name -n namespace-name + +# Obtener logs del controlador ALB +kubectl logs -n kube-system deployment/aws-load-balancer-controller +``` + +**Comandos de diagnóstico:** + +```bash +# Verificar que el controlador ALB está funcionando +kubectl get deployment -n kube-system aws-load-balancer-controller + +# Verificar los servicios backend +kubectl get svc -n namespace-name + +# Verificar los endpoints +kubectl get endpoints -n namespace-name + +# Verificar certificados TLS +kubectl get secrets -n namespace-name | grep tls + +# Obtener detalles del Load Balancer creado +aws elbv2 describe-load-balancers --names custom-alb-name +aws elbv2 describe-target-groups --load-balancer-arn arn:aws:elasticloadbalancing:... +``` + +**Métricas de monitoreo:** + +```yaml +# ServiceMonitor para Prometheus (si está disponible) +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: ingress-nginx-metrics +spec: + selector: + matchLabels: + app.kubernetes.io/name: ingress-nginx + endpoints: + - port: prometheus + interval: 30s + path: /metrics +``` + +**Alertas recomendadas:** + +```yaml +# PrometheusRule para alertas de Ingress +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: ingress-alerts +spec: + groups: + - name: ingress.rules + rules: + - alert: IngressDown + expr: up{job="ingress-nginx"} == 0 + for: 5m + labels: + severity: critical + annotations: + summary: "Ingress controller is down" + + - alert: HighLatency + expr: histogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) > 1 + for: 5m + labels: + severity: warning + annotations: + summary: "High latency on ingress" +``` + + + + + +**Problema: Ingress no crea Load Balancer** + +```bash +# Verificar logs del controlador +kubectl logs -n kube-system deployment/aws-load-balancer-controller -f + +# Verificar anotaciones requeridas +kubectl get ingress ingress-name -o yaml | grep -A 10 annotations + +# Verificar permisos IAM del controlador +aws sts get-caller-identity +aws iam list-attached-role-policies --role-name AWSLoadBalancerControllerRole +``` + +**Problema: Certificados SSL no funcionan** + +```bash +# Verificar certificados en AWS ACM +aws acm list-certificates --region your-region + +# Verificar validación de dominio +aws acm describe-certificate --certificate-arn arn:aws:acm:... + +# Verificar anotaciones de certificado en Ingress +kubectl get ingress ingress-name -o jsonpath='{.metadata.annotations.alb\.ingress\.kubernetes\.io/certificate-arn}' +``` + +**Problema: Health checks fallan** + +```bash +# Verificar configuración de health check +aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:... + +# Verificar que el servicio responde en el puerto correcto +kubectl port-forward svc/service-name 8080:80 +curl http://localhost:8080/health + +# Ajustar configuración de health check en Ingress +kubectl annotate ingress ingress-name alb.ingress.kubernetes.io/healthcheck-path=/custom-health +``` + +**Problema: Redirecciones no funcionan** + +```yaml +# Verificar formato correcto de anotaciones de redirección +annotations: + alb.ingress.kubernetes.io/actions.ssl-redirect: | + { + "Type": "redirect", + "RedirectConfig": { + "Protocol": "HTTPS", + "Port": "443", + "StatusCode": "HTTP_301" + } + } +``` + +**Soluciones de script automatizado:** + +```bash +#!/bin/bash +# Script de diagnóstico completo + +echo "=== Verificación de Ingress ===" +NAMESPACE=${1:-default} +INGRESS_NAME=${2:-""} + +if [ -z "$INGRESS_NAME" ]; then + echo "Listando todos los Ingress en namespace $NAMESPACE:" + kubectl get ingress -n $NAMESPACE + exit 1 +fi + +echo "1. Estado del Ingress:" +kubectl get ingress $INGRESS_NAME -n $NAMESPACE + +echo -e "\n2. Descripción detallada:" +kubectl describe ingress $INGRESS_NAME -n $NAMESPACE + +echo -e "\n3. Eventos recientes:" +kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$INGRESS_NAME + +echo -e "\n4. Servicios backend:" +kubectl get svc -n $NAMESPACE + +echo -e "\n5. Estado del controlador ALB:" +kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller + +echo -e "\n6. Logs del controlador (últimas 20 líneas):" +kubectl logs -n kube-system deployment/aws-load-balancer-controller --tail=20 + +echo -e "\n7. Verificación de Load Balancer en AWS:" +ALB_NAME=$(kubectl get ingress $INGRESS_NAME -n $NAMESPACE -o jsonpath='{.metadata.annotations.alb\.ingress\.kubernetes\.io/load-balancer-name}') +if [ ! -z "$ALB_NAME" ]; then + aws elbv2 describe-load-balancers --names $ALB_NAME 2>/dev/null || echo "Load Balancer no encontrado en AWS" +fi +``` + + + + + +**Organización de templates:** + +``` +project/ +├── charts/ +│ └── templates/ +│ ├── ingress-main.yaml # Ingress principal +│ ├── ingress-api.yaml # Ingress para API +│ ├── ingress-admin.yaml # Ingress para admin +│ └── configmap-nginx.yaml # Configuraciones adicionales +└── values/ + ├── values.yaml # Valores por defecto + ├── values-dev.yaml # Valores desarrollo + ├── values-staging.yaml # Valores staging + └── values-prod.yaml # Valores producción +``` + +**Configuración por entornos:** + +```yaml +# values-prod.yaml +ingress: + enabled: true + className: alb + annotations: + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:region:account:certificate/prod-cert + alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS-1-2-2017-01 + alb.ingress.kubernetes.io/tags: | + Environment=production, + Project={{ .Values.project.name }}, + CostCenter=engineering + + hosts: + - host: api.production.com + paths: + - path: / + service: + name: api-service + port: 80 + +# values-dev.yaml +ingress: + enabled: true + className: alb + annotations: + alb.ingress.kubernetes.io/scheme: internal + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/tags: | + Environment=development, + Project={{ .Values.project.name }} + + hosts: + - host: api.dev.internal.com + paths: + - path: / + service: + name: api-service + port: 80 +``` + +**Seguridad y validación:** + +```yaml +# Validaciones en templates +{{- if and .Values.ingress.enabled .Values.ingress.tls }} +{{- range .Values.ingress.tls }} +{{- if not .secretName }} +{{- fail "secretName is required for TLS configuration" }} +{{- end }} +{{- end }} +{{- end }} + +# Configuración de seguridad +annotations: + # Restricción de IPs (para entornos internos) + alb.ingress.kubernetes.io/inbound-cidrs: 10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 + + # Headers de seguridad + alb.ingress.kubernetes.io/response-headers: | + X-Frame-Options=SAMEORIGIN, + X-Content-Type-Options=nosniff, + X-XSS-Protection=1; mode=block, + Strict-Transport-Security=max-age=31536000; includeSubDomains +``` + +**Versionado y rollback:** + +```bash +#!/bin/bash +# Script para deployment con rollback automático + +NAMESPACE=$1 +RELEASE_NAME=$2 +CHART_PATH=$3 + +echo "Desplegando nueva versión..." +helm upgrade --install $RELEASE_NAME $CHART_PATH \ + --namespace $NAMESPACE \ + --timeout 10m \ + --wait \ + --atomic + +if [ $? -ne 0 ]; then + echo "Deployment falló, ejecutando rollback..." + helm rollback $RELEASE_NAME -n $NAMESPACE + exit 1 +fi + +# Verificar que el Ingress está funcionando +sleep 30 +INGRESS_ENDPOINT=$(kubectl get ingress -n $NAMESPACE -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}') + +if [ ! -z "$INGRESS_ENDPOINT" ]; then + curl -f https://$INGRESS_ENDPOINT/health || { + echo "Health check falló, ejecutando rollback..." + helm rollback $RELEASE_NAME -n $NAMESPACE + exit 1 + } +fi + +echo "Deployment exitoso!" +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 26 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/fargate-pods-not-terminating-cost-increase.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/fargate-pods-not-terminating-cost-increase.mdx new file mode 100644 index 000000000..1e3cc3064 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/fargate-pods-not-terminating-cost-increase.mdx @@ -0,0 +1,439 @@ +--- +sidebar_position: 3 +title: "Pods de AWS Fargate que no terminan - Problema de aumento de costos" +description: "Solución para pods de AWS Fargate que no terminan correctamente causando aumentos inesperados de costos" +date: "2024-03-14" +category: "cluster" +tags: + ["fargate", "aws", "costos", "pods", "terminación", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Pods de AWS Fargate que no terminan - Problema de aumento de costos + +**Fecha:** 14 de marzo de 2024 +**Categoría:** Clúster +**Etiquetas:** Fargate, AWS, Costos, Pods, Terminación, Solución de problemas + +## Descripción del problema + +**Contexto:** Los usuarios experimentan aumentos inesperados en los costos de su clúster AWS EKS cuando los pods de Fargate que ejecutan despliegues no terminan correctamente después de completar los trabajos. + +**Síntomas observados:** + +- Aumento dramático en los costos de AWS durante varios días +- Múltiples réplicas de Fargate visibles en la lista de Jobs +- Acumulación de pods sin limpieza adecuada +- Pronóstico de costos mostrando picos significativos +- Costos individuales de pods bajos, pero que se acumulan con el tiempo + +**Configuración relevante:** + +- Tipo de clúster: AWS EKS con Fargate +- Tipo de carga de trabajo: Despliegues ejecutándose en nodos Fargate +- Monitoreo: Prometheus con asignación insuficiente de RAM (1250MB) +- Monitoreo de costos: Habilitado con pronóstico + +**Condiciones de error:** + +- Los pods de Fargate no terminan después de la finalización del despliegue +- Problemas de memoria en Prometheus causando problemas de asignación de nodos +- Nodos no usados que permanecen activos generando costos +- El problema aparece de forma intermitente sin un disparador claro + +## Solución detallada + + + +Este problema de aumento de costos típicamente se debe a dos problemas principales: + +1. **Problema en el ciclo de vida de los pods Fargate**: Los pods de Fargate que ejecutan despliegues no siempre terminan correctamente, causando su acumulación con el tiempo +2. **Restricciones de recursos en Prometheus**: La asignación insuficiente de RAM (1250MB) hace que Prometheus asigne nodos que se vuelven inutilizables pero permanecen activos + +Ambos problemas resultan en recursos activos por más tiempo del necesario, generando costos inesperados. + + + + + +Para resolver el problema de memoria de Prometheus: + +1. **Accede a la configuración del addon Prometheus** en tu clúster +2. **Incrementa la asignación mínima de RAM** de 1250MB a 2GB (2048MB) +3. **Aplica los cambios de configuración** + +```yaml +# Configuración del addon Prometheus +resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" +``` + +Esto evita que Prometheus entre en nodos con recursos insuficientes y cree nodos inutilizables pero activos. + + + + + +Para abordar la acumulación de pods de Fargate: + +1. **Verifica los pods actuales de Fargate**: + +```bash +kubectl get pods --all-namespaces -o wide | grep fargate +``` + +2. **Identifica pods atascados**: + +```bash +kubectl get pods --all-namespaces --field-selector=status.phase=Succeeded +kubectl get pods --all-namespaces --field-selector=status.phase=Failed +``` + +3. **Limpia los pods completados**: + +```bash +# Eliminar pods completados +kubectl delete pods --all-namespaces --field-selector=status.phase=Succeeded +kubectl delete pods --all-namespaces --field-selector=status.phase=Failed +``` + + + + + +Para prevenir acumulaciones futuras, implementa limpieza automatizada: + +1. **Crea un CronJob de limpieza**: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: pod-cleanup + namespace: kube-system +spec: + schedule: "0 2 * * *" # Ejecutar diariamente a las 2 AM + jobTemplate: + spec: + template: + spec: + serviceAccountName: pod-cleanup + containers: + - name: kubectl + image: bitnami/kubectl:latest + command: + - /bin/sh + - -c + - | + kubectl delete pods --all-namespaces --field-selector=status.phase=Succeeded + kubectl delete pods --all-namespaces --field-selector=status.phase=Failed + restartPolicy: OnFailure +``` + +2. **Crea el RBAC necesario**: + +```yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: pod-cleanup + namespace: kube-system +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: pod-cleanup +rules: + - apiGroups: [""] + resources: ["pods"] + verbs: ["list", "delete"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: pod-cleanup +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: pod-cleanup +subjects: + - kind: ServiceAccount + name: pod-cleanup + namespace: kube-system +``` + + + + + +Para evitar sorpresas futuras en costos: + +1. **Configura alertas de costos** en AWS: + + - Ve a AWS Billing → Budgets + - Crea un presupuesto para tu clúster EKS + - Establece alertas al 80% y 100% de los costos esperados + +2. **Monitorea el uso de Fargate**: + +```bash +# Verificar cantidad de pods Fargate +kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.nodeName | contains("fargate")) | .metadata.name' | wc -l +``` + +3. **Revisiones regulares de costos**: + - Revisa AWS Cost Explorer semanalmente + - Monitorea costos de EKS por servicio + - Revisa patrones de uso de Fargate + + + + + +**Configuración de despliegue:** + +- Establece `activeDeadlineSeconds` apropiado para los jobs +- Usa `ttlSecondsAfterFinished` para limpieza automática +- Configura límites adecuados de recursos + +**Monitoreo:** + +- Configura alertas en Prometheus para acumulación de pods +- Monitorea uso de recursos del clúster regularmente +- Implementa paneles para seguimiento de costos + +**Mantenimiento:** + +- Programa chequeos regulares de salud del clúster +- Implementa procesos automatizados de limpieza +- Revisa y actualiza asignaciones de recursos periódicamente + +```yaml +# Ejemplo de job con limpieza automática +apiVersion: batch/v1 +kind: Job +metadata: + name: example-job +spec: + ttlSecondsAfterFinished: 300 # Limpia 5 minutos después de la finalización + activeDeadlineSeconds: 3600 # Finaliza el job después de 1 hora + template: + spec: + restartPolicy: Never + containers: + - name: job-container + image: your-image + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" +``` + + + + + +Para optimizar el uso de Fargate y reducir costos: + +1. **Configuración de Recursos Eficiente:** + +```yaml +# Usa combinaciones de CPU/memoria válidas para Fargate +resources: + requests: + cpu: "0.25" # 0.25, 0.5, 1, 2, 4 vCPU + memory: "0.5Gi" # Debe coincidir con las combinaciones válidas + limits: + cpu: "0.25" + memory: "0.5Gi" +``` + +2. **Configuraciones Válidas de CPU/Memoria para Fargate:** + +- 0.25 vCPU: 0.5GB, 1GB, 2GB +- 0.5 vCPU: 1GB, 2GB, 3GB, 4GB +- 1 vCPU: 2GB, 3GB, 4GB, 5GB, 6GB, 7GB, 8GB +- 2 vCPU: 4GB a 16GB (incrementos de 1GB) +- 4 vCPU: 8GB a 30GB (incrementos de 1GB) + +3. **Configuración de Tolerancias para Fargate:** + +```yaml +tolerations: + - key: eks.amazonaws.com/compute-type + operator: Equal + value: fargate + effect: NoSchedule +``` + + + + + +Implementa estas estrategias para reducir costos de Fargate: + +1. **Uso de Spot Instances cuando sea posible:** + +```yaml +# Para cargas de trabajo tolerantes a interrupciones +nodeSelector: + eks.amazonaws.com/capacityType: SPOT +``` + +2. **Configuración de Escalado Automático:** + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + minReplicas: 1 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + +3. **Configuración de Escalado Vertical:** + +```yaml +apiVersion: autoscaling.k8s.io/v1 +kind: VerticalPodAutoscaler +metadata: + name: app-vpa +spec: + targetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + updatePolicy: + updateMode: "Auto" +``` + + + + + +Configura monitoreo completo para prevenir problemas de costos: + +1. **Alertas de Prometheus para Pods Atascados:** + +```yaml +groups: + - name: fargate-cost-alerts + rules: + - alert: FargatePodStuck + expr: kube_pod_info{node=~"fargate-.*"} and on(pod) kube_pod_status_phase{phase="Succeeded"} > 0 + for: 10m + labels: + severity: warning + annotations: + summary: "Pod de Fargate atascado detectado" + description: "El pod {{ $labels.pod }} en el namespace {{ $labels.namespace }} ha estado en estado Succeeded por más de 10 minutos" + + - alert: FargateHighCost + expr: increase(aws_billing_estimated_charges_usd[1h]) > 10 + for: 5m + labels: + severity: critical + annotations: + summary: "Aumento significativo en costos de AWS" + description: "Los costos de AWS han aumentado más de $10 en la última hora" +``` + +2. **Dashboard de Grafana para Monitoreo de Costos:** + +```json +{ + "dashboard": { + "title": "Fargate Cost Monitoring", + "panels": [ + { + "title": "Pods de Fargate Activos", + "type": "stat", + "targets": [ + { + "expr": "count(kube_pod_info{node=~\"fargate-.*\"})" + } + ] + }, + { + "title": "Pods Completados sin Limpiar", + "type": "stat", + "targets": [ + { + "expr": "count(kube_pod_status_phase{phase=\"Succeeded\", node=~\"fargate-.*\"})" + } + ] + } + ] + } +} +``` + + + +## Mejores Prácticas + +### Configuración de Recursos + +- **Usa las combinaciones exactas de CPU/memoria** válidas para Fargate +- **Configura límites de recursos apropiados** para evitar sobre-aprovisionamiento +- **Implementa requests y limits iguales** para garantizar recursos predecibles +- **Revisa y ajusta recursos regularmente** basado en métricas de uso + +### Gestión del Ciclo de Vida + +- **Configura ttlSecondsAfterFinished** para todos los Jobs +- **Usa activeDeadlineSeconds** para evitar jobs que corren indefinidamente +- **Implementa health checks apropiados** para detección temprana de problemas +- **Configura graceful shutdown** para terminación limpia de pods + +### Monitoreo y Alertas + +- **Configura alertas de costos** en AWS Billing +- **Monitorea métricas de Fargate** con Prometheus +- **Implementa dashboards** para visibilidad de costos en tiempo real +- **Revisa costos semanalmente** para detectar tendencias + +## Lista de Verificación para Resolución + +### Verificación Inmediata +- [ ] Memoria de Prometheus aumentada a 2GB +- [ ] Pods completados eliminados manualmente +- [ ] Alertas de costos configuradas en AWS +- [ ] Uso actual de Fargate verificado + +### Implementación de Prevención +- [ ] CronJob de limpieza automática desplegado +- [ ] RBAC para limpieza configurado +- [ ] Configuraciones de ttl añadidas a Jobs +- [ ] Monitoreo de Prometheus configurado + +### Optimización Continua +- [ ] Revisión semanal de costos programada +- [ ] Métricas de uso de recursos monitoreadas +- [ ] Configuraciones de escalado implementadas +- [ ] Documentación de procedimientos actualizada + +--- + +*Este documento fue generado automáticamente el 14 de marzo de 2024 para proporcionar soluciones completas a problemas de costos con pods de Fargate en AWS EKS.* diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/frontend-environment-variables-docker-build.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/frontend-environment-variables-docker-build.mdx new file mode 100644 index 000000000..22f0db5f6 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/frontend-environment-variables-docker-build.mdx @@ -0,0 +1,179 @@ +--- +sidebar_position: 3 +title: "Variables de Entorno en Frontend que No Funcionan Durante la Construcción" +description: "Solución para proyectos frontend que no reciben variables de entorno durante el proceso de construcción con Docker" +date: "2024-12-23" +category: "proyecto" +tags: + [ + "frontend", + "variables-de-entorno", + "docker", + "construcción", + "configuración", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variables de Entorno en Frontend que No Funcionan Durante la Construcción + +**Fecha:** 23 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Frontend, Variables de Entorno, Docker, Construcción, Configuración + +## Descripción del Problema + +**Contexto:** Los proyectos frontend en SleakOps no están recibiendo variables de entorno durante el proceso de construcción, afectando múltiples aplicaciones en diferentes frameworks frontend. + +**Síntomas Observados:** + +- Las variables de entorno no están disponibles en las aplicaciones frontend +- Variables que funcionan en otros tipos de proyectos fallan en proyectos frontend +- El proceso de construcción finaliza con éxito pero las variables son indefinidas en tiempo de ejecución +- El problema afecta consistentemente a múltiples aplicaciones frontend + +**Configuración Relevante:** + +- Tipo de proyecto: Aplicaciones frontend +- Proceso de construcción: Construcciones basadas en Docker +- Variables de entorno: Configuradas en los ajustes del proyecto SleakOps +- Acceso a variables: Requerido durante el tiempo de construcción, no solo en tiempo de ejecución + +**Condiciones de Error:** + +- Variables indefinidas o nulas en la aplicación construida +- Las variables de entorno funcionan en servicios backend pero no en frontend +- El problema ocurre en diferentes frameworks frontend (React, Vue, Angular, etc.) +- El problema persiste independientemente de las convenciones de nombres de variables + +## Solución Detallada + + + +Los proyectos frontend requieren variables de entorno durante el **proceso de construcción**, no solo en tiempo de ejecución. A diferencia de los servicios backend que pueden acceder a variables de entorno cuando el contenedor inicia, las aplicaciones frontend necesitan que estas variables estén "incrustadas" durante el paso de construcción porque: + +1. El código frontend se ejecuta en el navegador, no en el servidor +2. Las variables de entorno deben incorporarse en los archivos estáticos durante la construcción +3. El proceso de construcción necesita acceso a las variables para generar el paquete final + + + + + +Para solucionar este problema, debes configurar los Docker Args en los ajustes de tu proyecto: + +1. Ve a la lista de **Proyectos** en SleakOps +2. Encuentra tu proyecto frontend y haz clic en **Editar** +3. Navega a la sección de **Docker Args** +4. Añade tus variables de entorno como argumentos de construcción + +**Ejemplo de configuración:** + +```bash +# En la sección Docker Args +REACT_APP_API_URL=${API_URL} +REACT_APP_ENV=${ENVIRONMENT} +VUE_APP_BASE_URL=${BASE_URL} +NEXT_PUBLIC_API_KEY=${API_KEY} +``` + + + + + +Tu Dockerfile debe estar configurado para aceptar y usar estos argumentos de construcción: + +```dockerfile +# Aceptar argumentos de construcción +ARG REACT_APP_API_URL +ARG REACT_APP_ENV +ARG VUE_APP_BASE_URL +ARG NEXT_PUBLIC_API_KEY + +# Establecerlos como variables de entorno durante la construcción +ENV REACT_APP_API_URL=$REACT_APP_API_URL +ENV REACT_APP_ENV=$REACT_APP_ENV +ENV VUE_APP_BASE_URL=$VUE_APP_BASE_URL +ENV NEXT_PUBLIC_API_KEY=$NEXT_PUBLIC_API_KEY + +# Ejecutar tu comando de construcción +RUN npm run build +``` + + + + + +Diferentes frameworks frontend tienen convenciones específicas para nombrar las variables de entorno: + +**React:** + +- Deben comenzar con `REACT_APP_` +- Ejemplo: `REACT_APP_API_URL` + +**Vue.js:** + +- Deben comenzar con `VUE_APP_` +- Ejemplo: `VUE_APP_BASE_URL` + +**Next.js:** + +- Deben comenzar con `NEXT_PUBLIC_` para acceso en cliente +- Ejemplo: `NEXT_PUBLIC_API_KEY` + +**Angular:** + +- No requiere prefijo específico, pero se usa configuración personalizada +- Acceso a través de archivos `environment.ts` + +**Proyectos basados en Vite:** + +- Deben comenzar con `VITE_` +- Ejemplo: `VITE_API_ENDPOINT` + + + + + +Para verificar que tus variables de entorno funcionan: + +1. **Revisa los logs de construcción:** Busca que las variables se establezcan durante la construcción +2. **Consola del navegador:** Usa `console.log(process.env.REACT_APP_API_URL)` en tu código +3. **Pestaña de red:** Verifica que las llamadas API usan las URLs correctas +4. **Salida de construcción:** Revisa si las variables aparecen en los archivos empaquetados (ten cuidado con datos sensibles) + +**Ejemplo de código para prueba:** + +```javascript +// Añade esto temporalmente para verificar +console.log("Variables de entorno:", { + apiUrl: process.env.REACT_APP_API_URL, + environment: process.env.REACT_APP_ENV, +}); +``` + + + + + +Si las variables aún no funcionan después de la configuración: + +1. **Revisa los nombres de las variables:** Asegúrate de que siguen las convenciones del framework +2. **Verifica la sintaxis de Docker Args:** Asegúrate de que la sintaxis sea correcta en SleakOps +3. **Reconstruye el proyecto:** Los cambios en Docker Args requieren una reconstrucción completa +4. **Revisa el Dockerfile:** Asegúrate de que las declaraciones ARG y ENV estén presentes +5. **Valida los valores de las variables:** Asegúrate de que las variables de entorno origen tengan valores + +**Errores comunes:** + +- Olvidar prefijos específicos del framework +- No reconstruir después de cambios en la configuración +- Faltar declaraciones ARG en el Dockerfile +- Sintaxis incorrecta en la sustitución de variables + + + +--- + +_Este FAQ fue generado automáticamente el 23 de diciembre de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-automatic-deployment.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-automatic-deployment.mdx new file mode 100644 index 000000000..fe11cc047 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-automatic-deployment.mdx @@ -0,0 +1,199 @@ +--- +sidebar_position: 3 +title: "Configuración de Despliegue Automático con GitHub Actions" +description: "Cómo configurar despliegues automáticos con GitHub Actions en SleakOps" +date: "2024-12-17" +category: "proyecto" +tags: ["github-actions", "ci-cd", "despliegue", "automatización"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Despliegue Automático con GitHub Actions + +**Fecha:** 17 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** GitHub Actions, CI/CD, Despliegue, Automatización + +## Descripción del Problema + +**Contexto:** El usuario desea configurar despliegues automáticos que se activen al hacer push de código a un repositorio de GitHub usando la plataforma SleakOps. + +**Síntomas Observados:** + +- Necesidad de configurar una pipeline CI/CD para despliegues automáticos +- Deseo de que los despliegues se activen con eventos de push en Git +- Busca integración entre GitHub y SleakOps +- Requiere orientación sobre la configuración de GitHub Actions + +**Configuración Relevante:** + +- Plataforma: SleakOps +- Control de Versiones: GitHub +- Herramienta CI/CD: GitHub Actions +- Destino de Despliegue: Infraestructura gestionada por SleakOps + +**Condiciones de Error:** + +- Actualmente se usa un proceso de despliegue manual +- No hay pipeline automatizada configurada +- Necesidad de establecer conexión entre GitHub y SleakOps + +## Solución Detallada + + + +SleakOps proporciona una herramienta CLI diseñada específicamente para integración CI/CD. Puedes encontrar la documentación completa en: https://docs.sleakops.com/cli + +El CLI te permite: + +- Desplegar aplicaciones automáticamente +- Gestionar despliegues desde pipelines CI/CD +- Integrar con varias plataformas CI/CD incluyendo GitHub Actions + + + + + +Para generar el archivo de workflow de GitHub Actions: + +1. **Accede a la Consola de SleakOps** + + - Inicia sesión en tu panel de SleakOps + - Navega a tu proyecto + +2. **Genera la Configuración CI/CD** + + - Busca la sección CI/CD o GitHub Actions + - La consola te proporcionará un archivo de workflow preconfigurado + - Este archivo incluye todos los pasos necesarios para el despliegue + +3. **Descarga la Configuración** + - Copia el archivo de workflow generado + - Toma nota de las claves secretas requeridas que se mostrarán + + + + + +Una vez que tengas el archivo workflow de la consola de SleakOps: + +1. **Crea el Directorio de Workflow** + + ```bash + mkdir -p .github/workflows + ``` + +2. **Agrega el Archivo Workflow** + + - Crea un archivo nuevo: `.github/workflows/deploy.yml` + - Pega la configuración proporcionada por la consola de SleakOps + +3. **Estructura Básica del Workflow** + + ```yaml + name: Deploy to SleakOps + + on: + push: + branches: [main, master] + + jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Deploy to SleakOps + run: | + # Aquí irán los comandos CLI de SleakOps + env: + SLEAKOPS_TOKEN: ${{ secrets.SLEAKOPS_TOKEN }} + SLEAKOPS_PROJECT_ID: ${{ secrets.SLEAKOPS_PROJECT_ID }} + ``` + + + + + +Para configurar los secretos requeridos en GitHub: + +1. **Accede a la Configuración del Repositorio** + + - Ve a tu repositorio en GitHub + - Haz clic en la pestaña **Settings** + - Navega a **Secrets and variables** → **Actions** + +2. **Agrega los Secrets Requeridos** + La consola de SleakOps te mostrará los secretos exactos necesarios, típicamente: + + - `SLEAKOPS_TOKEN`: Tu token API de SleakOps + - `SLEAKOPS_PROJECT_ID`: El identificador de tu proyecto + - Cualquier otro secreto específico del entorno + +3. **Crear Nuevo Secret** + - Haz clic en **New repository secret** + - Introduce el nombre del secreto exactamente como se muestra en la consola de SleakOps + - Pega el valor correspondiente + - Haz clic en **Add secret** + + + + + +Para verificar que tu configuración funciona correctamente: + +1. **Haz un Commit de Prueba** + + ```bash + git add . + git commit -m "Prueba de despliegue automático" + git push origin main + ``` + +2. **Monitorea GitHub Actions** + + - Ve a la pestaña **Actions** de tu repositorio + - Observa la ejecución del workflow + - Revisa si hay errores en los logs + +3. **Verifica el Despliegue** + - Consulta la consola de SleakOps para el estado del despliegue + - Verifica que tu aplicación esté actualizada + - Prueba la aplicación desplegada + + + + + +Si el despliegue falla: + +1. **Verifica los Secrets** + + - Asegúrate de que todos los secretos requeridos estén configurados + - Confirma que los nombres de los secretos coincidan exactamente + - Revisa que los tokens no hayan expirado + +2. **Revisa los Logs del Workflow** + + - Examina los logs de GitHub Actions para errores específicos + - Busca problemas de autenticación o permisos + - Verifica que los comandos CLI se ejecuten correctamente + +3. **Valida la Configuración de SleakOps** + + - Confirma que tu proyecto SleakOps esté correctamente configurado + - Asegúrate de que el CLI tenga los permisos necesarios + - Verifica que el ID del proyecto sea correcto + +4. **Prueba el CLI Localmente** + ```bash + # Instala localmente el CLI de SleakOps para pruebas + npm install -g @sleakops/cli + sleakops deploy --help + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 17 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-multi-project-deployment.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-multi-project-deployment.mdx new file mode 100644 index 000000000..dad1f335a --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-multi-project-deployment.mdx @@ -0,0 +1,753 @@ +--- +sidebar_position: 3 +title: "Despliegue Multi-Proyecto con GitHub Actions" +description: "Configura GitHub Actions para desplegar automáticamente múltiples proyectos SleakOps" +date: "2025-01-15" +category: "proyecto" +tags: + ["github-actions", "ci-cd", "despliegue", "automatización", "sleakops-cli"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegue Multi-Proyecto con GitHub Actions + +**Fecha:** 15 de enero de 2025 +**Categoría:** Proyecto +**Etiquetas:** GitHub Actions, CI/CD, Despliegue, Automatización, SleakOps CLI + +## Descripción del Problema + +**Contexto:** El usuario necesita configurar GitHub Actions para construir y desplegar automáticamente múltiples proyectos SleakOps al hacer push en la rama de producción, específicamente desea desplegar simultáneamente los proyectos `simplee-drf-aws-cl` y `simplee-drf-aws-mx`. + +**Síntomas Observados:** + +- La pipeline actual de CI/CD solo despliega un proyecto +- Necesidad de activar despliegues de múltiples proyectos desde un solo push +- Requisito de duplicar comandos de build y deploy para proyectos adicionales +- Confusión con el parámetro de entorno (usando `-e main` en lugar de `-e dev`) + +**Configuración Relevante:** + +- Plataforma: GitHub Actions +- Comandos SleakOps CLI: `sleakops build` y `sleakops deploy` +- Proyectos: `simplee-drf-aws-cl`, `simplee-drf-aws-mx` +- Rama disparadora: rama `prod` +- Entorno: `dev` (no `main`) + +**Condiciones de Error:** + +- Parámetro de entorno incorrecto en el comando de despliegue +- Despliegue de un solo proyecto en vez de multi-proyecto +- Necesidad de optimizar el workflow para múltiples proyectos + +## Solución Detallada + + + +Así es como configurar GitHub Actions para desplegar múltiples proyectos SleakOps: + +```yaml +name: Multi-Project CI/CD + +on: + push: + branches: + - prod + +jobs: + build: + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Run SleakOps build for ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops build -p ${{ matrix.project }} -b prod -w + + deploy: + needs: [build] + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Run SleakOps deploy for ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops deploy -p ${{ matrix.project }} -e dev -w +``` + + + + + +Si prefieres un despliegue secuencial (un proyecto tras otro), usa esta configuración: + +```yaml +name: Sequential Multi-Project CI/CD + +on: + push: + branches: + - prod + +jobs: + build-and-deploy-cl: + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Build simplee-drf-aws-cl + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops build -p simplee-drf-aws-cl -b prod -w + + - name: Deploy simplee-drf-aws-cl + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops deploy -p simplee-drf-aws-cl -e dev -w + + build-and-deploy-mx: + runs-on: ubuntu-latest + needs: [build-and-deploy-cl] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Build simplee-drf-aws-mx + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops build -p simplee-drf-aws-mx -b prod -w + + - name: Deploy simplee-drf-aws-mx + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: sleakops deploy -p simplee-drf-aws-mx -e dev -w +``` + + + + + +**Importante:** El parámetro de entorno en el comando deploy debe coincidir con el nombre del entorno en SleakOps, no con la rama git. + +Errores comunes: + +- ❌ `sleakops deploy -p nombre-del-proyecto -e main -w` +- ✅ `sleakops deploy -p nombre-del-proyecto -e dev -w` +- ✅ `sleakops deploy -p nombre-del-proyecto -e prod -w` + +Para verificar el nombre de tu entorno: + +1. Ve a tu consola de SleakOps +2. Navega a tu proyecto +3. Revisa la sección de entornos +4. Usa el nombre exacto del entorno en el parámetro `-e` + + + + + +Para entornos de producción, considera agregar manejo de errores y notificaciones: + +```yaml +name: Production Multi-Project CI/CD + +on: + push: + branches: + - prod + +jobs: + build: + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + fail-fast: false + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Build ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + set -e + echo "Building ${{ matrix.project }}..." + sleakops build -p ${{ matrix.project }} -b prod -w + echo "Build completed for ${{ matrix.project }}" + continue-on-error: false + + - name: Deploy ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + set -e + echo "Deploying ${{ matrix.project }}..." + sleakops deploy -p ${{ matrix.project }} -e dev -w + echo "Deployment completed for ${{ matrix.project }}" + continue-on-error: false + + - name: Notify on failure + if: failure() + run: | + echo "❌ Failed to build/deploy ${{ matrix.project }}" + # Aquí puedes agregar notificaciones a Slack, Discord, etc. +``` + + + + + +Para optimizar el proceso, puedes configurar despliegues condicionales basados en los archivos modificados: + +```yaml +name: Smart Multi-Project CI/CD + +on: + push: + branches: + - prod + +jobs: + detect-changes: + runs-on: ubuntu-latest + outputs: + cl-changed: ${{ steps.changes.outputs.cl }} + mx-changed: ${{ steps.changes.outputs.mx }} + steps: + - name: Checkout repository + uses: actions/checkout@v4 + with: + fetch-depth: 2 + + - name: Detect changes + id: changes + run: | + # Detectar cambios en proyecto CL + if git diff --name-only HEAD~1 HEAD | grep -E "(cl/|shared/)" > /dev/null; then + echo "cl=true" >> $GITHUB_OUTPUT + else + echo "cl=false" >> $GITHUB_OUTPUT + fi + + # Detectar cambios en proyecto MX + if git diff --name-only HEAD~1 HEAD | grep -E "(mx/|shared/)" > /dev/null; then + echo "mx=true" >> $GITHUB_OUTPUT + else + echo "mx=false" >> $GITHUB_OUTPUT + fi + + deploy-cl: + needs: detect-changes + if: needs.detect-changes.outputs.cl-changed == 'true' + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Deploy CL project + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + sleakops build -p simplee-drf-aws-cl -b prod -w + sleakops deploy -p simplee-drf-aws-cl -e dev -w + + deploy-mx: + needs: detect-changes + if: needs.detect-changes.outputs.mx-changed == 'true' + runs-on: ubuntu-latest + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Deploy MX project + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + sleakops build -p simplee-drf-aws-mx -b prod -w + sleakops deploy -p simplee-drf-aws-mx -e dev -w +``` + + + + + +Configura despliegues para diferentes entornos basados en la rama: + +```yaml +name: Multi-Environment Multi-Project CI/CD + +on: + push: + branches: + - main # Producción + - staging # Staging + - develop # Desarrollo + +jobs: + setup: + runs-on: ubuntu-latest + outputs: + environment: ${{ steps.env.outputs.environment }} + projects: ${{ steps.projects.outputs.list }} + steps: + - name: Determine environment + id: env + run: | + case ${{ github.ref_name }} in + main) + echo "environment=prod" >> $GITHUB_OUTPUT + ;; + staging) + echo "environment=staging" >> $GITHUB_OUTPUT + ;; + develop) + echo "environment=dev" >> $GITHUB_OUTPUT + ;; + *) + echo "environment=dev" >> $GITHUB_OUTPUT + ;; + esac + + - name: Set project list + id: projects + run: | + echo 'list=["simplee-drf-aws-cl", "simplee-drf-aws-mx"]' >> $GITHUB_OUTPUT + + deploy: + needs: setup + runs-on: ubuntu-latest + strategy: + matrix: + project: ${{ fromJson(needs.setup.outputs.projects) }} + fail-fast: false + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Build and Deploy ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + ENVIRONMENT: ${{ needs.setup.outputs.environment }} + run: | + echo "Deploying ${{ matrix.project }} to $ENVIRONMENT environment" + sleakops build -p ${{ matrix.project }} -b ${{ github.ref_name }} -w + sleakops deploy -p ${{ matrix.project }} -e $ENVIRONMENT -w +``` + + + + + +Agrega notificaciones para mantener al equipo informado sobre los despliegues: + +```yaml +name: Multi-Project CI/CD with Notifications + +on: + push: + branches: + - prod + +jobs: + notify-start: + runs-on: ubuntu-latest + steps: + - name: Notify deployment start + uses: 8398a7/action-slack@v3 + with: + status: custom + custom_payload: | + { + text: "🚀 Starting multi-project deployment", + attachments: [{ + color: "warning", + fields: [{ + title: "Branch", + value: "${{ github.ref_name }}", + short: true + }, { + title: "Commit", + value: "${{ github.sha }}", + short: true + }] + }] + } + env: + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} + + deploy: + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Deploy ${{ matrix.project }} + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + sleakops build -p ${{ matrix.project }} -b prod -w + sleakops deploy -p ${{ matrix.project }} -e dev -w + + - name: Notify success + if: success() + uses: 8398a7/action-slack@v3 + with: + status: success + text: "✅ Successfully deployed ${{ matrix.project }}" + env: + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} + + - name: Notify failure + if: failure() + uses: 8398a7/action-slack@v3 + with: + status: failure + text: "❌ Failed to deploy ${{ matrix.project }}" + env: + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} + + notify-complete: + needs: deploy + runs-on: ubuntu-latest + if: always() + steps: + - name: Notify deployment complete + uses: 8398a7/action-slack@v3 + with: + status: custom + custom_payload: | + { + text: "🏁 Multi-project deployment completed", + attachments: [{ + color: "${{ needs.deploy.result == 'success' && 'good' || 'danger' }}", + fields: [{ + title: "Status", + value: "${{ needs.deploy.result }}", + short: true + }, { + title: "Duration", + value: "${{ github.event.head_commit.timestamp }}", + short: true + }] + }] + } + env: + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} +``` + + + + + +Implementa una estrategia de rollback en caso de fallos: + +```yaml +name: Multi-Project CI/CD with Rollback + +on: + push: + branches: + - prod + +jobs: + deploy: + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + outputs: + deployment-status: ${{ steps.deploy.outcome }} + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Store current deployment info + id: current-deployment + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + # Obtener información del despliegue actual + CURRENT_VERSION=$(sleakops status -p ${{ matrix.project }} -e dev --format json | jq -r '.version') + echo "current-version=$CURRENT_VERSION" >> $GITHUB_OUTPUT + echo "Current version for ${{ matrix.project }}: $CURRENT_VERSION" + + - name: Deploy new version + id: deploy + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + sleakops build -p ${{ matrix.project }} -b prod -w + sleakops deploy -p ${{ matrix.project }} -e dev -w + + - name: Health check + id: health-check + run: | + # Esperar un momento para que el despliegue se estabilice + sleep 30 + + # Realizar health check (ajusta la URL según tu configuración) + if curl -f "https://${{ matrix.project }}.yourdomain.com/health" > /dev/null 2>&1; then + echo "Health check passed for ${{ matrix.project }}" + echo "status=healthy" >> $GITHUB_OUTPUT + else + echo "Health check failed for ${{ matrix.project }}" + echo "status=unhealthy" >> $GITHUB_OUTPUT + exit 1 + fi + + - name: Rollback on failure + if: failure() + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + echo "Deployment failed, initiating rollback for ${{ matrix.project }}" + # Rollback al despliegue anterior + sleakops rollback -p ${{ matrix.project }} -e dev -v ${{ steps.current-deployment.outputs.current-version }} + + # Verificar que el rollback fue exitoso + sleep 30 + if curl -f "https://${{ matrix.project }}.yourdomain.com/health" > /dev/null 2>&1; then + echo "Rollback successful for ${{ matrix.project }}" + else + echo "Rollback failed for ${{ matrix.project }} - manual intervention required" + exit 1 + fi +``` + + + + + +Implementa monitoreo para tus despliegues multi-proyecto: + +```yaml +name: Multi-Project CI/CD with Monitoring + +on: + push: + branches: + - prod + +jobs: + deploy: + runs-on: ubuntu-latest + strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Install SleakOps CLI + run: pip install sleakops + + - name: Create deployment record + id: deployment-record + run: | + DEPLOYMENT_ID="deploy-$(date +%s)-${{ matrix.project }}" + echo "deployment-id=$DEPLOYMENT_ID" >> $GITHUB_OUTPUT + + # Registrar inicio del despliegue en tu sistema de monitoreo + curl -X POST "${{ secrets.MONITORING_WEBHOOK }}" \ + -H "Content-Type: application/json" \ + -d '{ + "deployment_id": "'$DEPLOYMENT_ID'", + "project": "${{ matrix.project }}", + "status": "started", + "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'", + "commit": "${{ github.sha }}", + "branch": "${{ github.ref_name }}" + }' + + - name: Deploy project + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + sleakops build -p ${{ matrix.project }} -b prod -w + sleakops deploy -p ${{ matrix.project }} -e dev -w + + - name: Post-deployment monitoring + run: | + # Configurar alertas específicas para este despliegue + curl -X POST "${{ secrets.MONITORING_WEBHOOK }}/alerts" \ + -H "Content-Type: application/json" \ + -d '{ + "deployment_id": "${{ steps.deployment-record.outputs.deployment-id }}", + "project": "${{ matrix.project }}", + "alerts": [ + { + "type": "error_rate", + "threshold": 5, + "duration": "5m" + }, + { + "type": "response_time", + "threshold": 2000, + "duration": "5m" + } + ] + }' + + - name: Update deployment status + if: always() + run: | + STATUS="${{ job.status }}" + curl -X PUT "${{ secrets.MONITORING_WEBHOOK }}" \ + -H "Content-Type: application/json" \ + -d '{ + "deployment_id": "${{ steps.deployment-record.outputs.deployment-id }}", + "status": "'$STATUS'", + "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'" + }' + + generate-deployment-report: + needs: deploy + runs-on: ubuntu-latest + if: always() + steps: + - name: Generate deployment report + run: | + cat << EOF > deployment-report.md + # Deployment Report + + **Date:** $(date) + **Branch:** ${{ github.ref_name }} + **Commit:** ${{ github.sha }} + **Status:** ${{ needs.deploy.result }} + + ## Projects Deployed + - simplee-drf-aws-cl + - simplee-drf-aws-mx + + ## Links + - [CL Environment](https://simplee-drf-aws-cl.yourdomain.com) + - [MX Environment](https://simplee-drf-aws-mx.yourdomain.com) + - [Monitoring Dashboard](https://monitoring.yourdomain.com) + EOF + + - name: Upload deployment report + uses: actions/upload-artifact@v3 + with: + name: deployment-report + path: deployment-report.md +``` + + + + + +**Problema: Fallos intermitentes en la matriz de despliegue** + +```yaml +# Solución: Agregar reintentos automáticos +strategy: + matrix: + project: ["simplee-drf-aws-cl", "simplee-drf-aws-mx"] + fail-fast: false + max-parallel: 2 # Limitar despliegues paralelos + +steps: + - name: Deploy with retry + uses: nick-invision/retry@v2 + with: + timeout_minutes: 10 + max_attempts: 3 + retry_on: error + command: | + sleakops build -p ${{ matrix.project }} -b prod -w + sleakops deploy -p ${{ matrix.project }} -e dev -w +``` + +**Problema: Conflictos de recursos entre proyectos** + +```yaml +# Solución: Despliegue secuencial con dependencias +jobs: + deploy-cl: + runs-on: ubuntu-latest + steps: + # ... pasos de despliegue para CL + + deploy-mx: + needs: deploy-cl + runs-on: ubuntu-latest + steps: + # ... pasos de despliegue para MX +``` + +**Problema: Timeouts en despliegues largos** + +```yaml +# Solución: Aumentar timeouts y agregar progreso +- name: Deploy with extended timeout + timeout-minutes: 30 + env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + run: | + echo "Starting deployment of ${{ matrix.project }}..." + sleakops build -p ${{ matrix.project }} -b prod -w --timeout 1800 + echo "Build completed, starting deployment..." + sleakops deploy -p ${{ matrix.project }} -e dev -w --timeout 1800 + echo "Deployment completed successfully" +``` + +**Problema: Gestión de secretos para múltiples proyectos** + +```yaml +# Solución: Usar secretos específicos por proyecto +env: + SLEAKOPS_KEY: ${{ secrets.SLEAKOPS_KEY }} + PROJECT_SPECIFIC_SECRET: ${{ secrets[format('SECRET_{0}', matrix.project)] }} +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-quota-management.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-quota-management.mdx new file mode 100644 index 000000000..ef54a3f13 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-quota-management.mdx @@ -0,0 +1,254 @@ +--- +sidebar_position: 15 +title: "Gestión de Cuotas de GitHub Actions y Optimización de CI/CD" +description: "Gestión de cuotas de GitHub Actions y optimización de la eficiencia de la canalización CI/CD en SleakOps" +date: "2024-12-11" +category: "proyecto" +tags: ["github-actions", "ci-cd", "cuota", "optimización", "despliegue"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Gestión de Cuotas de GitHub Actions y Optimización de CI/CD + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** GitHub Actions, CI/CD, Cuota, Optimización, Despliegue + +## Descripción del Problema + +**Contexto:** Las organizaciones que usan SleakOps con GitHub Actions para canalizaciones CI/CD pueden encontrar limitaciones de cuota, especialmente al ejecutar múltiples despliegues en entornos de desarrollo y producción. + +**Síntomas Observados:** + +- Cuota de GitHub Actions consumida más rápido de lo esperado +- Despliegues que tardan significativamente más de lo usual (más de 50 minutos) +- Número reducido de compilaciones exitosas comparado con meses anteriores +- Agotamiento inesperado de cuota a mitad de mes + +**Configuración Relevante:** + +- Múltiples entornos: Desarrollo y Producción +- Rendimiento histórico: ~100 compilaciones por mes anteriormente +- Problema actual: agotamiento de cuota con menos compilaciones +- Supervisión de despliegues: monitoreo manual requerido + +**Condiciones de Error:** + +- El agotamiento de cuota ocurre más temprano en el mes +- Tiempos de despliegue más largos consumiendo más minutos +- Despliegues sin supervisión que llevan a un uso inesperado de cuota + +## Solución Detallada + + + +GitHub ha realizado varios cambios en sus cuotas del nivel gratuito a lo largo del tiempo: + +**Límites actuales del Nivel Gratuito de GitHub (2024):** + +- **Repositorios públicos**: Minutos ilimitados +- **Repositorios privados**: 2,000 minutos/mes para cuentas gratuitas +- **Cuentas de organización**: 2,000 minutos/mes (nivel gratuito) + +**Cambios Recientes:** + +- GitHub no ha reducido significativamente las cuotas gratuitas recientemente +- Sin embargo, ha mejorado la granularidad de facturación y monitoreo +- Los patrones de uso pueden haber cambiado debido a tiempos de ejecución más largos de trabajos + +**Pasos para Verificación:** + +1. Ir a GitHub → Configuración → Facturación y planes +2. Revisar "Uso este mes" bajo Actions +3. Revisar patrones históricos de uso + + + + + +Para reducir el consumo de minutos de GitHub Actions: + +**1. Optimizar Procesos de Construcción:** + +```yaml +# .github/workflows/deploy.yml +name: Despliegue Optimizado +on: + push: + branches: [main, develop] + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + # Usar caché para reducir tiempo de construcción + - name: Cachear dependencias + uses: actions/cache@v3 + with: + path: ~/.npm + key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} + + # Trabajos paralelos para diferentes entornos + - name: Construir y probar + run: | + npm ci + npm run build + npm test +``` + +**2. Despliegues Específicos por Entorno:** + +```yaml +# Flujo de trabajo separado para dev/prod +strategy: + matrix: + environment: [development, production] +steps: + - name: Desplegar a ${{ matrix.environment }} + if: | + (matrix.environment == 'development' && github.ref == 'refs/heads/develop') || + (matrix.environment == 'production' && github.ref == 'refs/heads/main') +``` + + + + + +**Configuración de Monitoreo Automatizado:** + +1. **Notificaciones en Slack/Teams:** + +```yaml +- name: Notificar inicio de despliegue + uses: 8398a7/action-slack@v3 + with: + status: custom + custom_payload: | + { + text: "🚀 Despliegue iniciado para ${{ github.repository }}", + blocks: [ + { + type: "section", + text: { + type: "mrkdwn", + text: "Entorno: ${{ matrix.environment }}\nRama: ${{ github.ref }}" + } + } + ] + } +``` + +2. **Configuración de Tiempo Máximo:** + +```yaml +jobs: + deploy: + timeout-minutes: 15 # Prevenir trabajos fuera de control + steps: + - name: Desplegar con límite de tiempo + timeout-minutes: 10 + run: | + # Tus comandos de despliegue +``` + +3. **Integración con el Panel de SleakOps:** + +- Habilitar notificaciones de despliegue en el panel de SleakOps +- Configurar alertas para despliegues prolongados +- Configurar rollback automático en caso de timeout + + + + + +**1. Monitoreo de Uso:** + +```bash +# Verificar uso actual vía CLI de GitHub +gh api /user/settings/billing/actions + +# Monitorear ejecuciones de workflow +gh run list --limit 50 --json status,conclusion,createdAt,name +``` + +**2. Optimización de Costos:** + +- Usar runners autohospedados para compilaciones intensivas +- Implementar despliegues condicionales +- Cachear dependencias y artefactos de construcción +- Usar estrategias de matriz eficientemente + +**3. Soluciones Alternativas:** + +```yaml +# Despliegue condicional basado en cambios +name: Despliegue Inteligente +on: + push: + paths: + - "src/**" + - "package.json" + - "Dockerfile" + +jobs: + check-changes: + outputs: + should-deploy: ${{ steps.changes.outputs.src }} + steps: + - uses: dorny/paths-filter@v2 + id: changes + with: + filters: | + src: + - 'src/**' + - 'package.json' + + deploy: + needs: check-changes + if: needs.check-changes.outputs.should-deploy == 'true' + # ... pasos de despliegue +``` + + + + + +**1. Configuraciones Específicas por Clúster:** + +- Configurar flujos de trabajo separados para QA y Producción +- Usar variables de entorno de SleakOps para lógica condicional +- Implementar estrategias de despliegue progresivo + +**2. Gestión de Recursos:** + +```yaml +# Flujo de trabajo optimizado para SleakOps +env: + SLEAKOPS_ENV: ${{ github.ref == 'refs/heads/main' && 'production' || 'development' }} + +jobs: + deploy: + steps: + - name: Desplegar en SleakOps + run: | + # Usar CLI de SleakOps con configuraciones optimizadas + sleakops deploy \ + --environment $SLEAKOPS_ENV \ + --timeout 300 \ + --max-retries 2 +``` + +**3. Integración de Monitoreo:** + +- Habilitar webhooks de despliegue en SleakOps +- Configurar políticas de rollback automático +- Configurar límites de recursos para pods de construcción + + + +--- + +_Esta FAQ fue generada automáticamente el 11 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-timeout-limits.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-timeout-limits.mdx new file mode 100644 index 000000000..5b5cf3657 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-actions-timeout-limits.mdx @@ -0,0 +1,207 @@ +--- +sidebar_position: 3 +title: "Límites de Tiempo de Espera y Mensuales de GitHub Actions" +description: "Solución para despliegues con GitHub Actions que se agotan en tiempo y superan límites mensuales" +date: "2024-01-15" +category: "proyecto" +tags: ["github-actions", "despliegue", "timeout", "ci-cd", "límites"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Límites de Tiempo de Espera y Mensuales de GitHub Actions + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** GitHub Actions, Despliegue, Timeout, CI/CD, Límites + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan fallos en despliegues al usar GitHub Actions para CI/CD, con construcciones que toman mucho más tiempo del esperado y que potencialmente superan los límites mensuales de uso de GitHub. + +**Síntomas Observados:** + +- Despliegues que normalmente tardan 10-14 minutos están ejecutándose por más de 45 minutos +- Trabajos de GitHub Actions fallando después de 8 segundos tras un push +- Múltiples despliegues quedándose en estado "en ejecución" +- Fallos en pipeline CI/CD sin mensajes de error claros + +**Configuración Relevante:** + +- Plataforma: Integración de GitHub Actions con SleakOps +- Tiempo normal de despliegue: 10-14 minutos +- Tiempo observado de despliegue: más de 45 minutos +- Tipo de proyecto: despliegue de aplicación web + +**Condiciones de Error:** + +- Error ocurre inmediatamente después del git push (en menos de 8 segundos) +- Múltiples despliegues ejecutándose simultáneamente +- Límites mensuales de GitHub Actions potencialmente superados +- Configuración de CI.yml parece correcta + +## Solución Detallada + + + +Para verificar si has superado los límites de GitHub Actions: + +1. Ve a la **Configuración de tu Organización en GitHub** +2. Navega a **Configuración** → **Facturación** +3. Revisa la sección **Actions & Packages** +4. Consulta: + - Minutos usados este mes + - Minutos disponibles restantes + - Límites del plan actual + +**Límites del Plan Gratuito:** + +- 2,000 minutos/mes para repositorios privados +- Ilimitado para repositorios públicos + +**Límites del Plan de Pago:** + +- Varía según el plan (Pro: 3,000 min/mes, Team: 10,000 min/mes) + + + + + +En lugar de depender de GitHub Actions, puedes usar el sistema nativo de compilación y despliegue de SleakOps: + +1. **Accede al panel de SleakOps** +2. Navega a tu proyecto +3. Ve a la sección **Build & Deploy** +4. Configura el **Despliegue Directo** desde SleakOps: + +```yaml +# Ejemplo de configuración de despliegue en SleakOps +build: + strategy: "sleakops-native" + timeout: "15m" + +deploy: + auto_deploy: true + branch: "dev" + environment: "development" +``` + +**Beneficios:** + +- No consume minutos de GitHub Actions +- Tiempos de compilación más confiables +- Mejor integración con funciones de SleakOps +- Registros detallados de compilación y monitoreo + + + + + +Si prefieres continuar usando GitHub Actions, optimiza tu flujo de trabajo: + +```yaml +# .github/workflows/ci.yml +name: Pipeline CI/CD + +on: + push: + branches: [dev, main] + +jobs: + build-and-deploy: + runs-on: ubuntu-latest + timeout-minutes: 20 # Establecer tiempo de espera explícito + + steps: + - uses: actions/checkout@v3 + + # Usar caché para reducir tiempo de compilación + - name: Cachear dependencias + uses: actions/cache@v3 + with: + path: ~/.npm + key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }} + + # Optimizar compilaciones Docker + - name: Configurar Docker Buildx + uses: docker/setup-buildx-action@v2 + + - name: Compilar y subir + uses: docker/build-push-action@v3 + with: + cache-from: type=gha + cache-to: type=gha,mode=max +``` + + + + + +Para prevenir problemas futuros, implementa monitoreo: + +**1. Monitoreo de GitHub Actions:** + +```yaml +# Añadir a tu flujo de trabajo +- name: Notificar en compilaciones largas + if: ${{ job.status == 'failure' || job.duration > 1200 }} + uses: 8398a7/action-slack@v3 + with: + status: custom + custom_payload: | + { + text: "Compilación tardando más de lo esperado: ${{ job.duration }}s" + } +``` + +**2. Monitoreo en SleakOps:** + +- Habilita alertas de tiempo de compilación en el panel de SleakOps +- Configura notificaciones para compilaciones que excedan 15 minutos +- Monitorea tendencias en uso de recursos + +**3. Seguimiento de uso:** + +- Revisión semanal del uso de GitHub Actions +- Configura alertas al acercarte al 80% del límite mensual +- Considera actualizar tu plan de GitHub si alcanzas consistentemente los límites + + + + + +Para resolución inmediata: + +**1. Cancelar acciones en ejecución:** + +```bash +# Usando GitHub CLI +gh run list --status in_progress +gh run cancel [RUN_ID] +``` + +**2. Revisar estado del flujo de trabajo:** + +- Ve a la pestaña **Actions** en tu repositorio +- Cancela cualquier flujo atascado +- Revisa los registros para mensajes de error específicos + +**3. Solución temporal:** + +- Deshabilita temporalmente GitHub Actions +- Usa despliegue manual con SleakOps +- Reactiva cuando se reinicien los límites (mensualmente) + +**4. Despliegue de emergencia:** +Si necesitas un despliegue inmediato: + +1. Accede al panel de SleakOps +2. Ve a proyecto → **Despliegue Manual** +3. Selecciona rama y entorno +4. Despliega directamente desde SleakOps + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-automatic-deployments-failing.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-automatic-deployments-failing.mdx new file mode 100644 index 000000000..405712877 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/github-automatic-deployments-failing.mdx @@ -0,0 +1,177 @@ +--- +sidebar_position: 3 +title: "Despliegues Automáticos de GitHub No Funcionan" +description: "Solución para despliegues automáticos desde GitHub que han dejado de funcionar" +date: "2024-01-15" +category: "proyecto" +tags: + ["github", "despliegue", "automatización", "ci-cd", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Despliegues Automáticos de GitHub No Funcionan + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** GitHub, Despliegue, Automatización, CI/CD, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Los despliegues automáticos desde repositorios de GitHub han dejado de funcionar en la plataforma SleakOps, impidiendo que los cambios de código se desplieguen automáticamente en el entorno objetivo. + +**Síntomas Observados:** + +- Los despliegues automáticos desde GitHub no se activan +- Aparecen mensajes de error en los registros de despliegue +- Los cambios de código enviados al repositorio no se reflejan en la aplicación desplegada +- El problema ha ocurrido durante varios días + +**Configuración Relevante:** + +- Plataforma: SleakOps +- Origen: Repositorio de GitHub +- Tipo de despliegue: Despliegues automáticos +- Duración: Varios días de fallo + +**Condiciones de Error:** + +- Los despliegues fallan consistentemente durante varios días +- El error ocurre después de enviar código a GitHub +- La canalización de despliegue automático no se ejecuta +- Puede requerirse intervención manual + +## Solución Detallada + + + +La causa más común de fallos en despliegues automáticos son problemas con la configuración del webhook: + +1. **Verificar estado del webhook en GitHub:** + + - Ve a tu repositorio → Configuración → Webhooks + - Comprueba si el webhook de SleakOps está activo + - Revisa si hay fallos recientes en las entregas + +2. **Comprobar URL del webhook:** + + - Asegúrate que la URL del webhook apunta al endpoint correcto de SleakOps + - Verifica el formato de la URL: `https://api.sleakops.com/webhooks/github/{project-id}` + +3. **Validar el secreto del webhook:** + - Confirma que el secreto del webhook coincide con la configuración de tu proyecto en SleakOps + - Si tienes dudas, regenera el webhook en la configuración de SleakOps + + + + + +Verifica si los tokens de acceso o permisos de GitHub han expirado: + +1. **Permisos de la App de GitHub:** + + - Ve a SleakOps → Configuración del Proyecto → Integración con GitHub + - Verifica que la App de GitHub sigue conectada + - Comprueba que los permisos incluyen acceso al repositorio + +2. **Token de Acceso Personal (si se usa):** + + - Verifica que el token no haya expirado + - Asegúrate que el token tenga los alcances necesarios: `repo`, `workflow` + - Regenera el token si es necesario + +3. **Acceso al repositorio:** + - Confirma que SleakOps aún tiene acceso al repositorio + - Revisa si el repositorio fue movido o renombrado + + + + + +Examina los registros de despliegue para mensajes de error específicos: + +1. **Acceder a los registros de despliegue:** + + - Ve al Panel de SleakOps → Tu Proyecto → Despliegues + - Haz clic en los intentos de despliegue fallidos + - Revisa mensajes de error y trazas de pila + +2. **Patrones comunes de error:** + + - `Authentication failed`: Problemas con token o webhook + - `Repository not found`: Problemas de acceso o nombres + - `Build failed`: Problemas de compilación o dependencias + - `Timeout`: Problemas de recursos o conectividad de red + +3. **Configuración de la canalización:** + - Verifica que la configuración de la canalización de despliegue sea correcta + - Revisa si se hicieron cambios recientes en la configuración de despliegue + + + + + +Intenta activar un despliegue manual para aislar el problema: + +1. **Prueba de despliegue manual:** + + - Ve a SleakOps → Tu Proyecto → Despliegues + - Haz clic en "Desplegar Ahora" o "Despliegue Manual" + - Selecciona la rama que deseas desplegar + +2. **Si el despliegue manual funciona:** + + - El problema está específicamente en los disparadores automáticos + - Enfócate en la solución de problemas del webhook y la integración con GitHub + +3. **Si el despliegue manual falla:** + - El problema está en el proceso de despliegue mismo + - Revisa configuración de compilación, dependencias y asignación de recursos + + + + + +Si otras soluciones no funcionan, intenta restablecer la integración con GitHub: + +1. **Desconectar y reconectar:** + + - Ve a SleakOps → Configuración del Proyecto → Integraciones + - Desconecta la integración con GitHub + - Reconéctala y vuelve a autorizar el acceso + +2. **Reconfigurar webhook:** + + - Elimina el webhook existente del repositorio de GitHub + - Deja que SleakOps recree el webhook automáticamente + - Prueba con un nuevo commit + +3. **Verificar configuración:** + - Asegúrate que las configuraciones de ramas sean correctas + - Confirma que los disparadores de despliegue estén habilitados + - Revisa si hay reglas o condiciones de despliegue que bloqueen la ejecución + + + + + +Contacta al soporte de SleakOps si: + +- Se han intentado todos los pasos de solución de problemas +- Los despliegues manuales también fallan +- Los mensajes de error son poco claros o relacionados con el sistema +- El problema afecta a múltiples proyectos + +**Información para proporcionar:** + +- Nombre e ID del proyecto +- URL del repositorio de GitHub +- Capturas de pantalla de los mensajes de error +- Cronología de cuándo comenzó el problema +- Cambios recientes realizados en el proyecto o repositorio + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/gitlab-self-hosted-redirect-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/gitlab-self-hosted-redirect-issue.mdx new file mode 100644 index 000000000..4403dda8c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/gitlab-self-hosted-redirect-issue.mdx @@ -0,0 +1,185 @@ +--- +sidebar_position: 3 +title: "Problema de Redirección en GitLab Self-Hosted" +description: "Solución para problemas de conexión con repositorios GitLab self-hosted con bucles de redirección" +date: "2025-02-25" +category: "proyecto" +tags: ["gitlab", "self-hosted", "git", "conexión", "redirección"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Redirección en GitLab Self-Hosted + +**Fecha:** 25 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** GitLab, Self-hosted, Git, Conexión, Redirección + +## Descripción del Problema + +**Contexto:** El usuario intenta conectar un repositorio GitLab self-hosted a SleakOps pero encuentra redirecciones continuas a la página de GitLab en lugar de una autenticación exitosa. + +**Síntomas Observados:** + +- No puede conectar con el repositorio GitLab self-hosted +- Redirección continua a la página de GitLab durante la autenticación +- El proceso de conexión no se completa +- Imposibilidad de acceder al repositorio a través de la plataforma SleakOps + +**Configuración Relevante:** + +- Tipo de repositorio: GitLab self-hosted +- Método de conexión: Configuración → Autorizaciones +- Plataforma: SleakOps +- Flujo de autenticación: OAuth / basado en redirección + +**Condiciones del Error:** + +- Ocurre un bucle de redirección durante la autenticación +- El problema persiste tras múltiples intentos de conexión +- El problema es específico de instancias GitLab self-hosted +- Las conexiones estándar a GitLab.com pueden funcionar normalmente + +## Solución Detallada + + + +Antes de solucionar la conexión, asegúrate de que tu instancia de GitLab esté configurada correctamente: + +1. **Revisar configuración de la aplicación OAuth en GitLab**: + + - Ve a tu instancia de GitLab: `Área de Administración → Aplicaciones` + - Verifica que exista la aplicación OAuth para SleakOps + - Confirma que la URI de redirección coincida con la URL de callback de SleakOps + +2. **Verificar accesibilidad de GitLab**: + - Asegúrate de que tu instancia GitLab sea accesible desde redes externas + - Comprueba que HTTPS esté configurado correctamente + - Verifica que los certificados SSL sean válidos + + + + + +Sigue estos pasos para conectar tu GitLab self-hosted: + +1. **En la plataforma SleakOps**: + + - Navega a **Configuración → Autorizaciones** + - Haz clic en **Agregar proveedor Git** + - Selecciona **GitLab Self-Hosted** + +2. **Configura la instancia GitLab**: + + ``` + URL de GitLab: https://tu-instancia-gitlab.com + ID de la Aplicación: [desde la app OAuth de GitLab] + Secreto de la Aplicación: [desde la app OAuth de GitLab] + ``` + +3. **Completa el flujo OAuth**: + - Haz clic en **Conectar** + - Serás redirigido a tu instancia GitLab + - Autoriza la aplicación SleakOps + - Deberías ser redirigido de vuelta a SleakOps + + + + + +Si experimentas bucles de redirección, revisa estos problemas comunes: + +**1. URI de redirección incorrecta**: + +- En la configuración de la app OAuth de GitLab, asegúrate que la URI de redirección sea: + ``` + https://app.sleakops.com/auth/gitlab/callback + ``` + +**2. Conectividad de red**: + +- Verifica que SleakOps pueda alcanzar tu instancia GitLab +- Revisa reglas de firewall y políticas de red +- Asegúrate que ningún proxy interfiera en la conexión + +**3. Configuración de la instancia GitLab**: + +- Verifica el `external_url` en la configuración de GitLab +- Revisa si GitLab está detrás de un proxy inverso +- Asegura una configuración HTTPS adecuada + + + + + +Si la aplicación OAuth no existe, créala: + +1. **En el Área de Administración de GitLab**: + + - Ve a **Área de Administración → Aplicaciones** + - Haz clic en **Nueva Aplicación** + +2. **Configuración de la aplicación**: + + ``` + Nombre: SleakOps + URI de redirección: https://app.sleakops.com/auth/gitlab/callback + Alcances: + ✓ api + ✓ read_user + ✓ read_repository + ✓ write_repository + ``` + +3. **Guardar y anotar credenciales**: + - Copia el ID de la Aplicación + - Copia el Secreto + - Usa estos datos en la configuración de SleakOps + + + + + +Si OAuth sigue fallando, prueba estas alternativas: + +**1. Token de Acceso Personal**: + +- Crea un Token de Acceso Personal en GitLab +- Usa autenticación basada en token en lugar de OAuth +- Configura en SleakOps con las credenciales del token + +**2. Autenticación con clave SSH**: + +- Genera un par de claves SSH en SleakOps +- Añade la clave pública al usuario/proyecto en GitLab +- Usa URLs de repositorio basadas en SSH + +**3. Contactar Soporte**: +Si los problemas persisten, contacta al soporte de SleakOps con: + +- URL de la instancia GitLab +- Detalles de configuración de red +- Registros de error de SleakOps y GitLab + + + + + +Tras la configuración, verifica que la conexión funcione: + +1. **Probar acceso al repositorio**: + + - Intenta navegar los repositorios en SleakOps + - Intenta crear un nuevo proyecto desde el repositorio GitLab + - Verifica que los webhooks estén configurados correctamente + +2. **Revisar permisos**: + - Asegura que SleakOps pueda leer el contenido del repositorio + - Verifica permisos de escritura para operaciones CI/CD + - Prueba la entrega de webhooks + + + +--- + +_Este FAQ fue generado automáticamente el 25 de febrero de 2025 basado en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/global-variables-update-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/global-variables-update-error.mdx new file mode 100644 index 000000000..e676c0b53 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/global-variables-update-error.mdx @@ -0,0 +1,753 @@ +--- +sidebar_position: 3 +title: "Error de Actualización de Variables Globales en SleakOps" +description: "Solución para errores al actualizar variables globales en la plataforma SleakOps" +date: "2024-03-27" +category: "proyecto" +tags: ["variables", "configuración", "error", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Actualización de Variables Globales en SleakOps + +**Fecha:** 27 de marzo de 2024 +**Categoría:** Proyecto +**Etiquetas:** Variables, Configuración, Error, Solución de Problemas + +## Descripción del Problema + +**Contexto:** El usuario encuentra errores al intentar actualizar variables globales a través de la interfaz de la plataforma SleakOps, impidiendo que los cambios de configuración se apliquen. + +**Síntomas Observados:** + +- Aparece un mensaje de error al intentar modificar variables globales +- No se pueden guardar los cambios de configuración a través de la interfaz +- El proceso de actualización de variables falla sin detalles específicos del error +- Incapacidad para modificar configuraciones existentes + +**Configuración Relevante:** + +- Plataforma: Interfaz web de SleakOps +- Funcionalidad: Gestión de Variables Globales +- Acción: Actualización de valores de variables existentes +- Tipo de error: Error genérico durante la operación de guardado + +**Condiciones del Error:** + +- El error ocurre al guardar los cambios de variables +- Afecta modificaciones de variables globales +- Impide que se apliquen las actualizaciones de configuración +- El problema persiste tras múltiples intentos + +## Solución Detallada + + + +Mientras se investiga el problema en la plataforma, puedes actualizar las variables directamente en el clúster: + +**Usando Lens (IDE de Kubernetes):** + +1. **Conéctate a tu clúster** mediante Lens +2. **Navega a Secrets** en la barra lateral izquierda +3. **Encuentra el secreto relevante** que contiene tus variables +4. **Edita el secreto** haciendo clic en el botón de editar +5. **Actualiza los valores de las variables** en la sección de datos +6. **Guarda los cambios** +7. **Reinicia el despliegue** para aplicar las nuevas variables + +```bash +# Alternativa: Usando kubectl +kubectl edit secret -n + +# Después de editar, reinicia el despliegue +kubectl rollout restart deployment -n +``` + +**Ventajas de usar Lens:** + +- Interfaz gráfica intuitiva +- Validación automática de YAML +- Historial de cambios visible +- Capacidad de ver el estado en tiempo real +- Menos propenso a errores de sintaxis + + + + + +Si prefieres usar kubectl directamente: + +**Paso 1: Identificar el secreto** + +```bash +# Lista los secretos en tu namespace +kubectl get secrets -n + +# Busca secretos con el nombre de tu aplicación o variables de entorno +kubectl describe secret -n + +# Ver el contenido del secreto (decodificado) +kubectl get secret -n -o jsonpath='{.data}' | base64 -d +``` + +**Paso 2: Actualizar el secreto** + +```bash +# Edita el secreto directamente +kubectl edit secret -n + +# O aplica un parche a valores específicos +kubectl patch secret -n --type='json' -p='[ + { + "op": "replace", + "path": "/data/DATABASE_URL", + "value": "'$(echo -n "nuevo-valor" | base64)'" + } +]' + +# Crear un nuevo secreto desde línea de comandos +kubectl create secret generic \ + --from-literal=API_KEY=tu-api-key \ + --from-literal=DATABASE_URL=postgres://user:pass@host:5432/db \ + -n \ + --dry-run=client -o yaml | kubectl apply -f - +``` + +**Paso 3: Verificar la aplicación de cambios** + +```bash +# Verificar que los cambios se aplicaron +kubectl get secret -n -o yaml + +# Reiniciar el deployment para aplicar las nuevas variables +kubectl rollout restart deployment -n + +# Verificar el estado del rollout +kubectl rollout status deployment -n + +# Verificar las variables de entorno en el pod +kubectl exec -it -n -- env | grep +``` + + + + + +**1. Problemas de codificación base64:** + +Las variables en Kubernetes secrets deben estar codificadas en base64: + +```bash +# Codificar correctamente +echo -n "mi-valor-secreto" | base64 +# Output: bWktdmFsb3Itc2VjcmV0bw== + +# Decodificar para verificar +echo "bWktdmFsb3Itc2VjcmV0bw==" | base64 -d +# Output: mi-valor-secreto +``` + +**2. Caracteres especiales en valores:** + +```bash +# Para valores con caracteres especiales, usar comillas simples +kubectl create secret generic mi-secreto \ + --from-literal='PASSWORD=password$with&special#chars' \ + --dry-run=client -o yaml + +# O escapar apropiadamente +kubectl create secret generic mi-secreto \ + --from-literal="PASSWORD=password\$with\&special\#chars" \ + --dry-run=client -o yaml +``` + +**3. Problemas de permisos RBAC:** + +```yaml +# Verificar permisos del usuario +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: secret-manager + namespace: mi-namespace +rules: +- apiGroups: [""] + resources: ["secrets"] + verbs: ["get", "list", "create", "update", "patch", "delete"] + +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: secret-manager-binding + namespace: mi-namespace +subjects: +- kind: User + name: mi-usuario + apiGroup: rbac.authorization.k8s.io +roleRef: + kind: Role + name: secret-manager + apiGroup: rbac.authorization.k8s.io +``` + +**4. Formato incorrecto de YAML:** + +```yaml +# ❌ Incorrecto +apiVersion: v1 +kind: Secret +metadata: + name: mi-secreto +data: + DATABASE_URL: postgres://user:pass@host:5432/db # No está en base64 + +# ✅ Correcto +apiVersion: v1 +kind: Secret +metadata: + name: mi-secreto +data: + DATABASE_URL: cG9zdGdyZXM6Ly91c2VyOnBhc3NAaG9zdDo1NDMyL2Ri # En base64 +``` + + + + + +**1. Usar ConfigMaps para configuración no sensible:** + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config + namespace: mi-namespace +data: + # Configuración no sensible (sin codificar) + APP_ENV: "production" + LOG_LEVEL: "info" + API_VERSION: "v1" + DATABASE_POOL_SIZE: "10" + CACHE_TTL: "3600" +``` + +```bash +# Crear ConfigMap desde línea de comandos +kubectl create configmap app-config \ + --from-literal=APP_ENV=production \ + --from-literal=LOG_LEVEL=info \ + -n mi-namespace + +# Actualizar ConfigMap +kubectl patch configmap app-config -n mi-namespace --type merge -p '{ + "data": { + "LOG_LEVEL": "debug", + "NEW_CONFIG": "new-value" + } +}' +``` + +**2. Usar archivos de configuración:** + +```bash +# Crear ConfigMap desde archivo +echo "database: + host: db.example.com + port: 5432 + name: myapp +redis: + host: redis.example.com + port: 6379" > config.yaml + +kubectl create configmap app-config --from-file=config.yaml -n mi-namespace +``` + +**3. Combinando Secrets y ConfigMaps en deployments:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: mi-app +spec: + template: + spec: + containers: + - name: app + image: mi-app:latest + env: + # Variables desde ConfigMap + - name: APP_ENV + valueFrom: + configMapKeyRef: + name: app-config + key: APP_ENV + # Variables desde Secret + - name: DATABASE_PASSWORD + valueFrom: + secretKeyRef: + name: app-secrets + key: DATABASE_PASSWORD + # Cargar todas las variables de un ConfigMap + envFrom: + - configMapRef: + name: app-config + # Cargar todas las variables de un Secret + - secretRef: + name: app-secrets + # Montar archivos de configuración + volumeMounts: + - name: config-volume + mountPath: /etc/config + volumes: + - name: config-volume + configMap: + name: app-config +``` + + + + + +**1. Verificar que los pods recojan las nuevas variables:** + +```bash +# Verificar el estado del deployment +kubectl get deployment mi-app -n mi-namespace -o wide + +# Verificar que se inició un nuevo rollout +kubectl rollout status deployment mi-app -n mi-namespace + +# Verificar los pods en ejecución +kubectl get pods -n mi-namespace -l app=mi-app + +# Verificar las variables de entorno en un pod específico +kubectl exec mi-app-pod-xxx -n mi-namespace -- env | grep DATABASE_URL + +# Verificar logs del pod para errores de configuración +kubectl logs mi-app-pod-xxx -n mi-namespace +``` + +**2. Forzar una actualización del deployment:** + +```bash +# Método 1: Reiniciar el deployment +kubectl rollout restart deployment mi-app -n mi-namespace + +# Método 2: Añadir una anotación para forzar update +kubectl patch deployment mi-app -n mi-namespace -p \ + '{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"'$(date +%Y-%m-%dT%H:%M:%S%z)'"}}}}}' + +# Método 3: Cambiar la imagen (si es necesario) +kubectl set image deployment/mi-app app=mi-app:v1.1 -n mi-namespace +``` + +**3. Verificar el historial de rollout:** + +```bash +# Ver historial de rollouts +kubectl rollout history deployment mi-app -n mi-namespace + +# Ver detalles de una revisión específica +kubectl rollout history deployment mi-app -n mi-namespace --revision=2 + +# Rollback a una versión anterior si hay problemas +kubectl rollout undo deployment mi-app -n mi-namespace --to-revision=1 +``` + +**4. Debugging avanzado:** + +```bash +# Verificar eventos del namespace +kubectl get events -n mi-namespace --sort-by=.metadata.creationTimestamp + +# Describir el deployment para ver errores +kubectl describe deployment mi-app -n mi-namespace + +# Describir un pod específico +kubectl describe pod mi-app-pod-xxx -n mi-namespace + +# Verificar el estado de los ReplicaSets +kubectl get rs -n mi-namespace -l app=mi-app + +# Verificar recursos y limits +kubectl top pods -n mi-namespace +``` + + + + + +**1. Script de validación de variables:** + +```bash +#!/bin/bash +# Script para validar que las variables están configuradas correctamente + +NAMESPACE="mi-namespace" +APP_NAME="mi-app" +SECRET_NAME="app-secrets" +CONFIGMAP_NAME="app-config" + +echo "=== Validación de Configuración ===" + +# Verificar que el Secret existe +echo "1. Verificando Secret..." +if kubectl get secret $SECRET_NAME -n $NAMESPACE >/dev/null 2>&1; then + echo "✅ Secret $SECRET_NAME existe" + echo "Variables en Secret:" + kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]' | sed 's/^/ - /' +else + echo "❌ Secret $SECRET_NAME no encontrado" +fi + +# Verificar que el ConfigMap existe +echo -e "\n2. Verificando ConfigMap..." +if kubectl get configmap $CONFIGMAP_NAME -n $NAMESPACE >/dev/null 2>&1; then + echo "✅ ConfigMap $CONFIGMAP_NAME existe" + echo "Variables en ConfigMap:" + kubectl get configmap $CONFIGMAP_NAME -n $NAMESPACE -o jsonpath='{.data}' | jq -r 'keys[]' | sed 's/^/ - /' +else + echo "❌ ConfigMap $CONFIGMAP_NAME no encontrado" +fi + +# Verificar que el deployment está actualizado +echo -e "\n3. Verificando Deployment..." +DEPLOYMENT_STATUS=$(kubectl get deployment $APP_NAME -n $NAMESPACE -o jsonpath='{.status.conditions[?(@.type=="Progressing")].status}') +if [ "$DEPLOYMENT_STATUS" = "True" ]; then + echo "✅ Deployment $APP_NAME está actualizado" +else + echo "❌ Deployment $APP_NAME tiene problemas" +fi + +# Verificar variables en pods +echo -e "\n4. Verificando variables en pods..." +POD_NAME=$(kubectl get pods -n $NAMESPACE -l app=$APP_NAME -o jsonpath='{.items[0].metadata.name}') +if [ ! -z "$POD_NAME" ]; then + echo "Verificando pod: $POD_NAME" + + # Verificar algunas variables clave + for var in DATABASE_URL API_KEY APP_ENV; do + if kubectl exec $POD_NAME -n $NAMESPACE -- env | grep -q "^$var="; then + echo " ✅ $var está configurada" + else + echo " ❌ $var no encontrada" + fi + done +else + echo "❌ No se encontraron pods ejecutándose" +fi + +echo -e "\n=== Validación Completada ===" +``` + +**2. Pruebas de conectividad y configuración:** + +```bash +#!/bin/bash +# Script para probar conectividad con las nuevas configuraciones + +NAMESPACE="mi-namespace" +POD_NAME=$(kubectl get pods -n $NAMESPACE -l app=mi-app -o jsonpath='{.items[0].metadata.name}') + +echo "=== Pruebas de Conectividad ===" + +# Probar conexión a base de datos +echo "1. Probando conexión a base de datos..." +kubectl exec $POD_NAME -n $NAMESPACE -- bash -c ' + if command -v psql &> /dev/null; then + psql $DATABASE_URL -c "SELECT 1;" >/dev/null 2>&1 + if [ $? -eq 0 ]; then + echo "✅ Conexión a base de datos exitosa" + else + echo "❌ Error de conexión a base de datos" + fi + else + echo "⚠️ psql no disponible en el contenedor" + fi +' + +# Probar conexión a Redis (si aplica) +echo -e "\n2. Probando conexión a Redis..." +kubectl exec $POD_NAME -n $NAMESPACE -- bash -c ' + if command -v redis-cli &> /dev/null && [ ! -z "$REDIS_URL" ]; then + redis-cli -u $REDIS_URL ping >/dev/null 2>&1 + if [ $? -eq 0 ]; then + echo "✅ Conexión a Redis exitosa" + else + echo "❌ Error de conexión a Redis" + fi + else + echo "⚠️ Redis CLI no disponible o REDIS_URL no configurada" + fi +' + +# Probar endpoint de health check +echo -e "\n3. Probando health check..." +kubectl exec $POD_NAME -n $NAMESPACE -- bash -c ' + if command -v curl &> /dev/null; then + curl -f http://localhost:8080/health >/dev/null 2>&1 + if [ $? -eq 0 ]; then + echo "✅ Health check exitoso" + else + echo "❌ Health check falló" + fi + else + echo "⚠️ curl no disponible en el contenedor" + fi +' + +echo -e "\n=== Pruebas Completadas ===" +``` + + + + + +**1. Versionado de configuración:** + +```bash +# Usar labels para versionar configuraciones +kubectl label secret app-secrets version=v1.2.0 -n mi-namespace +kubectl label configmap app-config version=v1.2.0 -n mi-namespace + +# Crear backup antes de modificar +kubectl get secret app-secrets -n mi-namespace -o yaml > secret-backup-$(date +%Y%m%d-%H%M%S).yaml +kubectl get configmap app-config -n mi-namespace -o yaml > configmap-backup-$(date +%Y%m%d-%H%M%S).yaml +``` + +**2. Validación de configuración con dry-run:** + +```bash +# Validar antes de aplicar cambios +kubectl apply --dry-run=client -f new-secret.yaml +kubectl apply --dry-run=server -f new-secret.yaml + +# Usar kubeval para validación adicional +kubeval new-secret.yaml +``` + +**3. Automatización con scripts:** + +```bash +#!/bin/bash +# Script para actualización segura de variables + +NAMESPACE=$1 +SECRET_NAME=$2 +VARIABLE_NAME=$3 +NEW_VALUE=$4 + +if [ -z "$4" ]; then + echo "Uso: $0 " + exit 1 +fi + +echo "=== Actualización Segura de Variable ===" +echo "Namespace: $NAMESPACE" +echo "Secret: $SECRET_NAME" +echo "Variable: $VARIABLE_NAME" + +# Crear backup +echo "1. Creando backup..." +kubectl get secret $SECRET_NAME -n $NAMESPACE -o yaml > "backup-$SECRET_NAME-$(date +%Y%m%d-%H%M%S).yaml" + +# Verificar que el secret existe +if ! kubectl get secret $SECRET_NAME -n $NAMESPACE >/dev/null 2>&1; then + echo "❌ Secret $SECRET_NAME no existe en namespace $NAMESPACE" + exit 1 +fi + +# Actualizar la variable +echo "2. Actualizando variable..." +kubectl patch secret $SECRET_NAME -n $NAMESPACE --type='json' -p="[ + { + \"op\": \"replace\", + \"path\": \"/data/$VARIABLE_NAME\", + \"value\": \"$(echo -n "$NEW_VALUE" | base64 -w 0)\" + } +]" + +if [ $? -eq 0 ]; then + echo "✅ Variable actualizada exitosamente" + + # Reiniciar deployments que usan este secret + echo "3. Buscando deployments que usan este secret..." + DEPLOYMENTS=$(kubectl get deployments -n $NAMESPACE -o json | jq -r ".items[] | select(.spec.template.spec.containers[]?.envFrom[]?.secretRef?.name == \"$SECRET_NAME\") | .metadata.name") + + for deployment in $DEPLOYMENTS; do + echo "Reiniciando deployment: $deployment" + kubectl rollout restart deployment $deployment -n $NAMESPACE + done + + echo "✅ Actualización completada" +else + echo "❌ Error al actualizar la variable" + exit 1 +fi +``` + +**4. Monitoreo de cambios:** + +```yaml +# ServiceMonitor para monitorear cambios de configuración +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: config-changes-monitor +spec: + selector: + matchLabels: + app: config-monitor + endpoints: + - port: metrics + interval: 30s + path: /metrics + +--- +# Script de monitoreo +apiVersion: v1 +kind: ConfigMap +metadata: + name: config-monitor-script +data: + monitor.sh: | + #!/bin/bash + # Monitor para cambios de configuración + + while true; do + # Verificar última modificación de secrets + LAST_SECRET_CHANGE=$(kubectl get secrets -n $NAMESPACE -o json | jq -r '.items[] | .metadata.resourceVersion' | sort -n | tail -1) + + # Verificar última modificación de configmaps + LAST_CONFIGMAP_CHANGE=$(kubectl get configmaps -n $NAMESPACE -o json | jq -r '.items[] | .metadata.resourceVersion' | sort -n | tail -1) + + # Registrar cambios si es diferente al último check + if [ "$LAST_SECRET_CHANGE" != "$PREVIOUS_SECRET" ] || [ "$LAST_CONFIGMAP_CHANGE" != "$PREVIOUS_CONFIGMAP" ]; then + echo "$(date): Detectado cambio en configuración" + echo "Secret version: $LAST_SECRET_CHANGE" + echo "ConfigMap version: $LAST_CONFIGMAP_CHANGE" + fi + + PREVIOUS_SECRET=$LAST_SECRET_CHANGE + PREVIOUS_CONFIGMAP=$LAST_CONFIGMAP_CHANGE + + sleep 60 + done +``` + + + + + +**1. Rollback rápido de configuración:** + +```bash +#!/bin/bash +# Script de rollback de emergencia + +NAMESPACE=$1 +BACKUP_FILE=$2 + +if [ -z "$2" ]; then + echo "Uso: $0 " + echo "Backups disponibles:" + ls -la backup-*.yaml 2>/dev/null || echo "No se encontraron backups" + exit 1 +fi + +echo "=== ROLLBACK DE EMERGENCIA ===" +echo "Namespace: $NAMESPACE" +echo "Backup file: $BACKUP_FILE" + +# Verificar que el archivo de backup existe +if [ ! -f "$BACKUP_FILE" ]; then + echo "❌ Archivo de backup no encontrado: $BACKUP_FILE" + exit 1 +fi + +# Aplicar el backup +echo "Aplicando configuración de backup..." +kubectl apply -f "$BACKUP_FILE" -n $NAMESPACE + +if [ $? -eq 0 ]; then + echo "✅ Configuración restaurada desde backup" + + # Reiniciar todos los deployments en el namespace + echo "Reiniciando deployments..." + kubectl get deployments -n $NAMESPACE -o name | xargs -I {} kubectl rollout restart {} -n $NAMESPACE + + echo "✅ Rollback completado" +else + echo "❌ Error al aplicar el backup" + exit 1 +fi +``` + +**2. Verificación post-rollback:** + +```bash +#!/bin/bash +# Verificación después del rollback + +NAMESPACE=$1 + +echo "=== VERIFICACIÓN POST-ROLLBACK ===" + +# Verificar que todos los deployments están saludables +echo "1. Verificando estado de deployments..." +kubectl get deployments -n $NAMESPACE -o wide + +echo -e "\n2. Verificando pods..." +kubectl get pods -n $NAMESPACE + +echo -e "\n3. Verificando eventos recientes..." +kubectl get events -n $NAMESPACE --sort-by=.metadata.creationTimestamp | tail -10 + +echo -e "\n4. Verificando logs de aplicación..." +for pod in $(kubectl get pods -n $NAMESPACE -o name); do + echo "Logs de $pod:" + kubectl logs $pod -n $NAMESPACE --tail=5 2>/dev/null || echo " No se pudieron obtener logs" +done + +echo -e "\n=== VERIFICACIÓN COMPLETADA ===" +``` + +**3. Plan de contingencia:** + +```markdown +# Plan de Contingencia - Variables Globales + +## Escenario 1: Error en la plataforma SleakOps +**Síntomas:** No se pueden guardar variables en la interfaz +**Acción:** Usar kubectl o Lens como alternativa inmediata +**Tiempo estimado:** 5-10 minutos + +## Escenario 2: Variables incorrectas aplicadas +**Síntomas:** Aplicación no funciona después de cambios +**Acción:** Rollback usando backup más reciente +**Tiempo estimado:** 2-5 minutos + +## Escenario 3: Pods no recogen nuevas variables +**Síntomas:** Variables actualizadas pero pods usan valores antiguos +**Acción:** Reiniciar deployment manualmente +**Tiempo estimado:** 1-3 minutos + +## Escenario 4: Corrupción de configuración +**Síntomas:** Secrets o ConfigMaps corruptos +**Acción:** Restaurar desde backup y recrear recursos +**Tiempo estimado:** 10-15 minutos + +## Contactos de Emergencia: +- Equipo SleakOps: support@sleakops.com +- Administrador Kubernetes: admin@company.com +- Equipo DevOps: devops@company.com +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 27 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-404-error-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-404-error-troubleshooting.mdx new file mode 100644 index 000000000..61455d763 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-404-error-troubleshooting.mdx @@ -0,0 +1,206 @@ +--- +sidebar_position: 3 +title: "Error 404 de Grafana Tras la Instalación" +description: "Solución de problemas de errores 404 en Grafana al acceder a la URL del panel" +date: "2024-01-15" +category: "dependencia" +tags: ["grafana", "monitoreo", "404", "dns", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error 404 de Grafana Tras la Instalación + +**Fecha:** 15 de enero de 2024 +**Categoría:** Dependencia +**Etiquetas:** Grafana, Monitoreo, 404, DNS, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario instala con éxito OpenTelemetry y Grafana en su clúster de desarrollo a través de SleakOps, pero encuentra un error 404 al intentar acceder a la URL del panel de Grafana. + +**Síntomas Observados:** + +- La URL de Grafana muestra "No se puede encontrar esta página grafana.develop.takenos.com" (error 404) +- Todos los pods parecen estar saludables en Kubernetes Lens +- El pod de Grafana muestra registros normales sin errores +- No hay alertas ni indicadores evidentes de error en el clúster +- La resolución DNS funciona (diferente de errores "no se pudo encontrar la dirección IP del servidor") + +**Configuración Relevante:** + +- Entorno: Clúster de desarrollo +- URL: `https://grafana.develop.takenos.com/` +- Componentes: OpenTelemetry + Grafana +- Herramientas de monitoreo: Kubernetes Lens + +**Condiciones del Error:** + +- El error ocurre al acceder a la URL del panel de Grafana +- Los pods están en ejecución y saludables +- La resolución DNS es correcta (sin errores de resolución DNS) +- El problema persiste tras esperar la propagación DNS + +## Solución Detallada + + + +La diferencia clave entre los tipos de error ayuda a diagnosticar el problema: + +- **Error 404**: "No se puede encontrar esta página grafana.develop.takenos.com" - El DNS resuelve pero el servicio/ingress no está configurado correctamente +- **Error DNS**: "no se pudo encontrar la dirección IP del servidor" - La resolución DNS falla + +Como obtienes un 404, el DNS está funcionando pero hay un problema de configuración con el ingress o el servicio. + + + + + +Revisa si el ingress de Grafana está configurado correctamente: + +```bash +# Verificar recursos ingress +kubectl get ingress -A + +# Buscar específicamente el ingress de grafana +kubectl get ingress -A | grep grafana + +# Describir el ingress de grafana\kubectl describe ingress grafana-ingress -n +``` + +Verifica que: + +- El ingress exista y tenga el host correcto (`grafana.develop.takenos.com`) +- El ingress tenga un servicio backend válido +- El controlador de ingress esté en ejecución + + + + + +Asegúrate de que el servicio de Grafana esté configurado correctamente: + +```bash +# Listar todos los servicios +kubectl get svc -A + +# Verificar el servicio de grafana específicamente +kubectl get svc -A | grep grafana + +# Describir el servicio de grafana +kubectl describe svc grafana -n + +# Probar conectividad al servicio +kubectl port-forward svc/grafana 3000:3000 -n +``` + +Luego prueba localmente: `http://localhost:3000` + + + + + +Aunque los pods parecen saludables, verifica que estén correctamente conectados: + +```bash +# Verificar estado y disponibilidad de los pods +kubectl get pods -A | grep grafana + +# Revisar logs del pod para problemas de inicio +kubectl logs -n + +# Verificar endpoints del servicio +kubectl get endpoints -A | grep grafana +``` + +Asegúrate de que los endpoints del servicio muestren las IPs de los pods. + + + + + +Verifica que el controlador ingress esté funcionando: + +```bash +# Verificar pods del controlador ingress +kubectl get pods -n ingress-nginx +# o +kubectl get pods -n kube-system | grep ingress + +# Revisar logs del controlador ingress +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller +``` + +Busca cualquier error relacionado con tu ingress de grafana. + + + + + +En la plataforma SleakOps: + +1. **Revisar el Estado del Add-on de Monitoreo**: + + - Accede a la configuración de tu clúster + - Verifica que el add-on de monitoreo esté completamente desplegado + - Revisa si hay advertencias o errores en el despliegue + +2. **Verificar la Configuración DNS**: + + - Asegúrate que tu dominio esté configurado correctamente en SleakOps + - Verifica que los certificados SSL estén emitidos correctamente + +3. **Revisar Logs de Despliegue**: + - Consulta el historial de despliegue para detectar pasos fallidos + - Busca fallas en la creación del ingress o servicio + + + + + +**Solución 1: Recrear el Ingress** + +```bash +kubectl delete ingress grafana-ingress -n +# Espera a que SleakOps lo recree o aplica manualmente la configuración correcta +``` + +**Solución 2: Verificar Configuración de Grafana** + +```bash +# Verificar si Grafana está configurado con la URL raíz correcta +kubectl get configmap grafana-config -n -o yaml +``` + +**Solución 3: Reiniciar el Pod de Grafana** + +```bash +kubectl delete pod -n +``` + +**Solución 4: Verificar el Enrutamiento Basado en Ruta** +Algunas configuraciones usan enrutamiento basado en rutas. Intenta acceder a: + +- `https://grafana.develop.takenos.com/grafana/` +- `https://develop.takenos.com/grafana/` + + + + + +Después de aplicar las soluciones: + +1. **Espera la propagación** (2-5 minutos) +2. **Limpia la caché del navegador** o prueba en modo incógnito +3. **Prueba la URL**: `https://grafana.develop.takenos.com/` +4. **Verifica el estado del ingress**: + ```bash + kubectl get ingress grafana-ingress -n + ``` +5. **Verifica el certificado SSL** si usas HTTPS + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-datasource-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-datasource-configuration.mdx new file mode 100644 index 000000000..d0a8bafcc --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-datasource-configuration.mdx @@ -0,0 +1,203 @@ +--- +sidebar_position: 3 +title: "Problema de Configuración de la Fuente de Datos de Grafana Loki" +description: "Solución para que la fuente de datos Loki no se agregue automáticamente a Grafana" +date: "2024-11-15" +category: "dependencia" +tags: ["grafana", "loki", "fuente de datos", "monitoreo", "logs"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Configuración de la Fuente de Datos de Grafana Loki + +**Fecha:** 15 de noviembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Grafana, Loki, Fuente de datos, Monitoreo, Logs + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas con Grafana que no muestra los logs correctamente debido a que la fuente de datos Loki no se configura automáticamente durante el despliegue de SleakOps. + +**Síntomas Observados:** + +- El panel de Grafana carga pero no muestra datos de logs +- La fuente de datos Loki falta en la configuración de Grafana +- Las consultas de logs devuelven resultados vacíos +- El filtro de stream predeterminado es 'stderr' pero no muestra datos + +**Configuración Relevante:** + +- Stack de monitoreo: Grafana + Loki +- Filtro de stream predeterminado: `stderr` +- Componentes de Loki: pod `loki-read` +- Autenticación: se requiere usuario administrador de Grafana + +**Condiciones de Error:** + +- Ocurre durante el despliegue inicial de SleakOps +- La adición de la fuente de datos Loki falla silenciosamente +- Requiere intervención manual para resolver +- Puede necesitar reinicios de pods para funcionar correctamente + +## Solución Detallada + + + +Para comprobar si la fuente de datos Loki está configurada correctamente en Grafana: + +1. **Accede al panel de Grafana** + + - Usa las credenciales de administrador proporcionadas por SleakOps + - Navega a **Configuración** → **Fuentes de datos** + +2. **Verifica la fuente de datos Loki** + + - Busca una fuente de datos llamada "Loki" o similar + - Verifica que la URL apunte a tu servicio Loki + - Prueba la conexión usando el botón "Guardar y probar" + +3. **Formato esperado de la URL de Loki:** + ``` + http://loki-gateway.monitoring.svc.cluster.local:80 + ``` + o + ``` + http://loki-read.monitoring.svc.cluster.local:3100 + ``` + + + + + +Si la fuente de datos Loki falta, agrégala manualmente: + +1. **En Grafana, ve a Configuración → Fuentes de datos** +2. **Haz clic en "Agregar fuente de datos"** +3. **Selecciona "Loki" de la lista** +4. **Configura la fuente de datos:** + + - **Nombre:** `Loki` + - **URL:** `http://loki-gateway.monitoring.svc.cluster.local:80` + - **Acceso:** `Servidor (predeterminado)` + +5. **Guarda y prueba la conexión** + +```yaml +# Ejemplo de configuración de fuente de datos +apiVersion: 1 +datasources: + - name: Loki + type: loki + access: proxy + url: http://loki-gateway.monitoring.svc.cluster.local:80 + isDefault: false +``` + + + + + +Si la fuente de datos Loki existe pero no funciona correctamente: + +1. **Reinicia el pod loki-read:** + + ```bash + kubectl delete pod -l app=loki-read -n monitoring + ``` + +2. **Verifica el estado del pod:** + + ```bash + kubectl get pods -n monitoring | grep loki + ``` + +3. **Revisa los logs de Loki:** + + ```bash + kubectl logs -l app=loki-read -n monitoring + ``` + +4. **Prueba la API de Loki directamente:** + ```bash + kubectl port-forward svc/loki-gateway 3100:80 -n monitoring + curl http://localhost:3100/ready + ``` + + + + + +Una vez que la fuente de datos Loki funcione: + +1. **Entender los filtros de stream:** + + - El filtro predeterminado busca streams `stderr` + - Cambia a `stdout` para ver los logs de la aplicación + - Usa `{stream="stderr"}` o `{stream="stdout"}` en las consultas + +2. **Consultas comunes en LogQL:** + + ```logql + # Todos los logs stderr + {stream="stderr"} + + # Todos los logs stdout + {stream="stdout"} + + # Logs de un namespace específico + {namespace="default"} + + # Logs de un pod específico + {pod="my-app-12345"} + + # Filtros combinados + {namespace="default", stream="stderr"} + ``` + +3. **Ajustar el rango de tiempo:** + - Usa el selector de tiempo de Grafana + - Empieza con "Última 1 hora" para pruebas + - Amplía el rango si no aparecen logs + + + + + +**Si los logs aún no aparecen:** + +1. **Verifica si las aplicaciones están generando logs:** + + ```bash + kubectl logs -n + ``` + +2. **Confirma que Loki está recibiendo logs:** + + ```bash + # Revisa los logs del ingestor de Loki + kubectl logs -l app=loki-write -n monitoring + ``` + +3. **Reinicia Grafana si es necesario:** + + ```bash + kubectl delete pod -l app.kubernetes.io/name=grafana -n monitoring + ``` + +4. **Revisa los logs de Grafana:** + ```bash + kubectl logs -l app.kubernetes.io/name=grafana -n monitoring + ``` + +**Prevención para futuros despliegues:** + +- Este problema se está abordando en próximas versiones de SleakOps +- Se implementará la configuración automática de la fuente de datos +- Los pasos de verificación manual se documentarán en las guías de despliegue + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de noviembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-installation-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-installation-troubleshooting.mdx new file mode 100644 index 000000000..10bc8e205 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-installation-troubleshooting.mdx @@ -0,0 +1,193 @@ +--- +sidebar_position: 3 +title: "Problemas de Instalación de Grafana y Loki" +description: "Solución de problemas de instalación atascada de Loki y problemas de tiempo de espera en Grafana" +date: "2024-11-12" +category: "dependencia" +tags: ["grafana", "loki", "monitoreo", "complementos", "tiempo de espera"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Instalación de Grafana y Loki + +**Fecha:** 12 de noviembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Grafana, Loki, Monitoreo, Complementos, Tiempo de espera + +## Descripción del Problema + +**Contexto:** El usuario instaló los complementos Grafana y Loki en su clúster de producción a través de la plataforma SleakOps pero encontró problemas de instalación y acceso. + +**Síntomas observados:** + +- Grafana aparece como "instalado" pero retorna tiempo de espera al acceder a la interfaz web +- La instalación de Loki parece atascada y nunca se completa +- Grafana se vuelve accesible después de reanudar la instalación atascada de Loki +- El acceso funciona correctamente cuando se conecta vía VPN + +**Configuración relevante:** + +- Entorno: Clúster de producción +- Complementos: Grafana y Loki +- Método de acceso: Se requiere conexión VPN +- Estado de instalación: Grafana marcado como instalado, Loki atascado durante la instalación + +**Condiciones de error:** + +- Errores de tiempo de espera al acceder a la interfaz web de Grafana +- El proceso de instalación de Loki se queda colgado indefinidamente +- Los problemas se resuelven tras reanudar manualmente la instalación atascada + +## Solución Detallada + + + +Para verificar el estado actual de tus complementos en SleakOps: + +1. Navega a tu **Panel del Clúster** +2. Ve a la sección **Complementos** +3. Revisa el estado de cada complemento: + - **Instalando**: Aún en progreso + - **Instalado**: Desplegado con éxito + - **Fallido**: La instalación encontró errores + - **Atascado**: La instalación parece congelada + +Si un complemento muestra "Instalando" por más de 30 minutos, puede estar atascado. + + + + + +Cuando la instalación de Loki se atasca: + +1. Ve a **Panel del Clúster** → **Complementos** +2. Encuentra el complemento Loki con estado "Instalando" +3. Haz clic en el **menú de tres puntos** junto a Loki +4. Selecciona **"Reanudar instalación"** o **"Reintentar"** +5. Monitorea el progreso de la instalación +6. Espera a que el estado cambie a "Instalado" + +**Método alternativo vía kubectl:** + +```bash +# Verificar estado de los pods +kubectl get pods -n loki-system + +# Revisar pods atascados\kubectl describe pod -n loki-system + +# Reiniciar pods atascados si es necesario +kubectl delete pod -n loki-system +``` + + + + + +Los problemas de tiempo de espera en Grafana suelen estar relacionados con: + +1. **Instalación incompleta de Loki**: Grafana puede depender de que Loki esté completamente instalado +2. **Conectividad de red**: Asegúrate de estar conectado vía VPN +3. **Preparación de los pods**: Los pods de Grafana pueden no estar completamente listos + +**Pasos para resolver:** + +1. **Asegúrate de que Loki esté completamente instalado** (como se describió arriba) +2. **Verifica el estado del pod de Grafana:** + ```bash + kubectl get pods -n grafana-system + kubectl logs -f -n grafana-system + ``` +3. **Verifica que la conexión VPN esté activa** +4. **Espera 5-10 minutos** después de que la instalación de Loki finalice +5. **Intenta acceder a Grafana nuevamente** + + + + + +Grafana y otras herramientas de monitoreo en SleakOps generalmente requieren acceso VPN por seguridad: + +**Requisitos:** + +- Conexión VPN activa a la red de tu clúster +- Resolución DNS adecuada a través de la VPN +- Reglas de firewall correctas que permitan el acceso + +**Pasos para verificar:** + +1. Confirma que la conexión VPN está activa +2. Prueba la resolución DNS: `nslookup grafana.tu-cluster.local` +3. Verifica si puedes acceder a otros servicios del clúster +4. Asegúrate de que tu usuario tenga los permisos adecuados para acceder a Grafana + + + + + +Para evitar problemas similares en el futuro: + +**Orden de instalación:** + +1. Instala primero Loki (backend de registro) +2. Espera a que Loki esté completamente listo +3. Luego instala Grafana (frontend de visualización) + +**Monitoreo de la instalación:** + +- Revisa el estado de los complementos cada 10-15 minutos durante la instalación +- No instales múltiples complementos pesados simultáneamente +- Asegúrate de que el clúster tenga recursos suficientes antes de instalar + +**Requisitos de recursos:** + +```yaml +# Recursos mínimos recomendados +Loki: + memory: 2Gi + cpu: 500m +Grafana: + memory: 1Gi + cpu: 250m +``` + + + + + +**Verificar estado general del clúster:** + +```bash +kubectl get nodes +kubectl top nodes +kubectl get pods --all-namespaces | grep -E "(loki|grafana)" +``` + +**Revisar logs específicos de complementos:** + +```bash +# Logs de Loki +kubectl logs -f deployment/loki -n loki-system + +# Logs de Grafana +kubectl logs -f deployment/grafana -n grafana-system +``` + +**Verificar endpoints de servicios:** + +```bash +kubectl get svc -n grafana-system +kubectl get svc -n loki-system +``` + +**Verificar ingress/rutas:** + +```bash +kubectl get ingress --all-namespaces +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 12 de noviembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-log-ingestion-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-log-ingestion-issues.mdx new file mode 100644 index 000000000..9cfea31cd --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-log-ingestion-issues.mdx @@ -0,0 +1,444 @@ +--- +sidebar_position: 3 +title: "Problemas de Ingestión de Logs en Grafana Loki" +description: "Solución de problemas de logs faltantes y fallos del pod loki-write en Grafana" +date: "2024-12-19" +category: "dependencia" +tags: ["grafana", "loki", "registro", "monitoreo", "solucion-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Ingestión de Logs en Grafana Loki + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Grafana, Loki, Registro, Monitoreo, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas con la visualización de logs en Grafana donde el pod loki-write parece incapaz de escribir y almacenar logs correctamente, resultando en entradas de logs faltantes de los servicios de la aplicación. + +**Síntomas Observados:** + +- Logs iniciales faltantes de los servicios en el panel de Grafana +- Las entradas de logs aparecen horas después de la hora real de inicio del servicio +- El pod loki-write presenta fallos de escritura/almacenamiento +- Línea de tiempo de logs incompleta con huecos en el historial de logs +- Servicios monitoreados a través de Lens muestran logs que no aparecen en Grafana + +**Configuración Relevante:** + +- Componente: Grafana con backend Loki +- Pod afectado: `loki-write` +- Ingestión de logs: Logs de aplicaciones en tiempo real +- Configuración de monitoreo: Pila Grafana integrada con SleakOps + +**Condiciones de Error:** + +- Logs faltantes desde el período de inicio del servicio +- Aparición retardada de logs (horas después de los eventos reales) +- Ingestión inconsistente de logs entre diferentes servicios +- El pod loki-write no puede persistir los datos de logs + +## Solución Detallada + + + +Para diagnosticar problemas del pod loki-write: + +1. **Verificar estado y logs del pod:** + +```bash +kubectl get pods -n monitoring | grep loki-write +kubectl logs -n monitoring loki-write-0 --tail=100 +``` + +2. **Verificar configuración de almacenamiento:** + +```bash +kubectl describe pvc -n monitoring | grep loki +``` + +3. **Revisar límites de recursos:** + +```bash +kubectl describe pod -n monitoring loki-write-0 +``` + +Problemas comunes incluyen: + +- Espacio de almacenamiento insuficiente +- Restricciones de memoria/CPU +- Problemas de montaje de PVC +- Problemas de conectividad de red + + + + + +Si el almacenamiento es la causa raíz: + +1. **Verificar almacenamiento disponible:** + +```bash +kubectl exec -n monitoring loki-write-0 -- df -h +``` + +2. **Verificar estado del PVC:** + +```bash +kubectl get pvc -n monitoring +kubectl describe pvc loki-storage -n monitoring +``` + +3. **Aumentar almacenamiento si es necesario:** + +```yaml +# En tu configuración de Loki +persistence: + enabled: true + size: 50Gi # Aumentar desde el valor predeterminado + storageClass: gp3 +``` + +4. **Limpiar logs antiguos si el almacenamiento está lleno:** + +```bash +# Acceder al pod loki-write +kubectl exec -it -n monitoring loki-write-0 -- /bin/sh +# Revisar y limpiar chunks antiguos +ls -la /loki/chunks/ +``` + + + + + +Para prevenir problemas de ingestión de logs relacionados con recursos: + +1. **Aumentar límites de memoria:** + +```yaml +# Recursos del componente de escritura de Loki +write: + resources: + requests: + memory: 512Mi + cpu: 100m + limits: + memory: 2Gi + cpu: 500m +``` + +2. **Configurar retención adecuada:** + +```yaml +limits_config: + retention_period: 168h # 7 días + ingestion_rate_mb: 10 + ingestion_burst_size_mb: 20 +``` + +3. **Optimizar configuración de chunks:** + +```yaml +chunk_store_config: + max_look_back_period: 168h +schema_config: + configs: + - from: 2023-01-01 + store: boltdb-shipper + object_store: s3 + schema: v11 + index: + prefix: loki_index_ + period: 24h +``` + + + + + +Para asegurar que los logs se estén ingiriendo correctamente: + +1. **Verificar configuración de Promtail:** + +```bash +kubectl logs -n monitoring promtail-daemonset-xxx +``` + +2. **Verificar envío de logs:** + +```bash +# Comprobar si los logs se envían a Loki +kubectl exec -n monitoring promtail-xxx -- wget -qO- http://localhost:3101/metrics | grep promtail_sent +``` + +3. **Probar API de Loki directamente:** + +```bash +# Consultar Loki por logs recientes +kubectl port-forward -n monitoring svc/loki 3100:3100 +curl -G -s "http://localhost:3100/loki/api/v1/query" --data-urlencode 'query={job="your-service"}' --data-urlencode 'start=1h' +``` + +4. **Verificar descubrimiento de servicios:** + +```bash +# Confirmar que Promtail descubre tus pods +kubectl exec -n monitoring promtail-xxx -- wget -qO- http://localhost:3101/targets +``` + + + + + +Asegúrate de que Grafana esté configurado correctamente para consultar Loki: + +1. **Verificar datasource de Loki:** + + - Ir a Grafana → Configuración → Fuentes de Datos + - Comprobar URL de Loki: `http://loki:3100` + - Probar conexión + +2. **Configurar rangos de tiempo adecuados:** + + - En los paneles de Grafana, asegurar que el rango de tiempo cubra el período esperado + - Verificar configuración de zona horaria + +3. **Optimizar el rendimiento de las consultas:** + +```logql +# Usar consultas LogQL eficientes +{namespace="your-namespace", pod=~"your-service-.*"} |= "your-search-term" +``` + +4. **Establecer intervalos de actualización apropiados:** + - Para monitoreo en tiempo real: 5-10 segundos + - Para análisis histórico: 1-5 minutos + + + + + +Para prevenir futuros problemas de ingestión de logs: + +1. **Monitorear métricas de Loki:** + +```yaml +# Añadir alertas para la salud de Loki +- alert: LokiWriteErrors + expr: increase(loki_ingester_chunks_flushed_total{status="failed"}[5m]) > 0 + for: 2m + labels: + severity: warning + annotations: + summary: "Errores de escritura en Loki detectados" +``` + +2. **Configurar monitoreo de almacenamiento:** + +```yaml +- alert: LokiStorageFull + expr: (kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*loki.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*loki.*"}) < 0.1 + for: 5m + labels: + severity: critical + annotations: + summary: "Almacenamiento de Loki casi lleno" +``` + +3. **Configurar dashboards de monitoreo:** + +```json +{ + "dashboard": { + "title": "Loki Health Monitoring", + "panels": [ + { + "title": "Log Ingestion Rate", + "targets": [ + { + "expr": "rate(loki_distributor_lines_received_total[5m])" + } + ] + }, + { + "title": "Write Errors", + "targets": [ + { + "expr": "rate(loki_ingester_chunks_flushed_total{status=\"failed\"}[5m])" + } + ] + } + ] + } +} +``` + + + + + +Para diagnosticar problemas de red que afectan la ingestión de logs: + +1. **Verificar conectividad entre componentes:** + +```bash +# Desde un pod de Promtail, probar conectividad a Loki +kubectl exec -n monitoring promtail-xxx -- nslookup loki +kubectl exec -n monitoring promtail-xxx -- telnet loki 3100 +``` + +2. **Verificar políticas de red:** + +```bash +kubectl get networkpolicies -n monitoring +kubectl describe networkpolicy -n monitoring +``` + +3. **Probar conectividad desde Grafana a Loki:** + +```bash +kubectl exec -n monitoring grafana-xxx -- wget -qO- http://loki:3100/ready +``` + +4. **Verificar configuración de DNS:** + +```bash +kubectl get svc -n monitoring | grep loki +kubectl describe svc loki -n monitoring +``` + + + + + +Para mejorar el rendimiento de ingestión de logs: + +1. **Configurar paralelismo de ingestión:** + +```yaml +ingester: + lifecycler: + ring: + replication_factor: 3 + chunk_idle_period: 5m + chunk_retain_period: 30s + max_transfer_retries: 0 + wal: + enabled: true + dir: /loki/wal +``` + +2. **Optimizar configuración de distribuidor:** + +```yaml +distributor: + ring: + kvstore: + store: memberlist +``` + +3. **Configurar compactación eficiente:** + +```yaml +compactor: + working_directory: /loki/compactor + shared_store: s3 + compaction_interval: 10m + retention_enabled: true + retention_delete_delay: 2h + retention_delete_worker_count: 150 +``` + +4. **Ajustar límites de consulta:** + +```yaml +query_range: + results_cache: + cache: + memcached_client: + consistent_hash: true + host: memcached + service: http +``` + + + + + +Para proteger contra pérdida de logs: + +1. **Configurar respaldo automático:** + +```yaml +# Configuración de respaldo para chunks de Loki +apiVersion: batch/v1 +kind: CronJob +metadata: + name: loki-backup +spec: + schedule: "0 2 * * *" + jobTemplate: + spec: + template: + spec: + containers: + - name: backup + image: amazon/aws-cli + command: + - /bin/sh + - -c + - aws s3 sync /loki/chunks s3://your-backup-bucket/loki-chunks/ +``` + +2. **Implementar retención de logs:** + +```yaml +table_manager: + retention_deletes_enabled: true + retention_period: 168h +``` + +3. **Configurar alertas de respaldo:** + +```yaml +- alert: LokiBackupFailed + expr: increase(cronjob_status_failed{job="loki-backup"}[1h]) > 0 + labels: + severity: warning + annotations: + summary: "Respaldo de Loki falló" +``` + + + +## Mejores Prácticas + +### Configuración de Producción + +1. **Usar almacenamiento persistente adecuado** +2. **Configurar retención de logs apropiada** +3. **Implementar monitoreo proactivo** +4. **Establecer alertas para problemas de ingestión** +5. **Realizar respaldos regulares** + +### Optimización de Consultas + +1. **Usar etiquetas eficientemente** +2. **Limitar rangos de tiempo en consultas** +3. **Implementar caché de consultas** +4. **Optimizar expresiones LogQL** + +### Mantenimiento Regular + +1. **Monitorear uso de almacenamiento** +2. **Revisar logs de errores regularmente** +3. **Actualizar configuraciones según carga** +4. **Probar procedimientos de recuperación** + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-log-viewing-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-log-viewing-issues.mdx new file mode 100644 index 000000000..0236cb1a2 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-log-viewing-issues.mdx @@ -0,0 +1,211 @@ +--- +sidebar_position: 3 +title: "Problemas para visualizar logs en Grafana Loki" +description: "Soluciones para visualización incompleta de logs y problemas de zona horaria en el panel de Grafana Loki" +date: "2025-01-15" +category: "general" +tags: ["grafana", "loki", "logs", "zona horaria", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas para visualizar logs en Grafana Loki + +**Fecha:** 15 de enero de 2025 +**Categoría:** General +**Etiquetas:** Grafana, Loki, Logs, Zona horaria, Solución de problemas + +## Descripción del problema + +**Contexto:** Los usuarios experimentan problemas al visualizar logs en el panel de Grafana Loki, particularmente con pods y servicios de CronJob donde la salida completa del log no es visible, dificultando determinar si los trabajos se completaron con éxito. + +**Síntomas observados:** + +- Visualización incompleta de logs - no se muestran las líneas finales +- No se puede ver el estado de finalización del trabajo (éxito/fallo) +- El panel de Grafana se congela o muestra carga infinita +- Errores de timeout al acceder a los logs de Loki +- Mensajes finales "OK" o de finalización ausentes en los logs +- Los problemas ocurren tanto en el panel de SleakOps como en Lens + +**Configuración relevante:** + +- Plataforma: SleakOps con integración Grafana/Loki +- Tipo de carga de trabajo: CronJobs y Pods regulares +- Visualización de logs: A través del panel de Grafana y Lens +- Zona horaria: Desajustes entre UTC y zona horaria local + +**Condiciones de error:** + +- El error ocurre al consultar logs fuera de rangos de tiempo específicos +- El panel se rompe cuando hay huecos en los datos de logs +- La configuración de zona horaria causa problemas en la visualización +- La consulta a Loki falla cuando el rango de tiempo incluye periodos sin logs + +## Solución detallada + + + +La causa principal de los problemas para visualizar logs es la discrepancia de zona horaria entre los paneles: + +**Solución:** + +1. **Configurar ambos paneles en la misma zona horaria**: + + - Cambiar la zona horaria del panel principal a **UTC 0** + - O configurar ambos paneles a **UTC-3** (zona horaria local) + +2. **Cómo cambiar la zona horaria en Grafana**: + + - Ir a la configuración del panel (icono de engranaje) + - Navegar a **General** → **Opciones de tiempo** + - Establecer **Zona horaria** al valor deseado + - Guardar el panel + +3. **Verificar que los rangos de tiempo coincidan**: + - Asegurarse que tanto el explorador de logs como los paneles de métricas usen el mismo rango de tiempo + - Usar rangos de tiempo absolutos cuando sea posible + + + + + +Para evitar fallos en las consultas, filtrar los logs dentro de rangos de tiempo válidos: + +**Pasos:** + +1. **Identificar periodos válidos de logs**: + + - Primero revisar el panel de métricas para ver cuándo su servicio realmente estuvo activo + - Anotar el rango de tiempo exacto con actividad de logs + +2. **Aplicar filtros de tiempo conservadores**: + + - Usar el selector de rango de tiempo en Grafana + - Establecer horas de **Desde** y **Hasta** que cubran solo periodos con actividad de logs conocida + - Evitar extender más allá del rango donde existen logs + +3. **Ejemplo de filtrado de tiempo**: + ``` + Desde: 2025-01-15 14:00:00 + Hasta: 2025-01-15 16:00:00 + ``` + +**Nota:** Extender el rango de tiempo más allá de periodos con logs reales causará que la consulta a Loki falle. + + + + + +Cuando los logs aparecen incompletos (faltan líneas finales): + +**Causas posibles:** + +1. **Tiempo de terminación del pod**: Los logs pueden cortarse si el pod termina antes de que todos los logs se hayan vaciado +2. **Retraso en la ingestión de Loki**: Puede haber un retraso entre la generación del log y su disponibilidad en Loki +3. **Problemas de buffer**: Los buffers de logs pueden no vaciarse completamente antes de la terminación del pod + +**Pasos para diagnóstico:** + +1. **Verificar estado del pod**: + + ```bash + kubectl get pods -n + kubectl describe pod -n + ``` + +2. **Verificar finalización del trabajo**: + + ```bash + kubectl get jobs -n + kubectl describe job -n + ``` + +3. **Revisar fuentes alternativas de logs**: + - Usar `kubectl logs` directamente + - Revisar logs de pods en Lens + - Verificar salidas en archivos si el servicio escribe a archivos + + + + + +Para CronJobs, el estado del pod indica éxito o fallo del trabajo: + +**Significados del estado del pod:** + +- **Completed**: Trabajo finalizado con éxito (código de salida 0) +- **Failed**: Trabajo fallido (código de salida 1 u otro distinto de cero) +- **Running**: Trabajo aún en ejecución + +**Buenas prácticas para logs de CronJob:** + +1. **Incluir siempre mensajes explícitos de finalización**: + + ```bash + echo "Trabajo iniciado a $(date)" + # Lógica de su trabajo aquí + echo "Trabajo completado con éxito a $(date)" + exit 0 + ``` + +2. **Usar códigos de salida correctos**: + + ```bash + # Para éxito + exit 0 + + # Para fallo + echo "Error: Algo salió mal" + exit 1 + ``` + +3. **Revisar historial de trabajos**: + ```bash + kubectl get jobs -n --show-labels + kubectl describe cronjob -n + ``` + + + + + +Mientras se desarrollan correcciones permanentes: + +**Soluciones inmediatas:** + +1. **Usar kubectl para logs completos**: + + ```bash + kubectl logs -n --tail=-1 + ``` + +2. **Revisar múltiples fuentes de logs**: + + - Panel de SleakOps + - Aplicación Lens + - Comandos directos kubectl + - Salidas en archivos (si aplica) + +3. **Verificar finalización de trabajos mediante estado de pods**: + + ```bash + kubectl get pods -n --field-selector=status.phase=Succeeded + kubectl get pods -n --field-selector=status.phase=Failed + ``` + +4. **Usar rangos de tiempo conservadores**: + - Consultar solo periodos donde se sabe que existen logs + - Evitar rangos amplios que puedan incluir huecos + +**Soluciones a largo plazo en desarrollo:** + +- Corrección para manejo de consultas Loki cuando existen huecos en logs +- Mejor manejo de zonas horarias en paneles +- Mejor vaciado de buffers de logs para pods en terminación + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-not-loading-logs.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-not-loading-logs.mdx new file mode 100644 index 000000000..67a8f331f --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-loki-not-loading-logs.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "Grafana Loki No Carga Logs ni Opciones" +description: "Solución para Grafana Loki cuando los dashboards y la vista de exploración no cargan opciones de logs" +date: "2024-10-21" +category: "dependency" +tags: ["grafana", "loki", "logs", "monitoring", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Grafana Loki No Carga Logs ni Opciones + +**Fecha:** 21 de octubre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Grafana, Loki, Logs, Monitoreo, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas con Grafana Loki donde los dashboards de logs y la vista de exploración dejan de funcionar, impidiendo el acceso a datos y opciones de logs. + +**Síntomas Observados:** + +- Grafana Loki deja de funcionar repentinamente +- Los dashboards de logs no cargan ninguna opción +- La vista de exploración con la fuente de datos Loki no muestra opciones disponibles +- El estado del servicio Loki parece normal +- El problema afecta toda la funcionalidad relacionada con logs en Grafana + +**Configuración Relevante:** + +- Servicio: Grafana con fuente de datos Loki +- Plataforma: Entorno gestionado por SleakOps +- Componentes afectados: Dashboards de logs y vista de exploración +- Estado externo: El estado del servicio Loki muestra que está operativo + +**Condiciones de Error:** + +- El error ocurre de forma intermitente sin un disparador claro +- Afecta a todos los usuarios que acceden a la funcionalidad de logs +- El problema persiste hasta una intervención manual +- No hay cambios evidentes en la configuración antes del problema + +## Solución Detallada + + + +La forma más rápida de resolver este problema es reiniciar el pod `loki-read`: + +**Usando kubectl:** + +```bash +# Encontrar el pod loki-read +kubectl get pods -n monitoring | grep loki-read + +# Reiniciar el pod eliminándolo (se recreará automáticamente) +kubectl delete pod -n monitoring + +# Verificar que el nuevo pod esté corriendo +kubectl get pods -n monitoring | grep loki-read +``` + +**Usando el Panel de SleakOps:** + +1. Navega a las cargas de trabajo de tu clúster +2. Encuentra el deployment `loki-read` en el namespace monitoring +3. Reinicia el deployment o elimina el pod +4. Espera a que el nuevo pod esté listo + + + + + +Después de reiniciar el pod loki-read: + +1. **Espera 2-3 minutos** para que el pod se inicialice completamente +2. **Accede a Grafana** y ve a la vista de Exploración +3. **Selecciona Loki** como fuente de datos +4. **Verifica si las etiquetas de logs** y las opciones ahora cargan +5. **Prueba una consulta simple** como `{job="tu-nombre-de-app"}` +6. **Confirma que los dashboards** muestran datos de logs nuevamente + + + + + +Este problema generalmente ocurre debido a: + +- **Presión de memoria** en el componente loki-read +- **Timeouts de conexión** entre Grafana y Loki +- **Corrupción de índices** en el almacenamiento temporal de Loki +- **Agotamiento de recursos** durante períodos de alto volumen de logs + +El equipo de SleakOps está trabajando en una solución permanente para evitar que este problema se repita. + + + + + +Para minimizar la ocurrencia de este problema: + +**Monitorea el uso de recursos:** + +```bash +# Revisa el uso de recursos del pod loki-read +kubectl top pod -n monitoring | grep loki-read + +# Revisa los logs del pod para errores +kubectl logs -n monitoring --tail=100 +``` + +**Configura alertas:** + +- Monitorea los tiempos de respuesta de consultas a Loki +- Alerta sobre alto uso de memoria en los pods loki-read +- Configura chequeos de salud para la conectividad de la fuente de datos Grafana + +**Buenas prácticas:** + +- Revisa regularmente las políticas de retención de logs +- Monitorea las tasas de ingestión de logs +- Considera el muestreo de logs para aplicaciones de alto volumen + + + + + +Contacta al soporte de SleakOps si: + +- Reiniciar el pod loki-read no resuelve el problema +- El problema se repite con frecuencia (más de una vez por semana) +- Ves errores persistentes en los logs del pod loki-read +- Los datos de logs parecen estar faltando o corruptos +- La degradación del rendimiento afecta a otros componentes de monitoreo + +Incluye la siguiente información: + +- Marca de tiempo cuando comenzó el problema +- Cualquier cambio reciente en el volumen de logs o configuración +- Capturas de pantalla de la interfaz de Grafana mostrando el problema +- Salida de `kubectl logs -n monitoring ` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de noviembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-prometheus-datasource-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-prometheus-datasource-configuration.mdx new file mode 100644 index 000000000..794ccd82d --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/grafana-prometheus-datasource-configuration.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Configuración de la Fuente de Datos Prometheus en Grafana" +description: "Cómo configurar la fuente de datos Prometheus en Grafana dentro de la plataforma SleakOps" +date: "2024-12-11" +category: "dependency" +tags: ["grafana", "prometheus", "monitorización", "fuente de datos", "thanos"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de la Fuente de Datos Prometheus en Grafana + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Grafana, Prometheus, Monitorización, Fuente de Datos, Thanos + +## Descripción del Problema + +**Contexto:** Los usuarios que intentan configurar Prometheus como fuente de datos en Grafana dentro de la plataforma SleakOps encuentran problemas de conexión y desconocen las fuentes de datos preconfiguradas. + +**Síntomas observados:** + +- Incapacidad para conectar con Prometheus desde la interfaz de Grafana +- Intentos fallidos de configuración manual de Prometheus +- Los usuarios crean fuentes de datos duplicadas innecesariamente +- Tiempo de espera agotado al probar la fuente de datos Prometheus + +**Configuración relevante:** + +- Grafana es accesible vía conexión VPN +- Prometheus está instalado dentro del clúster de Kubernetes +- Las fuentes de datos por defecto están preconfiguradas por SleakOps +- Thanos se utiliza como proxy para almacenamiento a largo plazo en S3 + +**Condiciones de error:** + +- Fallo en la prueba de conexión de la fuente de datos +- Punto de acceso Prometheus no accesible desde Grafana +- Los usuarios intentan configuración manual en lugar de usar los valores por defecto + +## Solución Detallada + + + +SleakOps configura automáticamente una fuente de datos Prometheus en Grafana. **No necesitas crear una nueva manualmente.** + +**Pasos para usar la fuente de datos por defecto:** + +1. Accede a Grafana a través de tu conexión VPN de SleakOps +2. Ve a **Configuración** → **Fuentes de datos** +3. Busca la fuente de datos existente llamada **"Prometheus"** +4. Esta fuente de datos ya está configurada y lista para usar + +**Beneficios clave de la configuración por defecto:** + +- Preconfigurada con los endpoints correctos +- Integrada con Thanos para almacenamiento de métricas a largo plazo +- Funciona con todos los dashboards por defecto de SleakOps +- No requiere configuración manual + + + + + +La fuente de datos Prometheus por defecto usa **Thanos** como destino en lugar de conectarse directamente a Prometheus: + +**¿Qué es Thanos?** + +- Una configuración altamente disponible de Prometheus con capacidades de almacenamiento a largo plazo +- Almacena métricas en AWS S3 para retención extendida +- Proporciona la misma interfaz de consulta que Prometheus +- Permite análisis histórico de datos más allá de la retención local de Prometheus + +**Detalles de configuración:** + +```yaml +# Configuración de la fuente de datos por defecto (gestionada por SleakOps) +name: Prometheus +type: prometheus +url: http://thanos-query:9090 +access: proxy +isDefault: true +``` + +**¿Por qué usar Thanos en lugar de Prometheus directo?** + +- Retención extendida de métricas (almacenadas en S3) +- Mejor rendimiento para consultas históricas +- Configuración de alta disponibilidad +- Integración perfecta con el stack de monitorización de SleakOps + + + + + +Si estás experimentando problemas de conexión con Grafana: + +**1. Verifica la conexión VPN:** + +```bash +# Comprueba si puedes acceder a Grafana +curl -I https://grafana.tu-proyecto.sleakops.com +``` + +**2. Revisa el estado de la fuente de datos:** + +- Ve a **Configuración** → **Fuentes de datos** +- Haz clic en la fuente de datos **"Prometheus"** +- Haz clic en **"Guardar y probar"** para verificar la conectividad + +**3. Problemas comunes de conexión:** + +- **VPN no conectada:** Asegúrate de estar conectado a la VPN de SleakOps +- **Ingress no listo:** Espera unos minutos después de la creación del clúster +- **Resolución DNS:** Limpia la caché del navegador y vuelve a intentar + +**4. Métodos alternativos de acceso:** + +```bash +# Reenvío de puerto para acceder directamente a Grafana (si persisten problemas con la VPN) +kubectl port-forward svc/grafana 3000:80 -n monitoring +``` + + + + + +**Solo deberías crear fuentes de datos personalizadas si:** + +1. **Instancias externas de Prometheus:** Conectar a Prometheus fuera de tu clúster +2. **Políticas de retención diferentes:** Necesitas diferentes tiempos de espera o intervalos de consulta +3. **Integraciones personalizadas:** Herramientas de monitorización específicas o servicios externos +4. **Desarrollo/pruebas:** Fuentes de datos separadas para diferentes entornos + +**Si creas fuentes de datos personalizadas:** + +```yaml +# Ejemplo de configuración personalizada de Prometheus +name: prometheus-custom +type: prometheus +url: http://prometheus-server.monitoring.svc.cluster.local:80 +access: proxy +timeout: 60s +``` + +**Buenas prácticas:** + +- Usa nombres descriptivos (p. ej., "prometheus-dev", "prometheus-external") +- Prueba la conectividad antes de guardar +- Documenta el propósito de las fuentes de datos personalizadas +- Evita duplicar la configuración por defecto + + + + + +**Dashboards por defecto de SleakOps:** + +- Están configurados para usar la fuente de datos "Prometheus" por defecto +- Funcionan automáticamente sin modificaciones +- Incluyen métricas de clúster, métricas de aplicaciones y monitorización de infraestructura + +**Si usas fuentes de datos personalizadas:** + +- Puede que necesites modificar las consultas en los dashboards +- Actualiza las referencias a la fuente de datos en el JSON del dashboard +- Prueba todos los paneles para asegurar que los datos se muestran correctamente + +**Enfoque recomendado:** + +1. Comienza con la fuente de datos "Prometheus" por defecto +2. Explora los dashboards y métricas disponibles +3. Solo crea fuentes de datos personalizadas cuando sea estrictamente necesario +4. Mantén la fuente de datos por defecto como tu principal fuente de monitorización + + + +--- + +_Esta FAQ fue generada automáticamente el 11 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/gunicorn-readiness-probe-timeout-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/gunicorn-readiness-probe-timeout-configuration.mdx new file mode 100644 index 000000000..4f0f7f6e3 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/gunicorn-readiness-probe-timeout-configuration.mdx @@ -0,0 +1,205 @@ +--- +sidebar_position: 3 +title: "Configuración de Timeout de Gunicorn con Probes de Readiness de Kubernetes" +description: "Cómo configurar los timeouts de Gunicorn sin causar fallos en las probes de readiness en SleakOps" +date: "2025-01-15" +category: "workload" +tags: ["gunicorn", "timeout", "readiness-probe", "python", "healthcheck"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Timeout de Gunicorn con Probes de Readiness de Kubernetes + +**Fecha:** 15 de enero de 2025 +**Categoría:** Workload +**Etiquetas:** Gunicorn, Timeout, Readiness Probe, Python, Healthcheck + +## Descripción del Problema + +**Contexto:** Al configurar Gunicorn con un timeout personalizado en SleakOps, los pods no inician correctamente debido a fallos en las probes de readiness, aunque la aplicación funciona bien sin la configuración de timeout. + +**Síntomas Observados:** + +- Los pods aparecen como "No Listos" cuando se configura el timeout de Gunicorn +- Los fallos en la probe de readiness impiden que los pods reciban tráfico +- La aplicación funciona correctamente cuando se elimina el parámetro de timeout +- Las solicitudes API tardan más de lo esperado (20-30 segundos) + +**Configuración Relevante:** + +- Comando Gunicorn con timeout: `gunicorn --bind 0.0.0.0:8000 --timeout 10 backend:app` +- Framework: Python con servidor WSGI Gunicorn +- Despliegue: Pods de Kubernetes en la plataforma SleakOps +- Timeout deseado: 5-10 segundos para solicitudes API + +**Condiciones de Error:** + +- Los pods fallan las comprobaciones de readiness cuando se añade el parámetro `--timeout` a Gunicorn +- El problema ocurre durante el inicio del pod y las comprobaciones de salud +- El problema persiste hasta que se elimina la configuración de timeout + +## Solución Detallada + + + +El problema ocurre porque: + +1. **El timeout de Gunicorn** termina los procesos worker que tardan más del tiempo especificado +2. **La probe de readiness de Kubernetes** espera respuestas consistentes del endpoint de salud +3. Cuando Gunicorn termina workers por timeout, las comprobaciones de salud pueden fallar intermitentemente +4. Las probes de readiness fallidas impiden que el pod sea marcado como "Listo" + + + + + +En SleakOps, ajusta la configuración de la probe de readiness: + +1. Ve a la configuración de tu **Workload** +2. Edita los ajustes de **Healthcheck** +3. Haz clic en **Opciones Avanzadas** +4. Configura los siguientes parámetros: + +```yaml +readinessProbe: + httpGet: + path: /health # Tu endpoint de salud + port: 8000 + initialDelaySeconds: 30 # Esperar antes de la primera comprobación + periodSeconds: 10 # Comprobar cada 10 segundos + timeoutSeconds: 5 # Timeout para cada comprobación + successThreshold: 1 # Éxitos consecutivos necesarios + failureThreshold: 5 # Fallos consecutivos antes de marcar como fallido +``` + +**Ajustes clave:** + +- Incrementa `failureThreshold` para permitir más tolerancia +- Incrementa `timeoutSeconds` para igualar o superar el timeout de Gunicorn +- Ajusta `periodSeconds` para reducir la frecuencia de comprobaciones + + + + + +Para mejor compatibilidad con Kubernetes, usa esta configuración de Gunicorn: + +```bash +# Comando recomendado para Gunicorn +newrelic-admin run-program gunicorn \ + --bind 0.0.0.0:8000 \ + --limit-request-line 0 \ + --max-requests 3000 \ + --max-requests-jitter 200 \ + --timeout 30 \ + --keep-alive 5 \ + --worker-class sync \ + --workers 4 \ + backend:app +``` + +**Parámetros importantes:** + +- `--timeout 30`: Establece un valor mayor que el tiempo esperado de la solicitud +- `--keep-alive 5`: Mantiene conexiones para las comprobaciones de salud +- `--workers 4`: Múltiples workers para redundancia +- `--worker-class sync`: Usa workers sincronizados para comportamiento predecible + + + + + +Configura el timeout del balanceador de carga para manejar solicitudes largas: + +1. En tu configuración de **Ingress** +2. Añade la siguiente anotación: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + annotations: + alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=60 + alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30 +spec: + # Tu configuración de ingress +``` + +**Jerarquía de timeouts:** + +1. Timeout de Gunicorn: 30 segundos (timeout de worker) +2. Timeout del balanceador de carga: 60 segundos (timeout de conexión) +3. Timeout de la probe de readiness: 5 segundos (timeout de comprobación de salud) + + + + + +Para depurar problemas de timeout: + +1. **Revisar logs del pod:** + + ```bash + kubectl logs -f deployment/your-app-name + ``` + +2. **Monitorear la probe de readiness:** + + ```bash + kubectl describe pod your-pod-name + ``` + +3. **Probar el endpoint de salud directamente:** + + ```bash + kubectl port-forward pod/your-pod-name 8000:8000 + curl -v http://localhost:8000/health + ``` + +4. **Deshabilitar temporalmente la probe de readiness:** + - Edita el deployment usando Lens o kubectl + - Elimina temporalmente la sección `readinessProbe` + - Prueba si la aplicación responde correctamente + + + + + +Para abordar la causa raíz de las solicitudes lentas: + +1. **Identificar endpoints lentos:** + + - Usa herramientas APM (New Relic, como en tu configuración) + - Añade logging para medir duración de solicitudes + - Perfila consultas a la base de datos + +2. **Optimización de base de datos:** + + ```python + # Añadir pool de conexiones + from sqlalchemy import create_engine + engine = create_engine( + 'postgresql://...', + pool_size=10, + max_overflow=20, + pool_timeout=30 + ) + ``` + +3. **Procesamiento asíncrono:** + + - Mover tareas de larga duración a jobs en segundo plano + - Usar Celery o colas similares + - Retornar respuesta inmediata con ID de job + +4. **Caching:** + - Implementar caching con Redis para consultas frecuentes + - Usar cabeceras HTTP de cache + - Cachear cálculos costosos + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/hpa-scaling-issues-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/hpa-scaling-issues-troubleshooting.mdx new file mode 100644 index 000000000..a19a3c50e --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/hpa-scaling-issues-troubleshooting.mdx @@ -0,0 +1,603 @@ +--- +sidebar_position: 3 +title: "Problemas de Escalado HPA - Pods que No Se Reducen" +description: "Solución de problemas de escalado HPA cuando los pods se acumulan con el tiempo" +date: "2024-12-19" +category: "cluster" +tags: + [ + "hpa", + "escalado", + "kubernetes", + "solución-de-problemas", + "fuga-de-memoria", + "cpu", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Escalado HPA - Pods que No Se Reducen + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** HPA, Escalado, Kubernetes, Solución de Problemas, Fuga de Memoria, CPU + +## Descripción del Problema + +**Contexto:** Entorno de producción experimentando un comportamiento anómalo de escalado de pods donde el Autoescalador Horizontal de Pods (HPA) crea muchos más pods de lo habitual y no logra reducir correctamente con el tiempo. + +**Síntomas Observados:** + +- Significativamente más pods en ejecución de lo normal en producción +- Los pods se normalizan temporalmente después del despliegue pero luego se acumulan nuevamente +- El HPA parece escalar hacia arriba pero no reduce eficazmente +- El problema ocurre repetidamente con el tiempo, creando un efecto acumulativo de escalado + +**Configuración Relevante:** + +- Entorno: clúster Kubernetes de producción +- Autoescalado: HPA habilitado +- Monitoreo: Lens usado para observar conteo de pods +- Ciclo de despliegue: Normalización temporal tras despliegues + +**Condiciones de Error:** + +- El HPA no reduce pods cuando el uso de recursos debería disminuir +- Crecimiento acumulativo de pods con el tiempo +- El consumo de recursos permanece alto impidiendo el escalado hacia abajo normal +- El problema persiste a través de múltiples ciclos de despliegue + +## Solución Detallada + + + +La causa más probable es que su aplicación mantiene procesos abiertos o tiene un consumo constante de CPU/memoria que impide que el HPA reduzca normalmente. Con el tiempo, esto crea un escalado acumulativo donde los pods se siguen añadiendo pero nunca se eliminan. + +Causas comunes incluyen: + +- **Fugas de memoria**: La aplicación no libera memoria adecuadamente +- **Procesos de larga duración**: Tareas en segundo plano que mantienen alto el uso de CPU/memoria +- **Conexiones abiertas**: Conexiones a bases de datos o servicios externos que no se cierran +- **Gestión de sesiones**: Sesiones de usuario de larga duración que consumen recursos +- **Código ineficiente**: Cuellos de botella que causan uso constante de recursos + + + + + +Para identificar el problema específico, use herramientas de monitoreo del rendimiento de aplicaciones (APM): + +### Herramientas Recomendadas + +1. **Atatus** + + - Monitoreo de aplicaciones en tiempo real + - Detección de fugas de memoria + - Identificación de cuellos de botella en rendimiento + +2. **New Relic** + + - Solución APM integral + - Perfilado de CPU y memoria + - Análisis de consultas a bases de datos + +3. **Blackfire.io** + - Perfilado PHP (si aplica) + - Información para optimización de rendimiento + - Monitoreo en tiempo real + +Estas herramientas ayudan a identificar problemas como fugas de memoria, procesos lentos, sesiones largas o cuellos de botella en tiempo real sin necesidad de reproducir condiciones de tráfico. + + + + + +Al analizar las métricas de su aplicación, enfoque en estas preguntas clave: + +### Patrones de Uso de CPU + +**Pregunta**: ¿El consumo de CPU inicia alto (alrededor del 70%) justo después del despliegue? + +- **Si SÍ**: Probablemente un problema de configuración de CPU +- **Solución**: Revise y ajuste las solicitudes/límites de CPU en su despliegue + +```yaml +resources: + requests: + cpu: "100m" # Ajustar según necesidades reales + memory: "128Mi" + limits: + cpu: "500m" # Establecer límites apropiados + memory: "512Mi" +``` + +### Patrones de Uso de Memoria + +**Pregunta**: ¿El consumo de memoria aumenta con el tiempo y nunca disminuye? + +- **Si SÍ**: Probablemente una fuga de memoria o procesos no liberados +- **Solución**: Perfilé su aplicación para encontrar fugas de memoria + +### Patrones Temporales + +**Pregunta**: ¿El problema ocurre en momentos específicos? + +- **Si SÍ**: Revise logs y métricas durante esos períodos +- **Solución**: Correlacione con lógica de negocio, tareas programadas o integraciones externas + + + + + +Revise la configuración de su HPA para asegurar que esté correctamente configurado: + +```bash +# Ver estado actual del HPA +kubectl get hpa + +# Obtener información detallada del HPA +kubectl describe hpa + +# Revisar eventos del HPA +kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler +``` + +Asegúrese de que su HPA tenga políticas de escalado adecuadas: + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: su-app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: su-app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 + - type: Resource + resource: + name: memory + target: + type: Utilization + averageUtilization: 80 + behavior: + scaleDown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodSeconds: 60 + scaleUp: + stabilizationWindowSeconds: 60 + policies: + - type: Percent + value: 50 + periodSeconds: 60 +``` + + + + + +### Paso 1: Verificar uso actual de recursos + +```bash +# Ver uso de recursos por pod +kubectl top pods -n + +# Ver uso de recursos por nodo +kubectl top nodes +``` + +### Paso 2: Analizar logs de la aplicación + +```bash +# Buscar errores relacionados con memoria +kubectl logs | grep -i "memory\|oom\|killed" + +# Buscar fugas de conexión/recursos +kubectl logs | grep -i "connection\|timeout\|leak" +``` + +### Paso 3: Monitorear comportamiento del HPA + +```bash +# Observar decisiones de escalado del HPA en tiempo real +kubectl get hpa -w + +# Revisar eventos de escalado del HPA +kubectl describe hpa +``` + +### Paso 4: Mitigación temporal + +Si el problema es crítico, puede reducir manualmente los pods: + +```bash +# Reducir manualmente el número de réplicas +kubectl scale deployment --replicas= + +# O deshabilitar temporalmente el HPA +kubectl delete hpa +``` + + + + + +Para identificar y resolver fugas de memoria: + +1. **Monitorear patrones de memoria:** + +```bash +# Monitorear uso de memoria en tiempo real +kubectl top pods --containers -n | grep + +# Usar herramientas de profiling +kubectl exec -it -- /bin/sh +# Dentro del pod, usar herramientas como htop, ps, o herramientas específicas del lenguaje +``` + +2. **Configurar alertas de memoria:** + +```yaml +# Alerta para uso alto de memoria +- alert: HighMemoryUsage + expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.8 + for: 5m + labels: + severity: warning + annotations: + summary: "Alto uso de memoria en {{ $labels.pod }}" +``` + +3. **Implementar límites de memoria estrictos:** + +```yaml +resources: + requests: + memory: "256Mi" + limits: + memory: "512Mi" # Límite estricto para forzar reinicio si hay fuga +``` + +4. **Configurar políticas de reinicio:** + +```yaml +spec: + template: + spec: + restartPolicy: Always + containers: + - name: app + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + failureThreshold: 3 +``` + + + + + +Para resolver problemas de CPU que impiden el escalado hacia abajo: + +1. **Analizar patrones de CPU:** + +```bash +# Monitorear CPU en tiempo real +kubectl top pods --containers -n + +# Revisar métricas históricas +kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces//pods | jq '.items[] | {name: .metadata.name, cpu: .containers[].usage.cpu}' +``` + +2. **Optimizar configuración de recursos:** + +```yaml +# Configuración optimizada de recursos +resources: + requests: + cpu: "50m" # Solicitud baja para permitir escalado + memory: "128Mi" + limits: + cpu: "200m" # Límite razonable + memory: "256Mi" +``` + +3. **Implementar perfilado de CPU:** + +```bash +# Para aplicaciones Java +kubectl exec -it -- jstack + +# Para aplicaciones Node.js +kubectl exec -it -- node --prof app.js + +# Para aplicaciones Python +kubectl exec -it -- python -m cProfile app.py +``` + +4. **Configurar políticas de escalado más agresivas:** + +```yaml +behavior: + scaleDown: + stabilizationWindowSeconds: 60 # Reducir tiempo de estabilización + policies: + - type: Percent + value: 25 # Permitir reducción más agresiva + periodSeconds: 60 +``` + + + + + +Para resolver problemas de conexiones abiertas que impiden el escalado: + +1. **Monitorear conexiones activas:** + +```bash +# Verificar conexiones de red +kubectl exec -it -- netstat -an | grep ESTABLISHED + +# Verificar conexiones a base de datos +kubectl exec -it -- ss -tulpn +``` + +2. **Implementar timeouts apropiados:** + +```yaml +# Configuración de timeouts en aplicación +apiVersion: v1 +kind: ConfigMap +metadata: + name: app-config +data: + DB_TIMEOUT: "30" + HTTP_TIMEOUT: "10" + IDLE_TIMEOUT: "300" +``` + +3. **Configurar connection pooling:** + +```javascript +// Ejemplo para Node.js +const pool = mysql.createPool({ + connectionLimit: 10, + host: process.env.DB_HOST, + user: process.env.DB_USER, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + acquireTimeout: 60000, + timeout: 60000, + reconnect: true +}); +``` + +4. **Implementar graceful shutdown:** + +```javascript +// Manejo de señales para cierre graceful +process.on('SIGTERM', () => { + console.log('SIGTERM received, closing server...'); + server.close(() => { + console.log('Server closed'); + process.exit(0); + }); +}); +``` + + + + + +Para configuraciones más sofisticadas del HPA: + +1. **HPA con múltiples métricas:** + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: advanced-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: su-app + minReplicas: 2 + maxReplicas: 20 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 60 + - type: Resource + resource: + name: memory + target: + type: Utilization + averageUtilization: 70 + - type: Pods + pods: + metric: + name: http_requests_per_second + target: + type: AverageValue + averageValue: "100" + behavior: + scaleDown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodSeconds: 60 + - type: Pods + value: 2 + periodSeconds: 60 + selectPolicy: Min + scaleUp: + stabilizationWindowSeconds: 60 + policies: + - type: Percent + value: 50 + periodSeconds: 60 + - type: Pods + value: 4 + periodSeconds: 60 + selectPolicy: Max +``` + +2. **Usar métricas personalizadas:** + +```yaml +# Métrica personalizada basada en cola de mensajes +- type: External + external: + metric: + name: queue_length + selector: + matchLabels: + queue: "processing-queue" + target: + type: AverageValue + averageValue: "10" +``` + +3. **Configurar VPA junto con HPA:** + +```yaml +apiVersion: autoscaling.k8s.io/v1 +kind: VerticalPodAutoscaler +metadata: + name: su-app-vpa +spec: + targetRef: + apiVersion: apps/v1 + kind: Deployment + name: su-app + updatePolicy: + updateMode: "Auto" + resourcePolicy: + containerPolicies: + - containerName: app + maxAllowed: + cpu: 1 + memory: 2Gi + minAllowed: + cpu: 100m + memory: 128Mi +``` + + + + + +Para implementar monitoreo proactivo: + +1. **Alertas de escalado anómalo:** + +```yaml +# Alerta cuando HPA escala demasiado frecuentemente +- alert: HPAScalingTooFrequent + expr: increase(kube_hpa_status_current_replicas[10m]) > 5 + for: 5m + labels: + severity: warning + annotations: + summary: "HPA {{ $labels.hpa }} escalando muy frecuentemente" + +# Alerta cuando pods no se reducen +- alert: HPANotScalingDown + expr: kube_hpa_status_current_replicas > kube_hpa_spec_min_replicas and kube_hpa_status_current_replicas == kube_hpa_status_current_replicas offset 30m + for: 30m + labels: + severity: warning + annotations: + summary: "HPA {{ $labels.hpa }} no está reduciendo pods" +``` + +2. **Dashboard de monitoreo:** + +```json +{ + "dashboard": { + "title": "HPA Monitoring", + "panels": [ + { + "title": "Current vs Desired Replicas", + "targets": [ + { + "expr": "kube_hpa_status_current_replicas" + }, + { + "expr": "kube_hpa_status_desired_replicas" + } + ] + }, + { + "title": "CPU Utilization", + "targets": [ + { + "expr": "rate(container_cpu_usage_seconds_total[5m]) * 100" + } + ] + }, + { + "title": "Memory Utilization", + "targets": [ + { + "expr": "(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100" + } + ] + } + ] + } +} +``` + + + +## Mejores Prácticas + +### Configuración de Recursos + +1. **Establecer solicitudes y límites apropiados** +2. **Usar perfilado de aplicaciones regularmente** +3. **Implementar health checks robustos** +4. **Configurar graceful shutdown** + +### Monitoreo Proactivo + +1. **Implementar alertas de escalado anómalo** +2. **Monitorear patrones de uso de recursos** +3. **Revisar logs de aplicación regularmente** +4. **Usar herramientas APM en producción** + +### Optimización Continua + +1. **Revisar configuración de HPA periódicamente** +2. **Analizar patrones de tráfico** +3. **Optimizar código para eficiencia de recursos** +4. **Probar configuraciones en entornos de staging** + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/index.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/index.mdx new file mode 100644 index 000000000..2f99ee8bb --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/index.mdx @@ -0,0 +1,55 @@ +--- +title: Problemas de Usuario +description: Problemas comunes y soluciones que los usuarios encuentran al usar la plataforma. +--- + +import DocCardList from "@theme/DocCardList"; + +# Problemas de Usuario - Guía de Solución de Problemas + +Esta sección contiene soluciones a los problemas más comunes que los usuarios encuentran al usar la plataforma. Aquí encontrarás guías detalladas organizadas por categorías para ayudarte a resolver rápidamente cualquier problema. + +## 🚀 Problemas Más Comunes + +Si es la primera vez que experimentas un problema, estos son los problemas más frecuentes: + +- **[Problemas de Conexión con el Cluster](./troubleshooting/cluster-connection-troubleshooting)** - Problemas de conectividad del cluster +- **[Despliegue Estancado](./troubleshooting/deployment-stuck-state-resolution)** - Despliegues que no progresan +- **[Fallos de Compilación](./troubleshooting/deployment-build-failed-production)** - Errores en el proceso de compilación +- **[Conexión VPN](./troubleshooting/vpn-connection-disconnection-issues)** - Problemas de conexión VPN +- **[Conexiones de Base de Datos](./troubleshooting/database-credentials-access)** - Problemas de acceso a base de datos + +## 📋 Categorías de Problemas + +### 🏗️ Infraestructura y Clusters +Problemas relacionados con clusters EKS, grupos de nodos, escalado y recursos de infraestructura. + +### 🚀 Despliegues y CI/CD +Problemas con despliegues, compilaciones, pipelines de CI/CD y GitHub Actions. + +### 🗄️ Base de Datos y Almacenamiento +Problemas con bases de datos PostgreSQL, migraciones, conexiones y almacenamiento. + +### 🌐 Redes y DNS +Configuración de dominios, SSL, DNS, balanceadores de carga y enrutamiento. + +### 📊 Monitoreo y Registro +Problemas con Grafana, Prometheus, Loki y herramientas de monitoreo. + +### 🔐 Seguridad y Acceso +Problemas de VPN, autenticación, permisos y control de acceso. + +### ⚡ Rendimiento y Recursos +Optimización de memoria, CPU, rendimiento y gestión de recursos. + +## 💡 Consejos Generales de Solución de Problemas + +1. **Revisar los logs** - Siempre comienza revisando los logs de tu aplicación y pods +2. **Verificar estado** - Usa `kubectl get pods` para verificar el estado de tus recursos +3. **Revisar eventos** - Los eventos de Kubernetes suelen proporcionar pistas importantes +4. **Verificar métricas** - Grafana y Prometheus pueden mostrar problemas de recursos + +--- + +## Todos los Artículos de Solución de Problemas + diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/infrastructure-architecture-diagrams.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/infrastructure-architecture-diagrams.mdx new file mode 100644 index 000000000..71cab0587 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/infrastructure-architecture-diagrams.mdx @@ -0,0 +1,527 @@ +--- +sidebar_position: 15 +title: "Diagramas de Arquitectura de Infraestructura" +description: "Comprendiendo la arquitectura de infraestructura de SleakOps y el flujo de solicitudes" +date: "2024-12-19" +category: "general" +tags: ["arquitectura", "infraestructura", "diagramas", "redes", "vpc"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Diagramas de Arquitectura de Infraestructura + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** General +**Etiquetas:** Arquitectura, Infraestructura, Diagramas, Redes, VPC + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender la arquitectura de infraestructura creada por SleakOps al desplegar aplicaciones, incluyendo cómo fluye la solicitud desde los usuarios hasta las aplicaciones y la relación entre los diferentes componentes. + +**Síntomas Observados:** + +- Falta de diagramas detallados de la arquitectura de infraestructura +- Dificultad para entender el flujo de solicitudes desde el navegador hasta la aplicación +- Relaciones poco claras entre componentes dentro de la VPC +- Diagramas genéricos que no muestran detalles reales de los componentes + +**Configuración Relevante:** + +- Aplicaciones desplegadas con SleakOps +- Balanceadores de carga y configuración DNS +- VPC y componentes de red +- Múltiples despliegues de aplicaciones + +**Condiciones de Error:** + +- Falta de documentación arquitectónica detallada +- Representación visual inadecuada de los componentes de infraestructura +- Límites poco claros entre componentes internos y externos + +## Solución Detallada + + + +SleakOps crea una arquitectura de infraestructura completa al desplegar aplicaciones. Aquí está la arquitectura típica: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Internet │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ DNS (Route 53) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ app1.tudominio.com → ALB │ │ +│ │ app2.tudominio.com → ALB │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ Balanceador de Carga de Aplicación (ALB) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Terminación SSL │ │ +│ │ Enrutamiento basado en ruta │ │ +│ │ Comprobaciones de estado │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ VPC │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Clúster EKS │ │ +│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ +│ │ │ Grupo de Nodos 1│ │ Grupo de Nodos 2│ │ │ +│ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ +│ │ │ │ Pod 1 │ │ │ │ Pod 3 │ │ │ │ +│ │ │ │ Pod 2 │ │ │ │ Pod 4 │ │ │ │ +│ │ │ └───────────┘ │ │ └───────────┘ │ │ │ +│ │ └─────────────────┘ └─────────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + + + + + +Así es como fluye una solicitud de usuario a través de la infraestructura de SleakOps: + +**1. Resolución DNS** + +- El usuario ingresa `app.tudominio.com` en el navegador +- DNS (Route 53) resuelve a la IP del Balanceador de Carga de Aplicación + +**2. Procesamiento del Balanceador de Carga** + +- La solicitud llega al ALB (fuera de la VPC) +- La terminación SSL ocurre en el ALB +- El ALB realiza comprobaciones de estado en los objetivos +- Enruta la solicitud basándose en reglas de ruta/host + +**3. Entrada a la VPC** + +- La solicitud entra a la VPC a través de los grupos de destino del ALB +- El tráfico fluye hacia los nodos del clúster EKS + +**4. Procesamiento en Kubernetes** + +- La solicitud llega al Servicio de Kubernetes +- El Servicio balancea la carga hacia Pods saludables +- La aplicación procesa la solicitud + +**5. Ruta de Respuesta** + +- La aplicación envía la respuesta de vuelta por la misma ruta +- El ALB maneja la encriptación SSL para la respuesta +- La respuesta llega al navegador del usuario + +```mermaid +sequenceDiagram + participant Usuario + participant DNS + participant ALB + participant EKS + participant App + + Usuario->>DNS: app.dominio.com + DNS->>Usuario: IP ALB + Usuario->>ALB: Solicitud HTTPS + ALB->>EKS: Solicitud HTTP (VPC) + EKS->>App: Reenvío al Pod + App->>EKS: Respuesta + EKS->>ALB: Respuesta + ALB->>Usuario: Respuesta HTTPS +``` + + + + + +**Fuera de la VPC:** + +- **DNS Route 53**: Resolución de nombres de dominio +- **Balanceador de Carga de Aplicación**: Terminación SSL, enrutamiento, comprobaciones de estado +- **Puerta de enlace a Internet**: Acceso a Internet de la VPC + +**Dentro de la VPC:** + +- **Clúster EKS**: Plano de control de Kubernetes gestionado +- **Grupos de Nodos**: Instancias EC2 ejecutando nodos Kubernetes +- **Pods**: Contenedores de aplicaciones +- **Servicios**: Balanceo de carga de Kubernetes +- **Controladores Ingress**: Dirigen el tráfico externo a los servicios + +**Componentes de Red:** + +- **Subredes Públicas**: ALB y puertas de enlace NAT +- **Subredes Privadas**: Nodos EKS y Pods +- **Grupos de Seguridad**: Reglas de firewall +- **NACLs**: Control de acceso a nivel de subred + +**Almacenamiento y Datos:** + +- **Volúmenes EBS**: Almacenamiento persistente para Pods +- **RDS/Base de Datos**: Si está configurado +- **Buckets S3**: Almacenamiento de objetos + + + + + +Al desplegar múltiples aplicaciones con SleakOps, la arquitectura se expande para manejar múltiples servicios: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Internet │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ DNS (Route 53) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ app1.tudominio.com → ALB-1 │ │ +│ │ app2.tudominio.com → ALB-2 │ │ +│ │ api.tudominio.com → ALB-1 (path-based routing) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ Múltiples Balanceadores de Carga │ +│ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ ALB-1 │ │ ALB-2 │ │ +│ │ ┌───────────┐ │ │ ┌───────────┐ │ │ +│ │ │ SSL/TLS │ │ │ │ SSL/TLS │ │ │ +│ │ │ Routing │ │ │ │ Routing │ │ │ +│ │ └───────────┘ │ │ └───────────┘ │ │ +│ └─────────────────┘ └─────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ VPC │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Clúster EKS │ │ +│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ +│ │ │ Namespace 1 │ │ Namespace 2 │ │ │ +│ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ +│ │ │ │ App 1 │ │ │ │ App 2 │ │ │ │ +│ │ │ │ Pods │ │ │ │ Pods │ │ │ │ +│ │ │ └───────────┘ │ │ └───────────┘ │ │ │ +│ │ └─────────────────┘ └─────────────────┘ │ │ +│ │ ┌─────────────────────────────────────────────────────┐ │ +│ │ │ Servicios Compartidos │ │ +│ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ +│ │ │ │ RDS │ │ Redis │ │ S3 │ │ │ +│ │ │ └───────────┘ └───────────┘ └───────────┘ │ │ +│ │ └─────────────────────────────────────────────────────┘ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Características de la arquitectura multi-aplicación:** + +- **Aislamiento por Namespace**: Cada aplicación en su propio namespace +- **Balanceadores de carga dedicados**: ALBs separados para diferentes aplicaciones +- **Enrutamiento inteligente**: Basado en host y path +- **Servicios compartidos**: Base de datos y caché compartidos entre aplicaciones + + + + + +SleakOps implementa múltiples capas de seguridad en la arquitectura de red: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Capa de Seguridad │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ WAF (Web Application Firewall) │ │ +│ │ ┌───────────────────────────────────────────────────┐ │ │ +│ │ │ • Protección DDoS │ │ │ +│ │ │ • Filtrado de SQL Injection │ │ │ +│ │ │ • Bloqueo de IPs maliciosas │ │ │ +│ │ └───────────────────────────────────────────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ Balanceador de Carga (ALB) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ • Certificados SSL/TLS automáticos │ │ +│ │ • Terminación SSL │ │ +│ │ • Redirección HTTP → HTTPS │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ VPC │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Grupos de Seguridad │ │ +│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ +│ │ │ ALB → EKS │ │ EKS → RDS │ │ │ +│ │ │ Puerto 80/443 │ │ Puerto 5432 │ │ │ +│ │ └─────────────────┘ └─────────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Subredes │ │ +│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ +│ │ │ Públicas │ │ Privadas │ │ │ +│ │ │ • ALB │ │ • EKS Nodes │ │ │ +│ │ │ • NAT Gateway │ │ • RDS │ │ │ +│ │ └─────────────────┘ └─────────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Network ACLs │ │ +│ │ • Control de acceso a nivel de subred │ │ +│ │ • Reglas de entrada y salida │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + + + + + +Comprende cómo fluyen los datos a través de la infraestructura de SleakOps: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Flujo de Datos │ +│ │ +│ Usuario → ALB → EKS → Aplicación → Base de Datos │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Capa de Aplicación │ │ +│ │ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ Frontend │ │ Backend │ │ │ +│ │ │ (React) │ │ (Django) │ │ │ +│ │ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ┌─────────────────────────┴───────────────────────────────┐ │ +│ │ Capa de Datos │ │ +│ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐ │ │ +│ │ │ PostgreSQL │ │ Redis │ │ S3 │ │ │ +│ │ │ (Principal) │ │ (Caché) │ │ (Archivos) │ │ │ +│ │ └───────────────┘ └───────────────┘ └─────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Capa de Monitoreo │ │ +│ │ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ CloudWatch │ │ Prometheus │ │ │ +│ │ │ (Métricas) │ │ (Métricas) │ │ │ +│ │ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Patrones de flujo de datos:** + +1. **Lectura**: Usuario → ALB → EKS → App → Caché/DB → Respuesta +2. **Escritura**: Usuario → ALB → EKS → App → DB → Confirmación +3. **Archivos**: Usuario → ALB → EKS → App → S3 → URL firmada +4. **Logs**: App → CloudWatch/Prometheus → Dashboard + + + + + +SleakOps implementa escalabilidad automática en múltiples niveles: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Escalabilidad Horizontal │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Nivel de Aplicación │ │ +│ │ │ │ +│ │ Baja carga: [Pod] [Pod] │ │ +│ │ Media carga: [Pod] [Pod] [Pod] [Pod] │ │ +│ │ Alta carga: [Pod] [Pod] [Pod] [Pod] [Pod] [Pod] │ │ +│ │ │ │ +│ │ • HPA (Horizontal Pod Autoscaler) │ │ +│ │ • Basado en CPU, memoria, métricas personalizadas │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Nivel de Infraestructura │ │ +│ │ │ │ +│ │ Baja carga: [Node1] [Node2] │ │ +│ │ Media carga: [Node1] [Node2] [Node3] │ │ +│ │ Alta carga: [Node1] [Node2] [Node3] [Node4] │ │ +│ │ │ │ +│ │ • Cluster Autoscaler │ │ +│ │ • Escalado automático de nodos EC2 │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Nivel de Base de Datos │ │ +│ │ │ │ +│ │ • RDS con réplicas de lectura │ │ +│ │ • Auto Scaling de almacenamiento │ │ +│ │ • Connection pooling │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + + + + + +SleakOps incluye capacidades de recuperación ante desastres: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Región Principal (us-east-1) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ VPC Principal │ │ +│ │ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ EKS Cluster │ │ RDS Primary │ │ │ +│ │ │ (Activo) │ │ (Activo) │ │ │ +│ │ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ Replicación + │ Automática +┌─────────────────────┴───────────────────────────────────────────┐ +│ Región Secundaria (us-west-2) │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ VPC Secundaria │ │ +│ │ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ EKS Cluster │ │ RDS Replica │ │ │ +│ │ │ (Standby) │ │ (Standby) │ │ │ +│ │ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────┐ +│ Componentes de Backup │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ • Snapshots automáticos de EBS │ │ +│ │ • Backups automáticos de RDS │ │ +│ │ • Replicación cross-region de S3 │ │ +│ │ • Backup de configuraciones de Kubernetes │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Características de DR:** + +- **RTO (Recovery Time Objective)**: < 15 minutos +- **RPO (Recovery Point Objective)**: < 5 minutos +- **Failover automático**: Configurado para servicios críticos +- **Backups incrementales**: Cada 6 horas + + + + + +Sistema completo de monitoreo implementado por SleakOps: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Capa de Recolección │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Métricas de Aplicación │ │ +│ │ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ Prometheus │ │ CloudWatch │ │ │ +│ │ │ (Custom) │ │ (AWS) │ │ │ +│ │ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Logs │ │ +│ │ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ Fluentd │ │ CloudWatch │ │ │ +│ │ │ (Collector) │ │ Logs │ │ │ +│ │ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Trazas │ │ +│ │ ┌───────────────┐ ┌───────────────┐ │ │ +│ │ │ Jaeger │ │ X-Ray │ │ │ +│ │ │ (Tracing) │ │ (AWS) │ │ │ +│ │ └───────────────┘ └───────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ Capa de Visualización │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐ │ │ +│ │ │ Grafana │ │ CloudWatch │ │ SleakOps │ │ │ +│ │ │ (Dashboards) │ │ (Dashboards) │ │ (Console) │ │ │ +│ │ └───────────────┘ └───────────────┘ └─────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────┬───────────────────────────────────────────┘ + │ +┌─────────────────────┴───────────────────────────────────────────┐ +│ Capa de Alertas │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐ │ │ +│ │ │ AlertManager │ │ SNS │ │ Slack │ │ │ +│ │ │ (Prometheus) │ │ (AWS) │ │ (Webhook) │ │ │ +│ │ └───────────────┘ └───────────────┘ └─────────────┘ │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + + + + + +SleakOps implementa estrategias de optimización de costos: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Optimización de Compute │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Instancias Spot │ │ +│ │ • 70% de descuento en instancias EC2 │ │ +│ │ • Manejo automático de interrupciones │ │ +│ │ • Mix de On-Demand y Spot │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Auto Scaling │ │ +│ │ • Escalado basado en demanda │ │ +│ │ • Apagado automático en horarios no laborales │ │ +│ │ • Escalado predictivo │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────┐ +│ Optimización de Almacenamiento │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Tiering Inteligente │ │ +│ │ • S3 Intelligent Tiering │ │ +│ │ • Lifecycle policies automáticas │ │ +│ │ • Compresión de logs │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────┐ +│ Optimización de Red │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ CDN y Caché │ │ +│ │ • CloudFront para contenido estático │ │ +│ │ • Redis para caché de aplicación │ │ +│ │ • Compresión gzip automática │ │ +│ └─────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Ahorros típicos:** + +- **Instancias Spot**: 60-70% de ahorro en compute +- **Auto Scaling**: 30-40% de ahorro en recursos no utilizados +- **S3 Intelligent Tiering**: 20-30% de ahorro en almacenamiento +- **CDN**: 40-50% de ahorro en transferencia de datos + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ingress-multiple-domains-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ingress-multiple-domains-configuration.mdx new file mode 100644 index 000000000..e7273a961 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ingress-multiple-domains-configuration.mdx @@ -0,0 +1,212 @@ +--- +sidebar_position: 3 +title: "Configuración de Múltiples Dominios en Kubernetes Ingress" +description: "Configuración manual de Kubernetes Ingress para múltiples dominios cuando la configuración raíz del entorno SleakOps presenta problemas" +date: "2024-12-23" +category: "workload" +tags: ["ingress", "kubernetes", "dominios", "lens", "resolución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Múltiples Dominios en Kubernetes Ingress + +**Fecha:** 23 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Ingress, Kubernetes, Dominios, Lens, Resolución de problemas + +## Descripción del Problema + +**Contexto:** Cuando SleakOps presenta problemas con múltiples configuraciones raíz de entorno en producción, se vuelve necesaria la configuración manual de Ingress para enrutar correctamente múltiples dominios al mismo servicio. + +**Síntomas observados:** + +- Configuración del dominio del entorno mostrando dominio incorrecto (por ejemplo, mostrando simplee.cl en lugar de simplee.com.mx) +- Imposibilidad de crear nuevos servicios debido a un enrutamiento de dominio incorrecto +- Múltiples dominios deben apuntar al mismo servicio backend +- Entorno de producción afectado por un error en la configuración raíz del entorno + +**Configuración relevante:** + +- Múltiples dominios: web.simplee.com.mx, simplee.com.mx, www.simplee.com.mx +- Servicio backend: mx-simplee-web-prod-mx-2-mx-simplee-web-svc +- Puerto del servicio: 7500 +- Herramienta usada: Lens para gestión de Kubernetes + +**Condiciones de error:** + +- Error en SleakOps con 2 raíces de entorno en producción +- Enrutamiento de dominio apuntando al entorno incorrecto +- Creación de servicio bloqueada debido a configuración de dominio incorrecta + +## Solución Detallada + + + +Cuando SleakOps presenta problemas en la configuración raíz del entorno, puedes configurar manualmente el recurso Ingress usando Lens: + +1. **Abre Lens** y conéctate a tu clúster +2. **Navega a Red** → **Ingresses** +3. **Encuentra el Ingress objetivo** en el namespace adecuado +4. **Edita el recurso Ingress** haciendo clic en el botón de editar +5. **Actualiza la configuración** con las reglas correctas y ajustes TLS + + + + + +Aquí está la configuración completa para múltiples dominios apuntando al mismo servicio: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: your-ingress-name + namespace: your-namespace +spec: + tls: + - hosts: + - web.simplee.com.mx + - simplee.com.mx + - www.simplee.com.mx + secretName: your-tls-secret + rules: + - host: web.simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mx-simplee-web-prod-mx-2-mx-simplee-web-svc + port: + number: 7500 + - host: simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mx-simplee-web-prod-mx-2-mx-simplee-web-svc + port: + number: 7500 + - host: www.simplee.com.mx + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: mx-simplee-web-prod-mx-2-mx-simplee-web-svc + port: + number: 7500 +``` + + + + + +La sección TLS asegura que todos los dominios estén cubiertos por certificados SSL: + +```yaml +tls: + - hosts: + - web.simplee.com.mx + - simplee.com.mx + - www.simplee.com.mx + secretName: your-tls-secret-name +``` + +**Consideraciones importantes:** + +- Todos los dominios deben estar listados en la sección hosts de TLS +- El secreto TLS debe contener certificados para todos los dominios listados +- El certificado debe ser válido para todos los subdominios y el dominio raíz + + + + + +Antes de aplicar la configuración de Ingress, verifica tu servicio backend: + +1. **Verifica que el servicio exista:** + + ```bash + kubectl get svc mx-simplee-web-prod-mx-2-mx-simplee-web-svc -n your-namespace + ``` + +2. **Verifica el puerto del servicio:** + + ```bash + kubectl describe svc mx-simplee-web-prod-mx-2-mx-simplee-web-svc -n your-namespace + ``` + +3. **Prueba la conectividad del servicio:** + ```bash + kubectl port-forward svc/mx-simplee-web-prod-mx-2-mx-simplee-web-svc 8080:7500 -n your-namespace + ``` + + + + + +**Si los dominios no resuelven:** + +1. **Verifica la configuración DNS:** + + ```bash + nslookup simplee.com.mx + nslookup web.simplee.com.mx + nslookup www.simplee.com.mx + ``` + +2. **Verifica el controlador Ingress:** + + ```bash + kubectl get pods -n ingress-nginx + kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + ``` + +3. **Revisa el estado del Ingress:** + ```bash + kubectl describe ingress your-ingress-name -n your-namespace + ``` + +**Si los certificados SSL no funcionan:** + +1. **Verifica el secreto del certificado:** + + ```bash + kubectl get secret your-tls-secret -n your-namespace -o yaml + ``` + +2. **Verifica la validez del certificado:** + ```bash + openssl x509 -in <(kubectl get secret your-tls-secret -n your-namespace -o jsonpath='{.data.tls\.crt}' | base64 -d) -text -noout + ``` + + + + + +Esta configuración manual sirve como solución temporal mientras el equipo de SleakOps corrige el error en la configuración raíz del entorno: + +1. **Monitorea actualizaciones de la plataforma** que resuelvan el problema de múltiples raíces de entorno +2. **Documenta los cambios manuales** realizados para referencia futura +3. **Prueba todos los puntos finales de dominio** después de aplicar la configuración +4. **Configura monitoreo** para asegurar que todos los dominios permanezcan accesibles + +**Lista de verificación para pruebas:** + +- [ ] https://web.simplee.com.mx carga correctamente +- [ ] https://simplee.com.mx carga correctamente +- [ ] https://www.simplee.com.mx carga correctamente +- [ ] Los certificados SSL son válidos para todos los dominios +- [ ] Todos los dominios enrutan al servicio correcto + + + +--- + +_Este FAQ fue generado automáticamente el 23 de diciembre de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ingress-route53-records-not-created.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ingress-route53-records-not-created.mdx new file mode 100644 index 000000000..0035c6789 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ingress-route53-records-not-created.mdx @@ -0,0 +1,192 @@ +--- +sidebar_position: 3 +title: "Registros de Ingress Route53 No Se Están Creando" +description: "Solución para registros DNS faltantes en Route53 para servicios web públicos con configuración de ingress" +date: "2024-05-23" +category: "workload" +tags: ["ingress", "route53", "dns", "webservice", "networking"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Registros de Ingress Route53 No Se Están Creando + +**Fecha:** 23 de mayo de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Ingress, Route53, DNS, Servicio web, Redes + +## Descripción del Problema + +**Contexto:** Los usuarios han configurado servicios web con ingress público en el entorno de producción de SleakOps, pero los registros DNS correspondientes no se están creando automáticamente en Route53. + +**Síntomas Observados:** + +- Servicio web configurado con ingress público (por ejemplo, organizations-api) +- No se crea el registro DNS correspondiente en Route53 +- El servicio aparece como "interno" en lugar de "público" en algunos casos +- Múltiples servicios potencialmente afectados por el mismo problema + +**Configuración Relevante:** + +- Entorno: Producción +- Tipo de servicio: Servicio web con ingress público +- Proveedor DNS: AWS Route53 +- Controlador de ingress: Kubernetes ingress + +**Condiciones de Error:** + +- Registros DNS faltantes para servicios web públicos +- Servicios no accesibles mediante los nombres de dominio configurados +- La configuración de ingress parece correcta pero la resolución DNS falla + +## Solución Detallada + + + +Primero, verifica si el ingress de tu servicio web está configurado correctamente: + +1. Ve a tu **Proyecto** → **Cargas de trabajo** → **Servicios web** +2. Selecciona el servicio afectado (por ejemplo, organizations-api) +3. Revisa la sección **Redes** +4. Asegúrate que el **Tipo de Ingress** esté configurado como **Público** +5. Verifica que la configuración del **Dominio** sea correcta + +```yaml +# Configuración esperada +ingress: + enabled: true + type: public + domain: organizations-api.tudominio.com + tls: true +``` + + + + + +SleakOps usa External-DNS para crear automáticamente registros en Route53. Verifica su estado: + +1. Accede a tu clúster vía **kubectl** +2. Consulta el estado de los pods de External-DNS: + +```bash +kubectl get pods -n kube-system | grep external-dns +kubectl logs -n kube-system deployment/external-dns +``` + +3. Busca errores relacionados con: + - Permisos AWS + - Acceso a Route53 + - Procesamiento de anotaciones de ingress + + + + + +Asegúrate de que tu clúster tenga los permisos adecuados para gestionar registros de Route53: + +1. Verifica si la cuenta de servicio de External-DNS tiene el rol IAM correcto +2. Los permisos requeridos incluyen: + - `route53:ChangeResourceRecordSets` + - `route53:ListHostedZones` + - `route53:ListResourceRecordSets` + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": ["route53:ChangeResourceRecordSets"], + "Resource": "arn:aws:route53:::hostedzone/*" + }, + { + "Effect": "Allow", + "Action": ["route53:ListHostedZones", "route53:ListResourceRecordSets"], + "Resource": "*" + } + ] +} +``` + + + + + +Para verificar manualmente y potencialmente corregir los registros DNS: + +1. **Revisa los registros actuales en Route53:** + + - Ve a la Consola AWS → Route53 → Zonas alojadas + - Selecciona la zona alojada de tu dominio + - Busca registros A/CNAME faltantes + +2. **Obtén el endpoint del Load Balancer:** + + ```bash + kubectl get ingress -n tu-namespace + kubectl describe ingress tu-ingress-servicio -n tu-namespace + ``` + +3. **Crea manualmente el registro si es necesario:** + - Tipo de registro: A (Alias) o CNAME + - Nombre: organizations-api.tudominio.com + - Valor: nombre DNS del Load Balancer + + + + + +Si tu servicio aparece como "interno" en lugar de "público": + +1. Ve al **Panel de SleakOps** → **Tu Proyecto** +2. Navega a **Cargas de trabajo** → **Servicios web** +3. Selecciona el servicio (organizations-api) +4. En la sección **Redes**: + + - Cambia el **Tipo de Ingress** de "Interno" a "Público" + - Asegúrate que el **Dominio** esté configurado correctamente + - Guarda los cambios + +5. Espera de 5 a 10 minutos para la propagación DNS +6. Verifica que el registro en Route53 se haya creado + + + + + +Si el problema persiste: + +1. **Revisa las anotaciones del ingress:** + + ```bash + kubectl get ingress tu-servicio -o yaml + ``` + + Busca anotaciones de External-DNS como: + + ```yaml + annotations: + external-dns.alpha.kubernetes.io/hostname: organizations-api.tudominio.com + ``` + +2. **Reinicia External-DNS:** + + ```bash + kubectl rollout restart deployment/external-dns -n kube-system + ``` + +3. **Verifica registros conflictivos:** + + - Asegúrate que no existan registros manuales en Route53 que entren en conflicto con los automáticos + - Elimina entradas DNS duplicadas o conflictivas + +4. **Verifica la zona alojada:** + - Confirma que la zona alojada correcta exista en Route53 + - Revisa que la delegación del dominio esté configurada correctamente + + + +--- + +_Esta FAQ fue generada automáticamente el 23 de mayo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/job-management-cleanup-logs.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/job-management-cleanup-logs.mdx new file mode 100644 index 000000000..b312df80d --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/job-management-cleanup-logs.mdx @@ -0,0 +1,148 @@ +--- +sidebar_position: 15 +title: "Gestión de Trabajos y Visualización de Registros" +description: "Cómo gestionar la ejecución de trabajos, limpiar el historial de trabajos y ver registros de trabajos completados" +date: "2025-02-04" +category: "workload" +tags: ["trabajos", "registros", "limpieza", "gestión", "ejecución"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Gestión de Trabajos y Visualización de Registros + +**Fecha:** 4 de febrero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** Trabajos, Registros, Limpieza, Gestión, Ejecución + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan gestionar la ejecución de trabajos en SleakOps, incluyendo la limpieza de ejecuciones antiguas y la visualización de registros tanto para trabajos exitosos como fallidos. + +**Síntomas observados:** + +- El historial de ejecuciones de trabajos se acumula con el tiempo sin opción de limpieza +- Los registros son visibles para trabajos fallidos pero no para los exitosos +- Los usuarios no pueden eliminar manualmente ejecuciones antiguas desde la interfaz +- El historial de trabajos se vuelve desordenado con ejecuciones antiguas que ya no son relevantes + +**Configuración relevante:** + +- Plataforma: Interfaz de gestión de trabajos de SleakOps +- Tipos de trabajos: Todos los tipos (programados, manuales, etc.) +- Visibilidad de registros: Actualmente limitada a ejecuciones fallidas + +**Condiciones de error:** + +- No existe un mecanismo integrado de limpieza para el historial de trabajos +- Visualización inconsistente de registros entre trabajos exitosos y fallidos +- La limpieza manual requiere intervención del equipo de soporte + +## Solución Detallada + + + +Actualmente, SleakOps presenta las siguientes limitaciones en la gestión de trabajos: + +1. **Sin limpieza autoservicio**: Los usuarios no pueden eliminar ejecuciones antiguas a través de la interfaz +2. **Acceso limitado a registros**: Los registros se muestran automáticamente para trabajos fallidos pero no para los exitosos +3. **Intervención manual requerida**: La limpieza requiere contactar al soporte para la eliminación manual + +Estas son limitaciones conocidas que se están abordando en futuras actualizaciones de la plataforma. + + + + + +Para limpiar ejecuciones antiguas de trabajos: + +1. **Contactar Soporte**: Enviar un correo electrónico a support@sleakops.com +2. **Especificar trabajos a eliminar**: Proporcionar detalles sobre qué ejecuciones de trabajos desea eliminar +3. **Incluir contexto del proyecto**: Mencionar el nombre de su proyecto y los nombres de los trabajos +4. **Esperar confirmación**: El equipo de soporte eliminará manualmente las ejecuciones especificadas + +**Plantilla de correo:** + +``` +Asunto: Solicitud de limpieza de ejecuciones de trabajos + +Hola equipo de SleakOps, + +Quisiera solicitar la limpieza de ejecuciones antiguas de los siguientes trabajos: +- Proyecto: [Nombre de su proyecto] +- Trabajos: [Nombres específicos de trabajos o "todas las ejecuciones antiguas"] +- Rango de tiempo: [Opcional: ejecuciones con más de X días] + +¡Gracias! +``` + + + + + +Actualmente, los registros de trabajos exitosos no se muestran automáticamente en la interfaz. Para acceder a ellos: + +1. **Volver a ejecutar el trabajo**: Ejecute el mismo trabajo con la configuración idéntica +2. **Monitorear durante la ejecución**: Observe los registros mientras el trabajo se está ejecutando +3. **Contactar soporte**: Solicite acceso específico a los registros de trabajos exitosos ya completados + +**Solución alternativa para retención de registros:** + +- Considere agregar registro a sistemas externos dentro de sus scripts de trabajo +- Use archivos de salida de trabajo que persistan más allá de la ejecución +- Implemente mecanismos personalizados de registro en el código de sus trabajos + + + + + +Para volver a ejecutar un trabajo con la misma configuración: + +1. **Navegue a la sección de Trabajos** en su proyecto SleakOps +2. **Encuentre el trabajo** que desea reejecutar +3. **Haga clic en el nombre del trabajo** para ver detalles +4. **Use el botón "Ejecutar de nuevo" o "Ejecutar"** (si está disponible) +5. **Verifique que la configuración** coincida con su ejecución previa +6. **Monitoree los registros** durante la ejecución para capturar la salida + +Si la opción de reejecución no está disponible en la interfaz, puede que necesite: + +- Recrear manualmente la configuración del trabajo +- Usar los mismos parámetros y ajustes que la ejecución anterior + + + + + +**Para una mejor gestión de trabajos:** + +1. **Limpieza regular**: Solicite limpieza mensual o trimestral +2. **Nombres significativos para trabajos**: Use nombres descriptivos para identificar fácilmente los trabajos +3. **Registro externo**: Implemente registro en sistemas externos para retención a largo plazo +4. **Documentación**: Mantenga un registro externo de configuraciones importantes de trabajos + +**Para la gestión de registros:** + +1. **Captura durante la ejecución**: Monitoree los trabajos mientras se ejecutan para ver la salida +2. **Exporte registros importantes**: Guarde información crítica de registros externamente +3. **Use salidas de trabajos**: Diseñe trabajos para que escriban resultados importantes en archivos +4. **Implemente notificaciones**: Configure alertas para el estado de finalización de trabajos + + + + + +SleakOps está trabajando en implementar las siguientes funcionalidades de gestión de trabajos: + +1. **Limpieza autoservicio**: Los usuarios podrán eliminar ejecuciones antiguas +2. **Visualización mejorada de registros**: Los registros serán accesibles tanto para trabajos exitosos como fallidos +3. **Gestión del historial de trabajos**: Mejor filtrado y organización de ejecuciones +4. **Políticas de retención de registros**: Configuraciones para retención de registros + +Estas mejoras están planificadas para futuras versiones para mejorar la experiencia del usuario. + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 4 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/job-reexecution-kubernetes.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/job-reexecution-kubernetes.mdx new file mode 100644 index 000000000..8873dba3c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/job-reexecution-kubernetes.mdx @@ -0,0 +1,191 @@ +--- +sidebar_position: 3 +title: "Re-ejecución de Jobs en Kubernetes" +description: "Cómo volver a ejecutar Jobs de Kubernetes con los mismos parámetros sin crear nuevos" +date: "2024-01-15" +category: "workload" +tags: ["kubernetes", "jobs", "re-ejecucion", "batch"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Re-ejecución de Jobs en Kubernetes + +**Fecha:** 15 de enero de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Kubernetes, Jobs, Re-ejecución, Batch + +## Descripción del Problema + +**Contexto:** El usuario necesita re-ejecutar un Job de Kubernetes con los mismos parámetros después de que haya finalizado, pero la plataforma no ofrece una opción directa de "volver a ejecutar", lo que obliga a crear nuevos Jobs cada vez. + +**Síntomas observados:** + +- No hay botón de "re-ejecutar" o "volver a ejecutar" disponible para Jobs completados +- Es necesario crear un nuevo Job cada vez con los mismos parámetros +- Proceso manual requerido para duplicar la configuración del Job +- Pérdida del historial de ejecución y la relación entre ejecuciones de Jobs + +**Configuración relevante:** + +- Tipo de Job: Job Batch de Kubernetes +- Estado del Job: Completado +- Plataforma: Entorno Kubernetes de SleakOps +- Comportamiento deseado: Volver a ejecutar el Job existente con los mismos parámetros + +**Condiciones de error:** + +- Ocurre después de la finalización del Job +- No hay mecanismo incorporado de re-ejecución disponible +- Se requiere recreación manual para cada ejecución + +## Solución Detallada + + + +Los Jobs de Kubernetes están diseñados para ejecutarse una vez y completarse. Una vez que un Job termina exitosamente, no puede "reiniciarse" en el sentido tradicional. Esto es por diseño en Kubernetes: + +- Los Jobs son inmutables una vez creados +- Los Jobs completados permanecen para historial y acceso a logs +- La re-ejecución requiere crear un nuevo objeto Job + +Este comportamiento asegura consistencia y previene re-ejecuciones accidentales de procesos batch críticos. + + + + + +En SleakOps, puedes crear plantillas reutilizables de Jobs para simplificar la re-ejecución: + +1. **Guardar Job como plantilla**: + + - Ve a tu Job completado + - Haz clic en **"Guardar como plantilla"** + - Asígnale un nombre descriptivo + +2. **Crear desde plantilla**: + + - Ve a **Jobs** → **Crear nuevo** + - Selecciona **"Desde plantilla"** + - Elige tu plantilla guardada + - Modifica parámetros si es necesario + +3. **Beneficios de las plantillas**: + - Preserva toda la configuración + - Permite modificar parámetros + - Mantiene el historial de ejecución + - Más rápido que la recreación manual + + + + + +Si necesitas recrear manualmente un Job usando kubectl: + +```bash +# Obtén la configuración original del Job +kubectl get job my-job -o yaml > job-backup.yaml + +# Edita el archivo para eliminar status y metadata.uid +# Cambia metadata.name para evitar conflictos +sed -i 's/name: my-job/name: my-job-rerun/' job-backup.yaml + +# Elimina la sección status y otros campos en tiempo de ejecución +kubectl create -f job-backup.yaml +``` + +**Importante**: Elimina estos campos del YAML: + +- `metadata.uid` +- `metadata.resourceVersion` +- `metadata.creationTimestamp` +- sección `status` + + + + + +Si necesitas ejecutar el mismo Job múltiples veces, considera usar un CronJob: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: my-repeated-job +spec: + schedule: "@daily" # o disparo manual + jobTemplate: + spec: + template: + spec: + containers: + - name: my-container + image: my-image + command: ["my-command"] + restartPolicy: OnFailure +``` + +**Beneficios del CronJob**: + +- Puede ser disparado manualmente +- Mantiene historial de jobs +- Soporta ejecución programada +- Mejor para tareas repetidas + + + + + +Para una re-ejecución óptima de Jobs en SleakOps: + +1. **Durante la creación del Job**: + + - Usa nombres descriptivos con versión/fecha + - Documenta parámetros en la descripción del Job + - Guarda como plantilla inmediatamente después de crear + +2. **Para la re-ejecución**: + + - Usa la opción **"Crear desde plantilla"** + - Actualiza el nombre con nueva marca de tiempo/versión + - Verifica parámetros antes de ejecutar + - Conserva el Job original como referencia + +3. **Buenas prácticas**: + - Nombra los Jobs con timestamps: `procesamiento-datos-2024-01-15` + - Usa etiquetas para agrupar ejecuciones relacionadas + - Documenta cambios de parámetros entre ejecuciones + - Limpia Jobs antiguos periódicamente + + + + + +Problemas comunes al recrear Jobs: + +**Conflictos de nombre**: + +```bash +# Error: Job ya existe +# Solución: Usa un nombre diferente +metadata: + name: my-job-v2 # o my-job-20240115 +``` + +**Conflictos de recursos**: + +- Asegúrate que los pods del Job anterior estén eliminados +- Revisa reclamaciones de volúmenes persistentes +- Verifica permisos de la cuenta de servicio + +**Problemas con parámetros**: + +- Revisa variables de entorno +- Verifica referencias a secretos y configmaps +- Asegura que las versiones de imagen sean correctas + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/jobs-timeout-display-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/jobs-timeout-display-issue.mdx new file mode 100644 index 000000000..0ecb88ea1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/jobs-timeout-display-issue.mdx @@ -0,0 +1,195 @@ +--- +sidebar_position: 3 +title: "Error de Tiempo de Espera en la Visualización de Trabajos a Pesar de la Ejecución Exitosa" +description: "Solución para trabajos que muestran error de tiempo de espera en la interfaz mientras se ejecutan correctamente" +date: "2025-02-10" +category: "workload" +tags: + [ + "trabajos", + "tiempo de espera", + "ui", + "visualización", + "solución de problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Tiempo de Espera en la Visualización de Trabajos a Pesar de la Ejecución Exitosa + +**Fecha:** 10 de febrero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** Trabajos, Tiempo de espera, UI, Visualización, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los trabajos de larga duración en SleakOps que tardan varias horas en completarse muestran errores de tiempo de espera en la interfaz, aunque se estén ejecutando correctamente en segundo plano. + +**Síntomas observados:** + +- Los trabajos muestran estado de error "Tiempo agotado" en el panel de SleakOps +- Los trabajos están realmente ejecutándose y completándose con éxito +- El error de tiempo de espera parece ser un problema de visualización, no una falla real en la ejecución +- Los trabajos de larga duración (varias horas) tienen más probabilidades de experimentar este problema + +**Configuración relevante:** + +- Tipo de trabajo: Trabajos por lotes de larga duración +- Tiempo de ejecución: Varias horas +- Plataforma: Sistema de gestión de trabajos SleakOps +- Visualización de estado: Muestra incorrectamente error de tiempo de espera + +**Condiciones de error:** + +- El error aparece en la UI tras un tiempo prolongado de ejecución +- Los trabajos continúan ejecutándose a pesar de la visualización de tiempo agotado +- El problema afecta la visibilidad y monitoreo del estado del trabajo +- Ocurre con trabajos que superan ciertos umbrales de tiempo + +## Solución Detallada + + + +Este es un problema conocido de visualización en la UI donde: + +1. **Los trabajos continúan ejecutándose normalmente** en segundo plano +2. **El tiempo de espera en la UI es solo cosmético** - no afecta la ejecución real del trabajo +3. **El estado de finalización del trabajo** puede no actualizarse correctamente en tiempo real +4. **La infraestructura subyacente del trabajo** continúa funcionando como se espera + +Este es un problema a nivel de plataforma que requiere una corrección por parte del equipo de desarrollo de SleakOps. + + + + + +Para confirmar que su trabajo se está ejecutando correctamente a pesar de la visualización del tiempo de espera: + +1. **Revise los registros del trabajo directamente:** + + ```bash + # Acceda a los registros del trabajo mediante kubectl si está disponible + kubectl logs -f job/your-job-name + ``` + +2. **Monitoree el uso de recursos:** + + - Verifique el uso de CPU/memoria en el clúster + - Confirme si los pods de su trabajo siguen activos + +3. **Revise la salida del trabajo:** + + - Monitoree cualquier archivo de salida o base de datos a la que su trabajo escriba + - Verifique que se estén generando resultados intermedios + +4. **Use kubectl para revisar el estado del trabajo:** + ```bash + kubectl get jobs + kubectl describe job your-job-name + ``` + + + + + +Mientras espera la corrección de la plataforma, puede: + +1. **Configurar monitoreo externo:** + + ```yaml + # Agregue puntos de verificación de salud a su trabajo + apiVersion: batch/v1 + kind: Job + metadata: + name: long-running-job + spec: + template: + spec: + containers: + - name: worker + image: your-image + # Agregue actualizaciones periódicas de estado + command: ["/bin/sh"] + args: + [ + "-c", + "your-job-command && echo 'Trabajo completado con éxito' > /tmp/status", + ] + ``` + +2. **Implemente registro de progreso:** + + - Añada actualizaciones regulares de progreso en el código de su trabajo + - Use registro estructurado para rastrear fases del trabajo + - Considere usar puntos de estado externos + +3. **Use notificaciones de finalización de trabajo:** + - Configure webhooks o notificaciones para la finalización del trabajo + - Configure alertas para fallas reales del trabajo frente a tiempos de espera en la UI + + + + + +Para minimizar problemas con trabajos de larga duración: + +1. **Implemente puntos de control:** + + ```python + # Ejemplo: Guardar progreso periódicamente + def save_checkpoint(progress_data): + with open('/tmp/checkpoint.json', 'w') as f: + json.dump(progress_data, f) + + def load_checkpoint(): + try: + with open('/tmp/checkpoint.json', 'r') as f: + return json.load(f) + except FileNotFoundError: + return None + ``` + +2. **Divida trabajos grandes:** + + - Considere partir trabajos muy largos en partes más pequeñas + - Use dependencias de trabajos para encadenar trabajos más pequeños + - Implemente manejo adecuado de errores y lógica de reintento + +3. **Agregue verificaciones de salud:** + ```yaml + # Agregue probes de liveness y readiness + livenessProbe: + exec: + command: + - /bin/sh + - -c + - "test -f /tmp/job-alive" + initialDelaySeconds: 30 + periodSeconds: 60 + ``` + + + + + +El equipo de desarrollo de SleakOps está al tanto de este problema y está trabajando en una corrección. El cronograma incluye: + +1. **Corto plazo (unos días):** Despliegue de la corrección en la plataforma +2. **Mediano plazo:** Mejoras en monitoreo y visualización del estado del trabajo +3. **Largo plazo:** Mejor manejo de cargas de trabajo de larga duración + +**Lo que la corrección abordará:** + +- Manejo correcto del tiempo de espera para trabajos de larga duración +- Mejoras en la actualización del estado en la UI +- Mejor monitoreo en tiempo real de los trabajos +- Gestión mejorada del ciclo de vida de los trabajos + +**No se requiere acción por parte de los usuarios** - la corrección se aplicará automáticamente en la plataforma. + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de febrero de 2025 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/karpenter-pod-scheduling-warnings.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/karpenter-pod-scheduling-warnings.mdx new file mode 100644 index 000000000..a5d762a67 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/karpenter-pod-scheduling-warnings.mdx @@ -0,0 +1,238 @@ +--- +sidebar_position: 3 +title: "Advertencias de Programación de Pods en Karpenter y Aprovisionamiento de Nodos" +description: "Entendiendo las advertencias de Karpenter cuando los pods no pueden programarse y cómo funciona el aprovisionamiento de nodos" +date: "2024-04-17" +category: "cluster" +tags: + [ + "karpenter", + "programacion-pods", + "aprovisionamiento-nodos", + "advertencias", + "resolucion-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Advertencias de Programación de Pods en Karpenter y Aprovisionamiento de Nodos + +**Fecha:** 17 de abril de 2024 +**Categoría:** Clúster +**Etiquetas:** Karpenter, Programación de Pods, Aprovisionamiento de Nodos, Advertencias, Resolución de Problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan advertencias de programación de pods en entornos de producción al usar Karpenter para el autoescalado de nodos, lo que puede causar errores 502 y interrupciones del servicio. + +**Síntomas Observados:** + +- Los pods muestran estado "running" pero presentan advertencias de programación +- Errores 502 intermitentes que afectan a los usuarios finales +- Mensajes de advertencia aparecen cuando no hay nodos disponibles para colocar pods +- Interrupciones del servicio durante los períodos de aprovisionamiento de nodos + +**Configuración Relevante:** + +- Entorno: Producción +- Autoescalador: Karpenter +- Estado de pods: En ejecución con advertencias +- Aprovisionamiento de nodos: Automático + +**Condiciones de Error:** + +- Las advertencias aparecen cuando el clúster no tiene nodos disponibles para nuevos pods +- Ocurre durante picos de tráfico o eventos de escalado +- Puede correlacionarse con errores 502 visibles para usuarios +- Condición temporal durante el proceso de aprovisionamiento de nodos + +## Solución Detallada + + + +Cuando ves advertencias de programación de pods con Karpenter, esto es generalmente un comportamiento normal: + +1. **Disparo de advertencia**: La alerta aparece cuando ningún nodo existente tiene recursos suficientes para nuevos pods +2. **Respuesta automática**: Karpenter detecta esta condición y comienza a aprovisionar nuevos nodos +3. **Estado temporal**: Este es un período de transición, no un error permanente +4. **Tiempo de resolución**: El proceso usualmente tarda entre 2 y 3 minutos en completarse + +La advertencia indica que Karpenter está funcionando correctamente, no que haya un problema. + + + + + +El aprovisionamiento de nodos en Karpenter sigue esta secuencia: + +1. **Detección**: Karpenter identifica pods no programables +2. **Selección de instancia**: Elige el tipo de instancia EC2 apropiado según los requisitos +3. **Lanzamiento de instancia**: Solicita una nueva instancia EC2 a AWS +4. **Inicialización del nodo**: La instancia arranca e instala los componentes necesarios +5. **Registro en el clúster**: El nuevo nodo se une al clúster de Kubernetes +6. **Programación de pods**: Los pods pendientes se programan en el nuevo nodo + +**Cronología**: Este proceso completo suele tardar entre 2 y 3 minutos. + + + + + +Cuando ves advertencias de programación, puedes monitorear el proceso de aprovisionamiento de nodos: + +**En el Panel de SleakOps:** + +1. Navega a **Clúster** → **Nodos** +2. Busca nodos con estado "Creating" o "Pending" +3. Monitorea la lista de nodos para nuevas incorporaciones + +**Usando kubectl:** + +```bash +# Observar nodos en creación +kubectl get nodes -w + +# Revisar logs de Karpenter +kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter + +# Ver pods pendientes +kubectl get pods --field-selector=status.phase=Pending +``` + +**Comportamiento esperado:** Deberías ver nuevos nodos aparecer dentro de 2 a 3 minutos después de la advertencia. + + + + + +La correlación entre advertencias de programación de pods y errores 502 puede ocurrir cuando: + +1. **Agotamiento de recursos**: Los pods existentes consumen todos los recursos disponibles +2. **Retraso en el escalado**: Nuevas solicitudes llegan durante la ventana de aprovisionamiento de 2 a 3 minutos +3. **Comportamiento del balanceador de carga**: Algunas solicitudes pueden fallar mientras la nueva capacidad se activa + +**Estrategias de mitigación:** + +- Configurar solicitudes y límites de recursos adecuados +- Implementar chequeos de salud y probes de readiness apropiados +- Considerar el preescalado para patrones de tráfico predecibles +- Usar Horizontal Pod Autoscaler (HPA) junto con Karpenter + + + + + +Para minimizar las advertencias de programación y mejorar los tiempos de respuesta: + +**1. Configura correctamente el NodePool de Karpenter:** + +```yaml +apiVersion: karpenter.sh/v1beta1 +kind: NodePool +metadata: + name: default +spec: + template: + spec: + requirements: + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: karpenter.sh/capacity-type + operator: In + values: ["spot", "on-demand"] + nodeClassRef: + apiVersion: karpenter.k8s.aws/v1beta1 + kind: EC2NodeClass + name: default + disruption: + consolidationPolicy: WhenEmpty + consolidateAfter: 30s +``` + +**2. Establece solicitudes de recursos adecuadas:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +spec: + template: + spec: + containers: + - name: app + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "256Mi" + cpu: "200m" +``` + +**3. Usa Horizontal Pod Autoscaler:** + +```yaml +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + + + + + +Si las advertencias de programación persisten o continúan los errores 502: + +**1. Verifica la configuración de Karpenter:** + +```bash +# Verificar que Karpenter esté corriendo +kubectl get pods -n karpenter + +# Revisar eventos de Karpenter +kubectl get events -n karpenter --sort-by='.lastTimestamp' +``` + +**2. Verifica permisos en AWS:** + +- Asegúrate que Karpenter tenga permisos IAM adecuados +- Revisa cuotas y límites del servicio EC2 +- Verifica configuraciones de subredes y grupos de seguridad + +**3. Monitorea la utilización de recursos:** + +```bash +# Ver uso de recursos en nodos +kubectl top nodes + +# Ver uso de recursos en pods +kubectl top pods --all-namespaces +``` + +**4. Revisa la configuración de la aplicación:** + +- Verifica que las solicitudes de recursos coincidan con el uso real +- Revisa probes de readiness y liveness +- Asegura un manejo correcto del apagado ordenado + + + +--- + +_Esta FAQ fue generada automáticamente el 17 de abril de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/keda-installation-guide.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/keda-installation-guide.mdx new file mode 100644 index 000000000..b6b41f4a9 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/keda-installation-guide.mdx @@ -0,0 +1,406 @@ +--- +sidebar_position: 15 +title: "Instalando KEDA en Clústeres de Kubernetes de SleakOps" +description: "Guía completa para instalar y configurar KEDA para el autoescalado de cargas de trabajo en clústeres SleakOps" +date: "2025-02-13" +category: "cluster" +tags: ["keda", "autoescalado", "kubernetes", "carga de trabajo", "programador"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Instalando KEDA en Clústeres de Kubernetes de SleakOps + +**Fecha:** 13 de febrero de 2025 +**Categoría:** Clúster +**Etiquetas:** KEDA, Autoescalado, Kubernetes, Carga de trabajo, Programador + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan implementar capacidades avanzadas de autoescalado en sus clústeres de Kubernetes de SleakOps, incluyendo la capacidad de escalar cargas de trabajo a cero y programar el escalado de cargas según tiempo o eventos. KEDA (Kubernetes Event-driven Autoscaling) proporciona estas capacidades pero aún no está soportado nativamente en SleakOps. + +**Síntomas Observados:** + +- Necesidad de iniciar/detener manualmente cargas de trabajo para ahorrar recursos +- Requisito de escalado programado (por ejemplo, servicios API que solo funcionan durante horas laborales) +- Falta de capacidades de autoescalado basado en eventos +- Necesidad de optimización de costos mediante la programación de cargas de trabajo + +**Configuración Relevante:** + +- Plataforma: Clústeres de Kubernetes de SleakOps +- Versión de KEDA: Estable más reciente (2.11+) +- Versión de Kubernetes: Compatible con clústeres SleakOps +- Tipos de carga de trabajo: Servicios Web y Trabajadores + +**Condiciones de Error:** + +- La gestión manual de recursos consume mucho tiempo +- Recursos funcionando 24/7 cuando solo se necesitan en horas específicas +- No hay capacidades nativas de programación en SleakOps para el escalado de cargas de trabajo + +## Solución Detallada + + + +KEDA (Kubernetes Event-driven Autoscaling) es un componente ligero y de propósito único que se puede agregar a cualquier clúster de Kubernetes. Proporciona: + +- **Escalado a Cero**: Escala despliegues a cero réplicas cuando no se necesitan +- **Escalado basado en Eventos**: Escala basado en diversas métricas y eventos +- **Escalado basado en Cron**: Programa operaciones de escalado basadas en tiempo +- **Múltiples Escaladores**: Soporte para varias fuentes de datos (colas, bases de datos, HTTP, etc.) + +Esto es particularmente útil para: + +- Optimización de costos deteniendo servicios no usados +- Cargas de trabajo programadas (trabajos por lotes, APIs con patrones específicos de uso) +- Microservicios basados en eventos + + + + + +La forma recomendada para instalar KEDA en tu clúster SleakOps es usando Helm: + +```bash +# Añadir el repositorio Helm de KEDA +helm repo add kedacore https://kedacore.github.io/charts +helm repo update + +# Instalar KEDA en el namespace keda +helm install keda kedacore/keda --namespace keda --create-namespace + +# Verificar la instalación +kubectl get pods -n keda +``` + +Salida esperada: + +``` +NAME READY STATUS RESTARTS AGE +keda-admission-webhooks-xxx 1/1 Running 0 2m +keda-operator-xxx 1/1 Running 0 2m +keda-operator-metrics-apiserver-xxx 1/1 Running 0 2m +``` + + + + + +Alternativamente, puedes instalar KEDA usando kubectl: + +```bash +# Instalar KEDA +kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.11.2/keda-2.11.2.yaml + +# Verificar la instalación +kubectl get pods -n keda +``` + +**Nota**: Reemplaza `v2.11.2` con la versión más reciente disponible. + + + + + +Para implementar escalado programado (iniciar/detener cargas en horarios específicos), usa el escalador Cron: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: api-service-scheduler + namespace: tu-namespace +spec: + scaleTargetRef: + name: tu-despliegue-api # Reemplaza con el nombre de tu despliegue + minReplicaCount: 0 # Escalar a cero cuando no se necesite + maxReplicaCount: 3 # Réplicas máximas durante horas activas + triggers: + - type: cron + metadata: + timezone: America/Argentina/Buenos_Aires # Ajusta a tu zona horaria + start: "0 8 * * 1-5" # Iniciar a las 8 AM, lunes a viernes + end: "0 18 * * 1-5" # Detener a las 6 PM, lunes a viernes + desiredReplicas: "2" # Número de réplicas durante horas activas +``` + +Aplica la configuración: + +```bash +kubectl apply -f scaled-object.yaml +``` + + + + + +Para control manual sobre el escalado de cargas de trabajo, puedes crear ScaledObjects que puedas activar/desactivar según sea necesario: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: manual-scaler + namespace: tu-namespace + annotations: + autoscaling.keda.sh/paused: "true" # Pausar el escalado inicialmente +spec: + scaleTargetRef: + name: tu-despliegue + minReplicaCount: 0 + maxReplicaCount: 5 + triggers: + - type: external-push + metadata: + scalerAddress: localhost:9090 +``` + +Para controlar manualmente: + +```bash +# Pausar el escalado (escalar a cero) +kubectl annotate scaledobject manual-scaler autoscaling.keda.sh/paused=true + +# Reanudar el escalado +kubectl annotate scaledobject manual-scaler autoscaling.keda.sh/paused- + +# Verificar estado +kubectl get scaledobject manual-scaler -o yaml +``` + + + + + +Para escalar basado en el tráfico HTTP entrante, puedes usar el escalador HTTP de KEDA: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: http-scaler + namespace: tu-namespace +spec: + scaleTargetRef: + name: tu-api-service + minReplicaCount: 0 + maxReplicaCount: 10 + triggers: + - type: prometheus + metadata: + serverAddress: http://prometheus:9090 + metricName: http_requests_per_second + threshold: '10' + query: sum(rate(http_requests_total{job="tu-servicio"}[1m])) +``` + +Esto escalará tu servicio basado en las solicitudes HTTP por segundo. + + + + + +Para escalar basado en el tamaño de colas en tu base de datos: + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: queue-processor + namespace: tu-namespace +spec: + scaleTargetRef: + name: worker-deployment + minReplicaCount: 0 + maxReplicaCount: 20 + triggers: + - type: postgresql + metadata: + connectionFromEnv: DATABASE_URL + query: "SELECT COUNT(*) FROM job_queue WHERE status = 'pending'" + targetQueryValue: '5' +``` + +Esto escalará workers basado en trabajos pendientes en tu cola de base de datos. + + + + + +Para monitorear el comportamiento de KEDA: + +```bash +# Ver todos los ScaledObjects +kubectl get scaledobjects -A + +# Ver detalles de un ScaledObject específico +kubectl describe scaledobject tu-scaler -n tu-namespace + +# Ver logs del operador KEDA +kubectl logs -n keda deployment/keda-operator + +# Ver métricas de KEDA +kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 +``` + +**Solución de problemas comunes:** + +1. **ScaledObject no escala**: Verifica que las métricas estén disponibles +2. **Escalado lento**: Ajusta `pollingInterval` en el trigger +3. **Escalado demasiado agresivo**: Configura `cooldownPeriod` + +```yaml +triggers: + - type: cron + metadata: + timezone: America/Argentina/Buenos_Aires + start: "0 8 * * 1-5" + end: "0 18 * * 1-5" + desiredReplicas: "2" + # Configuraciones adicionales + pollingInterval: 30 # Verificar cada 30 segundos + cooldownPeriod: 300 # Esperar 5 minutos antes de escalar hacia abajo +``` + + + + + +KEDA puede ayudar significativamente a reducir costos: + +**1. Escalado a cero para servicios no críticos:** + +```yaml +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: non-critical-service +spec: + scaleTargetRef: + name: analytics-service + minReplicaCount: 0 # Importante: permite escalado a cero + maxReplicaCount: 3 + triggers: + - type: cron + metadata: + timezone: America/Argentina/Buenos_Aires + start: "0 9 * * 1-5" # 9 AM días laborales + end: "0 17 * * 1-5" # 5 PM días laborales + desiredReplicas: "1" +``` + +**2. Escalado basado en demanda real:** + +```yaml +triggers: + - type: prometheus + metadata: + serverAddress: http://prometheus:9090 + metricName: active_users + threshold: '1' + query: sum(active_users{service="tu-app"}) +``` + +**3. Escalado por lotes durante horas de menor costo:** + +```yaml +triggers: + - type: cron + metadata: + timezone: America/Argentina/Buenos_Aires + start: "0 2 * * *" # 2 AM (horas de menor costo) + end: "0 6 * * *" # 6 AM + desiredReplicas: "5" # Más réplicas para procesamiento por lotes +``` + +**Estimación de ahorro de costos:** + +- Servicios 24/7 → Servicios programados: **60-70% de ahorro** +- Escalado manual → Escalado automático: **30-40% de ahorro** +- Escalado basado en demanda real: **40-50% de ahorro** + + + + + +**1. Configuración de recursos para KEDA:** + +```yaml +# Configuración recomendada para el operador KEDA +resources: + limits: + cpu: 1000m + memory: 1000Mi + requests: + cpu: 100m + memory: 100Mi +``` + +**2. Configuración de múltiples triggers:** + +```yaml +triggers: + - type: cron + metadata: + timezone: America/Argentina/Buenos_Aires + start: "0 8 * * 1-5" + end: "0 18 * * 1-5" + desiredReplicas: "2" + - type: prometheus + metadata: + serverAddress: http://prometheus:9090 + metricName: cpu_usage + threshold: '70' + query: avg(cpu_usage{service="tu-app"}) +``` + +**3. Configuración de comportamiento de escalado:** + +```yaml +behavior: + scaleDown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 10 + periodSeconds: 60 + scaleUp: + stabilizationWindowSeconds: 0 + policies: + - type: Percent + value: 100 + periodSeconds: 15 +``` + +**4. Etiquetas y anotaciones recomendadas:** + +```yaml +metadata: + labels: + app: tu-aplicacion + environment: produccion + managed-by: keda + annotations: + keda.sh/transfer-hpa-ownership: "true" + keda.sh/paused: "false" +``` + +**5. Configuración de seguridad:** + +```yaml +# Usar secretos para credenciales sensibles +triggers: + - type: postgresql + authenticationRef: + name: postgresql-secret + metadata: + query: "SELECT COUNT(*) FROM jobs WHERE status = 'pending'" + targetQueryValue: '5' +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 13 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-cronjob-duplicate-execution.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-cronjob-duplicate-execution.mdx new file mode 100644 index 000000000..8e62fe04f --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-cronjob-duplicate-execution.mdx @@ -0,0 +1,227 @@ +--- +sidebar_position: 3 +title: "Problema de Ejecución Duplicada en Kubernetes CronJob" +description: "Solución para CronJobs que se ejecutan múltiples veces debido a pods fallidos y configuración de reintentos" +date: "2025-01-03" +category: "workload" +tags: ["cronjob", "kubernetes", "retry", "backofflimit", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Ejecución Duplicada en Kubernetes CronJob + +**Fecha:** 3 de enero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** CronJob, Kubernetes, Reintento, BackOffLimit, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los CronJobs en un clúster de Kubernetes se están ejecutando múltiples veces en lugar de una sola vez en el horario programado. Este problema parece afectar a CronJobs específicos que han sido ejecutados manualmente a través de Lens o que han experimentado fallos. + +**Síntomas observados:** + +- Los CronJobs se ejecutan dos veces en lugar de una +- Se crean dos pods por cada ejecución del CronJob +- El primer pod parece fallar, lo que desencadena una segunda ejecución +- No hay errores visibles en los registros de los pods fallidos +- El problema persiste a través de despliegues y versiones +- El problema afecta a CronJobs específicos que fueron ejecutados manualmente previamente + +**Configuración relevante:** + +- Comportamiento por defecto de reintento en Kubernetes CronJob: BackOffLimit = 1 +- CronJobs afectados después de ejecución manual vía Lens +- Los jobs muestran estado fallido a pesar de no haber errores aparentes + +**Condiciones de error:** + +- Ocurre consistentemente para los CronJobs afectados +- Sucede independientemente del éxito o fallo real del job +- Persiste después de desplegar nuevas versiones +- Afecta trabajos programados en producción + +## Solución Detallada + + + +El problema de ejecución duplicada es causado por el mecanismo de reintento por defecto de Kubernetes para Jobs fallidos: + +1. **BackOffLimit por defecto**: Los CronJobs de Kubernetes tienen un `backoffLimit` por defecto de 1, lo que significa que reintentará una vez si falla +2. **Detección de fallo del Job**: Aunque el código de tu aplicación se ejecute correctamente, el Job puede ser marcado como fallido debido a: + + - Problemas con el código de salida + - Restricciones de recursos + - Configuraciones de timeout + - Problemas con el manejo de señales + +3. **Comportamiento de reintento**: Cuando un Job falla, Kubernetes crea automáticamente un nuevo pod para reintentar la ejecución + + + + + +Para detener inmediatamente las ejecuciones duplicadas, configura el `backoffLimit` a 0: + +**Usando Lens (método GUI):** + +1. Abre Lens y conéctate a tu clúster +2. Navega a **Workloads** → **CronJobs** en la barra lateral +3. Busca el CronJob afectado y haz clic en él +4. Haz clic en **Editar** o ve a la vista YAML +5. Localiza el campo `backoffLimit` (usualmente está en 1) +6. Cámbialo a `0`: + +```yaml +spec: + jobTemplate: + spec: + backoffLimit: 0 # Cambiado de 1 a 0 + template: + spec: + # ... resto de la especificación del job +``` + +7. Guarda los cambios + + + + + +Aunque deshabilitar los reintentos soluciona el síntoma, la solución correcta es asegurar que tus jobs finalicen exitosamente: + +**1. Verifica los códigos de salida de tu aplicación:** + +```bash +# En el contenedor de tu job, asegura una salida adecuada +exit 0 # Éxito +# o +exit 1 # Fallo (activará reintento si backoffLimit > 0) +``` + +**2. Revisa los logs del job para errores ocultos:** + +```bash +kubectl logs --previous +``` + +**3. Problemas comunes que causan fallos "silenciosos":** + +- Timeout en conexión a base de datos +- Variables de entorno faltantes +- Problemas de permisos +- Límites de memoria/CPU excedidos +- Manejo inadecuado de señales + + + + + +En SleakOps puedes configurar el comportamiento de reintentos: + +**Solución temporal:** +Usa Lens como se describió arriba hasta que SleakOps agregue soporte nativo para configurar BackOffLimit. + +**Configuración futura (cuando esté disponible):** +El equipo de SleakOps está trabajando en añadir esta opción de configuración directamente en la interfaz de la plataforma. + +**Buenas prácticas para CronJobs en SleakOps:** + +```yaml +# Configuración recomendada para CronJob +apiVersion: batch/v1 +kind: CronJob +metadata: + name: tu-cronjob +spec: + schedule: "0 5 * * *" # Diario a las 5 AM + jobTemplate: + spec: + backoffLimit: 0 # Sin reintentos + activeDeadlineSeconds: 300 # Timeout de 5 minutos + template: + spec: + restartPolicy: Never + containers: + - name: tu-job + image: tu-imagen + command: ["/bin/sh"] + args: ["-c", "tu-comando && exit 0"] +``` + + + + + +**Monitorea tus CronJobs:** + +1. **Revisa el estado de los jobs regularmente:** + +```bash +kubectl get cronjobs +kubectl get jobs +``` + +2. **Configura alertas para jobs fallidos:** + +```yaml +# Ejemplo de alerta en Prometheus +- alert: CronJobFailed + expr: kube_job_status_failed > 0 + for: 0m + labels: + severity: warning + annotations: + summary: "CronJob {{ $labels.job_name }} falló" +``` + +**Estrategias de prevención:** + +- Siempre prueba los CronJobs primero en desarrollo +- Usa códigos de salida adecuados en tus scripts +- Implementa manejo correcto de errores +- Establece límites razonables de recursos +- Usa chequeos de salud cuando sea posible + + + + + +Si aún experimentas ejecuciones duplicadas: + +**1. Verifica la configuración de BackOffLimit:** + +```bash +kubectl get cronjob -o yaml | grep backoffLimit +``` + +**2. Revisa el historial reciente de jobs:** + +```bash +kubectl get jobs --sort-by=.metadata.creationTimestamp +``` + +**3. Examina detalles de jobs fallidos:** + +```bash +kubectl describe job +``` + +**4. Revisa eventos de pods:** + +```bash +kubectl get events --sort-by=.metadata.creationTimestamp +``` + +**5. Verifica restricciones de recursos:** + +```bash +kubectl top pods +kubectl describe nodes +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 3 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-dns-label-length-limit.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-dns-label-length-limit.mdx new file mode 100644 index 000000000..22f269b6c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-dns-label-length-limit.mdx @@ -0,0 +1,209 @@ +--- +sidebar_position: 3 +title: "Error de Límite de Longitud de Etiqueta DNS en Kubernetes" +description: "Solución para errores de resolución DNS debido a etiquetas que exceden los 63 caracteres" +date: "2025-02-21" +category: "cluster" +tags: ["kubernetes", "dns", "service", "naming", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Límite de Longitud de Etiqueta DNS en Kubernetes + +**Fecha:** 21 de febrero de 2025 +**Categoría:** Clúster +**Etiquetas:** Kubernetes, DNS, Servicio, Nomenclatura, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Al intentar resolver nombres DNS de servicios internos de Kubernetes, los usuarios encuentran fallos en la resolución DNS debido a restricciones en la longitud de las etiquetas de dominio. + +**Síntomas Observados:** + +- La resolución DNS falla con el error "not a legal IDNA2008 name" +- El mensaje de error indica "etiqueta de dominio más larga que 63 caracteres" +- Los servicios no pueden comunicarse internamente usando los nombres DNS generados +- El comando `dig` no logra resolver el FQDN del servicio + +**Configuración Relevante:** + +- Nombre del servicio: Generado automáticamente por SleakOps +- Namespace: Nombres descriptivos largos (ejemplo: `velo-contact-email-sender-production`) +- Formato DNS: `..svc.cluster.local` +- El error ocurre cuando la longitud total de la etiqueta excede los 63 caracteres + +**Condiciones del Error:** + +- Ocurre cuando los nombres de servicio son autogenerados con prefijos largos +- Sucede con nombres descriptivos en el namespace +- La resolución DNS falla completamente +- Los servicios se vuelven inalcanzables vía DNS interno + +## Solución Detallada + + + +Los nombres DNS en Kubernetes deben cumplir con las normas RFC: + +- **Longitud máxima de etiqueta:** 63 caracteres por etiqueta DNS +- **Longitud total del FQDN:** máximo 253 caracteres +- **Formato de etiqueta:** Solo letras minúsculas, números y guiones +- **Formato FQDN del servicio:** `..svc.cluster.local` + +En el error de ejemplo: + +``` +velo-contact-email-sender-production-velo-contact-email-sender-svc.velo-contact-email-sender-production.svc.cluster.local +``` + +La parte del nombre del servicio `velo-contact-email-sender-production-velo-contact-email-sender-svc` excede los 63 caracteres. + + + + + +Para encontrar el nombre real del servicio en tu clúster: + +1. **Usando kubectl:** + +```bash +kubectl get services -n velo-contact-email-sender-production +``` + +2. **Usando Lens (como sugirió soporte):** + + - Navega al namespace específico + - Ve a la sección **Services** + - Encuentra tu servicio y anota el nombre real + +3. **Revisar el manifiesto del servicio:** + Mira el campo `metadata.name` en la definición del servicio: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: contact-email-sen-svc # Este es el nombre real del servicio + namespace: velo-contact-email-sender-production +``` + + + + + +Basado en el manifiesto del servicio proporcionado, el nombre DNS correcto debería ser: + +``` +contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local +``` + +**Desglose del formato:** + +- **Nombre del servicio:** `contact-email-sen-svc` (de `metadata.name`) +- **Namespace:** `velo-contact-email-sender-production` +- **Sufijo de dominio:** `svc.cluster.local` + +**Probando la resolución:** + +```bash +# Desde dentro de un pod en el clúster +dig contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local + +# O usar nslookup +nslookup contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local + +# Prueba simple de conectividad +telnet contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local 5001 +``` + + + + + +Para servicios dentro del mismo namespace, puedes usar formas más cortas: + +1. **Mismo namespace** (recomendado): + +``` +contact-email-sen-svc:5001 +``` + +2. **Cross-namespace pero mismo clúster:** + +``` +contact-email-sen-svc.velo-contact-email-sender-production:5001 +``` + +3. **FQDN completo** (cuando sea necesario): + +``` +contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local:5001 +``` + + + + + +Para evitar este problema en futuras implementaciones: + +1. **Usar nombres de servicio más cortos** en la configuración de SleakOps +2. **Evitar prefijos redundantes** en los nombres de servicio +3. **Considerar la longitud del namespace** al nombrar proyectos +4. **Probar la resolución DNS** después del despliegue + +**Patrón de nomenclatura recomendado:** + +- Proyecto: `velo-email-sender` +- Servicio: `email-svc` +- Resultado: `email-svc.velo-email-sender.svc.cluster.local` + +**Ejemplo de configuración:** + +```yaml +# En la configuración de carga de trabajo de SleakOps +name: email-sender # Mantenerlo corto y descriptivo +type: webservice +internal: true +port: 5001 +``` + + + + + +Si los problemas DNS persisten, sigue estos pasos: + +1. **Verificar que el servicio existe:** + +```bash +kubectl get svc -n velo-contact-email-sender-production +``` + +2. **Comprobar DNS desde dentro del clúster:** + +```bash +# Obtener shell en cualquier pod +kubectl exec -it -n -- /bin/bash + +# Probar resolución DNS +nslookup contact-email-sen-svc.velo-contact-email-sender-production.svc.cluster.local +``` + +3. **Verificar que CoreDNS está funcionando:** + +```bash +kubectl get pods -n kube-system | grep coredns +``` + +4. **Revisar endpoints del servicio:** + +```bash +kubectl get endpoints contact-email-sen-svc -n velo-contact-email-sender-production +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 21 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-memory-limits-pod-restarts.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-memory-limits-pod-restarts.mdx new file mode 100644 index 000000000..e8cba1d92 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-memory-limits-pod-restarts.mdx @@ -0,0 +1,211 @@ +--- +sidebar_position: 3 +title: "Reinicios de Pods en Kubernetes Debido a Límites de Memoria" +description: "Solución para pods que se reinician debido a una configuración insuficiente de memoria" +date: "2024-12-19" +category: "workload" +tags: + [ + "kubernetes", + "memoria", + "reinicios-pod", + "recursos", + "solución-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Reinicios de Pods en Kubernetes Debido a Límites de Memoria + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Kubernetes, Memoria, Reinicios de Pods, Recursos, Solución de problemas + +## Descripción del Problema + +**Contexto:** Un servicio en Kubernetes deja de funcionar de repente sin ningún cambio por parte del usuario. El pod se reinicia continuamente mostrando el mensaje "Back-off restarting failed container", y el servicio se reinicia cada vez que se accede a él. + +**Síntomas Observados:** + +- El pod muestra el mensaje "Back-off restarting failed container" +- El servicio se reinicia automáticamente al acceder +- No se realizaron cambios en el código de la aplicación +- El problema aparece de forma repentina y sin causa aparente + +**Configuración Relevante:** + +- Los valores de MemoryMin y MemoryMax parecen estar configurados demasiado bajos +- Kubernetes está matando el pod al exceder los límites de memoria +- El servicio es accesible pero inestable debido a reinicios constantes + +**Condiciones de Error:** + +- El error ocurre cuando el pod supera los límites de memoria definidos +- Kubernetes termina el pod asumiendo que el uso excesivo de memoria es incorrecto +- El problema se manifiesta durante el acceso al servicio o bajo alta carga + +## Solución Detallada + + + +Kubernetes gestiona los recursos de los pods mediante solicitudes y límites: + +- **Solicitud de Memoria (MemoryMin)**: Asignación garantizada de memoria +- **Límite de Memoria (MemoryMax)**: Memoria máxima que el pod puede usar + +Cuando un pod supera su límite de memoria, Kubernetes lo termina con un estado OOMKilled (terminado por falta de memoria) y lo reinicia automáticamente. + + + + + +Para analizar el uso actual de memoria: + +1. Accede a tu panel de Grafana +2. Navega a **'Kubernetes / Recursos de Cómputo / Namespace (Pods)'** +3. Selecciona tu namespace y pod +4. Revisa los patrones de uso de memoria: + - Consumo actual de memoria + - Picos de memoria durante la operación + - Comparación con los límites configurados + +```yaml +# Ejemplo de lo que debes buscar en las métricas +Uso de Memoria: 512Mi +Límite de Memoria: 256Mi # Esto causaría OOMKilled +Solicitud de Memoria: 128Mi +``` + + + + + +Para resolver el problema, aumenta la configuración de memoria: + +**En el Panel de SleakOps:** + +1. Ve a la configuración de tu servicio +2. Navega a **Configuración de Recursos** +3. Incrementa el **Límite de Memoria** (MemoryMax) +4. Opcionalmente, incrementa la **Solicitud de Memoria** (MemoryMin) +5. Despliega los cambios + +**Ejemplo de Configuración:** + +```yaml +resources: + requests: + memory: "512Mi" # MemoryMin + cpu: "250m" + limits: + memory: "1Gi" # MemoryMax + cpu: "500m" +``` + +**Valores Recomendados Iniciales:** + +- Para aplicaciones pequeñas: 512Mi - 1Gi +- Para aplicaciones medianas: 1Gi - 2Gi +- Para aplicaciones grandes: 2Gi - 4Gi + + + + + +**Paso 1: Verificar el Estado Actual del Pod** + +```bash +kubectl get pods -n tu-namespace +kubectl describe pod tu-nombre-pod -n tu-namespace +``` + +**Paso 2: Revisar Eventos del Pod** + +```bash +kubectl get events -n tu-namespace --sort-by='.lastTimestamp' +``` + +**Paso 3: Revisar Logs del Pod** + +```bash +kubectl logs tu-nombre-pod -n tu-namespace --previous +``` + +**Paso 4: Monitorear Uso de Recursos** + +```bash +kubectl top pods -n tu-namespace +``` + +**Paso 5: Actualizar Límites de Recursos** + +- Incrementa los límites de memoria según el uso observado +- Añade un margen del 20-50% sobre el pico de uso +- Prueba con aumentos graduales + + + + + +**Guías para Dimensionar la Memoria:** + +1. **Comienza de forma conservadora:** Establece límites razonables y monitorea +2. **Monitorea regularmente:** Usa paneles de Grafana para seguir patrones de uso +3. **Establece proporciones adecuadas:** + - Solicitud de Memoria: 70-80% del uso típico + - Límite de Memoria: 150-200% del uso típico + +**Ejemplo de Configuración:** + +```yaml +# Para una aplicación Node.js +resources: + requests: + memory: "256Mi" # Asignación garantizada + cpu: "100m" + limits: + memory: "512Mi" # Máximo permitido + cpu: "200m" + +# Para una aplicación Java +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" +``` + +**Alertas de Monitoreo:** + +- Configura alertas cuando el uso de memoria supere el 80% del límite +- Monitorea reinicios frecuentes +- Sigue las tendencias de uso de memoria a lo largo del tiempo + + + + + +Contacta al soporte de SleakOps si: + +- Los problemas de memoria persisten tras aumentar los límites +- Observas patrones inusuales de consumo de memoria +- La aplicación funcionaba antes sin cambios en la configuración +- Necesitas ayuda para interpretar métricas de Grafana +- Los aumentos de recursos no resuelven el problema de reinicios + +Proporciona esta información: + +- Nombre del servicio y namespace +- Configuración actual de recursos +- Capturas de pantalla de Grafana mostrando uso de memoria +- Logs y eventos del pod +- Línea de tiempo de cuándo comenzó el problema + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-pod-scheduling-insufficient-resources.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-pod-scheduling-insufficient-resources.mdx new file mode 100644 index 000000000..76b6ff8b1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-pod-scheduling-insufficient-resources.mdx @@ -0,0 +1,171 @@ +--- +sidebar_position: 3 +title: "Fallos en la Programación de Pods en Kubernetes - Recursos Insuficientes" +description: "Solución para fallos en la programación de pods debido a recursos insuficientes de CPU y memoria en clústeres de Kubernetes" +date: "2024-03-12" +category: "cluster" +tags: ["kubernetes", "programación", "recursos", "nodepool", "karpenter"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallos en la Programación de Pods en Kubernetes - Recursos Insuficientes + +**Fecha:** 12 de marzo de 2024 +**Categoría:** Clúster +**Etiquetas:** Kubernetes, Programación, Recursos, Nodepool, Karpenter + +## Descripción del Problema + +**Contexto:** Un despliegue que funcionaba previamente de repente falla al programar pods en un clúster de Kubernetes gestionado por SleakOps, mostrando errores de insuficiencia de recursos. + +**Síntomas Observados:** + +- Los pods no se programan con el error "0/X nodos disponibles" +- Mensajes de "CPU insuficiente" y "Memoria insuficiente" +- El error menciona requisitos específicos de recursos que no se cumplen +- Despliegues que funcionaban anteriormente dejan de funcionar de repente + +**Configuración Relevante:** + +- Requisitos de recursos del pod: CPU 665m, Memoria 2000Mi +- Nodepool: spot-arm64 con arquitectura ARM64 +- Clúster usando Karpenter para aprovisionamiento de nodos +- Requisitos mínimos de nodo: 2GB RAM, 2 núcleos CPU + +**Condiciones de Error:** + +- El error ocurre durante la fase de programación de pods +- Aparece cuando el nodepool alcanza los límites de recursos +- Afecta despliegues que previamente fueron exitosos +- Varios nodepools muestran problemas de incompatibilidad + +## Solución Detallada + + + +El mensaje de error indica varios problemas: + +1. **Recursos Insuficientes**: Los nodos no tienen suficiente CPU (se requieren 665m) o memoria (se requieren 2000Mi) +2. **Límites del Nodepool**: El nodepool ha alcanzado sus límites de recursos configurados +3. **Incompatibilidad de Taints**: Algunos nodos tienen taints que impiden la programación de pods +4. **Restricciones de Tipo de Instancia**: No hay tipos de instancia disponibles que cumplan con los requisitos de recursos y restricciones + +``` +0/8 nodos disponibles: +- 2 CPU insuficiente +- 4 Memoria insuficiente +- Taints que impiden la programación en ciertos nodos +``` + + + + + +Para resolver este problema de inmediato: + +1. **Acceder a la Configuración del Nodepool**: + + - Ir a la Consola de SleakOps + - Navegar a Clusters → Tu Clúster → Configuración → Node Pools + - Seleccionar el nodepool afectado (ejemplo: "spot-arm64") + +2. **Incrementar los Límites de Recursos**: + + - **Límite de CPU**: Aumentar desde el valor actual para acomodar más pods + - **Límite de Memoria**: Incrementar la asignación total de memoria (ejemplo: de 32GB a 64GB) + - **Cantidad de Nodos**: Asegurarse que el máximo de nodos permita el escalado + +3. **Guardar y Aplicar Cambios**: + - El clúster aprovisionará automáticamente nuevos nodos si es necesario + - Los pods existentes deberían reprogramarse en unos minutos + + + + + +**Sección 1 - Límites de Recursos (Seguridad/Tope de Costos)**: + +- Actúa como un límite de seguridad para evitar costos inesperados +- No afecta los costos a menos que aumentes el número de pods o habilites el escalado automático +- Previene consumo descontrolado de recursos + +**Sección 2 - Configuración del Nodo**: + +- **Almacenamiento**: Configura el espacio en disco para cada nodo +- **Configuración Avanzada**: Establece CPU y memoria mínimas por nodo + - Por defecto: 2GB RAM, 2 núcleos CPU mínimo + - Evita la programación en instancias muy pequeñas (t4g.nano, t4g.micro) + - Asegura que los daemonsets puedan ejecutarse correctamente + +```yaml +# Ejemplo de configuración de nodepool +resource_limits: + max_cpu: "32" + max_memory: "64Gi" + max_nodes: 10 + +node_requirements: + min_cpu: "2" + min_memory: "2Gi" + storage: "20Gi" +``` + + + + + +**Importante**: Incrementar los límites de recursos NO incrementa los costos inmediatamente. + +**Lo que afecta los costos**: + +- **Uso real de recursos**: Solo pagas por los nodos que realmente se crean +- **Escalado de pods**: Más pods = más nodos = costos más altos +- **Escalado automático**: Si está habilitado, escala automáticamente según la demanda + +**Lo que no afecta los costos**: + +- **Aumentar límites**: Solo eleva el techo, no crea recursos +- **Capacidades máximas de memoria/CPU**: Solo afecta el uso máximo posible + +**Buenas prácticas**: + +- Establecer límites razonables basados en la carga máxima esperada +- Monitorizar el uso real vs. límites regularmente +- Usar instancias spot para optimizar costos cuando sea posible + + + + + +**Monitorear uso de recursos**: + +1. **Panel de SleakOps**: Revisar la utilización de recursos del clúster +2. **Configurar Alertas**: Establecer notificaciones para uso alto de recursos +3. **Revisiones periódicas**: Revisar y ajustar límites regularmente + +**Estrategias de prevención**: + +- **Escalado gradual**: Aumentar límites gradualmente según necesidades reales +- **Solicitudes de recursos**: Asegurar que los pods tengan solicitudes de recursos adecuadas +- **Múltiples nodepools**: Usar diferentes nodepools para distintos tipos de carga +- **Spot vs On-Demand**: Balancear costo y necesidades de confiabilidad + +**Comandos para solución de problemas**: + +```bash +# Revisar recursos de nodos +kubectl describe nodes + +# Revisar solicitudes de recursos de un pod +kubectl describe pod -n + +# Ver estado de nodepools +kubectl get nodepools +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 12 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-secrets-ssl-private-keys.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-secrets-ssl-private-keys.mdx new file mode 100644 index 000000000..4e5d3bb40 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-secrets-ssl-private-keys.mdx @@ -0,0 +1,204 @@ +--- +sidebar_position: 3 +title: "Secretos de Kubernetes para Claves Privadas SSL" +description: "Cómo almacenar y montar de forma segura claves privadas SSL usando Secretos de Kubernetes en SleakOps" +date: "2024-12-19" +category: "proyecto" +tags: ["kubernetes", "secretos", "ssl", "seguridad", "volúmenes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Secretos de Kubernetes para Claves Privadas SSL + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Kubernetes, Secretos, SSL, Seguridad, Volúmenes + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan almacenar de forma segura claves privadas SSL en su entorno Kubernetes y montarlas como archivos dentro de los contenedores con las restricciones de acceso adecuadas. + +**Síntomas Observados:** + +- Necesidad de almacenar claves privadas SSL de forma segura en Kubernetes +- Requisito de montar secretos como archivos (no variables de entorno) +- Necesidad de restringir el acceso solo a pods autorizados +- Requisito de permisos de archivo adecuados en los secretos montados + +**Configuración Relevante:** + +- Plataforma: entorno Kubernetes de SleakOps +- Tipo de secreto: claves privadas SSL +- Requisito de montaje: basado en archivos (no variables de entorno) +- Nivel de seguridad: alta prioridad con acceso restringido + +**Condiciones de Error:** + +- VariableGroup por defecto crea variables de entorno, no archivos +- Necesidad de configuración específica para montar secretos como archivos +- Requiere buenas prácticas de seguridad para la gestión de claves SSL + +## Solución Detallada + + + +El enfoque estándar en SleakOps es usar VariableGroups, que exponen secretos como variables de entorno: + +1. **Crear un VariableGroup:** + + - Ve a tu proyecto en SleakOps + - Navega a **VariableGroups** + - Crea un nuevo VariableGroup + - Añade tu clave privada SSL como una variable + +2. **Asignar a la Ejecución:** + - Asigna el VariableGroup a ejecuciones específicas + - O déjalo "global" para exponerlo a todas las ejecuciones del proyecto + +**Nota:** Este método expone el secreto como variable de entorno, no como archivo. + + + + + +Para el montaje de secretos basado en archivos (recomendado para claves SSL), usa el enfoque de volúmenes: + +1. **Navega a Detalles del Proyecto:** + + - Ve a **Proyectos** → **Detalles** + - Encuentra la sección **Volúmenes** + +2. **Crear un Volumen de Secreto:** + + - Crea un nuevo volumen de tipo "Secreto" + - Sube o pega el contenido de tu clave privada SSL + - Configura la ruta de montaje donde debe aparecer el archivo + +3. **Ejemplo de Configuración:** + +```yaml +# Configuración del volumen +name: ssl-private-key +type: secret +mountPath: /etc/ssl/private/ +fileName: private.key +permissions: 0600 +``` + + + + + +Al trabajar con claves privadas SSL en Kubernetes: + +1. **Permisos de Archivo:** + + - Establece permisos restrictivos (0600 o 0400) + - Asegura que solo el usuario de la aplicación pueda leer la clave + +2. **Control de Acceso:** + + - Usa RBAC de Kubernetes para limitar el acceso a pods + - Monta secretos solo en los pods que los necesiten + - Evita exponer claves como variables de entorno + +3. **Seguridad de Almacenamiento:** + + - Usa cifrado nativo de secretos en reposo de Kubernetes + - Rota las claves regularmente + - Monitorea el acceso a recursos secretos + +4. **Ejemplo de Configuración Segura:** + +```yaml +# Montaje seguro de volumen +volumes: + - name: ssl-key-volume + secret: + secretName: ssl-private-key + defaultMode: 0400 + items: + - key: private.key + path: private.key + mode: 0400 +``` + + + + + +**Paso 1: Prepara tu Clave SSL** + +- Asegúrate de que tu clave privada esté en formato PEM +- Elimina espacios o formatos extra +- Prueba la validez de la clave antes de subirla + +**Paso 2: Crea el Volumen Secreto en SleakOps** + +1. Navega a **Proyectos** → **[Tu Proyecto]** → **Detalles** +2. Desplázate a la sección **Volúmenes** +3. Haz clic en **Agregar Volumen** +4. Selecciona tipo **Secreto** +5. Configura: + - **Nombre:** `ssl-private-key` + - **Ruta de Montaje:** `/etc/ssl/private/` + - **Nombre del Archivo:** `private.key` + - **Contenido:** Pega tu clave privada + - **Permisos:** `0600` + +**Paso 3: Referencia en tu Aplicación** + +```dockerfile +# En tu Dockerfile o aplicación +# La clave estará disponible en /etc/ssl/private/private.key +COPY --from=secrets /etc/ssl/private/private.key /app/ssl/ +``` + +**Paso 4: Verifica el Acceso** + +- Despliega tu aplicación +- Comprueba que el archivo exista en la ruta especificada +- Verifica que los permisos de archivo sean correctos +- Prueba la funcionalidad SSL + + + + + +**Problema 1: Archivo No Encontrado** + +- Verifica que la ruta de montaje sea correcta +- Comprueba que el volumen esté correctamente adjuntado al pod +- Asegúrate de que el secreto se haya creado con éxito + +**Problema 2: Permiso Denegado** + +- Revisa los permisos del archivo (deben ser 0600 o 0400) +- Verifica que el usuario de la aplicación tenga acceso de lectura +- Asegura que el directorio de la ruta de montaje exista + +**Problema 3: Formato de Clave Inválido** + +- Verifica que la clave esté en formato PEM correcto +- Revisa que no haya caracteres o formatos extra +- Prueba la clave fuera de Kubernetes primero + +**Comandos para Depurar:** + +```bash +# Verifica si el secreto existe +kubectl get secrets + +# Verifica el montaje del volumen +kubectl describe pod + +# Revisa permisos de archivo dentro del pod +kubectl exec -- ls -la /etc/ssl/private/ +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-shared-volumes-across-namespaces.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-shared-volumes-across-namespaces.mdx new file mode 100644 index 000000000..65f688f9a --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/kubernetes-shared-volumes-across-namespaces.mdx @@ -0,0 +1,563 @@ +--- +sidebar_position: 3 +title: "Volúmenes Compartidos en Kubernetes entre Namespaces" +description: "Solución para compartir volúmenes entre pods en diferentes namespaces y enfoques alternativos" +date: "2025-01-30" +category: "cluster" +tags: ["kubernetes", "volúmenes", "namespaces", "almacenamiento", "s3"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Volúmenes Compartidos en Kubernetes entre Namespaces + +**Fecha:** 30 de enero de 2025 +**Categoría:** Clúster +**Etiquetas:** Kubernetes, Volúmenes, Namespaces, Almacenamiento, S3 + +## Descripción del Problema + +**Contexto:** El usuario necesita compartir datos entre un cronjob que genera archivos y un servicio nginx que los sirve, pero están ejecutándose en diferentes namespaces dentro de un clúster de Kubernetes. + +**Síntomas Observados:** + +- No se puede montar el mismo volumen en diferentes namespaces +- Necesidad de compartir archivos generados entre el cronjob y el servidor web +- La configuración actual funciona en EC2 con montaje de volumen compartido +- Se busca el equivalente en Kubernetes para acceso compartido a volúmenes + +**Configuración Relevante:** + +- Plataforma: SleakOps en AWS EKS +- Caso de uso: Cronjob genera archivos, Nginx los sirve +- Configuración actual: EC2 con volumen compartido entre contenedores +- Objetivo: Pods de Kubernetes en diferentes namespaces + +**Condiciones de Error:** + +- Limitación de Kubernetes: no se puede usar el mismo volumen en diferentes namespaces +- Necesidad de solución alternativa para compartir archivos +- Requisitos de rendimiento para generación de archivos grandes (más de 5GB) + +## Solución Detallada + + + +Kubernetes tiene una limitación fundamental: **el mismo PersistentVolume no puede ser montado por pods en diferentes namespaces**. Esto es por diseño para propósitos de seguridad e aislamiento. + +Esto significa que el enfoque actual en EC2 de compartir un volumen entre contenedores no funcionará directamente en Kubernetes cuando los pods estén en diferentes namespaces. + + + + + +El enfoque recomendado es usar **Amazon S3** como almacenamiento intermedio: + +### Arquitectura: + +1. **Cronjob**: Genera archivos localmente → Los sube a S3 +2. **Servicio Nginx**: Descarga archivos desde S3 → Los sirve + +### Beneficios: + +- Funciona a través de namespaces +- Escalable y confiable +- Rentable para archivos grandes +- Autenticación integrada vía cuentas de servicio de SleakOps + +### Ejemplo de configuración: + +```yaml +# Configuración del Cronjob +apiVersion: batch/v1 +kind: CronJob +metadata: + name: generador-de-archivos + namespace: jobs +spec: + schedule: "0 2 * * *" + jobTemplate: + spec: + template: + spec: + containers: + - name: generador + image: tu-app:latest + env: + - name: S3_BUCKET + value: "tu-nombre-de-bucket" + - name: AWS_REGION + value: "us-east-1" + volumeMounts: + - name: almacenamiento-temporal + mountPath: /tmp/files + volumes: + - name: almacenamiento-temporal + emptyDir: + sizeLimit: 60Gi +``` + + + + + +Para la generación de archivos grandes, configura tu nodepool con suficiente almacenamiento EBS: + +### En SleakOps: + +1. Ve a **Configuración de Clúster** +2. Selecciona tu **Nodepool** +3. Modifica la **Configuración del Nodo**: + - **Tamaño del volumen EBS**: 50-60 GB + - **Tipo de volumen**: gp3 (más rápido y económico) + +### Ejemplo de configuración: + +```yaml +nodepool_config: + instance_type: "t3.medium" + disk_size: 60 # GB + disk_type: "gp3" + min_nodes: 1 + max_nodes: 5 +``` + + + + + +SleakOps configura automáticamente la autenticación para S3 mediante **cuentas de servicio**. No necesitas gestionar credenciales de AWS manualmente. + +### Ejemplo en Java para acceso a S3: + +```java +import software.amazon.awssdk.services.s3.S3Client; +import software.amazon.awssdk.regions.Region; +import software.amazon.awssdk.services.s3.model.*; + +public class S3FileUploader { + public static void main(String[] args) { + // SleakOps maneja la autenticación automáticamente + S3Client s3 = S3Client.builder() + .region(Region.US_EAST_1) + .build(); + + // Subir archivo a S3 + PutObjectRequest putRequest = PutObjectRequest.builder() + .bucket("tu-nombre-de-bucket") + .key("generated-files/data.zip") + .build(); + + s3.putObject(putRequest, + RequestBody.fromFile(new File("/tmp/files/data.zip"))); + } +} +``` + +### Ejemplo en Python: + +```python +import boto3 +import os + +# SleakOps maneja la autenticación vía cuenta de servicio +s3_client = boto3.client('s3') +bucket_name = os.environ['S3_BUCKET'] + +# Subir archivo generado +s3_client.upload_file( + '/tmp/files/generated_data.zip', + bucket_name, + 'generated-files/generated_data.zip' +) +``` + + + + + +Para servir archivos desde S3 a través de Nginx, tienes varias opciones: + +### Opción 1: Nginx con proxy a S3 + +```nginx +server { + listen 80; + server_name tu-dominio.com; + + location /files/ { + proxy_pass https://tu-bucket.s3.amazonaws.com/; + proxy_set_header Host tu-bucket.s3.amazonaws.com; + proxy_hide_header x-amz-id-2; + proxy_hide_header x-amz-request-id; + } +} +``` + +### Opción 2: Descargar y servir localmente + +```bash +#!/bin/bash +# Script de inicio para el contenedor nginx +aws s3 sync s3://tu-bucket/generated-files/ /usr/share/nginx/html/files/ +nginx -g "daemon off;" +``` + +### Opción 3: Usar hosting estático de sitio web en S3 + +Habilita el hosting estático en tu bucket S3 y apunta tu dominio directamente a S3. + + + + + +Si debes mantener todo dentro de Kubernetes: + +### Opción 1: Mismo namespace + +Mueve tanto el cronjob como nginx al mismo namespace para compartir volúmenes. + +### Opción 2: NFS o EFS + +Usa Amazon EFS (Elastic File System), que puede ser montado a través de namespaces: + +```yaml +apiVersion: v1 +kind: PersistentVolume +metadata: + name: efs-pv +spec: + capacity: + storage: 100Gi + accessModes: + - ReadWriteMany + persistentVolumeReclaimPolicy: Retain + storageClassName: efs-sc + csi: + driver: efs.csi.aws.com + volumeHandle: fs-12345678 # Tu EFS ID +``` + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: efs-claim + namespace: jobs # Namespace del cronjob +spec: + accessModes: + - ReadWriteMany + storageClassName: efs-sc + resources: + requests: + storage: 100Gi +``` + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: efs-claim + namespace: web # Namespace de nginx +spec: + accessModes: + - ReadWriteMany + storageClassName: efs-sc + resources: + requests: + storage: 100Gi +``` + +### Opción 3: Usar un servicio intermedio + +Crea un servicio API que maneje la transferencia de archivos: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: file-service + namespace: shared +spec: + replicas: 1 + selector: + matchLabels: + app: file-service + template: + metadata: + labels: + app: file-service + spec: + containers: + - name: file-service + image: file-service:latest + ports: + - containerPort: 8080 + volumeMounts: + - name: shared-storage + mountPath: /shared + volumes: + - name: shared-storage + persistentVolumeClaim: + claimName: shared-pvc +``` + + + + + +Para manejar archivos de más de 5GB eficientemente: + +### 1. Configuración de recursos del pod: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: generador-archivos-grandes +spec: + jobTemplate: + spec: + template: + spec: + containers: + - name: generador + image: tu-app:latest + resources: + requests: + memory: "4Gi" + cpu: "2" + ephemeral-storage: "60Gi" + limits: + memory: "8Gi" + cpu: "4" + ephemeral-storage: "80Gi" + volumeMounts: + - name: temp-storage + mountPath: /tmp/large-files + volumes: + - name: temp-storage + emptyDir: + sizeLimit: 70Gi +``` + +### 2. Subida multipart a S3: + +```python +import boto3 +from boto3.s3.transfer import TransferConfig + +def upload_large_file(file_path, bucket, key): + # Configuración para archivos grandes + config = TransferConfig( + multipart_threshold=1024 * 25, # 25MB + max_concurrency=10, + multipart_chunksize=1024 * 25, + use_threads=True + ) + + s3_client = boto3.client('s3') + s3_client.upload_file( + file_path, bucket, key, + Config=config + ) +``` + +### 3. Compresión antes de subir: + +```bash +#!/bin/bash +# Script para comprimir y subir archivos grandes +cd /tmp/files + +# Comprimir archivos +tar -czf data_$(date +%Y%m%d_%H%M%S).tar.gz *.data + +# Subir a S3 +aws s3 cp data_*.tar.gz s3://$S3_BUCKET/compressed-files/ + +# Limpiar archivos temporales +rm -f *.data *.tar.gz +``` + + + + + +### Monitoreo de uso de almacenamiento: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: monitoring-script +data: + monitor.sh: | + #!/bin/bash + while true; do + echo "=== Uso de almacenamiento $(date) ===" + df -h /tmp/files + echo "=== Archivos en directorio ===" + ls -lah /tmp/files/ + echo "=== Memoria del pod ===" + free -h + sleep 300 # Cada 5 minutos + done +``` + +### Verificación de conectividad S3: + +```bash +# Script de verificación +#!/bin/bash +echo "Verificando conectividad con S3..." + +# Verificar credenciales +aws sts get-caller-identity + +# Verificar acceso al bucket +aws s3 ls s3://$S3_BUCKET/ + +# Probar subida de archivo de prueba +echo "test" > /tmp/test.txt +aws s3 cp /tmp/test.txt s3://$S3_BUCKET/test/ +aws s3 rm s3://$S3_BUCKET/test/test.txt +rm /tmp/test.txt + +echo "Verificación completada" +``` + +### Logs y debugging: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: debug-pod +spec: + containers: + - name: debug + image: amazon/aws-cli:latest + command: ["/bin/bash"] + args: ["-c", "while true; do sleep 3600; done"] + env: + - name: S3_BUCKET + value: "tu-bucket" + volumeMounts: + - name: debug-storage + mountPath: /debug + volumes: + - name: debug-storage + emptyDir: {} +``` + + + + + +### 1. Gestión del ciclo de vida de archivos: + +```yaml +# Política de ciclo de vida en S3 +{ + "Rules": [ + { + "ID": "ArchiveOldFiles", + "Status": "Enabled", + "Filter": { + "Prefix": "generated-files/" + }, + "Transitions": [ + { + "Days": 30, + "StorageClass": "STANDARD_IA" + }, + { + "Days": 90, + "StorageClass": "GLACIER" + } + ] + } + ] +} +``` + +### 2. Seguridad y acceso: + +```yaml +# ServiceAccount con permisos específicos para S3 +apiVersion: v1 +kind: ServiceAccount +metadata: + name: s3-access-sa + namespace: jobs + annotations: + eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/S3AccessRole +``` + +### 3. Configuración de retry y timeout: + +```python +import boto3 +from botocore.config import Config + +# Configuración robusta para S3 +config = Config( + region_name='us-east-1', + retries={ + 'max_attempts': 10, + 'mode': 'adaptive' + }, + max_pool_connections=50 +) + +s3_client = boto3.client('s3', config=config) +``` + +### 4. Notificaciones y alertas: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: notification-config +data: + notify.py: | + import boto3 + import json + + def send_notification(message): + sns = boto3.client('sns') + sns.publish( + TopicArn='arn:aws:sns:us-east-1:ACCOUNT:file-processing', + Message=json.dumps({ + 'status': 'completed', + 'timestamp': str(datetime.now()), + 'message': message + }) + ) +``` + +### 5. Backup y recuperación: + +```bash +#!/bin/bash +# Script de backup automático +BACKUP_BUCKET="tu-bucket-backup" +SOURCE_BUCKET="tu-bucket" + +# Sincronizar buckets +aws s3 sync s3://$SOURCE_BUCKET/ s3://$BACKUP_BUCKET/ \ + --delete \ + --storage-class STANDARD_IA + +# Verificar integridad +aws s3api head-object --bucket $BACKUP_BUCKET --key generated-files/latest.zip +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 30 de enero de 2025 basada en una consulta real de usuario._ +``` diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-cluster-connection-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-cluster-connection-troubleshooting.mdx new file mode 100644 index 000000000..e1db9b482 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-cluster-connection-troubleshooting.mdx @@ -0,0 +1,215 @@ +--- +sidebar_position: 3 +title: "Problemas de Conexión con el Clúster en Lens" +description: "Guía de solución de problemas para problemas de conexión de Lens IDE con clústeres de Kubernetes" +date: "2025-01-27" +category: "clúster" +tags: ["lens", "conexión", "solución de problemas", "kubernetes", "redes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Conexión con el Clúster en Lens + +**Fecha:** 27 de enero de 2025 +**Categoría:** Clúster +**Etiquetas:** Lens, Conexión, Solución de Problemas, Kubernetes, Redes + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas de conectividad al intentar acceder a su clúster de Kubernetes a través de Lens IDE, a pesar de que los servicios del clúster funcionan normalmente y son accesibles desde fuentes externas. + +**Síntomas Observados:** + +- Lens IDE no puede conectarse al clúster de Kubernetes +- Los servicios del clúster continúan funcionando normalmente +- El acceso externo a los servicios funciona correctamente +- Los problemas de conexión aparecen repentinamente sin cambios en la configuración +- Lens muestra errores de tiempo de espera de conexión o de autenticación + +**Configuración Relevante:** + +- Herramienta: Lens IDE (IDE para Kubernetes) +- Clúster: Clúster de Kubernetes gestionado por SleakOps +- Red: Entorno local de desarrollo +- Método de acceso: configuración de kubectl + +**Condiciones de Error:** + +- La conexión falla específicamente a través de Lens IDE +- El problema ocurre de forma intermitente o tras un período de inactividad +- Otros comandos kubectl también pueden verse afectados +- Podrían estar involucrados problemas de VPN o conectividad de red + +## Solución Detallada + + + +La solución más común es reiniciar tu conexión de red: + +**Para Windows:** + +1. Abre el Símbolo del sistema como Administrador +2. Ejecuta los siguientes comandos: + ```cmd + ipconfig /release + ipconfig /renew + ipconfig /flushdns + ``` +3. Reinicia el adaptador de red desde Configuración de Red + +**Para macOS:** + +1. Apaga el Wi-Fi desde la barra de menú +2. Espera 10 segundos +3. Enciende nuevamente el Wi-Fi +4. O usa Terminal: + ```bash + sudo dscacheutil -flushcache + sudo killall -HUP mDNSResponder + ``` + +**Para Linux:** + +```bash +sudo systemctl restart NetworkManager +# o +sudo service network-manager restart +``` + + + + + +1. **Actualizar la conexión del clúster en Lens:** + + - Haz clic derecho sobre tu clúster en Lens + - Selecciona "Actualizar" + - Espera a que se restablezca la conexión + +2. **Limpiar la caché de Lens:** + + - Cierra Lens completamente + - Borra la caché de la aplicación: + - **Windows:** `%APPDATA%\Lens` + - **macOS:** `~/Library/Application Support/Lens` + - **Linux:** `~/.config/Lens` + - Reinicia Lens + +3. **Volver a agregar la configuración del clúster:** + - Elimina el clúster de Lens + - Reimporta tu archivo kubeconfig + - Prueba la conexión + + + + + +Antes de solucionar problemas en Lens, verifica que kubectl funcione: + +```bash +# Prueba la conectividad básica +kubectl cluster-info + +# Comprueba si puedes listar los nodos +kubectl get nodes + +# Verifica la autenticación +kubectl auth can-i get pods +``` + +Si kubectl no funciona, el problema está en la configuración de tu clúster, no específicamente en Lens. + + + + + +Si usas una VPN para acceder a tu clúster: + +1. **Verifica la conexión VPN:** + + - Revisa el estado del cliente VPN + - Asegúrate de estar conectado al servidor VPN correcto + - Intenta desconectar y reconectar + +2. **Prueba la conectividad de red:** + + ```bash + # Haz ping al servidor API de tu clúster + ping your-cluster-api-server.com + + # Prueba la conectividad al puerto + telnet your-cluster-api-server.com 443 + ``` + +3. **Problemas de resolución DNS:** + + ```bash + # Verifica la resolución DNS + nslookup your-cluster-api-server.com + + # Prueba usar diferentes servidores DNS + # DNS de Google: 8.8.8.8, 8.8.4.4 + # DNS de Cloudflare: 1.1.1.1, 1.0.0.1 + ``` + + + + + +Si tu clúster usa credenciales temporales (como AWS EKS), estas pueden haber expirado: + +**Para clústeres EKS:** + +```bash +# Actualiza kubeconfig +aws eks update-kubeconfig --region your-region --name your-cluster-name + +# Verifica la actualización +kubectl get nodes +``` + +**Para otros proveedores de nube:** + +- **GKE:** `gcloud container clusters get-credentials` +- **AKS:** `az aks get-credentials` + +**Verifica la expiración del token:** + +```bash +# Muestra el contexto actual +kubectl config current-context + +# Muestra la configuración detallada +kubectl config view --minify +``` + + + + + +1. **Configuración del firewall:** + + - Asegúrate de que Lens esté permitido en tu firewall + - Verifica que los puertos 443 y 6443 estén abiertos + - Desactiva temporalmente el firewall para probar + +2. **Proxy corporativo:** + + - Configura los ajustes de proxy en las preferencias de Lens + - Establece variables de entorno: + ```bash + export HTTP_PROXY=http://proxy.company.com:8080 + export HTTPS_PROXY=http://proxy.company.com:8080 + export NO_PROXY=localhost,127.0.0.1 + ``` + +3. **Problemas con certificados:** + - Verifica si tu organización usa certificados personalizados + - Importa los certificados en el almacén de confianza de tu sistema + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 27 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-vpn-dns-resolution-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-vpn-dns-resolution-issue.mdx new file mode 100644 index 000000000..682c28785 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-vpn-dns-resolution-issue.mdx @@ -0,0 +1,156 @@ +--- +sidebar_position: 3 +title: "Problema de Resolución DNS en VPN de Lens" +description: "Solución para problemas de resolución DNS al usar Lens con VPN Pritunl" +date: "2025-03-05" +category: "usuario" +tags: ["lens", "vpn", "pritunl", "dns", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Resolución DNS en VPN de Lens + +**Fecha:** 5 de marzo de 2025 +**Categoría:** Usuario +**Etiquetas:** Lens, VPN, Pritunl, DNS, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas de resolución DNS al intentar conectarse a clústeres de Kubernetes a través de Lens mientras están conectados a la VPN de Pritunl. El sistema resuelve DNS públicos en lugar de DNS internos, impidiendo el acceso correcto al clúster. + +**Síntomas Observados:** + +- Lens no puede conectarse al clúster de Kubernetes a pesar de estar conectado a la VPN +- Errores de resolución DNS al acceder a recursos del clúster +- Se resuelve DNS público en lugar de DNS interno/privado +- La conexión funciona de manera intermitente o no funciona en absoluto + +**Configuración Relevante:** + +- Herramienta: Lens (IDE de Kubernetes) +- VPN: Pritunl +- Conexión: VPN activa y conectada +- Kubeconfig: Configurado e importado correctamente + +**Condiciones de Error:** + +- El error ocurre cuando Lens intenta resolver los endpoints del clúster +- El problema persiste incluso con la conexión VPN activa +- La resolución DNS por defecto es pública en lugar de privada +- El problema ocurre comúnmente después de reinicios del sistema o cambios en la red + +## Solución Detallada + + + +Este es un problema común que ocurre cuando el sistema resuelve DNS públicos en lugar del DNS interno de la VPN. Sigue estos pasos para resolverlo: + +1. **Cerrar Lens completamente** + + - Asegúrate de que Lens esté totalmente cerrado (revisa la bandeja del sistema) + - Finaliza cualquier proceso de Lens restante si es necesario + +2. **Reconectar a la VPN Pritunl** + + - Desconéctate de la conexión VPN actual + - Espera unos segundos + - Vuelve a conectarte a la VPN + +3. **Restablecer el servicio DNS en Pritunl** + + - Abre el cliente Pritunl + - Ve a **Opciones** o **Configuración** + - Busca la opción **"Restablecer servicio DNS"** + - Haz clic para restablecer la configuración DNS + +4. **Reabrir Lens** + - Inicia Lens nuevamente + - Intenta conectarte a tu clúster + + + + + +Para confirmar que la resolución DNS funciona correctamente: + +1. **Verificar estado de la VPN** + + ```bash + # En Windows + nslookup your-cluster-endpoint + + # En macOS/Linux + dig your-cluster-endpoint + ``` + +2. **Probar conectividad al clúster** + + ```bash + kubectl cluster-info + kubectl get nodes + ``` + +3. **Verificar en Lens** + - Abre Lens + - Conéctate a tu clúster + - Revisa si los recursos cargan correctamente + + + + + +Si el restablecimiento del DNS no funciona, prueba estas alternativas: + +**Opción 1: Reiniciar servicios de red** + +```bash +# Windows (ejecutar como administrador) +ipconfig /flushdns +ipconfig /release +ipconfig /renew + +# macOS +sudo dscacheutil -flushcache +sudo killall -HUP mDNSResponder + +# Linux +sudo systemctl restart systemd-resolved +``` + +**Opción 2: Configuración manual de DNS** + +- Configura tu sistema para usar los servidores DNS de la VPN +- Añade los servidores DNS internos a tu configuración de red +- Asegúrate que el DNS de la VPN tenga prioridad sobre el DNS del sistema + +**Opción 3: Usar kubectl directamente** + +- Si Lens continúa presentando problemas, usa kubectl desde la terminal +- Esto evita la resolución DNS de Lens +- Configura correctamente tu kubeconfig para acceso directo + + + + + +Para minimizar problemas de resolución DNS: + +1. **Siempre conecta la VPN antes de abrir Lens** +2. **Usa la función de restablecer DNS de Pritunl regularmente** +3. **Mantén el cliente Pritunl actualizado** +4. **Configura DNS estático si es necesario** +5. **Monitorea cambios en la red que puedan afectar el DNS** + +**Consejo profesional:** Crea un script o rutina: + +1. Conectar a la VPN +2. Restablecer servicio DNS +3. Esperar 10 segundos +4. Abrir Lens + + + +--- + +_Esta FAQ fue generada automáticamente el 5 de marzo de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-wsl-aws-cli-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-wsl-aws-cli-configuration.mdx new file mode 100644 index 000000000..389c0af9d --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/lens-wsl-aws-cli-configuration.mdx @@ -0,0 +1,236 @@ +--- +sidebar_position: 3 +title: "Lens con WSL y Configuración de AWS CLI" +description: "Solución para problemas de autenticación en Lens al usar WSL con AWS CLI" +date: "2024-12-19" +category: "usuario" +tags: ["lens", "wsl", "aws-cli", "autenticación", "windows"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Lens con WSL y Configuración de AWS CLI + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Usuario +**Etiquetas:** Lens, WSL, AWS CLI, Autenticación, Windows + +## Descripción del Problema + +**Contexto:** Un miembro del equipo que usa Windows con WSL intenta acceder a un clúster de Kubernetes a través de Lens pero encuentra errores de autenticación debido a que AWS CLI no se encuentra en el entorno de Windows. + +**Síntomas Observados:** + +- Lens no puede autenticarse con el clúster de Kubernetes +- Mensaje de error: "executable aws not found" +- El proxy de autenticación no inicia correctamente +- La conexión al clúster falla con Error Interno del Servidor (500) + +**Configuración Relevante:** + +- Sistema Operativo: Windows con WSL +- Herramienta: Lens (IDE de Kubernetes) +- Autenticación: Plugin de credenciales AWS CLI +- Entorno: Configuración mixta Windows/WSL + +**Condiciones de Error:** + +- Lens abre terminal de Windows en lugar de terminal WSL +- AWS CLI está instalado en WSL pero no en Windows +- El plugin de credenciales no puede encontrar el ejecutable AWS +- La autenticación falla durante la conexión al clúster + +## Solución Detallada + + + +La mejor opción es instalar Lens directamente dentro del entorno WSL: + +### Requisitos Previos + +- WSL2 con Ubuntu o distribución Linux similar +- Reenvío X11 o WSLg para aplicaciones GUI + +### Pasos de Instalación + +1. **Habilitar soporte GUI en WSL:** + + ```bash + # Para WSL2 con WSLg (Windows 11) + # WSLg está incluido por defecto, no se requiere configuración adicional + + # Para versiones anteriores de Windows, instalar servidor X11 + # Instalar VcXsrv o servidor X11 similar en Windows + export DISPLAY=:0 + ``` + +2. **Instalar Lens en WSL:** + + ```bash + # Descargar Lens AppImage + wget https://api.k8slens.dev/binaries/Lens-2023.12.151757-latest.x86_64.AppImage + + # Hacerlo ejecutable + chmod +x Lens-2023.12.151757-latest.x86_64.AppImage + + # Ejecutar Lens + ./Lens-2023.12.151757-latest.x86_64.AppImage + ``` + +3. **Verificar acceso a AWS CLI:** + ```bash + # Asegurarse de que AWS CLI esté correctamente configurado + aws --version + aws sts get-caller-identity + ``` + + + + + +Asegúrate de que tu kubeconfig esté configurado correctamente dentro de WSL: + +1. **Actualizar kubeconfig:** + + ```bash + # Actualizar kubeconfig para tu clúster EKS + aws eks update-kubeconfig --region --name + ``` + +2. **Verificar acceso al clúster:** + + ```bash + # Probar conectividad con el clúster + kubectl get nodes + kubectl cluster-info + ``` + +3. **Comprobar ubicación de kubeconfig:** + ```bash + # Asegurarse de que kubeconfig esté en la ubicación esperada + ls -la ~/.kube/config + ``` + + + + + +Si prefieres mantener Lens en Windows, instala AWS CLI en Windows: + +### Pasos de Instalación + +1. **Descargar AWS CLI para Windows:** + + - Visitar: https://aws.amazon.com/cli/ + - Descargar el instalador para Windows + - Ejecutar el instalador con privilegios de administrador + +2. **Configurar AWS CLI:** + + ```cmd + # Abrir Símbolo del sistema o PowerShell + aws configure + # Introducir tu AWS Access Key ID, Secret Access Key, región y formato de salida + ``` + +3. **Copiar credenciales de WSL (si es necesario):** + + ```cmd + # Copiar credenciales de WSL a Windows + # Desde WSL, copiar el contenido del directorio ~/.aws/ al %USERPROFILE%\.aws\ de Windows + ``` + +4. **Actualizar kubeconfig en Windows:** + ```cmd + aws eks update-kubeconfig --region --name + ``` + + + + + +### Problemas comunes y soluciones + +**Problema 1: Reenvío X11 no funciona** + +```bash +# Instalar paquetes requeridos +sudo apt update +sudo apt install x11-apps + +# Probar reenvío X11 +xeyes +``` + +**Problema 2: Credenciales AWS no encontradas** + +```bash +# Verificar credenciales AWS +aws configure list +cat ~/.aws/credentials +cat ~/.aws/config +``` + +**Problema 3: Problemas con la ruta de kubeconfig** + +```bash +# Establecer variable de entorno KUBECONFIG +export KUBECONFIG=~/.kube/config + +# Añadir a ~/.bashrc para persistencia +echo 'export KUBECONFIG=~/.kube/config' >> ~/.bashrc +``` + +**Problema 4: Errores de permiso denegado** + +```bash +# Corregir permisos de kubeconfig +chmod 600 ~/.kube/config +chown $USER:$USER ~/.kube/config +``` + + + + + +### Configuración recomendada + +1. **Usar WSL2 para todas las herramientas de Kubernetes:** + + - Instalar kubectl, helm, aws-cli en WSL + - Mantener todas las configuraciones en el entorno WSL + - Usar WSL para todas las interacciones con el clúster + +2. **Consistencia del entorno:** + + ```bash + # Añadir a ~/.bashrc + export KUBECONFIG=~/.kube/config + export AWS_PROFILE=default + export AWS_REGION=us-west-2 + ``` + +3. **Script de instalación de herramientas:** + + ```bash + #!/bin/bash + # install-k8s-tools.sh + + # Instalar kubectl + curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" + sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl + + # Instalar AWS CLI + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + sudo ./aws/install + + # Instalar Helm + curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/load-balancer-host-header-routing.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/load-balancer-host-header-routing.mdx new file mode 100644 index 000000000..ced57fbc9 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/load-balancer-host-header-routing.mdx @@ -0,0 +1,229 @@ +--- +sidebar_position: 3 +title: "Problema de enrutamiento del encabezado Host en el balanceador de carga" +description: "Solución para errores HTTP 404 al usar dominios personalizados con balanceadores de carga" +date: "2024-01-15" +category: "workload" +tags: + [ + "balanceador-de-carga", + "nginx", + "encabezado-host", + "enrutamiento", + "cloudflare", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de enrutamiento del encabezado Host en el balanceador de carga + +**Fecha:** 15 de enero de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Balanceador de carga, Nginx, Encabezado Host, Enrutamiento, Cloudflare + +## Descripción del problema + +**Contexto:** El usuario ha desplegado un servicio Nginx escuchando en el puerto 80, configurado para servir contenido para dominios específicos ("velo.la" y "www.velo.la"). El servicio es accesible a través del dominio proporcionado por SleakOps pero devuelve HTTP 404 cuando se accede mediante dominios personalizados a través de CNAME de Cloudflare o manipulación directa del encabezado Host. + +**Síntomas observados:** + +- El servicio funciona correctamente cuando se accede a través del dominio de SleakOps: `main-proxy.develop.us-east-2.velo.la` +- Error HTTP 404 al acceder mediante redirección CNAME de Cloudflare desde `velo.la` al dominio de SleakOps +- Error HTTP 404 al usar `/etc/hosts` para apuntar `velo.la` a la IP del balanceador de carga +- `curl -H "Host: velo.la"` devuelve HTTP 404, mientras que curl directo al dominio de SleakOps devuelve 200 OK +- Los registros de Nginx muestran que las solicitudes llegan para acceso directo al dominio pero no para encabezados Host personalizados + +**Configuración relevante:** + +- Servicio: Nginx escuchando en el puerto 80 +- Dominios esperados: `velo.la`, `www.velo.la` +- Dominio SleakOps: `main-proxy.develop.us-east-2.velo.la` +- DNS: CNAME de Cloudflare apuntando dominios personalizados al dominio de SleakOps + +**Condiciones de error:** + +- HTTP 404 ocurre cuando el encabezado Host difiere del dominio proporcionado por SleakOps +- El balanceador de carga parece filtrar solicitudes basándose en el encabezado Host +- Las solicitudes con encabezados Host personalizados no llegan al contenedor Nginx + +## Solución detallada + + + +El problema ocurre porque los balanceadores de carga de SleakOps realizan **enrutamiento basado en host** por defecto. Cuando accedes al servicio usando un dominio personalizado (mediante encabezado Host o CNAME), el balanceador de carga no reconoce el dominio personalizado y devuelve un 404 antes de que la solicitud llegue a tu contenedor Nginx. + +Por eso: + +- `curl https://main-proxy.develop.us-east-2.velo.la` funciona (dominio reconocido) +- `curl -H "Host: velo.la" https://main-proxy.develop.us-east-2.velo.la` falla (dominio no reconocido) + + + + + +Para solucionar este problema, debes configurar tus dominios personalizados en la configuración del servicio SleakOps: + +1. **Accede a la configuración de tu servicio** en el panel de SleakOps +2. **Navega a la sección de Redes (Networking)** +3. **Agrega los dominios personalizados** a la lista de hosts permitidos: + +```yaml +# Ejemplo de configuración del servicio +networking: + public: true + domains: + - "main-proxy.develop.us-east-2.velo.la" # Dominio por defecto de SleakOps + - "velo.la" # Tu dominio personalizado + - "www.velo.la" # Tu subdominio www + ports: + - port: 80 + protocol: HTTP +``` + +4. **Aplica la configuración** y espera a que se actualice el despliegue + + + + + +Asegúrate de que la configuración de Nginx maneje correctamente los dominios personalizados: + +```nginx +server { + listen 80; + server_name velo.la www.velo.la main-proxy.develop.us-east-2.velo.la; + + location / { + # Configuración de tu aplicación + root /usr/share/nginx/html; + index index.html index.htm; + } +} +``` + +Si usas un archivo de configuración Nginx personalizado, asegúrate de incluir todos los dominios que deseas servir. + + + + + +Para la configuración en Cloudflare, asegúrate de usar la configuración correcta: + +1. **Registros CNAME:** + + - `velo.la` → `main-proxy.develop.us-east-2.velo.la` + - `www.velo.la` → `main-proxy.develop.us-east-2.velo.la` + +2. **Configuración SSL/TLS:** + + - Establece el modo de cifrado SSL/TLS a **"Full"** o **"Full (strict)"** + - Asegúrate de que **"Always Use HTTPS"** esté habilitado si es necesario + +3. **Estado del proxy:** + - Puedes mantener la nube naranja (proxy activado) habilitada + - O desactivarla (nube gris) para resolución DNS directa + + + + + +Después de configurar los dominios personalizados, prueba la configuración: + +1. **Prueba acceso directo:** + +```bash +curl -v https://velo.la +curl -v https://www.velo.la +``` + +2. **Prueba con encabezado Host explícito:** + +```bash +curl -v -H "Host: velo.la" https://main-proxy.develop.us-east-2.velo.la +``` + +3. **Revisa los registros del servicio:** + +```bash +# En el panel de SleakOps, revisa los registros de tu servicio +# Deberías ver solicitudes llegando para todos los dominios configurados +``` + +4. **Verifica resolución DNS:** + +```bash +nslookup velo.la +nslookup www.velo.la +``` + + + + + +Si no puedes configurar dominios personalizados en SleakOps, considera estas alternativas: + +1. **Usar un proxy inverso:** + Despliega un servicio proxy inverso separado que maneje los dominios personalizados y reenvíe las solicitudes a tu servicio principal. + +2. **Configurar Nginx para aceptar cualquier host:** + +```nginx +server { + listen 80 default_server; + server_name _; + + location / { + # Configuración de tu aplicación + } +} +``` + +3. **Usar Cloudflare Workers:** + Crea un Worker de Cloudflare para modificar el encabezado Host antes de reenviar las solicitudes. + + + + + +**Si el problema persiste:** + +1. **Verifica el estado del servicio SleakOps** - Asegúrate de que el servicio esté activo y saludable +2. **Verifica la configuración del balanceador de carga** - Contacta soporte de SleakOps para revisar la configuración del balanceador +3. **Prueba con diferentes herramientas:** + +```bash +# Prueba con wget +wget --server-response --header="Host: velo.la" https://main-proxy.develop.us-east-2.velo.la + +# Prueba con diferente User-Agent +curl -v -H "Host: velo.la" -H "User-Agent: Mozilla/5.0" https://main-proxy.develop.us-east-2.velo.la +``` + +4. **Revisa los logs del balanceador de carga:** + - Contacta el soporte de SleakOps para obtener acceso a los logs del balanceador + - Busca patrones de solicitudes rechazadas o filtradas + +5. **Considera usar herramientas de depuración de red:** + +```bash +# Usar tcpdump para capturar tráfico de red +sudo tcpdump -i any -s 0 -w capture.pcap host main-proxy.develop.us-east-2.velo.la + +# Analizar headers HTTP con curl verbose +curl -v -H "Host: velo.la" -H "X-Debug: true" https://main-proxy.develop.us-east-2.velo.la +``` + +**Mejores prácticas:** + +- Siempre configura todos los dominios que planeas usar en la configuración del servicio +- Mantén sincronizados los dominios en SleakOps y la configuración de Nginx +- Usa HTTPS siempre que sea posible para mayor seguridad +- Implementa health checks para monitorear la disponibilidad del servicio +- Documenta todos los dominios configurados para futuras referencias + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-grafana-connection-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-grafana-connection-issues.mdx new file mode 100644 index 000000000..4d92d4717 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-grafana-connection-issues.mdx @@ -0,0 +1,159 @@ +--- +sidebar_position: 3 +title: "Problemas de Conexión entre Loki y Grafana" +description: "Solución de problemas de configuración del pod Loki que afectan la conectividad con Grafana" +date: "2024-11-22" +category: "dependencia" +tags: ["loki", "grafana", "monitorización", "solución de problemas", "pods"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Conexión entre Loki y Grafana + +**Fecha:** 22 de noviembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Loki, Grafana, Monitorización, Solución de problemas, Pods + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas de conectividad entre Grafana y el servicio de registro Loki en los clústeres SleakOps, lo que impide la visualización adecuada de logs y la funcionalidad de monitorización. + +**Síntomas Observados:** + +- Grafana no puede conectarse a la fuente de datos de Loki +- El servicio Loki parece no responder o ser inaccesible +- Las consultas de logs fallan o agotan el tiempo en la interfaz de Grafana +- Los paneles de monitorización no muestran datos de logs + +**Configuración Relevante:** + +- Servicio: pila de registro Loki +- Componente afectado: pod `loki-read` +- Interfaz: panel de Grafana +- Herramienta de gestión: IDE Kubernetes Lens + +**Condiciones de Error:** + +- El error ocurre tras actualizaciones del clúster o reinicios del pod +- El problema persiste hasta que se recrea manualmente el pod +- Afecta la agregación y monitorización de logs +- Puede impactar a múltiples usuarios que acceden a los paneles de Grafana + +## Solución Detallada + + + +Para resolver inmediatamente el problema de conexión: + +1. **Abre el IDE Kubernetes Lens** +2. **Navega a tu clúster SleakOps** +3. **Ve a Workloads → Pods** +4. **Busca el pod `loki-read`** +5. **Haz clic derecho sobre el pod y selecciona "Eliminar"** +6. **Espera a que el pod se recree automáticamente** +7. **Prueba la conectividad en Grafana** + +El pod será recreado automáticamente por el controlador de despliegue, lo que debería resolver el error de configuración. + + + + + +Después de eliminar el pod, verifica que se haya recreado correctamente: + +```bash +# Verificar estado del pod +kubectl get pods -n monitoring | grep loki-read + +# Verificar que el pod esté en ejecución y listo +kubectl describe pod -n monitoring + +# Revisar logs del pod para detectar errores +kubectl logs -n monitoring +``` + +El pod debería mostrar estado `Running` con todos los contenedores listos (por ejemplo, `1/1`). + + + + + +Después de la recreación del pod: + +1. **Accede al panel de Grafana** +2. **Ve a Configuración → Fuentes de datos** +3. **Encuentra la fuente de datos Loki** +4. **Haz clic en "Probar" para verificar la conectividad** +5. **Intenta ejecutar una consulta simple de logs**: + ``` + {namespace="default"} + ``` + +Si es exitoso, deberías ver entradas de logs apareciendo en los resultados de la consulta. + + + + + +Si prefieres usar kubectl en lugar de Lens: + +```bash +# Listar pods de loki +kubectl get pods -n monitoring | grep loki + +# Eliminar el pod loki-read +kubectl delete pod -n monitoring + +# Observar la recreación del pod +kubectl get pods -n monitoring -w | grep loki-read +``` + +Reemplaza `` con el nombre real del pod obtenido en el primer comando. + + + + + +Para monitorear este problema en el futuro: + +1. **Configura alertas para reinicios del pod Loki** +2. **Monitorea la salud de la fuente de datos en Grafana** +3. **Revisa regularmente los logs del pod para detectar errores de configuración** + +```yaml +# Ejemplo de regla de alerta para problemas con el pod Loki +groups: + - name: loki-alerts + rules: + - alert: LokiPodDown + expr: up{job="loki-read"} == 0 + for: 2m + labels: + severity: warning + annotations: + summary: "El pod de lectura de Loki está caído" +``` + + + + + +Escala al soporte de SleakOps si: + +- La recreación del pod no resuelve el problema +- El problema se repite frecuentemente (más de una vez al día) +- Se afectan múltiples componentes de Loki +- Grafana muestra errores persistentes de conexión tras reiniciar el pod + +Incluye en tu solicitud de soporte: + +- Logs del pod antes y después de la recreación +- Mensajes de error de Grafana +- Información del clúster y namespace + + + +--- + +_Esta FAQ fue generada automáticamente el 22 de noviembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-log-explorer-dashboard-loading-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-log-explorer-dashboard-loading-issue.mdx new file mode 100644 index 000000000..2d59b86a4 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-log-explorer-dashboard-loading-issue.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "Problema de carga del panel de explorador de registros de Loki" +description: "Solución para el panel de Grafana Log Explorer atascado en estado de carga" +date: "2025-02-12" +category: "dependency" +tags: ["loki", "grafana", "logs", "dashboard", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de carga del panel de explorador de registros de Loki + +**Fecha:** 12 de febrero de 2025 +**Categoría:** Dependencia +**Etiquetas:** Loki, Grafana, Registros, Panel, Solución de problemas + +## Descripción del problema + +**Contexto:** Los usuarios experimentan problemas con el panel Grafana Log Explorer en la plataforma SleakOps donde los registros no se cargan y la interfaz permanece atascada en un estado de carga indefinidamente. + +**Síntomas observados:** + +- El panel Log Explorer muestra un spinner de carga continuo +- Los registros nunca aparecen en la interfaz de Grafana +- El panel permanece sin respuesta +- Otros paneles de Grafana pueden funcionar normalmente + +**Configuración relevante:** + +- Componente: sistema de registro Loki +- Interfaz: panel Grafana Log Explorer +- Pods afectados: `loki-backend` y `loki-read` +- Plataforma: entorno Kubernetes de SleakOps + +**Condiciones de error:** + +- Ocurre específicamente con paneles relacionados con registros +- Sucede cuando los pods `loki-backend` se reinician sin que los pods `loki-read` se reinicien +- Resulta en una ruptura de comunicación entre los componentes de Loki + +## Solución detallada + + + +Este es un problema conocido con Loki donde los pods `loki-read` pierden su capacidad de comunicarse con los pods `loki-backend` después de que el backend se reinicia. Esto crea un estado donde: + +- Los pods `loki-backend` están ejecutándose con nuevas configuraciones +- Los pods `loki-read` aún intentan usar parámetros de conexión antiguos +- El canal de comunicación entre los componentes está roto +- Las consultas de registros no pueden procesarse, resultando en una carga infinita + +Este problema está siendo rastreado en el repositorio oficial de Loki en GitHub: + +- [Issue #14384](https://github.com/grafana/loki/issues/14384#issuecomment-2612675359) +- [Issue #15191](https://github.com/grafana/loki/issues/15191) + + + + + +La solución inmediata es reiniciar los pods `loki-read` para restablecer la comunicación con el backend: + +**Usando kubectl:** + +```bash +# Encontrar los pods loki-read +kubectl get pods -n | grep loki-read + +# Reiniciar los pods loki-read +kubectl delete pod -n + +# O reiniciar todos los pods loki-read a la vez +kubectl delete pods -l app=loki-read -n +``` + +**Usando la interfaz de SleakOps:** + +1. Navega a la gestión de tu clúster +2. Ve a **Workloads** → **Pods** +3. Filtra por `loki-read` +4. Selecciona los pods y elige **Reiniciar** + +Los pods se reiniciarán automáticamente y restablecerán la conexión con el backend. + + + + + +Después de reiniciar los pods `loki-read`: + +1. **Espera 2-3 minutos** para que los pods se reinicien completamente +2. **Verifica el estado de los pods:** + + ```bash + kubectl get pods -n | grep loki + ``` + + Todos los pods deberían mostrar estado `Running` + +3. **Prueba el Log Explorer:** + + - Abre el panel de Grafana + - Navega a Log Explorer + - Intenta consultar registros recientes + - Verifica que los registros se carguen correctamente + +4. **Revisa los logs de los pods si persisten los problemas:** + ```bash + kubectl logs -f -n + ``` + + + + + +Aunque este es un problema conocido de Loki que se está abordando en el proyecto original, puedes: + +**Monitorear el problema:** + +- Configurar alertas para cuando Log Explorer deje de responder +- Monitorear eventos de reinicio de pods de Loki +- Crear chequeos de salud para la ingesta de registros + +**Automatización temporal como solución:** + +```yaml +# Ejemplo de script de monitoreo +apiVersion: batch/v1 +kind: CronJob +metadata: + name: loki-health-check +spec: + schedule: "*/10 * * * *" # Cada 10 minutos + jobTemplate: + spec: + template: + spec: + containers: + - name: health-check + image: curlimages/curl + command: + - /bin/sh + - -c + - | + # Verificar si Loki responde + if ! curl -f http://loki-gateway/ready; then + echo "Loki no responde, puede necesitar intervención" + fi + restartPolicy: OnFailure +``` + +**Mantente actualizado:** + +- Monitorea los issues de GitHub mencionados para soluciones permanentes +- Actualiza Loki cuando se publiquen parches +- Considera usar el modo distribuido de Loki para mejor resiliencia + + + + + +Si reiniciar los pods `loki-read` no soluciona el problema: + +**1. Reiniciar todos los componentes de Loki:** + +```bash +# Reiniciar todos los pods de Loki +kubectl delete pods -l app.kubernetes.io/name=loki -n +``` + +**2. Revisar la configuración de Loki:** + +```bash +# Ver configuración configmap de Loki +kubectl get configmap loki-config -n -o yaml +``` + +**3. Verificar conectividad de red:** + +```bash +# Probar conectividad entre pods +kubectl exec -it -n -- nslookup loki-backend +``` + +**4. Revisar limitaciones de recursos:** + +```bash +# Ver si los pods están limitados en recursos +kubectl top pods -n | grep loki +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 12 de febrero de 2025 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-read-pods-connection-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-read-pods-connection-issue.mdx new file mode 100644 index 000000000..0a91ba7c6 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/loki-read-pods-connection-issue.mdx @@ -0,0 +1,208 @@ +--- +sidebar_position: 15 +title: "Problema de Conexión de los Pods de Lectura de Loki" +description: "Solución para que los pods de lectura de Loki no se reconecten al backend después de un reinicio" +date: "2024-04-21" +category: "dependency" +tags: ["loki", "grafana", "monitoring", "pods", "connection", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Conexión de los Pods de Lectura de Loki + +**Fecha:** 21 de abril de 2024 +**Categoría:** Dependencia +**Etiquetas:** Loki, Grafana, Monitorización, Pods, Conexión, Solución de problemas + +## Descripción del Problema + +**Contexto:** La pila de monitorización Loki experimenta problemas de conectividad donde los pods de lectura no logran reconectarse al servicio backend después de que el pod backend se reinicia o queda inaccesible. + +**Síntomas Observados:** + +- Los pods de lectura de Loki (`loki-read`) no pueden conectarse al backend de Loki +- Los errores de conexión persisten incluso después de restaurar los pods backend +- Los paneles de Grafana muestran problemas al recuperar datos +- Las consultas de logs fallan o retornan resultados incompletos + +**Configuración Relevante:** + +- Componente: pods de lectura de Loki (`loki-read`) +- Servicio backend: `loki-backend` +- Plataforma: despliegue de Loki basado en Kubernetes +- Comportamiento de conexión: conexión inicial solamente, sin reconexión automática + +**Condiciones de Error:** + +- Ocurre cuando los pods `loki-backend` se reinician o fallan +- Los pods de lectura sólo intentan conexión al iniciar +- No existe mecanismo automático de reconexión en la versión actual de Loki +- Requiere intervención manual para restaurar la conectividad + +## Solución Detallada + + + +Para resolver el problema de conexión de forma inmediata: + +1. **Identificar los pods afectados:** + + ```bash + kubectl get pods -l app=loki-read -n monitoring + ``` + +2. **Reiniciar los pods de lectura de Loki:** + + ```bash + kubectl delete pods -l app=loki-read -n monitoring + ``` + +3. **Verificar que los pods estén en ejecución:** + + ```bash + kubectl get pods -l app=loki-read -n monitoring -w + ``` + +4. **Comprobar la conectividad:** + ```bash + kubectl logs -l app=loki-read -n monitoring --tail=50 + ``` + +Los pods se reconectarán automáticamente al backend al reiniciarse. + + + + + +El problema ocurre porque: + +1. **Intento único de conexión**: Los pods de lectura de Loki sólo intentan conectarse al backend durante su fase de inicialización +2. **Sin lógica de reconexión**: Las versiones actuales de Loki carecen de mecanismos automáticos de reconexión +3. **Dependencia del backend**: Cuando `loki-backend` se reinicia, se pierden las conexiones existentes +4. **Conexión estática**: Los pods de lectura mantienen una conexión estática sin chequeos de salud + +Esta es una limitación conocida en versiones antiguas de Loki que ha sido corregida en versiones más recientes. + + + + + +La solución permanente implica actualizar a una versión de Loki que incluya el [PR #17063](https://github.com/grafana/loki/pull/17063): + +**Lo que aborda el PR:** + +- Añade capacidades de chequeo de salud a los pods de lectura de Loki +- Implementa lógica automática de reconexión +- Mejora la resiliencia de la conexión + +**Pasos para la implementación:** + +1. **Verificar la versión actual de Loki:** + + ```bash + kubectl get deployment loki-read -n monitoring -o jsonpath='{.spec.template.spec.containers[0].image}' + ``` + +2. **Actualizar valores de Helm para incluir chequeos de salud:** + + ```yaml + loki: + read: + livenessProbe: + enabled: true + httpGet: + path: /ready + port: 3100 + initialDelaySeconds: 30 + periodSeconds: 10 + readinessProbe: + enabled: true + httpGet: + path: /ready + port: 3100 + initialDelaySeconds: 15 + periodSeconds: 5 + ``` + +3. **Actualizar usando Helm:** + ```bash + helm upgrade loki grafana/loki -n monitoring -f values.yaml + ``` + + + + + +Para prevenir y detectar rápidamente este problema: + +**Configurar alertas de monitoreo:** + +```yaml +# Regla de alerta de Prometheus +groups: + - name: loki-connectivity + rules: + - alert: LokiReadPodsDisconnected + expr: up{job="loki-read"} == 0 + for: 2m + labels: + severity: warning + annotations: + summary: "Los pods de lectura de Loki están desconectados" + description: "Los pods de lectura de Loki han estado desconectados por más de 2 minutos" +``` + +**Script de chequeo de salud:** + +```bash +#!/bin/bash +# Comprobar conectividad de los pods de lectura de Loki +READ_PODS=$(kubectl get pods -l app=loki-read -n monitoring --no-headers | wc -l) +READY_PODS=$(kubectl get pods -l app=loki-read -n monitoring --no-headers | grep Running | wc -l) + +if [ "$READ_PODS" -ne "$READY_PODS" ]; then + echo "Advertencia: No todos los pods de lectura de Loki están listos" + kubectl get pods -l app=loki-read -n monitoring +fi +``` + + + + + +Si el problema persiste después de reiniciar los pods: + +1. **Verificar estado del pod backend:** + + ```bash + kubectl get pods -l app=loki-backend -n monitoring + kubectl logs -l app=loki-backend -n monitoring --tail=100 + ``` + +2. **Verificar conectividad del servicio:** + + ```bash + kubectl get svc loki-backend -n monitoring + kubectl describe svc loki-backend -n monitoring + ``` + +3. **Probar conectividad interna:** + + ```bash + kubectl run test-pod --rm -i --tty --image=busybox -- /bin/sh + # Dentro del pod: + nslookup loki-backend.monitoring.svc.cluster.local + wget -qO- http://loki-backend.monitoring.svc.cluster.local:3100/ready + ``` + +4. **Verificar políticas de red:** + ```bash + kubectl get networkpolicies -n monitoring + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 21 de abril de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-addons-pricing-guide.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-addons-pricing-guide.mdx new file mode 100644 index 000000000..3afb9c009 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-addons-pricing-guide.mdx @@ -0,0 +1,175 @@ +--- +sidebar_position: 3 +title: "Guía de Precios para Complementos de Monitoreo" +description: "Desglose completo de precios para los complementos de monitoreo Grafana, Loki, KubeCost y OpenTelemetry" +date: "2024-12-19" +category: "general" +tags: + [ + "precios", + "monitoreo", + "grafana", + "loki", + "kubecost", + "opentelemetry", + "complementos", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Guía de Precios para Complementos de Monitoreo + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** General +**Etiquetas:** Precios, Monitoreo, Grafana, Loki, KubeCost, OpenTelemetry, Complementos + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender la estructura de precios para los complementos de monitoreo y observabilidad en SleakOps, incluyendo costos individuales de componentes y dependencias. + +**Síntomas Observados:** + +- Necesidad de estimar costos antes de implementar la pila de monitoreo +- Incertidumbre sobre el precio individual de cada complemento +- Preguntas sobre las dependencias entre componentes de monitoreo +- Confusión sobre qué impulsa los costos reales + +**Configuración Relevante:** + +- Pila de monitoreo: Grafana + Loki + KubeCost + OpenTelemetry +- Retención de logs: 14 días (configurable) +- Infraestructura: instancias spot dentro del clúster existente +- Base de datos: RDS requerida para Grafana + +**Condiciones de Error:** + +- Planificación presupuestaria sin desglose claro de costos +- Posible sobreaprovisionamiento debido a falta de transparencia en precios +- Dificultad para justificar inversiones en monitoreo + +## Solución Detallada + + + +**Precios Actuales de Complementos de Monitoreo (Aproximados):** + +- **Grafana**: 20 USD/mes +- **Loki**: 10 USD/mes +- **KubeCost**: 10 USD/mes +- **OpenTelemetry**: 10 USD/mes (complemento próximo) + +**Notas Importantes:** + +- Loki requiere que Grafana esté instalado primero +- Costo total para la pila completa de monitoreo: aproximadamente 40-50 USD/mes +- Los precios son aproximados y pueden variar según el uso + + + + + +**Costos Fijos:** + +- **Base de datos RDS**: Requerida para Grafana (componente principal de costo fijo) + +**Costos Variables:** + +- **Recursos de cómputo**: Ejecutados en instancias spot dentro de su clúster existente +- **Almacenamiento**: Buckets S3 para retención de logs y métricas +- **Red**: Costos de transferencia de datos (mínimos) + +**Optimización de Costos:** + +- Los complementos de monitoreo comparten instancias con complementos Esenciales +- En muchos casos, no se necesitan nuevas instancias +- Las instancias spot reducen significativamente los costos de cómputo + + + + + +**Dependencias Requeridas:** + +1. **Grafana** (independiente) + + - Requiere: base de datos RDS + - Costo: 20 USD/mes + +2. **Loki** (agregación de logs) + + - Requiere: Grafana debe estar instalado primero + - Costo adicional: 10 USD/mes + - Combinado con Grafana: 30 USD/mes + +3. **KubeCost** (monitoreo de costos) + + - Puede instalarse independientemente + - Costo: 10 USD/mes + +4. **OpenTelemetry** (próximo) + - Parte de la pila de observabilidad + - Costo estimado: 10 USD/mes + + + + + +**Beneficios de Retención en SleakOps:** + +- **Retención estándar**: 14 días (incluido en el precio base) +- **Retención extendida**: Hasta 3 meses con aumento mínimo de costo +- **Ubicación de almacenamiento**: Buckets S3 (rentable) +- **Sin impacto en rendimiento**: Los datos históricos no afectan el rendimiento del clúster + +**Comparación con otras soluciones:** + +```yaml +# Monitoreo tradicional (costoso) +retention_days: 14 +storage_type: "cluster-local" +cost_increase: "lineal con la retención" + +# Enfoque SleakOps (rentable) +retention_days: 90 +storage_type: "s3-bucket" +cost_increase: "mínimo" +``` + + + + + +**Para la Planificación Presupuestaria:** + +1. **Configuración mínima de monitoreo**: + + - Solo Grafana: 20 USD/mes + - Bueno para visualización básica de métricas + +2. **Configuración estándar de monitoreo**: + + - Grafana + Loki: 30 USD/mes + - Incluye agregación y análisis de logs + +3. **Configuración completa de monitoreo**: + + - Grafana + Loki + KubeCost: 40 USD/mes + - Observabilidad completa con seguimiento de costos + +4. **Configuración preparada para el futuro**: + - Todos los complementos + OpenTelemetry: 50 USD/mes + - Observabilidad completa y trazabilidad + +**Consejos para Optimización de Costos:** + +- Comience con complementos esenciales para compartir recursos de cómputo +- Monitoree el uso real antes de agregar más componentes +- Aproveche el almacenamiento en S3 para retención a largo plazo +- Considere patrones de uso estacionales para la planificación de costos + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-alternatives-datadog-vs-sleakops-addons.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-alternatives-datadog-vs-sleakops-addons.mdx new file mode 100644 index 000000000..352dc0aee --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-alternatives-datadog-vs-sleakops-addons.mdx @@ -0,0 +1,574 @@ +--- +sidebar_position: 3 +title: "Alternativas de Monitoreo: DataDog vs Complementos Nativos de SleakOps" +description: "Comparación entre DataDog y los complementos nativos de monitoreo de SleakOps para métricas y telemetría de aplicaciones" +date: "2024-03-25" +category: "dependencia" +tags: + ["monitoreo", "datadog", "grafana", "loki", "otel", "telemetría", "métricas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Alternativas de Monitoreo: DataDog vs Complementos de SleakOps + +**Fecha:** 25 de marzo de 2024 +**Categoría:** Dependencia +**Etiquetas:** Monitoreo, DataDog, Grafana, Loki, OTEL, Telemetría, Métricas + +## Descripción del Problema + +**Contexto:** Los usuarios que migran desde soluciones externas de monitoreo como DataDog necesitan comprender las opciones de monitoreo disponibles en SleakOps y cómo implementar la recopilación de métricas a nivel de aplicación para KPIs de negocio y monitoreo de rendimiento. + +**Síntomas Observados:** + +- Pérdida de capacidades de monitoreo de DataDog después de la migración a SleakOps +- Necesidad de recolección de métricas a nivel de aplicación para KPIs de negocio +- Incertidumbre sobre las alternativas de monitoreo disponibles en SleakOps +- Preguntas sobre las implicaciones de costos de diferentes soluciones de monitoreo + +**Configuración Relevante:** + +- La aplicación usa OpenTelemetry para exportar métricas +- Configuración previa de DataDog con claves API y recolectores +- Aplicación Spring Boot con configuración de gestión de métricas +- Necesidad de etiquetas personalizadas para la aplicación y métricas específicas por entorno + +**Condiciones de Error:** + +- Infraestructura de monitoreo ausente tras la migración de plataforma +- Incapacidad para rastrear métricas de rendimiento del negocio +- Falta de visibilidad en el rendimiento de la aplicación + +## Solución Detallada + + + +SleakOps proporciona un stack de monitoreo completo a través de complementos nativos: + +**Loki (Gestión de Logs):** + +- Persiste logs de todos los componentes del clúster +- Incluye tanto logs de aplicaciones como de controladores +- Proporciona agregación centralizada y búsqueda de logs + +**Grafana (Métricas y Visualización):** + +- Recopila y persiste métricas de CPU, Memoria, Red y E/S +- Monitorea todos los componentes del clúster y aplicaciones +- Proporciona paneles personalizables y alertas + +**OpenTelemetry (APM - Beta):** + +- Monitoreo de Rendimiento de Aplicaciones usando estándares abiertos +- Actualmente en beta con capacidades de métricas en expansión +- Compatible con la instrumentación existente de OpenTelemetry + + + + + +**Estructura de Costos de DataDog:** + +- Cargos por instancia en el clúster +- Costos variables según el tamaño del clúster +- Difícil predecir y controlar gastos + +**Alternativa: New Relic** + +- Precios basados en usuarios y consumo de datos +- Nivel gratuito incluye 100GB de datos y 1 usuario +- Estructura de costos más predecible +- Puede usarse gratis para equipos pequeños + +**Comparación de Costos:** + +``` +DataDog: $15-23/host/mes (variable con tamaño del clúster) +New Relic: $0-99/usuario/mes (predecible, nivel gratuito disponible) +Complementos de SleakOps: Incluidos en el costo de la plataforma +``` + + + + + +Si decides continuar con DataDog, puedes agregarlo como una dependencia personalizada: + +**1. Añadir el Agente de DataDog a la Imagen Docker:** + +```dockerfile +# Añadir a tu Dockerfile de aplicación +FROM your-base-image + +# Instalar agente de DataDog +RUN curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh | bash + +# Copiar tu aplicación +COPY . /app + +# Configurar variables de entorno de DataDog +ENV DD_API_KEY=${DATADOG_API_KEY} +ENV DD_SITE="datadoghq.com" +ENV DD_LOGS_ENABLED=true +ENV DD_APM_ENABLED=true +``` + +**2. Configurar Propiedades de la Aplicación:** + +```yaml +management: + metrics: + export: + datadog: + api-key: ${DATADOG_API_KEY} + application-key: ${DATADOG_APPLICATION_KEY} + uri: https://api.datadoghq.com + enabled: ${DATADOG_TOGGLE} + api-host: ${DATADOG_COLLECTOR_HOST} + port: ${DATADOG_COLLECTOR_PORT} + tags: + appId: ${spring.application.name} + env: ${DATADOG_ENVIRONMENT} + host: ${spring.cloud.client.hostname} + security: + enabled: false +``` + +**3. Establecer Variables de Entorno en SleakOps:** + +- `DATADOG_API_KEY`: Tu clave API de DataDog +- `DATADOG_APPLICATION_KEY`: Tu clave de aplicación de DataDog +- `DATADOG_TOGGLE`: Activar/desactivar DataDog (true/false) +- `DATADOG_ENVIRONMENT`: Nombre del entorno (prod, staging, dev) +- `DATADOG_COLLECTOR_HOST`: Endpoint del colector de DataDog +- `DATADOG_COLLECTOR_PORT`: Puerto del colector (usualmente 8125) + + + + + +Dado que tu aplicación ya usa OpenTelemetry, puedes integrarte fácilmente con el complemento OTEL de SleakOps: + +**1. Habilitar el Complemento OTEL en SleakOps:** + +- Navega a Configuración del Clúster → Complementos +- Habilita "OpenTelemetry (Beta)" +- Configura el endpoint del colector OTEL + +**2. Actualizar Configuración de la Aplicación:** + +```yaml +management: + metrics: + export: + otlp: + endpoint: http://otel-collector:4317 + protocol: grpc + headers: + authorization: Bearer ${OTEL_TOKEN} + tracing: + enabled: true + sampling: + probability: 1.0 +``` + +**3. Configuración de Métricas Personalizadas:** + +```java +@Component +public class BusinessMetrics { + private final MeterRegistry meterRegistry; + private final Counter orderCounter; + private final Timer processTimer; + + public BusinessMetrics(MeterRegistry meterRegistry) { + this.meterRegistry = meterRegistry; + this.orderCounter = Counter.builder("business.orders.total") + .description("Total de pedidos procesados") + .tag("env", "${ENVIRONMENT}") + .register(meterRegistry); + this.processTimer = Timer.builder("business.process.duration") + .description("Duración del proceso de negocio") + .register(meterRegistry); + } +} +``` + + + + + +| Funcionalidad | DataDog | Complementos SleakOps | New Relic | +| -------------------------------- | ----------------- | --------------------- | ------------------- | +| **Modelo de Costo** | Por instancia | Incluido | Por usuario + datos | +| **Nivel Gratuito** | Prueba de 14 días | Incluido | 100GB + 1 usuario | +| **APM** | Completo | Beta (en expansión) | Completo | +| **Gestión de Logs** | Sí | Sí (Loki) | Sí | +| **Monitoreo de Infraestructura** | Sí | Sí (Grafana) | Sí | +| **Métricas Personalizadas** | Sí | Sí (OTEL) | Sí | +| **Alertas** | Avanzadas | Básicas | Avanzadas | +| **Dashboards** | Predefinidos | Personalizables | Predefinidos | +| **Integración con K8s** | Excelente | Nativa | Buena | +| **Curva de Aprendizaje** | Media | Baja | Media | + + + + + +Para replicar funcionalidades de DataDog usando Grafana: + +**1. Crear Dashboard de Métricas de Aplicación:** + +```json +{ + "dashboard": { + "title": "Application Metrics Dashboard", + "panels": [ + { + "title": "Request Rate", + "type": "graph", + "targets": [ + { + "expr": "rate(http_requests_total[5m])", + "legendFormat": "{{method}} {{status}}" + } + ] + }, + { + "title": "Response Time", + "type": "graph", + "targets": [ + { + "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))", + "legendFormat": "95th percentile" + } + ] + }, + { + "title": "Error Rate", + "type": "singlestat", + "targets": [ + { + "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])", + "legendFormat": "Error Rate" + } + ] + } + ] + } +} +``` + +**2. Configurar Alertas en Grafana:** + +```yaml +# Alerta para alta tasa de errores +- alert: HighErrorRate + expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05 + for: 5m + labels: + severity: warning + annotations: + summary: "Alta tasa de errores en {{ $labels.instance }}" + description: "La tasa de errores es {{ $value | humanizePercentage }}" + +# Alerta para tiempo de respuesta alto +- alert: HighResponseTime + expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2 + for: 5m + labels: + severity: warning + annotations: + summary: "Tiempo de respuesta alto en {{ $labels.instance }}" +``` + +**3. Configurar Variables de Dashboard:** + +```json +{ + "templating": { + "list": [ + { + "name": "environment", + "type": "query", + "query": "label_values(environment)", + "refresh": 1 + }, + { + "name": "service", + "type": "query", + "query": "label_values(service)", + "refresh": 1 + } + ] + } +} +``` + + + + + +Para usar New Relic como alternativa a DataDog: + +**1. Configurar New Relic Agent:** + +```yaml +# application.yml +newrelic: + app_name: ${spring.application.name} + license_key: ${NEW_RELIC_LICENSE_KEY} + agent: + enabled: ${NEW_RELIC_ENABLED:true} + distributed_tracing: + enabled: true + application_logging: + enabled: true + forwarding: + enabled: true + local_decorating: + enabled: true +``` + +**2. Añadir Dependencias Maven:** + +```xml + + com.newrelic.agent.java + newrelic-agent + 8.7.0 + provided + + + com.newrelic.agent.java + newrelic-api + 8.7.0 + +``` + +**3. Configurar Métricas Personalizadas:** + +```java +@Component +public class NewRelicMetrics { + + @EventListener + public void handleOrderCreated(OrderCreatedEvent event) { + NewRelic.recordMetric("Custom/Orders/Created", 1); + NewRelic.addCustomAttribute("orderId", event.getOrderId()); + NewRelic.addCustomAttribute("amount", event.getAmount()); + } + + @Trace(dispatcher = true) + public void processBusinessLogic() { + // Tu lógica de negocio aquí + NewRelic.setTransactionName("Custom", "ProcessBusinessLogic"); + } +} +``` + +**4. Variables de Entorno en SleakOps:** + +```bash +NEW_RELIC_LICENSE_KEY=your_license_key +NEW_RELIC_APP_NAME=your_app_name +NEW_RELIC_ENABLED=true +NEW_RELIC_ENVIRONMENT=production +``` + + + + + +Para obtener lo mejor de ambos mundos: + +**1. Usar Complementos de SleakOps para Infraestructura:** + +- Grafana para métricas de infraestructura y aplicación básicas +- Loki para gestión centralizada de logs +- Alertas básicas para problemas de infraestructura + +**2. Integrar Herramienta APM Externa para Aplicaciones:** + +- New Relic o DataDog para APM detallado +- Métricas de negocio personalizadas +- Análisis avanzado de rendimiento + +**3. Configuración de Ejemplo:** + +```yaml +# Configuración híbrida en application.yml +management: + metrics: + export: + # Para métricas básicas a Grafana + prometheus: + enabled: true + # Para APM detallado a New Relic + newrelic: + enabled: ${NEW_RELIC_ENABLED:false} + api-key: ${NEW_RELIC_LICENSE_KEY} + account-id: ${NEW_RELIC_ACCOUNT_ID} + step: 1m +``` + +**4. Estrategia de Costos:** + +- Usar nivel gratuito de New Relic para desarrollo/staging +- Pagar solo por producción con datos críticos +- Mantener monitoreo básico gratuito con SleakOps + + + + + +Para migrar gradualmente desde DataDog: + +**1. Fase 1: Configurar Monitoreo Paralelo** + +```bash +# Mantener DataDog mientras configuras alternativas +DATADOG_ENABLED=true +NEW_RELIC_ENABLED=true +GRAFANA_METRICS_ENABLED=true +``` + +**2. Fase 2: Comparar Métricas** + +- Ejecutar ambos sistemas durante 2-4 semanas +- Comparar precisión y cobertura de métricas +- Identificar gaps en funcionalidad + +**3. Fase 3: Migración Gradual** + +```yaml +# Configuración de transición +monitoring: + primary: "newrelic" # o "grafana" + fallback: "datadog" + migration_mode: true +``` + +**4. Fase 4: Desactivar DataDog** + +```bash +# Una vez validada la nueva solución +DATADOG_ENABLED=false +# Cancelar suscripción de DataDog +``` + + + + + +Para replicar métricas de negocio específicas: + +**1. Métricas de Negocio con Micrometer:** + +```java +@Service +public class BusinessMetricsService { + private final MeterRegistry meterRegistry; + private final Counter orderCounter; + private final Timer paymentTimer; + private final Gauge activeUsersGauge; + + public BusinessMetricsService(MeterRegistry meterRegistry) { + this.meterRegistry = meterRegistry; + + this.orderCounter = Counter.builder("business.orders.total") + .description("Total orders processed") + .tag("environment", "${ENVIRONMENT}") + .register(meterRegistry); + + this.paymentTimer = Timer.builder("business.payment.duration") + .description("Payment processing time") + .register(meterRegistry); + + this.activeUsersGauge = Gauge.builder("business.users.active") + .description("Currently active users") + .register(meterRegistry, this, BusinessMetricsService::getActiveUserCount); + } + + public void recordOrder(String type, double amount) { + orderCounter.increment( + Tags.of( + "type", type, + "amount_range", getAmountRange(amount) + ) + ); + } + + public void recordPayment(Duration duration, boolean success) { + Timer.Sample sample = Timer.start(meterRegistry); + sample.stop(paymentTimer.tag("success", String.valueOf(success))); + } +} +``` + +**2. Métricas Personalizadas con OpenTelemetry:** + +```java +@Component +public class OtelBusinessMetrics { + private final Meter meter; + private final LongCounter orderCounter; + private final DoubleHistogram paymentAmount; + + public OtelBusinessMetrics() { + this.meter = GlobalOpenTelemetry.getMeter("business-metrics"); + this.orderCounter = meter.counterBuilder("business_orders_total") + .setDescription("Total business orders") + .build(); + this.paymentAmount = meter.histogramBuilder("business_payment_amount") + .setDescription("Payment amounts") + .build(); + } + + public void recordOrder(String category) { + orderCounter.add(1, Attributes.of( + AttributeKey.stringKey("category"), category, + AttributeKey.stringKey("environment"), System.getenv("ENVIRONMENT") + )); + } +} +``` + + + +## Recomendaciones + +### Para Equipos Pequeños (1-5 desarrolladores) + +1. **Usar complementos nativos de SleakOps** para monitoreo básico +2. **Agregar New Relic nivel gratuito** para APM detallado +3. **Configurar alertas básicas** en Grafana + +### Para Equipos Medianos (5-20 desarrolladores) + +1. **Enfoque híbrido** con SleakOps + New Relic pagado +2. **Dashboards personalizados** en Grafana +3. **Métricas de negocio** implementadas con OpenTelemetry + +### Para Equipos Grandes (20+ desarrolladores) + +1. **Evaluar DataDog vs New Relic** basado en costos +2. **Implementar monitoreo multicapa** con diferentes herramientas +3. **Establecer centro de excelencia** para monitoreo + +### Consideraciones de Costo + +- **SleakOps nativo**: $0 adicional +- **New Relic**: $0-99/usuario/mes +- **DataDog**: $15-23/host/mes + +### Migración Recomendada + +1. Comenzar con complementos nativos de SleakOps +2. Agregar New Relic para APM si es necesario +3. Evaluar DataDog solo si se requieren funcionalidades específicas + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 25 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-metrics-persistence-node-failures.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-metrics-persistence-node-failures.mdx new file mode 100644 index 000000000..0bd8333f3 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/monitoring-metrics-persistence-node-failures.mdx @@ -0,0 +1,216 @@ +--- +sidebar_position: 3 +title: "Pérdida de Métricas Durante Fallos de Nodos en Clústeres de Kubernetes" +description: "Comprender y prevenir la pérdida de métricas cuando los nodos fallan o son reemplazados en clústeres de Kubernetes" +date: "2024-06-26" +category: "cluster" +tags: ["monitoring", "metrics", "node-failure", "prometheus", "persistence"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Pérdida de Métricas Durante Fallos de Nodos en Clústeres de Kubernetes + +**Fecha:** 26 de junio de 2024 +**Categoría:** Clúster +**Etiquetas:** Monitoreo, Métricas, Fallo de Nodo, Prometheus, Persistencia + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan pérdida de métricas de monitoreo cuando los nodos de Kubernetes fallan o son reemplazados durante despliegues, afectando la disponibilidad de datos históricos para análisis y solución de problemas. + +**Síntomas Observados:** + +- Las métricas desaparecen cuando los nodos se caen o se terminan +- Los datos históricos de monitoreo dejan de estar disponibles +- Huecos en la continuidad de métricas durante el reemplazo de nodos +- Pérdida de datos de rendimiento durante despliegues + +**Configuración Relevante:** + +- Retención de métricas: almacenamiento local por 2 horas antes de persistir en S3 +- Tipos de instancia: instancias bajo demanda (menos propensas a fallos) +- Backend de almacenamiento: S3 para almacenamiento a largo plazo de métricas +- Stack de monitoreo: colección de métricas basada en Prometheus + +**Condiciones de Error:** + +- Fallos de nodo durante la ventana de retención local de 2 horas +- Despliegues que reemplazan nodos antes de que las métricas se persistan +- Terminaciones inesperadas de nodos por problemas de infraestructura +- Problemas de red que impiden la persistencia de métricas a S3 + +## Solución Detallada + + + +La arquitectura actual de métricas funciona de la siguiente manera: + +1. **Fase de Almacenamiento Local**: Las métricas se almacenan localmente en cada nodo durante 2 horas +2. **Fase de Persistencia**: Después de 2 horas, las métricas se persisten automáticamente en S3 +3. **Ventana de Riesgo**: Si un nodo falla dentro de esa ventana de 2 horas, las métricas se pierden + +Este diseño optimiza para: + +- Reducir costos de tráfico de red +- Seguir recomendaciones oficiales de herramientas +- Balancear rendimiento con costos de almacenamiento + +```yaml +# Ejemplo de configuración actual +prometheus: + retention: + local: "2h" + remote_write: + interval: "2h" + destination: "s3://metrics-bucket" +``` + + + + + +Durante los despliegues, la pérdida de métricas puede ocurrir cuando: + +1. **Actualizaciones Continuas**: Los nodos antiguos se terminan antes de que las métricas se persistan +2. **Reemplazo de Nodos**: Los nodos nuevos reemplazan a los antiguos dentro de la ventana de 2 horas +3. **Operaciones de Escalado**: Nodos son removidos durante operaciones de reducción de escala + +**Escenarios de despliegue vs fallo de nodo:** + +- **Despliegues planificados**: Las métricas pueden perderse si el despliegue ocurre dentro de la ventana de 2 horas +- **Fallas no planificadas**: Menos comunes con instancias bajo demanda pero aún posibles +- **Problemas de infraestructura**: Problemas de red, interrupciones en servicios AWS + + + + + +Mientras se esperan mejoras en la plataforma, considere estos enfoques: + +**1. Programar despliegues** + +```bash +# Verificar la última persistencia de métricas +kubectl get configmap prometheus-config -o yaml | grep last_persist + +# Esperar al siguiente ciclo de persistencia antes de desplegar +echo "Esperando la persistencia de métricas..." +sleep 7200 # 2 horas +``` + +**2. Respaldo manual de métricas** + +```bash +# Exportar métricas actuales antes del despliegue +kubectl port-forward svc/prometheus 9090:9090 +curl -G 'http://localhost:9090/api/v1/query_range' \ + --data-urlencode 'query=up' \ + --data-urlencode 'start=2024-06-26T00:00:00Z' \ + --data-urlencode 'end=2024-06-26T23:59:59Z' \ + --data-urlencode 'step=60s' > metrics_backup.json +``` + +**3. Monitorear impacto del despliegue** + +```bash +# Monitorear reemplazo de nodos durante el despliegue +kubectl get events --field-selector reason=NodeReady -w +``` + + + + + +El equipo de SleakOps está trabajando activamente en soluciones para prevenir la pérdida de métricas: + +**Mejoras planeadas:** + +1. **Claims de Volumen Persistente**: Almacenar métricas en almacenamiento persistente +2. **Persistencia más rápida**: Reducir la ventana de 2 horas para minimizar riesgos +3. **Apagado ordenado de nodos**: Asegurar que las métricas se guarden antes de la terminación del nodo +4. **Almacenamiento redundante**: Múltiples copias de métricas en diferentes nodos + +**Cronograma:** + +- Estas mejoras están en la hoja de ruta del producto +- Se comunicarán actualizaciones conforme estén disponibles +- No se ha proporcionado una fecha estimada específica aún + + + + + +**1. Programación de despliegues** + +- Programar despliegues después de los ciclos de persistencia de métricas +- Monitorear el estado de persistencia de métricas antes de desplegar +- Usar ventanas de mantenimiento para despliegues críticos + +**2. Configuración de monitoreo** + +```yaml +# Añadir alertas para problemas de persistencia de métricas +groups: + - name: metrics.rules + rules: + - alert: MetricsPersistenceDelay + expr: time() - prometheus_tsdb_last_successful_snapshot_timestamp > 7200 + for: 5m + annotations: + summary: "La persistencia de métricas está retrasada" +``` + +**3. Documentación** + +- Documentar procedimientos de despliegue que consideren las métricas +- Capacitar a los miembros del equipo sobre ventanas de persistencia de métricas +- Crear runbooks para procedimientos de recuperación de métricas + + + + + +Para entornos críticos que requieran cero pérdida de métricas: + +**1. Monitoreo externo** + +- Usar servicios externos de monitoreo (DataDog, New Relic) +- Implementar métricas push hacia sistemas externos +- Configurar infraestructura de monitoreo redundante + +**2. Persistencia personalizada** + +```yaml +# Sidecar personalizado para persistencia inmediata +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: metrics-backup +spec: + template: + spec: + containers: + - name: backup + image: prom/prometheus:latest + command: + - /bin/sh + - -c + - | + while true; do + promtool query instant 'up' | aws s3 cp - s3://backup-metrics/$(date +%s).json + sleep 300 + done +``` + +**3. Configuración de alta disponibilidad** + +- Desplegar Prometheus en modo HA +- Usar Thanos para almacenamiento a largo plazo +- Implementar replicación de métricas entre regiones + + + +--- + +_Esta FAQ fue generada automáticamente el 19/12/2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/nodepool-memory-limit-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/nodepool-memory-limit-troubleshooting.mdx new file mode 100644 index 000000000..267868594 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/nodepool-memory-limit-troubleshooting.mdx @@ -0,0 +1,211 @@ +--- +sidebar_position: 3 +title: "Problemas con el Límite de Memoria de Nodepool" +description: "Solución de problemas y resolución de problemas de capacidad de memoria en nodepool" +date: "2025-02-27" +category: "cluster" +tags: ["nodepool", "memoria", "capacidad", "escalado", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas con el Límite de Memoria de Nodepool + +**Fecha:** 27 de febrero de 2025 +**Categoría:** Cluster +**Etiquetas:** Nodepool, Memoria, Capacidad, Escalado, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Pods de producción fallando en la programación debido a que se alcanzan los límites de memoria del nodepool tras añadir nuevas cargas de trabajo durante la migración. + +**Síntomas Observados:** + +- Pods atascados en estado "Pending" +- Fallos en la programación de pods por memoria insuficiente +- Nodepool en capacidad máxima de memoria provisionada +- Cargas críticas de producción afectadas (trabajos cron, aplicaciones) + +**Configuración Relevante:** + +- Tipo de nodepool: `spot-amd64` +- Límite de memoria original: 120 GB +- Incremento temporal: 160 GB +- Nuevas cargas: despliegues de Redash y WordPress + +**Condiciones de Error:** + +- Ocurre cuando las solicitudes totales de memoria de los pods exceden la capacidad del nodepool +- Se desencadena tras añadir nuevos despliegues al nodepool existente +- Afecta la programación de pods y la disponibilidad de aplicaciones +- El problema se agrava durante picos de uso o reinicios de pods + +## Solución Detallada + + + +Para situaciones urgentes, puede aumentar temporalmente el límite de memoria del nodepool: + +1. **Acceda a la Consola SleakOps** +2. Navegue a **Gestión de Clúster** → **Nodepools** +3. Seleccione el nodepool afectado (por ejemplo, `spot-amd64`) +4. Vaya a **Configuración** → **Recursos** +5. Aumente el **Límite de Memoria** (por ejemplo, de 120GB a 200GB) +6. Haga clic en **Aplicar Cambios** + +**Nota:** Esto proporciona un alivio inmediato pero debe seguirse con una planificación adecuada de capacidad. + + + + + +Para entender su utilización actual de memoria: + +```bash +# Ver uso de memoria de nodos +kubectl top nodes + +# Ver solicitudes y límites de memoria de pods +kubectl describe nodes | grep -A 5 "Allocated resources" + +# Listar pods con solicitudes de memoria +kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,MEMORY_REQUEST:.spec.containers[*].resources.requests.memory +``` + +Esto ayuda a identificar qué cargas consumen más memoria. + + + + + +Para una mejor gestión de recursos, cree nodepools separados para diferentes tipos de cargas: + +1. **En la Consola SleakOps:** + - Vaya a **Gestión de Clúster** → **Nodepools** + - Haga clic en **Crear Nuevo Nodepool** + - Configure las especificaciones: + +```yaml +# Ejemplo: Nodepool dedicado para cargas de datos +name: "data-workloads" +instance_type: "m5.xlarge" +min_size: 1 +max_size: 5 +desired_size: 2 +memory_limit: "64GB" +labels: + workload-type: "data" +taints: + - key: "workload-type" + value: "data" + effect: "NoSchedule" +``` + +2. **Actualice los despliegues para usar el nuevo nodepool:** + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: redash-deployment +spec: + template: + spec: + nodeSelector: + workload-type: "data" + tolerations: + - key: "workload-type" + operator: "Equal" + value: "data" + effect: "NoSchedule" +``` + + + + + +Implemente monitoreo para prevenir futuros problemas de capacidad: + +1. **Habilite el monitoreo del clúster** en SleakOps +2. **Configure alertas** para uso de memoria del nodepool: + + - Advertencia al 70% de capacidad + - Crítico al 85% de capacidad + +3. **Cree un panel** para rastrear: + - Utilización de memoria por nodepool + - Fallos en la programación de pods + - Solicitudes vs. límites de recursos + +```yaml +# Ejemplo de configuración de alerta +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: nodepool-memory-alerts +spec: + groups: + - name: nodepool.rules + rules: + - alert: NodepoolMemoryHigh + expr: (sum(kube_pod_container_resource_requests{resource="memory"}) by (node) / sum(kube_node_status_allocatable{resource="memory"}) by (node)) > 0.85 + for: 5m + annotations: + summary: "El uso de memoria del nodepool es alto" +``` + + + + + +**Prácticas recomendadas para la gestión de capacidad de nodepool:** + +1. **Reserve un buffer del 20-30%** para picos inesperados de carga +2. **Separe cargas críticas y no críticas** en nodepools diferentes +3. **Use solicitudes y límites de recursos** adecuadamente: + +```yaml +resources: + requests: + memory: "512Mi" # Lo que el pod necesita + cpu: "250m" + limits: + memory: "1Gi" # Máximo que el pod puede usar + cpu: "500m" +``` + +4. **Revisiones regulares de capacidad** - evaluación mensual de patrones de uso +5. **Implemente autoescalado horizontal de pods** para cargas variables +6. **Use instancias spot con sabiduría** - asegure opciones de respaldo para cargas críticas + + + + + +Para despliegues creados fuera de la plataforma SleakOps: + +1. **Documente los despliegues externos:** + + ```bash + # Listar todos los despliegues no gestionados por SleakOps + kubectl get deployments --all-namespaces -o yaml | grep -v "sleakops.com" + ``` + +2. **Importe a SleakOps** (si es posible): + + - Use la funcionalidad de importación de SleakOps + - Recree despliegues a través de la interfaz de SleakOps + +3. **Cree un nodepool dedicado** para cargas externas: + + - Etiquete adecuadamente para identificación + - Establezca límites de recursos apropiados + - Monitoree separadamente de cargas gestionadas por la plataforma + +4. **Establezca gobernanza** para futuros despliegues y prevenir problemas similares + + + +--- + +_Esta FAQ fue generada automáticamente el 27 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/nodepool-ondemand-autoscaling-pending-pods.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/nodepool-ondemand-autoscaling-pending-pods.mdx new file mode 100644 index 000000000..f8e2401bb --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/nodepool-ondemand-autoscaling-pending-pods.mdx @@ -0,0 +1,218 @@ +--- +sidebar_position: 3 +title: "Configuración de Nodepool OnDemand que Causa Problemas de Escalado de Pods" +description: "Pods atascados en estado pendiente después de cambiar el nodepool de la configuración predeterminada a OnDemand" +date: "2024-12-19" +category: "cluster" +tags: ["nodepool", "ondemand", "autoscaling", "pending", "scaling"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Nodepool OnDemand que Causa Problemas de Escalado de Pods + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Nodepool, OnDemand, Autoscaling, Pendiente, Escalado + +## Descripción del Problema + +**Contexto:** El usuario cambió la configuración del nodepool de la predeterminada (sin valor) a OnDemand a solicitud del equipo de SleakOps. Después de este cambio, la funcionalidad de autoscaling no está funcionando correctamente. + +**Síntomas Observados:** + +- Los pods creados por autoscaling permanecen en estado pendiente +- Solo los pods mínimos requeridos están activos +- Los nuevos pods no pueden escalar y muestran errores +- El autoscaling funcionaba antes del cambio de configuración del nodepool + +**Configuración Relevante:** + +- Tipo de nodepool: Cambiado de predeterminado (sin valor) a OnDemand +- Autoscaling: Activado pero no funciona correctamente +- Pods mínimos: Funcionando correctamente +- Pods en escalado: Atascados en estado pendiente + +**Condiciones de Error:** + +- El error ocurre cuando el autoscaler intenta crear nuevos pods +- Los pods permanecen en estado pendiente indefinidamente +- El problema comenzó después del cambio de configuración del nodepool +- Solo afecta a los pods escalados, no a los pods mínimos requeridos + +## Solución Detallada + + + +Al cambiar de la configuración predeterminada del nodepool a OnDemand, varios aspectos pueden afectar la programación de pods: + +1. **Provisión de instancias**: Las instancias OnDemand tienen características de provisión diferentes +2. **Asignación de recursos**: La configuración OnDemand puede tener límites de recursos distintos +3. **Restricciones de programación**: La nueva configuración podría introducir restricciones de programación +4. **Planificación de capacidad**: Las instancias OnDemand pueden tener patrones de disponibilidad diferentes + + + + + +Para identificar por qué los pods están atascados en estado pendiente: + +```bash +# Verificar estado y eventos de pods +kubectl get pods -o wide +kubectl describe pod + +# Verificar capacidad y recursos de nodos +kubectl get nodes -o wide +kubectl describe nodes + +# Revisar logs del autoscaler del clúster +kubectl logs -n kube-system deployment/cluster-autoscaler +``` + +Razones comunes para pods pendientes tras cambios en el nodepool: + +- Capacidad insuficiente de nodos +- Restricciones de recursos (CPU/Memoria) +- Desajustes en selectores de nodo +- Problemas con taints y tolerancias + + + + + +Verifique la configuración actual de su nodepool en SleakOps: + +1. **Navegue a Gestión de Clústeres** +2. **Seleccione su clúster** +3. **Vaya a la sección de Nodepools** +4. **Verifique la configuración OnDemand**: + - Los tipos de instancia son apropiados + - Los límites mínimos/máximos de escalado son correctos + - Las zonas de disponibilidad están configuradas adecuadamente + - Las asignaciones de recursos coinciden con las necesidades de la carga de trabajo + +```yaml +# Ejemplo de configuración correcta de nodepool OnDemand +nodepool: + name: "ondemand-nodepool" + instance_type: ["t3.medium", "t3.large"] + capacity_type: "ON_DEMAND" + min_size: 2 + max_size: 10 + desired_size: 3 + availability_zones: ["us-east-1a", "us-east-1b", "us-east-1c"] +``` + + + + + +Verifique que el autoscaler del clúster esté configurado correctamente para instancias OnDemand: + +```bash +# Verificar configuración del autoscaler +kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml + +# Verificar despliegue del autoscaler +kubectl get deployment cluster-autoscaler -n kube-system -o yaml +``` + +Asegúrese de que el autoscaler tenga: + +- Permisos adecuados para la gestión de instancias OnDemand +- Configuración correcta para descubrir el nodepool +- Políticas de escalado apropiadas + + + + + +Verifique si sus pods tienen solicitudes de recursos que coincidan con la capacidad del nodepool OnDemand: + +```yaml +# Ejemplo de pod con solicitudes de recursos adecuadas +apiVersion: v1 +kind: Pod +spec: + containers: + - name: app + image: nginx + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" +``` + +Problemas comunes: + +- Solicitudes de recursos demasiado altas para la capacidad disponible del nodo +- Solicitudes de recursos faltantes que causan problemas de programación +- Límites establecidos demasiado restrictivos + + + + + +1. **Escalar manualmente hacia abajo y hacia arriba**: + + ```bash + kubectl scale deployment --replicas=1 + kubectl scale deployment --replicas= + ``` + +2. **Forzar escalado de nodos**: + + - Aumente temporalmente la capacidad deseada del nodepool en SleakOps + - Espere a que los nodos se aprovisionen + - Verifique si los pods pueden programarse en los nodos nuevos + +3. **Verificar taints en nodos**: + + ```bash + kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints + ``` + +4. **Verificar tolerancias de pods** si los nodos tienen taints + + + + + +Para un rendimiento óptimo del nodepool OnDemand: + +1. **Tipos de instancia mixtos**: Use múltiples tipos de instancia para mejor disponibilidad +2. **Planificación adecuada de recursos**: Asegure que las solicitudes de pods se alineen con la capacidad del nodo +3. **Escalado gradual**: Configure políticas de escalado apropiadas +4. **Monitoreo**: Configure alertas para pods pendientes y eventos de escalado + +```yaml +# Configuración recomendada de HPA +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opensearch-iam-roles-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opensearch-iam-roles-configuration.mdx new file mode 100644 index 000000000..bcbe23ced --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opensearch-iam-roles-configuration.mdx @@ -0,0 +1,885 @@ +--- +sidebar_position: 3 +title: "Configuración de Roles IAM para OpenSearch" +description: "Cómo configurar roles IAM para OpenSearch en SleakOps" +date: "2025-02-05" +category: "dependencia" +tags: ["opensearch", "iam", "aws", "roles", "permisos"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Roles IAM para OpenSearch + +**Fecha:** 5 de febrero de 2025 +**Categoría:** Dependencia +**Etiquetas:** OpenSearch, IAM, AWS, Roles, Permisos + +## Descripción del Problema + +**Contexto:** El usuario está configurando un servicio OpenSearch en SleakOps y tiene dudas sobre si se requieren roles IAM para la configuración correcta del acceso. + +**Síntomas Observados:** + +- Incertidumbre sobre los requisitos de roles IAM para OpenSearch +- Preguntas sobre la configuración de acceso después de la creación del servicio +- Necesidad de entender la estructura de permisos para OpenSearch en AWS + +**Configuración Relevante:** + +- Tipo de servicio: OpenSearch +- Tipo de instancia: t3.medium.search +- Plataforma: AWS +- Método de acceso: Por determinar + +**Condiciones de Error:** + +- Posibles problemas de acceso si los roles IAM no están configurados correctamente +- El servicio puede crearse pero ser inaccesible sin los permisos adecuados +- Las aplicaciones pueden fallar al conectar con OpenSearch sin una configuración IAM correcta + +## Solución Detallada + + + +Sí, OpenSearch en AWS normalmente requiere roles IAM para un acceso seguro. Los requisitos específicos dependen de tu método de acceso: + +**Para acceso basado en VPC (recomendado):** + +- Se requieren roles IAM para que las aplicaciones accedan al clúster de OpenSearch +- Se puede configurar un control de acceso detallado + +**Para acceso público:** + +- Se pueden usar políticas IAM junto con restricciones basadas en IP +- Menos seguro pero más sencillo para entornos de desarrollo + +### Tipos de Acceso a OpenSearch + +| Método de Acceso | Requisitos IAM | Nivel de Seguridad | Uso Recomendado | +|------------------|----------------|-------------------|-----------------| +| VPC + IAM | ✅ Obligatorio | 🔒 Alto | Producción | +| Público + IAM | ✅ Recomendado | 🔒 Medio | Desarrollo | +| Público + IP | ❌ Opcional | ⚠️ Bajo | Testing local | + + + + + +SleakOps configura automáticamente roles IAM básicos para OpenSearch: + +1. **Rol de Servicio**: Creado automáticamente para el propio servicio OpenSearch +2. **Rol de Acceso**: Generado para aplicaciones dentro de tu clúster para acceder a OpenSearch +3. **Usuario Maestro**: Configurado con permisos administrativos + +**Lo que se configura automáticamente:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": [ + "arn:aws:iam::ACCOUNT-ID:role/sleakops-opensearch-access-role", + "arn:aws:iam::ACCOUNT-ID:root" + ] + }, + "Action": "es:*", + "Resource": "arn:aws:es:REGION:ACCOUNT-ID:domain/DOMAIN-NAME/*" + } + ] +} +``` + +### Roles Automáticos Creados por SleakOps + +```bash +# Verificar roles creados automáticamente +aws iam list-roles --query 'Roles[?contains(RoleName, `sleakops`) && contains(RoleName, `opensearch`)]' + +# Ejemplos de roles típicos: +# - sleakops-opensearch-service-role +# - sleakops-opensearch-access-role-{environment} +# - sleakops-pods-opensearch-role +``` + +### Variables de Entorno Configuradas + +```bash +# Variables disponibles en pods después de la configuración +echo $OPENSEARCH_ENDPOINT # https://your-domain.region.es.amazonaws.com +echo $OPENSEARCH_DOMAIN_NAME # your-opensearch-domain +echo $OPENSEARCH_REGION # us-east-1 +echo $AWS_ROLE_ARN # Role ARN para IRSA +``` + + + + + +Si necesitas configurar roles IAM manualmente o personalizar la configuración automática: + +### 1. Rol de Servicio OpenSearch + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "es.amazonaws.com" + }, + "Action": "sts:AssumeRole" + } + ] +} +``` + +**Política del Rol de Servicio:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "ec2:CreateNetworkInterface", + "ec2:DeleteNetworkInterface", + "ec2:DescribeNetworkInterfaces", + "ec2:ModifyNetworkInterfaceAttribute", + "ec2:DescribeSecurityGroups", + "ec2:DescribeSubnets", + "ec2:DescribeVpcs" + ], + "Resource": "*" + } + ] +} +``` + +### 2. Rol de Acceso para Aplicaciones + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Federated": "arn:aws:iam::ACCOUNT-ID:oidc-provider/oidc.eks.REGION.amazonaws.com/id/OIDC-ID" + }, + "Action": "sts:AssumeRoleWithWebIdentity", + "Condition": { + "StringEquals": { + "oidc.eks.REGION.amazonaws.com/id/OIDC-ID:sub": "system:serviceaccount:NAMESPACE:SERVICE-ACCOUNT-NAME", + "oidc.eks.REGION.amazonaws.com/id/OIDC-ID:aud": "sts.amazonaws.com" + } + } + } + ] +} +``` + +### 3. Política de Permisos OpenSearch + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "es:ESHttpPost", + "es:ESHttpPut", + "es:ESHttpGet", + "es:ESHttpDelete", + "es:ESHttpHead" + ], + "Resource": [ + "arn:aws:es:REGION:ACCOUNT-ID:domain/DOMAIN-NAME/*", + "arn:aws:es:REGION:ACCOUNT-ID:domain/DOMAIN-NAME" + ] + }, + { + "Effect": "Allow", + "Action": [ + "es:DescribeDomain", + "es:DescribeDomains", + "es:DescribeDomainConfig", + "es:ListDomainNames", + "es:ListTags" + ], + "Resource": "*" + } + ] +} +``` + +### Script de Creación de Roles + +```bash +#!/bin/bash +# create-opensearch-iam-roles.sh + +DOMAIN_NAME="your-opensearch-domain" +ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) +REGION="us-east-1" +CLUSTER_NAME="your-eks-cluster" + +echo "Creating IAM roles for OpenSearch domain: $DOMAIN_NAME" + +# 1. Crear rol de servicio OpenSearch +cat > opensearch-service-role-trust-policy.json << EOF +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "es.amazonaws.com" + }, + "Action": "sts:AssumeRole" + } + ] +} +EOF + +aws iam create-role \ + --role-name OpenSearchServiceRole-$DOMAIN_NAME \ + --assume-role-policy-document file://opensearch-service-role-trust-policy.json + +# 2. Adjuntar política de servicio +aws iam attach-role-policy \ + --role-name OpenSearchServiceRole-$DOMAIN_NAME \ + --policy-arn arn:aws:iam::aws:policy/AmazonOpenSearchServiceRolePolicy + +# 3. Crear rol de acceso para pods +OIDC_ID=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5) + +cat > opensearch-access-role-trust-policy.json << EOF +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID" + }, + "Action": "sts:AssumeRoleWithWebIdentity", + "Condition": { + "StringEquals": { + "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:default:opensearch-service-account", + "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com" + } + } + } + ] +} +EOF + +aws iam create-role \ + --role-name OpenSearchAccessRole-$DOMAIN_NAME \ + --assume-role-policy-document file://opensearch-access-role-trust-policy.json + +echo "✓ IAM roles created successfully" +``` + + + + + +### Política de Acceso del Dominio OpenSearch + +La política de acceso del dominio controla quién puede acceder a su clúster OpenSearch: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": [ + "arn:aws:iam::ACCOUNT-ID:role/OpenSearchAccessRole-your-domain", + "arn:aws:iam::ACCOUNT-ID:user/opensearch-admin" + ] + }, + "Action": "es:*", + "Resource": "arn:aws:es:REGION:ACCOUNT-ID:domain/your-domain/*" + }, + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::ACCOUNT-ID:role/ReadOnlyRole" + }, + "Action": [ + "es:ESHttpGet", + "es:ESHttpHead", + "es:DescribeDomain" + ], + "Resource": "arn:aws:es:REGION:ACCOUNT-ID:domain/your-domain/*" + } + ] +} +``` + +### Política Restrictiva por IP + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "*" + }, + "Action": "es:*", + "Resource": "arn:aws:es:REGION:ACCOUNT-ID:domain/your-domain/*", + "Condition": { + "IpAddress": { + "aws:SourceIp": [ + "203.0.113.0/24", + "198.51.100.0/24" + ] + } + } + } + ] +} +``` + +### Aplicar Política al Dominio + +```bash +#!/bin/bash +# apply-domain-policy.sh + +DOMAIN_NAME="your-opensearch-domain" +POLICY_FILE="domain-access-policy.json" + +# Aplicar política de acceso +aws opensearch update-domain-config \ + --domain-name $DOMAIN_NAME \ + --access-policies file://$POLICY_FILE + +# Verificar aplicación +aws opensearch describe-domain \ + --domain-name $DOMAIN_NAME \ + --query 'DomainStatus.AccessPolicies' +``` + +### Configuración de VPC y Grupos de Seguridad + +```bash +# Para dominios en VPC, configurar grupos de seguridad +aws ec2 create-security-group \ + --group-name opensearch-access-sg \ + --description "Security group for OpenSearch access" + +# Permitir acceso HTTPS desde pods de Kubernetes +aws ec2 authorize-security-group-ingress \ + --group-id sg-12345678 \ + --protocol tcp \ + --port 443 \ + --source-group sg-87654321 # Security group de los nodos EKS +``` + + + + + +### Crear Service Account con Anotaciones IRSA + +```yaml +# opensearch-service-account.yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: opensearch-service-account + namespace: default + annotations: + eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT-ID:role/OpenSearchAccessRole-your-domain +automountServiceAccountToken: true +``` + +### Configuración del Pod con Service Account + +```yaml +# opensearch-client-pod.yaml +apiVersion: v1 +kind: Pod +metadata: + name: opensearch-client + namespace: default +spec: + serviceAccountName: opensearch-service-account + containers: + - name: opensearch-client + image: python:3.9-slim + command: ["/bin/sh"] + args: ["-c", "while true; do sleep 30; done"] + env: + - name: AWS_REGION + value: "us-east-1" + - name: OPENSEARCH_ENDPOINT + value: "https://your-domain.region.es.amazonaws.com" + # AWS SDK automáticamente usa el token de servicio account para autenticación +``` + +### Verificar Configuración IRSA + +```bash +#!/bin/bash +# verify-irsa-setup.sh + +NAMESPACE="default" +SERVICE_ACCOUNT="opensearch-service-account" +CLUSTER_NAME="your-eks-cluster" + +echo "=== Verifying IRSA Setup ===" + +# 1. Verificar anotaciones del service account +echo "1. Service Account annotations:" +kubectl get serviceaccount $SERVICE_ACCOUNT -n $NAMESPACE -o jsonpath='{.metadata.annotations}' | jq + +# 2. Verificar token montado en pod +echo "2. Pod token mount:" +kubectl exec -n $NAMESPACE -it opensearch-client -- cat /var/run/secrets/kubernetes.io/serviceaccount/token | head -c 50 + +# 3. Verificar identidad AWS desde el pod +echo "3. AWS identity from pod:" +kubectl exec -n $NAMESPACE -it opensearch-client -- aws sts get-caller-identity + +# 4. Test de acceso a OpenSearch +echo "4. OpenSearch access test:" +kubectl exec -n $NAMESPACE -it opensearch-client -- \ + curl -s --aws-sigv4 "aws:amz:us-east-1:es" \ + "$OPENSEARCH_ENDPOINT/_cluster/health" +``` + +### Deployment con Service Account + +```yaml +# opensearch-app-deployment.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: opensearch-app + namespace: default +spec: + replicas: 2 + selector: + matchLabels: + app: opensearch-app + template: + metadata: + labels: + app: opensearch-app + spec: + serviceAccountName: opensearch-service-account + containers: + - name: app + image: your-app:latest + env: + - name: AWS_REGION + value: "us-east-1" + - name: OPENSEARCH_ENDPOINT + valueFrom: + secretKeyRef: + name: opensearch-credentials + key: endpoint + - name: OPENSEARCH_DOMAIN + valueFrom: + secretKeyRef: + name: opensearch-credentials + key: domain + ports: + - containerPort: 8080 + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 +``` + + + + + +### Script de Verificación Completa + +```bash +#!/bin/bash +# opensearch-iam-verification.sh + +set -e + +DOMAIN_NAME="your-opensearch-domain" +REGION="us-east-1" +NAMESPACE="default" +SERVICE_ACCOUNT="opensearch-service-account" + +echo "=== OpenSearch IAM Configuration Verification ===" + +# 1. Verificar dominio OpenSearch +echo "1. Verifying OpenSearch domain..." +DOMAIN_STATUS=$(aws opensearch describe-domain --domain-name $DOMAIN_NAME --region $REGION --query 'DomainStatus.Processing' --output text) +if [ "$DOMAIN_STATUS" = "False" ]; then + echo "✓ Domain is active and ready" +else + echo "⚠ Domain is still processing" +fi + +# 2. Verificar políticas de acceso +echo "2. Checking domain access policies..." +aws opensearch describe-domain --domain-name $DOMAIN_NAME --region $REGION \ + --query 'DomainStatus.AccessPolicies' --output json | jq '.' + +# 3. Verificar roles IAM +echo "3. Verifying IAM roles..." +ROLES=$(aws iam list-roles --query 'Roles[?contains(RoleName, `OpenSearch`) || contains(RoleName, `opensearch`)].RoleName' --output text) +for role in $ROLES; do + echo "✓ Found role: $role" + aws iam list-attached-role-policies --role-name $role --query 'AttachedPolicies[].PolicyArn' --output text +done + +# 4. Verificar service account +echo "4. Checking Kubernetes service account..." +if kubectl get serviceaccount $SERVICE_ACCOUNT -n $NAMESPACE >/dev/null 2>&1; then + echo "✓ Service account exists" + ROLE_ARN=$(kubectl get serviceaccount $SERVICE_ACCOUNT -n $NAMESPACE -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}') + echo " Associated role: $ROLE_ARN" +else + echo "✗ Service account not found" +fi + +# 5. Test de conectividad desde pod +echo "5. Testing connectivity from pod..." +if kubectl get pod opensearch-client -n $NAMESPACE >/dev/null 2>&1; then + echo "Testing AWS credentials..." + kubectl exec -n $NAMESPACE opensearch-client -- aws sts get-caller-identity + + echo "Testing OpenSearch access..." + kubectl exec -n $NAMESPACE opensearch-client -- \ + curl -s --aws-sigv4 "aws:amz:$REGION:es" \ + "https://$DOMAIN_NAME.$REGION.es.amazonaws.com/_cluster/health" | jq '.' +else + echo "⚠ No test pod found. Creating one..." + kubectl run opensearch-client --image=amazon/aws-cli:latest --restart=Never \ + --serviceaccount=$SERVICE_ACCOUNT --namespace=$NAMESPACE \ + -- /bin/sh -c "while true; do sleep 30; done" + + echo "Wait for pod to be ready and re-run this script" +fi + +echo "=== Verification Complete ===" +``` + +### Pruebas de Rendimiento y Acceso + +```python +# opensearch-access-test.py +import boto3 +import json +import time +from datetime import datetime +from opensearchpy import OpenSearch, RequestsHttpConnection +from aws_requests_auth.aws_auth import AWSRequestsAuth + +def test_opensearch_access(): + """Test comprehensive OpenSearch access with IAM authentication""" + + print("=== OpenSearch Access Test ===") + + # 1. Setup AWS session + session = boto3.Session() + credentials = session.get_credentials() + region = session.region_name or 'us-east-1' + + print(f"AWS Region: {region}") + print(f"AWS Identity: {session.client('sts').get_caller_identity()['Arn']}") + + # 2. Setup OpenSearch client + endpoint = 'your-domain.region.es.amazonaws.com' + awsauth = AWSRequestsAuth( + aws_access_key=credentials.access_key, + aws_secret_access_key=credentials.secret_key, + aws_token=credentials.token, + aws_host=endpoint, + aws_region=region, + aws_service='es' + ) + + client = OpenSearch( + hosts=[{'host': endpoint, 'port': 443}], + http_auth=awsauth, + use_ssl=True, + verify_certs=True, + connection_class=RequestsHttpConnection, + timeout=30 + ) + + try: + # 3. Test cluster info + print("\n1. Testing cluster information...") + info = client.info() + print(f"✓ OpenSearch version: {info['version']['number']}") + + # 4. Test cluster health + print("\n2. Testing cluster health...") + health = client.cluster.health() + print(f"✓ Cluster status: {health['status']}") + print(f"✓ Number of nodes: {health['number_of_nodes']}") + + # 5. Test index operations + print("\n3. Testing index operations...") + index_name = f"test-index-{int(time.time())}" + + # Create index + client.indices.create(index=index_name, body={ + "mappings": { + "properties": { + "timestamp": {"type": "date"}, + "message": {"type": "text"}, + "level": {"type": "keyword"} + } + } + }) + print(f"✓ Created index: {index_name}") + + # Index document + doc = { + "timestamp": datetime.now().isoformat(), + "message": "Test document for IAM verification", + "level": "INFO" + } + result = client.index(index=index_name, body=doc) + print(f"✓ Indexed document: {result['_id']}") + + # Search document + time.sleep(1) # Wait for indexing + search_result = client.search(index=index_name, body={ + "query": {"match": {"message": "Test"}} + }) + print(f"✓ Search results: {search_result['hits']['total']['value']} documents") + + # Cleanup + client.indices.delete(index=index_name) + print(f"✓ Cleaned up index: {index_name}") + + print("\n✅ All OpenSearch IAM tests passed!") + return True + + except Exception as e: + print(f"\n❌ OpenSearch access test failed: {e}") + return False + +if __name__ == "__main__": + test_opensearch_access() +``` + +### Monitoreo de Acceso IAM + +```bash +#!/bin/bash +# monitor-opensearch-access.sh + +DOMAIN_NAME="your-opensearch-domain" +REGION="us-east-1" + +echo "=== OpenSearch Access Monitoring ===" + +# Métricas de CloudWatch para monitorear acceso +aws logs create-log-group --log-group-name "/aws/opensearch/domains/$DOMAIN_NAME/application-logs" 2>/dev/null || true + +# Consultar logs de acceso recientes +echo "Recent access logs:" +aws logs filter-log-events \ + --log-group-name "/aws/opensearch/domains/$DOMAIN_NAME/application-logs" \ + --start-time $(date -d '1 hour ago' +%s)000 \ + --filter-pattern "ERROR" \ + --query 'events[].message' \ + --output text + +# Métricas de performance +echo "Performance metrics:" +aws cloudwatch get-metric-statistics \ + --namespace AWS/ES \ + --metric-name IndexingRate \ + --dimensions Name=DomainName,Value=$DOMAIN_NAME Name=ClientId,Value=$(aws sts get-caller-identity --query Account --output text) \ + --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \ + --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ + --period 300 \ + --statistics Average \ + --region $REGION \ + --query 'Datapoints[0].Average' +``` + + + + + +### Error: "User is not authorized to perform: es:ESHttpPost" + +**Causa:** El rol IAM no tiene permisos suficientes + +**Solución:** + +```bash +# 1. Verificar políticas actuales del rol +ROLE_NAME="your-opensearch-role" +aws iam list-attached-role-policies --role-name $ROLE_NAME +aws iam list-role-policies --role-name $ROLE_NAME + +# 2. Adjuntar política con permisos completos +cat > opensearch-full-access-policy.json << EOF +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": "es:*", + "Resource": "arn:aws:es:*:*:domain/your-domain/*" + } + ] +} +EOF + +aws iam put-role-policy \ + --role-name $ROLE_NAME \ + --policy-name OpenSearchFullAccess \ + --policy-document file://opensearch-full-access-policy.json +``` + +### Error: "No credentials found" + +**Causa:** IRSA no está configurado correctamente + +**Solución:** + +```bash +# 1. Verificar OIDC provider del clúster +CLUSTER_NAME="your-eks-cluster" +aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" + +# 2. Verificar anotaciones del service account +kubectl describe serviceaccount opensearch-service-account + +# 3. Recrear service account con anotaciones correctas +kubectl delete serviceaccount opensearch-service-account +kubectl apply -f - << EOF +apiVersion: v1 +kind: ServiceAccount +metadata: + name: opensearch-service-account + namespace: default + annotations: + eks.amazonaws.com/role-arn: arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/OpenSearchAccessRole +EOF +``` + +### Error: "Domain endpoint not accessible" + +**Causa:** Problemas de conectividad de red + +**Solución:** + +```bash +# 1. Verificar configuración de VPC +aws opensearch describe-domain --domain-name your-domain \ + --query 'DomainStatus.VPCOptions' + +# 2. Verificar grupos de seguridad +SG_ID="sg-12345678" +aws ec2 describe-security-groups --group-ids $SG_ID \ + --query 'SecurityGroups[0].IpPermissions[?FromPort==`443`]' + +# 3. Test de conectividad desde pod +kubectl exec -it opensearch-client -- nslookup your-domain.region.es.amazonaws.com +kubectl exec -it opensearch-client -- curl -I https://your-domain.region.es.amazonaws.com +``` + +### Error: "Certificate verification failed" + +**Causa:** Problemas con certificados SSL + +**Solución:** + +```python +# Cliente Python con verificación SSL personalizada +import ssl +from opensearchpy import OpenSearch + +# Configuración con verificación SSL relajada (solo para debugging) +client = OpenSearch( + hosts=[{'host': 'your-domain.region.es.amazonaws.com', 'port': 443}], + http_auth=awsauth, + use_ssl=True, + verify_certs=False, # Solo para debugging + ssl_context=ssl.create_default_context(), + connection_class=RequestsHttpConnection +) +``` + +### Debug Completo de Conectividad + +```bash +#!/bin/bash +# debug-opensearch-connectivity.sh + +DOMAIN_NAME="your-opensearch-domain" +REGION="us-east-1" +POD_NAME="opensearch-client" +NAMESPACE="default" + +echo "=== OpenSearch Connectivity Debug ===" + +# 1. Verificar estado del dominio +echo "1. Domain status:" +aws opensearch describe-domain --domain-name $DOMAIN_NAME --region $REGION \ + --query 'DomainStatus.{Processing: Processing, Endpoint: Endpoint, Created: Created}' + +# 2. Test de DNS +echo "2. DNS resolution:" +kubectl exec -n $NAMESPACE $POD_NAME -- nslookup $DOMAIN_NAME.$REGION.es.amazonaws.com + +# 3. Test de conectividad TCP +echo "3. TCP connectivity:" +kubectl exec -n $NAMESPACE $POD_NAME -- timeout 5 bash -c " + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 5 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opensearch-pod-authentication-permissions.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opensearch-pod-authentication-permissions.mdx new file mode 100644 index 000000000..45287b8f2 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opensearch-pod-authentication-permissions.mdx @@ -0,0 +1,774 @@ +--- +sidebar_position: 3 +title: "Problemas de Autenticación y Permisos en Pods de OpenSearch" +description: "Solución de problemas de conectividad y errores de autorización de OpenSearch desde pods de Kubernetes" +date: "2025-02-19" +category: "dependency" +tags: ["opensearch", "autenticación", "permisos", "aws", "iam", "pod"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Autenticación y Permisos en Pods de OpenSearch + +**Fecha:** 19 de febrero de 2025 +**Categoría:** Dependencia +**Etiquetas:** OpenSearch, Autenticación, Permisos, AWS, IAM, Pod + +## Descripción del Problema + +**Contexto:** El usuario intenta conectarse a un clúster de OpenSearch desde un pod de Kubernetes pero encuentra errores de autorización a pesar de esperar que los permisos se configuren automáticamente. + +**Síntomas Observados:** + +- Error de autorización al realizar solicitudes curl a la URL de OpenSearch desde dentro de un pod +- La conexión falla a pesar de que la dependencia de OpenSearch está configurada en SleakOps +- Similar a problemas de permisos en S3: se encuentran credenciales pero carecen de permisos adecuados + +**Configuración Relevante:** + +- Dependencia de OpenSearch configurada en SleakOps +- Pod ejecutándose en un clúster de Kubernetes +- Infraestructura alojada en AWS +- Se espera que los roles y políticas IAM se configuren automáticamente + +**Condiciones de Error:** + +- Error ocurre al hacer solicitudes HTTP a OpenSearch desde el pod +- Fallo de autorización a pesar de la configuración de la dependencia +- El problema persiste en diferentes intentos de conexión +- Patrón similar a problemas de permisos en S3 + +## Solución Detallada + + + +Primero, determine cómo su pod se está autenticando actualmente con AWS: + +1. **Instale AWS CLI en su pod**: + + ```bash + # Si no está instalado + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + sudo ./aws/install + ``` + +2. **Verifique la identidad actual**: + ```bash + aws sts get-caller-identity + ``` + +Esto le mostrará: + +- Si está usando identidad del pod o credenciales de entorno +- El rol IAM que se está asumiendo +- Información de cuenta y usuario + +### Verificar variables de entorno de AWS + +```bash +# Verificar credenciales configuradas +echo $AWS_ACCESS_KEY_ID +echo $AWS_SECRET_ACCESS_KEY +echo $AWS_SESSION_TOKEN +echo $AWS_REGION + +# Verificar configuración del perfil +aws configure list +aws configure list-profiles +``` + +### Verificar Service Account de Kubernetes + +```bash +# Desde dentro del pod +cat /var/run/secrets/kubernetes.io/serviceaccount/token +kubectl describe serviceaccount default -n +``` + + + + + +Para acceso completo a OpenSearch, su rol IAM necesita estas políticas: + +### Política IAM Básica para OpenSearch + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "es:ESHttpPost", + "es:ESHttpPut", + "es:ESHttpGet", + "es:ESHttpDelete", + "es:ESHttpHead" + ], + "Resource": "arn:aws:es:*:*:domain/your-opensearch-domain/*" + }, + { + "Effect": "Allow", + "Action": [ + "es:DescribeDomain", + "es:DescribeDomains", + "es:DescribeDomainConfig", + "es:ListDomainNames", + "es:ListTags" + ], + "Resource": "arn:aws:es:*:*:domain/your-opensearch-domain" + } + ] +} +``` + +### Permisos de Acceso Granular + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "es:ESHttpPost", + "es:ESHttpPut" + ], + "Resource": "arn:aws:es:*:*:domain/your-domain/logs-*/_doc", + "Condition": { + "StringEquals": { + "aws:RequestedRegion": "us-east-1" + } + } + }, + { + "Effect": "Allow", + "Action": [ + "es:ESHttpGet" + ], + "Resource": "arn:aws:es:*:*:domain/your-domain/_search" + } + ] +} +``` + +### Verificar permisos actuales + +```bash +# Verificar qué políticas están adjuntas a su rol +aws iam list-attached-role-policies --role-name +aws iam list-role-policies --role-name + +# Obtener detalles de una política específica +aws iam get-policy-version --policy-arn --version-id v1 +``` + + + + + +### Configuración del Dominio OpenSearch + +1. **Política de Acceso del Dominio**: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": [ + "arn:aws:iam::123456789012:role/your-pod-role", + "arn:aws:iam::123456789012:root" + ] + }, + "Action": "es:*", + "Resource": "arn:aws:es:region:123456789012:domain/your-domain/*" + } + ] +} +``` + +2. **Configuración de VPC (si aplica)**: + +```yaml +# Para dominios dentro de VPC +opensearch_config: + vpc_options: + security_group_ids: + - sg-12345678 + subnet_ids: + - subnet-12345678 + - subnet-87654321 + domain_endpoint_options: + enforce_https: true + tls_security_policy: "Policy-Min-TLS-1-2-2019-07" +``` + +### Configuración Avanzada de Acceso + +```bash +# Verificar configuración de VPC si está habilitada +aws opensearch describe-domain --domain-name your-domain-name + +# Verificar grupos de seguridad +aws ec2 describe-security-groups --group-ids sg-12345678 +``` + +### Script de Verificación de Conectividad + +```bash +#!/bin/bash +# opensearch-connectivity-test.sh + +OPENSEARCH_ENDPOINT="https://your-domain.region.es.amazonaws.com" +AWS_REGION="us-east-1" + +echo "Testing OpenSearch connectivity..." + +# Test 1: Basic connectivity +echo "1. Testing basic connectivity..." +curl -s --connect-timeout 5 "$OPENSEARCH_ENDPOINT" && echo "✓ Basic connectivity OK" || echo "✗ Cannot connect" + +# Test 2: AWS signature test +echo "2. Testing with AWS signature..." +aws opensearch describe-domain --domain-name your-domain --region $AWS_REGION + +# Test 3: HTTP operations test +echo "3. Testing HTTP operations..." +curl -X GET \ + --aws-sigv4 "aws:amz:$AWS_REGION:es" \ + --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \ + "$OPENSEARCH_ENDPOINT/_cluster/health" +``` + + + + + +### Error: "Access Denied" o 403 Forbidden + +```bash +# Verificar configuración del dominio +aws opensearch describe-domain-config --domain-name your-domain + +# Verificar política de acceso +aws opensearch describe-domain --domain-name your-domain --query 'DomainStatus.AccessPolicies' +``` + +**Posibles causas:** + +1. **IP restringida**: El dominio puede estar restringido a IPs específicas +2. **VPC mal configurada**: Problemas de conectividad de red +3. **Política IAM insuficiente**: Falta de permisos en el rol +4. **Configuración de cognito**: Autenticación adicional requerida + +### Error: "Signature mismatch" + +```bash +# Verificar tiempo del sistema +date +timedatectl status + +# Sincronizar tiempo si es necesario +sudo ntpdate -s time.nist.gov +``` + +### Error: "Invalid credentials" + +```python +# Verificar credenciales en Python +import boto3 +from botocore.exceptions import ClientError + +try: + client = boto3.client('opensearch', region_name='us-east-1') + response = client.describe_domain(DomainName='your-domain') + print("✓ Credentials are valid") + print(f"Domain status: {response['DomainStatus']['Processing']}") +except ClientError as e: + print(f"✗ Credential error: {e}") +``` + +### Debugging de Conectividad de Red + +```bash +# Test de conectividad de red +nslookup your-domain.region.es.amazonaws.com +ping your-domain.region.es.amazonaws.com + +# Verificar reglas de grupo de seguridad +aws ec2 describe-security-groups --group-ids sg-12345678 \ + --query 'SecurityGroups[0].IpPermissions[?FromPort==`443`]' + +# Test de puerto específico +telnet your-domain.region.es.amazonaws.com 443 +``` + + + + + +### Verificar Configuración de la Dependencia + +1. **Revisar la configuración de la dependencia** en el panel de SleakOps: + +```yaml +# Ejemplo de configuración de dependencia OpenSearch +dependencies: + opensearch: + type: opensearch-aws + configuration: + instance_type: "t3.small.search" + instance_count: 1 + master_instance_type: "t3.small.search" + master_instance_count: 1 + volume_size: 20 + volume_type: "gp3" + access_policies: | + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "AWS": "*" + }, + "Action": "es:*", + "Resource": "arn:aws:es:*:*:domain/*" + } + ] + } +``` + +2. **Variables de salida esperadas**: + +```bash +# Variables que deberían estar disponibles en su pod +echo $OPENSEARCH_ENDPOINT +echo $OPENSEARCH_DOMAIN_NAME +echo $OPENSEARCH_REGION +echo $OPENSEARCH_PORT +``` + +### Configuración de Service Account + +```yaml +# service-account.yaml +apiVersion: v1 +kind: ServiceAccount +metadata: + name: opensearch-service-account + namespace: your-namespace + annotations: + eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/opensearch-role +``` + +### Configuración del Pod con IRSA + +```yaml +# pod-with-irsa.yaml +apiVersion: v1 +kind: Pod +metadata: + name: opensearch-client + namespace: your-namespace +spec: + serviceAccountName: opensearch-service-account + containers: + - name: app + image: your-app:latest + env: + - name: AWS_REGION + value: "us-east-1" + - name: OPENSEARCH_ENDPOINT + valueFrom: + secretKeyRef: + name: opensearch-credentials + key: endpoint +``` + +### Script de Verificación de SleakOps + +```bash +#!/bin/bash +# sleakops-opensearch-verify.sh + +echo "=== SleakOps OpenSearch Verification ===" + +# Verificar variables de entorno de la dependencia +echo "1. Environment variables:" +env | grep -i opensearch | sort + +# Verificar conectividad desde el pod +echo "2. Connectivity test:" +if [ -n "$OPENSEARCH_ENDPOINT" ]; then + curl -s --connect-timeout 5 "$OPENSEARCH_ENDPOINT/_cluster/health" \ + --aws-sigv4 "aws:amz:${AWS_REGION:-us-east-1}:es" && \ + echo "✓ OpenSearch accessible" || \ + echo "✗ OpenSearch not accessible" +else + echo "✗ OPENSEARCH_ENDPOINT not set" +fi + +# Verificar permisos IAM +echo "3. IAM permissions:" +aws sts get-caller-identity +aws opensearch list-domain-names --region ${AWS_REGION:-us-east-1} +``` + + + + + +### Cliente Python con boto3 + +```python +import boto3 +import json +from opensearchpy import OpenSearch, RequestsHttpConnection +from aws_requests_auth.aws_auth import AWSRequestsAuth + +def create_opensearch_client(): + # Configuración de autenticación AWS + session = boto3.Session() + credentials = session.get_credentials() + region = session.region_name or 'us-east-1' + + awsauth = AWSRequestsAuth( + aws_access_key=credentials.access_key, + aws_secret_access_key=credentials.secret_key, + aws_token=credentials.token, + aws_host='your-domain.region.es.amazonaws.com', + aws_region=region, + aws_service='es' + ) + + # Cliente OpenSearch + client = OpenSearch( + hosts=[{'host': 'your-domain.region.es.amazonaws.com', 'port': 443}], + http_auth=awsauth, + use_ssl=True, + verify_certs=True, + connection_class=RequestsHttpConnection + ) + + return client + +# Uso del cliente +try: + client = create_opensearch_client() + + # Test de conectividad + info = client.info() + print(f"✓ Connected to OpenSearch: {info['version']['number']}") + + # Ejemplo de búsqueda + response = client.search( + index="logs-*", + body={ + "query": {"match_all": {}}, + "size": 10 + } + ) + print(f"Found {response['hits']['total']['value']} documents") + +except Exception as e: + print(f"✗ Error connecting to OpenSearch: {e}") +``` + +### Cliente Node.js + +```javascript +// opensearch-client.js +const { Client } = require('@opensearch-project/opensearch'); +const AWS = require('aws-sdk'); + +async function createOpenSearchClient() { + // Configurar credenciales AWS + const credentials = await AWS.config.credentials.get(); + + const client = new Client({ + node: process.env.OPENSEARCH_ENDPOINT || 'https://your-domain.region.es.amazonaws.com', + auth: { + credentials: { + accessKeyId: credentials.accessKeyId, + secretAccessKey: credentials.secretAccessKey, + sessionToken: credentials.sessionToken + }, + region: process.env.AWS_REGION || 'us-east-1', + service: 'es' + } + }); + + return client; +} + +// Uso +(async () => { + try { + const client = await createOpenSearchClient(); + + // Test de conectividad + const info = await client.info(); + console.log('✓ Connected to OpenSearch:', info.body.version.number); + + // Búsqueda de ejemplo + const response = await client.search({ + index: 'logs-*', + body: { + query: { match_all: {} }, + size: 10 + } + }); + + console.log(`Found ${response.body.hits.total.value} documents`); + + } catch (error) { + console.error('✗ Error connecting to OpenSearch:', error); + } +})(); +``` + +### Cliente con curl y AWS CLI + +```bash +#!/bin/bash +# opensearch-curl-client.sh + +ENDPOINT=${OPENSEARCH_ENDPOINT:-"https://your-domain.region.es.amazonaws.com"} +REGION=${AWS_REGION:-"us-east-1"} + +# Función para hacer solicitudes firmadas +opensearch_request() { + local method=$1 + local path=$2 + local data=$3 + + if [ -n "$data" ]; then + aws opensearchserverless index \ + --endpoint "$ENDPOINT" \ + --region "$REGION" \ + --index-name "$path" \ + --document "$data" + else + curl -X "$method" \ + --aws-sigv4 "aws:amz:$REGION:es" \ + --user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \ + "$ENDPOINT/$path" + fi +} + +# Ejemplos de uso +echo "=== OpenSearch API Examples ===" + +# Obtener información del clúster +echo "1. Cluster info:" +opensearch_request GET "_cluster/health" + +# Listar índices +echo "2. List indices:" +opensearch_request GET "_cat/indices?v" + +# Búsqueda simple +echo "3. Simple search:" +opensearch_request GET "logs-*/_search" '{"query":{"match_all":{}},"size":5}' + +# Crear un documento +echo "4. Create document:" +opensearch_request POST "logs-$(date +%Y.%m.%d)/_doc" '{ + "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%S.%3NZ)'", + "level": "INFO", + "message": "Test log entry", + "service": "opensearch-test" +}' +``` + + + + + +### Configuración de Logging para Debugging + +```python +# opensearch-debug-client.py +import logging +import boto3 +from opensearchpy import OpenSearch, RequestsHttpConnection +from aws_requests_auth.aws_auth import AWSRequestsAuth + +# Habilitar logging detallado +logging.basicConfig(level=logging.DEBUG) +opensearch_logger = logging.getLogger('opensearch') +opensearch_logger.setLevel(logging.DEBUG) + +# Cliente con logging extendido +def create_debug_client(): + session = boto3.Session() + credentials = session.get_credentials() + + # Log de credenciales (sin mostrar secretos) + print(f"AWS Region: {session.region_name}") + print(f"Access Key ID: {credentials.access_key[:8]}...") + + awsauth = AWSRequestsAuth( + aws_access_key=credentials.access_key, + aws_secret_access_key=credentials.secret_key, + aws_token=credentials.token, + aws_host='your-domain.region.es.amazonaws.com', + aws_region=session.region_name, + aws_service='es' + ) + + client = OpenSearch( + hosts=[{'host': 'your-domain.region.es.amazonaws.com', 'port': 443}], + http_auth=awsauth, + use_ssl=True, + verify_certs=True, + connection_class=RequestsHttpConnection, + # Habilitar logging de requests + enable_log=True + ) + + return client +``` + +### Monitoreo de Métricas de OpenSearch + +```bash +#!/bin/bash +# opensearch-metrics.sh + +DOMAIN_NAME="your-domain" +REGION=${AWS_REGION:-"us-east-1"} + +echo "=== OpenSearch Domain Metrics ===" + +# Métricas del dominio +aws cloudwatch get-metric-statistics \ + --namespace AWS/ES \ + --metric-name CPUUtilization \ + --dimensions Name=DomainName,Value=$DOMAIN_NAME Name=ClientId,Value=$(aws sts get-caller-identity --query Account --output text) \ + --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \ + --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ + --period 300 \ + --statistics Average \ + --region $REGION + +# Estado del clúster +aws opensearch describe-domain \ + --domain-name $DOMAIN_NAME \ + --region $REGION \ + --query 'DomainStatus.{ + Processing: Processing, + Created: Created, + Deleted: Deleted, + Endpoint: Endpoint, + EngineVersion: EngineVersion + }' + +# Configuración de acceso +aws opensearch describe-domain-config \ + --domain-name $DOMAIN_NAME \ + --region $REGION \ + --query 'DomainConfig.AccessPolicies.Options' +``` + +### Herramientas de Debugging + +```yaml +# opensearch-debug-pod.yaml +apiVersion: v1 +kind: Pod +metadata: + name: opensearch-debug + namespace: your-namespace +spec: + serviceAccountName: opensearch-service-account + containers: + - name: debug + image: amazon/aws-cli:latest + command: ["/bin/sh"] + args: ["-c", "while true; do sleep 30; done"] + env: + - name: AWS_REGION + value: "us-east-1" + - name: OPENSEARCH_ENDPOINT + valueFrom: + secretKeyRef: + name: opensearch-credentials + key: endpoint + volumeMounts: + - name: debug-scripts + mountPath: /scripts + volumes: + - name: debug-scripts + configMap: + name: opensearch-debug-scripts + defaultMode: 0755 +``` + +### Scripts de Validación Completa + +```bash +#!/bin/bash +# opensearch-full-validation.sh + +set -e + +echo "=== Complete OpenSearch Validation ===" + +# 1. Verificar variables de entorno +echo "1. Environment verification:" +required_vars=("OPENSEARCH_ENDPOINT" "AWS_REGION") +for var in "${required_vars[@]}"; do + if [ -z "${!var}" ]; then + echo "✗ Missing required variable: $var" + exit 1 + else + echo "✓ $var is set" + fi +done + +# 2. Verificar credenciales AWS +echo "2. AWS credentials verification:" +aws sts get-caller-identity || { + echo "✗ AWS credentials not configured" + exit 1 +} + +# 3. Test de conectividad de red +echo "3. Network connectivity:" +timeout 5 bash -c " + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opentelemetry-django-database-detection.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opentelemetry-django-database-detection.mdx new file mode 100644 index 000000000..430863cff --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/opentelemetry-django-database-detection.mdx @@ -0,0 +1,279 @@ +--- +sidebar_position: 3 +title: "Problema de Detección de Base de Datos en OpenTelemetry con Django" +description: "Solución para OpenTelemetry que no detecta bases de datos configuradas en aplicaciones Django" +date: "2024-12-19" +category: "workload" +tags: + ["opentelemetry", "django", "base de datos", "monitoreo", "instrumentación"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Detección de Base de Datos en OpenTelemetry con Django + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** OpenTelemetry, Django, Base de datos, Monitoreo, Instrumentación + +## Descripción del Problema + +**Contexto:** El usuario tiene una aplicación Django desplegada en SleakOps con auto-instrumentación de OpenTelemetry activada, pero OpenTelemetry no detecta las conexiones a bases de datos configuradas a pesar de que la base de datos está correctamente configurada en los ajustes de Django. + +**Síntomas Observados:** + +- La auto-instrumentación de OpenTelemetry no detecta conexiones a bases de datos +- La base de datos está correctamente configurada y funciona en la aplicación Django +- Añadir la variable de entorno `DJANGO_SETTINGS_MODULE` no resuelve el problema +- La aplicación funciona normalmente pero carece de datos de telemetría de base de datos + +**Configuración Relevante:** + +- Framework: Django (Django REST Framework) +- Variable de entorno: `DJANGO_SETTINGS_MODULE="simplee_drf.settings"` +- Plataforma: SleakOps con auto-instrumentación de OpenTelemetry +- Problema previo con la librería boto fue resuelto actualizando la versión + +**Condiciones de Error:** + +- OpenTelemetry no detecta la base de datos durante el inicio de la aplicación +- Faltan trazas y métricas de base de datos en los datos de observabilidad +- El problema persiste tras configurar la variable de entorno del módulo de ajustes de Django + +## Solución Detallada + + + +Primero, asegúrate de que los ajustes de Django estén correctamente configurados para la detección de la base de datos: + +1. **Verifica que DJANGO_SETTINGS_MODULE esté correctamente establecido:** + +```yaml +# En la configuración de despliegue de SleakOps +environment: + DJANGO_SETTINGS_MODULE: "tu_proyecto.settings" + # Reemplaza 'tu_proyecto' con el nombre real de tu proyecto +``` + +2. **Revisa que el archivo de ajustes de Django contenga la configuración de la base de datos:** + +```python +# settings.py +DATABASES = { + 'default': { + 'ENGINE': 'django.db.backends.postgresql', # o el motor de base de datos que uses + 'NAME': os.environ.get('DB_NAME'), + 'USER': os.environ.get('DB_USER'), + 'PASSWORD': os.environ.get('DB_PASSWORD'), + 'HOST': os.environ.get('DB_HOST'), + 'PORT': os.environ.get('DB_PORT', '5432'), + } +} +``` + + + + + +Asegúrate de que la instrumentación de OpenTelemetry para Django esté configurada correctamente: + +1. **Agrega los paquetes necesarios de OpenTelemetry en requirements.txt:** + +```txt +opentelemetry-api +opentelemetry-sdk +opentelemetry-auto-instrumentation +opentelemetry-instrumentation-django +opentelemetry-instrumentation-psycopg2 # para PostgreSQL +# o opentelemetry-instrumentation-mysqlclient # para MySQL +``` + +2. **Configura la instrumentación en los ajustes de Django:** + +```python +# settings.py +from opentelemetry.instrumentation.django import DjangoInstrumentor +from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor + +# Inicializa los instrumentadores +DjangoInstrumentor().instrument() +Psycopg2Instrumentor().instrument() +``` + + + + + +Configura las variables de entorno necesarias en tu despliegue de SleakOps: + +```yaml +# Configuración de despliegue en SleakOps +environment: + # Configuración de Django + DJANGO_SETTINGS_MODULE: "tu_proyecto.settings" + + # Configuración de base de datos + DB_NAME: "nombre_de_tu_base_de_datos" + DB_USER: "usuario_de_tu_base_de_datos" + DB_PASSWORD: "contraseña_de_tu_base_de_datos" + DB_HOST: "host_de_tu_base_de_datos" + DB_PORT: "5432" + + # Configuración de OpenTelemetry + OTEL_SERVICE_NAME: "tu-app-django" + OTEL_RESOURCE_ATTRIBUTES: "service.name=tu-app-django,service.version=1.0.0" + OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED: "true" +``` + + + + + +Si la auto-instrumentación sigue fallando, configura la instrumentación manual: + +1. **Crea un módulo de instrumentación:** + +```python +# instrumentation.py +from opentelemetry import trace +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.sdk.trace.export import BatchSpanProcessor +from opentelemetry.instrumentation.django import DjangoInstrumentor +from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor + +def setup_telemetry(): + # Configura el proveedor de trazas + trace.set_tracer_provider(TracerProvider()) + tracer = trace.get_tracer_provider() + + # Configura el exportador OTLP + otlp_exporter = OTLPSpanExporter() + span_processor = BatchSpanProcessor(otlp_exporter) + tracer.add_span_processor(span_processor) + + # Instrumenta Django y la base de datos + DjangoInstrumentor().instrument() + Psycopg2Instrumentor().instrument() +``` + +2. **Inicializa en el método ready() de Django:** + +```python +# apps.py +from django.apps import AppConfig + +class YourAppConfig(AppConfig): + default_auto_field = 'django.db.models.BigAutoField' + name = 'tu_app' + + def ready(self): + from .instrumentation import setup_telemetry + setup_telemetry() +``` + + + + + +**Verifica que la instrumentación esté funcionando:** + +1. **Revisa los logs de la aplicación para la inicialización de OpenTelemetry:** + +```bash +# Busca estos mensajes en los logs de SleakOps +kubectl logs -f deployment/tu-app | grep -i "opentelemetry\|instrumentation" +``` + +2. **Prueba la conexión a la base de datos manualmente:** + +```python +# En la consola de Django o en una vista de prueba +from django.db import connection + +def test_db_connection(): + with connection.cursor() as cursor: + cursor.execute("SELECT 1") + result = cursor.fetchone() + return result +``` + +3. **Verifica que se estén enviando datos de telemetría:** + +```python +# Añade a una vista Django para pruebas +from opentelemetry import trace + +def test_view(request): + tracer = trace.get_tracer(__name__) + with tracer.start_as_current_span("test-span"): + # Tu lógica de vista aquí + return JsonResponse({"status": "ok"}) +``` + +**Problemas comunes y soluciones:** + +- **Problema**: La instrumentación no se inicializa + - **Solución**: Asegúrate de que la instrumentación se ejecute antes de importar Django +- **Problema**: Variables de entorno no están configuradas + - **Solución**: Verifica que todas las variables de entorno estén establecidas en SleakOps +- **Problema**: Versiones incompatibles de paquetes + - **Solución**: Actualiza todos los paquetes de OpenTelemetry a la misma versión + + + + + +**Configuración avanzada para casos complejos:** + +1. **Configuración personalizada del exportador:** + +```python +# settings.py +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter +from opentelemetry.sdk.trace.export import BatchSpanProcessor + +# Configuración personalizada del exportador OTLP +otlp_exporter = OTLPSpanExporter( + endpoint="http://tu-collector:4317", + headers={ + "api-key": "tu-api-key" + } +) +``` + +2. **Instrumentación selectiva:** + +```python +# Solo instrumentar componentes específicos +from opentelemetry.instrumentation.django import DjangoInstrumentor +from opentelemetry.instrumentation.requests import RequestsInstrumentor + +# Instrumentar solo Django y requests +DjangoInstrumentor().instrument() +RequestsInstrumentor().instrument() +``` + +3. **Configuración de muestreo:** + +```python +from opentelemetry.sdk.trace.sampling import TraceIdRatioBasedSampler + +# Configurar muestreo al 10% +sampler = TraceIdRatioBasedSampler(rate=0.1) +``` + +**Mejores prácticas:** + +- Usa instrumentación automática cuando sea posible +- Configura muestreo apropiado para entornos de producción +- Monitorea el rendimiento de la instrumentación +- Mantén actualizadas las librerías de OpenTelemetry +- Documenta la configuración específica de tu aplicación + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ + diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/pod-readiness-probe-failed-connection-refused.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/pod-readiness-probe-failed-connection-refused.mdx new file mode 100644 index 000000000..4ec486727 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/pod-readiness-probe-failed-connection-refused.mdx @@ -0,0 +1,216 @@ +--- +sidebar_position: 3 +title: "Fallo en la sonda de disponibilidad del Pod - Conexión rechazada" +description: "Solución para pods que fallan en las sondas de disponibilidad con errores de conexión rechazada" +date: "2024-12-19" +category: "workload" +tags: ["pod", "readiness-probe", "connection-refused", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallo en la sonda de disponibilidad del Pod - Conexión rechazada + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Pod, Sonda de disponibilidad, Conexión rechazada, Solución de problemas + +## Descripción del problema + +**Contexto:** Después de una compilación y despliegue exitosos, el nuevo pod no inicia correctamente y entra en un ciclo de reinicios. La sonda de disponibilidad no puede conectarse al endpoint de verificación de salud de la aplicación. + +**Síntomas observados:** + +- El pod muestra el estado "Back-off restarting failed container" +- La sonda de disponibilidad falla con error "conexión rechazada" +- La imagen del contenedor está presente en la máquina +- Los procesos de compilación y despliegue se completaron con éxito +- La aplicación parece no responder a las solicitudes HTTP + +**Configuración relevante:** + +- Endpoint de verificación de salud: `/users/sign_in` +- Puerto de la aplicación: `3000` +- IP interna del pod: `10.130.33.83` +- Error: `dial tcp 10.130.33.83:3000: connect: connection refused` + +**Condiciones de error:** + +- Ocurre después de una compilación y despliegue exitosos +- La sonda de disponibilidad falla constantemente +- El pod no puede alcanzar el estado listo +- El servicio de la aplicación no es accesible + +## Solución detallada + + + +Un error de "conexión rechazada" durante la sonda de disponibilidad indica que: + +1. **La aplicación no ha iniciado**: El servidor HTTP dentro del contenedor no ha arrancado +2. **Puerto incorrecto**: La aplicación está escuchando en un puerto diferente al esperado +3. **Fallo de la aplicación**: La aplicación arrancó pero se cayó antes de la sonda +4. **Problemas de enlace**: La aplicación sólo está enlazando a localhost en lugar de 0.0.0.0 + +El hecho de que la imagen del contenedor esté presente sugiere que el despliegue fue exitoso, pero la aplicación interna no está funcionando correctamente. + + + + + +Primero, examine los logs del pod para identificar por qué la aplicación no está iniciando: + +```bash +# Obtener logs del pod +kubectl logs -n + +# Para la instancia previa del contenedor +kubectl logs -n --previous + +# Seguir logs en tiempo real +kubectl logs -n -f +``` + +Busque: + +- Errores de inicio de la aplicación +- Fallos en la conexión a la base de datos +- Variables de entorno faltantes +- Problemas de enlace de puerto +- Fallos de dependencias + + + + + +Asegúrese de que su aplicación esté configurada correctamente: + +1. **Verifique el enlace de la aplicación**: + + ```javascript + // Incorrecto - sólo enlaza a localhost + app.listen(3000, "localhost"); + + // Correcto - enlaza a todas las interfaces + app.listen(3000, "0.0.0.0"); + ``` + +2. **Verifique el EXPOSE en el Dockerfile**: + + ```dockerfile + EXPOSE 3000 + ``` + +3. **Revise la configuración del servicio de Kubernetes**: + ```yaml + spec: + ports: + - port: 3000 + targetPort: 3000 + ``` + + + + + +Si la aplicación tarda en iniciar, ajuste la sonda de disponibilidad: + +```yaml +spec: + containers: + - name: app + readinessProbe: + httpGet: + path: /users/sign_in + port: 3000 + initialDelaySeconds: 30 # Esperar 30 segundos antes de la primera sonda + periodSeconds: 10 # Revisar cada 10 segundos + timeoutSeconds: 5 # Tiempo de espera de 5 segundos + failureThreshold: 3 # Fallar después de 3 fallos consecutivos + successThreshold: 1 # Éxito tras 1 sonda exitosa +``` + + + + + +Si `/users/sign_in` requiere autenticación o lógica compleja, cree un endpoint de salud más simple: + +```javascript +// Agregar un endpoint simple de verificación de salud +app.get("/health", (req, res) => { + res.status(200).json({ status: "ok" }); +}); +``` + +Luego actualice su sonda de disponibilidad: + +```yaml +readinessProbe: + httpGet: + path: /health + port: 3000 +``` + + + + + +1. **Verifique si el contenedor está en ejecución**: + + ```bash + kubectl describe pod -n + ``` + +2. **Ejecute dentro del contenedor** (si permanece en ejecución): + + ```bash + kubectl exec -it -n -- /bin/bash + # Verifique si el puerto está escuchando + netstat -tlnp | grep 3000 + ``` + +3. **Pruebe el endpoint manualmente**: + + ```bash + kubectl exec -it -n -- curl http://localhost:3000/users/sign_in + ``` + +4. **Revise los límites de recursos**: + ```bash + kubectl top pod -n + ``` + + + + + +**Para aplicaciones Ruby on Rails:** + +```ruby +# Asegúrese de que Rails se enlaza a todas las interfaces +# config/puma.rb +bind "tcp://0.0.0.0:#{ENV.fetch('PORT', 3000)}" +``` + +**Para aplicaciones Node.js:** + +```javascript +// Asegúrese de que Express se enlaza a todas las interfaces +const port = process.env.PORT || 3000; +app.listen(port, "0.0.0.0", () => { + console.log(`Servidor corriendo en el puerto ${port}`); +}); +``` + +**Para variables de entorno:** + +- Verifique que todas las variables de entorno necesarias estén definidas +- Revise las cadenas de conexión a la base de datos +- Asegúrese de que los secretos estén correctamente montados + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/postgres-pgvector-extension-alpine-debian.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/postgres-pgvector-extension-alpine-debian.mdx new file mode 100644 index 000000000..0927f023a --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/postgres-pgvector-extension-alpine-debian.mdx @@ -0,0 +1,227 @@ +--- +sidebar_position: 3 +title: "Problemas con la extensión pgvector de PostgreSQL en Alpine vs Debian" +description: "Solución para problemas de compatibilidad de la extensión pgvector entre imágenes de PostgreSQL Alpine y Debian" +date: "2024-04-24" +category: "dependencia" +tags: + ["postgresql", "pgvector", "alpine", "debian", "extensiones", "base de datos"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas con la extensión pgvector de PostgreSQL en Alpine vs Debian + +**Fecha:** 24 de abril de 2024 +**Categoría:** Dependencia +**Etiquetas:** PostgreSQL, pgvector, Alpine, Debian, Extensiones, Base de datos + +## Descripción del problema + +**Contexto:** El usuario está implementando la extensión pgvector para PostgreSQL en su entorno SleakOps. La extensión funciona correctamente en los entornos de producción y staging, pero falla en el entorno local de desarrollo. + +**Síntomas observados:** + +- La extensión pgvector se instala correctamente en producción y staging +- El entorno local de desarrollo no puede instalar ni usar la extensión pgvector +- Comportamiento inconsistente entre entornos +- Errores relacionados con la extensión en la instancia local de PostgreSQL + +**Configuración relevante:** + +- Entorno local: `postgres:14-alpine` +- Producción/Staging: `postgres:14` (basado en Debian) +- Extensión: pgvector para búsqueda de similitud vectorial +- Plataforma: SleakOps con dependencia de PostgreSQL + +**Condiciones de error:** + +- El error ocurre al intentar instalar la extensión pgvector localmente +- El problema aparece solo en imágenes de PostgreSQL basadas en Alpine +- Funciona correctamente en imágenes basadas en Debian +- Inconsistencia de entorno entre local y producción + +## Solución detallada + + + +La diferencia clave entre las imágenes de PostgreSQL: + +- **postgres:14-alpine**: Basada en Alpine Linux, imagen mínima con musl libc +- **postgres:14**: Basada en Debian, incluye glibc y más herramientas de desarrollo + +Las imágenes Alpine son más pequeñas pero pueden carecer de ciertas librerías y herramientas de compilación necesarias para extensiones de PostgreSQL como pgvector. + + + + + +Para mantener la consistencia del entorno, actualice su configuración local de PostgreSQL: + +**En docker-compose.yml o equivalente:** + +```yaml +services: + postgres: + image: postgres:14 # Cambiado de postgres:14-alpine + environment: + POSTGRES_DB: tu_base_de_datos + POSTGRES_USER: tu_usuario + POSTGRES_PASSWORD: tu_contraseña + ports: + - "5432:5432" + volumes: + - postgres_data:/var/lib/postgresql/data +``` + +**En la configuración de dependencias de SleakOps:** + +```yaml +dependencies: + postgres: + image: postgres:14 + version: "14" + extensions: + - pgvector +``` + + + + + +Después de cambiar a la imagen Debian, instala pgvector: + +**Método 1: Usando comandos SQL** + +```sql +-- Conéctate a tu base de datos +\c tu_base_de_datos + +-- Crea la extensión +CREATE EXTENSION IF NOT EXISTS vector; + +-- Verifica la instalación +\dx vector +``` + +**Método 2: Usando un script de inicialización** + +Crea un script de inicialización: + +```sql +-- init-pgvector.sql +CREATE EXTENSION IF NOT EXISTS vector; +``` + +Móntalo en tu contenedor: + +```yaml +volumes: + - ./init-pgvector.sql:/docker-entrypoint-initdb.d/init-pgvector.sql +``` + + + + + +Para confirmar que pgvector funciona correctamente: + +```sql +-- Verifica si la extensión está instalada +SELECT * FROM pg_extension WHERE extname = 'vector'; + +-- Prueba la funcionalidad vectorial +CREATE TABLE test_vectors ( + id SERIAL PRIMARY KEY, + embedding vector(3) +); + +-- Inserta datos de prueba +INSERT INTO test_vectors (embedding) VALUES ('[1,2,3]'); +INSERT INTO test_vectors (embedding) VALUES ('[4,5,6]'); + +-- Prueba la búsqueda por similitud +SELECT id, embedding, embedding <-> '[1,2,3]' AS distancia +FROM test_vectors +ORDER BY distancia; + +-- Limpia la prueba +DROP TABLE test_vectors; +``` + + + + + +**Si pgvector aún no funciona después de cambiar a Debian:** + +1. **Verifica la compatibilidad de la versión de PostgreSQL:** + + ```bash + docker exec -it tu_contenedor_postgres psql -U tu_usuario -c "SELECT version();" + ``` + +2. **Instala pgvector manualmente si es necesario:** + + ```bash + # Conéctate al contenedor + docker exec -it tu_contenedor_postgres bash + + # Instala dependencias + apt-get update + apt-get install -y postgresql-14-pgvector + ``` + +3. **Reinicia el servicio de PostgreSQL:** + + ```bash + docker restart tu_contenedor_postgres + ``` + +4. **Verifica la disponibilidad de la extensión:** + ```sql + SELECT * FROM pg_available_extensions WHERE name = 'vector'; + ``` + + + + + +Para prevenir futuras inconsistencias en el entorno: + +**1. Usa imágenes base idénticas:** + +```yaml +# Usa la misma imagen en todos los entornos +postgres: + image: postgres:14 # No postgres:14-alpine +``` + +**2. Documenta las dependencias:** + +```yaml +# Crea un archivo dependencies.yml +postgres: + version: "14" + image: postgres:14 + extensions: + - pgvector + - uuid-ossp + required_packages: + - postgresql-14-pgvector +``` + +**3. Usa scripts de inicialización:** + +```bash +# Crea init-extensions.sql +#!/bin/bash +CREATE EXTENSION IF NOT EXISTS vector; +CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/postgresql-backup-restore-version-compatibility.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/postgresql-backup-restore-version-compatibility.mdx new file mode 100644 index 000000000..1ef829642 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/postgresql-backup-restore-version-compatibility.mdx @@ -0,0 +1,207 @@ +--- +sidebar_position: 3 +title: "Problema de Compatibilidad de Versiones en la Restauración de Respaldos de PostgreSQL" +description: "Solución para errores de compatibilidad de versión de pg_restore al restaurar respaldos entre diferentes versiones de PostgreSQL" +date: "2024-08-20" +category: "dependency" +tags: + ["postgresql", "respaldo", "restauracion", "rds", "compatibilidad-version"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Compatibilidad de Versiones en la Restauración de Respaldos de PostgreSQL + +**Fecha:** 20 de agosto de 2024 +**Categoría:** Dependencia +**Etiquetas:** PostgreSQL, Respaldo, Restauración, RDS, Compatibilidad de Versiones + +## Descripción del Problema + +**Contexto:** El usuario intenta restaurar respaldos de PostgreSQL creados con versiones más nuevas de pg_dump en instancias RDS de PostgreSQL más antiguas, encontrando problemas de compatibilidad de versiones. + +**Síntomas Observados:** + +- Error: `pg_restore: error: unsupported version (1.15) in file header` +- Respaldo creado con PostgreSQL 16.1 y pg_dump 16.3 +- Instancia RDS destino ejecutando PostgreSQL 14.11 +- La operación de restauración falla debido a incompatibilidad de formato + +**Configuración Relevante:** + +- Versión PostgreSQL origen: 16.1 +- Versión pg_dump: 16.3 +- Versión PostgreSQL RDS destino: 14.11 +- Formato de respaldo: Formato personalizado (probablemente) + +**Condiciones de Error:** + +- El error ocurre al intentar restaurar archivos de respaldo +- Sucede cuando la versión de pg_dump es más nueva que la de pg_restore +- Afecta respaldos en formato personalizado creados con versiones más nuevas de PostgreSQL + +## Solución Detallada + + + +La compatibilidad de respaldos en PostgreSQL sigue estas reglas: + +- **pg_dump**: Puede crear respaldos desde bases de datos de versiones iguales o anteriores +- **pg_restore**: Puede restaurar respaldos creados por versiones iguales o anteriores de pg_dump +- **Versiones de formato**: Cada versión de PostgreSQL puede introducir nuevas versiones de formato de respaldo + +En tu caso: + +- pg_dump 16.3 creó un respaldo con versión de formato 1.15 +- pg_restore de PostgreSQL 14.11 no soporta la versión de formato 1.15 + + + + + +### Opción 1: Usar formato de texto plano + +Crea un nuevo respaldo usando formato de texto plano que es más compatible: + +```bash +# Crear respaldo en texto plano (compatible entre versiones) +pg_dump -h host-origen -U usuario -d nombre_base -f respaldo.sql + +# Restaurar usando psql (funciona con cualquier versión de PostgreSQL) +psql -h endpoint-rds -U usuario -d base_destino -f respaldo.sql +``` + +### Opción 2: Usar versión compatible de pg_dump + +Usa pg_dump de PostgreSQL 14.x para crear el respaldo: + +```bash +# Instalar herramientas cliente de PostgreSQL 14 +sudo apt-get install postgresql-client-14 + +# Crear respaldo con versión compatible +/usr/lib/postgresql/14/bin/pg_dump -h host-origen -U usuario -d nombre_base -Fc -f respaldo_v14.dump + +# Restaurar con pg_restore +pg_restore -h endpoint-rds -U usuario -d base_destino respaldo_v14.dump +``` + + + + + +### Planificación de la actualización de RDS a PostgreSQL 16 + +**Requisitos previos:** + +1. Revisar la compatibilidad de PostgreSQL 16 con tus aplicaciones +2. Probar la actualización en un entorno de pruebas +3. Planificar tiempo de inactividad (las actualizaciones mayores requieren reinicio) + +**Pasos para la actualización:** + +1. **Crear snapshot de RDS:** + +```bash +aws rds create-db-snapshot \ + --db-instance-identifier tu-instancia-rds \ + --db-snapshot-identifier snapshot-pre-actualizacion-$(date +%Y%m%d) +``` + +2. **Modificar instancia RDS:** + +```bash +aws rds modify-db-instance \ + --db-instance-identifier tu-instancia-rds \ + --engine-version 16.1 \ + --apply-immediately +``` + +3. **Monitorear progreso de la actualización:** + +```bash +aws rds describe-db-instances \ + --db-instance-identifier tu-instancia-rds \ + --query 'DBInstances[0].DBInstanceStatus' +``` + +**Consideraciones:** + +- Ruta de actualización: 14.11 → 15.x → 16.1 (puede requerir versiones intermedias) +- Tiempo de inactividad: 10-30 minutos según tamaño de la base +- Costo: No hay costo adicional por actualizaciones mayores + + + + + +### Actualizar versión de PostgreSQL en SleakOps + +Si usas PostgreSQL gestionado por SleakOps: + +1. **Acceder a configuración de la base de datos:** + + - Ve al panel de tu proyecto + - Navega a **Dependencias** → **Bases de datos** + - Selecciona tu instancia PostgreSQL + +2. **Actualizar versión:** + +```yaml +# En tu sleakops.yaml o configuración de base de datos +dependencies: + databases: + - name: main-db + type: postgresql + version: "16.1" # Actualizar desde 14.11 + instance_class: db.t3.micro + storage: 20 +``` + +3. **Aplicar cambios:** + +```bash +sleakops deploy --environment production +``` + +**Nota:** Esto disparará una actualización de base de datos con tiempo de inactividad asociado. + + + + + +### Buenas prácticas para la estrategia de respaldos + +1. **Respaldos conscientes de versión:** + +```bash +# Siempre especificar formato y compatibilidad de versión +gpg_dump --version # Verifica la versión de pg_dump +gpg_dump -Fc --no-owner --no-privileges -f respaldo_$(date +%Y%m%d).dump nombre_base +``` + +2. **Respaldos en múltiples formatos:** + +```bash +# Crear respaldos en formato personalizado y texto plano +pg_dump -Fc -f respaldo_custom.dump nombre_base +pg_dump -f respaldo_plano.sql nombre_base +``` + +3. **Pruebas regulares de compatibilidad:** + +- Probar procedimientos de restauración en entornos de pruebas +- Mantener versiones compatibles de pg_dump para diferentes entornos destino +- Documentar procedimientos de respaldo y restauración + +4. **Gestión de versiones:** + +- Mantener sincronizadas las versiones de PostgreSQL entre entornos cuando sea posible +- Planificar ciclos regulares de actualización para evitar grandes saltos de versión +- Usar herramientas pg_dump en contenedores para versiones consistentes + + + +--- + +_Este FAQ fue generado automáticamente el 20 de agosto de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-environment-setup-guide.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-environment-setup-guide.mdx new file mode 100644 index 000000000..ca5c359b3 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-environment-setup-guide.mdx @@ -0,0 +1,687 @@ +--- +sidebar_position: 3 +title: "Guía de Configuración del Entorno de Producción" +description: "Guía paso a paso para crear y configurar entornos de producción en SleakOps" +date: "2024-12-19" +category: "proyecto" +tags: ["producción", "entorno", "despliegue", "configuración", "ajustes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Guía de Configuración del Entorno de Producción + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Producción, Entorno, Despliegue, Configuración, Ajustes + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan crear un entorno de producción en SleakOps para desplegar aplicaciones desde la rama principal, separado de los entornos de desarrollo que típicamente usan ramas develop. + +**Síntomas Observados:** + +- Necesidad de crear un nuevo entorno de producción desde cero +- Requisito de separar los despliegues de producción de los de desarrollo +- Necesidad de migrar o replicar configuraciones existentes de desarrollo +- Requisitos de migración de base de datos para datos de producción + +**Configuración Relevante:** + +- Tipo de entorno: Producción (entorno raíz) +- Estrategia de ramas: rama main para producción vs develop para staging +- Base de datos: PostgreSQL con posibles necesidades de migración de datos +- Configuración de dominio: dominios específicos para producción + +**Condiciones de Error:** + +- Falta de un proceso claro paso a paso para la configuración de producción +- Incertidumbre sobre la replicación de recursos desde desarrollo +- Complejidad en la migración de base de datos + +## Solución Detallada + + + +### Creando el Entorno de Producción + +1. **Navegar a Entornos** + + - Ve a tu panel de control de SleakOps + - Haz clic en **Entornos** en la barra lateral + +2. **Crear Nuevo Entorno** + + - Haz clic en **"Crear Entorno"** + - Pon el nombre como `prod` o `producción` + - **Importante**: Marca este entorno como **"Entorno Raíz"** + +3. **Configurar Ajustes de Dominio** + - Configura tus dominios de producción + - Configura certificados SSL si es necesario + - Asegúrate de que las configuraciones DNS apunten a tu infraestructura de producción + +### Ejemplo de Configuración del Entorno + +```yaml +name: prod +type: root +domain_config: + primary_domain: "myapp.com" + ssl_enabled: true + certificate_type: "letsencrypt" +branch_strategy: + default_branch: "main" + auto_deploy: true +``` + + + + + +### Proceso de Replicación de Recursos + +Necesitas recrear todos los recursos de tu entorno de desarrollo: + +#### Proyectos + +1. **Crear Nuevo Proyecto** + - Usa la misma configuración que en desarrollo + - Cambia la rama de `develop` a `main` + - Actualiza las variables de entorno para producción + +#### Dependencias + +1. **Dependencias de Base de Datos** + + - Crea instancias de PostgreSQL para producción + - Configura con un tamaño apropiado para producción + - Configura respaldo y monitoreo + +2. **Dependencias de Caché** + - Redis u otras soluciones de caché + - Configura con especificaciones para producción + +#### Ejecuciones (Workloads) + +1. **Servicios Web** + + - Replica las configuraciones de servicios web + - Ajusta los límites de recursos para la carga de producción + - Configura chequeos de salud y monitoreo + +2. **Servicios Worker** + - Procesadores de trabajos en segundo plano + - Tareas programadas y cron jobs + +#### Grupos de Variables + +1. **Variables de Entorno** + - Crea grupos de variables específicos para producción + - Actualiza claves API, conexiones de base de datos + - Establece flags de características para producción + +### Lista de Verificación para Configuración de Producción + +- [ ] Entorno creado y marcado como raíz +- [ ] Proyectos configurados con rama main +- [ ] Dependencias de base de datos creadas +- [ ] Dependencias de caché configuradas +- [ ] Servicios web replicados +- [ ] Servicios worker configurados +- [ ] Grupos de variables para producción creados +- [ ] Dominio y SSL configurados +- [ ] Monitoreo y alertas habilitados + + + + + +### Proceso de Importación de Volcado de Base de Datos + +Si necesitas migrar datos desde desarrollo a producción: + +#### Requisitos Previos + +1. **Crear Volcado de Base de Datos** + + ```bash + pg_dump -h dev-db-host -U username -d database_name > production_dump.sql + ``` + +2. **Preparar Base de Datos de Producción** + - Asegúrate de que la instancia PostgreSQL de producción esté en ejecución + - Verifica las credenciales de conexión + - Asegura espacio de almacenamiento suficiente + +#### Proceso de Importación + +1. **Acceder a la Función de Importación de Base de Datos** + + - Ve a tu dependencia PostgreSQL en el entorno de producción + - Busca la opción "Importar Base de Datos" o "Restaurar" + +2. **Subir e Importar** + - Sube tu archivo de volcado + - Ejecuta el proceso de importación + - Monitorea el progreso de la importación + +#### Verificación Post-Importación + +```sql +-- Verificar integridad de datos +SELECT COUNT(*) FROM your_main_table; + +-- Revisar cuentas de usuario +SELECT COUNT(*) FROM users; + +-- Verificar datos específicos de la aplicación +SELECT * FROM configuration_table LIMIT 5; +``` + +### Consideraciones Importantes + +- **Saneamiento de Datos**: Elimina o anonimiza datos sensibles de desarrollo +- **Secretos de Producción**: Actualiza todas las claves API y contraseñas +- **Flags de Características**: Desactiva características solo para desarrollo +- **Configuración de Email**: Asegura la configuración de correo para producción + + + + + +### Lista de Verificación Pre-Producción + +#### Verificación Técnica + +- [ ] Todos los servicios están saludables y en ejecución +- [ ] Las conexiones a base de datos funcionan +- [ ] Integraciones con APIs externas configuradas +- [ ] Certificados SSL válidos +- [ ] Enrutamiento de dominio correcto +- [ ] Monitoreo y registro activos + +#### Lista de Seguridad + +- [ ] Secretos de producción actualizados +- [ ] Modos debug de desarrollo desactivados +- [ ] Controles de acceso configurados correctamente +- [ ] Procedimientos de respaldo establecidos +- [ ] Escaneo de seguridad completado + +### Proceso de Despliegue + +1. **Despliegue Inicial** + + ```bash + # Asegúrate que la rama main está lista + git checkout main + git pull origin main + + # Despliega a través de SleakOps + # Esto se activará automáticamente según tu configuración + ``` + +2. **Pruebas Iniciales (Smoke Testing)** + + - Prueba los flujos críticos de usuario + - Verifica la conectividad con la base de datos + - Revisa integraciones con servicios externos + - Valida monitoreo y alertas + +3. **Preparación para Puesta en Producción** + - Programa ventana de mantenimiento si es necesario + - Prepara procedimientos de reversión + - Configura monitoreo en tiempo real + - Notifica a los interesados sobre el cronograma de puesta en producción + +### Monitoreo en Producción + +```yaml +# Ejemplo de configuración de monitoreo +monitoring: + health_checks: + - endpoint: "/health" + interval: "30s" + timeout: "5s" + alerts: + - type: "response_time" + threshold: "2s" + - type: "error_rate" + threshold: "5%" + logging: + level: "info" + retention: "30d" +``` + + + + + +### Guías para Dimensionamiento de Recursos + +#### Base de Datos + +- **Desarrollo**: t3.micro o t3.small +- **Producción**: t3.medium o mayor según la carga +- Habilitar respaldos automáticos +- Configurar réplicas de lectura si es necesario + +#### Servicios de Aplicación + +- **CPU**: Comenzar con 2x los recursos de desarrollo +- **Memoria**: Mínimo 2GB para aplicaciones web +- **Réplicas**: Mínimo 2 para alta disponibilidad +- **Autoescalado**: Configurar HPA para manejo de carga + +#### Configuración de Ejemplo + +```yaml +resources: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" +replicas: + min: 2 + max: 10 +autoscaling: + enabled: true + targetCPU: 70 + targetMemory: 80 +``` + +### Configuración de Seguridad + +#### Variables de Entorno Seguras + +```bash +# Usar secretos seguros para producción +DATABASE_URL=postgresql://prod_user:secure_password@prod-db:5432/prod_db +API_SECRET_KEY=production_secret_key_here +JWT_SECRET=production_jwt_secret +ENCRYPTION_KEY=production_encryption_key + +# Configuraciones específicas de producción +NODE_ENV=production +DEBUG=false +LOG_LEVEL=info +``` + +#### Configuración de Red + +- Usar HTTPS únicamente +- Configurar CORS apropiadamente +- Implementar rate limiting +- Configurar firewall y políticas de red + +### Procedimientos de Respaldo + +#### Respaldo de Base de Datos + +```bash +# Configurar respaldos automáticos diarios +pg_dump -h prod-db-host -U username -d database_name | gzip > backup_$(date +%Y%m%d).sql.gz + +# Configurar retención de respaldos +find /backups -name "backup_*.sql.gz" -mtime +30 -delete +``` + +#### Respaldo de Configuración + +- Versionar todas las configuraciones en Git +- Documentar cambios de configuración +- Mantener inventario de recursos + + + + + +### Métricas Clave a Monitorear + +#### Métricas de Aplicación + +```yaml +# Configuración de métricas en Grafana +metrics: + - name: "response_time" + query: "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))" + threshold: 2.0 + - name: "error_rate" + query: "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])" + threshold: 0.05 + - name: "throughput" + query: "rate(http_requests_total[5m])" + threshold: 100 +``` + +#### Métricas de Infraestructura + +```yaml +infrastructure_alerts: + - name: "high_cpu_usage" + query: "cpu_usage_percent > 80" + duration: "5m" + - name: "high_memory_usage" + query: "memory_usage_percent > 85" + duration: "5m" + - name: "disk_space_low" + query: "disk_usage_percent > 90" + duration: "2m" +``` + +### Configuración de Alertas + +#### Alertas Críticas + +```yaml +critical_alerts: + - name: "service_down" + condition: "up == 0" + notification: "immediate" + channels: ["slack", "email", "sms"] + - name: "database_connection_failed" + condition: "db_connections_failed > 0" + notification: "immediate" + channels: ["slack", "email"] +``` + +#### Alertas de Advertencia + +```yaml +warning_alerts: + - name: "high_response_time" + condition: "response_time > 1.5" + duration: "10m" + channels: ["slack"] + - name: "increased_error_rate" + condition: "error_rate > 0.02" + duration: "5m" + channels: ["slack"] +``` + +### Dashboard de Producción + +```json +{ + "dashboard": { + "title": "Production Environment Overview", + "panels": [ + { + "title": "Service Health", + "type": "stat", + "targets": [ + { + "expr": "up{job='production-app'}", + "legendFormat": "{{instance}}" + } + ] + }, + { + "title": "Request Rate", + "type": "graph", + "targets": [ + { + "expr": "rate(http_requests_total[5m])", + "legendFormat": "Requests/sec" + } + ] + }, + { + "title": "Error Rate", + "type": "graph", + "targets": [ + { + "expr": "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m]) * 100", + "legendFormat": "Error %" + } + ] + }, + { + "title": "Database Connections", + "type": "graph", + "targets": [ + { + "expr": "pg_stat_database_numbackends", + "legendFormat": "Active Connections" + } + ] + } + ] + } +} +``` + + + + + +### Pipeline de CI/CD para Producción + +#### Configuración de GitHub Actions + +```yaml +name: Production Deployment +on: + push: + branches: [main] + workflow_dispatch: + +jobs: + deploy-production: + runs-on: ubuntu-latest + environment: production + steps: + - uses: actions/checkout@v3 + + - name: Run Tests + run: | + npm test + npm run test:integration + + - name: Security Scan + run: | + npm audit + npm run security:scan + + - name: Build Application + run: | + npm run build:production + + - name: Deploy to SleakOps + env: + SLEAKOPS_API_KEY: ${{ secrets.SLEAKOPS_API_KEY }} + run: | + # Trigger deployment via SleakOps API + curl -X POST "https://api.sleakops.com/deploy" \ + -H "Authorization: Bearer $SLEAKOPS_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"environment": "production", "branch": "main"}' +``` + +#### Estrategia de Despliegue Blue-Green + +```yaml +deployment_strategy: + type: "blue-green" + health_check: + endpoint: "/health" + timeout: "30s" + retries: 3 + rollback: + automatic: true + threshold: "error_rate > 5%" + traffic_shifting: + initial: 10 + increment: 25 + interval: "5m" +``` + +### Procedimientos de Rollback + +#### Rollback Automático + +```bash +#!/bin/bash +# Script de rollback automático + +CURRENT_VERSION=$(kubectl get deployment production-app -o jsonpath='{.metadata.labels.version}') +PREVIOUS_VERSION=$(kubectl rollout history deployment/production-app | tail -2 | head -1 | awk '{print $1}') + +echo "Rolling back from version $CURRENT_VERSION to $PREVIOUS_VERSION" + +kubectl rollout undo deployment/production-app +kubectl rollout status deployment/production-app --timeout=300s + +if [ $? -eq 0 ]; then + echo "Rollback successful" + # Notificar al equipo + curl -X POST $SLACK_WEBHOOK -d '{"text":"Production rollback completed successfully"}' +else + echo "Rollback failed" + exit 1 +fi +``` + +#### Rollback Manual + +```bash +# Listar versiones disponibles +kubectl rollout history deployment/production-app + +# Rollback a versión específica +kubectl rollout undo deployment/production-app --to-revision=2 + +# Verificar estado +kubectl rollout status deployment/production-app +``` + + + + + +### Configuración de Cache + +#### Redis para Cache de Aplicación + +```yaml +redis_config: + host: "production-redis" + port: 6379 + password: "${REDIS_PASSWORD}" + db: 0 + max_connections: 100 + timeout: 5000 + retry_attempts: 3 +``` + +#### Cache de Base de Datos + +```sql +-- Configurar parámetros de PostgreSQL para producción +ALTER SYSTEM SET shared_buffers = '256MB'; +ALTER SYSTEM SET effective_cache_size = '1GB'; +ALTER SYSTEM SET maintenance_work_mem = '64MB'; +ALTER SYSTEM SET checkpoint_completion_target = 0.9; +ALTER SYSTEM SET wal_buffers = '16MB'; +ALTER SYSTEM SET default_statistics_target = 100; +SELECT pg_reload_conf(); +``` + +### Optimización de Consultas + +#### Índices de Base de Datos + +```sql +-- Crear índices para consultas frecuentes +CREATE INDEX CONCURRENTLY idx_users_email ON users(email); +CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders(created_at); +CREATE INDEX CONCURRENTLY idx_products_category_id ON products(category_id); + +-- Índices compuestos para consultas complejas +CREATE INDEX CONCURRENTLY idx_orders_user_status ON orders(user_id, status); +``` + +#### Configuración de Connection Pool + +```javascript +// Configuración de pool de conexiones para producción +const pool = new Pool({ + host: process.env.DB_HOST, + port: process.env.DB_PORT, + database: process.env.DB_NAME, + user: process.env.DB_USER, + password: process.env.DB_PASSWORD, + max: 20, // máximo número de conexiones + idleTimeoutMillis: 30000, + connectionTimeoutMillis: 2000, +}); +``` + +### CDN y Optimización de Assets + +```yaml +# Configuración de CDN +cdn_config: + provider: "cloudflare" + cache_ttl: "1h" + compression: true + minification: true + image_optimization: true +``` + + + +## Lista de Verificación Final + +### Pre-Lanzamiento + +- [ ] Entorno de producción creado y configurado +- [ ] Todos los recursos replicados desde desarrollo +- [ ] Base de datos migrada y verificada +- [ ] Variables de entorno de producción configuradas +- [ ] Certificados SSL instalados y verificados +- [ ] Monitoreo y alertas configurados +- [ ] Respaldos automáticos configurados +- [ ] Pruebas de carga realizadas +- [ ] Procedimientos de rollback probados +- [ ] Documentación actualizada + +### Post-Lanzamiento + +- [ ] Monitoreo activo durante las primeras 24 horas +- [ ] Verificación de métricas de rendimiento +- [ ] Confirmación de funcionamiento de alertas +- [ ] Revisión de logs para errores +- [ ] Validación de respaldos +- [ ] Comunicación con stakeholders +- [ ] Documentación de lecciones aprendidas + +## Mejores Prácticas + +### Seguridad + +1. **Principio de menor privilegio** para accesos +2. **Rotación regular** de secretos y contraseñas +3. **Auditoría** de accesos y cambios +4. **Encriptación** de datos en tránsito y reposo + +### Operaciones + +1. **Monitoreo proactivo** con alertas configuradas +2. **Respaldos regulares** y pruebas de restauración +3. **Documentación actualizada** de procedimientos +4. **Revisiones regulares** de rendimiento y costos + +### Desarrollo + +1. **Feature flags** para control de funcionalidades +2. **Despliegues graduales** con validación +3. **Pruebas automatizadas** en pipeline CI/CD +4. **Revisión de código** obligatoria para cambios + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-environment-setup.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-environment-setup.mdx new file mode 100644 index 000000000..84d295d25 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-environment-setup.mdx @@ -0,0 +1,254 @@ +--- +sidebar_position: 3 +title: "Guía de Configuración del Entorno de Producción" +description: "Guía completa para configurar entornos de producción con migración de bases de datos y dominios personalizados" +date: "2024-01-15" +category: "proyecto" +tags: + [ + "producción", + "despliegue", + "base de datos", + "migración", + "dominio", + "entorno", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Guía de Configuración del Entorno de Producción + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Producción, Despliegue, Base de datos, Migración, Dominio, Entorno + +## Descripción del Problema + +**Contexto:** El usuario necesita configurar un entorno de producción replicando la configuración de staging, incluyendo dependencias y grupos de variables, mientras gestiona la migración de base de datos y la configuración de dominio personalizado. + +**Síntomas Observados:** + +- Necesidad de replicar la configuración del entorno staging a producción +- Requisito de modificar la variable RAILS_ENV de staging a producción +- Necesidad de procedimientos de volcado/restauración de base de datos para despliegues fuera de horario +- Configuración de dominio personalizado necesaria para acceso en producción + +**Configuración Relevante:** + +- Entorno origen: Staging +- Entorno destino: Producción +- Framework: Aplicación Rails +- Dominio personalizado: hub.supra.social +- Base de datos: Requiere capacidad de volcado/restauración + +**Condiciones de Error:** + +- Configuración manual del entorno requerida +- Migración de base de datos necesaria durante horas no laborables +- Configuración de dominio para acceso en producción + +## Solución Detallada + + + +Para replicar la configuración de tu entorno staging: + +1. **Accede al Panel de SleakOps** + + - Navega a tu proyecto + - Ve a la sección **Entornos** + +2. **Crear Entorno de Producción** + + ```bash + # Usando la CLI de SleakOps + sleakops env create production --from-template staging + ``` + +3. **Copiar Dependencias** + + - En el panel, ve a la pestaña **Dependencias** + - Selecciona todas las dependencias de staging + - Haz clic en **Clonar al Entorno** → Selecciona **Producción** + +4. **Copiar Grupos de Variables** + - Navega a **Variables** → **Grupos** + - Selecciona el grupo de variables de staging + - Haz clic en **Duplicar** → Nómbralo para producción + - **Importante**: Cambia `RAILS_ENV` de `staging` a `production` + + + + + +### Creando Volcado de Base de Datos + +```bash +# Conectarse al pod de base de datos de staging +kubectl exec -it -- bash + +# Crear volcado (ejemplo PostgreSQL) +pg_dump -U -h localhost > /tmp/production_dump.sql + +# Copiar volcado a máquina local +kubectl cp :/tmp/production_dump.sql ./production_dump.sql +``` + +### Restaurando en Base de Datos de Producción + +```bash +# Copiar volcado al pod de base de datos de producción +kubectl cp ./production_dump.sql :/tmp/production_dump.sql + +# Conectarse al pod de base de datos de producción +kubectl exec -it -- bash + +# Restaurar base de datos +psql -U -h localhost < /tmp/production_dump.sql +``` + +### Script Automatizado para Despliegue Fuera de Horario + +```bash +#!/bin/bash +# production-deploy.sh + +set -e + +echo "Iniciando migración de base de datos en producción a las $(date)" + +# Paso 1: Crear respaldo de la base de datos actual en producción +echo "Creando respaldo de producción..." +kubectl exec -it -- pg_dump -U > prod_backup_$(date +%Y%m%d_%H%M%S).sql + +# Paso 2: Aplicar nuevo volcado +echo "Aplicando nuevo volcado de base de datos..." +kubectl cp ./production_dump.sql :/tmp/ +kubectl exec -it -- psql -U < /tmp/production_dump.sql + +# Paso 3: Reiniciar pods de la aplicación +echo "Reiniciando aplicación..." +kubectl rollout restart deployment/ -n + +echo "Migración completada a las $(date)" +``` + + + + + +### Configurar Dominio Personalizado en SleakOps + +1. **En el Panel de SleakOps:** + + - Ve al **Entorno de Producción** + - Navega a **Red** → **Dominios** + - Haz clic en **Agregar Dominio Personalizado** + - Ingresa: `hub.supra.social` + +2. **Configuración DNS:** + + ```dns + # Agrega un registro CNAME en tu proveedor DNS + hub.supra.social. CNAME + ``` + +3. **Certificado SSL:** + - SleakOps aprovisionará automáticamente el certificado SSL + - Espera la validación del certificado (usualmente 5-10 minutos) + +### Verificar Configuración del Dominio + +```bash +# Verificar resolución DNS +nslookup hub.supra.social + +# Probar acceso HTTPS +curl -I https://hub.supra.social + +# Verificar certificado +openssl s_client -connect hub.supra.social:443 -servername hub.supra.social +``` + + + + + +### Pre-Despliegue (Antes de las 7 PM) + +- [ ] Verificar que el entorno staging esté estable +- [ ] Crear volcado de base de datos desde staging +- [ ] Probar restauración del volcado en desarrollo +- [ ] Preparar script de despliegue +- [ ] Notificar a interesados sobre ventana de mantenimiento + +### Durante el Despliegue (7 PM - Horas No Laborables) + +- [ ] Crear respaldo de base de datos en producción +- [ ] Aplicar nuevo volcado de base de datos +- [ ] Actualizar variables de entorno (RAILS_ENV=production) +- [ ] Desplegar aplicación con nueva configuración +- [ ] Verificar accesibilidad del dominio +- [ ] Ejecutar pruebas básicas + +### Post-Despliegue + +- [ ] Monitorear logs de la aplicación +- [ ] Verificar funcionamiento de todas las funcionalidades +- [ ] Revisar métricas de rendimiento +- [ ] Notificar al equipo sobre despliegue exitoso + +### Plan de Reversión (Si ocurren problemas) + +```bash +# Restaurar respaldo previo de base de datos +kubectl cp prod_backup_.sql :/tmp/ +kubectl exec -it -- psql -U < /tmp/prod_backup_.sql + +# Revertir a versión anterior de la aplicación +kubectl rollout undo deployment/ -n +``` + + + + + +### Problemas de Conexión a la Base de Datos + +```bash +# Verificar estado del pod de base de datos +kubectl get pods -l app=database + +# Revisar logs de la base de datos +kubectl logs + +# Probar conectividad a base de datos +kubectl exec -it -- rails db:migrate:status +``` + +### Dominio No Accesible + +1. **Verificar propagación DNS:** + + ```bash + dig hub.supra.social + ``` + +2. **Verificar configuración de ingress:** + + ```bash + kubectl get ingress -n + kubectl describe ingress + ``` + +3. **Comprobar estado del certificado:** + ```bash + kubectl get certificates -n + ``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-site-down-pods-not-starting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-site-down-pods-not-starting.mdx new file mode 100644 index 000000000..6948dde39 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-site-down-pods-not-starting.mdx @@ -0,0 +1,245 @@ +--- +sidebar_position: 1 +title: "Sitio de Producción Caído - Pods No Inician" +description: "Guía de solución de problemas para interrupciones en producción cuando los pods no logran iniciar" +date: "2024-01-15" +category: "workload" +tags: + ["producción", "pods", "interrupción", "solución de problemas", "kubernetes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Sitio de Producción Caído - Pods No Inician + +**Fecha:** 15 de enero de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Producción, Pods, Interrupción, Solución de problemas, Kubernetes + +## Descripción del Problema + +**Contexto:** El sitio web de producción está experimentando una interrupción total con pods que no logran iniciar o reiniciar correctamente en el clúster de Kubernetes. + +**Síntomas Observados:** + +- El sitio de producción está completamente caído +- Los pods no inician ni se ponen en línea +- Los servicios de la aplicación no están disponibles +- Los usuarios no pueden acceder al entorno de producción + +**Configuración Relevante:** + +- Entorno: Producción +- Plataforma: Clúster Kubernetes +- Alcance del problema: Interrupción total del sitio +- Estado de los pods: Fallo al iniciar + +**Condiciones de Error:** + +- Ocurre en entorno de producción +- Afecta a todos o la mayoría de los pods de la aplicación +- Resulta en indisponibilidad total del servicio +- Puede indicar problemas a nivel de clúster + +## Solución Detallada + + + +**Paso 1: Verificar Estado de los Pods** + +```bash +# Ver todos los pods en el namespace +kubectl get pods -n + +# Obtener información detallada de los pods +kubectl describe pods -n + +# Revisar logs de los pods +kubectl logs -n --previous +``` + +**Paso 2: Verificar Estado de los Nodos** + +```bash +# Verificar salud de los nodos +kubectl get nodes + +# Revisar recursos de los nodos +kubectl top nodes + +# Describir nodos problemáticos +kubectl describe node +``` + + + + + +**Agotamiento de Recursos:** + +- Verificar si los nodos tienen suficiente CPU/Memoria +- Comprobar capacidad de almacenamiento +- Revisar si se están excediendo cuotas de recursos + +**Problemas al Descargar Imágenes:** + +```bash +# Verificar si las imágenes pueden descargarse +kubectl describe pod | grep -i "image" + +# Verificar conectividad con el registro de imágenes +kubectl get events --sort-by=.metadata.creationTimestamp +``` + +**Problemas de Configuración:** + +- Revisar ConfigMaps y Secrets +- Verificar variables de entorno +- Validar permisos de cuentas de servicio + +**Problemas de Red:** + +- Probar resolución DNS del clúster +- Verificar conectividad de servicios +- Comprobar estado del controlador de ingreso + + + + + +**1. Revisar Eventos del Clúster** + +```bash +# Obtener eventos recientes del clúster +kubectl get events --sort-by=.metadata.creationTimestamp -A + +# Filtrar eventos de error +kubectl get events --field-selector type=Warning -A +``` + +**2. Verificar Pods Críticos del Sistema** + +```bash +# Ver pods en kube-system +kubectl get pods -n kube-system + +# Ver controlador de ingreso +kubectl get pods -n ingress-nginx + +# Ver pila de monitoreo +kubectl get pods -n monitoring +``` + +**3. Verificar Disponibilidad de Recursos** + +```bash +# Revisar capacidad de nodos +kubectl describe nodes | grep -A 5 "Allocated resources" + +# Ver volúmenes persistentes +kubectl get pv,pvc -A +``` + + + + + +**Pasos Inmediatos de Recuperación:** + +1. **Reiniciar Despliegues** + +```bash +# Reiniciar despliegue específico +kubectl rollout restart deployment/ -n + +# Reiniciar todos los despliegues en el namespace +kubectl get deployments -n -o name | xargs -I {} kubectl rollout restart {} -n +``` + +2. **Escalar Recursos si es Necesario** + +```bash +# Escalar despliegue +kubectl scale deployment/ --replicas=3 -n + +# Añadir más nodos si se usa autoscaler de clúster +kubectl get nodes --show-labels +``` + +3. **Eliminar Pods Fallidos** + +```bash +# Borrar pods fallidos para forzar recreación +kubectl delete pods --field-selector=status.phase=Failed -n + +# Borrar pods atascados en estado pendiente +kubectl delete pods --field-selector=status.phase=Pending -n +``` + + + + + +**Configurar Alertas de Monitoreo:** + +1. **Monitoreo de Salud de Pods** + +```yaml +# Ejemplo de regla de alerta Prometheus +- alert: PodsNotReady + expr: kube_pod_status_ready{condition="false"} > 0 + for: 5m + labels: + severity: critical + annotations: + summary: "Pods no listos en {{ $labels.namespace }}" +``` + +2. **Monitoreo de Recursos** + +```yaml +- alert: NodeResourceExhaustion + expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 + for: 2m + labels: + severity: warning +``` + +**Mejores Prácticas:** + +- Implementar chequeos de salud y probes de readiness +- Establecer solicitudes y límites adecuados de recursos +- Usar autoscaling horizontal de pods +- Mantener entorno de staging para pruebas +- Realizar respaldos regulares de configuraciones críticas + + + + + +**Escalar inmediatamente si:** + +- Múltiples nodos están caídos +- El plano de control del clúster no responde +- Se sospecha corrupción de datos +- Se detecta una brecha de seguridad + +**Antes de escalar, recopilar:** + +- Estado del clúster +- Historial reciente de despliegues +- Logs de error y eventos +- Métricas de uso de recursos +- Línea de tiempo de inicio del problema + +**Contactos de Emergencia:** + +- Equipo de plataforma para problemas a nivel clúster +- Equipo de infraestructura para problemas de nodo/red +- Equipo de aplicaciones para problemas específicos de servicios + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-site-down-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-site-down-troubleshooting.mdx new file mode 100644 index 000000000..cdd672bcf --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/production-site-down-troubleshooting.mdx @@ -0,0 +1,239 @@ +--- +sidebar_position: 1 +title: "Sitio de Producción Caído - Resolución de Emergencia" +description: "Procedimientos de emergencia para manejar interrupciones en el sitio de producción y problemas de despliegue" +date: "2025-01-27" +category: "general" +tags: ["producción", "interrupción", "despliegue", "emergencia", "resolución"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Sitio de Producción Caído - Resolución de Emergencia + +**Fecha:** 27 de enero de 2025 +**Categoría:** General +**Etiquetas:** Producción, Interrupción, Despliegue, Emergencia, Resolución + +## Descripción del Problema + +**Contexto:** El sitio de producción está experimentando tiempo de inactividad que requiere atención inmediata y procedimientos de despliegue de emergencia a través de la plataforma SleakOps. + +**Síntomas Observados:** + +- Caída completa del sitio ("Se nos cayó el sitio") +- Estado de sitio en construcción +- Problemas urgentes en el entorno de producción +- Necesidad de coordinación para despliegue de emergencia +- Varios miembros del equipo involucrados para la resolución + +**Configuración Relevante:** + +- Entorno: Producción +- Plataforma: SleakOps con integración Bitbucket +- Método de despliegue: pipelines de Bitbucket +- Coordinación de equipo: requerida para cambios en producción + +**Condiciones de Error:** + +- Sitio de producción completamente caído +- Despliegue de emergencia necesario +- Múltiples interesados involucrados en la resolución +- Restauración sensible al tiempo requerida + +## Solución Detallada + + + +Al enfrentar una interrupción en producción: + +1. **Verificar el alcance de la interrupción**: + + - Comprobar si es una falla total del sitio o funcionalidad parcial + - Verificar resolución DNS y conectividad básica + - Revisar paneles de monitoreo para alertas + +2. **Recopilar información inicial**: + + - Despliegues o cambios recientes + - Registros de errores de las aplicaciones + - Estado de infraestructura (pods, servicios, ingress) + +3. **Establecer comunicación**: + - Crear canal de comunicación de emergencia + - Notificar a todos los interesados relevantes + - Documentar la línea de tiempo de eventos + + + + + +Para despliegues de emergencia en SleakOps: + +1. **Coordinar con el equipo**: + + ```bash + # Antes de cualquier despliegue en producción, asegurar coordinación del equipo + # Usar canales de comunicación para anunciar el despliegue + ``` + +2. **Desplegar a través de SleakOps**: + + - Acceder al panel de SleakOps + - Navegar al proyecto afectado + - Seleccionar el entorno de producción + - Elegir "Deploy" con la última versión conocida estable + +3. **Despliegue alternativo vía Bitbucket**: + + ```yaml + # Si se despliega mediante pipelines de Bitbucket + # Asegurar que el pipeline apunte al entorno correcto + pipelines: + branches: + main: + - step: + name: Despliegue de Emergencia a Producción + script: + - echo "Despliegue de emergencia iniciado" + - # Tus comandos de despliegue aquí + ``` + +4. **Monitorear progreso del despliegue**: + - Observar logs de despliegue en tiempo real + - Monitorear chequeos de salud de la aplicación + - Verificar restauración del servicio + + + + + +Si el despliegue de emergencia no resuelve el problema: + +1. **Reversión inmediata**: + + - En SleakOps: usar la función "Rollback" a la versión estable anterior + - Documentar la decisión y el momento de la reversión + +2. **Verificar éxito de la reversión**: + + ```bash + # Verificar estado de la aplicación + kubectl get pods -n production + kubectl get services -n production + + # Verificar salud de la aplicación + curl -I https://tu-sitio-produccion.com + ``` + +3. **Acciones post-reversión**: + - Confirmar la funcionalidad del sitio + - Notificar a los interesados sobre la resolución temporal + - Iniciar análisis de causa raíz + + + + + +Coordinación efectiva durante incidentes en producción: + +1. **Establecer un comandante de incidente**: + + - Designar a una persona para coordinar esfuerzos + - Todas las decisiones de despliegue deben pasar por esta persona + - Mantener canales de comunicación claros + +2. **Asignación de roles**: + + - **Comandante de Incidente**: Coordinación general + - **Líder Técnico**: Resolución técnica directa + - **Comunicación**: Actualizaciones a interesados + - **Documentación**: Línea de tiempo y acciones tomadas + +3. **Protocolos de comunicación**: + + - Usar videollamadas para coordinación en tiempo real + - Mantener actualizaciones escritas en canales compartidos + - Establecer intervalos regulares de actualización (cada 15-30 minutos) + +4. **Toma de decisiones**: + - Consenso rápido sobre acciones de despliegue + - Decisiones claras de continuar o detener + - Documentar todas las decisiones importantes con marca de tiempo + + + + + +Después de resolver la interrupción en producción: + +1. **Verificación inmediata**: + + - Pruebas integrales de funcionalidad + - Monitoreo de desempeño durante 1-2 horas + - Verificación de aceptación por usuarios + +2. **Documentación**: + + - Línea de tiempo completa del incidente + - Análisis de causa raíz + - Acciones tomadas y sus resultados + - Lecciones aprendidas + +3. **Mejoras posteriores**: + + - Revisar procedimientos de despliegue + - Mejorar monitoreo y alertas + - Actualizar procedimientos de respuesta a emergencias + - Programar reunión post-mortem + +4. **Comunicación a interesados**: + - Enviar notificación de resolución + - Proporcionar resumen breve de causa y solución + - Compartir cronograma para post-mortem detallado + + + + + +Para minimizar futuras interrupciones en producción: + +1. **Mejores prácticas de despliegue**: + + - Siempre usar entorno de staging primero + - Implementar despliegues blue-green + - Mantener procedimientos de reversión + - Usar feature flags para cambios riesgosos + +2. **Monitoreo y alertas**: + + ```yaml + # Ejemplo de configuración de monitoreo + alerts: + - name: sitio-caido + condition: http_status != 200 + duration: 1m + severity: critical + notifications: + - slack + - email + ``` + +3. **Preparación para emergencias**: + + - Mantener lista de contactos de emergencia actualizada + - Realizar simulacros regulares de recuperación ante desastres + - Canales de comunicación predefinidos + - Documentación de procedimientos de despliegue de emergencia + +4. **Preparación del equipo**: + - Capacitación cruzada en procedimientos de despliegue + - Permisos de acceso para miembros clave + - Procedimientos de escalamiento de emergencia + - Revisión regular de planes de respuesta a incidentes + + + +--- + +_Esta FAQ fue generada automáticamente el 27 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-branch-name-validation-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-branch-name-validation-error.mdx new file mode 100644 index 000000000..444890109 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-branch-name-validation-error.mdx @@ -0,0 +1,155 @@ +--- +sidebar_position: 3 +title: "Error de Creación de Proyecto con Nombres de Ramas que Contienen Caracteres Especiales" +description: "Solución para errores de validación de nombres de ramas al crear proyectos desde repositorios" +date: "2025-02-19" +category: "proyecto" +tags: ["proyecto", "repositorio", "rama", "validación", "git"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Creación de Proyecto con Nombres de Ramas que Contienen Caracteres Especiales + +**Fecha:** 19 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Proyecto, Repositorio, Rama, Validación, Git + +## Descripción del Problema + +**Contexto:** El usuario intenta crear un nuevo proyecto en SleakOps basado en un repositorio Git pero encuentra errores de validación cuando el nombre de la rama contiene caracteres especiales o patrones de nombres específicos. + +**Síntomas Observados:** + +- Error en la interfaz al intentar crear un proyecto desde un repositorio +- El error ocurre específicamente con nombres de ramas que contienen barras diagonales (/) +- Formato del nombre de rama: `feature/SITE-1` genera error de validación +- El proceso de creación del proyecto falla en el paso de selección de rama + +**Configuración Relevante:** + +- Nombre de rama: `feature/SITE-1` +- Acción: Creación de nuevo proyecto desde repositorio +- Plataforma: Interfaz web de SleakOps +- Tipo de repositorio: Repositorio Git + +**Condiciones de Error:** + +- El error aparece durante el asistente de creación de proyecto +- Ocurre cuando los nombres de ramas contienen barras diagonales (`/`) +- El error de validación impide completar la creación del proyecto +- El problema afecta la selección de rama en la interfaz + +## Solución Detallada + + + +Mientras el equipo de SleakOps trabaja en corregir el problema de validación, puedes usar esta solución temporal: + +1. **Cambia temporalmente el nombre de la rama** en tu repositorio: + + - Renombra `feature/SITE-1` a algo como `feature-SITE-1` o `featureSITE1` + - Usa solo caracteres alfanuméricos y guiones/guiones bajos + +2. **Crea el proyecto** con la rama renombrada + +3. **Usa especificación manual de compilación** cuando sea necesario: + - Durante el proceso de compilación, puedes especificar manualmente el nombre original de la rama + - Esto te permite trabajar con la estructura original de tus ramas + + + + + +Cuando necesites compilar desde tu rama original: + +1. Ve a la sección **Build** de tu proyecto +2. Selecciona la opción **Manual Build** +3. En la configuración de compilación, especifica manualmente el nombre de la rama: + ``` + Branch: feature/SITE-1 + ``` +4. Procede con el proceso de compilación + +**Nota:** Esta solución funciona para compilaciones manuales pero no funcionará para integración continua hasta que se implemente la corrección en la plataforma. + + + + + +Para evitar problemas similares en el futuro, considera estas convenciones para nombrar ramas: + +**Formatos recomendados:** + +- `feature-SITE-1` +- `feature_SITE_1` +- `featureSITE1` +- `site1-feature` + +**Evita estos caracteres en los nombres de ramas:** + +- Barras diagonales (`/`) +- Caracteres especiales que puedan causar problemas de codificación URL +- Espacios u otros caracteres de espacio en blanco + +**Nombres de ramas Git que funcionan bien con SleakOps:** + +```bash +# Buenas prácticas +git checkout -b feature-user-authentication +git checkout -b bugfix-login-error +git checkout -b hotfix-security-patch + +# Evitar estos patrones +git checkout -b feature/user-authentication +git checkout -b bug/fix-login +``` + + + + + +**Limitaciones actuales:** + +- La integración continua no funcionará con la solución temporal +- Las compilaciones automatizadas fallarán hasta que se despliegue la corrección en la plataforma +- Se requiere intervención manual para cada compilación + +**Cuando la corrección esté disponible:** + +- Se soportarán nombres de ramas con barras diagonales +- La integración continua retomará su funcionamiento normal +- No serán necesarios cambios en la configuración existente del proyecto + +**Cronograma:** + +- El equipo de SleakOps ha añadido este problema a su backlog +- La prioridad de la corrección depende del calendario de despliegue y las necesidades de los usuarios +- Contacta al soporte si necesitas que esto se resuelva antes del despliegue en producción + + + + + +Si usas frecuentemente ramas de características con barras diagonales, considera este flujo de trabajo: + +1. **Mantén tus ramas de desarrollo** con el nombre original +2. **Crea ramas de despliegue** con nombres compatibles con SleakOps: + + ```bash + # Tu rama de desarrollo + git checkout feature/SITE-1 + + # Crear rama de despliegue + git checkout -b deploy-SITE-1 + git push origin deploy-SITE-1 + ``` + +3. **Usa las ramas de despliegue** para proyectos en SleakOps +4. **Fusiona los cambios** de las ramas de características a las ramas de despliegue según sea necesario + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-build-branch-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-build-branch-configuration.mdx new file mode 100644 index 000000000..3647e1ced --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-build-branch-configuration.mdx @@ -0,0 +1,160 @@ +--- +sidebar_position: 3 +title: "Configuración de la Rama de Construcción en Proyectos" +description: "Comprendiendo cómo SleakOps maneja la selección de ramas durante los procesos de construcción" +date: "2024-07-23" +category: "proyecto" +tags: ["construcción", "rama", "git", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de la Rama de Construcción en Proyectos + +**Fecha:** 23 de julio de 2024 +**Categoría:** Proyecto +**Etiquetas:** Construcción, Rama, Git, Configuración + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender cómo SleakOps determina qué rama de Git usar durante el proceso de construcción cuando no se especifica una rama concreta en la configuración de construcción. + +**Síntomas Observados:** + +- Incertidumbre sobre qué rama se usa cuando no se especifica explícitamente +- Las construcciones pueden usar ramas inesperadas +- Necesidad de aclarar el comportamiento de la rama por defecto + +**Configuración Relevante:** + +- Configuración de rama por defecto del proyecto +- Configuración de construcción sin especificación explícita de rama +- Configuración del repositorio Git + +**Condiciones de Error:** + +- Construcciones usando una rama incorrecta cuando las expectativas difieren +- Confusión acerca de la lógica de selección de ramas + +## Solución Detallada + + + +Cuando no se define una rama específica en la configuración de construcción, SleakOps sigue este comportamiento: + +**Selección de Rama por Defecto:** + +- El proceso de construcción usa automáticamente la **rama por defecto definida en la configuración del proyecto** +- Esto asegura un comportamiento consistente en todas las construcciones dentro del proyecto +- La rama por defecto del proyecto tiene prioridad sobre las ramas por defecto del repositorio + + + + + +Para configurar o verificar la rama por defecto de tu proyecto: + +1. Navega a la **Configuración del Proyecto** +2. Ve a la sección de **Código Fuente** o **Repositorio** +3. Busca la configuración de **Rama por Defecto** +4. Establécela en la rama que prefieras (por ejemplo, `main`, `master`, `develop`) + +```yaml +# Ejemplo de configuración del proyecto +project: + name: "mi-aplicacion" + repository: + url: "https://github.com/company/my-app" + default_branch: "main" # Esta rama se usará cuando no se especifique otra +``` + + + + + +Puedes sobrescribir la rama por defecto para construcciones específicas: + +**Construcciones Manuales:** + +1. Ve a la sección de **Construcción** en tu proyecto +2. Haz clic en **Nueva Construcción** o **Disparar Construcción** +3. Especifica la rama deseada en el campo de rama + +**Construcciones Automatizadas:** + +```yaml +# En tu configuración de construcción +build: + trigger: + branch: "feature/nueva-funcionalidad" # Sobrescribe la rama por defecto + steps: + - name: "build" + command: "npm run build" +``` + + + + + +**Prácticas Recomendadas:** + +1. **Establecer ramas por defecto claras**: Usa `main` o `master` como rama por defecto de tu proyecto +2. **Documentar la estrategia de ramas**: Asegúrate de que los miembros del equipo entiendan el modelo de ramas +3. **Usar ramas específicas para cada entorno**: + - `main` para construcciones de producción + - `develop` para construcciones de preproducción + - Ramas de características para pruebas + +**Ejemplo de Configuración de Entornos:** + +```yaml +environments: + production: + branch: "main" + auto_deploy: true + staging: + branch: "develop" + auto_deploy: true + development: + branch: "*" # Permite cualquier rama + auto_deploy: false +``` + + + + + +**Problemas Comunes y Soluciones:** + +1. **Se está construyendo la rama equivocada:** + + - Verifica la configuración de la rama por defecto del proyecto + - Asegúrate de que la configuración de construcción no tenga especificaciones de rama conflictivas + +2. **La construcción falla porque no se encuentra la rama:** + + - Asegúrate de que la rama especificada exista en el repositorio + - Verifica la ortografía y sensibilidad a mayúsculas/minúsculas del nombre de la rama + +3. **Comportamiento inconsistente en la construcción:** + - Revisa todas las configuraciones de construcción para ajustes explícitos de rama + - Estandariza los nombres de las ramas en todos los entornos + +**Pasos de Verificación:** + +```bash +# Ver configuración actual del proyecto +sleakops project show + +# Listar ramas disponibles +sleakops project branches + +# Ver historial de construcciones con información de ramas +sleakops builds list --project --show-branches +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-ci-pipeline-deployment-not-triggering.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-ci-pipeline-deployment-not-triggering.mdx new file mode 100644 index 000000000..92e33db6b --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-ci-pipeline-deployment-not-triggering.mdx @@ -0,0 +1,227 @@ +--- +sidebar_position: 15 +title: "El pipeline de CI no se activa al hacer push en la rama" +description: "Solución para despliegues que no se ejecutan al hacer push en la rama configurada" +date: "2024-03-18" +category: "proyecto" +tags: ["ci", "pipeline", "despliegue", "github", "staging"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# El pipeline de CI no se activa al hacer push en la rama + +**Fecha:** 18 de marzo de 2024 +**Categoría:** Proyecto +**Etiquetas:** CI, Pipeline, Despliegue, GitHub, Staging + +## Descripción del problema + +**Contexto:** El usuario ha configurado un proyecto con una rama específica (staging) para despliegues, pero al hacer push de código en esa rama, el pipeline de CI/CD no se ejecuta y los despliegues no se activan. + +**Síntomas observados:** + +- El push a la rama staging no dispara despliegues +- No hay ejecución visible del pipeline en GitHub Actions +- El proyecto parece estar configurado correctamente con la rama deseada +- El proceso de despliegue automático esperado no funciona + +**Configuración relevante:** + +- Rama objetivo: `staging` +- Plataforma: GitHub con GitHub Actions +- Ubicación de archivos CI/CD: `.github/workflows/` +- Configuración del proyecto en SleakOps + +**Condiciones de error:** + +- El pipeline no se activa tras el push a la rama staging +- No se muestran mensajes de error inicialmente +- La automatización del despliegue no funciona como se espera + +## Solución detallada + + + +Primero, asegúrate de que tu proyecto está configurado con la rama correcta: + +1. Ve a tu **Panel de Proyecto de SleakOps** +2. Navega a **Configuración del Proyecto** +3. Revisa la sección **Configuración del Pipeline** +4. Verifica que la **Rama Objetivo** esté establecida en `staging` +5. Guarda los cambios si realizaste alguna modificación + +La configuración debe coincidir con la rama a la que haces push. + + + + + +Verifica que los archivos del workflow de GitHub Actions estén correctamente estructurados: + +1. Comprueba que los archivos existan en el directorio `.github/workflows/` +2. Asegúrate de que los archivos tengan extensión `.yml` o `.yaml` +3. Verifica que el workflow se active en la rama correcta + +Ejemplo de configuración del workflow: + +```yaml +name: Despliegue SleakOps + +on: + push: + branches: + - staging # Asegúrate que coincida con tu rama configurada + pull_request: + branches: + - staging + +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - name: Checkout del código + uses: actions/checkout@v3 + + - name: Desplegar en SleakOps + uses: sleakops/deploy-action@v1 + with: + api-key: ${{ secrets.SLEAKOPS_API_KEY }} + project-id: ${{ secrets.SLEAKOPS_PROJECT_ID }} +``` + + + + + +Aquí tienes un ejemplo completo de archivo workflow que puedes usar tal cual para SleakOps: + +```yaml +# .github/workflows/sleakops-deploy.yml +name: Pipeline CI/CD de SleakOps + +on: + push: + branches: + - staging + - main + pull_request: + branches: + - staging + - main + +env: + SLEAKOPS_API_URL: https://api.sleakops.com + +jobs: + build-and-deploy: + runs-on: ubuntu-latest + + steps: + - name: Checkout del repositorio + uses: actions/checkout@v3 + + - name: Configurar Node.js + uses: actions/setup-node@v3 + with: + node-version: "18" + cache: "npm" + + - name: Instalar dependencias + run: npm ci + + - name: Ejecutar pruebas + run: npm test + + - name: Construir aplicación + run: npm run build + + - name: Desplegar en SleakOps + uses: sleakops/deploy-action@v1 + with: + api-key: ${{ secrets.SLEAKOPS_API_KEY }} + project-id: ${{ secrets.SLEAKOPS_PROJECT_ID }} + environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }} + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} +``` + +Guarda este archivo como `.github/workflows/sleakops-deploy.yml` en tu repositorio. + + + + + +Después de fusionar un PR a staging, verifica la ejecución de la acción: + +1. Ve a tu **repositorio en GitHub** +2. Haz clic en la pestaña **Actions** +3. Busca ejecuciones del workflow después de tu push/merge reciente +4. Haz clic en cualquier ejecución fallida para ver los registros detallados de error +5. Revisa la sección **Jobs** para fallos específicos en pasos + +Problemas comunes a revisar: + +- Secretos faltantes (SLEAKOPS_API_KEY, SLEAKOPS_PROJECT_ID) +- Nombres de ramas incorrectos en los triggers del workflow +- Errores de sintaxis en archivos YAML +- Problemas de permisos con tokens de GitHub + + + + + +Si el pipeline aún no se activa: + +1. **Verifica permisos del repositorio:** + + - Asegúrate que GitHub Actions esté habilitado para tu repositorio + - Revisa permisos del workflow en la configuración del repositorio + +2. **Valida sintaxis YAML:** + + ```bash + # Usa un validador YAML o el verificador integrado de GitHub + yamllint .github/workflows/sleakops-deploy.yml + ``` + +3. **Prueba con un workflow simple:** + + ```yaml + name: Workflow de prueba + on: + push: + branches: [staging] + jobs: + test: + runs-on: ubuntu-latest + steps: + - run: echo "Pipeline activado correctamente" + ``` + +4. **Verifica integración con SleakOps:** + - Confirma que las claves API estén correctamente configuradas en los Secrets de GitHub + - Asegúrate que el ID del proyecto coincida con tu proyecto en SleakOps + - Revisa el panel de SleakOps para detectar problemas de integración + + + + + +Asegúrate de que los siguientes secretos estén configurados en tu repositorio de GitHub: + +1. Ve a **Configuración del repositorio** → **Secrets and variables** → **Actions** +2. Añade los siguientes secretos del repositorio: + +| Nombre del secreto | Descripción | Dónde encontrar | +| --------------------- | ---------------------------- | ---------------------------------------------- | +| `SLEAKOPS_API_KEY` | Tu clave API de SleakOps | Panel de SleakOps → Configuración → Claves API | +| `SLEAKOPS_PROJECT_ID` | Identificador de tu proyecto | Proyecto SleakOps → Configuración → General | + +3. Verifica que los secretos sean accesibles en tu workflow revisando los logs de Actions + + + +--- + +_Esta FAQ fue generada automáticamente el 18 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-process.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-process.mdx new file mode 100644 index 000000000..7360e7610 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-process.mdx @@ -0,0 +1,195 @@ +--- +sidebar_position: 15 +title: "Cómo Eliminar un Proyecto en SleakOps" +description: "Guía completa para eliminar proyectos de forma segura y entender qué se elimina" +date: "2024-03-13" +category: "proyecto" +tags: ["proyecto", "eliminación", "gestión", "limpieza"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Cómo Eliminar un Proyecto en SleakOps + +**Fecha:** 13 de marzo de 2024 +**Categoría:** Proyecto +**Etiquetas:** Proyecto, Eliminación, Gestión, Limpieza + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender cómo eliminar correctamente proyectos en SleakOps y qué componentes se ven afectados cuando un proyecto es eliminado. + +**Síntomas Observados:** + +- No se encuentra la opción de eliminación de proyecto en la interfaz +- Incertidumbre sobre qué se elimina junto con el proyecto +- Necesidad de limpiar proyectos con problemas de nombres o deuda técnica +- Preguntas sobre el impacto en recursos asociados + +**Configuración Relevante:** + +- El proyecto contiene: argumentos Docker, cargas de trabajo, dependencias, grupos de variables (vargroups) +- Recursos asociados: configuraciones específicas del entorno +- Dependencias: conexiones a bases de datos, servicios externos +- Grupos de variables: compartidos entre proyectos o específicos de cargas de trabajo + +**Condiciones de Error:** + +- Opción de eliminación no visible en los lugares esperados +- Incertidumbre sobre eliminaciones en cascada +- Riesgo de perder datos importantes de configuración + +## Solución Detallada + + + +Antes de eliminar un proyecto, asegúrate de documentar los siguientes componentes: + +**1. Argumentos Docker** + +- Anota todos los argumentos docker personalizados configurados para el proyecto +- Documenta cualquier configuración especial de compilación + +**2. Cargas de Trabajo** + +- Lista todas las cargas de trabajo dentro del proyecto +- Documenta sus configuraciones y ajustes + +**3. Dependencias** + +- Registra todas las dependencias del proyecto (bases de datos, cachés, etc.) +- Anota las cadenas de conexión y configuraciones + +**4. Grupos de Variables (VarGroups)** + +- Documenta los vargroups asociados al proyecto +- Anota los vargroups asociados a cargas de trabajo específicas +- Verifica si algún vargroup está compartido con otros proyectos + + + + + +Para eliminar un proyecto en SleakOps: + +1. Navega a tu **Panel del Proyecto** +2. Ve a **Configuración del Proyecto** +3. Haz clic en **Configuración General** +4. Desplázate hacia abajo hasta encontrar el botón **Eliminar Proyecto** + +**Nota:** El botón de eliminación generalmente se encuentra al final de la página de Configuración General y puede estar estilizado con un color rojo o de advertencia. + + + + + +Cuando eliminas un proyecto, los siguientes componentes son **eliminados automáticamente**: + +**✅ Eliminados con el Proyecto:** + +- El proyecto en sí +- Todas las cargas de trabajo dentro del proyecto +- Dependencias específicas del proyecto +- Grupos de variables asociados exclusivamente al proyecto +- Grupos de variables asociados a cargas de trabajo en el proyecto +- Configuraciones docker específicas del proyecto +- Configuraciones de entorno para este proyecto + +**⚠️ Potencialmente Afectados:** + +- Grupos de variables compartidos (si son usados por otros proyectos) +- Dependencias externas (bases de datos, servicios) - estos pueden requerir limpieza manual + +**❌ No Eliminados:** + +- Infraestructura del clúster +- Otros proyectos en el mismo clúster +- Recursos compartidos usados por múltiples proyectos + + + + + +**Paso 1: Respaldar Datos Importantes** + +```bash +# Exportar configuración del proyecto (si está disponible vía CLI) +sleakops project export --project-name tu-nombre-de-proyecto +``` + +**Paso 2: Documentar Dependencias** + +- Toma capturas de pantalla de las configuraciones de dependencias +- Anota las cadenas de conexión a bases de datos +- Documenta cualquier variable de entorno personalizada + +**Paso 3: Verificar Recursos Compartidos** + +- Verifica qué vargroups están compartidos con otros proyectos +- Asegúrate de que ningún otro proyecto dependa de los recursos de este proyecto + +**Paso 4: Realizar la Eliminación** + +1. Ve a **Configuración del Proyecto > Configuración General** +2. Desplázate hasta el final de la página +3. Haz clic en **Eliminar Proyecto** +4. Confirma la eliminación cuando se te solicite +5. Espera a que el proceso de eliminación se complete + +**Paso 5: Verificar Limpieza** + +- Comprueba que el proyecto ya no aparece en tu lista de proyectos +- Verifica que las cargas de trabajo asociadas hayan sido eliminadas +- Confirma que los vargroups exclusivos han sido eliminados + + + + + +Si estás eliminando un proyecto para recrearlo (por ejemplo, para corregir problemas de nombres): + +**Convenciones de Nombres:** + +- Evita sufijos redundantes (por ejemplo, no uses nombre del proyecto y ambiente en el nombre) +- Usa nombres claros y descriptivos: `admin` en lugar de `sostengo-admin-prod-prod` +- Recuerda que los sufijos de ambiente se agregan automáticamente + +**Gestión de Configuración:** + +- Usa las configuraciones documentadas del proyecto eliminado +- Aplica las lecciones aprendidas para evitar deuda técnica +- Considera usar enfoques de Infraestructura como Código para futuros proyectos + +**Consideraciones de Tiempo:** + +- Planifica la eliminación y recreación durante ventanas de mantenimiento +- Asegura que todos los miembros del equipo estén informados de los cambios +- Ten planes de reversión en caso de problemas + + + + + +**Si no encuentras la opción de eliminación:** + +- Verifica tus permisos de usuario - puede que necesites acceso de administrador +- Asegúrate de estar en el contexto correcto del proyecto +- Intenta refrescar la página o limpiar la caché del navegador + +**Si la eliminación falla:** + +- Revisa si hay despliegues activos que deban detenerse primero +- Verifica que no haya cargas de trabajo en ejecución +- Contacta soporte si el proyecto parece estar atascado en estado de eliminación + +**Después de la eliminación:** + +- Si los recursos parecen seguir existiendo, espera unos minutos para la propagación +- Revisa los registros de actividad para el estado de eliminación +- Contacta soporte si la limpieza parece incompleta + + + +--- + +_Esta FAQ fue generada automáticamente el 13 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-s3-bucket-cleanup.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-s3-bucket-cleanup.mdx new file mode 100644 index 000000000..7745cf027 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-s3-bucket-cleanup.mdx @@ -0,0 +1,127 @@ +--- +sidebar_position: 3 +title: "Problema de Eliminación de Proyecto Atascado - Problema de Limpieza de Bucket S3" +description: "Solución para proyectos atascados en estado 'pendiente de eliminación' debido a problemas con la limpieza de buckets S3" +date: "2024-10-31" +category: "proyecto" +tags: ["s3", "eliminación", "bucket", "limpieza", "aws"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Eliminación de Proyecto Atascado - Problema de Limpieza de Bucket S3 + +**Fecha:** 31 de octubre de 2024 +**Categoría:** Proyecto +**Etiquetas:** S3, Eliminación, Bucket, Limpieza, AWS + +## Descripción del Problema + +**Contexto:** Al intentar eliminar un proyecto en SleakOps, el proyecto queda atascado en estado "pendiente de eliminación" debido a problemas con los procesos de limpieza del bucket S3. + +**Síntomas Observados:** + +- El proyecto permanece indefinidamente en estado "pendiente de eliminación" +- El proceso de eliminación parece colgarse o agotar el tiempo +- Los buckets S3 asociados al proyecto no se limpian correctamente +- El error ocurre específicamente con buckets que contienen un gran número de objetos + +**Configuración Relevante:** + +- Tipo de proyecto: Cualquier proyecto con almacenamiento S3 asociado +- Buckets AWS S3 con un conteo significativo de objetos +- Procesos automatizados de limpieza de SleakOps + +**Condiciones de Error:** + +- Ocurre durante el proceso de eliminación del proyecto +- Sucede cuando los buckets S3 contienen muchos objetos +- El proceso de limpieza no maneja grandes volúmenes de objetos +- La eliminación se queda atascada antes de completarse + +## Solución Detallada + + + +El problema ocurre porque AWS S3 requiere que todos los objetos sean eliminados de un bucket antes de que el bucket mismo pueda ser eliminado. Cuando un bucket contiene un gran número de objetos, el proceso de limpieza puede: + +1. **Agotar el tiempo**: El proceso de eliminación puede exceder los límites de tiempo +2. **Límite de tasa**: Se pueden alcanzar los límites de tasa de la API de AWS durante eliminaciones masivas +3. **Problemas de memoria**: Procesar demasiados objetos a la vez puede causar problemas de memoria +4. **Limpieza incompleta**: Algunos objetos pueden quedar, impidiendo la eliminación del bucket + + + + + +Para comprobar si esto está afectando a tu proyecto: + +1. **Accede a la Consola AWS** +2. **Navega al servicio S3** +3. **Busca buckets** relacionados con el nombre de tu proyecto +4. **Revisa el conteo de objetos** en cada bucket +5. **Busca buckets** que deberían haber sido eliminados pero que aún existen + +```bash +# Usando AWS CLI para revisar el contenido del bucket +aws s3 ls s3://nombre-de-tu-bucket-del-proyecto --recursive --summarize +``` + + + + + +El equipo de SleakOps ha implementado correcciones para este problema: + +1. **Procesamiento por lotes mejorado**: Los objetos ahora se eliminan en lotes más pequeños y manejables +2. **Manejo de errores mejorado**: Mejores mecanismos de reintento para eliminaciones fallidas +3. **Gestión de tiempos de espera**: Tiempos de espera extendidos para la limpieza de buckets grandes +4. **Seguimiento del progreso**: Mejor monitoreo del progreso de la limpieza + +Si tu proyecto está actualmente atascado: + +- **Contacta soporte**: Reporta el proyecto atascado mediante un ticket de soporte +- **Proporciona el nombre del proyecto**: Incluye el nombre exacto del proyecto que muestra "pendiente de eliminación" +- **Espera la resolución**: El equipo completará manualmente el proceso de limpieza + + + + + +Para evitar este problema en futuras eliminaciones de proyectos: + +1. **Limpieza regular**: Limpia periódicamente archivos innecesarios en tu proyecto +2. **Políticas de ciclo de vida**: Implementa políticas de ciclo de vida de S3 para eliminar automáticamente objetos antiguos +3. **Monitoreo de almacenamiento**: Controla el conteo de objetos en los buckets S3 de tu proyecto +4. **Eliminación escalonada**: Para proyectos con grandes cantidades de datos, considera limpieza manual antes de la eliminación + +```yaml +# Ejemplo de política de ciclo de vida de S3 +LifecycleConfiguration: + Rules: + - Id: DeleteOldObjects + Status: Enabled + Filter: + Prefix: temp/ + Expiration: + Days: 30 +``` + + + + + +Mientras esperas que se complete una eliminación atascada: + +1. **Revisa el estado del proyecto** regularmente en el panel de SleakOps +2. **Monitorea AWS CloudTrail** para eventos de eliminación en S3 (si tienes acceso) +3. **Atento a notificaciones por correo** del equipo de SleakOps +4. **Evita reintentar** el proceso de eliminación mientras se está resolviendo + +**Nota**: El proceso de resolución puede tomar algo de tiempo dependiendo del número de objetos que deben ser limpiados. + + + +--- + +_Este FAQ fue generado automáticamente el 1 de noviembre de 2024 basado en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-stuck-deleting-state.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-stuck-deleting-state.mdx new file mode 100644 index 000000000..4826028d2 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-deletion-stuck-deleting-state.mdx @@ -0,0 +1,117 @@ +--- +sidebar_position: 3 +title: "Proyecto Atrapado en Estado de Eliminación" +description: "Solución para proyectos que permanecen en estado 'eliminando' después de la eliminación" +date: "2024-12-19" +category: "proyecto" +tags: ["proyecto", "eliminación", "solución de problemas", "ui"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Proyecto Atrapado en Estado de Eliminación + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Proyecto, Eliminación, Solución de problemas, UI + +## Descripción del Problema + +**Contexto:** El usuario intenta eliminar un proyecto a través de la interfaz de la plataforma SleakOps, pero el proyecto permanece visible en la vista de Proyectos con un estado "eliminando" indefinidamente. + +**Síntomas observados:** + +- El proyecto parece eliminarse con éxito +- El proyecto sigue visible en la lista de Proyectos +- El estado del proyecto muestra "eliminando" de forma permanente +- La interfaz no refleja la finalización real de la eliminación + +**Configuración relevante:** + +- Nombre del proyecto: Puede afectar a cualquier proyecto +- Plataforma: Interfaz web de SleakOps +- Acción: Eliminación de proyecto a través de la UI + +**Condiciones de error:** + +- Ocurre después de iniciar la eliminación del proyecto +- El estado permanece en "eliminando" indefinidamente +- Los recursos del proyecto pueden estar realmente eliminados pero el estado de la UI persiste +- Refrescar no resuelve el estado + +## Solución Detallada + + + +Cuando eliminas un proyecto en SleakOps, el sistema realiza varias operaciones de limpieza: + +1. **Limpieza de recursos**: Elimina todos los recursos en la nube asociados (clústeres, almacenamiento, etc.) +2. **Limpieza de base de datos**: Elimina los registros del proyecto de la base de datos de la plataforma +3. **Actualización del estado de la UI**: Actualiza la interfaz para reflejar la eliminación + +A veces, la actualización del estado en la UI puede retrasarse respecto a la limpieza real de recursos, causando que el estado "eliminando" persista. + + + + + +Si tu proyecto está atrapado en estado "eliminando": + +1. **Espera a que termine**: Proyectos grandes pueden tardar entre 10 y 15 minutos en eliminarse completamente +2. **Refresca la página**: Usa Ctrl+F5 (o Cmd+Shift+R en Mac) para un refresco completo +3. **Limpia la caché del navegador**: Borra la caché y las cookies para el dominio de SleakOps +4. **Verifica en modo incógnito/privado**: Abre SleakOps en una ventana incógnito para verificar el estado + + + + + +Para confirmar si tu proyecto fue realmente eliminado: + +1. **Revisa la consola del proveedor de la nube**: + + - AWS: Verifica que no queden recursos en tu cuenta + - Azure: Comprueba que los grupos de recursos estén eliminados + - GCP: Confirma que los recursos del proyecto estén limpiados + +2. **Intenta crear un nuevo proyecto** con el mismo nombre: + + - Si tiene éxito, el proyecto antiguo fue realmente eliminado + - Si falla por conflicto de nombre, el proyecto puede seguir existiendo + +3. **Contacta soporte** si el problema persiste después de 30 minutos + + + + + +Para evitar este problema en el futuro: + +1. **Asegura una conexión estable**: Mantén una conexión a internet estable durante la eliminación +2. **No cierres el navegador**: Mantén la pestaña del navegador abierta hasta que la eliminación finalice +3. **Elimina en horas de baja carga**: Realiza eliminaciones cuando la carga del sistema sea menor +4. **Monitorea la limpieza de recursos**: Revisa la consola de tu proveedor en la nube para confirmar la limpieza + + + + + +Contacta al soporte de SleakOps si: + +- El proyecto permanece en estado "eliminando" por más de 30 minutos +- Los recursos en la nube no se están limpiando +- No puedes crear un nuevo proyecto con el mismo nombre +- El proyecto reaparece después de parecer eliminado + +Proporciona esta información al contactar soporte: + +- Nombre del proyecto +- Hora en que se inició la eliminación +- Capturas de pantalla del estado actual +- Cualquier mensaje de error recibido + + + +--- + +_Este FAQ fue generado automáticamente el 19 de diciembre de 2024 basado en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dependency-deletion-behavior.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dependency-deletion-behavior.mdx new file mode 100644 index 000000000..5e42b3b9f --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dependency-deletion-behavior.mdx @@ -0,0 +1,164 @@ +--- +sidebar_position: 3 +title: "Comportamiento de la Eliminación de Proyectos y Dependencias" +description: "Entendiendo qué sucede con las dependencias al eliminar un proyecto en SleakOps" +date: "2025-01-28" +category: "proyecto" +tags: ["proyecto", "dependencias", "redis", "eliminación", "limpieza"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Comportamiento de la Eliminación de Proyectos y Dependencias + +**Fecha:** 28 de enero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Proyecto, Dependencias, Redis, Eliminación, Limpieza + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender el comportamiento de las dependencias del proyecto (como Redis, bases de datos, cachés) cuando un proyecto es eliminado de la plataforma SleakOps. + +**Síntomas Observados:** + +- Incertidumbre sobre la limpieza de dependencias al eliminar proyectos +- Necesidad de entender si las dependencias se eliminan automáticamente +- Preocupación por pérdida de datos o recursos huérfanos +- Preguntas sobre los procedimientos de limpieza + +**Configuración Relevante:** + +- Proyecto con dependencias adjuntas (Redis, PostgreSQL, MySQL, etc.) +- Interfaz de gestión de proyectos de SleakOps +- Gestión del ciclo de vida de dependencias + +**Condiciones de Error:** + +- Riesgo de recursos huérfanos si las dependencias no se limpian correctamente +- Posible pérdida de datos si las dependencias se eliminan inesperadamente +- Preocupaciones de facturación por recursos no usados + +## Solución Detallada + + + +**Sí, las dependencias se eliminan automáticamente cuando borras un proyecto en SleakOps.** + +Cuando eliminas un proyecto que contiene dependencias como Redis, PostgreSQL, MySQL u otros servicios, ocurre lo siguiente: + +1. **Todas las dependencias del proyecto se eliminan** junto con el proyecto +2. **Los datos almacenados en estas dependencias se borran permanentemente** +3. **Los recursos en la nube asociados se limpian** para evitar cargos continuos +4. **La eliminación es irreversible** - no podrás recuperar los datos después + + + + + +El proceso de eliminación sigue esta secuencia: + +1. **El usuario inicia la eliminación del proyecto** +2. **Se detienen las cargas de trabajo** (servicios web, workers, trabajos cron) +3. **Se identifican las dependencias** y se marcan para eliminación +4. **Se muestra advertencia de respaldo de datos** (si aplica) +5. **Se eliminan las dependencias** en orden inverso de dependencia +6. **Se limpian los recursos del proyecto** +7. **Confirmación de eliminación completa** + +```bash +# Ejemplo de lo que se elimina: +- Instancia de Redis y todos los datos en caché +- Base de datos PostgreSQL y todos los datos almacenados +- Base de datos MySQL y todos los datos almacenados +- Volúmenes y almacenamiento asociados +- Configuraciones de red +- Balanceadores de carga y reglas de ingreso +``` + + + + + +**Importante: Siempre haz respaldo de los datos críticos antes de eliminar un proyecto.** + +### Para Redis: + +```bash +# Conéctate a tu instancia Redis y crea un respaldo +redis-cli --rdb /ruta/al/respaldo.rdb + +# O exporta claves específicas +redis-cli --scan --pattern "*" | xargs redis-cli MGET +``` + +### Para PostgreSQL: + +```bash +# Crea un volcado de la base de datos +pg_dump -h tu-host -U tu-usuario -d tu-base-de-datos > respaldo.sql + +# O usa el CLI de SleakOps si está disponible +sleakops project backup --project-id ID_PROYECTO --service postgres +``` + +### Para MySQL: + +```bash +# Crea un volcado de la base de datos +mysqldump -h tu-host -u tu-usuario -p tu-base-de-datos > respaldo.sql +``` + + + + + +### Antes de eliminar un proyecto: + +1. **Revisa todas las dependencias**: + + - Ve a Configuración del Proyecto → Dependencias + - Lista todos los servicios adjuntos (Redis, bases de datos, etc.) + - Identifica datos críticos que necesiten respaldo + +2. **Crea respaldos**: + + - Exporta datos importantes de Redis + - Realiza volcados de bases de datos + - Guarda archivos de configuración + +3. **Considera alternativas**: + - **Pausar el proyecto** en lugar de eliminar (si está disponible) + - **Migrar dependencias** a otro proyecto + - **Desvincular dependencias** antes de eliminar (si se soporta) + +### Lista de verificación de seguridad: + +- [ ] Todos los datos críticos han sido respaldados +- [ ] Los miembros del equipo están informados de la eliminación +- [ ] No hay usuarios activos que dependan de los servicios +- [ ] Hay soluciones alternativas disponibles si es necesario + + + + + +**Desafortunadamente, una vez que un proyecto y sus dependencias son eliminados, la recuperación generalmente no es posible.** + +Sin embargo, revisa estas opciones: + +1. **Respaldos del proveedor en la nube**: Algunos proveedores mantienen respaldos automáticos +2. **Soporte de SleakOps**: Contacta soporte inmediatamente si la eliminación fue accidental +3. **Respaldos a nivel de aplicación**: Verifica si tu aplicación creó sus propios respaldos + +### Prevención para el futuro: + +- Configura respaldos automáticos regulares +- Usa entornos de staging para probar eliminaciones +- Implementa estrategias de respaldo en el código de tu aplicación +- Considera usar servicios externos de respaldo + + + +--- + +_Esta FAQ fue generada automáticamente el 28 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dockerfile-arguments-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dockerfile-arguments-configuration.mdx new file mode 100644 index 000000000..f4b282677 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dockerfile-arguments-configuration.mdx @@ -0,0 +1,184 @@ +--- +sidebar_position: 3 +title: "Error de Configuración de Argumentos en Dockerfile" +description: "Solución para la configuración de argumentos en Dockerfile en proyectos SleakOps" +date: "2025-03-11" +category: "proyecto" +tags: ["dockerfile", "argumentos", "configuración", "despliegue"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Configuración de Argumentos en Dockerfile + +**Fecha:** 11 de marzo de 2025 +**Categoría:** Proyecto +**Etiquetas:** Dockerfile, Argumentos, Configuración, Despliegue + +## Descripción del Problema + +**Contexto:** El usuario encuentra problemas al crear un proyecto de producción en SleakOps, específicamente relacionados con la configuración de argumentos en Dockerfile y el flujo de creación del proyecto. + +**Síntomas Observados:** + +- No es posible añadir argumentos de Dockerfile durante la creación del proyecto +- La creación del proyecto falla en el entorno de producción +- La interfaz impide configurar argumentos antes de que el proyecto exista +- El error persiste a pesar de que el Dockerfile está presente en la rama correcta + +**Configuración Relevante:** + +- Rama: `sleakops-master` (rama de producción) +- Entorno: Producción +- Dockerfile: Presente en el repositorio +- Argumentos: Requeridos por Dockerfile pero no se pueden configurar + +**Condiciones del Error:** + +- El error ocurre durante la creación del proyecto hacia el entorno de producción +- No se pueden declarar argumentos antes de crear el proyecto +- El problema es resuelto mediante intervención del equipo de SleakOps +- Se identifica como un problema del lado de la plataforma + +## Solución Detallada + + + +En SleakOps, los argumentos de Dockerfile (ARG) deben configurarse en la sección de **Configuración del Proyecto**, no en los Grupos de Variables: + +1. Ve a tu **Configuración del Proyecto** +2. Navega a **Configuración de Build** +3. Busca la sección **Argumentos de Dockerfile** +4. Añade tus argumentos allí antes del despliegue + +**Importante:** Los argumentos de Dockerfile son diferentes de las variables de entorno en tiempo de ejecución: + +- **Argumentos de Dockerfile (ARG):** Usados durante el proceso de construcción de la imagen +- **Grupos de Variables:** Usados para variables de entorno en tiempo de ejecución + + + + + +Para crear correctamente un proyecto con argumentos de Dockerfile: + +1. **Asegúrate de que el Dockerfile esté en la rama correcta** + + - Verifica que el Dockerfile exista en tu rama de producción (`sleakops-master`) + - Confirma que todos los cambios estén fusionados desde las ramas de desarrollo + +2. **Crea el proyecto primero** + + - Crea el proyecto inicialmente sin argumentos + - Esto permite acceder a las opciones de configuración + +3. **Configura los argumentos después de la creación** + + - Ve a Configuración del Proyecto → Configuración de Build + - Añade los argumentos de Dockerfile requeridos + - Guarda la configuración + +4. **Despliega nuevamente con los argumentos** + - Inicia un nuevo despliegue + - Los argumentos serán pasados durante el proceso de construcción + + + + + +Antes de crear un proyecto de producción, verifica: + +```bash +# Verifica si el Dockerfile existe en la rama de producción +git checkout sleakops-master +ls -la | grep Dockerfile + +# Verifica que la rama esté actualizada +git pull origin sleakops-master + +# Revisa si todos los cambios necesarios están fusionados +git log --oneline -10 +``` + +En la interfaz de SleakOps: + +1. Ve a **Configuración del Repositorio** +2. Verifica que la **Rama de Producción** esté configurada como `sleakops-master` +3. Confirma que la **Ruta del Dockerfile** sea correcta (usualmente `./Dockerfile`) + + + + + +Si encuentras problemas similares: + +1. **Revisa los logs de build** + + - Ve a **Despliegues** → **Logs de Build** + - Busca mensajes de error específicos + - Verifica si el Dockerfile es encontrado + +2. **Verifica acceso al repositorio** + + - Asegúrate de que SleakOps tiene acceso al repositorio + - Comprueba que los permisos de la rama sean correctos + +3. **Contacta soporte si es necesario** + - Problemas del lado de la plataforma pueden requerir intervención del equipo + - Proporciona mensajes de error específicos y capturas de pantalla + - Incluye información de la rama y el repositorio + +**Ejemplo de Dockerfile con argumentos:** + +```dockerfile +FROM node:16-alpine + +# Declarar argumentos de build +ARG NODE_ENV=production +ARG API_URL +ARG DATABASE_URL + +# Usar argumentos durante el build +ENV NODE_ENV=$NODE_ENV +ENV API_URL=$API_URL +ENV DATABASE_URL=$DATABASE_URL + +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . + +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +Para evitar problemas similares en el futuro: + +1. **Siempre prueba primero en staging** + + - Crea y prueba proyectos en entorno staging + - Verifica que todas las configuraciones funcionen antes de producción + +2. **Documenta tus argumentos** + + - Mantén una lista de los argumentos requeridos para Dockerfile + - Documenta su propósito y valores esperados + +3. **Usa nombres de ramas consistentes** + + - Mantén convenciones claras para nombres de ramas + - Asegura que la rama de producción esté siempre actualizada + +4. **Sincronización regular de ramas** + - Fusiona cambios regularmente a la rama de producción + - Mantén sincronizadas las ramas de desarrollo y producción + + + +--- + +_Esta FAQ fue generada automáticamente el 11 de marzo de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dockerfile-path-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dockerfile-path-configuration.mdx new file mode 100644 index 000000000..246622c01 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-dockerfile-path-configuration.mdx @@ -0,0 +1,195 @@ +--- +sidebar_position: 3 +title: "Error de Configuración de la Ruta del Dockerfile en Proyectos" +description: "Solución para problemas con la ruta del Dockerfile y excepciones al editar la configuración del proyecto" +date: "2025-02-20" +category: "proyecto" +tags: ["dockerfile", "proyecto", "configuración", "ruta", "repositorio"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Configuración de la Ruta del Dockerfile en Proyectos + +**Fecha:** 20 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Dockerfile, Proyecto, Configuración, Ruta, Repositorio + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas al configurar o editar la ruta del Dockerfile en la configuración del proyecto SleakOps, lo que resulta en que la ruta se borre o se lancen excepciones. + +**Síntomas Observados:** + +- La ruta del Dockerfile se borra/elimina automáticamente al editar +- Se lanza una excepción al intentar modificar la ruta del Dockerfile +- La configuración no persiste después de guardar +- La construcción del proyecto falla por falta de referencia al Dockerfile + +**Configuración Relevante:** + +- Formato de la ruta del Dockerfile: Ruta relativa desde la raíz del repositorio +- Rama del repositorio: Debe coincidir con la rama donde existe el Dockerfile +- Estructura de la ruta: Debe incluir la ruta completa incluyendo el nombre del archivo + +**Condiciones de Error:** + +- El error ocurre cuando no se encuentra el Dockerfile en el repositorio y rama especificados +- La especificación de la ruta es incompleta o incorrecta +- El Dockerfile no existe en la ubicación especificada + +## Solución Detallada + + + +La ruta del Dockerfile en SleakOps debe incluir la ruta completa desde la raíz del repositorio, incluyendo el nombre del archivo: + +**Formato correcto:** + +``` +./docker/base/Dockerfile +./docker/Dockerfile +./Dockerfile +``` + +**Formato incorrecto:** + +``` +./docker/base # Falta el nombre del archivo +docker/base # Falta el prefijo ./ +base # Ruta incompleta +``` + +**Ejemplo:** +Si la estructura de tu repositorio es: + +``` +my-repo/ +├── docker/ +│ └── base/ +│ └── Dockerfile +└── src/ +``` + +La ruta correcta sería: `./docker/base/Dockerfile` + + + + + +Asegúrate de que: + +1. **El repositorio esté correctamente vinculado** a tu proyecto +2. **La especificación de la rama** coincida con donde existe tu Dockerfile +3. **El Dockerfile exista** en la rama especificada + +**Pasos para verificar:** + +1. Ve a los **Ajustes del Proyecto** +2. Verifica que el campo **Repositorio** apunte al repositorio correcto +3. Comprueba que el campo **Rama** coincida con la rama de tu Dockerfile +4. Confirma que el Dockerfile exista en la ruta especificada dentro de esa rama + +**Problemas comunes:** + +- El Dockerfile existe en la rama `main` pero el proyecto está configurado para `develop` +- La URL del repositorio es incorrecta o está desactualizada +- El Dockerfile fue movido o renombrado después de la configuración inicial + + + + + +Si la ruta del Dockerfile sigue borrándose: + +1. **Verifica la existencia del archivo:** + + ```bash + # Clona tu repositorio y verifica + git clone + cd + git checkout + ls -la ./docker/base/Dockerfile # Reemplaza con tu ruta + ``` + +2. **Revisa los permisos del archivo:** + + - Asegúrate de que el Dockerfile sea legible + - Verifica los permisos de acceso al repositorio + +3. **Actualiza la configuración paso a paso:** + + - Primero, asegúrate de que el repositorio y la rama sean correctos + - Luego, añade la ruta completa del Dockerfile + - Guarda y verifica que la configuración persista + +4. **Prueba con una ruta sencilla:** + - Intenta colocar el Dockerfile en la raíz del repositorio: `./Dockerfile` + - Si esto funciona, muévete gradualmente a subdirectorios + + + + + +**Prácticas recomendadas:** + +1. **Usa nombres consistentes:** + + ``` + ./Dockerfile # Para un solo servicio + ./docker/app/Dockerfile # Para aplicaciones multi-servicio + ./services/api/Dockerfile # Para microservicios + ``` + +2. **Organiza por entorno:** + + ``` + ./docker/production/Dockerfile + ./docker/development/Dockerfile + ./docker/staging/Dockerfile + ``` + +3. **Mantén los Dockerfiles en control de versiones:** + + - Siempre haz commit de los cambios en Dockerfile + - Usa mensajes de commit significativos + - Etiqueta las versiones que incluyen cambios en Dockerfile + +4. **Prueba localmente antes de configurar en SleakOps:** + ```bash + docker build -f ./docker/base/Dockerfile . + ``` + + + + + +Si el problema persiste: + +1. **Usa el Dockerfile en la raíz del repositorio:** + + - Mueve tu Dockerfile a la raíz del repositorio + - Usa la ruta: `./Dockerfile` + +2. **Crea un enlace simbólico:** + + ```bash + ln -s ./docker/base/Dockerfile ./Dockerfile + ``` + +3. **Usa Docker Compose:** + + - Configura un archivo docker-compose.yml en la raíz del repositorio + - Referencia el Dockerfile desde el archivo compose + +4. **Contacta al soporte con detalles:** + - URL del repositorio + - Nombre de la rama + - Ruta exacta del Dockerfile + - Capturas de pantalla del error + + + +--- + +_Esta FAQ fue generada automáticamente el 20 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-multi-service-setup-guide.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-multi-service-setup-guide.mdx new file mode 100644 index 000000000..d5475c2de --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-multi-service-setup-guide.mdx @@ -0,0 +1,259 @@ +--- +sidebar_position: 15 +title: "Guía de Configuración de Proyecto Multi-Servicio" +description: "Guía completa para configurar conexiones de Kafka, MySQL, MongoDB y proxy inverso Nginx en SleakOps" +date: "2024-01-15" +category: "proyecto" +tags: + [ + "kafka", + "mysql", + "mongodb", + "nginx", + "reverse-proxy", + "base-de-datos", + "configuración", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Guía de Configuración de Proyecto Multi-Servicio + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** Kafka, MySQL, MongoDB, Nginx, Proxy Inverso, Base de Datos, Configuración + +## Descripción del Problema + +**Contexto:** Configurar un entorno completo de proyecto que requiere múltiples servicios incluyendo cola de mensajes (Kafka), base de datos relacional (MySQL), conexión a base de datos NoSQL (MongoDB) y configuración de proxy inverso (Nginx). + +**Síntomas Observados:** + +- Necesidad de desplegar servicio Kafka para procesamiento de mensajes +- Requiere base de datos MySQL con capacidad de carga de datos +- Necesidad de establecer conexión con instancia de producción MongoDB +- Requiere configuración de Nginx para proxy inverso + +**Configuración Relevante:** + +- Servicios necesarios: Kafka, MySQL, Nginx +- Conexión externa: MongoDB de producción +- Configuración de proxy: Proxy inverso +- Operaciones de base de datos: Carga de datos y conectividad + +**Condiciones de Error:** + +- Los servicios deben estar correctamente interconectados +- Las conexiones a base de datos deben ser seguras y confiables +- El proxy inverso debe manejar el tráfico correctamente +- Todos los servicios deben ser accesibles dentro del entorno del proyecto + +## Solución Detallada + + + +Para añadir Kafka a tu proyecto SleakOps: + +1. **Navega al panel de tu proyecto** +2. **Ve a la sección de Dependencias** +3. **Agrega la dependencia Kafka:** + - Haz clic en "Agregar Dependencia" + - Selecciona "Kafka" de la lista + - Configura los siguientes ajustes: + +```yaml +# Ejemplo de configuración Kafka +kafka: + version: "3.5" + replicas: 3 + storage: "10Gi" + resources: + requests: + memory: "1Gi" + cpu: "500m" + limits: + memory: "2Gi" + cpu: "1000m" +``` + +4. **Configura los topics** (si es necesario): + - Añade variables de entorno para la configuración de topics + - Establece políticas de retención + - Configura particiones y factor de replicación + + + + + +Para configurar MySQL y cargar tu base de datos: + +1. **Agrega la dependencia MySQL:** + - Ve a Dependencias → Agregar Dependencia → MySQL + - Configura la versión y recursos: + +```yaml +# Configuración MySQL +mysql: + version: "8.0" + database: "nombre_de_tu_base_de_datos" + username: "tu_usuario" + storage: "20Gi" + resources: + requests: + memory: "1Gi" + cpu: "500m" +``` + +2. **Carga los datos de la base de datos:** + + - Usa la función **Scripts de Inicialización** en la configuración de MySQL + - Sube tus archivos de volcado SQL + - O usa variables de entorno para ejecutar comandos de inicialización + +3. **Accede a las credenciales:** + - Las credenciales de la base de datos se generan automáticamente + - Accede a ellas mediante variables de entorno en tus aplicaciones + - Formato de cadena de conexión: `mysql://usuario:contraseña@mysql-service:3306/base_de_datos` + + + + + +Para conectar a una instancia externa de MongoDB en producción: + +1. **Agrega los detalles de conexión como secretos:** + - Ve a Configuración del Proyecto → Variables de Entorno + - Añade las variables de conexión MongoDB: + +```bash +MONGO_PROD_URI=mongodb://usuario:contraseña@host-mongo-prod:27017/base_de_datos +MONGO_PROD_DATABASE=tu_base_de_datos_de_producción +MONGO_PROD_USERNAME=tu_usuario +MONGO_PROD_PASSWORD=tu_contraseña +``` + +2. **Consideraciones de seguridad de red:** + + - Asegúrate que tu clúster SleakOps pueda acceder al MongoDB de producción + - Configura reglas de firewall si es necesario + - Usa VPN o red privada si está disponible + +3. **Prueba la conectividad:** + - Crea una carga de trabajo simple para verificar la conexión + - Usa herramientas cliente de MongoDB para probar desde dentro del clúster + + + + + +Para configurar Nginx como proxy inverso: + +1. **Agrega la carga de trabajo Nginx:** + + - Ve a Cargas de Trabajo → Agregar Carga de Trabajo → Servicio Web + - Selecciona Nginx como imagen base + +2. **Configura el proxy inverso:** + - Crea una configuración personalizada de Nginx: + +```nginx +# Ejemplo nginx.conf +server { + listen 80; + server_name tu-dominio.com; + + # Proxy a tu aplicación principal + location / { + proxy_pass http://tu-servicio-app:8080; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + } + + # Proxy a la UI de gestión de Kafka (si es necesario) + location /kafka { + proxy_pass http://kafka-ui-service:8080; + proxy_set_header Host $host; + } + + # Endpoint de chequeo de salud + location /health { + return 200 'OK'; + add_header Content-Type text/plain; + } +} +``` + +3. **Monta la configuración:** + + - Sube tu archivo nginx.conf como un archivo de configuración + - Móntalo en `/etc/nginx/conf.d/default.conf` + +4. **Expón el servicio:** + - Configura ingreso o balanceador de carga + - Configura SSL/TLS si es requerido + + + + + +Para asegurar que todos los servicios funcionen juntos: + +1. **Descubrimiento de servicios:** + + - Los servicios pueden comunicarse usando sus nombres de servicio + - Ejemplo: `kafka-service:9092`, `mysql-service:3306` + +2. **Variables de entorno para integración:** + +```bash +# Variables de entorno de la aplicación +KAFKA_BROKERS=kafka-service:9092 +MYSQL_HOST=mysql-service +MYSQL_PORT=3306 +MYSQL_DATABASE=tu_base_de_datos +MONGO_PROD_URI=mongodb://host-mongo-prod:27017/base_de_datos +NGINX_UPSTREAM=tu-servicio-app:8080 +``` + +3. **Chequeos de salud:** + + - Configura chequeos de salud para cada servicio + - Monitorea la conectividad entre servicios + - Configura alertas para fallos de servicio + +4. **Prueba de la configuración completa:** + - Prueba la producción/consumo de mensajes Kafka + - Verifica consultas a la base de datos MySQL + - Confirma la conectividad con MongoDB de producción + - Prueba el ruteo del proxy inverso + + + + + +**Problemas de conectividad de servicios:** + +- Verifica nombres de servicios y puertos +- Revisa políticas de red +- Revisa configuraciones de firewall + +**Problemas de conexión a base de datos:** + +- Verifica credenciales y cadenas de conexión +- Revisa estado del servicio de base de datos +- Revisa conectividad de red + +**Problemas con proxy Nginx:** + +- Verifica sintaxis de configuración Nginx +- Revisa logs de Nginx para errores +- Confirma que los servicios upstream estén disponibles + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-stuck-analyzing-dockerfile-missing.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-stuck-analyzing-dockerfile-missing.mdx new file mode 100644 index 000000000..74aadf372 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-stuck-analyzing-dockerfile-missing.mdx @@ -0,0 +1,151 @@ +--- +sidebar_position: 3 +title: "Proyecto Atascado en Estado de Análisis - Dockerfile Ausente" +description: "Solución para proyectos que permanecen permanentemente atascados en estado de análisis debido a la ausencia del Dockerfile" +date: "2024-12-11" +category: "proyecto" +tags: ["dockerfile", "analizando", "despliegue", "solución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Proyecto Atascado en Estado de Análisis - Dockerfile Ausente + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Dockerfile, Análisis, Despliegue, Solución de problemas + +## Descripción del Problema + +**Contexto:** Un proyecto nuevo en SleakOps queda permanentemente atascado en el estado de "análisis" y no puede continuar con el proceso de despliegue. + +**Síntomas Observados:** + +- El proyecto permanece indefinidamente en estado de "análisis" +- No hay avance en la pipeline de despliegue +- El proceso de construcción no avanza a las siguientes etapas +- No se muestran mensajes claros de error en la interfaz + +**Configuración Relevante:** + +- Tipo de proyecto: Proyecto nuevo +- Estado: Atascado en fase de análisis +- Repositorio: Conectado al repositorio de código fuente +- Dockerfile: Ausente o no detectado + +**Condiciones de Error:** + +- El error ocurre durante el análisis inicial del proyecto +- Sucede cuando SleakOps no puede encontrar un Dockerfile en el repositorio +- El problema persiste hasta que el Dockerfile esté configurado correctamente +- Puede ocurrir si el Dockerfile fue agregado después del análisis inicial + +## Solución Detallada + + + +Primero, verifica si tu repositorio contiene un Dockerfile: + +1. **Revisa la raíz del repositorio**: Asegúrate de que haya un archivo llamado `Dockerfile` (sensible a mayúsculas) en el directorio raíz +2. **Verifica el contenido del archivo**: El Dockerfile debe contener instrucciones válidas de Docker +3. **Revisa los permisos del archivo**: Asegúrate de que el archivo sea legible y esté correctamente comprometido en el repositorio + +```bash +# Ejemplo de una estructura básica de Dockerfile +FROM node:18-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +Si tu repositorio no tiene un Dockerfile, necesitas crear uno: + +1. **Crear el archivo**: Añade un archivo llamado `Dockerfile` en la raíz de tu repositorio +2. **Elegir la imagen base adecuada**: Selecciona según la tecnología de tu aplicación +3. **Definir los pasos de construcción**: Incluye todos los comandos necesarios para construir tu aplicación +4. **Commit y push**: Asegúrate de comprometer el Dockerfile en tu repositorio + +**Plantillas comunes de Dockerfile:** + +```dockerfile +# Aplicación Node.js +FROM node:18-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production +COPY . . +EXPOSE 3000 +CMD ["node", "server.js"] +``` + +```dockerfile +# Aplicación Python +FROM python:3.9-slim +WORKDIR /app +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt +COPY . . +EXPOSE 8000 +CMD ["python", "app.py"] +``` + + + + + +Después de agregar o corregir el Dockerfile: + +1. **Commit de los cambios**: Asegúrate de que el Dockerfile esté comprometido en tu repositorio +2. **Forzar reanálisis**: En el panel de SleakOps: + - Ve a tu proyecto + - Busca el botón "Re-analizar" o "Actualizar" + - Haz clic para iniciar un nuevo análisis +3. **Monitorea el progreso**: Observa cómo el estado del proyecto cambia de "analizando" a la siguiente fase + + + + + +Antes de comprometer, valida tu Dockerfile: + +```bash +# Probar Dockerfile localmente +docker build -t test-image . + +# Revisar errores de sintaxis +docker run --rm -i hadolint/hadolint < Dockerfile +``` + +**Problemas comunes en Dockerfile:** + +- Falta la instrucción `FROM` +- Rutas incorrectas en comandos `COPY` +- Permisos ejecutables faltantes +- Configuración incorrecta del directorio de trabajo + + + + + +Si el proyecto sigue atascado después de agregar un Dockerfile: + +1. **Verifica permisos del repositorio**: Asegúrate de que SleakOps tenga acceso para leer el repositorio +2. **Verifica la rama**: Confirma que estás trabajando en la rama correcta +3. **Limpiar caché**: Intenta desconectar y reconectar el repositorio +4. **Contacta soporte**: Si el problema persiste, proporciona: + - URL del repositorio + - Contenido del Dockerfile + - Detalles de configuración del proyecto + + + +--- + +_Esta FAQ fue generada automáticamente el 11 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-vargroups-environment-variables.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-vargroups-environment-variables.mdx new file mode 100644 index 000000000..dbe1a84bf --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/project-vargroups-environment-variables.mdx @@ -0,0 +1,185 @@ +--- +sidebar_position: 3 +title: "Variables de Entorno de Vargroups No Disponibles en la Aplicación" +description: "Solución para aplicaciones que no reciben todas las variables de entorno esperadas de Vargroups" +date: "2024-12-30" +category: "proyecto" +tags: ["vargroups", "variables-de-entorno", "proyecto", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variables de Entorno de Vargroups No Disponibles en la Aplicación + +**Fecha:** 30 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Vargroups, Variables de Entorno, Proyecto, Configuración + +## Descripción del Problema + +**Contexto:** El usuario ha creado múltiples Vargroups pero su aplicación solo está recibiendo variables de entorno de un Vargroup específico ('simplee-web'), mientras que otras variables de entorno esperadas están ausentes. + +**Síntomas Observados:** + +- La aplicación solo muestra variables de entorno de un Vargroup +- Faltan variables de entorno que deberían estar disponibles +- Otros Vargroups parecen no afectar la aplicación +- La aplicación puede no estar leyendo correctamente las variables de entorno desde el Pod + +**Configuración Relevante:** + +- Aplicación: Aplicación web (simplee.cl) +- Vargroup disponible: 'simplee-web' +- Plataforma: Entorno del proyecto SleakOps +- Faltante: Vargroups adicionales con variables de entorno requeridas + +**Condiciones de Error:** + +- Las variables de entorno de otros Vargroups no llegan a la aplicación +- Solo se aplican las variables de un Vargroup +- La funcionalidad de la aplicación puede verse afectada por variables faltantes + +## Solución Detallada + + + +Los Vargroups en SleakOps son **con alcance de proyecto**, lo que significa: + +- Cada Vargroup solo afecta el Proyecto específico donde fue creado +- Los Vargroups de otros Proyectos no son accesibles +- Debes crear todos los Vargroups requeridos dentro del mismo Proyecto + +**Para verificar tus Vargroups actuales:** + +1. Navega a tu Proyecto en SleakOps +2. Ve a **Configuración** → **Vargroups** +3. Revisa qué Vargroups existen en este Proyecto específico + + + + + +Si te faltan Vargroups en tu Proyecto actual: + +1. **Navega a Configuración del Proyecto:** + + - Ve al panel de tu Proyecto + - Selecciona **Configuración** → **Vargroups** + +2. **Crear Nuevo Vargroup:** + + - Haz clic en **Agregar Vargroup** + - Ingresa un nombre descriptivo + - Añade todas las variables de entorno requeridas + +3. **Configura las Variables:** + + ```bash + # Ejemplo de configuración de Vargroup + DATABASE_URL=postgresql://usuario:contraseña@host:5432/db + API_KEY=tu-api-key-aqui + ENVIRONMENT=producción + ``` + +4. **Aplica los Cambios:** + - Guarda el Vargroup + - Reimplementa tu aplicación para aplicar las nuevas variables + + + + + +Asegúrate que el código de tu aplicación está leyendo correctamente las variables de entorno: + +**Para aplicaciones Node.js:** + +```javascript +// Verifica si las variables están disponibles +console.log("Variables de entorno disponibles:", Object.keys(process.env)); + +// Accede a variables específicas +const dbUrl = process.env.DATABASE_URL; +const apiKey = process.env.API_KEY; +``` + +**Para aplicaciones Python:** + +```python +import os + +# Verifica variables disponibles +print('Variables de entorno disponibles:', list(os.environ.keys())) + +# Accede a variables específicas +db_url = os.environ.get('DATABASE_URL') +api_key = os.environ.get('API_KEY') +``` + +**Depuración en el Pod:** + +```bash +# Conéctate a tu pod y verifica el entorno +kubectl exec -it -- env | grep -i +``` + + + + + +**Paso 1: Verifica la Asignación de Vargroups** + +- Asegúrate que todos los Vargroups requeridos estén creados en el Proyecto correcto +- Comprueba que los Vargroups estén correctamente asignados a tu aplicación + +**Paso 2: Revisa el Estado del Despliegue** + +- Después de crear nuevos Vargroups, inicia un nuevo despliegue +- Las variables de entorno se aplican en el momento del despliegue + +**Paso 3: Valida los Nombres de Variables** + +- Asegúrate que los nombres de variables coincidan exactamente (sensible a mayúsculas/minúsculas) +- Revisa posibles errores tipográficos en los nombres + +**Paso 4: Revisión del Código Aplicativo** + +- Verifica que tu aplicación lea desde `process.env` o su equivalente +- Comprueba si las variables se usan en el ámbito correcto + +**Paso 5: Verificación a Nivel de Pod** + +```bash +# Verifica variables de entorno en el pod en ejecución +kubectl get pods -n +kubectl exec -it -n -- printenv +``` + + + + + +**Problema 1: Variables No Actualizadas Tras Cambios** + +- **Solución:** Reimplementa la aplicación después de cambios en Vargroups +- Las variables de entorno se inyectan al iniciar el contenedor + +**Problema 2: Sensibilidad a Mayúsculas y Minúsculas** + +- **Solución:** Asegura coincidencia exacta entre Vargroup y código de aplicación +- Los contenedores Linux son sensibles a mayúsculas/minúsculas + +**Problema 3: Sobrescritura de Variables** + +- **Solución:** Verifica si múltiples Vargroups definen la misma variable +- Los Vargroups posteriores pueden sobrescribir a los anteriores + +**Problema 4: Aplicación No Lee el Entorno** + +- **Solución:** Verifica que el framework de la aplicación cargue correctamente las variables de entorno +- Algunos frameworks requieren configuración explícita + + + +--- + +_Esta FAQ fue generada automáticamente el 30 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/prometheus-memory-issues-grafana-no-data.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/prometheus-memory-issues-grafana-no-data.mdx new file mode 100644 index 000000000..e668785ea --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/prometheus-memory-issues-grafana-no-data.mdx @@ -0,0 +1,220 @@ +--- +sidebar_position: 3 +title: "Problemas de Memoria en Prometheus que Causan Pérdida de Datos en Grafana" +description: "Solución para fallos en el pod backend de Prometheus debido a RAM insuficiente que causa que Grafana no muestre datos" +date: "2024-12-11" +category: "dependency" +tags: ["prometheus", "grafana", "memoria", "monitoreo", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Memoria en Prometheus que Causan Pérdida de Datos en Grafana + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Dependencia +**Etiquetas:** Prometheus, Grafana, Memoria, Monitoreo, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Los paneles de Grafana muestran datos vacíos o sin métricas durante varios días debido a fallos en el pod backend de Prometheus causados por asignación insuficiente de memoria. + +**Síntomas Observados:** + +- Los paneles de Grafana no muestran datos ni métricas +- El pod backend de Prometheus falla o permanece en estado fallido +- Falta de recolección de métricas durante períodos prolongados (días) +- El filtro de espacio de nombres predeterminado muestra resultados vacíos en Grafana +- El panel de Loki no muestra información de logs + +**Configuración Relevante:** + +- Límite de memoria del pod backend de Prometheus: Por debajo del umbral requerido +- Filtro de espacio de nombres predeterminado en Grafana: 'default' (espacio de nombres vacío) +- Periodo afectado: Múltiples días con datos faltantes +- Requisito de memoria: Mínimo 1250Mi de RAM necesario + +**Condiciones de Error:** + +- El pod de Prometheus falla debido a OOMKilled (Falta de Memoria) +- La recolección de métricas se detiene completamente +- Grafana no puede recuperar datos históricos del período fallido +- El problema persiste hasta intervención manual + +## Solución Detallada + + + +**Solución Inmediata:** + +1. **Identificar el pod de Prometheus que falló:** + + ```bash + kubectl get pods -n monitoring | grep prometheus + kubectl describe pod -n monitoring + ``` + +2. **Verificar uso y límites de memoria:** + + ```bash + kubectl top pod -n monitoring + kubectl get pod -n monitoring -o yaml | grep -A 5 -B 5 resources + ``` + +3. **Incrementar manualmente la asignación de memoria:** + + ```yaml + # Editar el deployment de Prometheus + kubectl edit deployment prometheus-server -n monitoring + + # Añadir o modificar la sección de recursos: + resources: + requests: + memory: "1250Mi" + limits: + memory: "2Gi" + ``` + +4. **Reiniciar el deployment:** + ```bash + kubectl rollout restart deployment prometheus-server -n monitoring + ``` + + + + + +**Problema:** Grafana se abre con el filtro de espacio de nombres 'default' que normalmente no contiene aplicaciones desplegadas. + +**Solución:** + +1. **Acceder al panel de Grafana** +2. **Cambiar el filtro de espacio de nombres:** + + - Buscar el desplegable de espacio de nombres (usualmente en la parte superior) + - Seleccionar un espacio de nombres que contenga tus aplicaciones + - Espacios de nombres comunes: `kube-system`, `monitoring`, `default` o los específicos de tus aplicaciones + +3. **Establecer un valor predeterminado significativo:** + - Elegir un espacio de nombres con cargas de trabajo activas + - Guardar el panel con el espacio de nombres correcto seleccionado + +**Paneles Disponibles:** + +- Vista general del clúster Kubernetes +- Métricas de nodos +- Métricas de pods +- Paneles específicos de aplicaciones +- Monitoreo de red +- Métricas de almacenamiento + + + + + +**Problema:** El panel de Loki no muestra información de logs debido a fallo en el componente de lectura. + +**Solución:** + +1. **Identificar el pod de lectura de Loki:** + + ```bash + kubectl get pods -n monitoring | grep loki-read + ``` + +2. **Eliminar el pod problemático:** + + ```bash + kubectl delete pod -n monitoring + ``` + +3. **Verificar la recreación automática:** + + ```bash + kubectl get pods -n monitoring | grep loki-read + kubectl logs -n monitoring + ``` + +4. **Probar la recolección de logs:** + - Esperar 2-3 minutos para que el pod inicie completamente + - Revisar el panel de Grafana Loki para nuevas entradas de logs + - Verificar que los logs se estén recolectando de tus aplicaciones + + + + + +**Configuración de Monitoreo:** + +1. **Configurar alertas para uso de memoria en Prometheus:** + + ```yaml + # Ejemplo de regla de alerta + - alert: PrometheusHighMemoryUsage + expr: (container_memory_usage_bytes{pod=~"prometheus.*"} / container_spec_memory_limit_bytes{pod=~"prometheus.*"}) > 0.8 + for: 5m + labels: + severity: warning + annotations: + summary: "Prometheus está usando mucha memoria" + ``` + +2. **Monitoreo regular de memoria:** + + ```bash + # Ver uso actual de memoria + kubectl top pods -n monitoring + + # Monitorear en tiempo real + watch kubectl top pods -n monitoring + ``` + +3. **Consideraciones de escalado:** + - A medida que el clúster crece, aumentan los requerimientos de memoria de Prometheus + - Monitorear el período de retención de métricas + - Considerar federación de Prometheus para clústeres grandes + - Ajustar límites de memoria según tamaño del clúster y políticas de retención + +**Mejoras Futuras en la Plataforma:** + +- Los límites de memoria serán ajustables desde la interfaz de SleakOps +- Escalado automático basado en tamaño del clúster +- Monitoreo proactivo y alertas para restricciones de recursos + + + + + +**Notas Importantes:** + +- **Las métricas perdidas no se pueden recuperar:** Los datos del período en que Prometheus estuvo caído se pierden permanentemente +- **Planificar redundancia:** Considerar configurar federación de Prometheus o almacenamiento externo para métricas críticas +- **Estrategias de respaldo:** Implementar respaldos regulares de datos de Prometheus para entornos críticos + +**Mitigación para Producción:** + +1. **Configuración de Alta Disponibilidad:** + + ```yaml + # Ejemplo de configuración HA para Prometheus + prometheus: + prometheusSpec: + replicas: 2 + retention: 30d + resources: + requests: + memory: 2Gi + limits: + memory: 4Gi + ``` + +2. **Almacenamiento externo:** + - Configurar escritura remota a TSDB externo + - Usar Thanos para almacenamiento a largo plazo + - Implementar estrategias de respaldo entre regiones + + + +--- + +_Este FAQ fue generado automáticamente el 11 de diciembre de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/prometheus-memory-issues-node-allocation.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/prometheus-memory-issues-node-allocation.mdx new file mode 100644 index 000000000..16a443793 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/prometheus-memory-issues-node-allocation.mdx @@ -0,0 +1,218 @@ +--- +sidebar_position: 3 +title: "Problemas de Memoria en Prometheus y Asignación de Nodos" +description: "Solución para fallos del pod de Prometheus debido a limitaciones de memoria y asignación dinámica de nodos" +date: "2024-12-19" +category: "cluster" +tags: ["prometheus", "memoria", "monitorización", "grafana", "asignacion-nodos"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Memoria en Prometheus y Asignación de Nodos + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Clúster +**Etiquetas:** Prometheus, Memoria, Monitorización, Grafana, Asignación de Nodos + +## Descripción del Problema + +**Contexto:** Los pods de Prometheus en los clústeres SleakOps experimentan fallos debido a limitaciones de memoria cuando se asignan a nodos con recursos insuficientes, causando que los paneles de monitorización y la recopilación de métricas no funcionen. + +**Síntomas Observados:** + +- Fallos del pod de Prometheus por agotamiento de memoria +- Los paneles de Grafana no muestran datos o se vuelven inaccesibles +- No se recopilan ni almacenan métricas +- El contenedor de Prometheus muestra estado amarillo (estado de advertencia) +- La funcionalidad de monitorización está completamente interrumpida + +**Configuración Relevante:** + +- Prometheus tiene requisitos dinámicos de memoria +- La asignación de nodos es dinámica y puede colocar Prometheus en nodos subdimensionados +- Grafana depende de Prometheus para los datos de métricas +- Loki puede verse afectado por las mismas limitaciones de recursos del nodo + +**Condiciones de Error:** + +- Ocurre cuando Prometheus se programa en nodos con memoria insuficiente +- El problema es intermitente debido a la asignación dinámica de nodos +- Afecta todas las funciones de monitorización y observabilidad +- Puede reaparecer conforme cambia la disponibilidad de nodos por escalado del clúster + +## Solución Detallada + + + +La solución inmediata consiste en configurar Prometheus para que siempre se programe en nodos con recursos suficientes: + +1. **Accede a la configuración de tu clúster** +2. **Modifica el despliegue de Prometheus** para incluir reglas de afinidad de nodo +3. **Asegura que Prometheus apunte a nodos más grandes** con memoria adecuada + +```yaml +# Ejemplo de configuración de afinidad de nodo para Prometheus +apiVersion: apps/v1 +kind: Deployment +metadata: + name: prometheus +spec: + template: + spec: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: node.kubernetes.io/instance-type + operator: In + values: + - "m5.large" + - "m5.xlarge" + - "c5.large" + - "c5.xlarge" +``` + +Esto evita que Prometheus se programe en nodos más pequeños que no pueden manejar sus requisitos de memoria. + + + + + +Después de aplicar la solución, verifica que la monitorización funcione correctamente: + +1. **Revisa el estado del pod de Prometheus**: + + ```bash + kubectl get pods -n monitoring | grep prometheus + ``` + + El pod debe mostrar estado `Running` con todos los contenedores en verde. + +2. **Verifica los paneles de Grafana**: + + - **Logs del contenedor (Loki)**: Comprueba si hay datos de logs disponibles + - **Computación y RAM (Prometheus)**: Verifica que las métricas se estén recopilando + - **Tráfico de red (Prometheus)**: Confirma que las métricas de red se actualizan + +3. **Prueba la funcionalidad del panel**: + - Accede a la interfaz de Grafana + - Navega por diferentes paneles + - Confirma que los datos se muestran con marcas de tiempo recientes + + + + + +Para identificar cuándo Prometheus está experimentando problemas: + +1. **Indicadores visuales en el panel de SleakOps**: + + - Busca contenedores amarillos (estado de advertencia) + - Revisa contenedores rojos (estado fallido) + - Monitorea gráficos de uso de recursos + +2. **Monitorización por línea de comandos**: + + ```bash + # Revisa uso de recursos del pod Prometheus + kubectl top pod -n monitoring | grep prometheus + + # Revisa eventos del pod por problemas de memoria + kubectl describe pod -n monitoring + + # Monitorea logs del pod para errores de memoria + kubectl logs -n monitoring + ``` + +3. **Configura alertas** para reinicios del pod de Prometheus o picos en uso de memoria. + + + + + +Establece solicitudes y límites apropiados de recursos para Prometheus: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: prometheus +spec: + template: + spec: + containers: + - name: prometheus + resources: + requests: + memory: "2Gi" + cpu: "500m" + limits: + memory: "4Gi" + cpu: "1000m" +``` + +**Directrices para dimensionar recursos**: + +- **Clústeres pequeños** (< 50 pods): solicitud de 2Gi de memoria, límite de 4Gi +- **Clústeres medianos** (50-200 pods): solicitud de 4Gi de memoria, límite de 8Gi +- **Clústeres grandes** (> 200 pods): solicitud de 8Gi de memoria, límite de 16Gi + + + + + +Para evitar que este problema se repita: + +1. **Implementa taints y tolerations en los nodos**: + + ```yaml + # Taint a los nodos para cargas de monitorización + kubectl taint nodes monitoring=true:NoSchedule + + # Añade toleration al despliegue de Prometheus + tolerations: + - key: "monitoring" + operator: "Equal" + value: "true" + effect: "NoSchedule" + ``` + +2. **Usa pools de nodos dedicados** para componentes de monitorización +3. **Implementa autoscaling del clúster** con requisitos mínimos de nodos +4. **Configura alertas de monitorización** para agotamiento de recursos +5. **Planificación regular de capacidad** basada en el crecimiento del clúster + + + + + +Si el problema persiste después de aplicar la solución inicial: + +1. **Revisa la capacidad de los nodos del clúster**: + + ```bash + kubectl describe nodes | grep -A 5 "Allocated resources" + ``` + +2. **Verifica la configuración de Prometheus**: + + - Revisa intervalos de scrape y políticas de retención + - Revisa configuración de descubrimiento de objetivos + - Valida configuración de almacenamiento + +3. **Examina eventos del clúster**: + + ```bash + kubectl get events --sort-by=.metadata.creationTimestamp + ``` + +4. **Considera federación de Prometheus** para clústeres muy grandes +5. **Implementa fragmentación (sharding) de Prometheus** si una sola instancia no puede manejar la carga + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/rails-console-large-script-execution.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/rails-console-large-script-execution.mdx new file mode 100644 index 000000000..627fbbea4 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/rails-console-large-script-execution.mdx @@ -0,0 +1,172 @@ +--- +sidebar_position: 15 +title: "Problema de Ejecución de Scripts Grandes en la Consola de Rails" +description: "Solución para ejecutar scripts grandes en la consola de Rails a través de Lens" +date: "2024-01-15" +category: "workload" +tags: ["rails", "consola", "lens", "kubectl", "ejecucion-de-scripts"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Ejecución de Scripts Grandes en la Consola de Rails + +**Fecha:** 15 de enero de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Rails, Consola, Lens, kubectl, Ejecución de Scripts + +## Descripción del Problema + +**Contexto:** El usuario experimenta problemas al intentar pegar o ejecutar scripts grandes (aproximadamente 300 líneas) en la consola de Rails a través de la interfaz de Lens. + +**Síntomas Observados:** + +- `IOError: ungetbyte failed` al pegar scripts grandes +- El error ocurre en `/usr/local/lib/ruby/3.1.0/reline/ansi.rb:259` +- La consola se vuelve no responsiva con bloques grandes de código +- El problema afecta específicamente a la consola IRB de Rails accedida mediante Lens + +**Configuración Relevante:** + +- Versión de Ruby: 3.1.0 +- Consola de Rails accedida a través de Lens +- Tamaño del script: ~300 líneas +- Ubicación del error: módulo Reline ANSI + +**Condiciones del Error:** + +- El error ocurre al pegar grandes cantidades de código +- Sucede específicamente en la consola de Rails (IRB) +- El problema aparece al usar la interfaz de consola de Lens +- No ocurre con fragmentos de código pequeños + +## Solución Detallada + + + +La solución más confiable es copiar tu script como archivo al pod y luego ejecutarlo: + +```bash +# Copiar script desde la máquina local al pod +kubectl cp /home/user/local/path/script.rb namespace/pod_name:/tmp/script.rb + +# Luego en la consola de Rails, cargar y ejecutar el script +load '/tmp/script.rb' +``` + +**Proceso paso a paso:** + +1. Guarda tu script en un archivo local (ejemplo, `my_script.rb`) +2. Usa `kubectl cp` para copiarlo al pod +3. Accede a la consola de Rails a través de Lens +4. Usa `load` o `require` para ejecutar el script + + + + + +También puedes ejecutar scripts de Rails directamente desde el terminal: + +```bash +# Acceder al terminal del pod a través de Lens +# Luego ejecutar Rails runner con tu script +rails runner /tmp/script.rb + +# O ejecutar código Ruby directamente +ruby -e "$(cat /tmp/script.rb)" +``` + +Esto evita completamente las limitaciones de la consola IRB. + + + + + +**Recuerda que los sistemas de archivos de los pods son efímeros:** + +- Los archivos copiados a los pods se perderán cuando los pods se reinicien +- Usa el directorio `/tmp` para scripts temporales +- Para scripts persistentes, considera usar ConfigMaps o volúmenes montados + +**Creando un ConfigMap para scripts reutilizables:** + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: rails-scripts +data: + my-script.rb: | + # Contenido de tu script Ruby aquí + puts "Hola desde el script en ConfigMap" +``` + +Luego móntalo en tu despliegue y accede en `/scripts/my-script.rb`. + + + + + +**Método 1: Divide scripts grandes en partes más pequeñas** + +```ruby +# En la consola de Rails, pega secciones más pequeñas a la vez +# Esto evita el problema de desbordamiento del búfer +``` + +**Método 2: Usa sintaxis heredoc** + +```ruby +# En la consola de Rails +script_content = <<~RUBY + # Contenido de tu script aquí + # Esto puede manejar bloques grandes mejor +RUBY + +eval(script_content) +``` + +**Método 3: Ejecutar desde la aplicación Rails** + +```ruby +# Crea una tarea rake o un script para Rails runner +# rails runner 'ruta/a/tu/script.rb' +``` + + + + + +Si `kubectl cp` no funciona: + +**Verifica el pod y el namespace:** + +```bash +# Listar pods en el namespace +kubectl get pods -n tu-namespace + +# Verificar que el pod esté corriendo +kubectl describe pod nombre-del-pod -n tu-namespace +``` + +**Sintaxis correcta:** + +```bash +# De local a pod +kubectl cp ./archivo-local.rb namespace/nombre-pod:/tmp/archivo-remoto.rb + +# De pod a local +kubectl cp namespace/nombre-pod:/tmp/archivo-remoto.rb ./archivo-local.rb +``` + +**Problemas comunes:** + +- Asegúrate que el directorio destino exista en el pod +- Revisa permisos de archivo +- Verifica que el contexto de kubectl sea el correcto + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/react-environment-variables-runtime.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/react-environment-variables-runtime.mdx new file mode 100644 index 000000000..a9cf9c5b1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/react-environment-variables-runtime.mdx @@ -0,0 +1,201 @@ +--- +sidebar_position: 15 +title: "Variables de Entorno de React No Disponibles en Tiempo de Ejecución" +description: "Solución para aplicaciones React donde las variables de entorno de vargroups no son accesibles en tiempo de ejecución" +date: "2024-12-19" +category: "proyecto" +tags: ["react", "variables-de-entorno", "dockerfile", "build", "runtime"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Variables de Entorno de React No Disponibles en Tiempo de Ejecución + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** React, Variables de Entorno, Dockerfile, Build, Runtime + +## Descripción del Problema + +**Contexto:** El usuario tiene una aplicación React desplegada en SleakOps donde las variables de entorno definidas en vargroups no son accesibles en tiempo de ejecución a través de `process.env`, aunque funcionan localmente con archivos `.env.local`. + +**Síntomas Observados:** + +- Las variables de entorno de vargroups no aparecen en los logs de `process.env` +- Las variables funcionan correctamente en desarrollo local con archivos `.env.local` +- La aplicación se construye exitosamente pero las variables están indefinidas en tiempo de ejecución +- El proceso de build estático no incluye variables de entorno en tiempo de ejecución + +**Configuración Relevante:** + +- Tipo de aplicación: React SPA (Aplicación de Página Única) +- Proceso de build: Build Docker multi-etapa con Node.js y Nginx +- Despliegue: Archivos estáticos servidos por Nginx +- Entorno: Plataforma SleakOps con vargroups + +**Condiciones de Error:** + +- Las variables están indefinidas al acceder vía `process.env` en el navegador +- El problema ocurre solo en el entorno desplegado, no localmente +- El problema persiste incluso cuando los vargroups están correctamente configurados + +## Solución Detallada + + + +El problema ocurre porque las variables de entorno de React se resuelven durante el **proceso de build**, no en tiempo de ejecución. Esto es lo que sucede: + +1. **Etapa de Build**: Las variables de entorno se incrustan en los archivos JavaScript estáticos durante la compilación +2. **Etapa de Runtime**: La aplicación se ejecuta como archivos estáticos servidos por Nginx, sin acceso a variables de entorno del servidor +3. **Ejecución en Navegador**: Las referencias a `process.env` son reemplazadas por valores literales durante el build + +Por eso las variables funcionan localmente (disponibles durante `yarn build`) pero no en producción (no disponibles durante el build de Docker). + + + + + +Para hacer que las variables de entorno estén disponibles durante el proceso de build de Docker: + +1. **Agregar declaraciones ARG** a tu Dockerfile: + +```dockerfile +FROM node:20-alpine AS build + +# Declarar argumentos de build para variables de entorno +ARG PUBLIC_URL +ARG NODE_ENV +ARG REACT_APP_API_URL +ARG REACT_APP_API_KEY +# Añade todas tus variables de entorno como ARG + +# Establecer directorio de trabajo +WORKDIR /workspace/app + +# Copiar el código al contenedor +COPY . . + +# Instalar dependencias +RUN yarn install + +# Construir la app (las variables ARG estarán disponibles como ENV durante el build) +RUN yarn run build + +# Etapa de producción +FROM nginx:latest +WORKDIR /usr/share/nginx/html +COPY ./deploy/nginx/default.conf /etc/nginx/conf.d/default.conf +COPY --from=build /workspace/app/build . +EXPOSE 3000 +CMD ["nginx", "-g", "daemon off;"] +``` + +2. **Configurar variables en SleakOps**: + - Ve al vargroup **Docker Args** de tu proyecto + - Añade todas las variables de entorno que tu app React necesite + - Estas serán pasadas como `--build-arg` durante el build de Docker + + + + + +En SleakOps, debes configurar las variables en el vargroup correcto: + +1. **Navegar a Vargroups**: Ve a tu proyecto → Configuración → Vargroups +2. **Usar el vargroup Docker Args**: Las variables de entorno para el build de React deben estar en el vargroup "Docker Args", no en el vargroup de runtime +3. **Agregar variables**: + ``` + PUBLIC_URL=https://tu-dominio.com + REACT_APP_API_URL=https://api.tu-dominio.com + REACT_APP_ENVIRONMENT=production + ``` + +**Importante**: React solo incluye en el build las variables de entorno que comienzan con `REACT_APP_`. + + + + + +Para tener variables de entorno verdaderamente en tiempo de ejecución, considera migrar a un framework con capacidades SSR: + +**Beneficios de migrar a Next.js:** + +- Variables de entorno disponibles en tiempo de ejecución +- Capacidades de renderizado del lado servidor +- Mejor SEO y rendimiento +- Configuración en runtime sin necesidad de reconstruir + +**Configuración básica de Next.js:** + +```javascript +// next.config.js +module.exports = { + env: { + CUSTOM_KEY: process.env.CUSTOM_KEY, + }, + // O usar configuración en runtime + publicRuntimeConfig: { + apiUrl: process.env.API_URL, + }, +}; +``` + +**Dockerfile para Next.js:** + +```dockerfile +FROM node:20-alpine +WORKDIR /app +COPY package*.json ./ +RUN npm install +COPY . . +RUN npm run build +EXPOSE 3000 +CMD ["npm", "start"] +``` + + + + + +Para verificar que tu solución funciona: + +1. **Revisar logs de build**: Asegúrate de que las variables ARG estén disponibles durante el build +2. **Inspeccionar archivos construidos**: Busca tus variables en el JavaScript generado +3. **Probar en el navegador**: Usa herramientas de desarrollo para verificar los valores de `process.env` +4. **Logs en consola**: Añade logs temporales para verificar la disponibilidad de variables + +```javascript +// Añade esto temporalmente para verificar variables +console.log("Variables de entorno:", { + apiUrl: process.env.REACT_APP_API_URL, + environment: process.env.REACT_APP_ENVIRONMENT, + nodeEnv: process.env.NODE_ENV, +}); +``` + + + + + +**Convenciones de Nomenclatura:** + +- Siempre prefija con `REACT_APP_` para variables del lado cliente +- Usa nombres descriptivos: `REACT_APP_API_BASE_URL` en lugar de `REACT_APP_URL` + +**Consideraciones de Seguridad:** + +- Nunca incluyas datos sensibles (claves API, secretos) en variables de entorno del lado cliente +- Todas las variables `REACT_APP_` son accesibles públicamente en el navegador +- Usa un proxy backend para llamadas API sensibles + +**Desarrollo vs Producción:** + +- Usa diferentes vargroups para distintos entornos +- Prueba con variables de entorno similares a producción durante el desarrollo +- Documenta todas las variables de entorno requeridas en tu README + + + +--- + +_Este FAQ fue generado automáticamente el 19 de diciembre de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/redis-connection-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/redis-connection-configuration.mdx new file mode 100644 index 000000000..3a6281890 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/redis-connection-configuration.mdx @@ -0,0 +1,232 @@ +--- +sidebar_position: 3 +title: "Problemas de Configuración de Conexión Redis" +description: "Solución para problemas de conexión de dependencia Redis en proyectos SleakOps" +date: "2024-01-15" +category: "dependency" +tags: ["redis", "conexión", "aws", "elasticache", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Configuración de Conexión Redis + +**Fecha:** 15 de enero de 2024 +**Categoría:** Dependencia +**Etiquetas:** Redis, Conexión, AWS, ElastiCache, Configuración + +## Descripción del Problema + +**Contexto:** El usuario creó una dependencia Redis para su proyecto SleakOps que generó el grupo de variables correspondiente, pero la aplicación Spring Boot no puede conectarse a la instancia Redis a pesar de tener la configuración correcta de URL y puerto. + +**Síntomas Observados:** + +- Dependencia Redis creada exitosamente en SleakOps +- Grupo de variables generado con la variable `CACHE_REDIS_URL` +- La aplicación Spring Boot falla al conectarse con `RedisConnectionException` +- El error muestra "Unable to connect to cache-xxx.amazonaws.com:6379" +- Los intentos de conexión terminan en tiempo de espera o son interrumpidos + +**Configuración Relevante:** + +- Instancia Redis: AWS ElastiCache +- Versión Spring Data Redis: 3.3.2 +- Versión cliente Lettuce: 6.3.2.RELEASE +- Formato actual de URL: `redis://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379` +- Puerto: 6379 (puerto estándar de Redis) + +**Condiciones de Error:** + +- La conexión falla durante el inicio de la aplicación +- El error ocurre cuando Spring intenta establecer una conexión reactiva a Redis +- Se probaron múltiples formatos de URL sin éxito +- El problema persiste con diferentes configuraciones de URL + +## Solución Detallada + + + +Para instancias Redis de AWS ElastiCache en SleakOps, el formato correcto de la URL debe ser: + +``` +redis://[hostname]:[puerto]/[número_de_base_de_datos] +``` + +**Configuración estándar:** + +``` +CACHE_REDIS_URL=redis://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379/0 +``` + +**Puntos clave:** + +- Siempre incluir el número de base de datos (usualmente `/0` para la base de datos por defecto) +- El puerto `6379` es el puerto estándar de Redis +- Usar el prefijo de protocolo `redis://` para conexiones sin SSL +- Usar `rediss://` para conexiones SSL si tu ElastiCache tiene cifrado en tránsito habilitado + + + + + +El problema de conexión podría estar relacionado con la configuración de red: + +**1. Revisar Grupos de Seguridad:** + +- Asegúrate que el grupo de seguridad de tu clúster EKS permita tráfico saliente en el puerto 6379 +- Verifica que el grupo de seguridad de ElastiCache permita tráfico entrante desde tu clúster EKS + +**2. Configuración de Subredes:** + +- ElastiCache y el clúster EKS deben estar en la misma VPC +- Las subredes deben tener una configuración de enrutamiento adecuada + +**3. Probar conectividad desde un pod:** + +```bash +# Crear un pod de prueba +kubectl run redis-test --image=redis:alpine --rm -it -- sh + +# Dentro del pod, probar conexión +redis-cli -h cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com -p 6379 ping +``` + + + + + +Asegúrate que tu aplicación Spring Boot esté configurada correctamente: + +**application.yml:** + +```yaml +spring: + data: + redis: + url: ${CACHE_REDIS_URL} + timeout: 10s + lettuce: + pool: + max-active: 8 + max-idle: 8 + min-idle: 0 +``` + +**O usando propiedades individuales:** + +```yaml +spring: + data: + redis: + host: cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com + port: 6379 + database: 0 + timeout: 10s +``` + + + + + +Si tu instancia ElastiCache tiene cifrado en tránsito habilitado: + +**1. Usar protocolo `rediss://`:** + +``` +CACHE_REDIS_URL=rediss://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379/0 +``` + +**2. Configurar SSL en Spring Boot:** + +```yaml +spring: + data: + redis: + url: ${CACHE_REDIS_URL} + ssl: + enabled: true +``` + +**3. Verificar configuración de ElastiCache:** + +- Ir a Consola AWS → ElastiCache → Clústeres Redis +- Verificar si "Encryption in transit" está habilitado +- Si está habilitado, debes usar conexión SSL + + + + + +En SleakOps, asegúrate que tus variables Redis estén configuradas correctamente: + +**1. Revisar grupo de variables:** + +- Ir a tu proyecto → Dependencias → Redis +- Verificar que estén presentes todas las variables necesarias: + - `CACHE_REDIS_URL` + - `CACHE_REDIS_HOST` (si usas propiedades individuales) + - `CACHE_REDIS_PORT` (si usas propiedades individuales) + +**2. Formato de variables:** + +``` +CACHE_REDIS_URL=redis://cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com:6379/0 +CACHE_REDIS_HOST=cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com +CACHE_REDIS_PORT=6379 +``` + +**3. Aplicar cambios:** + +- Después de modificar variables, redepliega tu aplicación +- Las variables se inyectan durante el despliegue + + + + + +**1. Habilitar registro de conexión Redis:** + +```yaml +logging: + level: + org.springframework.data.redis: DEBUG + io.lettuce.core: DEBUG +``` + +**2. Revisar logs de la aplicación para errores detallados:** + +```bash +kubectl logs -f deployment/your-app-name +``` + +**3. Verificar estado de ElastiCache:** + +- Consola AWS → ElastiCache → Clústeres Redis +- Asegurarse que el estado sea "Available" +- Revisar si hay ventanas de mantenimiento + +**4. Probar con Redis CLI desde máquina local:** + +```bash +# Si tienes acceso VPN a la VPC +redis-cli -h cache-9ca54715.ojs75q.0001.use2.cache.amazonaws.com -p 6379 ping +``` + +**5. Formatos comunes de URL para probar:** + +``` +# Conexión estándar +redis://hostname:6379/0 + +# Con autenticación (si AUTH está habilitado) +redis://:password@hostname:6379/0 + +# Conexión SSL +rediss://hostname:6379/0 +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/retool-rds-private-connection.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/retool-rds-private-connection.mdx new file mode 100644 index 000000000..346dbefd5 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/retool-rds-private-connection.mdx @@ -0,0 +1,208 @@ +--- +sidebar_position: 3 +title: "Conectando Retool a Base de Datos RDS Privada" +description: "Cómo conectar el servicio externo Retool a una base de datos RDS privada en SleakOps" +date: "2024-08-30" +category: "dependencia" +tags: ["retool", "rds", "base de datos", "nlb", "privado", "conexión"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Conectando Retool a Base de Datos RDS Privada + +**Fecha:** 30 de agosto de 2024 +**Categoría:** Dependencia +**Etiquetas:** Retool, RDS, Base de datos, NLB, Privado, Conexión + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan conectar los paneles externos de Retool a bases de datos RDS desplegadas a través de SleakOps. Anteriormente, podían modificar los grupos de seguridad para permitir acceso directo, pero SleakOps despliega instancias RDS en subredes privadas por seguridad. + +**Síntomas Observados:** + +- No se puede conectar Retool directamente a la base de datos RDS +- RDS está desplegado en subredes privadas sin acceso a internet +- Las modificaciones en grupos de seguridad no proporcionan conectividad externa +- Retool requiere una lista de IPs permitidas específica para la conexión + +**Configuración Relevante:** + +- Base de datos RDS: PostgreSQL en subredes privadas +- IPs de Retool para permitir: `35.90.103.132/30`, `44.208.168.68/30` +- VPC de SleakOps con arquitectura de subredes privadas/públicas +- Grupos de seguridad gestionados por SleakOps + +**Condiciones de Error:** + +- Tiempo de espera agotado al intentar conectar RDS desde Retool +- Resolución DNS puede fallar para endpoints privados de RDS +- Problemas de conectividad en el puerto (típicamente puerto 5432 para PostgreSQL) + +## Solución Detallada + + + +SleakOps despliega bases de datos RDS en subredes privadas siguiendo las mejores prácticas de seguridad. Esto significa: + +- **Subredes privadas**: Sin acceso directo a internet +- **Aislamiento de seguridad**: Protegido de amenazas externas +- **Acceso solo dentro de la VPC**: Solo recursos dentro de la VPC pueden conectar + +Para conectar servicios externos como Retool, es necesario crear un puente entre la base de datos privada y el internet. + + + + + +El enfoque recomendado es usar un Network Load Balancer (NLB) como se describe en la [documentación de Retool](https://docs.retool.com/center-of-excellence/patterns/AWS/Connect/privateresource). + +**Pasos para implementar:** + +1. **Crear Network Load Balancer** + + - Desplegar en subredes públicas + - Asignar direcciones IP elásticas + - Configurar para tráfico TCP en puerto 5432 + +2. **Configurar Grupo de Destino** + + - Tipo: Direcciones IP + - Protocolo: TCP + - Puerto: 5432 + - Destino: IP privada de la instancia RDS + +3. **Actualizar Grupos de Seguridad** + + - Grupo de seguridad del NLB: Permitir entrada desde IPs de Retool + - Grupo de seguridad de RDS: Permitir entrada desde NLB + +4. **Configurar Conexión en Retool** + - Usar la IP elástica del NLB como host de base de datos + - Puerto estándar 5432 + - Mismas credenciales de base de datos + +**Consideración de costo:** Aproximadamente 20 USD al mes más costos de transferencia de datos. + + + + + +Para el enfoque con NLB, configure los grupos de seguridad como sigue: + +**Grupo de Seguridad del NLB:** + +``` +Reglas de Entrada: +- Tipo: TCP personalizado +- Puerto: 5432 +- Origen: 35.90.103.132/30 (rango IP Retool 1) +- Origen: 44.208.168.68/30 (rango IP Retool 2) + +Reglas de Salida: +- Tipo: TCP personalizado +- Puerto: 5432 +- Destino: ID del grupo de seguridad de RDS +``` + +**Grupo de Seguridad de RDS (agregar regla):** + +``` +Reglas de Entrada: +- Tipo: PostgreSQL +- Puerto: 5432 +- Origen: ID del grupo de seguridad del NLB +``` + + + + + +Si NLB no es adecuado, considere estas alternativas: + +**1. Retool autoalojado en el clúster** + +- Desplegar Retool directamente en su clúster Kubernetes +- Requiere recursos significativos (16GB RAM, 8 CPUs) +- Costos de infraestructura más altos pero mejor seguridad +- Acceso directo a recursos privados + +**2. Réplica de lectura RDS en subredes públicas** + +- Crear réplica de solo lectura en subredes públicas +- Usar solo para informes/paneles +- Mantiene seguridad de la base de datos primaria +- Costo adicional por instancia réplica + +**3. Conexión VPN (si está soportada)** + +- Algunos planes de Retool soportan conexiones VPN +- Consultar con soporte de Retool para disponibilidad +- Opción más segura si está disponible +- Puede requerir plan empresarial de Retool + + + + + +Para probar la conectividad antes de configurar Retool: + +**Desde su máquina local con VPN:** + +```bash +# Probar conectividad de puerto +nmap -Pn -p 5432 your-rds-endpoint.amazonaws.com + +# Probar conexión a base de datos +psql -h your-rds-endpoint.amazonaws.com -p 5432 -U username -d database_name +``` + +**Desde dentro del clúster:** + +```bash +# Crear pod de prueba +kubectl run postgres-client --rm -it --image=postgres:13 -- bash + +# Dentro del pod +psql -h your-rds-endpoint.amazonaws.com -p 5432 -U username -d database_name +``` + +**Después de configurar el NLB:** + +```bash +# Probar a través del NLB (desde internet) +nmap -Pn -p 5432 your-nlb-elastic-ip +``` + + + + + +**Tiempos de espera en conexión:** + +- Verificar que las reglas de grupos de seguridad estén configuradas correctamente +- Revisar el estado de salud del grupo de destino del NLB +- Asegurar que RDS esté en estado activo + +**Problemas de resolución DNS:** + +- Usar direcciones IP en lugar de nombres de host para pruebas +- Verificar configuración DNS en su red + +**Fallos de autenticación:** + +- Confirmar que las credenciales de base de datos sean correctas +- Revisar si el usuario de base de datos tiene permisos necesarios +- Verificar que el nombre de la base de datos sea correcto + +**Fallos en chequeos de salud del NLB:** + +- Asegurar que el grupo de destino apunta a la IP privada correcta de RDS +- Verificar que el grupo de seguridad de RDS permita tráfico desde el NLB +- Revisar tablas de ruteo de subredes de RDS + + + +--- + +_Esta FAQ fue generada automáticamente basada en una consulta real de usuario sobre cómo conectar Retool a bases de datos RDS privadas en SleakOps._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/s3-bucket-access-authentication.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/s3-bucket-access-authentication.mdx new file mode 100644 index 000000000..968fbb773 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/s3-bucket-access-authentication.mdx @@ -0,0 +1,550 @@ +--- +sidebar_position: 3 +title: "Problemas de Acceso y Autenticación en Buckets S3" +description: "Resolución de problemas de acceso a buckets AWS S3 con roles IAM y autenticación entre proyectos" +date: "2025-02-11" +category: "dependency" +tags: ["s3", "aws", "autenticación", "boto3", "iam", "bucket"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Acceso y Autenticación en Buckets S3 + +**Fecha:** 11 de febrero de 2025 +**Categoría:** Dependencia +**Etiquetas:** S3, AWS, Autenticación, Boto3, IAM, Bucket + +## Descripción del Problema + +**Contexto:** El usuario creó un bucket S3 privado a través de SleakOps y está experimentando problemas de autenticación al acceder a él desde aplicaciones Python usando boto3. El bucket funciona con credenciales explícitas de AWS pero falla con autenticación basada en roles IAM dentro del clúster. + +**Síntomas Observados:** + +- `botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden` +- La autenticación funciona fuera del clúster con credenciales explícitas +- La autenticación falla dentro del clúster sin credenciales explícitas +- Necesidad de acceder al bucket S3 desde diferentes proyectos/servicios + +**Configuración Relevante:** + +- Bucket: Bucket S3 privado creado a través de SleakOps +- Biblioteca: boto3 (Python) +- Entorno: Clúster EKS con roles IAM +- Patrón de acceso: Se requiere acceso tanto dentro del mismo proyecto como entre proyectos + +**Condiciones de Error:** + +- El error ocurre al usar autenticación con rol IAM dentro de pods +- El problema aparece durante operaciones HeadObject y otras de S3 +- Funciona con AWS_ACCESS_KEY_ID y AWS_SECRET_ACCESS_KEY explícitos +- Falla al confiar en identidad del pod o roles de cuenta de servicio + +## Solución Detallada + + + +SleakOps provee autenticación automática basada en roles IAM para buckets S3 creados dentro de proyectos. Esto significa: + +1. **Acceso dentro del mismo proyecto**: No se necesitan credenciales explícitas +2. **Acceso entre proyectos**: Requiere configuración adicional +3. **Acceso externo**: Requiere credenciales explícitas o URLs prefirmadas + +```python +# Dentro del mismo proyecto - no se necesitan credenciales +import boto3 + +s3_client = boto3.client('s3', region_name='us-east-1') +# Esto debería funcionar automáticamente +``` + + + + + +Para depurar problemas de autenticación: + +1. **Entrar a un pod en tu proyecto**: + +```bash +kubectl exec -it -- /bin/bash +``` + +2. **Instalar AWS CLI** (si no está presente): + +```bash +apt-get update && apt-get install -y awscli +# o +pip install awscli +``` + +3. **Limpiar credenciales existentes**: + +```bash +unset AWS_ACCESS_KEY_ID +unset AWS_SECRET_ACCESS_KEY +unset AWS_SESSION_TOKEN +``` + +4. **Probar autenticación**: + +```bash +aws s3 ls +# Debería listar los buckets si la autenticación funciona +``` + +5. **Probar acceso a bucket específico**: + +```bash +aws s3 ls s3://nombre-de-tu-bucket +``` + + + + + +Para acceso S3 dentro del mismo proyecto: + +```python +import boto3 +from botocore.exceptions import ClientError + +# Inicializar cliente S3 sin credenciales explícitas +# El rol IAM se usará automáticamente +s3_client = boto3.client( + 's3', + region_name='us-east-1' # Especifica tu región +) + +try: + # Probar acceso al bucket + response = s3_client.head_bucket(Bucket='nombre-de-tu-bucket') + print("Acceso al bucket exitoso") +except ClientError as e: + print(f"Error al acceder al bucket: {e}") +``` + +Para listar objetos: + +```python +try: + response = s3_client.list_objects_v2(Bucket='nombre-de-tu-bucket') + for obj in response.get('Contents', []): + print(f"Objeto: {obj['Key']}") +except ClientError as e: + print(f"Error al listar objetos: {e}") +``` + + + + + +Para acceder a un bucket S3 desde un proyecto diferente: + +1. **En el panel de SleakOps**: + + - Ve al **proyecto origen** (donde se creó el bucket) + - Navega a **Configuración del Proyecto** → **Configuración de Acceso** + - Añade el proyecto destino o cuenta de servicio + +2. **Conceder permisos específicos**: + + - Selecciona el recurso bucket S3 + - Elige los permisos adecuados (lectura, escritura, eliminación) + - Guarda la configuración + +3. **Usar el mismo método de autenticación**: + +```python +# No se necesitan cambios en el código - los roles IAM manejan el acceso entre proyectos +s3_client = boto3.client('s3', region_name='us-east-1') +``` + + + + + +Para acceso HTTP desde servicios externos o navegadores: + +```python +import boto3 +from botocore.exceptions import ClientError + +def generar_url_prefirmada(nombre_bucket, clave_objeto, expiracion=3600): + """Genera una URL prefirmada para acceso a objeto S3""" + s3_client = boto3.client('s3', region_name='us-east-1') + + try: + response = s3_client.generate_presigned_url( + 'get_object', + Params={'Bucket': nombre_bucket, 'Key': clave_objeto}, + ExpiresIn=expiracion + ) + return response + except ClientError as e: + print(f"Error al generar URL prefirmada: {e}") + return None + +# Uso +url = generar_url_prefirmada('tu-bucket', 'ruta/al/archivo.txt') +if url: + print(f"URL de descarga: {url}") +``` + +Para URLs de subida: + +```python +def generar_url_subida_prefirmada(nombre_bucket, clave_objeto, expiracion=3600): + """Genera una URL prefirmada para subir archivos""" + s3_client = boto3.client('s3', region_name='us-east-1') + + try: + response = s3_client.generate_presigned_url( + 'put_object', + Params={'Bucket': nombre_bucket, 'Key': clave_objeto}, + ExpiresIn=expiracion + ) + return response + except ClientError as e: + print(f"Error al generar URL de subida: {e}") + return None +``` + + + + + +Si aún recibes errores 403: + +1. **Revisar permisos del rol IAM**: + + - Verifica que la cuenta de servicio del pod tenga el rol IAM correcto + - Asegúrate que el rol tenga permisos S3 para tu bucket + +2. **Verificar la política del bucket**: + - Comprueba si el bucket tiene políticas restrictivas + - Asegúrate que tu rol IAM esté incluido en las políticas del bucket + +3. **Verificar configuración de región**: + +```python +# Asegúrate de usar la región correcta +s3_client = boto3.client('s3', region_name='us-east-1') # Cambia según tu región + +# Verificar región del bucket +try: + response = s3_client.get_bucket_location(Bucket='tu-bucket') + region = response['LocationConstraint'] or 'us-east-1' + print(f"Región del bucket: {region}") +except ClientError as e: + print(f"Error al obtener región: {e}") +``` + +4. **Verificar identidad actual**: + +```python +import boto3 + +# Verificar qué identidad está siendo usada +sts_client = boto3.client('sts') +try: + identity = sts_client.get_caller_identity() + print(f"Usuario/Rol actual: {identity['Arn']}") + print(f"Account ID: {identity['Account']}") +except ClientError as e: + print(f"Error al obtener identidad: {e}") +``` + + + + + +**1. Configuración con múltiples buckets:** + +```python +import boto3 +from botocore.config import Config + +# Configuración optimizada para múltiples buckets +config = Config( + region_name='us-east-1', + retries={ + 'max_attempts': 3, + 'mode': 'adaptive' + }, + max_pool_connections=50 +) + +s3_client = boto3.client('s3', config=config) + +# Función helper para operaciones S3 +def s3_operation_with_retry(operation, **kwargs): + max_retries = 3 + for attempt in range(max_retries): + try: + return operation(**kwargs) + except ClientError as e: + if attempt == max_retries - 1: + raise + print(f"Intento {attempt + 1} falló: {e}") + time.sleep(2 ** attempt) # Backoff exponencial +``` + +**2. Configuración para aplicaciones web:** + +```python +# Para aplicaciones Django/Flask +import os +from django.conf import settings + +# Configuración en settings.py +AWS_STORAGE_BUCKET_NAME = os.environ.get('S3_BUCKET_NAME', 'tu-bucket-default') +AWS_S3_REGION_NAME = os.environ.get('AWS_REGION', 'us-east-1') +AWS_S3_CUSTOM_DOMAIN = f'{AWS_STORAGE_BUCKET_NAME}.s3.amazonaws.com' + +# En tu aplicación +def get_s3_client(): + return boto3.client( + 's3', + region_name=settings.AWS_S3_REGION_NAME + ) +``` + +**3. Configuración para trabajos batch:** + +```python +# Para jobs que procesan muchos archivos +import concurrent.futures +import boto3 + +def process_s3_objects_parallel(bucket_name, prefix=''): + s3_client = boto3.client('s3') + + # Listar objetos + paginator = s3_client.get_paginator('list_objects_v2') + pages = paginator.paginate(Bucket=bucket_name, Prefix=prefix) + + objects = [] + for page in pages: + objects.extend(page.get('Contents', [])) + + # Procesar en paralelo + with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: + futures = [ + executor.submit(process_single_object, bucket_name, obj['Key']) + for obj in objects + ] + + for future in concurrent.futures.as_completed(futures): + try: + result = future.result() + print(f"Procesado: {result}") + except Exception as e: + print(f"Error procesando objeto: {e}") + +def process_single_object(bucket_name, object_key): + s3_client = boto3.client('s3') + # Tu lógica de procesamiento aquí + return f"Procesado {object_key}" +``` + + + + + +**1. Configurar logging detallado:** + +```python +import logging +import boto3 +from botocore.config import Config + +# Configurar logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +# Habilitar logging de boto3 +boto3.set_stream_logger('boto3', logging.DEBUG) +boto3.set_stream_logger('botocore', logging.DEBUG) + +# Cliente con logging +s3_client = boto3.client('s3', region_name='us-east-1') + +def s3_operation_with_logging(operation_name, **kwargs): + logger.info(f"Iniciando operación S3: {operation_name}") + try: + result = getattr(s3_client, operation_name)(**kwargs) + logger.info(f"Operación {operation_name} exitosa") + return result + except ClientError as e: + logger.error(f"Error en {operation_name}: {e}") + raise +``` + +**2. Métricas personalizadas:** + +```python +import time +from functools import wraps + +def measure_s3_operation(func): + @wraps(func) + def wrapper(*args, **kwargs): + start_time = time.time() + try: + result = func(*args, **kwargs) + duration = time.time() - start_time + print(f"Operación {func.__name__} completada en {duration:.2f}s") + return result + except Exception as e: + duration = time.time() - start_time + print(f"Operación {func.__name__} falló después de {duration:.2f}s: {e}") + raise + return wrapper + +@measure_s3_operation +def upload_file_to_s3(bucket_name, file_path, object_key): + s3_client = boto3.client('s3') + s3_client.upload_file(file_path, bucket_name, object_key) +``` + +**3. Health checks para S3:** + +```python +def s3_health_check(bucket_name): + """Verifica la conectividad y permisos básicos de S3""" + s3_client = boto3.client('s3') + + checks = { + 'bucket_exists': False, + 'can_list': False, + 'can_read': False, + 'can_write': False + } + + try: + # Verificar si el bucket existe + s3_client.head_bucket(Bucket=bucket_name) + checks['bucket_exists'] = True + + # Verificar permisos de listado + s3_client.list_objects_v2(Bucket=bucket_name, MaxKeys=1) + checks['can_list'] = True + + # Verificar permisos de escritura + test_key = f"health-check-{int(time.time())}.txt" + s3_client.put_object( + Bucket=bucket_name, + Key=test_key, + Body=b"health check" + ) + checks['can_write'] = True + + # Verificar permisos de lectura + s3_client.get_object(Bucket=bucket_name, Key=test_key) + checks['can_read'] = True + + # Limpiar archivo de prueba + s3_client.delete_object(Bucket=bucket_name, Key=test_key) + + except ClientError as e: + print(f"Error en health check: {e}") + + return checks +``` + + + + + +**1. Gestión de credenciales:** + +- Nunca hardcodees credenciales AWS en tu código +- Usa variables de entorno solo para desarrollo local +- Confía en los roles IAM de SleakOps para producción +- Rota credenciales regularmente si usas acceso externo + +**2. Optimización de rendimiento:** + +```python +# Configuración optimizada para alto rendimiento +from botocore.config import Config + +config = Config( + region_name='us-east-1', + retries={'max_attempts': 3, 'mode': 'adaptive'}, + max_pool_connections=50, + s3={ + 'max_bandwidth': 100 * 1024 * 1024, # 100 MB/s + 'max_concurrent_requests': 10, + 'multipart_threshold': 64 * 1024 * 1024, # 64 MB + 'multipart_chunksize': 16 * 1024 * 1024, # 16 MB + } +) + +s3_client = boto3.client('s3', config=config) +``` + +**3. Manejo de errores robusto:** + +```python +import time +from botocore.exceptions import ClientError, NoCredentialsError + +def robust_s3_operation(operation, max_retries=3, **kwargs): + for attempt in range(max_retries): + try: + return operation(**kwargs) + except NoCredentialsError: + raise Exception("No se encontraron credenciales AWS") + except ClientError as e: + error_code = e.response['Error']['Code'] + + if error_code in ['NoSuchBucket', 'NoSuchKey']: + raise # No reintentar para errores permanentes + + if attempt == max_retries - 1: + raise + + wait_time = (2 ** attempt) + random.uniform(0, 1) + time.sleep(wait_time) +``` + +**4. Seguridad:** + +- Usa HTTPS siempre para operaciones S3 +- Implementa validación de integridad para archivos críticos +- Usa cifrado en tránsito y en reposo +- Audita accesos regularmente + +```python +# Ejemplo con validación de integridad +import hashlib + +def upload_with_integrity_check(bucket_name, file_path, object_key): + # Calcular hash del archivo + with open(file_path, 'rb') as f: + file_hash = hashlib.md5(f.read()).hexdigest() + + # Subir archivo + s3_client = boto3.client('s3') + s3_client.upload_file( + file_path, + bucket_name, + object_key, + ExtraArgs={'Metadata': {'md5': file_hash}} + ) + + # Verificar integridad + response = s3_client.head_object(Bucket=bucket_name, Key=object_key) + stored_hash = response['Metadata'].get('md5') + + if stored_hash != file_hash: + raise Exception("Error de integridad: hash no coincide") +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 11 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/security-ddos-protection-aws-waf.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/security-ddos-protection-aws-waf.mdx new file mode 100644 index 000000000..129d4e02c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/security-ddos-protection-aws-waf.mdx @@ -0,0 +1,214 @@ +--- +sidebar_position: 3 +title: "Protección contra DDoS y Prevención de Ataques de Bots para Servicios Web" +description: "Configura AWS WAF y CloudFront para protección contra DDoS y prevención de ataques de bots en los servicios web de SleakOps" +date: "2024-12-19" +category: "workload" +tags: + ["seguridad", "ddos", "waf", "cloudfront", "aws", "servicioweb", "protección"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Protección contra DDoS y Prevención de Ataques de Bots para Servicios Web + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Workload +**Etiquetas:** Seguridad, DDoS, WAF, CloudFront, AWS, ServicioWeb, Protección + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan entender qué protección contra DDoS y ataques de bots está disponible para los servicios web desplegados en SleakOps, y si se requieren medidas adicionales de CDN o seguridad. + +**Síntomas Observados:** + +- Incertidumbre sobre la protección DDoS existente para servicios web +- Preguntas sobre si CloudFront ya está implementado +- Necesidad de entender si se requiere configuración adicional de CDN +- Preocupaciones sobre la prevención de ataques de bots + +**Configuración Relevante:** + +- Plataforma: Despliegue de SleakOps basado en AWS +- Tipo de servicio: Servicios web con balanceadores de carga públicos +- Requisito de seguridad: Protección contra DDoS y ataques de bots +- Servicios AWS: WAF, CloudFront, Balanceador de carga + +**Condiciones de Error:** + +- Los servicios web pueden ser vulnerables a ataques DDoS sin la protección adecuada +- Los ataques de bots podrían afectar la disponibilidad del servicio +- Falta de claridad sobre las medidas de seguridad existentes + +## Solución Detallada + + + +Los servicios web de SleakOps no están protegidos automáticamente contra ataques DDoS por defecto. Es necesario configurar manualmente AWS WAF (Firewall de Aplicaciones Web) para proteger tus aplicaciones. + +**Puntos Clave:** + +- Los servicios web usan balanceadores de carga públicos que están expuestos a internet +- AWS provee protección contra DDoS a través de AWS WAF +- La protección debe configurarse explícitamente y asociarse a tu balanceador de carga + + + + + +Para configurar AWS WAF para tus servicios web de SleakOps: + +1. **Accede a la Consola de AWS** + + - Ingresa a la Consola de AWS + - Busca "WAF" en la barra de búsqueda + - Selecciona "AWS WAF & Shield" + +2. **Crear Web ACL** + + - Haz clic en "Crear web ACL" + - Elige "Recursos regionales (Application Load Balancer, API Gateway)" + - Selecciona la región donde está desplegado el clúster + +3. **Asociar con Balanceador de Carga** + + - En la sección "Recursos AWS asociados" + - Añade el balanceador de carga público de tu clúster + - El balanceador debería aparecer en la lista desplegable + +4. **Configurar Reglas** + ```yaml + # Ejemplo de configuración WAF + Rules: + - Reglas Administradas de AWS - Conjunto de Reglas Básicas + - Reglas Administradas de AWS - Entradas Maliciosas Conocidas + - Reglas Administradas de AWS - Base de Datos SQL + - Regla de Limitación de Tasa (personalizada) + ``` + + + + + +**Reglas Administradas de AWS (Recomendadas):** + +- Preconfiguradas por expertos en seguridad de AWS +- Actualizadas automáticamente para nuevas amenazas +- Más económicas que las reglas personalizadas +- Cubren patrones comunes de ataque: + - Inyección SQL + - Cross-site scripting (XSS) + - Direcciones IP maliciosas conocidas + - Protección contra bots + +**Reglas Personalizadas:** + +- Más costosas de mantener +- Requieren experiencia en seguridad para configurar +- Útiles para proteger lógica de negocio específica +- Pueden combinarse con reglas administradas + +**Configuración Recomendada:** + +```yaml +Grupos de Reglas Administradas: + - AWSManagedRulesCommonRuleSet + - AWSManagedRulesKnownBadInputsRuleSet + - AWSManagedRulesBotControlRuleSet + - AWSManagedRulesAmazonIpReputationList +``` + + + + + +**Beneficios de CloudFront:** + +- Sirve archivos estáticos de forma más eficiente +- Proporciona protección adicional contra DDoS en ubicaciones edge +- Reduce la carga en tus servicios web +- Mejora el rendimiento global + +**Cuándo Usar CloudFront:** + +- Tu aplicación sirve contenido estático (imágenes, CSS, JS) +- Tienes usuarios en múltiples regiones geográficas +- Deseas protección adicional más allá de WAF +- Necesitas reducir costos de ancho de banda + +**Configuración CloudFront + WAF:** + +```yaml +# Configuración de distribución CloudFront +Origin: tu-balanceador-sleakops.region.elb.amazonaws.com +Caching: + - Archivos estáticos: Cache por 24 horas + - Contenido dinámico: Sin cache o TTL corto +WAF: Asociar el mismo Web ACL de WAF con CloudFront +``` + + + + + +**Costos de AWS WAF:** + +- Web ACL: $1.00 por mes +- Reglas: $0.60 por millón de solicitudes +- Grupos de reglas administradas: $1.00-$10.00 por mes cada uno + +**Costos de CloudFront:** + +- Transferencia de datos: $0.085 por GB (varía según región) +- Solicitudes: $0.0075 por cada 10,000 solicitudes +- Costos adicionales de WAF si se aplica a CloudFront + +**Enfoque Rentable:** + +1. Comienza con AWS WAF solo en el balanceador de carga +2. Usa reglas administradas de AWS (más económicas que personalizadas) +3. Añade CloudFront si tienes contenido estático significativo +4. Monitorea costos y ajusta reglas según sea necesario + + + + + +**Fase 1: Protección Básica con WAF** + +1. Identifica el ARN del balanceador de carga de tu clúster +2. Crea Web ACL de WAF en la Consola de AWS +3. Añade grupos de reglas administradas: + - Conjunto de Reglas Básicas + - Entradas Maliciosas Conocidas + - Lista de Reputación IP +4. Asócialo con el balanceador de carga +5. Prueba y monitorea + +**Fase 2: Protección Mejorada (Opcional)** + +1. Añade reglas administradas de Control de Bots +2. Configura reglas de limitación de tasa +3. Configura monitoreo con CloudWatch +4. Crea reglas personalizadas si es necesario + +**Fase 3: Integración con CloudFront (Si es Necesario)** + +1. Crea distribución CloudFront +2. Apunta el origen a tu balanceador de carga +3. Configura políticas de caché +4. Asocia WAF con CloudFront +5. Actualiza DNS para apuntar a CloudFront + +**Monitoreo y Mantenimiento:** + +```bash +# Monitorear métricas de WAF +aws wafv2 get-sampled-requests --web-acl-arn --rule-metric-name --scope REGIONAL --time-window StartTime=,EndTime= --max-items 100 +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-management-multiple-domains.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-management-multiple-domains.mdx new file mode 100644 index 000000000..148c9b76b --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-management-multiple-domains.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Gestión de Certificados SSL para Múltiples Dominios" +description: "Gestión de certificados SSL a través de regiones y múltiples subdominios en SleakOps" +date: "2024-12-19" +category: "proyecto" +tags: ["ssl", "certificados", "dominios", "aws", "cloudfront", "acm"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Gestión de Certificados SSL para Múltiples Dominios + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** SSL, Certificados, Dominios, AWS, CloudFront, ACM + +## Descripción del Problema + +**Contexto:** Los usuarios que gestionan múltiples subdominios y certificados SSL en SleakOps pueden encontrar problemas con la validación de certificados, la ubicación regional y la reutilización de certificados en diferentes alias. + +**Síntomas Observados:** + +- Certificado SSL no valida para subdominios específicos (por ejemplo, media.app.develop.domain.com) +- El certificado funciona para algunos subdominios pero no para otros (por ejemplo, static.app.develop.domain.com funciona) +- Certificados aparecen en regiones AWS incorrectas (us-east-1 en lugar de la región prevista) +- Confusión sobre cuándo crear certificados individuales vs certificados comodín + +**Configuración Relevante:** + +- Múltiples subdominios bajo el mismo dominio principal +- Certificados de AWS Certificate Manager (ACM) +- Requisitos de distribución en CloudFront +- Requisitos de ubicación regional de certificados + +**Condiciones de Error:** + +- Fallos en la validación de certificados para subdominios específicos +- Certificados creados en us-east-1 cuando se esperan en otras regiones +- Incertidumbre sobre la estrategia adecuada de gestión de certificados + +## Solución Detallada + + + +SleakOps tiene una función automática de reutilización de certificados que: + +- **Reutiliza certificados existentes** al agregar nuevos alias que pertenecen a un dominio ya gestionado +- **Detecta automáticamente** si un subdominio pertenece a un dominio con certificado existente +- **Evita certificados duplicados** para la misma jerarquía de dominio + +Esta función fue implementada para optimizar la gestión de certificados y reducir los límites de AWS ACM. + + + + + +**Requisito de Certificado para CloudFront:** + +Los certificados en `us-east-1` son específicamente para distribuciones de CloudFront: + +- **URLs de contenido estático** (como `static.app.develop.domain.com`) usan CloudFront +- **CloudFront requiere** que los certificados SSL estén en la región `us-east-1` +- **Esto es un requisito de AWS**, no un problema de configuración de SleakOps + +**Distribución Regional:** + +- Certificados de aplicación: desplegados en la región elegida (por ejemplo, Ohio) +- Certificados para CloudFront: siempre en `us-east-1` + + + + + +Para solucionar problemas de validación de certificados: + +**Paso 1: Eliminar certificados existentes** + +1. Ve al panel de tu proyecto en SleakOps +2. Navega a **Dominios y SSL** +3. Elimina los certificados problemáticos +4. **Importante:** Esto no romperá los servicios existentes inmediatamente + +**Paso 2: Recrear certificados** + +1. Agrega tus dominios nuevamente en SleakOps +2. La plataforma creará automáticamente certificados optimizados +3. SleakOps reutilizará certificados cuando sea apropiado + +**Paso 3: Verificar validación** + +1. Comprueba que todos los subdominios validen correctamente +2. Monitorea el estado del certificado en AWS ACM +3. Prueba todas las URLs afectadas + + + + + +**Enfoque Recomendado:** + +Para dominios como `develop.domain.com` con múltiples subdominios: + +``` +# Configuración óptima de certificados +Certificado 1: *.develop.domain.com, develop.domain.com +- Cubre: app.develop.domain.com +- Cubre: api.develop.domain.com +- Cubre: media.develop.domain.com +- Cubre: develop.domain.com (apex) + +Certificado 2: static.develop.domain.com (solo us-east-1) +- Para distribución CloudFront +- Debe estar en la región us-east-1 +``` + +**Beneficios:** + +- Reduce la cantidad de certificados +- Simplifica la gestión +- Cubre todos los subdominios actuales y futuros +- Cumple con los requisitos regionales de AWS + + + + + +**Problemas comunes de validación y soluciones:** + +**Problema 1: La validación DNS no se completa** + +- Verifica que los registros DNS estén configurados correctamente +- Confirma que la propiedad del dominio esté validada en AWS +- Asegúrate que la propagación DNS haya finalizado (puede tardar hasta 24 horas) + +**Problema 2: El certificado no se aplica al subdominio** + +- Confirma que el certificado incluya el subdominio específico +- Revisa que el certificado comodín cubra el patrón del subdominio +- Verifica que el certificado esté en la región correcta para el servicio + +**Problema 3: Regiones mixtas de certificados** + +- Servicios de aplicación: usar certificados en la región de despliegue +- Servicios CloudFront: deben usar certificados en us-east-1 +- Esto es un comportamiento normal y esperado + +**Comandos de verificación:** + +```bash +# Ver detalles del certificado +aws acm describe-certificate --certificate-arn arn:aws:acm:region:account:certificate/cert-id + +# Probar conexión SSL +openssl s_client -connect your-domain.com:443 -servername your-domain.com +``` + + + + + +**Planificación de tu estrategia de certificados:** + +1. **Agrupa dominios relacionados**: usa certificados comodín para múltiples subdominios +2. **Entiende los requisitos regionales**: acepta que los certificados CloudFront estarán en us-east-1 +3. **Usa la automatización de SleakOps**: permite que la plataforma maneje la reutilización y optimización de certificados +4. **Monitorea expiraciones**: configura alertas para la renovación de certificados +5. **Documenta tu configuración**: lleva un registro de qué certificados sirven a qué dominios + +**Cuándo recrear certificados:** + +- Problemas de validación que persisten tras la propagación DNS +- Necesidad de agregar muchos subdominios nuevos a la configuración existente +- Consolidar múltiples certificados individuales en comodines +- Migrar entre diferentes estructuras de dominio + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-subdomain-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-subdomain-issues.mdx new file mode 100644 index 000000000..db2d529ac --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-subdomain-issues.mdx @@ -0,0 +1,247 @@ +--- +sidebar_position: 3 +title: "Problemas con el Certificado SSL en Subdominios de API" +description: "Solución de problemas con certificados SSL para endpoints de API y subdominios" +date: "2024-12-19" +category: "cluster" +tags: ["ssl", "certificados", "https", "api", "seguridad"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas con el Certificado SSL en Subdominios de API + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Cluster +**Etiquetas:** SSL, Certificados, HTTPS, API, Seguridad + +## Descripción del Problema + +**Contexto:** Los usuarios reportan advertencias de certificado SSL al acceder a endpoints específicos de la API, mientras que el dominio principal aparece seguro. Esto ocurre comúnmente cuando las aplicaciones tienen múltiples subdominios o rutas de API que requieren una cobertura adecuada del certificado SSL. + +**Síntomas Observados:** + +- El navegador muestra advertencia de "No Seguro" para endpoints específicos de la API +- El dominio principal (por ejemplo, `https://api.dominio.com`) muestra certificado válido +- Rutas específicas de la API (por ejemplo, `https://api.dominio.com/api/v3/...`) generan advertencias de seguridad +- Los navegadores móviles pueden ser más sensibles a problemas de certificado +- Las descargas de archivos o llamadas a la API pueden fallar debido a la validación SSL + +**Configuración Relevante:** + +- Dominio: Subdominio API con múltiples endpoints +- Tipo de certificado: Probablemente certificado para un solo dominio o cobertura wildcard insuficiente +- Plataforma: Clúster Kubernetes con controlador ingress +- Acceso cliente: Navegadores móviles y web + +**Condiciones de Error:** + +- El error aparece en endpoints específicos de la API pero no en el dominio raíz +- El problema ocurre en diferentes navegadores y dispositivos +- Afecta descargas de archivos y respuestas de la API +- La validación del certificado falla para ciertas rutas + +## Solución Detallada + + + +Primero, revise la configuración actual de su certificado SSL: + +```bash +# Ver detalles del certificado para su dominio +openssl s_client -connect api.sudominio.com:443 -servername api.sudominio.com + +# O use herramientas en línea como SSL Labs +# https://www.ssllabs.com/ssltest/ +``` + +Revise: + +- Fechas de validez del certificado +- Nombres Alternativos del Sujeto (SAN) +- Integridad de la cadena de certificados +- Compatibilidad del conjunto de cifrado + + + + + +Asegúrese de que su ingress de Kubernetes esté configurado correctamente para SSL: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: api-ingress + annotations: + kubernetes.io/ingress.class: "nginx" + cert-manager.io/cluster-issuer: "letsencrypt-prod" + nginx.ingress.kubernetes.io/ssl-redirect: "true" + nginx.ingress.kubernetes.io/force-ssl-redirect: "true" +spec: + tls: + - hosts: + - api.sudominio.com + secretName: api-tls-secret + rules: + - host: api.sudominio.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: api-service + port: + number: 80 +``` + + + + + +Si usa cert-manager, asegure la configuración adecuada: + +```yaml +apiVersion: cert-manager.io/v1 +kind: Certificate +metadata: + name: api-certificate + namespace: default +spec: + secretName: api-tls-secret + issuerRef: + name: letsencrypt-prod + kind: ClusterIssuer + dnsNames: + - api.sudominio.com + - "*.api.sudominio.com" # Wildcard para subrutas si es necesario +``` + +Verifique que cert-manager esté funcionando: + +```bash +# Ver estado de certificados +kubectl get certificates + +# Ver detalles del certificado +kubectl describe certificate api-certificate + +# Ver logs de cert-manager +kubectl logs -n cert-manager deployment/cert-manager +``` + + + + + +1. **Verificar cobertura del certificado:** + + ```bash + # Probar endpoints específicos + curl -I https://api.sudominio.com/api/v3/endpoint + + # Revisar cadena de certificados + openssl s_client -connect api.sudominio.com:443 -showcerts + ``` + +2. **Verificar resolución DNS:** + + ```bash + nslookup api.sudominio.com + dig api.sudominio.com + ``` + +3. **Probar desde diferentes ubicaciones:** + + - Usar verificadores SSL en línea + - Probar desde distintas redes + - Comparar navegadores móviles y de escritorio + +4. **Revisar logs del controlador ingress:** + ```bash + kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + ``` + + + + + +**Para clústeres gestionados por SleakOps:** + +1. **Actualizar anotaciones del ingress:** + + ```yaml + annotations: + nginx.ingress.kubernetes.io/backend-protocol: "HTTP" + nginx.ingress.kubernetes.io/ssl-redirect: "true" + nginx.ingress.kubernetes.io/proxy-body-size: "50m" + ``` + +2. **Asegurar configuración correcta del servicio:** + + ```yaml + apiVersion: v1 + kind: Service + metadata: + name: api-service + spec: + ports: + - port: 80 + targetPort: 8080 + protocol: TCP + selector: + app: api-app + ``` + +3. **Verificar renovación del certificado:** + ```bash + # Forzar renovación del certificado + kubectl delete certificate api-certificate + kubectl apply -f certificate.yaml + ``` + +**Para dominios personalizados:** + +- Asegurar que el DNS apunte al balanceador de carga correcto +- Verificar que el certificado incluya todos los dominios necesarios +- Confirmar que la cadena de certificados esté completa + + + + + +1. **Usar certificados wildcard** para múltiples subdominios: + + ```yaml + dnsNames: + - sudominio.com + - "*.sudominio.com" + ``` + +2. **Monitorear expiración de certificados:** + + ```bash + # Configurar alertas de monitoreo + kubectl get certificates -o wide + ``` + +3. **Probar configuración SSL regularmente:** + + ```bash + # Pruebas automatizadas de SSL + curl -f https://api.sudominio.com/health + ``` + +4. **Usar encabezados HSTS** para mayor seguridad: + ```yaml + annotations: + nginx.ingress.kubernetes.io/server-snippet: | + add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always; + ``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-validation-aws-acm.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-validation-aws-acm.mdx new file mode 100644 index 000000000..a03465533 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/ssl-certificate-validation-aws-acm.mdx @@ -0,0 +1,150 @@ +--- +sidebar_position: 3 +title: "Error de Validación de Certificado SSL en AWS ACM" +description: "Cómo resolver problemas de validación de certificados SSL cuando los certificados ACM no se validan dentro de 72 horas" +date: "2024-11-07" +category: "provider" +tags: ["aws", "acm", "ssl", "certificado", "dns", "validacion"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Validación de Certificado SSL en AWS ACM + +**Fecha:** 7 de noviembre de 2024 +**Categoría:** Proveedor +**Etiquetas:** AWS, ACM, SSL, Certificado, DNS, Validación + +## Descripción del Problema + +**Contexto:** Al desplegar aplicaciones con dominios personalizados en SleakOps, los certificados de AWS Certificate Manager (ACM) pueden no validarse si la validación DNS no se completa dentro de las 72 horas posteriores a la creación del certificado. + +**Síntomas observados:** + +- El certificado ACM muestra estado "Fallido" o "Tiempo de validación agotado" +- No se pueden configurar correctamente los alias de dominio +- Las conexiones SSL/TLS fallan para el dominio personalizado +- El certificado aparece como "erróneo" en la consola de AWS +- Las aplicaciones pueden ser inaccesibles vía HTTPS en dominios personalizados + +**Configuración relevante:** + +- Tipo de certificado: AWS Certificate Manager (ACM) +- Método de validación: Validación DNS +- Dominio: Dominio personalizado con gestión DNS externa +- Límite de tiempo: 72 horas desde la creación del certificado + +**Condiciones de error:** + +- Validación del certificado no completada dentro de la ventana de 72 horas +- Registros DNS TXT no configurados correctamente +- Propiedad del dominio no verificada mediante DNS +- Se requiere regeneración del certificado + +## Solución Detallada + + + +Cuando un certificado ACM falla en la validación, debes regenerarlo: + +1. **Accede a tu panel de ejecución de SleakOps** +2. **Navega a la ejecución afectada** (por ejemplo, tu landing page o aplicación web) +3. **Ve a la sección SSL/Certificado** +4. **Haz clic en "Regenerar Certificado"** o una opción similar +5. **Espera a que se cree el nuevo certificado** + +SleakOps creará automáticamente un nuevo certificado ACM con registros de validación actualizados. + + + + + +Después de regenerar el certificado, debes agregar los registros de validación DNS: + +1. **Copia la información de validación** desde los detalles de tu ejecución en SleakOps: + + - **Nombre**: `_[validation-hash].tudominio.com` + - **Valor**: `_[validation-value].acm-validations.aws.` + - **Tipo**: `TXT` + +2. **Agrega el registro TXT en tu proveedor DNS**: + + ``` + Nombre: _595d6ebeeb98358afc0357d079d068f4.tudominio.com + Tipo: TXT + Valor: _139389a2dc765df9b1c6bc66a1367077.djqtsrsxkq.acm-validations.aws. + TTL: 300 (o el mínimo de tu proveedor) + ``` + +3. **Guarda el registro DNS** y espera la propagación (usualmente 5-15 minutos) + + + + + +Después de agregar los registros DNS: + +1. **Espera la propagación DNS** (usualmente 5-15 minutos) +2. **Usa el botón "Verificar Certificado"** en SleakOps para activar la validación manualmente +3. **Monitorea el estado del certificado** en tu panel de ejecución +4. **Verifica la propagación del registro DNS** con herramientas como: + ```bash + dig TXT _595d6ebeeb98358afc0357d079d068f4.tudominio.com + ``` + o + ```bash + nslookup -type=TXT _595d6ebeeb98358afc0357d079d068f4.tudominio.com + ``` + +**Resultado esperado**: El estado del certificado debe cambiar a "Emitido" o "Válido" + + + + + +**Si la validación continúa fallando, revisa estos problemas comunes:** + +1. **Formato incorrecto del registro DNS**: + + - Asegúrate de que el nombre del registro TXT incluya el subdominio completo de validación + - No agregues comillas adicionales alrededor del valor + - Usa exactamente los valores proporcionados por AWS/SleakOps + +2. **Retrasos en la propagación DNS**: + + - Algunos proveedores DNS tardan más en propagar cambios + - Espera hasta 1 hora antes de considerar que falló + - Verifica con múltiples herramientas de consulta DNS + +3. **Múltiples registros de validación**: + + - Elimina registros antiguos o duplicados de validación + - Mantén solo el registro de validación actual + +4. **Limitaciones del proveedor DNS**: + - Algunos proveedores no soportan nombres largos en registros TXT + - Contacta a tu proveedor DNS si los registros no se guardan correctamente + + + + + +**Para evitar que se agote el tiempo para la validación del certificado:** + +1. **Configura los registros DNS inmediatamente** después de crear el certificado +2. **Monitorea el estado del certificado** regularmente en el panel de SleakOps +3. **Usa proveedores DNS automatizados** cuando sea posible (Route53, API de Cloudflare) +4. **Establece recordatorios en el calendario** para renovaciones de certificados +5. **Prueba los cambios DNS** antes de aplicarlos en dominios de producción + +**Buenas prácticas:** + +- Completa la validación DNS dentro de las 24 horas posteriores a la creación del certificado +- Mantén las credenciales de gestión DNS accesibles para tu equipo +- Documenta tu proceso de validación DNS para futuras referencias + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 7 de noviembre de 2024 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/superset-bitnami-helm-deployment.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/superset-bitnami-helm-deployment.mdx new file mode 100644 index 000000000..17f088922 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/superset-bitnami-helm-deployment.mdx @@ -0,0 +1,1292 @@ +--- +sidebar_position: 3 +title: "Desplegando Apache Superset con el Chart Helm de Bitnami" +description: "Guía completa para desplegar Apache Superset usando el chart Helm de Bitnami con configuraciones personalizadas y tolerancias" +date: "2024-12-20" +category: "carga-de-trabajo" +tags: ["superset", "bitnami", "helm", "kubernetes", "despliegue", "tolerancias"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Desplegando Apache Superset con el Chart Helm de Bitnami + +**Fecha:** 20 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Superset, Bitnami, Helm, Kubernetes, Despliegue, Tolerancias + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan desplegar Apache Superset usando el chart Helm de Bitnami en un clúster de Kubernetes con configuraciones específicas que incluyen tolerancias para la inicialización de la base de datos y valores personalizados. + +**Síntomas Observados:** + +- Necesidad de añadir tolerancias a los pods de inicialización de la base de datos +- El chart de Bitnami no soporta tolerancias para contenedores init vía valores +- Requiere un script post-render para modificar las plantillas del chart +- Gestión compleja de la configuración para el despliegue de Superset + +**Configuración Relevante:** + +- Chart: `bitnami/superset` +- Namespace: `superset` +- Archivo de valores personalizados: `values-def.yaml` +- Script post-render: `add-tolerations.sh` + +**Condiciones de Error:** + +- Los pods de inicialización de la base de datos pueden fallar al programarse sin las tolerancias adecuadas +- Los valores estándar de Helm no proporcionan opciones de personalización suficientes +- Se requiere modificación manual del chart para casos de uso específicos + +## Solución Detallada + + + +Primero, agregue y actualice el repositorio Helm de Bitnami: + +```bash +# Agregar repositorio Bitnami +helm repo add bitnami https://charts.bitnami.com/bitnami + +# Actualizar repositorios para obtener los charts más recientes +helm repo update + +# Verificar que el repositorio fue agregado +helm repo list + +# Buscar la versión más reciente del chart +helm search repo bitnami/superset --versions + +# Obtener información del chart +helm show chart bitnami/superset +helm show readme bitnami/superset +``` + +### Verificar prerequisitos + +```bash +# Verificar que kubectl está configurado +kubectl cluster-info + +# Verificar que Helm está instalado +helm version + +# Verificar permisos en el namespace +kubectl auth can-i create deployments --namespace superset + +# Crear namespace si no existe +kubectl create namespace superset --dry-run=client -o yaml | kubectl apply -f - +``` + + + + + +Cree un script post-render para añadir tolerancias a los pods de inicialización de la base de datos: + +```bash +#!/bin/bash +# Archivo: add-tolerations.sh + +set -e + +# Hacer el script ejecutable +chmod +x add-tolerations.sh + +# Script post-render para añadir tolerancias +cat <<'EOF' > add-tolerations.sh +#!/bin/bash + +# Función para log con timestamp +log() { + echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >&2 +} + +log "Aplicando post-render: añadiendo tolerancias..." + +# Leer manifiestos desde stdin +manifests=$(cat) + +# Usar yq para modificar los manifiestos +echo "$manifests" | yq eval ' +# Añadir tolerancias a Jobs (init containers para DB) +(select(.kind == "Job") | .spec.template.spec.tolerations) = [ + { + "key": "node-role.kubernetes.io/control-plane", + "operator": "Exists", + "effect": "NoSchedule" + }, + { + "key": "CriticalAddonsOnly", + "operator": "Exists" + }, + { + "key": "node.kubernetes.io/not-ready", + "operator": "Exists", + "effect": "NoExecute", + "tolerationSeconds": 300 + }, + { + "key": "node.kubernetes.io/unreachable", + "operator": "Exists", + "effect": "NoExecute", + "tolerationSeconds": 300 + } +] | +# Añadir tolerancias a Deployments (aplicación principal) +(select(.kind == "Deployment") | .spec.template.spec.tolerations) = [ + { + "key": "node-role.kubernetes.io/control-plane", + "operator": "Exists", + "effect": "NoSchedule" + }, + { + "key": "CriticalAddonsOnly", + "operator": "Exists" + } +] | +# Añadir tolerancias a StatefulSets (si hay alguno) +(select(.kind == "StatefulSet") | .spec.template.spec.toleranations) = [ + { + "key": "node-role.kubernetes.io/control-plane", + "operator": "Exists", + "effect": "NoSchedule" + }, + { + "key": "CriticalAddonsOnly", + "operator": "Exists" + } +]' + +log "Post-render completado exitosamente" +EOF + +# Hacer el script ejecutable +chmod +x add-tolerations.sh + +# Verificar que yq está instalado +if ! command -v yq &> /dev/null; then + echo "Instalando yq..." + # Para Linux + sudo wget -qO /usr/local/bin/yq https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 + sudo chmod +x /usr/local/bin/yq + + # Para macOS (alternativa) + # brew install yq +fi +``` + +### Script alternativo con sed (sin dependencias externas) + +```bash +#!/bin/bash +# add-tolerations-sed.sh - Versión alternativa usando sed + +cat <<'EOF' > add-tolerations-sed.sh +#!/bin/bash + +set -e + +log() { + echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >&2 +} + +log "Aplicando tolerancias con sed..." + +# Leer todo el input +input=$(cat) + +# Función para añadir tolerancias después de 'spec:' +add_tolerations() { + local kind=$1 + local tolerations=' + tolerations: + - key: "node-role.kubernetes.io/control-plane" + operator: "Exists" + effect: "NoSchedule" + - key: "CriticalAddonsOnly" + operator: "Exists" + - key: "node.kubernetes.io/not-ready" + operator: "Exists" + effect: "NoExecute" + tolerationSeconds: 300 + - key: "node.kubernetes.io/unreachable" + operator: "Exists" + effect: "NoExecute" + tolerationSeconds: 300' + + echo "$input" | sed "/^kind: $kind$/,/^spec:/{ + /^spec:/{ + a\\ +$tolerations + } + }" +} + +# Aplicar a diferentes tipos de recursos +result="$input" +for kind in Job Deployment StatefulSet; do + result=$(echo "$result" | add_tolerations "$kind") +done + +echo "$result" + +log "Tolerancias añadidas exitosamente" +EOF + +chmod +x add-tolerations-sed.sh +``` + + + + + +Cree un archivo de valores personalizado para Superset: + +```yaml +# values-def.yaml - Configuración completa de Superset + +# Configuración global +global: + imageRegistry: "" + imagePullSecrets: [] + storageClass: "" + +# Configuración de la imagen +image: + registry: docker.io + repository: bitnami/superset + tag: "" + pullPolicy: IfNotPresent + +# Configuración de autenticación +auth: + adminUser: admin + adminPassword: "SuperSecretPassword123!" + adminEmail: admin@example.com + secretKey: "THIS_IS_MY_SECRET_KEY_CHANGE_ME_IN_PRODUCTION" + +# Configuración de base de datos +postgresql: + enabled: true + auth: + username: superset + password: "PostgresPassword123!" + database: superset + primary: + persistence: + enabled: true + size: 8Gi + storageClass: "" + resources: + requests: + memory: 256Mi + cpu: 250m + limits: + memory: 512Mi + cpu: 500m + +# Configuración de Redis (para caché y celery) +redis: + enabled: true + auth: + enabled: true + password: "RedisPassword123!" + master: + persistence: + enabled: true + size: 2Gi + resources: + requests: + memory: 128Mi + cpu: 100m + limits: + memory: 256Mi + cpu: 200m + +# Configuración de la aplicación principal +superset: + # Configuración de recursos + resources: + requests: + memory: 512Mi + cpu: 250m + limits: + memory: 1Gi + cpu: 500m + + # Número de réplicas + replicaCount: 2 + + # Configuración de autoescalado + autoscaling: + enabled: true + minReplicas: 2 + maxReplicas: 10 + targetCPUUtilizationPercentage: 70 + targetMemoryUtilizationPercentage: 80 + + # Configuración de sondas de salud + livenessProbe: + enabled: true + initialDelaySeconds: 60 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 3 + successThreshold: 1 + + readinessProbe: + enabled: true + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + successThreshold: 1 + + # Variables de entorno adicionales + extraEnvVars: + - name: SUPERSET_LOAD_EXAMPLES + value: "yes" + - name: SUPERSET_SECRET_KEY + valueFrom: + secretKeyRef: + name: superset-secret + key: secret-key + - name: PYTHONPATH + value: "/app/pythonpath:/opt/bitnami/superset/lib/python3.9/site-packages" + +# Configuración de Celery Worker +worker: + enabled: true + replicaCount: 2 + resources: + requests: + memory: 256Mi + cpu: 200m + limits: + memory: 512Mi + cpu: 400m + + # Configuración específica de Celery + celery: + broker: redis + backend: redis + concurrency: 4 + +# Configuración de Celery Beat (scheduler) +beat: + enabled: true + resources: + requests: + memory: 128Mi + cpu: 100m + limits: + memory: 256Mi + cpu: 200m + +# Configuración de Flower (monitoreo de Celery) +flower: + enabled: true + resources: + requests: + memory: 128Mi + cpu: 100m + limits: + memory: 256Mi + cpu: 200m + +# Configuración del servicio +service: + type: ClusterIP + port: 8088 + targetPort: 8088 + annotations: {} + +# Configuración de Ingress +ingress: + enabled: true + className: "nginx" + annotations: + nginx.ingress.kubernetes.io/proxy-body-size: "50m" + nginx.ingress.kubernetes.io/proxy-read-timeout: "300" + nginx.ingress.kubernetes.io/proxy-send-timeout: "300" + cert-manager.io/cluster-issuer: "letsencrypt-prod" + hosts: + - host: superset.example.com + paths: + - path: / + pathType: Prefix + tls: + - secretName: superset-tls + hosts: + - superset.example.com + +# Configuración de persistencia para logs +persistence: + enabled: true + size: 5Gi + storageClass: "" + accessModes: + - ReadWriteOnce + +# Configuración de seguridad +securityContext: + enabled: true + fsGroup: 1001 + runAsUser: 1001 + +podSecurityContext: + enabled: true + fsGroup: 1001 + +containerSecurityContext: + enabled: true + runAsUser: 1001 + runAsNonRoot: true + readOnlyRootFilesystem: false + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + +# Configuración de métricas +metrics: + enabled: true + serviceMonitor: + enabled: false + namespace: "" + annotations: {} + labels: {} + +# Configuración de init containers +initContainers: + # Container para inicializar la base de datos + - name: wait-for-db + image: postgres:13 + command: + - sh + - -c + - | + until pg_isready -h $POSTGRES_HOST -p $POSTGRES_PORT -U $POSTGRES_USER; do + echo "Waiting for PostgreSQL..." + sleep 2 + done + echo "PostgreSQL is ready!" + env: + - name: POSTGRES_HOST + value: "superset-postgresql" + - name: POSTGRES_PORT + value: "5432" + - name: POSTGRES_USER + value: "superset" + +# Configuración de NetworkPolicy +networkPolicy: + enabled: false + ingress: + enabled: true + namespaceSelector: {} + podSelector: {} + egress: + enabled: true + namespaceSelector: {} + podSelector: {} + +# Tolerancias (serán sobrescritas por el post-renderer) +tolerations: [] + +# Afinidad de nodos +nodeAffinity: {} + +# Anti-afinidad de pods +podAntiAffinity: {} + +# Selector de nodos +nodeSelector: {} +``` + +### Archivo de secrets separado + +```yaml +# superset-secrets.yaml +apiVersion: v1 +kind: Secret +metadata: + name: superset-secret + namespace: superset +type: Opaque +stringData: + secret-key: "THIS_IS_MY_SECRET_KEY_CHANGE_ME_IN_PRODUCTION_VERY_LONG_AND_SECURE" + admin-password: "SuperSecretPassword123!" + postgres-password: "PostgresPassword123!" + redis-password: "RedisPassword123!" +``` + + + + + +### Instalación inicial de Superset + +```bash +#!/bin/bash +# deploy-superset.sh + +set -e + +NAMESPACE="superset" +RELEASE_NAME="superset" +CHART="bitnami/superset" +VALUES_FILE="values-def.yaml" +POST_RENDERER="./add-tolerations.sh" + +echo "=== Desplegando Apache Superset ===" + +# 1. Crear namespace +kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f - + +# 2. Aplicar secrets +kubectl apply -f superset-secrets.yaml -n $NAMESPACE + +# 3. Verificar que el post-renderer existe y es ejecutable +if [[ ! -x "$POST_RENDERER" ]]; then + echo "❌ Post-renderer script no encontrado o no es ejecutable: $POST_RENDERER" + exit 1 +fi + +# 4. Dry-run para verificar la configuración +echo "Ejecutando dry-run..." +helm upgrade --install $RELEASE_NAME $CHART \ + --namespace $NAMESPACE \ + --values $VALUES_FILE \ + --post-renderer $POST_RENDERER \ + --dry-run --debug + +# 5. Confirmar despliegue +read -p "¿Continuar con el despliegue? (y/N): " -n 1 -r +echo +if [[ $REPLY =~ ^[Yy]$ ]]; then + echo "Desplegando Superset..." + + helm upgrade --install $RELEASE_NAME $CHART \ + --namespace $NAMESPACE \ + --values $VALUES_FILE \ + --post-renderer $POST_RENDERER \ + --wait \ + --timeout 10m + + echo "✅ Superset desplegado exitosamente" +else + echo "Despliegue cancelado" + exit 0 +fi + +# 6. Verificar estado del despliegue +echo "=== Estado del Despliegue ===" +kubectl get all -n $NAMESPACE +kubectl get pvc -n $NAMESPACE +kubectl get secrets -n $NAMESPACE + +# 7. Obtener información de acceso +echo "=== Información de Acceso ===" +echo "Namespace: $NAMESPACE" +echo "Release: $RELEASE_NAME" + +# Port-forward para acceso local +echo "Para acceder localmente, ejecute:" +echo "kubectl port-forward --namespace $NAMESPACE svc/$RELEASE_NAME 8088:8088" +echo "Luego visite: http://localhost:8088" + +# Credenciales +echo "" +echo "Credenciales de acceso:" +echo "Usuario: admin" +echo "Contraseña: $(kubectl get secret superset-secret -n $NAMESPACE -o jsonpath='{.data.admin-password}' | base64 -d)" +``` + +### Actualización del despliegue + +```bash +#!/bin/bash +# update-superset.sh + +NAMESPACE="superset" +RELEASE_NAME="superset" +CHART="bitnami/superset" +VALUES_FILE="values-def.yaml" +POST_RENDERER="./add-tolerations.sh" + +echo "=== Actualizando Apache Superset ===" + +# Actualizar repositorio Helm +helm repo update + +# Verificar diferencias +helm diff upgrade $RELEASE_NAME $CHART \ + --namespace $NAMESPACE \ + --values $VALUES_FILE \ + --post-renderer $POST_RENDERER || echo "helm-diff plugin no instalado" + +# Ejecutar actualización +helm upgrade $RELEASE_NAME $CHART \ + --namespace $NAMESPACE \ + --values $VALUES_FILE \ + --post-renderer $POST_RENDERER \ + --wait \ + --timeout 15m + +echo "✅ Actualización completada" + +# Verificar estado +kubectl rollout status deployment/$RELEASE_NAME -n $NAMESPACE +``` + +### Rollback en caso de problemas + +```bash +#!/bin/bash +# rollback-superset.sh + +NAMESPACE="superset" +RELEASE_NAME="superset" + +echo "=== Rollback de Apache Superset ===" + +# Listar historial de releases +echo "Historial de releases:" +helm history $RELEASE_NAME -n $NAMESPACE + +# Obtener última versión estable +LAST_REVISION=$(helm history $RELEASE_NAME -n $NAMESPACE --max 2 -o json | jq -r '.[1].revision // .[0].revision') + +read -p "¿Hacer rollback a la revisión $LAST_REVISION? (y/N): " -n 1 -r +echo +if [[ $REPLY =~ ^[Yy]$ ]]; then + helm rollback $RELEASE_NAME $LAST_REVISION -n $NAMESPACE --wait + echo "✅ Rollback completado" +else + echo "Rollback cancelado" +fi +``` + + + + + +### Verificar conectividad de la base de datos + +```bash +#!/bin/bash +# verify-db-connection.sh + +NAMESPACE="superset" +DB_HOST="superset-postgresql" +DB_PORT="5432" +DB_NAME="superset" +DB_USER="superset" + +echo "=== Verificación de Base de Datos ===" + +# Obtener contraseña de la base de datos +DB_PASSWORD=$(kubectl get secret superset-secret -n $NAMESPACE -o jsonpath='{.data.postgres-password}' | base64 -d) + +# Test de conectividad usando un pod temporal +kubectl run postgres-client \ + --rm -i --tty \ + --namespace $NAMESPACE \ + --image postgres:13 \ + --restart=Never \ + --command -- psql \ + "postgresql://$DB_USER:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" \ + -c "SELECT version();" + +echo "✅ Conectividad de base de datos verificada" +``` + +### Inicialización manual de la base de datos + +```bash +#!/bin/bash +# init-superset-db.sh + +NAMESPACE="superset" +SUPERSET_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/component=superset -o jsonpath='{.items[0].metadata.name}') + +echo "=== Inicialización de Base de Datos Superset ===" +echo "Pod seleccionado: $SUPERSET_POD" + +# 1. Inicializar la base de datos +echo "1. Inicializando base de datos..." +kubectl exec -n $NAMESPACE $SUPERSET_POD -- superset db upgrade + +# 2. Crear usuario administrador +echo "2. Creando usuario administrador..." +kubectl exec -n $NAMESPACE $SUPERSET_POD -- superset fab create-admin \ + --username admin \ + --firstname Superset \ + --lastname Admin \ + --email admin@example.com \ + --password "$(kubectl get secret superset-secret -n $NAMESPACE -o jsonpath='{.data.admin-password}' | base64 -d)" + +# 3. Inicializar roles y permisos +echo "3. Inicializando roles..." +kubectl exec -n $NAMESPACE $SUPERSET_POD -- superset init + +# 4. Cargar ejemplos (opcional) +echo "4. Cargando datos de ejemplo..." +kubectl exec -n $NAMESPACE $SUPERSET_POD -- superset load_examples + +echo "✅ Inicialización de base de datos completada" +``` + +### Configuración de conexiones de base de datos + +```sql +-- Ejemplo de conexiones que se pueden configurar en Superset +-- PostgreSQL +postgresql://user:password@host:5432/database + +-- MySQL +mysql://user:password@host:3306/database + +-- SQLite +sqlite:///path/to/database.db + +-- BigQuery +bigquery://project_id/dataset_id + +-- Snowflake +snowflake://user:password@account/database/schema + +-- Redshift +redshift+psycopg2://user:password@host:5439/database +``` + +### Script de backup de base de datos + +```bash +#!/bin/bash +# backup-superset-db.sh + +NAMESPACE="superset" +BACKUP_DIR="/tmp/superset-backups" +TIMESTAMP=$(date +%Y%m%d_%H%M%S) +DB_HOST="superset-postgresql" +DB_PORT="5432" +DB_NAME="superset" +DB_USER="superset" + +echo "=== Backup de Base de Datos Superset ===" + +# Crear directorio de backup +mkdir -p $BACKUP_DIR + +# Obtener contraseña +DB_PASSWORD=$(kubectl get secret superset-secret -n $NAMESPACE -o jsonpath='{.data.postgres-password}' | base64 -d) + +# Crear backup usando pod temporal +kubectl run postgres-backup \ + --rm -i \ + --namespace $NAMESPACE \ + --image postgres:13 \ + --restart=Never \ + --command -- pg_dump \ + "postgresql://$DB_USER:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME" \ + --verbose \ + --format=custom \ + --compress=9 > "$BACKUP_DIR/superset_backup_$TIMESTAMP.dump" + +echo "✅ Backup creado: $BACKUP_DIR/superset_backup_$TIMESTAMP.dump" + +# Verificar backup +if [[ -f "$BACKUP_DIR/superset_backup_$TIMESTAMP.dump" ]]; then + size=$(du -h "$BACKUP_DIR/superset_backup_$TIMESTAMP.dump" | cut -f1) + echo "Tamaño del backup: $size" +else + echo "❌ Error: Backup no creado" + exit 1 +fi +``` + + + + + +### Configuración de Prometheus metrics + +```yaml +# superset-servicemonitor.yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: superset-metrics + namespace: superset + labels: + app.kubernetes.io/name: superset + app.kubernetes.io/instance: superset +spec: + selector: + matchLabels: + app.kubernetes.io/name: superset + app.kubernetes.io/component: superset + endpoints: + - port: http + path: /health + interval: 30s + scrapeTimeout: 10s + - port: http + path: /metrics + interval: 30s + scrapeTimeout: 10s +``` + +### Dashboard de Grafana para Superset + +```json +{ + "dashboard": { + "title": "Apache Superset Metrics", + "panels": [ + { + "title": "Request Rate", + "type": "stat", + "targets": [ + { + "expr": "rate(superset_request_total[5m])", + "legendFormat": "Requests/sec" + } + ] + }, + { + "title": "Response Time", + "type": "stat", + "targets": [ + { + "expr": "histogram_quantile(0.95, rate(superset_request_duration_seconds_bucket[5m]))", + "legendFormat": "95th percentile" + } + ] + }, + { + "title": "Database Connections", + "type": "stat", + "targets": [ + { + "expr": "superset_database_connections_active", + "legendFormat": "Active connections" + } + ] + }, + { + "title": "Error Rate", + "type": "stat", + "targets": [ + { + "expr": "rate(superset_request_total{status=~\"4.*|5.*\"}[5m])", + "legendFormat": "Errors/sec" + } + ] + } + ] + } +} +``` + +### Configuración de logging estructurado + +```yaml +# superset-logging-config.yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: superset-logging-config + namespace: superset +data: + logging.conf: | + [loggers] + keys=root,superset + + [handlers] + keys=console,file + + [formatters] + keys=generic,json + + [logger_root] + level=INFO + handlers=console + + [logger_superset] + level=INFO + handlers=console,file + qualname=superset + propagate=0 + + [handler_console] + class=StreamHandler + formatter=json + args=(sys.stdout,) + + [handler_file] + class=FileHandler + formatter=json + args=('/var/log/superset/superset.log',) + + [formatter_generic] + format=%(asctime)s [%(levelname)s] %(name)s: %(message)s + + [formatter_json] + class=pythonjsonlogger.jsonlogger.JsonFormatter + format=%(asctime)s %(name)s %(levelname)s %(message)s +``` + +### Script de monitoreo de salud + +```bash +#!/bin/bash +# monitor-superset-health.sh + +NAMESPACE="superset" +RELEASE_NAME="superset" + +echo "=== Monitoreo de Salud Superset ===" + +# Función para verificar estado de pods +check_pods() { + echo "Estado de pods:" + kubectl get pods -n $NAMESPACE -l app.kubernetes.io/instance=$RELEASE_NAME + + # Verificar pods no saludables + unhealthy_pods=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/instance=$RELEASE_NAME --field-selector=status.phase!=Running -o name 2>/dev/null) + + if [[ -n $unhealthy_pods ]]; then + echo "⚠️ Pods no saludables encontrados:" + echo "$unhealthy_pods" + return 1 + else + echo "✅ Todos los pods están saludables" + return 0 + fi +} + +# Función para verificar servicios +check_services() { + echo "Estado de servicios:" + kubectl get svc -n $NAMESPACE -l app.kubernetes.io/instance=$RELEASE_NAME +} + +# Función para test de conectividad HTTP +check_http_health() { + echo "Verificando health endpoint..." + + # Port-forward en background + kubectl port-forward -n $NAMESPACE svc/$RELEASE_NAME 8088:8088 & + PF_PID=$! + + # Esperar que el port-forward esté listo + sleep 5 + + # Test HTTP + if curl -f -s http://localhost:8088/health > /dev/null; then + echo "✅ Health endpoint respondiendo" + result=0 + else + echo "❌ Health endpoint no responde" + result=1 + fi + + # Terminar port-forward + kill $PF_PID 2>/dev/null + wait $PF_PID 2>/dev/null + + return $result +} + +# Función para verificar base de datos +check_database() { + echo "Verificando conectividad de base de datos..." + + DB_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/component=postgresql -o jsonpath='{.items[0].metadata.name}') + + if [[ -n $DB_POD ]]; then + if kubectl exec -n $NAMESPACE $DB_POD -- pg_isready > /dev/null 2>&1; then + echo "✅ Base de datos PostgreSQL saludable" + return 0 + else + echo "❌ Base de datos PostgreSQL no responde" + return 1 + fi + else + echo "⚠️ Pod de PostgreSQL no encontrado" + return 1 + fi +} + +# Función para verificar Redis +check_redis() { + echo "Verificando Redis..." + + REDIS_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/component=master -o jsonpath='{.items[0].metadata.name}') + + if [[ -n $REDIS_POD ]]; then + if kubectl exec -n $NAMESPACE $REDIS_POD -- redis-cli ping | grep -q PONG; then + echo "✅ Redis saludable" + return 0 + else + echo "❌ Redis no responde" + return 1 + fi + else + echo "⚠️ Pod de Redis no encontrado" + return 1 + fi +} + +# Ejecutar todas las verificaciones +checks=(check_pods check_services check_database check_redis check_http_health) +failed_checks=0 + +for check in "${checks[@]}"; do + echo "" + if ! $check; then + ((failed_checks++)) + fi +done + +echo "" +echo "=== Resumen ===" +if [[ $failed_checks -eq 0 ]]; then + echo "✅ Todas las verificaciones pasaron" + exit 0 +else + echo "❌ $failed_checks verificación(es) fallaron" + exit 1 +fi +``` + + + + + +### Problema: Pods de inicialización fallan por tolerancias + +**Síntoma:** Los jobs de inicialización no se programan en nodos con taints + +**Solución:** + +```bash +# Verificar que el post-renderer funciona +helm template superset bitnami/superset \ + --values values-def.yaml \ + --post-renderer ./add-tolerations.sh | \ + grep -A 20 "tolerations:" + +# Verificar taints en nodos +kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints + +# Verificar tolerancias en pods desplegados +kubectl get pods -n superset -o yaml | grep -A 10 tolerations +``` + +### Problema: Error de conexión a base de datos + +**Síntoma:** Superset no puede conectarse a PostgreSQL + +**Solución:** + +```bash +#!/bin/bash +# debug-db-connection.sh + +NAMESPACE="superset" + +echo "=== Debug de Conexión Base de Datos ===" + +# 1. Verificar estado del pod PostgreSQL +echo "1. Estado de PostgreSQL:" +kubectl get pods -n $NAMESPACE -l app.kubernetes.io/component=postgresql + +# 2. Verificar logs de PostgreSQL +echo "2. Logs de PostgreSQL (últimas 50 líneas):" +POSTGRES_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/component=postgresql -o jsonpath='{.items[0].metadata.name}') +kubectl logs -n $NAMESPACE $POSTGRES_POD --tail=50 + +# 3. Verificar variables de entorno en pod Superset +echo "3. Variables de entorno en Superset:" +SUPERSET_POD=$(kubectl get pods -n $NAMESPACE -l app.kubernetes.io/component=superset -o jsonpath='{.items[0].metadata.name}') +kubectl exec -n $NAMESPACE $SUPERSET_POD -- env | grep -E "(DATABASE|POSTGRES|DB_)" + +# 4. Test de conectividad de red +echo "4. Test de conectividad de red:" +kubectl exec -n $NAMESPACE $SUPERSET_POD -- nslookup superset-postgresql + +# 5. Test de puerto +echo "5. Test de puerto PostgreSQL:" +kubectl exec -n $NAMESPACE $SUPERSET_POD -- timeout 5 bash -c " $OUTPUT_DIR/cluster-info.txt +kubectl get nodes -o wide > $OUTPUT_DIR/nodes.txt +kubectl get nodes -o yaml > $OUTPUT_DIR/nodes-full.yaml + +# 2. Estado de recursos +echo "2. Recopilando estado de recursos..." +kubectl get all -n $NAMESPACE > $OUTPUT_DIR/resources.txt +kubectl get pvc -n $NAMESPACE > $OUTPUT_DIR/pvc.txt +kubectl get secrets -n $NAMESPACE > $OUTPUT_DIR/secrets.txt +kubectl get configmaps -n $NAMESPACE > $OUTPUT_DIR/configmaps.txt + +# 3. Manifiestos completos +echo "3. Exportando manifiestos..." +kubectl get deployment,statefulset,job -n $NAMESPACE -o yaml > $OUTPUT_DIR/manifests.yaml + +# 4. Logs de aplicación +echo "4. Recopilando logs..." +for pod in $(kubectl get pods -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'); do + echo "Logs de $pod..." + kubectl logs -n $NAMESPACE $pod --previous > $OUTPUT_DIR/logs-$pod-previous.log 2>/dev/null || true + kubectl logs -n $NAMESPACE $pod > $OUTPUT_DIR/logs-$pod.log 2>/dev/null || true +done + +# 5. Descripción de pods +echo "5. Describiendo pods..." +for pod in $(kubectl get pods -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'); do + kubectl describe pod -n $NAMESPACE $pod > $OUTPUT_DIR/describe-$pod.txt +done + +# 6. Eventos del namespace +echo "6. Recopilando eventos..." +kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' > $OUTPUT_DIR/events.txt + +# 7. Información de Helm +echo "7. Información de Helm..." +helm list -n $NAMESPACE > $OUTPUT_DIR/helm-releases.txt +helm status superset -n $NAMESPACE > $OUTPUT_DIR/helm-status.txt +helm history superset -n $NAMESPACE > $OUTPUT_DIR/helm-history.txt + +# 8. Test de conectividad +echo "8. Tests de conectividad..." +{ + echo "=== DNS Tests ===" + kubectl run dns-test --rm -i --tty --image=busybox --restart=Never -- nslookup superset-postgresql.$NAMESPACE.svc.cluster.local || true + + echo -e "\n=== HTTP Tests ===" + kubectl run http-test --rm -i --tty --image=curlimages/curl --restart=Never -- curl -I http://superset.$NAMESPACE.svc.cluster.local:8088/health || true +} > $OUTPUT_DIR/connectivity-tests.txt 2>&1 + +echo "✅ Diagnósticos completados en: $OUTPUT_DIR" +echo "Para compartir, crear un archivo tar: tar -czf superset-diagnostics.tar.gz -C $(dirname $OUTPUT_DIR) $(basename $OUTPUT_DIR)" +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 20 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-502-errors-pod-logs.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-502-errors-pod-logs.mdx new file mode 100644 index 000000000..4673abc32 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-502-errors-pod-logs.mdx @@ -0,0 +1,552 @@ +--- +sidebar_position: 3 +title: "Error 502 y Registros de Pod No Cargan" +description: "Solución de problemas de errores 502 cuando no se pueden acceder a los registros de pods a través de la plataforma" +date: "2025-01-27" +category: "workload" +tags: + ["error-502", "registros-pod", "solucion-de-problemas", "despliegue", "redes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error 502 y Registros de Pod No Cargan + +**Fecha:** 27 de enero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** Error 502, Registros de Pod, Solución de problemas, Despliegue, Redes + +## Descripción del Problema + +**Contexto:** El usuario experimenta un error 502 al acceder a su aplicación desplegada en SleakOps, combinado con la imposibilidad de ver los registros de pod a través de la interfaz de la plataforma. + +**Síntomas observados:** + +- Error 502 Bad Gateway al acceder a la URL de la aplicación +- El botón de registros de pod en la interfaz de SleakOps no abre ni carga los registros +- El reenvío de puertos mediante Lens muestra pantalla en blanco con errores de red +- El contenedor Docker funciona correctamente al probarlo localmente +- El rollback a una compilación anterior funcional no resuelve el problema +- Los registros de otros proyectos y despliegues son accesibles + +**Configuración relevante:** + +- Entorno: Desarrollo +- Aplicación: Aplicación basada en monorepositorio +- Plataforma: Despliegue Kubernetes en SleakOps +- Pruebas locales: El contenedor Docker funciona correctamente +- Estado previo: La aplicación funcionaba con compilaciones anteriores + +**Condiciones del error:** + +- El error ocurre al acceder a la URL de la aplicación +- Los registros de pod son específicamente inaccesibles para este despliegue +- El reenvío de puertos falla con errores de red +- El problema persiste tras intentos de rollback +- El problema está aislado a pods/despliegue específicos + +## Solución Detallada + + + +Cuando encuentras un error 502 combinado con registros de pod inaccesibles, esto típicamente indica: + +1. **Problemas en el arranque del pod**: El pod puede estar fallando al iniciar correctamente o cayendo durante la inicialización +2. **Restricciones de recursos**: Memoria o CPU insuficientes causando terminación del pod +3. **Fallos en chequeos de salud**: Las sondas de readiness o liveness fallan +4. **Problemas de conectividad de red**: Problemas con la comunicación servicio-pod +5. **Problemas con el runtime del contenedor**: Problemas específicos del entorno Kubernetes frente a Docker local + +El hecho de que los registros sean inaccesibles sugiere que los pods pueden estar en un ciclo de reinicio o en estado de fallo. + + + + + +Cuando la interfaz de SleakOps no puede mostrar registros, usa kubectl directamente: + +```bash +# Obtener estado y eventos de pods +kubectl get pods -n +kubectl describe pod -n + +# Obtener registros de pods caídos o reiniciándose +kubectl logs -n --previous + +# Obtener registros en tiempo real +kubectl logs -f -n + +# Revisar eventos del namespace +kubectl get events -n --sort-by='.lastTimestamp' +``` + +Busca: + +- Cantidad de reinicios del pod +- Códigos de salida en la descripción del pod +- Eventos recientes mostrando errores +- Mensajes de límite de recursos excedido + + + + + +Los problemas de recursos son comunes cuando Docker local funciona pero el despliegue en Kubernetes falla: + +```bash +# Revisar uso de recursos +kubectl top pods -n + +# Revisar límites de recursos en el despliegue +kubectl get deployment -n -o yaml | grep -A 10 resources + +# Revisar recursos del nodo +kubectl describe nodes +``` + +**Soluciones comunes:** + +1. **Incrementar límites de memoria**: + +```yaml +resources: + limits: + memory: "1Gi" # Aumentar desde el valor por defecto + cpu: "500m" + requests: + memory: "512Mi" + cpu: "250m" +``` + +2. **Buscar fugas de memoria** en tu aplicación +3. **Optimizar el arranque del contenedor** para reducir picos de recursos + + + + + +Chequear salud incorrectamente puede causar errores 502: + +```yaml +# Ejemplo de configuración correcta de chequeos de salud +livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 +``` + +**Consideraciones clave:** + +- Asegurar que los endpoints de chequeo de salud existan en tu aplicación +- Configurar `initialDelaySeconds` adecuado para el tiempo de arranque de la app +- Usar endpoints diferentes para liveness y readiness si es posible +- Considerar deshabilitar temporalmente los chequeos para depuración + + + + + +Cuando Docker funciona localmente pero Kubernetes falla, revisa: + +**1. Variables de entorno:** + +```bash +# Comparar variables de entorno +kubectl exec -n -- env +``` + +**2. Permisos en el sistema de archivos:** + +- Kubernetes corre con contextos de usuario diferentes +- Verifica si tu app escribe en directorios específicos +- Asegura permisos adecuados en el Dockerfile + +**3. Configuración de red:** + +- La red en Kubernetes difiere de Docker +- Verifica que tu app se enlace a `0.0.0.0` y no a `localhost` +- Confirma que las configuraciones de puertos coinciden con las definiciones de servicio + +**4. Dependencias y servicios externos:** + +- Las conexiones a bases de datos pueden variar +- Endpoints de APIs externas podrían ser inaccesibles +- Diferencias en resolución DNS + + + + + +Para problemas específicos de SleakOps: + +**1. Revisar registros de compilación:** + +- Analiza el proceso de compilación en el panel de SleakOps +- Busca advertencias o errores durante la creación de la imagen +- Verifica que todos los pasos de compilación se completaron exitosamente + +**2. Configuración del despliegue:** + +- Verifica si la configuración del despliegue cambió +- Confirma que las variables de entorno estén correctamente definidas +- Asegura que secretos y configmaps sean accesibles + +**3. Configuración del servicio:** + +- Verifica que el servicio enrute correctamente a los pods +- Revisa configuración de ingress si aplica +- Asegura que el balanceador de carga esté saludable + +**4. Recursos de la plataforma:** + +- Comprueba que el clúster tenga recursos suficientes +- Verifica que no haya problemas a nivel plataforma +- Contacta soporte de SleakOps si la interfaz de la plataforma no responde + + + + + +Si el problema es urgente y requiere resolución inmediata: + +**1. Forzar reinicio del pod:** + +```bash +kubectl delete pod -n +``` + +**2. Escalar hacia abajo y luego hacia arriba:** + +```bash +kubectl scale deployment --replicas=0 -n +kubectl scale deployment --replicas=1 -n +``` + +**3. Incremento temporal de recursos:** + +- Aumentar temporalmente límites de recursos +- Reducir escala de otros servicios no críticos +- Usar nodos con más recursos si están disponibles + +**4. Rollback a imagen conocida como funcional:** + +```bash +# Rollback usando kubectl +kubectl rollout undo deployment/ -n + +# Verificar estado del rollback +kubectl rollout status deployment/ -n +``` + +**5. Despliegue de emergencia con configuración mínima:** + +```yaml +# deployment-emergency.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: emergency-deployment +spec: + replicas: 1 + selector: + matchLabels: + app: emergency-app + template: + metadata: + labels: + app: emergency-app + spec: + containers: + - name: app + image: + ports: + - containerPort: 8080 + resources: + limits: + memory: "2Gi" + cpu: "1000m" + requests: + memory: "1Gi" + cpu: "500m" + # Deshabilitar chequeos de salud temporalmente + # livenessProbe: {} + # readinessProbe: {} +``` + + + + + +**1. Implementar chequeos de salud robustos:** + +```javascript +// Ejemplo para aplicación Node.js +const express = require('express'); +const app = express(); + +// Endpoint de health check básico +app.get('/health', (req, res) => { + res.status(200).json({ status: 'healthy', timestamp: new Date().toISOString() }); +}); + +// Endpoint de readiness más complejo +app.get('/ready', async (req, res) => { + try { + // Verificar conexiones a base de datos + await checkDatabaseConnection(); + + // Verificar servicios externos críticos + await checkExternalServices(); + + res.status(200).json({ + status: 'ready', + timestamp: new Date().toISOString(), + checks: { + database: 'ok', + external_services: 'ok' + } + }); + } catch (error) { + res.status(503).json({ + status: 'not ready', + error: error.message, + timestamp: new Date().toISOString() + }); + } +}); +``` + +**2. Configurar monitoreo proactivo:** + +```yaml +# Configuración de alertas +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: app-alerts +spec: + groups: + - name: app.rules + rules: + - alert: PodCrashLooping + expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 + for: 5m + labels: + severity: critical + annotations: + summary: "Pod {{ $labels.pod }} is crash looping" + + - alert: HighMemoryUsage + expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9 + for: 2m + labels: + severity: warning + annotations: + summary: "High memory usage on {{ $labels.pod }}" +``` + +**3. Implementar despliegues graduales:** + +```yaml +# Estrategia de rolling update conservadora +spec: + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: 1 + minReadySeconds: 30 + progressDeadlineSeconds: 600 +``` + +**4. Usar staging que replique producción:** + +- Configurar entorno de staging con recursos similares a producción +- Probar con volúmenes de datos realistas +- Validar chequeos de salud en staging antes de producción +- Implementar pruebas de carga automatizadas + + + + + +**1. Depuración con contenedores sidecar:** + +```yaml +# Agregar contenedor de depuración +spec: + containers: + - name: app + image: tu-app:latest + - name: debug + image: busybox + command: ['sleep', '3600'] + volumeMounts: + - name: shared-logs + mountPath: /logs +``` + +**2. Usar herramientas de profiling:** + +```bash +# Instalar herramientas de depuración en el pod +kubectl exec -it -- /bin/bash + +# Dentro del pod +apt-get update && apt-get install -y htop strace tcpdump + +# Monitorear procesos +htop + +# Rastrear llamadas del sistema +strace -p + +# Capturar tráfico de red +tcpdump -i any port 8080 +``` + +**3. Análisis de dumps de memoria:** + +```bash +# Para aplicaciones Java +kubectl exec -- jstack +kubectl exec -- jmap -dump:format=b,file=/tmp/heap.hprof + +# Para aplicaciones Node.js +kubectl exec -- kill -USR2 # Genera heap dump +``` + +**4. Depuración de red:** + +```bash +# Verificar conectividad desde el pod +kubectl exec -- nslookup +kubectl exec -- curl -v http://:/health + +# Verificar configuración de iptables +kubectl exec -- iptables -L + +# Verificar rutas de red +kubectl exec -- route -n +``` + + + + + +**1. Crear runbook de respuesta a incidentes:** + +```markdown +# Runbook: Error 502 con Logs Inaccesibles + +## Pasos inmediatos (5 minutos) +1. Verificar estado de pods: `kubectl get pods -n ` +2. Revisar eventos recientes: `kubectl get events --sort-by='.lastTimestamp'` +3. Intentar acceso a logs: `kubectl logs --previous` + +## Investigación (15 minutos) +1. Verificar recursos: `kubectl top pods` +2. Revisar chequeos de salud +3. Comparar con configuración funcional anterior + +## Escalación +- Si no se resuelve en 20 minutos, contactar al equipo de plataforma +- Si es crítico para producción, implementar rollback inmediato +``` + +**2. Configurar dashboards de monitoreo:** + +```yaml +# Dashboard de Grafana para monitoreo de aplicación +{ + "dashboard": { + "title": "Application Health Dashboard", + "panels": [ + { + "title": "Pod Status", + "type": "stat", + "targets": [ + { + "expr": "kube_pod_status_phase{namespace=\"your-namespace\"}" + } + ] + }, + { + "title": "Memory Usage", + "type": "graph", + "targets": [ + { + "expr": "container_memory_usage_bytes{namespace=\"your-namespace\"}" + } + ] + }, + { + "title": "HTTP Response Codes", + "type": "graph", + "targets": [ + { + "expr": "rate(http_requests_total[5m])" + } + ] + } + ] + } +} +``` + +**3. Automatizar chequeos de salud:** + +```bash +#!/bin/bash +# health-check-script.sh + +NAMESPACE="your-namespace" +DEPLOYMENT="your-deployment" + +echo "Verificando salud del despliegue $DEPLOYMENT en namespace $NAMESPACE" + +# Verificar estado de pods +PODS=$(kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT --no-headers) +echo "Estado de pods:" +echo "$PODS" + +# Verificar si hay pods en estado no saludable +UNHEALTHY=$(echo "$PODS" | grep -v "Running\|Completed" | wc -l) + +if [ $UNHEALTHY -gt 0 ]; then + echo "ALERTA: $UNHEALTHY pods no están en estado saludable" + + # Obtener logs de pods problemáticos + kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT --no-headers | \ + grep -v "Running\|Completed" | \ + awk '{print $1}' | \ + while read pod; do + echo "Logs del pod $pod:" + kubectl logs $pod -n $NAMESPACE --tail=50 + done + + exit 1 +else + echo "Todos los pods están saludables" + exit 0 +fi +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 27 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-application-crashes-memory-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-application-crashes-memory-issues.mdx new file mode 100644 index 000000000..7796d1c6d --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-application-crashes-memory-issues.mdx @@ -0,0 +1,256 @@ +--- +sidebar_position: 3 +title: "Fallos de Aplicación y Problemas de Memoria en Producción" +description: "Resolución de fallos de aplicación, problemas de memoria y errores 502 en entornos SleakOps" +date: "2024-06-09" +category: "workload" +tags: + ["memoria", "fallos", "error-502", "producción", "resolución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Fallos de Aplicación y Problemas de Memoria en Producción + +**Fecha:** 9 de junio de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Memoria, Fallos, Error 502, Producción, Resolución de Problemas + +## Descripción del Problema + +**Contexto:** Aplicaciones desplegadas en SleakOps que experimentan fallos y problemas de conectividad tanto en entornos de desarrollo como de producción, con errores específicos relacionados con la memoria que causan interrupciones del servicio. + +**Síntomas Observados:** + +- El entorno de desarrollo no inicia correctamente +- La aplicación en producción falla después del despliegue +- Errores 502 Bad Gateway al acceder a la aplicación +- Errores relacionados con memoria en los registros de la aplicación +- La funcionalidad de reenvío de puertos no funciona correctamente +- Problemas de resolución DNS para algunos subdominios + +**Configuración Relevante:** + +- Entorno: Desarrollo y producción +- Aplicación: Aplicación web basada en monorepositorio +- Ubicación del error: Línea 805 en los logs (problema de memoria) +- Host: app.takenos.com +- Servicio: web-app con endpoints tRPC + +**Condiciones de Error:** + +- Fallos ocurren durante el despliegue a producción +- Agotamiento de memoria en ubicaciones específicas del código +- Problemas intermitentes de conectividad +- La aplicación parece saludable en Kubernetes pero devuelve errores + +## Solución Detallada + + + +Cuando tu aplicación falla debido a problemas de memoria, sigue estos pasos: + +1. **Verifica los límites actuales de memoria:** + + ```bash + kubectl describe pod -n + ``` + +2. **Monitorea el uso de memoria:** + + ```bash + kubectl top pod -n + ``` + +3. **Revisa los registros de la aplicación en busca de errores de memoria:** + ```bash + kubectl logs -n --tail=100 + ``` + +Busca errores como: + +- "Out of memory" +- "JavaScript heap out of memory" +- "Cannot allocate memory" + + + + + +Para aumentar los límites de memoria de tu aplicación en SleakOps: + +1. **Accede a la configuración de tu proyecto** +2. **Navega hasta la carga de trabajo afectada (servicio web)** +3. **Ve a Configuración Avanzada → Recursos** +4. **Incrementa los límites de memoria:** + +```yaml +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2Gi" # Incrementa este valor + cpu: "1000m" +``` + +**Incrementos recomendados de memoria:** + +- Para aplicaciones Node.js: Comienza con 2Gi, aumenta a 4Gi si es necesario +- Para aplicaciones Java: Comienza con 4Gi, aumenta a 8Gi si es necesario +- Para aplicaciones Python: Comienza con 1Gi, aumenta a 2Gi si es necesario + + + + + +Cuando experimentes errores 502 a pesar de que los pods parecen saludables: + +1. **Verifica la salud del pod:** + + ```bash + kubectl get pods -n + kubectl describe pod -n + ``` + +2. **Revisa los endpoints del servicio:** + + ```bash + kubectl get endpoints -n + ``` + +3. **Prueba la conectividad directa al pod:** + + ```bash + kubectl port-forward pod/ 8080:8080 -n + curl http://localhost:8080/health + ``` + +4. **Verifica la configuración del ingress:** + ```bash + kubectl describe ingress -n + ``` + +**Causas comunes:** + +- La aplicación no escucha en el puerto correcto +- Fallos en los endpoints de chequeo de salud +- Errores internos de la aplicación no reflejados en el estado del pod +- Configuración incorrecta del ingress + + + + + +Para verificar la configuración de DNS e ingress: + +1. **Comprueba la resolución DNS:** + + ```bash + dig A tu-dominio.com + nslookup tu-dominio.com + ``` + +2. **Verifica el estado del balanceador de carga:** + + ```bash + kubectl get svc -n + kubectl describe svc -n + ``` + +3. **Revisa los logs del controlador ingress:** + + ```bash + kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + ``` + +4. **Prueba el acceso directo al balanceador de carga:** + ```bash + curl -H "Host: tu-dominio.com" http:// + ``` + + + + + +Para prevenir fallos durante los despliegues en producción: + +1. **Implementa chequeos de salud adecuados:** + + ```yaml + livenessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + + readinessProbe: + httpGet: + path: /ready + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 5 + ``` + +2. **Usa estrategia de despliegue rolling update:** + + ```yaml + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 1 + maxSurge: 1 + ``` + +3. **Establece límites de recursos apropiados:** + + - Siempre configura tanto requests como limits + - Monitorea el uso real y ajusta según sea necesario + - Deja margen para picos de tráfico + +4. **Prueba primero en entorno staging:** + - Despliega en desarrollo/staging antes de producción + - Realiza pruebas de carga para verificar uso de memoria + - Monitorea posibles fugas de memoria con el tiempo + + + + + +Si tu aplicación en producción está caída: + +1. **Rollback inmediato:** + + ```bash + kubectl rollout undo deployment/ -n + ``` + +2. **Escala réplicas temporalmente:** + + ```bash + kubectl scale deployment --replicas=3 -n + ``` + +3. **Revisa cambios recientes:** + + ```bash + kubectl rollout history deployment/ -n + ``` + +4. **Monitorea la recuperación:** + + ```bash + kubectl get pods -n -w + ``` + +5. **Verifica que la aplicación responde:** + ```bash + curl -I https://tu-dominio.com + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 9 de junio de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-build-node-disk-space.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-build-node-disk-space.mdx new file mode 100644 index 000000000..a73f09401 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-build-node-disk-space.mdx @@ -0,0 +1,405 @@ +--- +sidebar_position: 3 +title: "Problemas de Espacio en Disco en Nodos de Construcción" +description: "Resolución de problemas y solución de problemas de espacio en disco en nodos de construcción" +date: "2024-09-10" +category: "cluster" +tags: + [ + "construcción", + "espacio-en-disco", + "nodos", + "resolución-de-problemas", + "almacenamiento", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Espacio en Disco en Nodos de Construcción + +**Fecha:** 10 de septiembre de 2024 +**Categoría:** Cluster +**Etiquetas:** Construcción, Espacio en Disco, Nodos, Resolución de Problemas, Almacenamiento + +## Descripción del Problema + +**Contexto:** Procesos de construcción que fallan intermitentemente debido a espacio insuficiente en disco en nodos Kubernetes, incluso después de aumentar el almacenamiento del nodo de 20GB a 40GB. + +**Síntomas Observados:** + +- Los procesos de construcción se bloquean intermitentemente por problemas de espacio en disco +- Algunas construcciones fallan mientras otras tienen éxito inmediatamente después +- El problema persiste incluso tras aumentar el tamaño del disco del nodo a 40GB +- El uso de memoria del clúster muestra aproximadamente un 60% de utilización +- El problema ocurre de manera esporádica y no constante + +**Configuración Relevante:** + +- Tamaño del disco del nodo: Anteriormente 20GB, aumentado a 40GB +- Uso de memoria del clúster: ~60% +- Sistema de construcción: Construcciones basadas en Docker en nodos Kubernetes +- Plataforma: Clúster Kubernetes gestionado por SleakOps + +**Condiciones de Error:** + +- Las construcciones fallan cuando los nodos se quedan sin espacio en disco +- El problema ocurre durante el proceso de construcción de imágenes Docker +- Fallos intermitentes sugieren agotamiento temporal del espacio en disco +- El problema afecta a múltiples construcciones pero no de forma consistente + +## Solución Detallada + + + +Para identificar qué está consumiendo espacio en disco en tus nodos de construcción: + +1. **Accede por SSH al nodo afectado** (si tienes acceso) +2. **Usa el comando ncdu** para analizar el uso de disco: + +```bash +# Instalar ncdu si no está disponible +sudo apt-get update && sudo apt-get install ncdu + +# Analizar uso de disco desde la raíz +sudo ncdu / + +# Enfócate en áreas problemáticas comunes +sudo ncdu /var/lib/docker +sudo ncdu /var/log +sudo ncdu /tmp +``` + +3. **Revisa el uso específico de Docker**: + +```bash +# Ver uso del sistema Docker +docker system df + +# Listar todos los contenedores y sus tamaños +docker ps -a --size + +# Listar todas las imágenes y sus tamaños +docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" +``` + + + + + +Causas típicas del agotamiento de espacio en disco en nodos de construcción: + +1. **Acumulación de capas Docker**: + + - Imágenes Docker no usadas consumiendo espacio + - Capas intermedias de construcción no limpiadas + - Caché de construcción Docker creciendo con el tiempo + +2. **Acumulación de archivos de registro**: + + - Logs de contenedores en `/var/log/containers/` + - Logs del sistema en `/var/log/` + - Logs de construcción que no se rotan + +3. **Archivos temporales**: + + - Artefactos de construcción en `/tmp` + - Cachés de gestores de paquetes + - Archivos temporales de aplicaciones + +4. **Imágenes Docker grandes**: + - Imágenes base innecesariamente grandes + - Construcciones en múltiples etapas no optimizadas + + + + + +Para liberar espacio en disco inmediatamente: + +```bash +# Limpiar sistema Docker (elimina contenedores, redes e imágenes no usadas) +docker system prune -af + +# Eliminar volúmenes no usados +docker volume prune -f + +# Limpiar caché de construcción +docker builder prune -af + +# Limpiar logs (ten cuidado con este comando) +sudo journalctl --vacuum-time=7d +sudo find /var/log -name "*.log" -type f -mtime +7 -delete + +# Limpiar archivos temporales +sudo rm -rf /tmp/* +sudo apt-get clean +``` + +**Nota**: Siempre realiza copias de seguridad de datos importantes antes de ejecutar comandos de limpieza. + + + + + +Para prevenir futuros problemas de espacio en disco, optimiza tus construcciones Docker: + +1. **Usa construcciones multi-etapa**: + +```dockerfile +# Ejemplo de construcción multi-etapa +FROM node:16 AS builder +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production + +FROM node:16-alpine AS runtime +WORKDIR /app +COPY --from=builder /app/node_modules ./node_modules +COPY . . +CMD ["npm", "start"] +``` + +2. **Usa archivo .dockerignore**: + +```dockerignore +node_modules +.git +.gitignore +README.md +.env +.nyc_output +coverage +.cache +``` + +3. **Limpieza en el mismo comando RUN**: + +```dockerfile +RUN apt-get update && \ + apt-get install -y package-name && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* +``` + + + + + +Configura monitoreo para prevenir problemas futuros: + +1. **Monitorea el uso de disco**: + +```bash +# Ver uso actual de disco +df -h + +# Monitorear uso de disco con intervalos +watch -n 5 'df -h' + +# Configura alertas para uso de disco > 80% +``` + +2. **Implementa limpieza automática**: + +```bash +# Crear script de limpieza +#!/bin/bash +# cleanup-docker.sh + +echo "Iniciando limpieza de Docker..." +docker system prune -f +docker volume prune -f +echo "Limpieza de Docker completada" + +# Agregar a crontab para ejecutar diariamente +# 0 2 * * * /ruta/a/cleanup-docker.sh +``` + +3. **Configura rotación de logs**: + +```json +{ + "log-driver": "json-file", + "log-opts": { + "max-size": "10m", + "max-file": "3" + } +} +``` + + + + + +**Configurar límites de recursos para construcciones:** + +1. **Configurar límites de recursos en pods de construcción:** + +```yaml +apiVersion: v1 +kind: Pod +spec: + containers: + - name: build-container + image: docker:latest + resources: + limits: + memory: "2Gi" + cpu: "1000m" + ephemeral-storage: "10Gi" + requests: + memory: "1Gi" + cpu: "500m" + ephemeral-storage: "5Gi" +``` + +2. **Configurar políticas de limpieza automática:** + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: docker-cleanup +spec: + schedule: "0 2 * * *" # Ejecutar diariamente a las 2 AM + jobTemplate: + spec: + template: + spec: + containers: + - name: cleanup + image: docker:latest + command: + - /bin/sh + - -c + - | + docker system prune -af + docker volume prune -f + restartPolicy: OnFailure +``` + +3. **Monitorear uso de almacenamiento efímero:** + +```bash +# Ver uso de almacenamiento efímero por pod +kubectl top pods --containers + +# Describir uso de recursos de un pod específico +kubectl describe pod +``` + + + + + +**Configuraciones específicas para la plataforma SleakOps:** + +1. **Configurar limpieza automática en SleakOps:** + + - Accede al panel de SleakOps + - Ve a **Configuración del Clúster** → **Políticas de Limpieza** + - Habilita **Limpieza Automática de Docker** + - Configura frecuencia: **Diaria** + +2. **Aumentar tamaño de disco de nodos:** + +```bash +# Usando la CLI de SleakOps +sleakops cluster node-pool update --disk-size 60GB + +# O desde el panel web: +# Clúster → Pools de Nodos → Editar → Aumentar Tamaño de Disco +``` + +3. **Configurar alertas de espacio en disco:** + +```yaml +# Configuración de alerta en SleakOps +alerts: + - name: "Espacio en Disco Bajo" + condition: "disk_usage > 80%" + action: "notify_team" + channels: ["slack", "email"] +``` + +4. **Optimizar configuración de construcción:** + +```yaml +# En la configuración de construcción de SleakOps +build: + cache: + enabled: true + size: "5Gi" + cleanup: + after_build: true + keep_artifacts: 3 +``` + + + + + +**Estrategias para prevenir problemas futuros:** + +1. **Implementar construcciones distribuidas:** + +```yaml +# Configurar múltiples nodos de construcción +apiVersion: v1 +kind: ConfigMap +metadata: + name: build-config +data: + strategy: "distributed" + max_concurrent_builds: "3" + node_selector: "build-node=true" +``` + +2. **Usar registro de contenedores externo:** + +```dockerfile +# Usar registro externo para reducir almacenamiento local +FROM registry.external.com/base-image:latest +# En lugar de construir imágenes base localmente +``` + +3. **Implementar caché de construcción compartido:** + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: build-cache +spec: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 50Gi + storageClassName: shared-storage +``` + +4. **Configurar métricas y alertas avanzadas:** + +```bash +# Instalar Prometheus para monitoreo +kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml + +# Configurar alertas específicas para espacio en disco +``` + +**Mejores prácticas a largo plazo:** + +- Revisar y optimizar Dockerfiles regularmente +- Implementar construcciones en múltiples etapas +- Usar imágenes base más pequeñas (Alpine Linux) +- Configurar limpieza automática en todos los nodos +- Monitorear tendencias de uso de espacio en disco +- Planificar capacidad basada en patrones de uso + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 10 de septiembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-high-cpu-usage.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-high-cpu-usage.mdx new file mode 100644 index 000000000..fc9f0af99 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-high-cpu-usage.mdx @@ -0,0 +1,228 @@ +--- +sidebar_position: 3 +title: "Investigación y Solución de Problemas de Alto Uso de CPU" +description: "Guía para investigar y resolver picos de alto uso de CPU en aplicaciones contenerizadas" +date: "2025-01-27" +category: "workload" +tags: + ["cpu", "rendimiento", "monitoreo", "solución de problemas", "contenedores"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Investigación y Solución de Problemas de Alto Uso de CPU + +**Fecha:** 27 de enero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** CPU, Rendimiento, Monitoreo, Solución de problemas, Contenedores + +## Descripción del Problema + +**Contexto:** Aplicaciones contenerizadas en producción que experimentan picos inesperados en el uso de CPU que pueden afectar el rendimiento y la experiencia del usuario. + +**Síntomas Observados:** + +- Incremento súbito en la utilización de CPU +- Degradación del rendimiento de la aplicación +- Posibles lentitudes o tiempos de espera en el servicio +- Alertas de consumo de recursos activadas + +**Configuración Relevante:** + +- Runtime de contenedores: Docker/containerd +- Orquestación: Kubernetes +- Monitoreo: Disponible a través del panel de SleakOps +- Tipo de aplicación: Carga de trabajo en producción + +**Condiciones de Error:** + +- Picos en el uso de CPU por encima de la línea base normal +- Impacto en la capacidad de respuesta de la aplicación +- Escenarios potenciales de agotamiento de recursos + +## Solución Detallada + + + +Primero, identifique el alcance y la severidad del pico de CPU: + +1. **Verifique las métricas actuales de CPU** en el panel de SleakOps +2. **Identifique los contenedores/pods afectados** +3. **Determine la línea de tiempo** de cuándo comenzó el pico +4. **Compare con las líneas base históricas** + +```bash +# Ver uso actual de CPU para un contenedor específico +kubectl top pods -n + +# Obtener uso detallado de recursos +kubectl describe pod -n +``` + + + + + +**1. Investigación a Nivel de Aplicación:** + +- Revise los logs de la aplicación en busca de errores o actividad inusual +- Revise despliegues recientes o cambios de configuración +- Identifique nuevas funcionalidades o cambios en el código + +```bash +# Revisar logs de la aplicación +kubectl logs -n --tail=100 + +# Revisar eventos recientes +kubectl get events -n --sort-by='.lastTimestamp' +``` + +**2. Análisis a Nivel de Sistema:** + +- Monitoree el uso de memoria (memoria alta puede causar picos de CPU) +- Verifique patrones de I/O de disco +- Revise patrones de tráfico de red + +**3. Factores Externos:** + +- Incremento en el tráfico o carga de usuarios +- Problemas de rendimiento en la base de datos +- Dependencias de servicios de terceros + + + + + +**Acceso a Métricas de Rendimiento:** + +1. Navegue al **Panel de SleakOps** +2. Vaya a **Monitoreo** → **Cargas de Trabajo** +3. Seleccione su contenedor/servicio específico +4. Revise los gráficos de **Uso de CPU** en diferentes periodos de tiempo + +**Métricas Clave a Monitorear:** + +- Porcentaje de utilización de CPU +- Patrones de uso de memoria +- Tiempos de solicitud/respuesta +- Tasas de error +- I/O de red + +**Configurar Alertas:** + +Configure alertas para incidentes futuros: + +- Uso de CPU > 80% por más de 5 minutos +- Uso de memoria > 85% +- Tiempo de respuesta > umbral aceptable + + + + + +**1. Escalar Recursos (Solución Rápida):** + +```yaml +# Aumentar límites de CPU temporalmente +resources: + limits: + cpu: "2000m" # Incrementar desde el límite actual + memory: "4Gi" + requests: + cpu: "1000m" + memory: "2Gi" +``` + +**2. Escalado Horizontal:** + +```bash +# Escalar réplicas temporalmente +kubectl scale deployment --replicas=5 -n +``` + +**3. Balanceo de Carga:** + +- Asegurar que el tráfico se distribuya equitativamente entre las instancias +- Verificar si alguna instancia está recibiendo una carga desproporcionada + + + + + +**1. Optimización de Código:** + +- Perfilar el código de la aplicación para identificar operaciones intensivas en CPU +- Optimizar consultas a la base de datos +- Implementar estrategias de caché +- Revisar la eficiencia de los algoritmos + +**2. Gestión de Recursos:** + +```yaml +# Implementar solicitudes y límites adecuados de recursos +resources: + requests: + cpu: "500m" + memory: "1Gi" + limits: + cpu: "1000m" + memory: "2Gi" +``` + +**3. Configuración de Autoescalado:** + +```yaml +# Configurar Horizontal Pod Autoscaler +apiVersion: autoscaling/v2 +kind: HorizontalPodAutoscaler +metadata: + name: app-hpa +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: your-app + minReplicas: 2 + maxReplicas: 10 + metrics: + - type: Resource + resource: + name: cpu + target: + type: Utilization + averageUtilization: 70 +``` + + + + + +**1. Configuración de Monitoreo:** + +- Implementar monitoreo integral para todas las métricas críticas +- Configurar alertas proactivas antes de que los problemas sean críticos +- Revisiones regulares de líneas base de rendimiento + +**2. Pruebas de Carga:** + +- Realizar pruebas de carga periódicas para identificar cuellos de botella +- Probar con patrones de tráfico realistas +- Validar comportamiento del autoescalado + +**3. Planificación de Recursos:** + +- Dimensionar correctamente los contenedores según patrones reales de uso +- Planificar crecimiento de tráfico y variaciones estacionales +- Revisiones regulares de planificación de capacidad + +**4. Prácticas de Revisión de Código:** + +- Incluir consideraciones de rendimiento en las revisiones de código +- Monitorear el impacto en rendimiento de nuevos despliegues +- Implementar despliegues graduales para cambios mayores + + + +--- + +_Esta FAQ fue generada automáticamente el 27 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-web-service-dns-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-web-service-dns-issues.mdx new file mode 100644 index 000000000..e6603c1c8 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/troubleshooting-web-service-dns-issues.mdx @@ -0,0 +1,199 @@ +--- +sidebar_position: 3 +title: "Problemas de Configuración DNS en Servicios Web" +description: "Resolución de problemas de delegación DNS y conectividad de servicios web" +date: "2024-01-15" +category: "project" +tags: + [ + "dns", + "servicio-web", + "despliegue", + "resolucion-de-problemas", + "delegacion-dominio", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Configuración DNS en Servicios Web + +**Fecha:** 15 de enero de 2024 +**Categoría:** Proyecto +**Etiquetas:** DNS, Servicio Web, Despliegue, Resolución de Problemas, Delegación de Dominio + +## Descripción del Problema + +**Contexto:** El usuario crea un servicio web en SleakOps con compilación y despliegue exitosos, pero el servicio no es accesible mediante su nombre de dominio debido a problemas de configuración DNS. + +**Síntomas observados:** + +- El servicio web devuelve error 404 al acceder mediante el dominio +- Las herramientas de propagación DNS no muestran registros DNS para el subdominio +- Los procesos de compilación y despliegue se completan con éxito +- El servicio aparece saludable en el clúster de Kubernetes + +**Configuración relevante:** + +- Tipo de servicio: Servicio web +- Dominio: Subdominio personalizado (por ejemplo, `subdominio.dominio.com`) +- Estado del despliegue: Exitoso +- Estado de compilación: Exitoso + +**Condiciones de error:** + +- Registros DNS no propagados globalmente +- Delegación DNS faltante para el subdominio +- Servicio accesible mediante port-forward pero no vía dominio +- Errores 404 al acceder mediante URL pública + +## Solución Detallada + + + +Antes de resolver problemas DNS, confirme que su servicio esté funcionando correctamente: + +1. **Verifique el estado de los pods en Lens o kubectl:** + + - Los pods deben aparecer en verde/en ejecución + - Los chequeos de salud del contenedor deben pasar + - Si los pods están amarillos, revise los logs para errores + +2. **Pruebe la conectividad local:** + + ```bash + # Reenvío de puerto para probar servicio localmente + kubectl port-forward service/nombre-de-su-servicio 8080:80 + # Luego pruebe: curl http://localhost:8080 + ``` + +3. **Verifique la configuración del chequeo de salud:** + - Asegúrese que la ruta del chequeo de salud sea correcta + - Pruebe endpoints HTTP y HTTPS + - Verifique requisitos de barra diagonal final + + + + + +Use herramientas DNS para diagnosticar problemas de delegación: + +1. **Verifique la delegación DNS con dig:** + + ```bash + # Compruebe si el subdominio está delegado correctamente + dig NS subdominio.sudominio.com + + # Debe devolver servidores de nombres de AWS Route53 como: + # subdominio.sudominio.com. 300 IN NS ns-xxx.awsdns-xx.com. + # subdominio.sudominio.com. 300 IN NS ns-xxx.awsdns-xx.co.uk. + ``` + +2. **Verifique resolución del registro A:** + + ```bash + # Verifique que el registro A apunte a la IP correcta + dig A subdominio.sudominio.com + + # Debe devolver la IP del balanceador de carga + ``` + +3. **Use herramientas en línea de propagación DNS:** + - Consulte https://www.whatsmydns.net/ + - Verifique la propagación global de DNS + - Busque inconsistencias entre regiones + + + + + +Si la delegación DNS falta o es incorrecta: + +1. **Identifique los registros DNS requeridos:** + + - Obtenga los servidores de nombres de la zona hospedada Route53 desde la consola AWS + - O revise el panel de SleakOps para configuración DNS + +2. **Agregue registros NS al dominio padre:** + + ``` + # Añada estos registros NS al DNS de su dominio principal: + subdominio.sudominio.com. IN NS ns-xxx.awsdns-xx.com. + subdominio.sudominio.com. IN NS ns-xxx.awsdns-xx.co.uk. + subdominio.sudominio.com. IN NS ns-xxx.awsdns-xx.net. + subdominio.sudominio.com. IN NS ns-xxx.awsdns-xx.org. + ``` + +3. **Espere la propagación:** + - Los cambios DNS pueden tardar 24-48 horas en propagarse completamente + - Use `dig` para monitorear el progreso de propagación + + + + + +**Pruebe diferentes variaciones de URL:** + +```bash +# Pruebe con y sin barra diagonal final +curl http://subdominio.sudominio.com/ +curl http://subdominio.sudominio.com + +# Pruebe tanto HTTP como HTTPS +curl https://subdominio.sudominio.com/ +curl http://subdominio.sudominio.com/ +``` + +**Verifique configuración de ingress:** + +```bash +# Confirme que el ingress esté configurado correctamente +kubectl get ingress -n su-namespace +kubectl describe ingress nombre-del-ingress -n su-namespace +``` + +**Monitoree propagación DNS:** + +```bash +# Consulte múltiples servidores DNS +nslookup subdominio.sudominio.com 8.8.8.8 +nslookup subdominio.sudominio.com 1.1.1.1 +``` + + + + + +**Antes de crear servicios web:** + +1. **Verifique la delegación del dominio:** + + - Asegúrese que el dominio padre esté configurado correctamente + - Pruebe la resolución DNS de subdominios existentes + +2. **Planifique la estructura de subdominios:** + + - Use convenciones de nombres consistentes + - Considere delegación wildcard para múltiples servicios + +3. **Monitoree el proceso de despliegue:** + - Verifique tanto la compilación COMO la configuración DNS + - Confirme que el controlador ingress esté en ejecución + +**Después del despliegue:** + +1. **Espere la propagación DNS:** + + - Permita de 15 a 30 minutos para propagación inicial + - Use múltiples herramientas de verificación DNS + +2. **Pruebe sistemáticamente:** + - Comience con pruebas por port-forward + - Luego pruebe vía IP del balanceador de carga + - Finalmente pruebe mediante el nombre de dominio + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/understanding-aws-eks-costs.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/understanding-aws-eks-costs.mdx new file mode 100644 index 000000000..2d6c702e6 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/understanding-aws-eks-costs.mdx @@ -0,0 +1,166 @@ +--- +sidebar_position: 3 +title: "Entendiendo los Costos de AWS EKS en SleakOps" +description: "Desglose de los costos del clúster AWS EKS incluyendo el plano de control y recursos de cómputo" +date: "2025-01-15" +category: "cluster" +tags: ["eks", "aws", "costos", "facturación", "nodegroup"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Entendiendo los Costos de AWS EKS en SleakOps + +**Fecha:** 15 de enero de 2025 +**Categoría:** Clúster +**Etiquetas:** EKS, AWS, Costos, Facturación, NodeGroup + +## Descripción del Problema + +**Contexto:** Los usuarios a menudo tienen preguntas sobre los costos del clúster AWS EKS al usar SleakOps, particularmente para entender las diferentes categorías de costos y qué genera cargos incluso durante las fases de desarrollo o prueba. + +**Síntomas observados:** + +- Cargos inesperados por clústeres EKS durante el desarrollo +- Confusión sobre las diferentes categorías de costos de AWS +- Preguntas sobre por qué ocurren costos incluso con uso mínimo +- Dificultad para entender la relación entre los componentes del clúster y la facturación + +**Configuración relevante:** + +- Clúster AWS EKS con plano de control +- NodeGroups para nodos trabajadores +- Karpenter para autoescalado +- Varios complementos y aplicaciones desplegadas + +**Condiciones de error:** + +- Usuarios ven cargos que no comprenden +- Entornos de desarrollo generan costos inesperados +- Dificultad para correlacionar uso con facturación + +## Solución Detallada + + + +El cargo por **Amazon Elastic Container Service para Kubernetes** es el costo base de operar un clúster EKS: + +- **Costo fijo**: $73 USD por mes por clúster +- **Qué cubre**: El plano de control de EKS (nodos maestros) +- **Siempre cobrado**: Este costo aplica independientemente del uso de cargas de trabajo +- **Prorrateado**: Si tu clúster existe solo una parte del mes, pagas proporcionalmente + +**Ejemplo**: Si tu clúster funciona del 10 al 31 de diciembre (21 días), pagas aproximadamente $51 USD (21/31 × $73). + +Para información detallada de precios, visita la [página oficial de precios de AWS EKS](https://aws.amazon.com/eks/pricing/). + + + + + +Los cargos por **Elastic Compute Cloud (EC2)** cubren los nodos trabajadores que ejecutan tus aplicaciones: + +**NodeGroups**: + +- Instancias EC2 que forman los nodos trabajadores de tu clúster +- Cobradas según tipo de instancia y tiempo de ejecución +- Ejemplos: t3.medium, t3.large, m5.xlarge + +**Instancias gestionadas por Karpenter**: + +- Instancias aprovisionadas automáticamente según la demanda de cargas +- Escala hacia arriba cuando las aplicaciones necesitan recursos +- Escala hacia abajo cuando los recursos ya no son necesarios + +**Factores de costo**: + +- Tipo de instancia (CPU, memoria, rendimiento de red) +- Número de instancias +- Tiempo de ejecución +- Transferencia de datos + + + + + +Incluso los entornos de desarrollo generan costos porque: + +1. **El plano de control siempre está activo**: Aplica el cargo de $73/mes de EKS +2. **Nodos trabajadores mínimos**: Usualmente se necesitan al menos 1-2 nodos para funcionalidad básica +3. **Los complementos consumen recursos**: Monitoreo, registro y otras herramientas requieren potencia de cómputo +4. **Procesos en segundo plano**: Pods y servicios del sistema corren continuamente + +**Estrategias para optimizar costos**: + +```yaml +# Configuración de desarrollo optimizada para costos +nodegroup_config: + instance_types: ["t3.small", "t3.medium"] + min_size: 1 + max_size: 3 + desired_size: 1 + +karpenter_config: + enabled: true + spot_instances: true # Usar instancias spot para ahorro + +scheduling: + auto_shutdown: "18:00" # Apagar cargas no esenciales + auto_startup: "09:00" # Iniciar cargas en horas laborales +``` + + + + + +**En la Consola AWS**: + +1. Ve a **Cost Explorer** +2. Filtra por servicio: **Amazon Elastic Kubernetes Service** +3. Agrupa por: **Servicio** y **Tipo de Uso** +4. Configura **Alertas de Presupuesto** para aumentos inesperados de costos + +**En SleakOps**: + +- Monitorea el uso de recursos del clúster en el panel +- Revisa las aplicaciones desplegadas y sus solicitudes de recursos +- Usa la configuración de escalado del clúster para controlar recursos máximos + +**Buenas prácticas**: + +- Usar instancias spot para cargas de desarrollo +- Implementar autoescalado del clúster +- Revisar y limpiar recursos no usados regularmente +- Considerar hibernación del clúster para entornos no productivos + + + + + +Para un clúster pequeño de desarrollo funcionando un mes completo: + +**Costos fijos**: + +- Plano de Control EKS: $73.00 + +**Costos variables**: + +- 2 × nodos t3.medium (24/7): ~ $60.00 +- Instancias adicionales Karpenter (ocasionales): ~ $15.00 +- Transferencia de datos y almacenamiento: ~ $5.00 + +**Costo mensual total estimado**: ~ $153.00 + +**Para uso parcial de mes** (como del 10 al 31 de diciembre): + +- Plano de Control EKS: $51.00 (prorrateado) +- Recursos de cómputo: $34.00 (prorrateado) +- **Total**: ~ $85.00 + +Estos son costos típicos para un entorno de desarrollo con uso mínimo pero continuo. + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-access-control-production-environments.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-access-control-production-environments.mdx new file mode 100644 index 000000000..6c8686492 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-access-control-production-environments.mdx @@ -0,0 +1,890 @@ +--- +sidebar_position: 3 +title: "Control de Acceso de Usuarios en Entornos de Producción" +description: "Cómo gestionar permisos de usuarios y control de acceso para recursos de producción en SleakOps" +date: "2024-02-17" +category: "usuario" +tags: ["control-de-acceso", "iam", "permisos", "produccion", "seguridad"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Control de Acceso de Usuarios en Entornos de Producción + +**Fecha:** 17 de febrero de 2024 +**Categoría:** Usuario +**Etiquetas:** Control de Acceso, IAM, Permisos, Producción, Seguridad + +## Descripción del Problema + +**Contexto:** Las organizaciones necesitan implementar un control de acceso granular cuando usan un solo clúster para múltiples entornos (desarrollo, staging, producción) mientras restringen el acceso a producción a miembros específicos del equipo. + +**Síntomas Observados:** + +- Necesidad de limitar el acceso al entorno de producción a solo 2 miembros del equipo +- Los desarrolladores requieren acceso únicamente a recursos de desarrollo y staging +- Incertidumbre sobre a qué recursos pueden acceder los diferentes roles de usuario +- Preguntas sobre permisos para manipular buckets S3 y bases de datos + +**Configuración Relevante:** + +- Un solo clúster que aloja múltiples entornos +- Entornos de producción, desarrollo y staging en la misma cuenta de AWS +- Necesidad de control de acceso basado en roles (Viewer, Editor, Admin) +- Integración de AWS IAM con SleakOps + +**Condiciones de Error:** + +- Riesgo de acceso no autorizado a recursos de producción +- Posible manipulación de datos por usuarios con permisos excesivos +- Falta de segregación adecuada entre entornos + +## Solución Detallada + + + +El rol **Viewer** en SleakOps tiene tres requisitos principales de acceso: + +1. **Acceso a la Cuenta AWS**: Los usuarios reciben la política administrada de AWS [ReadOnlyAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/ReadOnlyAccess.html) +2. **Acceso VPN**: Requerido para acceder a recursos internos +3. **Acceso a la Plataforma SleakOps**: Acceso a la interfaz de SleakOps + +**Lo que el rol Viewer PUEDE hacer:** + +- Ver VariableGroups en SleakOps +- Acceder a RDS y otras dependencias a través de VPN +- Ver recursos y configuraciones del clúster +- Acceder a detalles de conexión de base de datos desde VariableGroups o secretos del clúster +- Ver logs de aplicación a través de Grafana/Loki (si está configurado) +- Monitorear métricas de rendimiento de aplicaciones + +**Lo que el rol Viewer NO PUEDE hacer:** + +- Editar objetos en S3 mediante la consola de AWS (requiere rol Editor) +- Modificar VariableGroups (requiere rol Editor) +- Realizar cambios en configuraciones del clúster +- Ejecutar comandos kubectl directamente en producción +- Acceder a secretos sensibles de producción + +**Políticas IAM aplicadas al rol Viewer:** + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "ec2:Describe*", + "rds:Describe*", + "s3:ListBucket", + "s3:GetObject", + "eks:DescribeCluster", + "eks:ListClusters", + "logs:DescribeLogGroups", + "logs:DescribeLogStreams", + "cloudwatch:GetMetricStatistics", + "cloudwatch:ListMetrics" + ], + "Resource": "*" + }, + { + "Effect": "Deny", + "Action": [ + "s3:PutObject", + "s3:DeleteObject", + "rds:ModifyDBInstance", + "ec2:TerminateInstances", + "eks:UpdateClusterConfig" + ], + "Resource": "*" + } + ] +} +``` + + + + + +### Estrategia 1: Acceso Solo por VPN + +Para desarrolladores que solo necesitan acceso a bases de datos: + +```yaml +Nivel de Acceso: Solo VPN +Permisos: + - Conectar a bases de datos de desarrollo/staging + - Sin acceso a consola AWS + - Sin acceso a plataforma SleakOps +Implementación: Compartir credenciales de base de datos directamente +Ventajas: + - Acceso mínimo y controlado + - Sin posibilidad de manipular recursos AWS + - Ideal para desarrolladores que solo necesitan datos +``` + +### Estrategia 2: Rol IAM Personalizado "Developer" + +Crear un rol personalizado "Developer" en la cuenta de producción: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "ec2:DescribeInstances", + "ec2:DescribeSecurityGroups", + "rds:DescribeDBInstances", + "rds:Connect", + "s3:ListBucket", + "s3:GetObject" + ], + "Resource": "*", + "Condition": { + "StringEquals": { + "aws:RequestedRegion": ["us-west-2", "us-east-1"] + } + } + }, + { + "Effect": "Allow", + "Action": ["s3:PutObject", "s3:DeleteObject"], + "Resource": ["arn:aws:s3:::dev-bucket/*", "arn:aws:s3:::staging-bucket/*"] + }, + { + "Effect": "Deny", + "Action": ["s3:PutObject", "s3:DeleteObject"], + "Resource": ["arn:aws:s3:::prod-bucket/*"] + } + ] +} +``` + +### Estrategia 3: Segregación por Namespace Kubernetes + +Usar RBAC de Kubernetes para limitar acceso por namespace: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + namespace: development + name: developer-role +rules: + - apiGroups: [""] + resources: ["pods", "services", "configmaps", "secrets"] + verbs: ["get", "list", "create", "update", "patch", "delete"] + - apiGroups: ["apps"] + resources: ["deployments", "replicasets"] + verbs: ["get", "list", "create", "update", "patch", "delete"] + +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: developer-binding + namespace: development +subjects: + - kind: User + name: developer-user + apiGroup: rbac.authorization.k8s.io +roleRef: + kind: Role + name: developer-role + apiGroup: rbac.authorization.k8s.io +``` + + + + + +**Configuración para acceso limitado a producción (solo 2 miembros del equipo):** + +### Paso 1: Crear Grupo IAM de Producción + +```json +{ + "GroupName": "ProductionAccess", + "Path": "/", + "PolicyDocument": { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "eks:DescribeCluster", + "eks:ListClusters", + "eks:DescribeNodegroup", + "eks:ListNodegroups" + ], + "Resource": "*" + }, + { + "Effect": "Allow", + "Action": ["s3:*"], + "Resource": ["arn:aws:s3:::prod-bucket", "arn:aws:s3:::prod-bucket/*"] + }, + { + "Effect": "Allow", + "Action": ["rds:*"], + "Resource": "arn:aws:rds:*:*:db:prod-*" + } + ] + } +} +``` + +### Paso 2: Configurar RBAC Kubernetes para Producción + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: production-admin +rules: + - apiGroups: [""] + resources: ["*"] + verbs: ["*"] + resourceNames: [] + - apiGroups: ["apps", "extensions", "networking.k8s.io"] + resources: ["*"] + verbs: ["*"] + +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: production-admin-binding +subjects: + - kind: User + name: prod-admin-1@company.com + apiGroup: rbac.authorization.k8s.io + - kind: User + name: prod-admin-2@company.com + apiGroup: rbac.authorization.k8s.io +roleRef: + kind: ClusterRole + name: production-admin + apiGroup: rbac.authorization.k8s.io +``` + +### Paso 3: Configurar Acceso a SleakOps + +```yaml +# Configuración en SleakOps para usuarios de producción +Users: + - email: prod-admin-1@company.com + role: Admin + environments: ["production"] + permissions: + - can_deploy: true + - can_modify_variables: true + - can_access_secrets: true + - can_scale_resources: true + + - email: prod-admin-2@company.com + role: Admin + environments: ["production"] + permissions: + - can_deploy: true + - can_modify_variables: true + - can_access_secrets: true + - can_scale_resources: true + + - email: developer@company.com + role: Editor + environments: ["development", "staging"] + permissions: + - can_deploy: true + - can_modify_variables: true + - can_access_secrets: false + - can_scale_resources: false +``` + + + + + +**Implementación de segregación usando namespaces y node selectors:** + +### Configuración de Namespaces + +```yaml +# Namespace para desarrollo +apiVersion: v1 +kind: Namespace +metadata: + name: development + labels: + environment: dev + access-level: open + annotations: + scheduler.alpha.kubernetes.io/node-selector: environment=dev + +--- +# Namespace para staging +apiVersion: v1 +kind: Namespace +metadata: + name: staging + labels: + environment: staging + access-level: restricted + annotations: + scheduler.alpha.kubernetes.io/node-selector: environment=staging + +--- +# Namespace para producción +apiVersion: v1 +kind: Namespace +metadata: + name: production + labels: + environment: prod + access-level: highly-restricted + annotations: + scheduler.alpha.kubernetes.io/node-selector: environment=prod +``` + +### Network Policies para Aislamiento + +```yaml +# Política de red para aislar producción +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: production-isolation + namespace: production +spec: + podSelector: {} + policyTypes: + - Ingress + - Egress + ingress: + - from: + - namespaceSelector: + matchLabels: + name: production + - namespaceSelector: + matchLabels: + name: monitoring + egress: + - to: + - namespaceSelector: + matchLabels: + name: production + - to: [] + ports: + - protocol: TCP + port: 443 + - protocol: UDP + port: 53 + +--- +# Política de red para desarrollo (más permisiva) +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: development-policy + namespace: development +spec: + podSelector: {} + policyTypes: + - Ingress + ingress: + - from: + - namespaceSelector: + matchLabels: + environment: dev + - namespaceSelector: + matchLabels: + environment: staging +``` + +### Taints y Tolerations para Nodos + +```yaml +# Configuración para nodos de producción +apiVersion: v1 +kind: Node +metadata: + name: prod-node-1 +spec: + taints: + - key: environment + value: production + effect: NoSchedule + - key: access-level + value: restricted + effect: NoExecute + +--- +# Tolerations para pods de producción +apiVersion: apps/v1 +kind: Deployment +metadata: + name: prod-app + namespace: production +spec: + template: + spec: + tolerations: + - key: environment + operator: Equal + value: production + effect: NoSchedule + - key: access-level + operator: Equal + value: restricted + effect: NoExecute + nodeSelector: + environment: production +``` + + + + + +**Configuración de acceso granular a bases de datos por entorno:** + +### Usuarios de Base de Datos por Entorno + +```sql +-- Usuario para desarrollo (permisos completos en DB dev) +CREATE USER 'dev_user'@'%' IDENTIFIED BY 'dev_password'; +GRANT ALL PRIVILEGES ON dev_database.* TO 'dev_user'@'%'; + +-- Usuario para staging (permisos limitados) +CREATE USER 'staging_user'@'%' IDENTIFIED BY 'staging_password'; +GRANT SELECT, INSERT, UPDATE, DELETE ON staging_database.* TO 'staging_user'@'%'; + +-- Usuario para producción (solo lectura para desarrolladores) +CREATE USER 'prod_readonly'@'%' IDENTIFIED BY 'readonly_password'; +GRANT SELECT ON prod_database.* TO 'prod_readonly'@'%'; + +-- Usuario administrador de producción +CREATE USER 'prod_admin'@'%' IDENTIFIED BY 'admin_password'; +GRANT ALL PRIVILEGES ON prod_database.* TO 'prod_admin'@'%'; +``` + +### Configuración de Security Groups + +```json +{ + "SecurityGroups": [ + { + "GroupName": "rds-dev-access", + "Description": "Access to development database", + "Rules": [ + { + "Type": "Ingress", + "Protocol": "TCP", + "Port": 5432, + "Source": "sg-dev-workers" + }, + { + "Type": "Ingress", + "Protocol": "TCP", + "Port": 5432, + "Source": "vpn-subnet" + } + ] + }, + { + "GroupName": "rds-prod-access", + "Description": "Access to production database", + "Rules": [ + { + "Type": "Ingress", + "Protocol": "TCP", + "Port": 5432, + "Source": "sg-prod-workers", + "Description": "Production workloads only" + }, + { + "Type": "Ingress", + "Protocol": "TCP", + "Port": 5432, + "Source": "vpn-admin-subnet", + "Description": "Admin VPN access only" + } + ] + } + ] +} +``` + +### Secretos Kubernetes por Entorno + +```yaml +# Secreto para desarrollo +apiVersion: v1 +kind: Secret +metadata: + name: db-credentials + namespace: development +type: Opaque +data: + username: ZGV2X3VzZXI= # dev_user + password: ZGV2X3Bhc3N3b3Jk # dev_password + host: ZGV2LWRhdGFiYXNlLmludGVybmFs # dev-database.internal + +--- +# Secreto para producción (acceso limitado) +apiVersion: v1 +kind: Secret +metadata: + name: db-credentials + namespace: production +type: Opaque +data: + username: cHJvZF9hZG1pbg== # prod_admin + password: YWRtaW5fcGFzc3dvcmQ= # admin_password + host: cHJvZC1kYXRhYmFzZS5pbnRlcm5hbA== # prod-database.internal +``` + + + + + +**Configuración de políticas granulares para buckets S3:** + +### Política de Bucket para Desarrollo + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "DeveloperAccess", + "Effect": "Allow", + "Principal": { + "AWS": [ + "arn:aws:iam::ACCOUNT:role/DeveloperRole", + "arn:aws:iam::ACCOUNT:user/dev-user-1", + "arn:aws:iam::ACCOUNT:user/dev-user-2" + ] + }, + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:DeleteObject", + "s3:ListBucket" + ], + "Resource": ["arn:aws:s3:::dev-bucket", "arn:aws:s3:::dev-bucket/*"] + } + ] +} +``` + +### Política de Bucket para Producción + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "ProductionAdminAccess", + "Effect": "Allow", + "Principal": { + "AWS": [ + "arn:aws:iam::ACCOUNT:user/prod-admin-1", + "arn:aws:iam::ACCOUNT:user/prod-admin-2" + ] + }, + "Action": ["s3:*"], + "Resource": ["arn:aws:s3:::prod-bucket", "arn:aws:s3:::prod-bucket/*"] + }, + { + "Sid": "DeveloperReadOnlyAccess", + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::ACCOUNT:role/DeveloperRole" + }, + "Action": ["s3:GetObject", "s3:ListBucket"], + "Resource": ["arn:aws:s3:::prod-bucket", "arn:aws:s3:::prod-bucket/*"] + }, + { + "Sid": "DenyDeveloperWriteAccess", + "Effect": "Deny", + "Principal": { + "AWS": "arn:aws:iam::ACCOUNT:role/DeveloperRole" + }, + "Action": ["s3:PutObject", "s3:DeleteObject", "s3:PutObjectAcl"], + "Resource": ["arn:aws:s3:::prod-bucket/*"] + } + ] +} +``` + +### Configuración de CORS para Acceso Web + +```json +{ + "CORSRules": [ + { + "AllowedHeaders": ["*"], + "AllowedMethods": ["GET", "HEAD"], + "AllowedOrigins": [ + "https://dev-app.company.com", + "https://staging-app.company.com" + ], + "ExposeHeaders": ["ETag"], + "MaxAgeSeconds": 3000 + }, + { + "AllowedHeaders": ["*"], + "AllowedMethods": ["GET", "HEAD"], + "AllowedOrigins": ["https://app.company.com"], + "ExposeHeaders": ["ETag"], + "MaxAgeSeconds": 3000 + } + ] +} +``` + + + + + +**Configuración de logging y monitoreo para control de acceso:** + +### CloudTrail para Auditoría de AWS + +```json +{ + "TrailName": "access-audit-trail", + "S3BucketName": "audit-logs-bucket", + "IncludeGlobalServiceEvents": true, + "IsMultiRegionTrail": true, + "EnableLogFileValidation": true, + "EventSelectors": [ + { + "ReadWriteType": "All", + "IncludeManagementEvents": true, + "DataResources": [ + { + "Type": "AWS::S3::Object", + "Values": [ + "arn:aws:s3:::prod-bucket/*", + "arn:aws:s3:::staging-bucket/*" + ] + }, + { + "Type": "AWS::RDS::DBInstance", + "Values": ["*"] + } + ] + } + ] +} +``` + +### Alertas CloudWatch para Accesos Sospechosos + +```json +{ + "AlarmName": "UnauthorizedProductionAccess", + "AlarmDescription": "Alert on unauthorized access to production resources", + "MetricName": "ErrorCount", + "Namespace": "AWS/CloudTrail", + "Statistic": "Sum", + "Period": 300, + "EvaluationPeriods": 1, + "Threshold": 1, + "ComparisonOperator": "GreaterThanOrEqualToThreshold", + "AlarmActions": ["arn:aws:sns:region:account:security-alerts"], + "MetricFilters": [ + { + "FilterName": "ProductionAccessFilter", + "FilterPattern": "{ ($.eventName = \"AssumeRole\") && ($.responseElements.assumedRoleUser.arn = \"*prod*\") && ($.sourceIPAddress != \"10.0.0.0/8\") }" + } + ] +} +``` + +### Kubernetes Audit Logging + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: audit-policy + namespace: kube-system +data: + audit-policy.yaml: | + apiVersion: audit.k8s.io/v1 + kind: Policy + rules: + # Log production namespace access at highest level + - level: Metadata + namespaces: ["production"] + verbs: ["get", "list", "create", "update", "patch", "delete"] + resources: + - group: "" + resources: ["secrets", "configmaps"] + + # Log admin role usage + - level: Request + users: ["prod-admin-1@company.com", "prod-admin-2@company.com"] + verbs: ["create", "update", "patch", "delete"] + + # Log access denials + - level: Metadata + omitStages: + - RequestReceived + resources: + - group: "" + resources: ["*"] + namespaces: ["production"] + verbs: ["*"] + users: ["system:anonymous"] +``` + +### Script de Monitoreo de Accesos + +```bash +#!/bin/bash +# Script para monitorear accesos a recursos de producción + +LOG_FILE="/var/log/access-monitor.log" +ALERT_EMAIL="security@company.com" + +# Función para verificar accesos no autorizados a S3 +check_s3_access() { + echo "$(date): Verificando accesos a S3..." >> $LOG_FILE + + # Buscar accesos desde IPs no corporativas + aws logs filter-log-events \ + --log-group-name CloudTrail/S3DataEvents \ + --start-time $(date -d '1 hour ago' +%s)000 \ + --filter-pattern '{ $.sourceIPAddress != "10.0.0.0/8" && $.eventName = "GetObject" && $.requestParameters.bucketName = "prod-bucket" }' \ + --query 'events[?sourceIPAddress]' > /tmp/s3_external_access.json + + if [ -s /tmp/s3_external_access.json ]; then + echo "ALERTA: Acceso externo detectado a bucket de producción" | mail -s "Security Alert: S3 Access" $ALERT_EMAIL + fi +} + +# Función para verificar accesos kubectl no autorizados +check_k8s_access() { + echo "$(date): Verificando accesos kubectl..." >> $LOG_FILE + + kubectl get events --all-namespaces --field-selector type=Warning \ + --output json | jq '.items[] | select(.reason == "Forbidden" and .involvedObject.namespace == "production")' > /tmp/k8s_forbidden.json + + if [ -s /tmp/k8s_forbidden.json ]; then + echo "ALERTA: Intento de acceso denegado a namespace de producción" | mail -s "Security Alert: K8s Access Denied" $ALERT_EMAIL + fi +} + +# Función para verificar conexiones de base de datos +check_db_connections() { + echo "$(date): Verificando conexiones de base de datos..." >> $LOG_FILE + + # Verificar conexiones activas en RDS de producción + aws rds describe-db-log-files --db-instance-identifier prod-db-instance \ + --query 'DescribeDBLogFiles[?contains(LogFileName, `error`)]' \ + --output table +} + +# Ejecutar verificaciones +check_s3_access +check_k8s_access +check_db_connections + +echo "$(date): Verificación de accesos completada" >> $LOG_FILE +``` + + + + + +**Pasos para implementar control de acceso granular:** + +### Fase 1: Preparación (Semana 1) + +```checklist +□ Inventariar todos los recursos por entorno +□ Identificar usuarios y sus necesidades de acceso +□ Documentar flujos de trabajo actuales +□ Crear matriz de permisos requeridos +□ Planificar migración gradual +``` + +### Fase 2: Configuración IAM (Semana 2) + +```checklist +□ Crear grupos IAM por rol (Developer, ProductionAdmin, Viewer) +□ Configurar políticas IAM granulares +□ Implementar políticas de bucket S3 +□ Configurar acceso VPN por grupos +□ Probar accesos en entorno de staging +``` + +### Fase 3: Configuración Kubernetes (Semana 3) + +```checklist +□ Implementar RBAC por namespace +□ Configurar Network Policies +□ Aplicar taints y tolerations a nodos +□ Configurar audit logging +□ Validar segregación de entornos +``` + +### Fase 4: Monitoreo y Auditoría (Semana 4) + +```checklist +□ Configurar CloudTrail +□ Implementar alertas CloudWatch +□ Configurar logging de auditoría Kubernetes +□ Crear dashboards de monitoreo +□ Documentar procedimientos de respuesta a incidentes +``` + +### Validación Final + +```bash +#!/bin/bash +# Script de validación de control de acceso + +echo "=== Validación de Control de Acceso ===" + +# Validar que desarrolladores no pueden acceder a producción +echo "1. Validando acceso de desarrolladores..." +kubectl auth can-i create pods --namespace=production --as=developer@company.com +if [ $? -eq 0 ]; then + echo "❌ ERROR: Desarrollador tiene acceso a producción" +else + echo "✅ OK: Desarrollador no tiene acceso a producción" +fi + +# Validar que solo admins pueden acceder a secretos de producción +echo "2. Validando acceso a secretos..." +kubectl auth can-i get secrets --namespace=production --as=prod-admin-1@company.com +if [ $? -eq 0 ]; then + echo "✅ OK: Admin puede acceder a secretos de producción" +else + echo "❌ ERROR: Admin no puede acceder a secretos" +fi + +# Validar políticas S3 +echo "3. Validando acceso S3..." +aws s3api head-object --bucket prod-bucket --key test-object 2>/dev/null +if [ $? -eq 0 ]; then + echo "✅ OK: Acceso S3 configurado correctamente" +else + echo "❌ ERROR: Problema con acceso S3" +fi + +echo "=== Validación Completada ===" +``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 17 de febrero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-deletion-aws-account-binding-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-deletion-aws-account-binding-error.mdx new file mode 100644 index 000000000..b8a35a1e9 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-deletion-aws-account-binding-error.mdx @@ -0,0 +1,203 @@ +--- +sidebar_position: 3 +title: "Error de Eliminación de Usuario Debido a la Vinculación con Cuenta AWS" +description: "Solución para problemas de eliminación de usuarios cuando están vinculados a cuentas AWS" +date: "2024-01-15" +category: "usuario" +tags: + [ + "gestion-de-usuarios", + "aws", + "eliminacion", + "vinculacion-cuenta", + "resolucion-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Eliminación de Usuario Debido a la Vinculación con Cuenta AWS + +**Fecha:** 15 de enero de 2024 +**Categoría:** Usuario +**Etiquetas:** Gestión de Usuarios, AWS, Eliminación, Vinculación de Cuenta, Resolución de Problemas + +## Descripción del Problema + +**Contexto:** Los usuarios que intentan eliminar una cuenta de usuario en la plataforma SleakOps encuentran errores cuando el usuario está vinculado a una cuenta AWS. El proceso de eliminación falla, dejando al usuario en un estado pendiente que impide su recreación. + +**Síntomas Observados:** + +- El usuario aparece con estado de error en la interfaz +- El proceso de eliminación no se completa +- El usuario permanece en estado pendiente tras el intento de eliminación +- No se puede recrear el usuario con las mismas credenciales +- El error está relacionado con la vinculación a la cuenta AWS + +**Configuración Relevante:** + +- El usuario tiene vinculación activa con cuenta AWS +- Se intenta eliminar el usuario a través de la interfaz de SleakOps +- El usuario muestra estado de error antes del intento de eliminación +- Plataforma: sistema de gestión de usuarios de SleakOps + +**Condiciones de Error:** + +- Ocurre error durante el proceso de eliminación del usuario +- El usuario tiene asociaciones existentes con cuentas AWS +- La eliminación deja al usuario en un estado inconsistente +- Impide la recreación del usuario con la misma identidad + +## Solución Detallada + + + +Cuando un usuario está vinculado a una cuenta AWS en SleakOps, pueden estar asociados varios recursos: + +- Roles y políticas IAM +- Cuentas de servicio en clusters EKS +- Configuraciones de VPC y redes +- Permisos de acceso a recursos + +Estas vinculaciones deben limpiarse correctamente antes de que la eliminación del usuario pueda completarse con éxito. + + + + + +Antes de intentar eliminar un usuario con vinculación a cuenta AWS: + +1. **Verificar Recursos Activos**: + + - Navegar a Gestión de Usuarios → Detalles del Usuario + - Revisar la pestaña "Recursos AWS" + - Documentar cualquier cluster o proyecto activo + +2. **Eliminar Asociaciones AWS**: + + ```bash + # Eliminar usuario de todos los proyectos primero + sleakops project remove-user --user-email usuario@ejemplo.com --all-projects + + # Desvincular cuenta AWS + sleakops user unbind-aws --user-email usuario@ejemplo.com + ``` + +3. **Verificar Estado Limpio**: + - Asegurarse de que no haya clusters activos asignados al usuario + - Confirmar que no existan operaciones pendientes en AWS + - Verificar que los roles IAM hayan sido limpiados correctamente + + + + + +Para eliminar de forma segura un usuario vinculado a cuenta AWS: + +1. **Fase de Preparación**: + + ```bash + # Listar recursos activos del usuario + sleakops user list-resources --user-email usuario@ejemplo.com + + # Eliminar de todos los proyectos + sleakops project remove-user --user-email usuario@ejemplo.com --all-projects + ``` + +2. **Fase de Limpieza AWS**: + + ```bash + # Desvincular cuenta AWS + sleakops user unbind-aws --user-email usuario@ejemplo.com --force-cleanup + + # Esperar a que la limpieza finalice + sleakops user check-cleanup-status --user-email usuario@ejemplo.com + ``` + +3. **Eliminación Final**: + ```bash + # Eliminar usuario tras limpieza + sleakops user delete --user-email usuario@ejemplo.com --confirm + ``` + + + + + +Si un usuario queda atascado en estado pendiente de eliminación: + +1. **Verificar Estado de Eliminación**: + + ```bash + sleakops user status --user-email usuario@ejemplo.com + ``` + +2. **Forzar Limpieza** (solo administrador): + + ```bash + # Comando admin para forzar limpieza + sleakops admin user force-cleanup --user-email usuario@ejemplo.com + + # Completar la eliminación + sleakops admin user complete-deletion --user-email usuario@ejemplo.com + ``` + +3. **Limpieza Manual en AWS** (si es necesario): + - Acceder a la Consola AWS IAM + - Eliminar cualquier rol restante con prefijo `sleakops-user-{user-id}` + - Limpiar cuentas de servicio huérfanas en clusters EKS + + + + + +Una vez completada la limpieza, para recrear el usuario: + +1. **Verificar Estado Limpio**: + + ```bash + sleakops user check --email usuario@ejemplo.com + # Debe devolver "Usuario no encontrado" + ``` + +2. **Crear Nuevo Usuario**: + + ```bash + sleakops user create --email usuario@ejemplo.com --name "Nombre Usuario" --role member + ``` + +3. **Re-vincular Cuenta AWS** (si es necesario): + ```bash + sleakops user bind-aws --user-email usuario@ejemplo.com --aws-account-id 123456789012 + ``` + + + + + +Para evitar problemas futuros al eliminar usuarios: + +1. **Limpieza Regular**: + + - Eliminar usuarios de proyectos antes de la eliminación + - Desvincular cuentas AWS cuando ya no se necesiten + - Monitorear regularmente el uso de recursos por usuario + +2. **Flujo de Trabajo Adecuado**: + + ```bash + # Flujo recomendado para eliminación + sleakops user prepare-deletion --user-email usuario@ejemplo.com + sleakops user confirm-deletion --user-email usuario@ejemplo.com + ``` + +3. **Monitoreo**: + - Configurar alertas para operaciones fallidas de usuarios + - Auditorías regulares de vinculaciones usuario-AWS + - Documentar asociaciones de recursos de usuario + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-mfa-recovery-after-device-loss.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-mfa-recovery-after-device-loss.mdx new file mode 100644 index 000000000..ca49fba1c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-mfa-recovery-after-device-loss.mdx @@ -0,0 +1,185 @@ +--- +sidebar_position: 3 +title: "Recuperación de MFA Después de la Pérdida del Dispositivo" +description: "Cómo recuperar el acceso cuando se pierde o roba el dispositivo MFA" +date: "2024-10-10" +category: "usuario" +tags: ["mfa", "2fa", "autenticación", "recuperación", "seguridad"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Recuperación de MFA Después de la Pérdida del Dispositivo + +**Fecha:** 10 de octubre de 2024 +**Categoría:** Usuario +**Etiquetas:** MFA, 2FA, Autenticación, Recuperación, Seguridad + +## Descripción del Problema + +**Contexto:** El usuario ha perdido el acceso a su dispositivo MFA (Autenticación Multifactor) debido a robo, pérdida o daño, impidiéndole iniciar sesión en su cuenta de SleakOps. + +**Síntomas Observados:** + +- No puede completar la autenticación 2FA durante el inicio de sesión +- Los códigos MFA ya no están accesibles +- El usuario está completamente bloqueado de su cuenta +- El proceso estándar de inicio de sesión falla en el paso de verificación MFA + +**Configuración Relevante:** + +- La cuenta tiene MFA/2FA habilitado +- El dispositivo de autenticación principal ya no está disponible +- La cuenta de correo electrónico del usuario sigue accesible +- La recuperación de la cuenta puede requerir intervención del administrador + +**Condiciones de Error:** + +- La verificación MFA falla durante el inicio de sesión +- No hay métodos de autenticación de respaldo disponibles +- El usuario no puede generar códigos de autenticación válidos +- El bloqueo de la cuenta persiste hasta que se reinicie el MFA + +## Solución Detallada + + + +Si has perdido tu dispositivo MFA: + +1. **No entres en pánico** - La recuperación de la cuenta es posible +2. **Contacta al soporte inmediatamente** vía correo electrónico: support@sleakops.com +3. **Proporciona detalles de verificación**: + - Tu dirección de correo electrónico registrada + - Nombre de usuario de la cuenta + - Fecha aproximada del último inicio de sesión exitoso + - Motivo de la pérdida del dispositivo MFA (robo, daño, etc.) + +**Importante**: No intentes múltiples intentos fallidos de inicio de sesión ya que esto puede activar medidas de seguridad adicionales. + + + + + +Al contactar al soporte para el reinicio de MFA: + +**Plantilla de correo electrónico:** + +``` +Asunto: Solicitud de reinicio de MFA - [Tu correo electrónico] + +Hola Soporte de SleakOps, + +Necesito solicitar un reinicio de MFA para mi cuenta debido a [pérdida/robo/daño del dispositivo]. + +Detalles de la cuenta: +- Correo electrónico: [tu-correo@empresa.com] +- Nombre de usuario: [si es diferente al correo] +- Último inicio de sesión exitoso: [fecha aproximada] +- Motivo: [breve explicación] + +Puedo verificar mi identidad mediante [método alternativo si está disponible]. + +Gracias por su ayuda. +``` + +**Información requerida:** + +- Dirección de correo electrónico registrada +- Detalles para verificar la cuenta +- Motivo de la indisponibilidad del dispositivo MFA +- Método de contacto alternativo si es necesario + + + + + +El equipo de soporte verificará tu identidad a través de: + +1. **Verificación por correo electrónico** - Respondiendo desde tu correo registrado +2. **Detalles de la cuenta** - Confirmando información específica de la cuenta +3. **Preguntas de seguridad** - Si fueron configuradas previamente +4. **Verificación alternativa** - A través del administrador del equipo si aplica + +**Plazo:** Los reinicios de MFA se procesan típicamente en 24-48 horas durante días hábiles. + +**Nota de seguridad:** Este proceso es intencionalmente riguroso para proteger la seguridad de tu cuenta. + + + + + +Una vez que tu MFA sea reiniciado: + +1. **Inicia sesión inmediatamente** usando solo tu contraseña +2. **Configura un nuevo MFA** lo antes posible +3. **Configura métodos de respaldo**: + - Guarda códigos de respaldo en un lugar seguro + - Considera múltiples aplicaciones autenticadoras + - Establece métodos alternativos de autenticación + +**Configuración recomendada de MFA:** + +``` +Principal: Aplicación autenticadora (Google Authenticator, Authy) +Respaldo: SMS a número de teléfono verificado +Emergencia: Códigos de respaldo almacenados de forma segura +``` + +4. **Actualiza las prácticas de seguridad**: + - Guarda los códigos de respaldo separados de tu dispositivo + - Considera usar autenticadores basados en la nube (Authy) + - Informa a tu equipo sobre el incidente de seguridad + + + + + +**Mejores prácticas:** + +1. **Múltiples métodos de autenticación:** + + - Configura al menos 2 métodos MFA diferentes + - Usa tanto aplicaciones como respaldo por SMS + - Guarda códigos de respaldo en un gestor de contraseñas + +2. **Almacenamiento seguro de respaldo:** + + - Guarda códigos de respaldo en gestor de contraseñas cifrado + - Mantén copias físicas en lugar seguro + - No almacenes códigos en el mismo dispositivo que tu autenticador + +3. **Revisión periódica:** + + - Prueba métodos de respaldo periódicamente + - Actualiza números telefónicos cuando cambien + - Revisa y renueva códigos de respaldo trimestralmente + +4. **Coordinación con el equipo:** + - Informa a los administradores del equipo sobre tu configuración MFA + - Asegura que varios miembros tengan acceso administrativo + - Documenta procedimientos de recuperación para tu organización + + + + + +**Para necesidades urgentes de recuperación MFA:** + +- **Correo electrónico**: support@sleakops.com +- **Asunto:** "URGENTE: Reinicio de MFA requerido - [Tu correo electrónico]" +- **Tiempo de respuesta:** 24-48 horas (días hábiles) + +**Incluye en solicitudes urgentes:** + +- Explicación clara de la urgencia +- Impacto en el negocio si aplica +- Métodos alternativos de verificación disponibles +- Mejor método de contacto para seguimiento + +**Nota**: Aunque el soporte busca ayudar rápidamente, la verificación de seguridad no puede ser omitida y puede requerir tiempo adicional. + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de noviembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-password-reset-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-password-reset-error.mdx new file mode 100644 index 000000000..93d4b9b6c --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/user-password-reset-error.mdx @@ -0,0 +1,115 @@ +--- +sidebar_position: 3 +title: "Error al Restablecer Contraseña de AWS para Miembros del Equipo" +description: "Solución para el 'error al restablecer la contraseña' al restablecer las credenciales de AWS para miembros del equipo" +date: "2025-02-03" +category: "usuario" +tags: ["aws", "contraseña", "restablecer", "miembro", "autenticación"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error al Restablecer Contraseña de AWS para Miembros del Equipo + +**Fecha:** 3 de febrero de 2025 +**Categoría:** Usuario +**Etiquetas:** AWS, Contraseña, Restablecer, Miembro, Autenticación + +## Descripción del Problema + +**Contexto:** Al intentar restablecer la contraseña de AWS para un miembro del equipo a través de la plataforma SleakOps, la operación falla y deja al usuario en un estado inconsistente. + +**Síntomas Observados:** + +- Mensaje de error: "error al restablecer la contraseña" aparece durante el proceso de restablecimiento +- La cuenta de usuario queda bloqueada con la etiqueta/estado "actualizando" +- La operación de restablecimiento de contraseña no se completa con éxito +- El usuario sigue sin poder acceder a los recursos de AWS + +**Configuración Relevante:** + +- Tipo de usuario: Miembro del equipo (no administrador) +- Operación: Restablecimiento de contraseña AWS +- Plataforma: Interfaz de gestión de usuarios de SleakOps +- Estado del usuario: Bloqueado en estado "actualizando" + +**Condiciones de Error:** + +- El error ocurre durante la iniciación del restablecimiento de contraseña +- Afecta específicamente a miembros del equipo +- La cuenta de usuario queda bloqueada en estado de actualización +- Los intentos posteriores de restablecimiento también pueden fallar + +## Solución Detallada + + + +Si estás experimentando este error, sigue estos pasos: + +1. **Espera la recuperación automática**: El sistema puede resolver automáticamente el estado "actualizando" en 5-10 minutos +2. **Actualiza la página de gestión de usuarios**: A veces la actualización del estado se retrasa en la interfaz +3. **Contacta soporte**: Si el problema persiste, proporciona el correo electrónico del usuario específico y la marca de tiempo del error + + + + + +Este error generalmente ocurre debido a: + +1. **Conflictos de permisos en AWS IAM**: La cuenta de servicio puede no tener permisos suficientes para restablecer contraseñas de usuarios +2. **Operaciones concurrentes**: Múltiples intentos de restablecimiento de contraseña simultáneos +3. **Indisponibilidad temporal del servicio AWS**: Problemas transitorios con el servicio AWS IAM +4. **Inconsistencia en el estado del usuario**: La cuenta de usuario puede estar en un estado inválido en AWS IAM + + + + + +Para evitar este problema en el futuro: + +1. **Intento único de restablecimiento**: Solo intenta un restablecimiento de contraseña a la vez por usuario +2. **Espera entre intentos**: Si un restablecimiento falla, espera al menos 5 minutos antes de intentar nuevamente +3. **Verifica el estado del usuario**: Asegúrate que el usuario esté en estado "activo" antes de intentar restablecer +4. **Verifica permisos**: Asegúrate de que tu cuenta de administrador tenga los permisos adecuados para gestión de usuarios + +```bash +# Verifica si el usuario está en el estado adecuado antes del restablecimiento +# Esto debe hacerse a través de la interfaz de SleakOps +1. Navega a Gestión de Equipo +2. Localiza al usuario +3. Verifica que el estado muestre "Activo" (no "Actualizando" ni "Pendiente") +4. Luego procede con el restablecimiento de contraseña +``` + + + + + +Si eres administrador y necesitas resolver esto manualmente: + +1. **Accede al panel de administración de SleakOps** +2. **Navega a Gestión de Usuarios** +3. **Encuentra al usuario afectado** +4. **Verifica el estado actual del usuario** +5. **Si está bloqueado en 'actualizando', intenta estas acciones:** + - Actualiza la página y espera 2-3 minutos + - Intenta "cancelar" la operación actual si está disponible + - Contacta al soporte de SleakOps con el correo del usuario y detalles del error + + + + + +Si el restablecimiento estándar continúa fallando: + +1. **Solución temporal**: Crea una cuenta de usuario temporal nueva para acceso inmediato +2. **Acceso directo a AWS**: Si tienes acceso a la consola AWS, puedes restablecer la contraseña directamente en IAM +3. **Eliminar y volver a agregar usuario**: Como último recurso, elimina al usuario del equipo y vuelve a invitarlo + +**Nota**: Siempre coordina con tu equipo antes de eliminar usuarios, ya que esto puede afectar su acceso a proyectos y recursos. + + + +--- + +_Esta FAQ fue generada automáticamente el 3 de febrero de 2025 basándose en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-data-loss-recovery.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-data-loss-recovery.mdx new file mode 100644 index 000000000..d56315ec7 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-data-loss-recovery.mdx @@ -0,0 +1,194 @@ +--- +sidebar_position: 15 +title: "Pérdida de Datos y Recuperación de VarGroup" +description: "Cómo manejar la pérdida de datos de VarGroup y procedimientos de recuperación en SleakOps" +date: "2025-02-26" +category: "proyecto" +tags: + [ + "vargroup", + "pérdida-de-datos", + "recuperación", + "respaldo", + "solución-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Pérdida de Datos y Recuperación de VarGroup + +**Fecha:** 26 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** VarGroup, Pérdida de Datos, Recuperación, Respaldo, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Los usuarios pueden experimentar pérdida de datos en VarGroup debido a errores del sistema, eliminaciones accidentales o fallos en el despliegue. Esto puede resultar en que las variables de entorno se reviertan a versiones anteriores o se pierdan completamente. + +**Síntomas Observados:** + +- VarGroup muestra una versión antigua en lugar de la configuración más reciente +- Variables de entorno faltantes en entornos específicos +- Variables previamente configuradas ya no están disponibles +- Fallos en despliegues debido a variables requeridas ausentes +- Aparición de VarGroups duplicados en el sistema + +**Configuración Relevante:** + +- Entorno: Desarrollo, staging o producción +- Tipo de VarGroup: Global o específico de entorno +- Última marca temporal conocida que funcionaba +- Permisos de usuario y niveles de acceso + +**Condiciones de Error:** + +- Pérdida de datos tras errores del sistema o mantenimiento +- Variables que se revierten inesperadamente a versiones previas +- Fallos en despliegues que provocan corrupción en VarGroup +- VarGroups duplicados que causan conflictos de configuración + +## Solución Detallada + + + +Cuando se detecte pérdida de datos en VarGroup: + +1. **Documentar la línea de tiempo**: Anotar cuándo se observó el problema por primera vez +2. **Identificar entornos afectados**: Verificar qué entornos están impactados +3. **Revisar cambios recientes**: Analizar historial de despliegues y actividades de usuarios +4. **Verificar historial de VarGroup**: Revisar la línea de tiempo de actualizaciones en el panel de administración + +```bash +# Ejemplo para revisar despliegues recientes +kubectl get deployments -n tu-namespace --sort-by=.metadata.creationTimestamp +``` + + + + + +**Opción 1: Recuperación por Plataforma (Si está disponible)** + +- Contactar soporte de SleakOps inmediatamente +- Proporcionar la marca temporal exacta de la última configuración conocida buena +- Incluir el nombre del VarGroup y el entorno afectado + +**Opción 2: Recreación Manual** + +- Recrear el VarGroup desde cero +- Usar documentación o conocimiento del equipo para restaurar variables +- Probar exhaustivamente en desarrollo antes de aplicar en producción + +**Opción 3: Recuperación desde Control de Versiones** + +- Si mantienes configuraciones de VarGroup en control de versiones +- Restaurar desde tu último respaldo o commit +- Aplicar la configuración mediante la interfaz de SleakOps + + + + + +**Estrategias de Respaldo:** + +1. **Exportar VarGroups regularmente**: + + ```bash + # Exportar configuración actual de VarGroup + sleakops vargroup export --name global --environment develop > respaldo-vargroup.json + ``` + +2. **Integración con Control de Versiones:** + + - Almacenar configuraciones de VarGroup en repositorios Git + - Usar enfoques de Infraestructura como Código (IaC) + - Implementar respaldos automatizados + +3. **Documentación:** + - Mantener documentación actualizada de todas las variables de entorno + - Documentar el propósito y valores esperados de cada variable + - Llevar un registro de cambios de modificaciones en VarGroup + +**Monitoreo y Alertas:** + +- Configurar alertas para cambios en VarGroup +- Monitorear fallos en despliegues que puedan indicar variables faltantes +- Auditorías regulares de configuraciones de VarGroup + + + + + +Cuando los despliegues fallen debido a problemas con VarGroup: + +1. **Revisar logs de despliegue**: + + ```bash + kubectl logs deployment/tu-app -n tu-namespace + ``` + +2. **Verificar contenido de VarGroup**: + + - Ingresar al panel de administración de SleakOps + - Navegar a la sección de VarGroups + - Comparar configuración actual con valores esperados + +3. **Probar disponibilidad de variables**: + + ```bash + # Probar si las variables se inyectan correctamente + kubectl exec -it pod/tu-pod -- env | grep TU_VARIABLE + ``` + +4. **Revertir si es necesario**: + - Usar el despliegue previo que funcionaba + - Restaurar VarGroup al último estado conocido bueno + - Realizar redeployment tras verificación + + + + + +**Contactar soporte de SleakOps inmediatamente si:** + +- La pérdida de datos afecta entornos de producción +- Se ven afectados múltiples VarGroups +- El problema parece afectar toda la plataforma +- No puedes recrear la configuración perdida + +**Información para proporcionar:** + +- Referencia del ticket (si está disponible) +- Marca temporal exacta cuando se descubrió el problema +- Nombres de VarGroup y entornos afectados +- Historial reciente de despliegues +- Mensajes de error o logs +- Línea de tiempo de cambios o actividades recientes + +**Contacto de Soporte:** + +- Correo: support@sleakops.com +- Incluir "Pérdida de Datos VarGroup" en el asunto +- Usar "Responder a todos" al responder tickets de soporte + + + + + +Después de recuperar los datos de VarGroup: + +- [ ] Verificar que todas las variables de entorno estén presentes +- [ ] Probar funcionalidad de la aplicación en desarrollo +- [ ] Ejecutar pruebas de despliegue +- [ ] Actualizar documentación con cualquier cambio +- [ ] Implementar medidas adicionales de respaldo +- [ ] Revisar y mejorar monitoreo +- [ ] Capacitar al equipo en estrategias de prevención +- [ ] Documentar lecciones aprendidas + + + +--- + +_Esta FAQ fue generada automáticamente el 26 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-deployment-error-state.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-deployment-error-state.mdx new file mode 100644 index 000000000..1429df60d --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-deployment-error-state.mdx @@ -0,0 +1,124 @@ +--- +sidebar_position: 3 +title: "Estado de Error en VarGroup que Impide el Despliegue" +description: "Solución para VarGroup atascado en estado de error que bloquea los despliegues" +date: "2024-03-17" +category: "proyecto" +tags: ["vargroup", "despliegue", "variables", "estado-error"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Estado de Error en VarGroup que Impide el Despliegue + +**Fecha:** 17 de marzo de 2024 +**Categoría:** Proyecto +**Etiquetas:** VarGroup, Despliegue, Variables, Estado de Error + +## Descripción del Problema + +**Contexto:** El usuario intenta modificar variables de entorno a través de VarGroup pero encuentra un estado de error que impide tanto la actualización de variables como los despliegues. + +**Síntomas Observados:** + +- VarGroup entra en estado "Error" tras intentar modificar variables +- El proceso de despliegue falla con mensaje de "cambios no publicados" +- No se pueden publicar los cambios pendientes debido al estado de error de VarGroup +- Las modificaciones de variables (como APP_DEBUG de false a true) no se aplican +- El sistema indica que hay cambios para publicar pero la publicación falla + +**Configuración Relevante:** + +- Plataforma: SleakOps +- Componente: VarGroup (Grupos de Variables) +- Ejemplo de variable: APP_DEBUG (valor booleano) +- El error ocurre durante operaciones con Kubernetes + +**Condiciones de Error:** + +- Error aparece al modificar variables de VarGroup +- Timeout de Kubernetes causa el fallo inicial +- VarGroup permanece en estado de error impidiendo operaciones posteriores +- Despliegue bloqueado hasta que se resuelva el estado de VarGroup + +## Solución Detallada + + + +La solución más directa es reintentar la modificación del VarGroup: + +1. **Navega a la sección VarGroup** en tu proyecto +2. **Localiza el VarGroup en estado "Error"** +3. **Haz clic en "Editar" en el VarGroup fallido** +4. **Vuelve a ingresar los mismos valores de variables** que querías cambiar +5. **Guarda los cambios** - esto disparará una nueva ejecución +6. **Espera a que el estado** cambie de "Error" a "Creado" + +Este mecanismo de reintento ayuda a resolver problemas temporales de timeout en Kubernetes. + + + + + +Antes de reintentar, verifica tus cambios previstos: + +1. **Documenta los valores actuales**: Anota los valores actuales de las variables +2. **Confirma los cambios deseados**: Verifica qué valores quieres establecer +3. **Aplica los cambios con cuidado**: Asegúrate de estar configurando los valores correctos +4. **Valida después del éxito**: Una vez que VarGroup muestre "Creado", verifica que las variables estén correctamente establecidas + +**Ejemplo de verificación:** + +``` +Antes: APP_DEBUG = false +Previsto: APP_DEBUG = true +Después del reintento: Confirmar APP_DEBUG = true +``` + + + + + +El error ocurre porque: + +1. **Retrasos en la API de Kubernetes**: A veces Kubernetes tarda más en responder +2. **Contención de recursos**: Alta carga en el clúster puede causar timeouts +3. **Problemas de red**: Problemas temporales de conectividad +4. **Sincronización de estado**: El estado de VarGroup se queda atascado durante el timeout + +**Esto suele ser un problema temporal** que se resuelve con un reintento. + + + + + +Una vez arreglado el VarGroup: + +1. **Verifica el estado del VarGroup**: Asegúrate que muestre "Creado" en lugar de "Error" +2. **Revisa cambios pendientes**: El mensaje de "cambios no publicados" debería desaparecer +3. **Intenta el despliegue**: Prueba nuevamente tu proceso de despliegue +4. **Monitorea el despliegue**: Observa que se complete exitosamente + +Si el despliegue sigue fallando, revisa si hay otros cambios pendientes en: + +- Otros VarGroups +- Configuración de la aplicación +- Configuraciones de infraestructura + + + + + +Para minimizar futuros errores en VarGroup: + +1. **Haz cambios en períodos de baja actividad**: Reduce la posibilidad de timeouts en Kubernetes +2. **Modifica un VarGroup a la vez**: Evita modificaciones concurrentes +3. **Espera a la finalización**: No hagas cambios adicionales mientras uno está en proceso +4. **Monitorea la salud del clúster**: Verifica si tu clúster está bajo alta carga +5. **Usa lotes más pequeños**: Si cambias muchas variables, hazlo en grupos más pequeños + + + +--- + +_Esta FAQ fue generada automáticamente el 17 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-editing-redis-connection-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-editing-redis-connection-error.mdx new file mode 100644 index 000000000..abed9d2a7 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-editing-redis-connection-error.mdx @@ -0,0 +1,121 @@ +--- +sidebar_position: 3 +title: "Error al editar Vargroup - Problema de conexión con Redis" +description: "Solución para el error 'algo salió mal' al editar variables de vargroup debido a problemas de conectividad con Redis" +date: "2024-11-20" +category: "proyecto" +tags: ["vargroup", "redis", "variables", "edición", "conexión"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error al editar Vargroup - Problema de conexión con Redis + +**Fecha:** 20 de noviembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Vargroup, Redis, Variables, Edición, Conexión + +## Descripción del problema + +**Contexto:** El usuario intenta editar variables en vargroups a través de la plataforma SleakOps pero encuentra errores de conexión debido a interrupciones en el servicio de Redis. + +**Síntomas observados:** + +- Mensaje de error "algo salió mal" al intentar editar variables del vargroup +- El error ocurre en múltiples entornos (desarrollo y producción) +- No es posible acceder a la interfaz de edición de vargroup +- El error persiste en diferentes vargroups dentro del mismo proyecto + +**Configuración relevante:** + +- Vargroups afectados: web-mecubro-develop, web-mecubro-com (producción) +- Componente de la plataforma: sistema de gestión de variables +- Dependencia backend: servicio Redis +- Interfaz de usuario: editor de vargroup + +**Condiciones del error:** + +- El error ocurre cuando el servicio Redis no está disponible o presenta problemas de conectividad +- Afecta todas las operaciones de edición de vargroup +- Puede impactar a múltiples usuarios simultáneamente +- Interrupción temporal del servicio + +## Solución detallada + + + +Una vez que el servicio Redis haya sido restaurado: + +1. **Esperar confirmación del servicio**: Asegurarse de que el servicio Redis esté completamente operativo +2. **Limpiar caché del navegador**: Refrescar la página o limpiar la caché del navegador +3. **Reintentar la operación**: Intentar editar nuevamente las variables del vargroup +4. **Verificar múltiples entornos**: Comprobar que tanto los entornos de desarrollo como producción estén accesibles + +El error debería resolverse automáticamente una vez se restablezca la conexión con Redis. + + + + + +Para comprobar si los problemas relacionados con Redis están afectando la plataforma: + +1. **Revisar la página de estado de la plataforma**: Buscar cualquier interrupción del servicio en curso +2. **Probar otras funcionalidades de la plataforma**: Verificar si otras funciones relacionadas con variables funcionan correctamente +3. **Contactar soporte**: Si el problema persiste tras la recuperación de Redis, reportar el problema + +**Indicadores de que Redis está funcionando de nuevo:** + +- Otros usuarios pueden editar vargroups con éxito +- El estado de la plataforma muestra todos los servicios operativos +- No aparecen mensajes de error al acceder a la gestión de variables + + + + + +Para minimizar el impacto durante cortes temporales del servicio: + +1. **Guardar trabajo frecuentemente**: Al editar grandes conjuntos de variables, guardar los cambios de forma incremental +2. **Usar control de versiones**: Llevar registro de los cambios en las variables en su documentación +3. **Planificar cambios críticos**: Evitar realizar cambios críticos en variables durante horas pico +4. **Tener un plan de reversión**: Mantener documentadas las configuraciones previas de variables + +**Enfoque de respaldo para cambios críticos:** + +```bash +# Exportar variables actuales antes de hacer cambios +sleakops vargroup export --project web-mecubro --env develop > backup-variables.json + +# Aplicar cambios cuando el servicio esté estable +sleakops vargroup import --project web-mecubro --env develop < new-variables.json +``` + + + + + +Si continúa experimentando problemas después de la restauración del servicio Redis: + +1. **Limpiar todos los datos del navegador**: + + - Borrar cookies y almacenamiento local para el dominio de SleakOps + - Intentar usar una ventana de navegador en modo incógnito/privado + - Probar desde un navegador diferente + +2. **Verificar conectividad de red**: + + - Confirmar que la conexión a internet sea estable + - Intentar acceder desde una red diferente + - Comprobar si el firewall corporativo está bloqueando las solicitudes + +3. **Reportar problemas persistentes**: + - Incluir mensajes de error específicos + - Mencionar qué vargroups están afectados + - Proporcionar información del navegador y red + - Referenciar la interrupción original de Redis para contexto + + + +--- + +_Esta FAQ fue generada automáticamente el 20 de noviembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-publish-error-without-deploy.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-publish-error-without-deploy.mdx new file mode 100644 index 000000000..e7ec1ed41 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vargroup-publish-error-without-deploy.mdx @@ -0,0 +1,148 @@ +--- +sidebar_position: 3 +title: "Error al Publicar Grupo de Variables Sin Desplegar" +description: "Error al publicar grupos de variables sin desplegar primero" +date: "2025-02-10" +category: "proyecto" +tags: ["vargroup", "variables", "publicar", "desplegar", "error"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error al Publicar Grupo de Variables Sin Desplegar + +**Fecha:** 10 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Vargroup, Variables, Publicar, Desplegar, Error + +## Descripción del Problema + +**Contexto:** Al actualizar los grupos de variables (vargroups) en SleakOps e intentar publicar cambios sin desplegar primero, el sistema lanza un error en el primer intento de publicación. + +**Síntomas Observados:** + +- El error ocurre al publicar un vargroup después de actualizarlo sin desplegar +- El primer intento de publicación falla con un error +- El segundo intento de publicación tiene éxito sin problemas +- El error es reproducible siguiendo los mismos pasos + +**Configuración Relevante:** + +- Componente: Grupos de Variables (vargroups) +- Secuencia de acción: Actualizar → Publicar (sin Desplegar) +- Plataforma: SleakOps +- El error ocurre consistentemente en el primer intento + +**Condiciones del Error:** + +- El error aparece específicamente al saltar el paso de desplegar +- Ocurre durante la operación de publicación +- El primer intento siempre falla +- Los intentos posteriores funcionan correctamente +- Comportamiento reproducible + +## Solución Detallada + + + +En SleakOps, los grupos de variables siguen un flujo de trabajo específico: + +1. **Actualizar**: Modificar valores de variables o agregar/eliminar variables +2. **Desplegar**: Aplicar cambios al entorno +3. **Publicar**: Hacer los cambios disponibles para los servicios dependientes + +Saltar el paso de desplegar puede causar problemas de sincronización entre el estado de configuración y el estado publicado. + + + + + +Si encuentras este error, puedes resolverlo inmediatamente haciendo lo siguiente: + +1. **Primer intento falla**: Observa el error pero no entres en pánico +2. **Reintenta la publicación**: Haz clic en publicar nuevamente inmediatamente +3. **Segundo intento tiene éxito**: La publicación debería funcionar en el segundo intento + +Esta solución temporal resolverá el problema inmediato, pero se recomienda seguir el flujo de trabajo adecuado para evitar el error completamente. + + + + + +Para prevenir este error, sigue esta secuencia: + +```bash +# Secuencia recomendada +1. Actualizar vargroup → 2. Desplegar → 3. Publicar +``` + +**Proceso paso a paso:** + +1. **Actualiza tu grupo de variables**: + + - Navega a los grupos de variables de tu proyecto + - Realiza los cambios necesarios en las variables + - Guarda los cambios + +2. **Despliega los cambios**: + + - Haz clic en el botón "Desplegar" + - Espera a que el despliegue se complete exitosamente + - Verifica el estado del despliegue + +3. **Publica los cambios**: + - Haz clic en el botón "Publicar" + - La publicación debería completarse sin errores + + + + + +Este error sucede debido a un problema de sincronización en la plataforma SleakOps: + +- **Desajuste de Estado**: Al saltar el despliegue, hay un desajuste entre el estado de configuración y el estado en tiempo de ejecución +- **Fallo en la Validación**: El proceso de publicación valida contra el estado desplegado, que no ha sido actualizado +- **Problemas de Caché**: El sistema puede tener estados en caché que se vuelven inconsistentes + +El segundo intento funciona porque el primer intento actualiza parcialmente el estado interno, permitiendo que el segundo intento tenga éxito. + + + + + +Si necesitas publicar sin desplegar por razones específicas: + +1. **Usa el método de reintento**: Acepta que el primer intento fallará y reintenta +2. **Agrupa tus cambios**: Realiza todas las actualizaciones de variables de una vez, luego despliega y publica +3. **Considera el impacto**: Evalúa si saltar el despliegue es necesario para tu caso de uso + +**Cuando saltar el despliegue puede ser aceptable:** + +- Probar cambios de configuración +- Preparar variables para futuros despliegues +- Trabajar en entornos de desarrollo + +**Cuándo siempre debes desplegar primero:** + +- Entornos de producción +- Cuando otros servicios dependen de las variables +- Cuando la consistencia es crítica + + + + + +Si continúas experimentando este problema: + +1. **Documenta los pasos**: Registra la secuencia exacta que reproduce el error +2. **Anota el mensaje de error**: Copia el texto exacto del error +3. **Detalles del entorno**: Incluye nombre del proyecto, entorno y nombre del grupo de variables +4. **Contacta soporte**: Reporta esto como un fallo para que el equipo de desarrollo lo investigue + +Este parece ser un problema conocido en el que el equipo de SleakOps está trabajando para resolver. + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de febrero de 2025 basada en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/variable-groups-editing-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/variable-groups-editing-error.mdx new file mode 100644 index 000000000..c7cd59fe8 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/variable-groups-editing-error.mdx @@ -0,0 +1,167 @@ +--- +sidebar_position: 3 +title: "Error al Editar Grupos de Variables" +description: "Solución para problemas al editar grupos de variables y solución alternativa con secretos del clúster" +date: "2024-04-21" +category: "proyecto" +tags: + [ + "grupos-de-variables", + "secretos", + "clúster", + "configuración", + "solución-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error al Editar Grupos de Variables + +**Fecha:** 21 de abril de 2024 +**Categoría:** Proyecto +**Etiquetas:** Grupos de Variables, Secretos, Clúster, Configuración, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan errores al intentar editar grupos de variables en SleakOps, lo que les impide actualizar valores de configuración como parámetros de conexión a bases de datos. + +**Síntomas Observados:** + +- Incapacidad para editar grupos de variables mediante la interfaz estándar +- Error al intentar actualizar valores del grupo de variables +- Los cambios en los grupos de variables no se guardan ni aplican +- Problemas específicos con actualizaciones de configuración de bases de datos + +**Configuración Relevante:** + +- Grupos de variables que contienen cadenas de conexión a bases de datos +- Configuraciones de entornos de desarrollo +- Parámetros de bases de datos MySQL +- Permisos específicos de cuenta + +**Condiciones de Error:** + +- El error aparece al acceder a la interfaz de edición de grupos de variables +- El problema es específico de la cuenta +- Afecta la capacidad para actualizar conexiones a bases de datos +- Impide cambios de configuración para entornos de desarrollo + +## Solución Detallada + + + +Mientras la interfaz de grupos de variables presenta problemas, puedes editar los valores directamente mediante los secretos del clúster: + +1. **Accede a la Configuración del Clúster** + + - Navega a tu clúster en el panel de SleakOps + - Ve a la sección **Secretos** + +2. **Localiza los Secretos del Grupo de Variables** + + - Encuentra el secreto correspondiente a tu grupo de variables + - Busca secretos con nombres que coincidan con el patrón de tu grupo de variables + +3. **Edita los Valores del Secreto** + + - Haz clic en el secreto para editarlo + - Actualiza los valores directamente en la configuración del secreto + - Guarda los cambios + +4. **Reinicia los Servicios Afectados** + - Reinicia cualquier servicio que dependa de estas variables + - Esto asegura que los nuevos valores sean aplicados + + + + + +**Para editar la configuración de la base de datos a través de secretos del clúster:** + +1. **Navega a los Secretos del Clúster** + + ``` + Panel → Clústeres → [Tu Clúster] → Secretos + ``` + +2. **Encuentra el Grupo de Variables de la Base de Datos** + + - Busca secretos con nombres similares a tu grupo de variables + - Ejemplo: `mecubrov4develop-mysql-secrets` + +3. **Edita los Parámetros de la Base de Datos** + + - Actualiza las cadenas de conexión + - Modifica host, puerto, nombre de base de datos según sea necesario + - Actualiza credenciales si se requiere + +4. **Aplica los Cambios** + - Guarda la configuración del secreto + - Los cambios se sincronizarán automáticamente + +**Ejemplo de estructura del secreto:** + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: mecubrov4develop-mysql +type: Opaque +data: + DB_HOST: + DB_PORT: + DB_NAME: + DB_USER: + DB_PASSWORD: +``` + + + + + +**Actualización de la Plataforma:** + +Se tiene programada una nueva versión de SleakOps que resolverá este problema de edición de grupos de variables: + +- **Cronograma de lanzamiento:** A mitad de semana (miércoles/jueves) +- **Alcance de la corrección:** Resolución completa de errores al editar grupos de variables +- **Impacto:** Se restaurará la funcionalidad normal para editar grupos de variables + +**Después de la Actualización:** + +- Los grupos de variables podrán editarse directamente mediante la interfaz estándar +- No será necesario usar la solución alternativa con secretos del clúster +- Se resolverán todos los problemas específicos de edición por cuenta + + + + + +**Para evitar problemas similares en el futuro:** + +1. **Copias de Seguridad Regulares** + + - Exporta las configuraciones de grupos de variables regularmente + - Mantén documentación de los valores críticos de variables + +2. **Usa Grupos Específicos por Entorno** + + - Separa variables para desarrollo, staging y producción + - Usa convenciones de nombres claras + +3. **Control de Versiones** + + - Rastrea cambios en las configuraciones de variables + - Documenta las razones de las actualizaciones de configuración + +4. **Proceso de Pruebas** + - Prueba los cambios de configuración primero en desarrollo + - Verifica las conexiones a bases de datos después de las actualizaciones + - Monitorea los registros de la aplicación tras los cambios + + + +--- + +_Este FAQ fue generado automáticamente el 21 de abril de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/variable-management-synchronization-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/variable-management-synchronization-issues.mdx new file mode 100644 index 000000000..6333a72a9 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/variable-management-synchronization-issues.mdx @@ -0,0 +1,188 @@ +--- +sidebar_position: 3 +title: "Gestión de Variables y Problemas de Sincronización" +description: "Solución para problemas de sincronización de grupos de variables y cambios inesperados en configuraciones de entornos" +date: "2024-12-19" +category: "proyecto" +tags: ["variables", "entorno", "secretos", "sincronización", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Gestión de Variables y Problemas de Sincronización + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Proyecto +**Etiquetas:** Variables, Entorno, Secretos, Sincronización, Configuración + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan cambios inesperados en variables de entorno y secretos de configuración entre diferentes entornos (desarrollo, producción) después de actualizaciones de la plataforma o mantenimiento del sistema. + +**Síntomas Observados:** + +- Variables de entorno modificadas sin autorización del usuario +- Grupos de variables mostrando valores "dummy" o de marcador de posición +- Inconsistencias entre entornos de desarrollo y producción +- Secretos faltantes o vacíos en entornos específicos +- Variables que no se actualizan correctamente desde la interfaz del clúster + +**Configuración Relevante:** + +- Plataforma: sistema de gestión de variables SleakOps +- Entornos afectados: Desarrollo y Producción +- Almacenamiento: AWS Parameter Store con cifrado +- Configuración de grupos de variables (vargroups) + +**Condiciones de Error:** + +- Variables aparecen modificadas tras actualizaciones de la plataforma +- Fallos de sincronización entre entornos +- Grupos de variables faltantes en entorno de desarrollo +- Valores de marcador de posición reemplazando configuración real + +## Solución Detallada + + + +El problema de sincronización de variables ocurre típicamente cuando: + +1. **Actualizaciones de la plataforma** modifican el sistema de gestión de variables +2. **Grupos de variables faltantes** en entornos específicos causan inconsistencias +3. **Políticas de seguridad** impiden la restauración automática de valores sensibles +4. **Gestión centralizada** entra en conflicto con configuraciones locales + +El sistema crea secretos vacíos con valores de marcador de posición cuando no puede encontrar grupos de variables correspondientes para mantener los estándares de seguridad. + + + + + +Para recuperarse de problemas de sincronización de variables: + +1. **Identificar entornos afectados**: + + ```bash + # Verificar estado de grupos de variables + sleakops env list --show-variables + sleakops secrets list --environment development + ``` + +2. **Restaurar desde copias de seguridad**: + + - Acceder a la consola de SleakOps + - Navegar a **Configuración de Entorno** → **Variables** + - Buscar la sección **Backup/Historial** + - Restaurar configuración previa que funcione + +3. **Restauración manual de variables**: + - Reemplazar valores "dummy" por la configuración real + - Actualizar cada grupo de variables individualmente + - Verificar sincronización entre entornos + + + + + +Para prevenir futuros problemas de sincronización de variables: + +1. **Habilitar copias de seguridad automáticas**: + + ```yaml + # En tu archivo sleakops.yml + environments: + development: + variables: + backup_enabled: true + backup_frequency: "daily" + encryption: true + ``` + +2. **Implementar validación de variables**: + + - Configurar chequeos de integridad antes de aplicar cambios + - Usar plantillas de variables para consistencia + - Implementar flujos de aprobación para cambios en producción + +3. **Aislamiento de entornos**: + - Mantener variables de desarrollo y producción separadas + - Usar diferentes grupos de variables para cada entorno + - Implementar controles de acceso adecuados + + + + + +SleakOps centraliza la gestión de variables con estas características: + +1. **Integración con Parameter Store**: + + - Todas las variables almacenadas en AWS Parameter Store + - Cifrado automático para datos sensibles + - Historial de versiones y capacidades de reversión + +2. **Sincronización automática**: + + - Variables sincronizadas en todos los entornos + - Actualizaciones en tiempo real para aplicaciones en ejecución + - Mecanismos de resolución de conflictos + +3. **Características de seguridad**: + - No almacenamiento de credenciales en texto plano + - Valores de marcador de posición para configuraciones faltantes + - Registro de auditoría para todos los cambios + + + + + +Al experimentar problemas con variables: + +1. **Verificar estado de grupos de variables**: + + ```bash + sleakops vargroups status --environment development + sleakops vargroups validate --all + ``` + +2. **Verificar sincronización**: + + ```bash + sleakops sync status + sleakops sync force --environment development + ``` + +3. **Revisar historial de cambios**: + + - Acceder a la consola de SleakOps + - Ir a **Registros de Auditoría** → **Cambios en Variables** + - Revisar modificaciones recientes y sus fuentes + +4. **Sincronización manual**: + ```bash + # Forzar sincronización de grupo de variables específico + sleakops vargroups sync --name database-config --environment development + ``` + + + + + +Sigue esta lista para recuperarte completamente de problemas con variables: + +- [ ] **Respaldar estado actual** antes de hacer cambios +- [ ] **Identificar todos los entornos afectados** y grupos de variables +- [ ] **Restaurar variables de producción** desde la copia de seguridad más reciente +- [ ] **Actualizar variables de desarrollo** con valores correctos +- [ ] **Verificar funcionalidad de la aplicación** en cada entorno +- [ ] **Probar sincronización de variables** entre entornos +- [ ] **Habilitar monitoreo** para futuros cambios en variables +- [ ] **Documentar el incidente** y lecciones aprendidas +- [ ] **Programar copias de seguridad regulares** si no están configuradas +- [ ] **Revisar permisos de acceso** para la gestión de variables + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/volume-unique-path-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/volume-unique-path-error.mdx new file mode 100644 index 000000000..dbdb3c8c6 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/volume-unique-path-error.mdx @@ -0,0 +1,139 @@ +--- +sidebar_position: 3 +title: "Error de Ruta Única de Volumen Después de la Eliminación" +description: "Solución para el error 'Ruta única' al recrear volúmenes en el mismo punto de montaje" +date: "2025-02-11" +category: "proyecto" +tags: ["volumen", "almacenamiento", "error", "punto-de-montaje", "eliminación"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Ruta Única de Volumen Después de la Eliminación + +**Fecha:** 11 de febrero de 2025 +**Categoría:** Proyecto +**Etiquetas:** Volumen, Almacenamiento, Error, Punto de Montaje, Eliminación + +## Descripción del Problema + +**Contexto:** El usuario intenta recrear un volumen en el mismo punto de montaje tras eliminar un volumen previo, pero encuentra un error de "Ruta única" en la plataforma SleakOps. + +**Síntomas Observados:** + +- Error de "Ruta única" al intentar crear un nuevo volumen +- El error ocurre en el mismo punto de montaje donde se eliminó un volumen previamente +- El volumen parece estar correctamente eliminado desde la interfaz +- No es posible reutilizar la misma ruta de montaje para nuevos volúmenes + +**Configuración Relevante:** + +- Proyecto: "velo-opensactions" +- Ruta de montaje: "/opensanctions/data" +- Volumen creado previamente y luego eliminado +- Intento de recrear volumen en el punto de montaje idéntico + +**Condiciones del Error:** + +- El error aparece durante el proceso de creación del volumen +- Ocurre específicamente al reutilizar rutas de montaje +- Persiste incluso después de que la eliminación del volumen parece exitosa +- Puede indicar una limpieza incompleta de los recursos del volumen + +## Solución Detallada + + + +El error de "Ruta única" típicamente ocurre cuando: + +1. **Limpieza incompleta del volumen**: El proceso de eliminación del volumen no eliminó completamente todos los recursos asociados +2. **Reserva del punto de montaje**: El sistema aún considera la ruta de montaje como "en uso" +3. **Retraso en la sincronización de recursos**: Existe un retraso entre la eliminación del volumen y la disponibilidad de la ruta +4. **Referencias huérfanas**: Entradas en la base de datos o metadatos aún hacen referencia al volumen eliminado + +Este es un problema conocido que ha sido identificado y corregido en actualizaciones recientes de la plataforma. + + + + + +Mientras se espera que se aplique la corrección, puede usar estas soluciones temporales: + +**Opción 1: Usar una ruta de montaje diferente** + +```bash +# En lugar de /opensanctions/data +# Pruebe /opensanctions/data-v2 o /opensanctions/storage +``` + +**Opción 2: Esperar y reintentar** + +- Espere 10-15 minutos después de la eliminación del volumen +- El sistema puede eventualmente liberar la reserva de la ruta +- Reintente la creación del volumen con la misma ruta + +**Opción 3: Contactar soporte para limpieza manual** + +- Si la ruta debe reutilizarse inmediatamente +- El equipo de soporte puede limpiar manualmente las referencias huérfanas + + + + + +Para evitar este problema en el futuro: + +1. **Verificar eliminación completa**: + + - Verifique que ninguna carga de trabajo esté usando aún el volumen + - Asegúrese de que todos los pods que usan el volumen estén detenidos + +2. **Esperar antes de recrear**: + + - Permita de 5 a 10 minutos entre la eliminación y la recreación + - Esto garantiza que todos los procesos de limpieza se completen + +3. **Usar rutas de montaje únicas**: + - Considere usar marcas de tiempo o números de versión en las rutas + - Ejemplo: `/data/app-v1`, `/data/app-v2` + + + + + +Para confirmar que un volumen está completamente eliminado: + +1. **Verifique la sección de Volúmenes**: + + - Vaya al panel de su proyecto + - Navegue a Almacenamiento → Volúmenes + - Verifique que el volumen no esté listado + +2. **Revise las configuraciones de cargas de trabajo**: + + - Revise todas las cargas de trabajo en el proyecto + - Asegúrese de que ninguna haga referencia al volumen eliminado + - Busque errores de "Volumen no encontrado" + +3. **Monitoree recursos huérfanos**: + - Verifique si quedan reclamaciones de volúmenes persistentes + - Verifique que ninguna clase de almacenamiento siga haciendo referencia al volumen + + + + + +Este problema ha sido identificado y resuelto: + +- **Estado**: Corregido en una versión reciente de la plataforma +- **Solución**: Proceso mejorado de limpieza de volúmenes +- **Impacto**: Previene reservas huérfanas de rutas de montaje +- **Despliegue**: Aplicado automáticamente a todos los proyectos + +Si continúa experimentando este problema después del despliegue de la corrección, por favor contacte soporte ya que podría indicar un problema subyacente diferente. + + + +--- + +_Este FAQ fue generado automáticamente el 11 de febrero de 2025 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-addon-access-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-addon-access-troubleshooting.mdx new file mode 100644 index 000000000..fcd5ed240 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-addon-access-troubleshooting.mdx @@ -0,0 +1,205 @@ +--- +sidebar_position: 3 +title: "Problemas de Conexión VPN con Addons de Clúster" +description: "Solución de problemas de conectividad VPN que impiden el acceso a addons de clúster y entornos de desarrollo" +date: "2024-12-19" +category: "usuario" +tags: ["vpn", "conexión", "addons", "solución de problemas", "lens"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Conexión VPN con Addons de Clúster + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, Conexión, Addons, Solución de problemas, Lens + +## Descripción del Problema + +**Contexto:** El usuario experimenta problemas de conectividad con addons del clúster de desarrollo y Lens después de trabajar en un nuevo proyecto, mientras que los servicios de producción siguen accesibles. + +**Síntomas Observados:** + +- No puede acceder a ningún addon del clúster de desarrollo +- Incapacidad para conectar al clúster de desarrollo vía Lens +- Pérdida de acceso al nuevo proyecto (n8n) +- Los servicios de producción (Grafana, Loki) permanecen accesibles +- Los problemas comenzaron en la mañana (zona horaria de Francia) + +**Configuración Relevante:** + +- Entorno: Clúster de desarrollo vs Clúster de producción +- Herramientas afectadas: Addons de clúster, Lens, cargas de trabajo del proyecto +- Ubicación geográfica: Francia +- VPN: Conexión VPN estándar de SleakOps + +**Condiciones de Error:** + +- Pérdida selectiva de conectividad (dev accesible, prod no) +- Coincide con el trabajo en nuevo proyecto +- Persiste a pesar de la reconexión VPN y limpieza de DNS +- Afecta múltiples servicios simultáneamente + +## Solución Detallada + + + +Cuando puedes acceder a los servicios de producción pero no a los addons del clúster de desarrollo, esto típicamente indica: + +1. **Segmentación de red**: Los entornos de desarrollo y producción usan rutas de red diferentes +2. **Tablas de enrutamiento VPN**: Tu cliente VPN puede tener información de enrutamiento desactualizada o incompleta +3. **Resolución DNS**: Las entradas DNS del clúster de desarrollo pueden no estar resolviéndose correctamente +4. **Tokens de autenticación**: Tu kubeconfig o tokens de acceso para desarrollo pueden haber expirado + + + + + +Más allá de la reconexión básica VPN, prueba estos pasos: + +1. **Reinicio completo de la VPN**: + + ```bash + # Desconectar la VPN completamente + # Limpiar caché/configuración del cliente VPN + # Reiniciar la aplicación cliente VPN + # Reconectar con configuración nueva + ``` + +2. **Verificar tablas de enrutamiento**: + + ```bash + # En Windows + route print + + # En macOS/Linux + netstat -rn + ``` + +3. **Probar endpoints específicos**: + + ```bash + # Probar API del clúster de desarrollo + curl -k https://tu-endpoint-api-dev-cluster/version + + # Probar endpoints de addons + nslookup tu-dominio-addon.sleakops.com + ``` + + + + + +Para problemas de conexión en Lens: + +1. **Actualizar configuración del clúster**: + + - Abrir Lens + - Ir a configuración del clúster + - Hacer clic en "Actualizar" o "Reconectar" + - Verificar que la ruta del kubeconfig sea correcta + +2. **Actualizar kubeconfig**: + + ```bash + # Descargar kubeconfig fresco desde SleakOps + sleakops cluster kubeconfig --cluster development + + # O actualizar configuración existente + kubectl config use-context development + kubectl cluster-info + ``` + +3. **Limpiar caché de Lens**: + - Cerrar Lens completamente + - Limpiar caché de la aplicación (la ubicación varía según el SO) + - Reiniciar Lens + + + + + +Si perdiste acceso a un proyecto recién creado: + +1. **Verificar permisos del proyecto**: + + - Comprobar si el despliegue del proyecto se completó exitosamente + - Confirmar que tu usuario tiene los permisos adecuados + - Verificar que el proyecto esté en estado "Running" en el panel de SleakOps + +2. **Revisar red del proyecto**: + + ```bash + # Verificar que los endpoints del proyecto sean accesibles + nslookup tu-url-proyecto.sleakops.com + + # Probar conectividad directa + curl -I https://tu-url-proyecto.sleakops.com + ``` + +3. **Re-sincronizar configuración del proyecto**: + - Ir al panel de SleakOps + - Navegar a tu proyecto + - Hacer clic en "Sincronizar" o "Actualizar configuración" + + + + + +Cuando producción funciona pero desarrollo no: + +1. **Verificar estado del entorno**: + + - Confirmar que el clúster de desarrollo esté saludable en el panel de SleakOps + - Buscar notificaciones de mantenimiento + - Revisar si hay despliegues en curso + +2. **Diferencias en configuración de red**: + + - Los clústeres de desarrollo pueden usar rangos IP diferentes + - El túnel dividido de la VPN puede afectar las rutas de desarrollo + - Las reglas de firewall pueden variar entre entornos + +3. **Ámbito de autenticación**: + - Tus tokens pueden tener diferentes ámbitos para dev vs prod + - El acceso a desarrollo puede requerir permisos adicionales + - Verificar si tu rol de usuario incluye acceso al clúster de desarrollo + + + + + +Si estos pasos no resuelven el problema, escala con esta información: + +1. **Diagnósticos de red**: + + ```bash + # Recopilar información de enrutamiento + route print > info_enrutamiento.txt + + # Pruebas de resolución DNS + nslookup grafana.prod.sleakops.com + nslookup tu-addon.dev.sleakops.com + + # Pruebas de conectividad + traceroute tu-endpoint-dev-cluster + ``` + +2. **Información de línea de tiempo**: + + - Hora exacta cuando comenzaron los problemas + - Última conexión exitosa + - Cambios recientes o creación de nuevos proyectos + +3. **Detalles del entorno**: + - Sistema operativo y versión + - Versión del cliente VPN + - Versión de Lens + - Ubicación geográfica/zona horaria + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-cluster-access-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-cluster-access-troubleshooting.mdx new file mode 100644 index 000000000..235f0926d --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-cluster-access-troubleshooting.mdx @@ -0,0 +1,229 @@ +--- +sidebar_position: 3 +title: "Problemas de Conexión VPN y Acceso al Clúster" +description: "Solución de problemas de conectividad VPN y acceso al clúster Kubernetes" +date: "2024-12-11" +category: "usuario" +tags: ["vpn", "clúster", "acceso", "solución de problemas", "conectividad"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Conexión VPN y Acceso al Clúster + +**Fecha:** 11 de diciembre de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, Clúster, Acceso, Solución de Problemas, Conectividad + +## Descripción del Problema + +**Contexto:** El usuario intenta acceder a un clúster Kubernetes de producción a través de SleakOps pero encuentra problemas de conectividad a pesar de tener una conexión VPN activa. + +**Síntomas Observados:** + +- No se puede acceder al clúster de producción +- La VPN parece estar conectada pero el acceso al clúster falla +- Las pruebas de ping a direcciones de red internas pueden fallar +- Las URLs internas de cargas de trabajo no son accesibles +- Los comandos kubectl se agotan o no logran conectarse + +**Configuración Relevante:** + +- Conexión VPN: Estado Activo/Conectado +- Entorno objetivo: Clúster de producción +- Rango de red interna: 10.130.0.0/16 (típico) +- Punto de enlace del clúster: Clúster EKS con punto de enlace privado + +**Condiciones de Error:** + +- Error al intentar acceder a recursos del clúster +- El problema persiste a pesar de que la VPN muestra estado activo +- URLs y servicios internos son inalcanzables +- Los comandos kubectl fallan con errores de tiempo de espera + +## Solución Detallada + + + +Cuando se experimentan problemas de acceso al clúster con VPN, siga estos pasos de diagnóstico: + +1. **Verifique el estado de la conexión VPN** en su cliente VPN +2. **Pruebe la conectividad básica de red** hacia los rangos internos +3. **Revise la resolución DNS** para servicios internos +4. **Valide la accesibilidad del punto de enlace del clúster** + +Comience con una prueba simple de ping para verificar la conectividad básica: + +```bash +# Probar conectividad a red interna +ping 10.130.0.2 + +# Probar resolución DNS +nslookup your-cluster-endpoint.us-east-1.eks.amazonaws.com +``` + + + + + +Si la VPN aparece como conectada pero no puede acceder a recursos internos: + +1. **Desconecte y reconecte** el cliente VPN +2. **Revise la tabla de rutas** para asegurar que el tráfico se enruta a través de la VPN: + +```bash +# En Windows +route print + +# En macOS/Linux +route -n get 10.130.0.0 +netstat -rn | grep 10.130 +``` + +3. **Verifique que la configuración DNS** apunte a servidores DNS internos +4. **Pruebe con diferentes protocolos VPN** si están disponibles (OpenVPN vs IKEv2) +5. **Revise la configuración del firewall** que podría bloquear el tráfico VPN + + + + + +Para revisar la configuración de su clúster: + +1. **Revise el archivo kubeconfig** generado por SleakOps: + +```bash +# Ver kubeconfig actual +kubectl config view + +# Buscar el endpoint del servidor del clúster +kubectl config view -o jsonpath='{.clusters[0].cluster.server}' +``` + +2. **Verifique el formato del punto de enlace del clúster**: + + - Debe ser similar a: `https://XXXXXXXXXX.us-east-1.eks.amazonaws.com` + - Debe ser accesible a través de la VPN + +3. **Pruebe la conectividad directa** al punto de enlace: + +```bash +# Probar conectividad HTTPS +curl -k https://your-cluster-endpoint.us-east-1.eks.amazonaws.com/version + +# O usar telnet para probar el puerto 443 +telnet your-cluster-endpoint.us-east-1.eks.amazonaws.com 443 +``` + + + + + +Para verificar la conectividad a cargas de trabajo internas: + +1. **Acceda al panel de SleakOps** y navegue a su proyecto de producción +2. **Vaya a la sección de Cargas de Trabajo** para ver las URLs internas +3. **Pruebe las URLs internas** mientras está conectado a la VPN: + +```bash +# Ejemplo de prueba de URL interna +curl -I http://your-internal-service.internal.domain + +# Prueba con salida detallada para depuración +curl -v http://your-internal-service.internal.domain +``` + +4. **Verifique el descubrimiento de servicios**: + +```bash +# Listar servicios en su clúster +kubectl get services --all-namespaces + +# Probar conectividad al servicio +kubectl port-forward service/your-service 8080:80 +``` + + + + + +**Solución 1: Actualizar configuración VPN** + +1. Descargue una configuración VPN nueva desde SleakOps +2. Elimine perfiles VPN antiguos de su cliente +3. Importe la nueva configuración +4. Reconecte la VPN + +**Solución 2: Configuración DNS** + +```bash +# Vaciar caché DNS (Windows) +ipconfig /flushdns + +# Vaciar caché DNS (macOS) +sudo dscacheutil -flushcache + +# Vaciar caché DNS (Linux) +sudo systemctl restart systemd-resolved +``` + +**Solución 3: Métodos alternativos de conexión** + +Si la VPN sigue fallando: + +1. **Use kubectl proxy** para acceso temporal: + +```bash +kubectl proxy --port=8080 +# Acceda al clúster mediante http://localhost:8080 +``` + +2. **Habilite el punto de enlace público** temporalmente (si la seguridad lo permite) +3. **Use un host bastión** para túneles SSH + +**Solución 4: Solución de problemas de red** + +```bash +# Ver interfaces de red activas +ifconfig -a # macOS/Linux +ipconfig /all # Windows + +# Verifique que la interfaz VPN esté activa y tenga IP correcta +# Busque tun0, utun o interfaz VPN similar +``` + + + + + +**Mantenimiento regular:** + +1. **Mantenga el cliente VPN actualizado** a la última versión +2. **Renueve regularmente los certificados VPN** antes de expirar +3. **Pruebe la conectividad** después de cualquier cambio en la red +4. **Documente configuraciones funcionales** para recuperación rápida + +**Configuración de monitoreo:** + +```bash +# Crear un script simple de prueba de conectividad +#!/bin/bash +echo "Probando conectividad VPN..." +ping -c 3 10.130.0.2 +echo "Probando punto de enlace del clúster..." +kubectl cluster-info +echo "Probando servicios internos..." +curl -I http://your-internal-service.domain +``` + +**Acceso de emergencia:** + +1. **Configure métodos alternativos de acceso** (host bastión) +2. **Mantenga información de contacto de soporte de emergencia** +3. **Documente pasos de solución de problemas** para su equipo + + + +--- + +_Esta FAQ fue generada automáticamente el 11 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-connection-disconnection-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-connection-disconnection-issues.mdx new file mode 100644 index 000000000..7232754a5 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-connection-disconnection-issues.mdx @@ -0,0 +1,170 @@ +--- +sidebar_position: 3 +title: "Problemas de Desconexión de la Conexión VPN" +description: "Solución de problemas de desconexiones VPN y actualización de credenciales en SleakOps" +date: "2024-08-01" +category: "usuario" +tags: ["vpn", "conexión", "credenciales", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Desconexión de la Conexión VPN + +**Fecha:** 1 de agosto de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, Conexión, Credenciales, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan desconexiones intermitentes de la VPN al acceder a los servicios de SleakOps, requiriendo la actualización de credenciales y reconexión para restaurar el acceso a recursos internos como los paneles de Grafana. + +**Síntomas Observados:** + +- La conexión VPN se cae inesperadamente +- Incapacidad para acceder a servicios internos (p. ej., ingreso a Grafana) +- Problemas de conexión resueltos tras actualización de credenciales +- Los servicios funcionan correctamente cuando la VPN está conectada adecuadamente + +**Configuración Relevante:** + +- Cliente VPN: configuración proporcionada por SleakOps +- Servicios objetivo: puntos de ingreso internos (p. ej., Grafana) +- Autenticación: credenciales de usuario para acceso VPN + +**Condiciones de Error:** + +- La VPN se desconecta sin acción del usuario +- Los servicios internos se vuelven inaccesibles +- La reconexión puede fallar con credenciales antiguas +- El problema se resuelve tras refrescar las credenciales + +## Solución Detallada + + + +Cuando experimente desconexión de la VPN: + +1. **Desconéctese completamente** de la VPN +2. **Espere 10-15 segundos** antes de intentar reconectar +3. **Reconéctese a la VPN** usando su cliente +4. **Pruebe el acceso** a los servicios internos + +Si la reconexión falla, proceda a los pasos de actualización de credenciales. + + + + + +Para actualizar sus credenciales VPN: + +1. **Acceda al Panel de SleakOps** + + - Inicie sesión en su cuenta de SleakOps + - Navegue a **Configuración de Usuario** o **Acceso VPN** + +2. **Genere nuevas credenciales** + + - Haga clic en **"Regenerar Credenciales VPN"** + - Descargue el nuevo archivo de configuración + +3. **Actualice su cliente VPN** + - Elimine el perfil VPN antiguo + - Importe la nueva configuración + - Conéctese usando las credenciales actualizadas + +```bash +# Para clientes OpenVPN +sudo openvpn --config /ruta/a/nueva-config.ovpn +``` + + + + + +Después de reconectarse, verifique su acceso: + +1. **Revise el estado de la VPN** + + ```bash + # Verifique su dirección IP + curl ifconfig.me + + # Debe mostrar rango IP de la VPN SleakOps + ``` + +2. **Pruebe el acceso a servicios internos** + + ```bash + # Prueba de acceso a Grafana (ejemplo) + curl -I https://grafana.prod.su-dominio.com/login + + # Debe devolver HTTP 200 o redirección + ``` + +3. **Verifique la resolución DNS** + ```bash + # Prueba de resolución DNS interna + nslookup grafana.prod.su-dominio.com + ``` + + + + + +Para minimizar las desconexiones de VPN: + +1. **Habilite la reconexión automática** en la configuración de su cliente VPN +2. **Use opciones keep-alive** si están disponibles +3. **Verifique la estabilidad de la red** en su conexión local +4. **Actualice el cliente VPN** a la última versión + +**Para clientes OpenVPN:** + +```conf +# Añada a su archivo .ovpn +keepalive 10 120 +ping-timer-rem +persist-tun +persist-key +``` + +**Para administradores de red:** + +- Configure el firewall para permitir tráfico VPN +- Asegure que el puerto UDP 1194 (o el puerto configurado) esté abierto +- Verifique que el equipo de red no corte conexiones prolongadas + + + + + +**Si los problemas persisten:** + +1. **Revise los registros del sistema** + + ```bash + # En Linux/macOS + tail -f /var/log/openvpn.log + + # En Windows + # Revise el Visor de Eventos > Registros de Aplicaciones y Servicios > OpenVPN + ``` + +2. **Pruebe diferentes métodos de conexión** + + - Intente TCP en lugar de UDP + - Use diferentes puntos de acceso VPN si están disponibles + - Pruebe desde distintas ubicaciones de red + +3. **Contacte soporte con detalles** + - Versión del cliente VPN + - Sistema operativo + - Mensajes de error de los registros + - Hora en que ocurrió la desconexión + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-dns-configuration-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-dns-configuration-troubleshooting.mdx new file mode 100644 index 000000000..59f1ecb98 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-dns-configuration-troubleshooting.mdx @@ -0,0 +1,222 @@ +--- +sidebar_position: 3 +title: "Problemas de Configuración DNS de VPN con Acceso a Clúster Lens" +description: "Solución de problemas de configuración DNS al acceder a clústeres Kubernetes a través de VPN" +date: "2024-12-19" +category: "usuario" +tags: ["vpn", "dns", "lens", "conectividad", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Configuración DNS de VPN con Acceso a Clúster Lens + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, DNS, Lens, Conectividad, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas de tiempo de espera al intentar acceder a clústeres Kubernetes mediante Lens después de la conexión VPN, a pesar de seguir las guías de configuración DNS. + +**Síntomas Observados:** + +- Errores persistentes de tiempo de espera al conectarse al clúster vía Lens +- La conexión VPN parece estar funcionando +- La configuración DNS en `/etc/resolv.conf` deja de funcionar después de haber funcionado previamente +- Los problemas de conexión aparecen de forma intermitente o tras cambios en el sistema + +**Configuración Relevante:** + +- Sistema Operativo: Ubuntu 24.04.2 LTS (o distribuciones Linux similares) +- Cliente VPN: Pritunl +- Cliente Kubernetes: Lens +- Configuración DNS: `/etc/resolv.conf` + +**Condiciones de Error:** + +- Ocurre tiempo de espera al acceder al clúster mediante Lens +- El problema persiste incluso tras reconectar la VPN +- Los cambios en la configuración DNS no surten efecto +- Configuración que funcionaba deja de funcionar repentinamente + +## Solución Detallada + + + +El problema más común es una configuración DNS incorrecta en `/etc/resolv.conf`. En lugar de agregar servidores DNS, se debe reemplazar todo el contenido: + +**Enfoque incorrecto (agregar):** + +```bash +# No hacer esto - agregar al contenido existente + echo "nameserver 10.0.0.2" >> /etc/resolv.conf +``` + +**Enfoque correcto (reemplazar):** + +```bash +# Reemplazar todo el contenido +sudo tee /etc/resolv.conf > /dev/null < + + + +Para una solución más permanente, configura el DNS mediante el administrador de red de tu sistema: + +**Para Ubuntu/GNOME:** + +1. Ve a **Configuración** → **Red** +2. Haz clic en tu conexión **WiFi** o **Ethernet** +3. Ve a la pestaña **IPv4** o **IPv6** +4. Establece **DNS** en **Manual** +5. Añade los servidores DNS: + - Primario: `10.0.0.2` (servidor DNS de la VPC) + - Secundario: `8.8.8.8` (DNS de Google) + - Terciario: `8.8.4.4` (DNS de Google como respaldo) +6. Haz clic en **Aplicar** + +**Por línea de comandos:** + +```bash +# Usando nmcli +nmcli con mod "Your-Connection-Name" ipv4.dns "10.0.0.2,8.8.8.8" +nmcli con down "Your-Connection-Name" +nmcli con up "Your-Connection-Name" +``` + + + + + +Asegúrate de que tu cliente Pritunl esté configurado para manejar DNS correctamente: + +1. Abre el **Cliente Pritunl** +2. Haz clic en el **icono de engranaje** junto a tu perfil +3. Revisa **Configuraciones Avanzadas**: + - Activa **Sufijo DNS** si está disponible + - Activa **Forzar Configuración DNS** (esto obliga al cliente a sobreescribir el DNS del sistema) + - Desactiva **Bloquear DNS Externo** si está habilitado +4. Reconéctate a la VPN + +**Configuración alternativa para Pritunl:** + +```bash +# Si usas Pritunl desde línea de comandos +pritunl-client enable [profile-id] +pritunl-client start [profile-id] --dns-force +``` + + + + + +Después de configurar el DNS, verifica que la resolución funcione correctamente: + +```bash +# Verificar configuración DNS actual +cat /etc/resolv.conf + +# Probar resolución DNS para tu clúster +nslookup your-cluster-endpoint.amazonaws.com + +# Probar con dig para más detalles +dig your-cluster-endpoint.amazonaws.com + +# Probar conectividad al clúster +kubectl cluster-info + +# Probar desde Lens - verificar estado de conexión +``` + +**Salida esperada:** + +- `/etc/resolv.conf` debe mostrar primero el servidor DNS de tu VPC +- `nslookup` debe resolver a direcciones IP internas +- `kubectl cluster-info` debe conectarse con éxito + + + + + +Si la configuración DNS se sobrescribe constantemente: + +**Haz que resolv.conf sea inmutable:** + +```bash +# Después de establecer la configuración DNS correcta +sudo chattr +i /etc/resolv.conf + +# Para volver a hacerlo mutable más tarde +sudo chattr -i /etc/resolv.conf +``` + +**Usar systemd-resolved (Ubuntu 18.04+):** + +```bash +# Configurar systemd-resolved +sudo systemctl enable systemd-resolved +sudo systemctl start systemd-resolved + +# Editar configuración de resolved +sudo nano /etc/systemd/resolved.conf + +# Añadir tus servidores DNS +[Resolve] +DNS=10.0.0.2 8.8.8.8 +Domains=~. + +# Reiniciar el servicio +sudo systemctl restart systemd-resolved + +# Enlazar resolv.conf a systemd-resolved +sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf +``` + + + + + +Si el DNS funciona pero Lens sigue presentando problemas: + +1. **Limpiar caché de Lens:** + + ```bash + # Eliminar configuración y caché de Lens + rm -rf ~/.config/Lens + rm -rf ~/.cache/Lens + ``` + +2. **Verificar kubeconfig:** + + ```bash + # Verificar que kubeconfig sea accesible + kubectl config current-context + kubectl config view + ``` + +3. **Probar conexión directa:** + + ```bash + # Probar si puedes alcanzar el servidor API + curl -k https://your-cluster-endpoint.amazonaws.com/version + ``` + +4. **Configuración de proxy de Lens:** + - Abre Lens + - Ve a **Archivo** → **Preferencias** → **Proxy** + - Asegúrate de que la configuración del proxy no interfiera con la VPN + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-dns-resolution-kubeconfig.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-dns-resolution-kubeconfig.mdx new file mode 100644 index 000000000..6a662e502 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-dns-resolution-kubeconfig.mdx @@ -0,0 +1,197 @@ +--- +sidebar_position: 3 +title: "Problemas de Resolución DNS en VPN con Kubeconfig" +description: "Solución para problemas de resolución DNS al conectar con clústeres EKS a través de VPN" +date: "2024-01-15" +category: "usuario" +tags: ["vpn", "dns", "kubeconfig", "pritunl", "ubuntu", "lens"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Resolución DNS en VPN con Kubeconfig + +**Fecha:** 15 de enero de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, DNS, Kubeconfig, Pritunl, Ubuntu, Lens + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas de conectividad al intentar acceder a clústeres EKS mediante conexión VPN usando herramientas como Lens o kubectl. El problema está relacionado con conflictos en la resolución DNS entre el cliente VPN (Pritunl) y ciertas distribuciones de Linux como Ubuntu. + +**Síntomas Observados:** + +- No se puede conectar al clúster EKS a través de Lens a pesar de estar conectado a la VPN +- La resolución DNS falla para los endpoints del clúster EKS +- La conexión funciona con direcciones IP directas pero no con nombres DNS +- El problema ocurre específicamente en Ubuntu y distribuciones Linux similares + +**Configuración Relevante:** + +- Cliente VPN: Pritunl +- Sistema Operativo: Ubuntu (y distribuciones Linux similares) +- Acceso al Clúster: Clústeres EKS en AWS +- Herramientas: Lens, kubectl +- Método de Conexión: VPN + kubeconfig + +**Condiciones de Error:** + +- El error ocurre cuando la VPN está conectada +- La resolución DNS falla para endpoints EKS +- El problema persiste en diferentes clústeres (Dev/Prod) +- El problema es específico de la distribución (sistemas basados en Ubuntu) + +## Solución Detallada + + + +Este problema ocurre porque ciertas distribuciones Linux (particularmente Ubuntu) tienen conflictos en la configuración DNS al usar la VPN Pritunl. El sistema no puede resolver correctamente los nombres DNS del clúster EKS a través del túnel VPN, causando fallos en la conexión. + +El problema afecta a: + +- Ubuntu y distribuciones basadas en Ubuntu +- Sistemas con systemd-resolved +- Configuraciones donde las DNS de la VPN entran en conflicto con las DNS locales + + + + + +La solución más efectiva es reemplazar el endpoint DNS del clúster EKS por su dirección IP directa en el archivo kubeconfig. + +**Pasos:** + +1. **Ubica tu archivo kubeconfig** (usualmente en `~/.kube/config`) +2. **Encuentra la línea del servidor** que se ve así: + ```yaml + server: https://9CFEED5AD69EF4F87D19D6FF9FBF7AD9.gr7.us-east-1.eks.amazonaws.com + ``` +3. **Reemplázala con la dirección IP:** + + ```yaml + # Para clúster de Producción + server: https://10.130.96.192 + + # Para clúster de Desarrollo + server: https://10.110.98.134 + ``` + +4. **Guarda el archivo** y prueba la conexión + +**Ejemplo de modificación en kubeconfig:** + +```yaml +apiVersion: v1 +clusters: + - cluster: + certificate-authority-data: LS0tLS1CRUdJTi... + server: https://10.130.96.192 # Cambiado de DNS a IP + name: arn:aws:eks:us-east-1:123456789:cluster/my-cluster +contexts: + - context: + cluster: arn:aws:eks:us-east-1:123456789:cluster/my-cluster + user: arn:aws:eks:us-east-1:123456789:cluster/my-cluster + name: arn:aws:eks:us-east-1:123456789:cluster/my-cluster +current-context: arn:aws:eks:us-east-1:123456789:cluster/my-cluster +kind: Config +preferences: {} +users: + - name: arn:aws:eks:us-east-1:123456789:cluster/my-cluster + user: + exec: + apiVersion: client.authentication.k8s.io/v1beta1 + command: aws + args: + - eks + - get-token + - --cluster-name + - my-cluster +``` + + + + + +Si el reemplazo por IP no funciona de forma permanente, puedes intentar restablecer la configuración DNS y la conexión de red en Pritunl: + +**Pasos:** + +1. **Abre el cliente Pritunl** +2. **Busca opciones de red** (usualmente en ajustes o opciones de conexión) +3. **Encuentra los botones para restablecer DNS** - debería haber opciones para: + - Restablecer configuración DNS + - Restablecer conexión de red +4. **Haz clic en ambos botones de restablecer** +5. **Reconéctate a la VPN** +6. **Prueba la conectividad al clúster** + +**Nota:** Esta solución puede ser temporal y podría ser necesario repetirla periódicamente. + + + + + +Al usar Lens con el kubeconfig modificado: + +**Pasos:** + +1. **Asegúrate de que la VPN esté conectada** antes de abrir Lens +2. **Importa el kubeconfig modificado** con direcciones IP en lugar de nombres DNS +3. **Agrega el clúster en Lens** usando la configuración actualizada +4. **Prueba la conexión** - ahora deberías poder ver logs y recursos del clúster + +**Solución de problemas en la conexión de Lens:** + +- Verifica que la conexión VPN esté activa +- Comprueba que la dirección IP en kubeconfig coincida con tu entorno (Dev/Prod) +- Asegúrate que las credenciales AWS estén configuradas correctamente +- Intenta refrescar la conexión al clúster en Lens + + + + + +**Importante:** Diferentes entornos usan diferentes direcciones IP. Asegúrate de usar la IP correcta para cada entorno: + +**Entorno de Desarrollo:** + +```yaml +server: https://10.110.98.134 +``` + +**Entorno de Producción:** + +```yaml +server: https://10.130.96.192 +``` + +**Cómo obtener la IP correcta:** +Si necesitas encontrar la dirección IP para tu clúster específico: + +1. Contacta al soporte de SleakOps para las direcciones IP actuales +2. O usa `nslookup` cuando no estés conectado a la VPN: + ```bash + nslookup your-cluster-endpoint.gr7.us-east-1.eks.amazonaws.com + ``` + + + + + +**Estado Actual:** El equipo de SleakOps está trabajando en una solución permanente para este problema de resolución DNS. + +**Solución Temporal:** El método de reemplazo por dirección IP descrito arriba es la solución recomendada actualmente. + +**Actualizaciones Futuras:** Cuando se implemente una solución permanente, los usuarios serán notificados y podrán volver a usar nombres DNS en sus archivos kubeconfig. + +**Sistemas Afectados:** + +- Ubuntu y distribuciones basadas en Ubuntu +- Sistemas que usan systemd-resolved +- Ciertas configuraciones de red que entran en conflicto con la VPN Pritunl + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-kubernetes-cluster-access-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-kubernetes-cluster-access-troubleshooting.mdx new file mode 100644 index 000000000..718730828 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-kubernetes-cluster-access-troubleshooting.mdx @@ -0,0 +1,222 @@ +--- +sidebar_position: 3 +title: "Problemas de Conexión VPN con el Acceso al Cluster Kubernetes" +description: "Solución de problemas de conectividad VPN al acceder a clusters Kubernetes a través de Lens" +date: "2025-01-28" +category: "usuario" +tags: ["vpn", "kubernetes", "lens", "conectividad", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Conexión VPN con el Acceso al Cluster Kubernetes + +**Fecha:** 28 de enero de 2025 +**Categoría:** Usuario +**Etiquetas:** VPN, Kubernetes, Lens, Conectividad, Solución de Problemas + +## Descripción del Problema + +**Contexto:** El usuario se conecta correctamente a la VPN usando Pritunl pero no puede acceder al cluster Kubernetes a través del IDE Lens. Los intentos de conexión resultan en errores de tiempo de espera, y hay dudas sobre la resolución DNS y la configuración del segmento de red. + +**Síntomas Observados:** + +- La conexión VPN con Pritunl parece exitosa +- Lens muestra errores de tiempo de espera al intentar conectarse al cluster Kubernetes +- El cluster resuelve a una IP pública en lugar de una IP interna +- La VPN asigna un segmento de red diferente al esperado para el cluster Kubernetes +- La consola de AWS muestra notificaciones de actualización del cluster + +**Configuración Relevante:** + +- Cliente VPN: Pritunl +- IDE Kubernetes: Lens +- Plataforma: AWS EKS +- Método de conexión: importación de kubeconfig +- Resolución DNS: intento de configuración DNS forzada + +**Condiciones de Error:** + +- Ocurren errores de tiempo de espera cuando Lens intenta conectar +- El problema persiste a pesar de la conexión VPN exitosa +- La resolución DNS apunta a IP pública en lugar del endpoint privado del cluster +- El problema ocurre incluso con configuraciones DNS forzadas + +## Solución Detallada + + + +Primero, verifica que tu VPN esté resolviendo correctamente el endpoint interno del cluster: + +1. **Revisa tu archivo kubeconfig** para encontrar el endpoint del cluster: + + ```bash + cat ~/.kube/config | grep server + ``` + +2. **Prueba la resolución DNS** con la VPN conectada: + + ```bash + nslookup your-cluster-endpoint.amazonaws.com + ping your-cluster-endpoint.amazonaws.com + ``` + +3. **Prueba la conectividad HTTP**: + ```bash + curl -k https://your-cluster-endpoint.amazonaws.com + ``` + +El cluster debería resolverse a una dirección IP privada (rango 10.x.x.x o 172.x.x.x) cuando la VPN está activa. + + + + + +Múltiples conexiones VPN pueden interferir con la resolución DNS: + +1. **Desconecta todas las demás conexiones VPN**: + + - Revisa la bandeja del sistema para clientes VPN activos + - Deshabilita cualquier conexión VPN corporativa + - Cierra otras aplicaciones VPN + +2. **Verifica que Pritunl sea la única VPN activa**: + + ```bash + # En Windows + ipconfig /all + + # En macOS/Linux + ifconfig + ``` + +3. **Revisa la tabla de rutas**: + + ```bash + # Windows + route print + + # macOS/Linux + route -n + ``` + +Asegúrate de que la ruta VPN tenga prioridad sobre otras interfaces de red. + + + + + +Lens puede almacenar en caché la resolución del endpoint del cluster: + +1. **Cierra Lens completamente** +2. **Asegúrate de que la VPN esté conectada y estable** +3. **Borra la caché de Lens** (opcional): + + - Windows: `%APPDATA%\Lens` + - macOS: `~/Library/Application Support/Lens` + - Linux: `~/.config/Lens` + +4. **Reinicia Lens y vuelve a importar el kubeconfig**: + + - Archivo → Añadir Cluster + - Importa nuevamente tu archivo kubeconfig + - Prueba la conexión + +5. **Verifica el contexto del cluster**: + ```bash + kubectl config current-context + kubectl cluster-info + ``` + + + + + +Si la solución básica no funciona: + +1. **Verifica el rango IP asignado por la VPN**: + + ```bash + # Encuentra tu interfaz VPN + ip addr show | grep tun + # o + ifconfig | grep -A 5 tun + ``` + +2. **Verifica que puedas alcanzar la red del cluster**: + + ```bash + # Intenta alcanzar la red interna del cluster + ping 10.0.0.1 # Reemplaza con la red de tu cluster + ``` + +3. **Prueba kubectl directamente**: + + ```bash + kubectl get nodes + kubectl get pods --all-namespaces + ``` + +4. **Revisa la configuración DNS**: + + ```bash + # Windows + nslookup + server + + # macOS/Linux + cat /etc/resolv.conf + ``` + + + + + +Respecto a la notificación de actualización del cluster en la consola AWS: + +1. **Esto es normal** - AWS notifica regularmente sobre actualizaciones disponibles +2. **No actualices durante la solución de problemas** - Prioriza la conectividad primero +3. **Las actualizaciones deben planificarse** - Coordina con tu equipo de SleakOps + +**Para verificar la versión del cluster**: + +```bash +kubectl version --short +``` + +**Nota**: Las actualizaciones del cluster normalmente no causan problemas de conectividad VPN. + + + + + +Si Lens continúa presentando problemas: + +1. **Usa kubectl directamente**: + + ```bash + kubectl get nodes + kubectl get pods + ``` + +2. **Prueba otros IDEs para Kubernetes**: + + - K9s (basado en terminal) + - Octant (basado en web) + - VS Code con extensión de Kubernetes + +3. **Acceso basado en web** (si está disponible): + + - Kubernetes Dashboard + - Interfaz web de SleakOps + +4. **Verifica con soporte de SleakOps**: + - Confirma la configuración VPN + - Revisa la accesibilidad del cluster + - Verifica permisos de usuario + + + +--- + +_Esta FAQ fue generada automáticamente el 28 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-mobile-access-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-mobile-access-configuration.mdx new file mode 100644 index 000000000..516e62053 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-mobile-access-configuration.mdx @@ -0,0 +1,181 @@ +--- +sidebar_position: 3 +title: "Configuración de Acceso VPN Móvil" +description: "Cómo configurar el acceso VPN desde dispositivos móviles usando Pritunl Server" +date: "2024-03-10" +category: "usuario" +tags: ["vpn", "móvil", "pritunl", "openvpn", "aws", "acceso"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Acceso VPN Móvil + +**Fecha:** 10 de marzo de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, Móvil, Pritunl, OpenVPN, AWS, Acceso + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan acceder a la VPN de SleakOps desde dispositivos móviles para probar aplicaciones sin desplegarlas en entornos de producción. + +**Síntomas Observados:** + +- Imposible instalar el cliente Pritunl en el dispositivo móvil +- Necesidad de realizar pruebas móviles a través de conexión VPN +- Requisito de probar aplicaciones desde móvil sin despliegue en producción + +**Configuración Relevante:** + +- Servidor VPN: Pritunl Server ejecutándose en instancia EC2 +- Protocolo: Compatible con OpenVPN +- Método de acceso: AWS Secrets Manager para credenciales +- Entornos objetivo: Desarrollo y Producción + +**Condiciones de Error:** + +- Cliente Pritunl no disponible para instalación móvil +- Necesidad de método alternativo de conexión para dispositivos móviles +- Requisitos de prueba desde dispositivos móviles + +## Solución Detallada + + + +La infraestructura VPN de SleakOps utiliza: + +- **Pritunl Server**: Ejecutándose en instancia EC2 en AWS +- **Protocolo OpenVPN**: Compatible con clientes estándar OpenVPN +- **Soporte Móvil Nativo**: La mayoría de dispositivos móviles soportan OpenVPN de forma nativa +- **Acceso Multi-ambiente**: Configuraciones separadas para desarrollo y producción + + + + + +Para obtener tu configuración VPN: + +1. **Inicia sesión en la Consola AWS** con tus credenciales de usuario SleakOps +2. **Cambia a la cuenta objetivo** (desarrollo o producción) +3. **Navega a AWS Systems Manager**: + - Busca "Systems Manager" en los servicios AWS + - O ve al servicio "Secrets Manager" +4. **Encuentra el Secreto Pritunl**: + - Busca secretos relacionados con Pritunl + - Haz clic en el secreto para ver detalles +5. **Revela los Valores del Secreto**: + - Haz clic en el botón "Retrieve secret value" (Obtener valor secreto) + - Anota: dirección IP, usuario y contraseña + + + + + +Una vez que tengas las credenciales: + +1. **Abre un navegador web** en cualquier dispositivo +2. **Navega a la IP de Pritunl** (desde el secreto AWS) +3. **Inicia sesión con las credenciales**: + - Usuario: (desde el secreto AWS) + - Contraseña: (desde el secreto AWS) +4. **Accede al Panel**: Verás la interfaz de gestión de Pritunl + + + + + +En el panel de Pritunl: + +1. **Encuentra tu cuenta de usuario** en la lista de usuarios +2. **Descarga el perfil de conexión** en múltiples formatos: + - **Archivo ZIP**: Contiene todos los archivos de configuración + - **Archivo OVPN**: Configuración estándar OpenVPN + - **Configuración URL**: Para configuración rápida en móvil +3. **Elige formato compatible con móvil**: Se recomienda OVPN para móviles + + + + + +### Dispositivos iOS: + +1. **Descarga OpenVPN Connect** desde App Store +2. **Importa archivo OVPN**: + - Envíate el archivo OVPN por correo electrónico + - Ábrelo desde el correo y elige "Abrir en OpenVPN" + - O usa la configuración URL para importación directa +3. **Conéctate**: Pulsa conectar en la app OpenVPN + +### Dispositivos Android: + +1. **Descarga OpenVPN for Android** desde Play Store +2. **Importa configuración**: + - Transfiere el archivo OVPN al dispositivo + - Ábrelo con la app OpenVPN + - O usa la configuración VPN nativa con el archivo OVPN +3. **Alternativa**: Usa la configuración VPN nativa de Android + +### VPN Móvil Nativo (Alternativa): + +La mayoría de smartphones modernos soportan OpenVPN nativamente: + +- Ve a **Ajustes** → **VPN** +- Añade nueva configuración VPN +- Importa archivo OVPN o ingresa datos manualmente + + + + + +Una vez configurada la VPN: + +1. **Conéctate a la VPN** desde el dispositivo móvil +2. **Accede al entorno de desarrollo**: + - Tus aplicaciones serán accesibles mediante URLs internas + - No es necesario desplegar en producción para pruebas +3. **Realiza pruebas móviles**: + - Prueba diseño responsivo + - Verifica funcionalidades específicas de móvil + - Comprueba rendimiento en redes móviles +4. **Desconecta la VPN** al finalizar las pruebas + +**Beneficios:** + +- Prueba en dispositivos móviles reales +- Acceso seguro al entorno de desarrollo +- No se requiere despliegue en producción +- Mantiene separación desarrollo/producción + + + + + +**Problemas de Conexión:** + +- Verifica que las credenciales sean correctas +- Revisa conectividad de datos móviles/WiFi +- Asegúrate que el servidor VPN esté activo (revisa instancia EC2 en AWS) + +**Problemas al Importar Perfil:** + +- Prueba diferentes métodos de importación (correo, almacenamiento en la nube, URL) +- Verifica que el archivo OVPN no esté corrupto +- Usa formato ZIP si falla OVPN + +**Problemas de Rendimiento:** + +- Las redes móviles pueden tener mayor latencia +- Considera probar tanto en WiFi como en celular +- La VPN añade sobrecarga por cifrado + +**Soluciones Alternativas:** + +- Usa herramientas de prueba basadas en navegador +- Configura reenvío de puertos para servicios específicos +- Crea endpoints públicos temporales para pruebas + + + +--- + +_Este FAQ fue generado automáticamente el 10 de marzo de 2024 basado en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-pritunl-connection-troubleshooting.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-pritunl-connection-troubleshooting.mdx new file mode 100644 index 000000000..a9ea6c931 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-pritunl-connection-troubleshooting.mdx @@ -0,0 +1,163 @@ +--- +sidebar_position: 3 +title: "Problemas de Conexión VPN Pritunl" +description: "Guía de solución de problemas para problemas de conexión VPN Pritunl en SleakOps" +date: "2024-12-19" +category: "usuario" +tags: ["vpn", "pritunl", "conexión", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Conexión VPN Pritunl + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, Pritunl, Conexión, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Usuarios experimentando problemas de conectividad con la VPN Pritunl en el entorno de producción de SleakOps, donde el cliente VPN muestra el estado "conectando" pero nunca establece una conexión exitosa. + +**Síntomas Observados:** + +- El cliente Pritunl permanece en estado "conectando" indefinidamente +- No se muestran mensajes de error en el cliente +- Incapacidad para acceder a recursos internos a través de la VPN +- El servidor VPN parece estar funcionando pero las conexiones fallan + +**Configuración Relevante:** + +- Tipo de VPN: Pritunl OpenVPN +- Entorno: Producción +- Cliente: Aplicación de escritorio Pritunl +- Formato de perfil: archivo de configuración .ovpn + +**Condiciones de Error:** + +- Los intentos de conexión agotan el tiempo sin establecer túnel +- El problema puede ser intermitente o persistente +- Puede ocurrir tras actualizaciones del perfil o cambios en la red +- La configuración de red local puede interferir con la conexión + +## Solución Detallada + + + +Antes de solucionar problemas del perfil VPN, verifique la conectividad básica al servidor VPN: + +1. **Pruebe la conectividad HTTPS** a la IP del servidor VPN: + ```bash + # Reemplace con la IP real de su servidor VPN + curl -k https://3.82.69.46/ + ``` +2. **Comportamiento esperado**: Debería ver una respuesta HTTPS (incluso con advertencias de certificado) + +3. **Si no hay respuesta**: Probablemente el problema sea de red (firewall, bloqueo del ISP, etc.) + + + + + +La solución más común es regenerar el perfil VPN: + +1. **Acceda al Panel de SleakOps** +2. Navegue a **Configuración VPN** o **Perfil de Usuario** +3. **Genere una nueva URL de Pritunl**: + - Busque la opción "Generar Perfil VPN" o similar + - Haga clic para crear un nuevo enlace temporal de descarga +4. **Descargue el perfil nuevo**: + - Las URLs generadas tienen validez limitada (unas pocas horas) + - Descargue el nuevo perfil .ovpn inmediatamente + + + + + +Proceso completo de reinstalación del perfil: + +1. **Eliminar perfil existente**: + + - Abra el cliente Pritunl + - Haga clic derecho en el perfil problemático + - Seleccione "Eliminar" o "Quitar" + +2. **Borrar cualquier dato en caché**: + + - Cierre completamente el cliente Pritunl + - Reinicie la aplicación + +3. **Instalar nuevo perfil**: + - Importe el archivo .ovpn recién descargado + - Verifique que la configuración del perfil sea correcta + - Intente la conexión + + + + + +Si regenerar el perfil no funciona, pruebe la solución de problemas a nivel de red: + +1. **Pruebe diferentes redes**: + + - Intente conectar desde un hotspot móvil + - Pruebe desde otra red WiFi + - Esto ayuda a identificar problemas con el ISP o la red local + +2. **Verifique la configuración del firewall**: + + - Asegúrese de que los puertos de OpenVPN no estén bloqueados + - Puertos comunes: 1194 (UDP), 443 (TCP) + - Desactive temporalmente el firewall local para pruebas + +3. **Resolución DNS**: + - Verifique que el nombre de host del servidor VPN se resuelva correctamente + - Intente usar la dirección IP en lugar del nombre de host en el perfil + + + + + +Para administradores con acceso al backend de SleakOps: + +1. **Acceda al Gestor de Secretos**: + + - Encuentre las credenciales del servidor Pritunl + - Acceda a la consola administrativa de Pritunl + +2. **Revise los registros del servidor**: + + - Analice los intentos de conexión en los logs de Pritunl + - Busque fallos de autenticación o errores de red + +3. **Verifique el estado del servidor**: + - Asegúrese de que el servicio Pritunl esté en ejecución + - Revise el uso de recursos del servidor + - Verifique la conectividad de red desde el servidor + + + + + +Si la solución estándar no resuelve el problema: + +1. **Pruebe diferentes protocolos VPN**: + + - Cambie entre UDP y TCP si hay opciones disponibles + - Pruebe diferentes puertos si es configurable + +2. **Actualice el cliente Pritunl**: + + - Asegúrese de usar la última versión + - Verifique problemas de compatibilidad + +3. **Contacte soporte con detalles**: + - Proporcione la IP del servidor VPN y registros de conexión + - Incluya detalles del entorno de red + - Especifique cuándo comenzó a ocurrir el problema + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-pritunl-dns-resolution-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-pritunl-dns-resolution-issues.mdx new file mode 100644 index 000000000..48185939b --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/vpn-pritunl-dns-resolution-issues.mdx @@ -0,0 +1,222 @@ +--- +sidebar_position: 3 +title: "Problemas de Resolución DNS en Pritunl VPN" +description: "Solución de problemas de resolución DNS con conexiones VPN Pritunl" +date: "2024-01-15" +category: "usuario" +tags: ["vpn", "pritunl", "dns", "redes", "solución-de-problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Resolución DNS en Pritunl VPN + +**Fecha:** 15 de enero de 2024 +**Categoría:** Usuario +**Etiquetas:** VPN, Pritunl, DNS, Redes, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios experimentan problemas intermitentes de resolución DNS al conectarse a la infraestructura de SleakOps mediante VPN Pritunl desde diferentes ubicaciones y sistemas operativos. + +**Síntomas Observados:** + +- La conexión VPN Pritunl funciona de forma intermitente +- La resolución DNS falla en algunos sistemas operativos +- Los problemas varían según la ubicación geográfica +- El problema aparece de forma inconsistente en distintos sistemas cliente + +**Configuración Relevante:** + +- Cliente VPN: Pritunl +- Ubicaciones de conexión: Múltiples IPs en Argentina +- Servidor DNS de producción: 10.130.0.2 +- Afecta múltiples sistemas operativos + +**Condiciones de Error:** + +- Fallos esporádicos en la resolución DNS +- Problema varía según sistema operativo +- Problemas al conectar desde distintas ubicaciones geográficas +- Comportamiento inconsistente en intentos de conexión + +## Solución Detallada + + + +El primer paso para solucionar es restablecer las configuraciones DNS y de red de Pritunl: + +**Pasos para restablecer:** + +1. Abre tu cliente Pritunl +2. Ve a **Configuración** o **Preferencias** +3. Busca **Opciones Avanzadas** o **Configuración de Red** +4. Haz clic en **Restablecer DNS** +5. Haz clic en **Restablecer Red** +6. Reinicia el cliente Pritunl +7. Reconéctate a tu perfil VPN + +**Qué hace esto:** + +- Limpia las entradas DNS en caché +- Restablece las configuraciones de interfaces de red +- Fuerza una configuración nueva del resolvedor DNS + + + + + +Si el restablecimiento no resuelve el problema, configura manualmente el servidor DNS de producción: + +**Agregar servidor DNS 10.130.0.2:** + +**En Windows:** + +1. Ve a **Centro de redes y recursos compartidos** +2. Haz clic en tu conexión de red activa +3. Haz clic en **Propiedades** +4. Selecciona **Protocolo de Internet versión 4 (TCP/IPv4)** +5. Haz clic en **Propiedades** +6. Selecciona **Usar las siguientes direcciones de servidor DNS** +7. Añade `10.130.0.2` como DNS primario +8. Haz clic en **Aceptar** + +**En macOS:** + +1. Ve a **Preferencias del Sistema** → **Red** +2. Selecciona tu conexión de red +3. Haz clic en **Avanzado** +4. Ve a la pestaña **DNS** +5. Haz clic en **+** y agrega `10.130.0.2` +6. Haz clic en **Aceptar** + +**En Linux:** + +```bash +# Editar resolv.conf +sudo nano /etc/resolv.conf + +# Añadir esta línea al principio +nameserver 10.130.0.2 + +# O usar systemd-resolved +sudo systemctl edit systemd-resolved + +# Añadir: +[Resolve] +DNS=10.130.0.2 +``` + + + + + +También puedes configurar el DNS directamente en tu perfil de conexión Pritunl: + +1. Abre el cliente Pritunl +2. Busca tu perfil de conexión +3. Haz clic en el **icono de engranaje** o **Editar** +4. Busca **Configuración DNS** o **Avanzado** +5. Añade `10.130.0.2` a la lista de servidores DNS +6. Guarda el perfil +7. Reconéctate + +**Método alternativo:** + +```bash +# Si usas Pritunl desde línea de comandos +pritunl-client add [archivo-perfil] +pritunl-client start [id-perfil] --dns 10.130.0.2 +``` + + + + + +**Windows:** + +- Deshabilitar IPv6 si no es necesario: `netsh interface ipv6 set global randomizeidentifiers=disabled` +- Vaciar caché DNS: `ipconfig /flushdns` +- Restablecer Winsock: `netsh winsock reset` + +**macOS:** + +- Vaciar caché DNS: `sudo dscacheutil -flushcache` +- Restablecer configuración de red: `sudo ifconfig en0 down && sudo ifconfig en0 up` + +**Linux:** + +- Reiniciar NetworkManager: `sudo systemctl restart NetworkManager` +- Vaciar caché DNS: `sudo systemd-resolve --flush-caches` +- Verificar resolución DNS: `nslookup [dominio] 10.130.0.2` + + + + + +Para usuarios que se conectan desde distintas ubicaciones en Argentina: + +**Optimización de Conexión:** + +1. Elige la ubicación del servidor Pritunl más cercana +2. Prueba diferentes protocolos de conexión (UDP vs TCP) +3. Ajusta el tamaño MTU si es necesario: + + ```bash + # Probar MTU óptimo + ping -f -l 1472 [ip-servidor] + + # Configurar MTU en Pritunl + # Añadir al perfil: tun-mtu 1500 + ``` + +**Pruebas de Calidad de Red:** + +```bash +# Probar velocidad de resolución DNS +dig @10.130.0.2 [tu-dominio] + +# Probar calidad de conexión +ping -c 10 10.130.0.2 + +# Rastrear ruta para identificar problemas +traceroute 10.130.0.2 +``` + + + + + +Después de aplicar las soluciones, verifica que la resolución DNS funcione correctamente: + +**Comandos de prueba:** + +```bash +# Probar resolución DNS +nslookup google.com 10.130.0.2 +dig @10.130.0.2 tu-dominio-interno.com + +# Probar servicios internos +ping servicio-interno.sleakops.local +curl -I https://tu-api.sleakops.com + +# Verificar servidores DNS actuales +# En Linux/macOS: +cat /etc/resolv.conf + +# En Windows: +ipconfig /all | findstr "DNS Servers" +``` + +**Resultados esperados:** + +- Las consultas DNS deben resolverse rápidamente +- Los dominios internos deben ser accesibles +- No debe haber errores de tiempo de espera +- Resolución consistente en múltiples intentos + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-dns-404-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-dns-404-error.mdx new file mode 100644 index 000000000..2240f3104 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-dns-404-error.mdx @@ -0,0 +1,206 @@ +--- +sidebar_position: 3 +title: "Error 404 de DNS en Servicio Web" +description: "Solución para servicio web que devuelve error 404 a pesar de que el pod funciona correctamente" +date: "2024-01-15" +category: "workload" +tags: ["servicioweb", "dns", "404", "resolucióndeproblemas", "redes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error 404 de DNS en Servicio Web + +**Fecha:** 15 de enero de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Servicio Web, DNS, 404, Resolución de problemas, Redes + +## Descripción del Problema + +**Contexto:** El usuario crea un nuevo servicio web en entorno de producción con registro DNS. El pod funciona correctamente y responde con reenvío de puertos, pero al acceder a la URL devuelve un error 404. + +**Síntomas observados:** + +- El pod está funcionando correctamente en Kubernetes (visible en Lens) +- El reenvío de puertos al pod funciona adecuadamente +- La URL DNS devuelve error 404: "No se puede encontrar esta página [nombre-del-servicio].[dominio].com" +- El error indica que la página web no fue encontrada, no que la IP no pudo resolverse +- El navegador muestra: "No se encontró ninguna página web para esta dirección" + +**Configuración relevante:** + +- Entorno: Producción +- Tipo de servicio: Servicio web +- Registro DNS: Configurado +- Estado del pod: En ejecución y responde +- Balanceador de carga: Recientemente tuvo problemas (mencionados como resueltos) + +**Condiciones de error:** + +- El error ocurre al acceder a la URL pública +- El pod responde correctamente al reenvío directo de puertos +- La resolución DNS parece funcionar (sin error de resolución de IP) +- El error 404 sugiere problema en el enrutamiento o configuración del ingress + +## Solución Detallada + + + +Cuando un pod funciona con reenvío de puertos pero devuelve 404 vía DNS, el problema suele estar en la capa de ingress/enrutamiento: + +1. **Controlador Ingress**: El ingress puede no estar configurado correctamente +2. **Configuración del Servicio**: El servicio de Kubernetes podría no estar exponiendo correctamente el pod +3. **Balanceador de Carga**: Problemas recientes en el balanceador podrían haber afectado las reglas de enrutamiento +4. **Enrutamiento de Ruta**: El ingress podría estar esperando una ruta o configuración de host diferente + + + + + +Verifique si el Servicio de Kubernetes está configurado correctamente: + +```bash +# Verificar si el servicio existe y tiene endpoints +kubectl get svc -n [namespace] +kubectl describe svc [nombre-del-servicio] -n [namespace] + +# Verificar que los endpoints estén poblados +kubectl get endpoints [nombre-del-servicio] -n [namespace] +``` + +El servicio debe: + +- Tener el selector correcto que coincida con las etiquetas de su pod +- Mostrar endpoints apuntando a la IP del pod +- Usar la configuración de puerto correcta + + + + + +Revise la configuración del ingress en SleakOps: + +1. **En el Panel de SleakOps**: + + - Vaya a su proyecto → Servicios Web + - Verifique la configuración DNS + - Revise los ajustes de ruta y host + +2. **En Kubernetes**: + +```bash +# Revisar recursos ingress +kubectl get ingress -n [namespace] +kubectl describe ingress [nombre-del-ingress] -n [namespace] +``` + +3. **Problemas comunes**: + - Configuración incorrecta del host + - Reglas de ruta faltantes o erróneas + - Desajuste en el nombre del servicio backend + + + + + +Dado que hubo problemas recientes con el balanceador de carga: + +1. **Verificar estado del Balanceador**: + +```bash +# Comprobar si el balanceador recibe tráfico +kubectl get svc -n ingress-nginx +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller +``` + +2. **Verificar propagación DNS**: + +```bash +# Comprobar si DNS resuelve a la IP correcta +nslookup [su-dominio].com +dig [su-dominio].com +``` + +3. **Probar balanceador directamente**: + - Obtener la IP del balanceador + - Probar con curl usando la cabecera Host: + +```bash +curl -H "Host: [su-dominio].com" http://[ip-del-balanceador]/ +``` + + + + + +**Solución 1: Recrear el registro DNS** + +1. En SleakOps, elimine la configuración DNS +2. Espere 2-3 minutos +3. Vuelva a agregar la configuración DNS + +**Solución 2: Verificar configuración de ruta en la aplicación** + +```yaml +# Asegúrese que su aplicación sirva en la ruta correcta +# Si su app sirve en /app, configure el ingress en consecuencia +path: / +pathType: Prefix +``` + +**Solución 3: Verificar que la aplicación escuche en el puerto correcto** + +```bash +# Reenvíe puertos y verifique qué puerto usa la app realmente +kubectl port-forward pod/[nombre-del-pod] 8080:8080 +# Pruebe con otros puertos si 8080 no funciona +``` + +**Solución 4: Revisar cambios recientes** +Si esto funcionaba antes, revise: + +- Despliegues recientes que puedan haber modificado la aplicación +- Cambios en la configuración del balanceador +- Modificaciones en reglas DNS o ingress + + + + + +1. **Verificar que el pod esté saludable**: + +```bash +kubectl get pods -n [namespace] +kubectl logs [nombre-del-pod] -n [namespace] +``` + +2. **Probar conectividad del servicio**: + +```bash +# Desde dentro del clúster +kubectl run test-pod --image=curlimages/curl -it --rm -- sh +# Dentro del pod: +curl http://[nombre-del-servicio].[namespace].svc.cluster.local:[puerto] +``` + +3. **Revisar logs del ingress**: + +```bash +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller | grep [su-dominio] +``` + +4. **Validar reglas del ingress**: + +```bash +kubectl get ingress [nombre-del-ingress] -o yaml +``` + +5. **Probar con diferentes rutas**: + - Intente acceder a `https://[dominio].com/health` u otros endpoints conocidos + - Verifique si la aplicación requiere una ruta base específica + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-domain-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-domain-configuration.mdx new file mode 100644 index 000000000..08278623e --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-domain-configuration.mdx @@ -0,0 +1,147 @@ +--- +sidebar_position: 3 +title: "Problema de Configuración de Dominio en Servicio Web" +description: "Cómo corregir una configuración incorrecta de dominio en servicios web" +date: "2024-03-10" +category: "workload" +tags: ["servicio-web", "dominio", "configuración", "staging"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema de Configuración de Dominio en Servicio Web + +**Fecha:** 10 de marzo de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Servicio Web, Dominio, Configuración, Staging + +## Descripción del Problema + +**Contexto:** Al configurar un servicio web en SleakOps, la configuración del dominio resultó en una estructura incorrecta de subdominio anidado en lugar del subdominio limpio deseado. + +**Síntomas Observados:** + +- El dominio aparece como `supra.staging.supra.social` (estructura anidada) +- El dominio esperado debería ser `staging.supra.social` (subdominio limpio) +- La configuración del dominio creó una estructura "frankenstein" +- El servicio web es accesible pero con URL incorrecta + +**Configuración Relevante:** + +- Dominio actual: `supra.staging.supra.social` +- Dominio deseado: `staging.supra.social` +- Tipo de servicio: Servicio web +- Entorno: Staging + +**Condiciones de Error:** + +- Configuración incorrecta del dominio durante la configuración del servicio web +- Problema de anidamiento de dominio creando subdominio redundante +- Necesidad de reconfigurar el dominio del servicio web existente + +## Solución Detallada + + + +Para corregir la configuración del dominio en tu servicio web: + +1. **Navega a tu proyecto** en el panel de SleakOps +2. **Ve a la sección de Cargas de trabajo (Workloads)** +3. **Encuentra tu servicio web** en la lista +4. **Haz clic en el servicio web** para abrir su configuración +5. **Haz clic en Editar** o en el ícono de configuración + + + + + +En la configuración del servicio web: + +1. **Ubica la sección de Configuración de Dominio** +2. **Limpia el campo del dominio actual** si muestra el dominio incorrecto +3. **Ingresa el dominio correcto**: `staging.supra.social` +4. **Guarda la configuración** + +```yaml +# Ejemplo de configuración +domain: staging.supra.social +# NO: supra.staging.supra.social +``` + +**Importante:** Asegúrate de ingresar solo el subdominio deseado sin duplicar el dominio base. + + + + + +Después de actualizar la configuración del dominio: + +1. **Espera de 5 a 10 minutos** para que los cambios se propaguen +2. **Verifica el estado del despliegue** en el panel de cargas de trabajo +3. **Confirma que el nuevo dominio** funcione correctamente +4. **Prueba el acceso** a `staging.supra.social` + +Si el dominio antiguo sigue en caché, puede ser necesario: + +- Limpiar la caché del navegador +- Esperar la propagación DNS (hasta 24 horas en algunos casos) +- Usar modo incógnito/navegación privada para probar + + + + + +Si aún experimentas problemas después del cambio de configuración: + +**Revisa la Configuración DNS:** + +```bash +# Prueba la resolución DNS +nslookup staging.supra.social + +# Verifica si el dominio apunta a la IP correcta +dig staging.supra.social +``` + +**Verifica el Certificado SSL:** + +- El certificado SSL debería actualizarse automáticamente para el nuevo dominio +- Si ves advertencias SSL, espera unos minutos más para la provisión del certificado + +**Problemas Comunes:** + +- **Dominio no resuelve:** Verifica que los registros DNS estén configurados correctamente +- **Errores de certificado SSL:** Espera la provisión automática del certificado +- **El dominio antiguo sigue funcionando:** Es normal durante el período de transición + + + + + +Para prevenir problemas similares en el futuro: + +**Buenas Prácticas:** + +1. **Planifica la estructura de tu dominio** antes de crear el servicio web +2. **Usa convenciones claras de nombres:** `[entorno].[app].[dominio]` +3. **Revisa dos veces las entradas del dominio** antes de guardar la configuración +4. **Prueba la configuración del dominio** primero en un entorno de desarrollo + +**Ejemplos de Estructura de Dominio:** + +``` +# Buenos ejemplos +staging.miapp.com +api.miapp.com +dev.miapp.com + +# Evita duplicaciones anidadas +# Malo: miapp.staging.miapp.com +# Malo: api.miapp.api.miapp.com +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 10 de marzo de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-domain-reset-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-domain-reset-issue.mdx new file mode 100644 index 000000000..dd28c26ae --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/web-service-domain-reset-issue.mdx @@ -0,0 +1,114 @@ +--- +sidebar_position: 3 +title: "Restablecimiento del Dominio del Servicio Web Durante la Actualización de Réplicas" +description: "Problema donde el nombre de dominio se restablece al nombre del proyecto al editar réplicas del servicio web" +date: "2025-02-20" +category: "workload" +tags: ["webservice", "domain", "replicas", "frontend", "configuration"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Restablecimiento del Dominio del Servicio Web Durante la Actualización de Réplicas + +**Fecha:** 20 de febrero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** ServicioWeb, Dominio, Réplicas, Frontend, Configuración + +## Descripción del Problema + +**Contexto:** Al editar la configuración de un Servicio Web para modificar el número de réplicas, el nombre de dominio cambia inesperadamente del dominio de producción actual al nombre predeterminado del proyecto, causando posible tiempo de inactividad. + +**Síntomas Observados:** + +- El nombre de dominio se restablece al nombre del proyecto al editar el Servicio Web +- Ocurre al hacer cambios simples como actualizaciones del conteo de réplicas +- Causa tiempo de inactividad temporal del sitio web si no se detecta inmediatamente +- El frontend muestra el nombre del proyecto en lugar del dominio de producción actual + +**Configuración Relevante:** + +- Componente: Configuración del Servicio Web +- Acción: Edición del conteo de réplicas +- Comportamiento esperado: El dominio debe permanecer sin cambios +- Comportamiento real: El dominio se restablece al nombre del proyecto + +**Condiciones de Error:** + +- Ocurre durante el proceso de edición del Servicio Web +- Sucede al modificar cualquier parámetro del Servicio Web +- Resulta en cambios no deseados en el dominio +- Puede causar interrupción del servicio + +## Solución Detallada + + + +Para evitar cambios accidentales en el dominio al editar Servicios Web: + +1. **Verifique siempre el campo de dominio** antes de guardar los cambios +2. **Compruebe que el dominio coincida con su URL de producción actual** +3. **Si el dominio se ha restablecido, corríjalo manualmente** a su dominio de producción +4. **Guarde la configuración** solo después de verificar todos los campos + +**Importante:** Siempre revise el campo de dominio incluso cuando realice cambios no relacionados como ajustes en el conteo de réplicas. + + + + + +Actualmente, el formulario de edición del Servicio Web puede mostrar el nombre del proyecto como dominio predeterminado en lugar de preservar el dominio de producción existente. Este es un problema conocido del frontend que puede causar: + +- Cambios inesperados en el dominio +- Interrupciones del servicio +- Necesidad de corrección manual del dominio +- Posible tiempo de inactividad si no se detecta inmediatamente + + + + + +Para minimizar riesgos al editar Servicios Web: + +1. **Revise todos los campos** antes de guardar, no solo los que planeaba cambiar +2. **Anote su dominio actual** antes de comenzar el proceso de edición +3. **Realice cambios durante ventanas de mantenimiento** cuando sea posible +4. **Tenga monitoreo activo** para detectar rápidamente cambios en el dominio +5. **Considere hacer cambios en lotes pequeños** para limitar el impacto + + + + + +Para detectar rápidamente si su dominio ha sido cambiado accidentalmente: + +1. **Configure monitoreo externo** para sus URLs de producción +2. **Configure alertas** para cambios en la respuesta HTTP +3. **Monitoree la resolución DNS** de sus dominios +4. **Use chequeos de salud** que verifiquen que el servicio correcto está respondiendo + +```bash +# Ejemplo de script de monitoreo +curl -f https://su-dominio-de-produccion.com/health || echo "Problema detectado en el dominio" +``` + + + + + +Si ha cambiado el dominio accidentalmente: + +1. **Edite el Servicio Web nuevamente de inmediato** +2. **Corrija el campo de dominio** a su dominio de producción +3. **Guarde la configuración** +4. **Espere a que se complete el despliegue** +5. **Verifique que el servicio sea accesible** en la URL correcta +6. **Confirme que todo el tráfico se enrute correctamente** + +El proceso de recuperación típicamente toma unos minutos para que los cambios se propaguen. + + + +--- + +_Esta FAQ fue generada automáticamente el 20 de febrero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/webservice-alias-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/webservice-alias-configuration.mdx new file mode 100644 index 000000000..052e8ebd1 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/webservice-alias-configuration.mdx @@ -0,0 +1,165 @@ +--- +sidebar_position: 3 +title: "Configuración de Alias de Servicios Web" +description: "Cómo crear y administrar alias para servicios web en SleakOps" +date: "2024-12-19" +category: "workload" +tags: ["servicioweb", "alias", "configuración", "redes"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Alias de Servicios Web + +**Fecha:** 19 de diciembre de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Servicio Web, Alias, Configuración, Redes + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan crear alias de dominio personalizados para sus servicios web en SleakOps para proporcionar URLs alternativas o nombres de dominio personalizados para acceder a sus aplicaciones. + +**Síntomas Observados:** + +- Necesidad de nombres de dominio personalizados para servicios web +- Requisito de acceder a servicios a través de múltiples URLs +- Deseo de usar nombres de dominio con marca o amigables para el usuario +- Necesidad de balanceo de carga entre múltiples puntos finales + +**Configuración Relevante:** + +- Despliegue de servicio web en SleakOps +- Nombres de dominio personalizados o subdominios +- Requisitos de configuración DNS +- Gestión de certificados SSL/TLS + +**Condiciones de Error:** + +- Las URLs por defecto del servicio pueden no cumplir con los requisitos de marca +- Se necesitan múltiples puntos de acceso para el mismo servicio +- Requisitos de enrutamiento de dominio personalizado + +## Solución Detallada + + + +Para crear alias para tus servicios web en SleakOps: + +1. **Navega a los Detalles del Servicio Web**: + + - Ve al panel de tu proyecto + - Selecciona el servicio web que deseas configurar + - Haz clic en el nombre del servicio web para acceder a sus detalles + +2. **Accede a la Configuración de Alias**: + + - En la página de detalles del servicio web, busca la sección "Alias" o "Redes" + - Haz clic en "Agregar Alias" o botón similar + +3. **Configura el Alias**: + - Ingresa tu nombre de dominio personalizado (ej., `api.miempresa.com`) + - Selecciona el protocolo adecuado (HTTP/HTTPS) + - Configura reglas adicionales de enrutamiento si es necesario + + + + + +Antes de crear un alias, asegúrate de que tu DNS esté configurado correctamente: + +1. **Registro CNAME**: Crea un registro CNAME apuntando tu dominio personalizado al endpoint del servicio SleakOps +2. **Registro A**: Alternativamente, usa un registro A apuntando a la dirección IP del servicio +3. **Verificación**: Asegúrate de que la propagación DNS esté completa antes de probar + +```bash +# Ejemplo de configuración DNS +# Registro CNAME +api.miempresa.com. IN CNAME tu-servicio.sleakops.io. + +# O registro A +api.miempresa.com. IN A 192.168.1.100 +``` + + + + + +Para alias HTTPS, SleakOps puede gestionar automáticamente los certificados SSL: + +1. **Certificados Automáticos**: SleakOps puede aprovisionar automáticamente certificados Let's Encrypt +2. **Certificados Personalizados**: Sube tus propios certificados SSL si es necesario +3. **Renovación de Certificados**: La renovación automática la maneja la plataforma + +**Pasos para habilitar HTTPS**: + +- Habilita SSL/TLS en la configuración del alias +- Elige entre certificado automático o personalizado +- Verifica la instalación del certificado después de la creación + + + + + +Puedes crear múltiples alias para el mismo servicio web: + +1. **Dominios Diferentes**: Apunta múltiples dominios al mismo servicio +2. **Enrutamiento por Subdominios**: Usa diferentes subdominios para distintos propósitos +3. **Específico por Ambiente**: Crea alias para diferentes ambientes + +**Ejemplos de casos de uso**: + +- `api.miempresa.com` - API de producción +- `api-staging.miempresa.com` - Ambiente de staging +- `v1.api.miempresa.com` - Punto final específico de versión + + + + + +Si tu alias no funciona correctamente: + +1. **Propagación DNS**: Espera a que los cambios DNS se propaguen (hasta 48 horas) +2. **Problemas con Certificados**: Revisa el estado y validez del certificado SSL +3. **Reglas de Firewall**: Asegúrate de que no haya reglas de firewall bloqueando el tráfico +4. **Estado del Servicio**: Verifica que el servicio web subyacente esté funcionando correctamente + +**Comandos de verificación**: + +```bash +# Verificar resolución DNS +nslookup api.miempresa.com + +# Probar conectividad HTTP +curl -I http://api.miempresa.com + +# Probar conectividad HTTPS +curl -I https://api.miempresa.com +``` + + + + + +**Convenciones de Nombres**: + +- Usa nombres de subdominios descriptivos +- Sigue patrones de nombres consistentes +- Considera prefijos de ambiente (prod-, staging-, dev-) + +**Consideraciones de Seguridad**: + +- Usa siempre HTTPS para servicios en producción +- Implementa una gestión adecuada de certificados +- Considera usar certificados comodín para múltiples subdominios + +**Optimización de Rendimiento**: + +- Usa integración con CDN cuando esté disponible +- Configura encabezados de caché apropiados +- Monitorea el rendimiento y uso del alias + + + +--- + +_Esta FAQ fue generada automáticamente el 19 de diciembre de 2024 basado en una consulta real de un usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/websocket-mixed-content-error.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/websocket-mixed-content-error.mdx new file mode 100644 index 000000000..024d36d33 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/websocket-mixed-content-error.mdx @@ -0,0 +1,203 @@ +--- +sidebar_position: 3 +title: "Error de Contenido Mixto en WebSocket - Se Requiere Protocolo WSS" +description: "Solución para el error de contenido mixto en WebSocket al conectar desde HTTPS a un endpoint WS" +date: "2024-12-18" +category: "general" +tags: ["websocket", "https", "ssl", "contenido-mixto", "seguridad"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error de Contenido Mixto en WebSocket - Se Requiere Protocolo WSS + +**Fecha:** 18 de diciembre de 2024 +**Categoría:** General +**Etiquetas:** WebSocket, HTTPS, SSL, Contenido Mixto, Seguridad + +## Descripción del Problema + +**Contexto:** Al intentar establecer una conexión WebSocket desde una página HTTPS a un endpoint WebSocket usando el protocolo `ws://`, los navegadores bloquean la conexión debido a políticas de seguridad de contenido mixto. + +**Síntomas Observados:** + +- Error de Contenido Mixto en la consola del navegador +- La conexión WebSocket no se establece +- Mensaje de error: "Esta solicitud ha sido bloqueada; este endpoint debe estar disponible a través de WSS" +- Intento de conexión desde página HTTPS a endpoint `ws://` es rechazado + +**Configuración Relevante:** + +- Frontend servido sobre: HTTPS +- Protocolo del endpoint WebSocket: `ws://` (inseguro) +- Política de seguridad del navegador: bloqueo de contenido mixto activado +- Formato de URL WebSocket: `ws://domain/ws/path/?token=...` + +**Condiciones de Error:** + +- El error ocurre cuando una página HTTPS intenta conectarse a un endpoint `ws://` +- Navegadores modernos aplican políticas de contenido mixto +- La conexión es bloqueada antes de establecerse +- El problema afecta a todos los contextos seguros (páginas HTTPS) + +## Solución Detallada + + + +La solución principal es usar el protocolo seguro de WebSocket (`wss://`) en lugar del protocolo inseguro (`ws://`): + +**Antes (causa error):** + +```javascript +const websocket = new WebSocket("ws://apiqa.simplee.cl/ws/lead/?token=..."); +``` + +**Después (correcto):** + +```javascript +const websocket = new WebSocket("wss://apiqa.simplee.cl/ws/lead/?token=..."); +``` + +Este cambio asegura que: + +- La conexión use cifrado SSL/TLS +- Se cumplan las políticas de contenido mixto del navegador +- La comunicación sea segura de extremo a extremo + + + + + +Verifica que tu servidor WebSocket esté configurado para manejar conexiones seguras: + +**Para aplicaciones Node.js:** + +```javascript +const https = require("https"); +const WebSocket = require("ws"); +const fs = require("fs"); + +const server = https.createServer({ + cert: fs.readFileSync("ruta/al/cert.pem"), + key: fs.readFileSync("ruta/al/key.pem"), +}); + +const wss = new WebSocket.Server({ server }); +``` + +**Para proxy inverso (Nginx):** + +```nginx +server { + listen 443 ssl; + server_name apiqa.simplee.cl; + + ssl_certificate /ruta/al/cert.pem; + ssl_certificate_key /ruta/al/key.pem; + + location /ws/ { + proxy_pass http://backend; + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + proxy_set_header Host $host; + } +} +``` + + + + + +Para aplicaciones que necesitan funcionar tanto en entornos HTTP como HTTPS, usa selección dinámica del protocolo: + +```javascript +function getWebSocketURL(path) { + const protocol = window.location.protocol === "https:" ? "wss:" : "ws:"; + const host = window.location.host; + return `${protocol}//${host}${path}`; +} + +// Uso +const wsUrl = getWebSocketURL("/ws/lead/?token=..."); +const websocket = new WebSocket(wsUrl); +``` + +O usa URLs relativas que heredan automáticamente el protocolo de la página: + +```javascript +// Esto usa automáticamente wss:// en páginas HTTPS y ws:// en páginas HTTP +const websocket = new WebSocket( + `${window.location.protocol === "https:" ? "wss:" : "ws:"}//${ + window.location.host + }/ws/lead/?token=...` +); +``` + + + + + +Si aún experimentas problemas después de cambiar a `wss://`: + +1. **Verifica el certificado SSL:** + + ```bash + openssl s_client -connect apiqa.simplee.cl:443 -servername apiqa.simplee.cl + ``` + +2. **Prueba el endpoint WebSocket:** + + ```bash + # Usando la herramienta websocat + websocat wss://apiqa.simplee.cl/ws/lead/ + ``` + +3. **Revisa las herramientas de desarrollo del navegador:** + + - Abre la pestaña Red (Network) + - Busca conexiones WebSocket + - Verifica errores SSL/TLS + +4. **Problemas comunes:** + - Certificados autofirmados (usa certificado SSL válido) + - Bloqueo de puertos (asegura que el puerto 443 esté abierto) + - Reglas de firewall (permite tráfico WSS) + - Configuración de balanceador de carga (asegura soporte para WebSocket) + + + + + +Al implementar conexiones WSS: + +1. **Usa siempre WSS en producción:** + + - Nunca uses `ws://` en entornos de producción + - Encripta toda la comunicación WebSocket + +2. **Valida certificados SSL:** + + - Usa certificados de autoridades certificadoras confiables + - Evita certificados autofirmados en producción + +3. **Implementa autenticación adecuada:** + + ```javascript + const token = getAuthToken(); // Tu mecanismo de autenticación + const websocket = new WebSocket(`wss://api.dominio.com/ws/?token=${token}`); + ``` + +4. **Maneja errores de conexión de forma adecuada:** + ```javascript + websocket.onerror = function (error) { + console.error("Error en WebSocket:", error); + // Implementa lógica de reconexión + }; + ``` + + + +--- + +_Esta FAQ fue generada automáticamente el 18 de diciembre de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-502-bad-gateway-nestjs.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-502-bad-gateway-nestjs.mdx new file mode 100644 index 000000000..fabfd4409 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-502-bad-gateway-nestjs.mdx @@ -0,0 +1,608 @@ +--- +sidebar_position: 15 +title: "Error 502 Bad Gateway con Aplicación NestJS" +description: "Solución para errores 502 Bad Gateway cuando los pods de NestJS están en ejecución pero los endpoints API no son accesibles" +date: "2025-01-15" +category: "workload" +tags: ["502", "bad-gateway", "nestjs", "api", "troubleshooting"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Error 502 Bad Gateway con Aplicación NestJS + +**Fecha:** 15 de enero de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** 502, Bad Gateway, NestJS, API, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario tiene una aplicación NestJS desplegada en Kubernetes que muestra logs normales de inicio y parece estar ejecutándose correctamente, pero los endpoints API devuelven errores 502 Bad Gateway. + +**Síntomas observados:** + +- Los pods están en ejecución y muestran logs normales de inicio de NestJS +- Los módulos de la aplicación se inicializan correctamente (TypeORM, Config, Logger, etc.) +- Las rutas están mapeadas correctamente (`/health`, `/session`) +- Las solicitudes API devuelven error `502 Bad Gateway` +- Fallan tanto las solicitudes GET como POST + +**Configuración relevante:** + +- Aplicación: NestJS con TypeORM +- Nombre del servicio: `rattlesnake-develop` +- Número de pods: 2 pods en ejecución +- Rutas: `/health` (GET), `/session` (POST) + +**Condiciones del error:** + +- El error ocurre tras un inicio exitoso de la aplicación +- Afecta a todos los endpoints API +- Sucede a pesar de que los pods aparecen como saludables en Kubernetes +- El problema se resuelve generando un nuevo despliegue + +## Solución detallada + + + +Un error 502 Bad Gateway en Kubernetes típicamente indica que el controlador de ingreso o el servicio puede alcanzar el pod, pero el pod no responde correctamente a las solicitudes HTTP. Las causas comunes incluyen: + +1. **Desajuste de puerto**: La aplicación escucha en un puerto diferente al que espera el servicio +2. **Fallos en las comprobaciones de salud**: Las sondas de readiness/liveness fallan +3. **Aplicación no completamente lista**: La app parece iniciada pero el servidor HTTP no está escuchando +4. **Problemas con el selector del servicio**: El servicio no enruta a los pods correctos + + + + + +Verifica si tu aplicación NestJS está escuchando en el puerto correcto: + +```typescript +// En tu archivo main.ts +import { NestFactory } from "@nestjs/core"; +import { AppModule } from "./app.module"; + +async function bootstrap() { + const app = await NestFactory.create(AppModule); + const port = process.env.PORT || 3000; + await app.listen(port, "0.0.0.0"); // Importante: enlazar a 0.0.0.0 + console.log(`La aplicación está ejecutándose en: ${await app.getUrl()}`); +} +bootstrap(); +``` + +Asegúrate de que el servicio de Kubernetes coincida con este puerto: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: rattlesnake-develop +spec: + selector: + app: rattlesnake-develop + ports: + - port: 80 + targetPort: 3000 # Debe coincidir con el puerto de tu app + protocol: TCP +``` + + + + + +Agrega endpoints de comprobación de salud y configura las sondas de Kubernetes: + +```typescript +// Añadir a tu controlador de la app +@Get('/health') +getHealth() { + return { + status: 'ok', + timestamp: new Date().toISOString(), + uptime: process.uptime() + }; +} + +@Get('/ready') +getReadiness() { + // Añade cualquier chequeo de readiness (conexión a base de datos, etc.) + return { status: 'ready' }; +} +``` + +Configura las sondas en Kubernetes: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: rattlesnake-develop +spec: + template: + spec: + containers: + - name: app + image: your-image + ports: + - containerPort: 3000 + livenessProbe: + httpGet: + path: /health + port: 3000 + initialDelaySeconds: 30 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /ready + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 5 +``` + + + + + +Usa estos comandos para depurar la conexión: + +```bash +# Verificar si los pods están listos +kubectl get pods -l app=rattlesnake-develop + +# Verificar endpoints del servicio +kubectl get endpoints rattlesnake-develop + +# Probar conectividad directa al pod +kubectl port-forward pod/rattlesnake-develop-xxx 3000:3000 +# Luego probar: curl http://localhost:3000/health + +# Verificar conectividad al servicio +kubectl port-forward service/rattlesnake-develop 8080:80 +# Luego probar: curl http://localhost:8080/health + +# Revisar logs del pod para arranque del servidor HTTP +kubectl logs -f deployment/rattlesnake-develop +``` + + + + + +Asegúrate de que tu aplicación NestJS esté configurada correctamente para entornos contenerizados: + +```typescript +// Habilitar apagado ordenado\async function bootstrap() { + const app = await NestFactory.create(AppModule); + + // Habilitar hooks para apagado + app.enableShutdownHooks(); + + // Configurar CORS si es necesario + app.enableCors(); + + // Prefijo global (opcional) + app.setGlobalPrefix('api'); + + // Enlazar a todas las interfaces + await app.listen(process.env.PORT || 3000, '0.0.0.0'); +} +``` + +Verifica la configuración de tu base de datos para entornos contenerizados: + +```typescript +// Configuración de TypeORM +@Module({ + imports: [ + TypeOrmModule.forRootAsync({ + useFactory: () => ({ + type: "postgres", + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT) || 5432, + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + synchronize: false, // Nunca true en producción + retryAttempts: 3, + retryDelay: 3000, + }), + }), + ], +}) +export class AppModule {} +``` + + + + + +Si necesitas una solución inmediata, puedes forzar un nuevo despliegue: + +```bash +# Opción 1: Reiniciar el despliegue +kubectl rollout restart deployment/rattlesnake-develop + +# Opción 2: Escalar a 0 y luego de vuelta +kubectl scale deployment rattlesnake-develop --replicas=0 +kubectl scale deployment rattlesnake-develop --replicas=2 + +# Opción 3: Eliminar pods para forzar recreación +kubectl delete pods -l app=rattlesnake-develop +``` + +Sin embargo, es importante identificar la causa raíz para evitar que el problema se repita. + + + + + +Revisa la configuración del Ingress para asegurar que esté enrutando correctamente: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: rattlesnake-develop-ingress + annotations: + nginx.ingress.kubernetes.io/rewrite-target: / + nginx.ingress.kubernetes.io/proxy-connect-timeout: "30" + nginx.ingress.kubernetes.io/proxy-send-timeout: "30" + nginx.ingress.kubernetes.io/proxy-read-timeout: "30" +spec: + rules: + - host: your-domain.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: rattlesnake-develop + port: + number: 80 +``` + +Verifica el estado del Ingress: + +```bash +# Verificar configuración del Ingress +kubectl describe ingress rattlesnake-develop-ingress + +# Verificar logs del controlador de Ingress +kubectl logs -n ingress-nginx deployment/ingress-nginx-controller + +# Probar conectividad desde el controlador de Ingress +kubectl exec -n ingress-nginx deployment/ingress-nginx-controller -- curl -I http://rattlesnake-develop.default.svc.cluster.local/health +``` + + + + + +Asegúrate de que todas las variables de entorno necesarias estén configuradas: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: rattlesnake-develop +spec: + template: + spec: + containers: + - name: app + image: your-nestjs-app:latest + env: + - name: NODE_ENV + value: "production" + - name: PORT + value: "3000" + - name: DB_HOST + value: "your-db-host" + - name: DB_PORT + value: "5432" + - name: DB_USERNAME + valueFrom: + secretKeyRef: + name: db-secret + key: username + - name: DB_PASSWORD + valueFrom: + secretKeyRef: + name: db-secret + key: password + - name: DB_NAME + value: "your-database" +``` + +Verifica que las variables estén disponibles en el pod: + +```bash +# Verificar variables de entorno en el pod +kubectl exec -it deployment/rattlesnake-develop -- env | grep -E "(PORT|DB_|NODE_ENV)" + +# Verificar conectividad a la base de datos +kubectl exec -it deployment/rattlesnake-develop -- nc -zv your-db-host 5432 +``` + + + + + +Implementa logging detallado para diagnosticar problemas futuros: + +```typescript +// Configurar logger personalizado +import { Logger } from '@nestjs/common'; + +@Injectable() +export class AppService { + private readonly logger = new Logger(AppService.name); + + onModuleInit() { + this.logger.log('AppService initialized'); + this.logger.log(`Server starting on port ${process.env.PORT || 3000}`); + } + + @Get('/health') + getHealth() { + this.logger.log('Health check requested'); + return { + status: 'ok', + timestamp: new Date().toISOString(), + uptime: process.uptime(), + memory: process.memoryUsage(), + env: process.env.NODE_ENV + }; + } +} +``` + +Configurar middleware de logging para requests: + +```typescript +// Middleware de logging +import { Injectable, NestMiddleware, Logger } from '@nestjs/common'; +import { Request, Response, NextFunction } from 'express'; + +@Injectable() +export class LoggerMiddleware implements NestMiddleware { + private logger = new Logger('HTTP'); + + use(req: Request, res: Response, next: NextFunction): void { + const { ip, method, originalUrl } = req; + const userAgent = req.get('User-Agent') || ''; + + res.on('close', () => { + const { statusCode } = res; + const contentLength = res.get('Content-Length'); + + this.logger.log( + `${method} ${originalUrl} ${statusCode} ${contentLength} - ${userAgent} ${ip}` + ); + }); + + next(); + } +} +``` + + + + + +Configura recursos apropiados para evitar problemas de rendimiento: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: rattlesnake-develop +spec: + template: + spec: + containers: + - name: app + image: your-nestjs-app:latest + resources: + requests: + memory: "256Mi" + cpu: "250m" + limits: + memory: "512Mi" + cpu: "500m" + # Configurar startup probe para apps que tardan en iniciar + startupProbe: + httpGet: + path: /health + port: 3000 + failureThreshold: 30 + periodSeconds: 10 +``` + +Optimizar la aplicación NestJS: + +```typescript +// Configurar timeouts y límites +async function bootstrap() { + const app = await NestFactory.create(AppModule, { + logger: ['error', 'warn', 'log'], + }); + + // Configurar timeouts + app.use((req, res, next) => { + res.setTimeout(30000, () => { + res.status(408).send('Request Timeout'); + }); + next(); + }); + + // Configurar límites de payload + app.use(express.json({ limit: '10mb' })); + app.use(express.urlencoded({ extended: true, limit: '10mb' })); + + await app.listen(process.env.PORT || 3000, '0.0.0.0'); +} +``` + + + + + +Diagnosticar y resolver problemas de conexión a la base de datos: + +```typescript +// Configuración robusta de TypeORM +@Module({ + imports: [ + TypeOrmModule.forRootAsync({ + useFactory: () => ({ + type: 'postgres', + host: process.env.DB_HOST, + port: parseInt(process.env.DB_PORT) || 5432, + username: process.env.DB_USERNAME, + password: process.env.DB_PASSWORD, + database: process.env.DB_NAME, + synchronize: false, + logging: process.env.NODE_ENV === 'development', + retryAttempts: 5, + retryDelay: 3000, + autoLoadEntities: true, + keepConnectionAlive: true, + extra: { + connectionLimit: 10, + acquireTimeout: 60000, + timeout: 60000, + }, + }), + }), + ], +}) +export class DatabaseModule {} +``` + +Implementar health check para la base de datos: + +```typescript +@Injectable() +export class HealthService { + constructor( + @InjectConnection() private connection: Connection, + ) {} + + @Get('/health/db') + async checkDatabase() { + try { + await this.connection.query('SELECT 1'); + return { status: 'ok', database: 'connected' }; + } catch (error) { + throw new ServiceUnavailableException('Database connection failed'); + } + } +} +``` + + + + + +Configurar seguridad apropiada para evitar bloqueos: + +```typescript +// Configuración de CORS y seguridad +async function bootstrap() { + const app = await NestFactory.create(AppModule); + + // Configurar CORS + app.enableCors({ + origin: process.env.ALLOWED_ORIGINS?.split(',') || '*', + methods: ['GET', 'POST', 'PUT', 'DELETE', 'PATCH'], + allowedHeaders: ['Content-Type', 'Authorization'], + credentials: true, + }); + + // Configurar helmet para seguridad + app.use(helmet({ + contentSecurityPolicy: false, // Ajustar según necesidades + })); + + // Rate limiting + app.use(rateLimit({ + windowMs: 15 * 60 * 1000, // 15 minutos + max: 100, // límite de requests por ventana + })); + + await app.listen(process.env.PORT || 3000, '0.0.0.0'); +} +``` + +Configurar headers de seguridad en Kubernetes: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: rattlesnake-develop-ingress + annotations: + nginx.ingress.kubernetes.io/configuration-snippet: | + add_header X-Frame-Options "SAMEORIGIN" always; + add_header X-Content-Type-Options "nosniff" always; + add_header X-XSS-Protection "1; mode=block" always; +``` + + + +## Lista de Verificación para Resolución + +### Verificaciones Inmediatas + +- [ ] Verificar que los pods estén en estado `Running` y `Ready` +- [ ] Confirmar que el puerto de la aplicación coincida con el `targetPort` del servicio +- [ ] Probar conectividad directa al pod usando `port-forward` +- [ ] Verificar logs de la aplicación para errores de inicio +- [ ] Confirmar que las variables de entorno estén configuradas correctamente + +### Verificaciones de Configuración + +- [ ] Validar configuración del servicio Kubernetes +- [ ] Revisar configuración del Ingress +- [ ] Verificar sondas de salud (liveness/readiness) +- [ ] Confirmar configuración de CORS si es necesario +- [ ] Validar conectividad a la base de datos + +### Verificaciones Avanzadas + +- [ ] Revisar logs del controlador de Ingress +- [ ] Verificar métricas de recursos (CPU/memoria) +- [ ] Confirmar configuración de timeouts +- [ ] Validar configuración de seguridad +- [ ] Revisar configuración de red y DNS + +## Mejores Prácticas + +### Desarrollo + +1. **Siempre enlazar a `0.0.0.0`** en aplicaciones contenerizadas +2. **Implementar health checks robustos** que verifiquen dependencias +3. **Configurar logging detallado** para facilitar debugging +4. **Usar variables de entorno** para toda la configuración + +### Despliegue + +1. **Configurar sondas apropiadas** con timeouts realistas +2. **Establecer límites de recursos** para evitar problemas de rendimiento +3. **Implementar monitoreo** de métricas clave +4. **Documentar configuración** específica del entorno + +### Operaciones + +1. **Monitorear logs regularmente** para detectar patrones +2. **Configurar alertas** para errores 502 +3. **Mantener runbooks** para resolución rápida +4. **Realizar pruebas de conectividad** periódicas + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-cron-job-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-cron-job-configuration.mdx new file mode 100644 index 000000000..f8fcf0385 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-cron-job-configuration.mdx @@ -0,0 +1,174 @@ +--- +sidebar_position: 15 +title: "Configuración de Trabajos Cron en SleakOps" +description: "Cómo configurar expresiones cron para cargas de trabajo programadas en SleakOps" +date: "2024-08-15" +category: "workload" +tags: ["cron", "programación", "carga de trabajo", "configuración"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Trabajos Cron en SleakOps + +**Fecha:** 15 de agosto de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Cron, Programación, Carga de trabajo, Configuración + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan configurar expresiones cron para cargas de trabajo programadas en SleakOps, pero pueden no tener acceso directo a un campo de texto para ingresar la expresión cron en la versión actual de la plataforma. + +**Síntomas Observados:** + +- Necesidad de configurar trabajos programados con requisitos de tiempo específicos +- Opciones limitadas en la interfaz para la configuración de expresiones cron +- Requisito de patrones de programación personalizados más allá de los preajustes básicos + +**Configuración Relevante:** + +- Tipo de carga de trabajo: trabajos programados/trabajos cron +- Plataforma: gestión de cargas de trabajo en SleakOps +- Requisitos de programación: expresiones cron personalizadas + +**Condiciones de Error:** + +- Dificultad para configurar patrones de programación complejos +- Opciones limitadas de programación en la interfaz actual +- Necesidad de control preciso del tiempo + +## Solución Detallada + + + +Actualmente, puedes configurar trabajos cron en SleakOps a través de la interfaz de configuración de cargas de trabajo: + +1. **Navega a Cargas de Trabajo** en tu proyecto SleakOps +2. **Crea o Edita** una carga de trabajo +3. **Selecciona Tipo de Carga de Trabajo**: Elige "Trabajo Cron" o "Tarea Programada" +4. **Configura el Horario**: Usa las opciones de programación disponibles + +**Métodos de programación disponibles:** + +- Intervalos predefinidos (cada hora, diario, semanal) +- Selección de tiempo personalizada mediante componentes de la interfaz +- Configuración avanzada mediante editor YAML + + + + + +Para expresiones cron complejas, puedes usar el editor YAML: + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: + name: mi-trabajo-programado +spec: + schedule: "0 2 * * 1-5" # Ejecutar a las 2 AM, de lunes a viernes + jobTemplate: + spec: + template: + spec: + containers: + - name: mi-contenedor + image: mi-app:latest + command: ["/bin/sh"] + args: ["-c", "echo 'Ejecutando tarea programada'"] + restartPolicy: OnFailure +``` + +**Patrones comunes de expresiones cron:** + +- `0 0 * * *` - Diario a medianoche +- `0 */6 * * *` - Cada 6 horas +- `0 9 * * 1-5` - Días laborables a las 9 AM +- `*/15 * * * *` - Cada 15 minutos + + + + + +Las expresiones cron en Kubernetes siguen este formato: + +``` +┌───────────── minuto (0 - 59) +│ ┌───────────── hora (0 - 23) +│ │ ┌───────────── día del mes (1 - 31) +│ │ │ ┌───────────── mes (1 - 12) +│ │ │ │ ┌───────────── día de la semana (0 - 6) (domingo a sábado) +│ │ │ │ │ +│ │ │ │ │ +* * * * * +``` + +**Caracteres especiales:** + +- `*` - Cualquier valor +- `,` - Separador de lista de valores +- `-` - Rango de valores +- `/` - Valores por paso +- `?` - Sin valor específico (solo día del mes/semana) + +**Ejemplos:** + +- `0 0 1 * *` - Primer día de cada mes a medianoche +- `0 */2 * * *` - Cada 2 horas +- `0 9-17 * * 1-5` - Cada hora de 9 AM a 5 PM, de lunes a viernes + + + + + +**Configuración Mejorada de Cron (Próximamente):** + +El equipo de SleakOps está trabajando en funciones mejoradas para la configuración de trabajos cron: + +1. **Entrada Directa de Expresión Cron**: Un campo de texto dedicado para ingresar expresiones cron directamente +2. **Constructor Visual de Cron**: Interfaz interactiva para construir expresiones cron +3. **Validación de Expresiones**: Validación en tiempo real de la sintaxis cron +4. **Vista Previa de Programación**: Representación visual de cuándo se ejecutarán los trabajos + +**Soluciones Temporales Actuales:** + +- Usa el editor YAML para expresiones complejas +- Consulta generadores de expresiones cron en línea +- Prueba las expresiones primero en entornos de desarrollo + + + + + +**Problemas comunes y soluciones:** + +1. **El trabajo no se ejecuta en los horarios esperados:** + + - Verifica la configuración de zona horaria en tu clúster + - Revisa la sintaxis de la expresión cron + - Consulta el historial de trabajos en el panel de SleakOps + +2. **Los trabajos no completan:** + + - Revisa límites y solicitudes de recursos + - Consulta los registros del contenedor + - Verifica la disponibilidad y permisos de la imagen + +3. **Monitoreo de trabajos cron:** + + ```bash + # Ver estado de trabajos cron + kubectl get cronjobs + + # Ver historial de ejecución de trabajos + kubectl get jobs + + # Ver registros de un trabajo específico + kubectl logs job/mi-trabajo-programado-1234567890 + ``` + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 15 de enero de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-internal-configuration-issue.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-internal-configuration-issue.mdx new file mode 100644 index 000000000..ef983bddf --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-internal-configuration-issue.mdx @@ -0,0 +1,332 @@ +--- +sidebar_position: 3 +title: "Problema Interno de Configuración de Cargas de Trabajo" +description: "Imposible modificar las configuraciones internas de cargas de trabajo en la plataforma SleakOps" +date: "2025-03-28" +category: "workload" +tags: + [ + "carga de trabajo", + "interna", + "configuración", + "interfaz", + "solución de problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problema Interno de Configuración de Cargas de Trabajo + +**Fecha:** 28 de marzo de 2025 +**Categoría:** Carga de trabajo +**Etiquetas:** Carga de trabajo, Interna, Configuración, Interfaz, Solución de problemas + +## Descripción del Problema + +**Contexto:** Los usuarios que intentan modificar las configuraciones internas de cargas de trabajo a través de la interfaz de la plataforma SleakOps encuentran problemas de navegación en la interfaz que les impiden completar el proceso de configuración. + +**Síntomas Observados:** + +- El botón "Siguiente" en el paso de Esquema de Servicio se vuelve inoperante +- Incapacidad para avanzar en el asistente de configuración para cargas de trabajo internas +- El problema afecta consistentemente a múltiples cargas de trabajo internas +- Las cargas de trabajo públicas pueden ser modificadas sin problemas +- No se pueden guardar ni aplicar los cambios de configuración + +**Configuración Relevante:** + +- Tipo de carga de trabajo: Cargas de trabajo internas +- Paso afectado: Paso de configuración de Esquema de Servicio +- Componente de interfaz: Funcionalidad del botón Siguiente +- Alternativa funcional: Las cargas de trabajo públicas funcionan normalmente + +**Condiciones de Error:** + +- El error ocurre específicamente con tipos de cargas de trabajo internas +- El problema aparece durante el paso de Esquema de Servicio del asistente de configuración +- Afecta uniformemente a todas las cargas de trabajo internas +- No afecta configuraciones de cargas de trabajo públicas + +## Solución Detallada + + + +Mientras se resuelve el problema de la interfaz de la plataforma, puedes modificar las configuraciones internas de cargas de trabajo directamente usando Lens: + +1. **Accede a tu clúster a través de Lens** +2. **Navega a Cargas de Trabajo** → **Despliegues** +3. **Encuentra tu carga de trabajo interna** +4. **Edita el despliegue directamente** + +Para modificaciones comunes: + +```yaml +# Para cambiar el número de réplicas +spec: + replicas: 3 # Cambia este valor + +# Para modificar límites de recursos +spec: + template: + spec: + containers: + - name: tu-contenedor + resources: + limits: + cpu: "500m" + memory: "512Mi" + requests: + cpu: "250m" + memory: "256Mi" +``` + + + + + +También puedes usar kubectl para modificar cargas de trabajo internas: + +```bash +# Obtener la configuración actual del despliegue +kubectl get deployment -o yaml > respaldo-carga-trabajo.yaml + +# Editar el despliegue +kubectl edit deployment + +# O aplicar cambios desde un archivo +kubectl apply -f carga-trabajo-modificada.yaml + +# Verificar los cambios +kubectl get deployment +kubectl describe deployment +``` + + + + + +**Escalar réplicas:** + +```bash +kubectl scale deployment --replicas=5 +``` + +**Actualizar límites de recursos:** + +```bash +kubectl patch deployment -p '{ + "spec": { + "template": { + "spec": { + "containers": [{ + "name": "", + "resources": { + "limits": { + "cpu": "1000m", + "memory": "1Gi" + }, + "requests": { + "cpu": "500m", + "memory": "512Mi" + } + } + }] + } + } + } +}' +``` + +**Actualizar variables de entorno:** + +```bash +kubectl patch deployment -p '{ + "spec": { + "template": { + "spec": { + "containers": [{ + "name": "", + "env": [ + {"name": "NUEVA_VARIABLE", "value": "nuevo-valor"}, + {"name": "OTRA_VARIABLE", "value": "otro-valor"} + ] + }] + } + } + } +}' +``` + + + + + +Para ayudar a diagnosticar el problema de la interfaz: + +**1. Verificar la consola del navegador:** + +```javascript +// Abrir herramientas de desarrollador (F12) +// Buscar errores en la consola durante el proceso de configuración +console.log("Verificando errores de JavaScript..."); + +// Verificar si hay errores de red +// Ir a la pestaña Network y buscar requests fallidos +``` + +**2. Verificar el estado del botón:** + +```javascript +// En la consola del navegador, verificar el estado del botón +const nextButton = document.querySelector('[data-testid="next-button"]') || + document.querySelector('button:contains("Siguiente")'); +console.log("Estado del botón:", nextButton); +console.log("Deshabilitado:", nextButton?.disabled); +console.log("Clases CSS:", nextButton?.className); +``` + +**3. Información útil para reportar:** + +- Versión del navegador +- Errores específicos en la consola +- Pasos exactos para reproducir el problema +- Capturas de pantalla del estado del botón + + + + + +**Respaldo antes de modificaciones:** + +```bash +# Crear respaldo completo del namespace +kubectl get all -n -o yaml > respaldo-completo.yaml + +# Respaldar configuraciones específicas +kubectl get deployment -o yaml > respaldo-deployment.yaml +kubectl get service -o yaml > respaldo-service.yaml +kubectl get configmap -o yaml > respaldo-configmap.yaml +``` + +**Validación de cambios:** + +```bash +# Verificar que los pods se reinicien correctamente +kubectl rollout status deployment/ + +# Verificar logs de los nuevos pods +kubectl logs -f deployment/ + +# Verificar que el servicio responda +kubectl port-forward deployment/ 8080:8080 +curl http://localhost:8080/health +``` + +**Monitoreo post-cambios:** + +```bash +# Verificar métricas de recursos +kubectl top pods -l app= + +# Verificar eventos del deployment +kubectl describe deployment + +# Verificar estado de los pods +kubectl get pods -l app= -w +``` + + + + + +**Análisis de configuración YAML:** + +```bash +# Exportar configuración actual para análisis +kubectl get deployment -o yaml > current-config.yaml + +# Validar sintaxis YAML +python -c "import yaml; yaml.safe_load(open('current-config.yaml'))" + +# Comparar con configuración de trabajo conocida +diff working-config.yaml current-config.yaml +``` + +**Depuración de problemas de red:** + +```bash +# Verificar conectividad desde el pod +kubectl exec -it -- nslookup +kubectl exec -it -- curl -v http://:/health + +# Verificar configuración de red del servicio +kubectl get endpoints +kubectl describe service +``` + +**Análisis de logs detallado:** + +```bash +# Logs con timestamps +kubectl logs --timestamps=true + +# Logs de contenedores específicos +kubectl logs -c + +# Logs de eventos del sistema +kubectl get events --sort-by=.metadata.creationTimestamp +``` + + + + + +**Rollback rápido:** + +```bash +# Ver historial de rollouts +kubectl rollout history deployment/ + +# Rollback a la versión anterior +kubectl rollout undo deployment/ + +# Rollback a una versión específica +kubectl rollout undo deployment/ --to-revision=2 +``` + +**Recuperación de configuración:** + +```bash +# Restaurar desde respaldo +kubectl apply -f respaldo-deployment.yaml + +# Forzar recreación de pods +kubectl delete pods -l app= + +# Escalar a cero y luego restaurar +kubectl scale deployment --replicas=0 +kubectl scale deployment --replicas=3 +``` + +**Contacto con soporte:** + +Si el problema persiste: + +1. **Recopilar información de diagnóstico:** + - Logs de la consola del navegador + - Capturas de pantalla del problema + - Configuración actual de la carga de trabajo + +2. **Crear ticket de soporte** con: + - Descripción detallada del problema + - Pasos para reproducir + - Información del entorno (navegador, versión de SleakOps) + +3. **Mientras tanto, usar las alternativas** descritas anteriormente + + + +--- + +_Esta sección de preguntas frecuentes fue generada automáticamente el 28 de marzo de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-job-resource-limits-missing.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-job-resource-limits-missing.mdx new file mode 100644 index 000000000..e9268c540 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-job-resource-limits-missing.mdx @@ -0,0 +1,158 @@ +--- +sidebar_position: 3 +title: "Límites de Recursos de Job No Aplicados en Pod de Kubernetes" +description: "Solución para cuando los límites de CPU y memoria definidos en trabajos de SleakOps no se aplican a pods de Kubernetes" +date: "2024-04-22" +category: "workload" +tags: ["job", "kubernetes", "recursos", "límites", "memoria", "cpu"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Límites de Recursos de Job No Aplicados en Pod de Kubernetes + +**Fecha:** 22 de abril de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Job, Kubernetes, Recursos, Límites, Memoria, CPU + +## Descripción del Problema + +**Contexto:** El usuario define límites de CPU y memoria para un trabajo en la plataforma SleakOps, pero el pod resultante de Kubernetes muestra `resources: {}` en lugar de las restricciones de recursos especificadas. + +**Síntomas Observados:** + +- Trabajo configurado con CPU (1000 min, 1500 max) y Memoria (2500 min, 3000 max) +- Pod generado de Kubernetes muestra `resources: {}` en la especificación +- El pod experimenta errores por falta de memoria debido a la ausencia de límites de recursos +- Mensaje de error: "El nodo tenía pocos recursos: memoria. El contenedor estaba usando 4068000Ki, la solicitud es 0" + +**Configuración Relevante:** + +- CPU Mínimo: 1000m, Máximo: 1500m +- Memoria Mínima: 2500Mi, Máxima: 3000Mi +- Tipo de trabajo: Kubernetes Job +- Plataforma: SleakOps en AWS EKS + +**Condiciones de Error:** + +- Los límites de recursos definidos en la interfaz de SleakOps no se traducen a la especificación del pod de Kubernetes +- El pod se ejecuta sin restricciones de recursos, lo que puede agotar recursos del nodo +- El contenedor puede consumir recursos ilimitados causando expulsión + +## Solución Detallada + + + +Este es un error confirmado en la plataforma SleakOps donde los límites de recursos definidos en la configuración del trabajo no se aplican correctamente a los pods generados de Kubernetes. + +**Estado:** Corregido en la próxima versión (programada para la semana actual) + +El equipo de desarrollo ha identificado y resuelto el problema donde los límites de CPU y memoria especificados en la configuración del trabajo de SleakOps no se traducían en la especificación del pod de Kubernetes. + + + + + +Si necesita ejecutar el trabajo con urgencia antes de que se publique la corrección, puede agregar manualmente los límites de recursos al trabajo de Kubernetes: + +1. **Exporte la configuración actual del trabajo** +2. **Edite manualmente el YAML del trabajo** para incluir las especificaciones de recursos +3. **Aplique la configuración modificada** + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: nombre-de-su-job +spec: + template: + spec: + containers: + - name: nombre-de-su-contenedor + image: su-imagen + resources: + requests: + cpu: "1000m" # Su valor mínimo de CPU + memory: "2500Mi" # Su valor mínimo de memoria + limits: + cpu: "1500m" # Su valor máximo de CPU + memory: "3000Mi" # Su valor máximo de memoria + # ... resto de la especificación del contenedor +``` + + + + + +Al agregar recursos manualmente, use el formato correcto de recursos de Kubernetes: + +**Recursos de CPU:** + +- Use milicores: `1000m` = 1 núcleo de CPU +- O decimal: `1.5` = 1.5 núcleos de CPU + +**Recursos de Memoria:** + +- Use unidades estándar: `Mi` (Mebibytes), `Gi` (Gibibytes) +- Ejemplos: `2500Mi`, `3Gi` + +**Ejemplo Completo:** + +```yaml +resources: + requests: # Recursos mínimos garantizados + cpu: "1000m" + memory: "2500Mi" + limits: # Recursos máximos permitidos + cpu: "1500m" + memory: "3000Mi" +``` + + + + + +Después de aplicar la corrección manual o cuando se publique la actualización de la plataforma: + +1. **Verifique la especificación del pod:** + +```bash +kubectl get pod -o yaml | grep -A 10 resources: +``` + +2. **Verifique la asignación de recursos:** + +```bash +kubectl describe pod +``` + +3. **Monitoree el uso de recursos:** + +```bash +kubectl top pod +``` + +La salida debe mostrar los límites de CPU y memoria especificados en lugar de recursos vacíos. + + + + + +**Mejores Prácticas para la Gestión de Recursos:** + +1. **Siempre establezca límites de recursos** para evitar la falta de recursos +2. **Configure solicitudes apropiadas** para asegurar una correcta programación +3. **Monitoree el uso de recursos** para ajustar límites basados en el consumo real +4. **Use cuotas de recursos** a nivel de namespace para protección adicional + +**Estrategia Recomendada de Recursos:** + +- Establezca solicitudes al 70-80% del uso esperado +- Establezca límites al 120-150% del uso esperado +- Monitoree y ajuste según métricas reales + + + +--- + +_Esta FAQ fue generada automáticamente el 22 de abril de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-memory-limits-and-debug-pods.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-memory-limits-and-debug-pods.mdx new file mode 100644 index 000000000..497750c16 --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-memory-limits-and-debug-pods.mdx @@ -0,0 +1,222 @@ +--- +sidebar_position: 3 +title: "Incrementar Límites de Memoria y Crear Pods de Depuración" +description: "Cómo aumentar los límites de memoria para servicios web y crear pods de depuración para ejecutar scripts" +date: "2024-01-15" +category: "workload" +tags: + [ + "memoria", + "límites", + "depuración", + "pods", + "servicio-web", + "solución-de-problemas", + ] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Incrementar Límites de Memoria y Crear Pods de Depuración + +**Fecha:** 15 de enero de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Memoria, Límites, Depuración, Pods, Servicio Web, Solución de Problemas + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan aumentar los límites de memoria para los pods de servicios web para ejecutar scripts que requieren muchos recursos y desean crear pods de depuración dedicados para ejecutar comandos en consola. + +**Síntomas Observados:** + +- Pods de servicios web alcanzando límites de memoria (768M por defecto) +- Fallos en scripts debido a memoria insuficiente +- Necesidad de pods temporales para ejecutar comandos +- Requisito de acceso a consola para ejecutar scripts de mantenimiento + +**Configuración Relevante:** + +- Límite de memoria por defecto: 768M +- Tipo de carga de trabajo: Servicio Web +- Plataforma: entorno Kubernetes de SleakOps +- Necesidad de contenedores temporales de depuración + +**Condiciones de Error:** + +- Exceso del límite de memoria durante la ejecución de scripts +- Errores de falta de memoria en los registros de la aplicación +- Necesidad de un entorno para ejecución de comandos ad-hoc + +## Solución Detallada + + + +Para aumentar los límites de memoria de tu servicio web: + +1. **Navega a Cargas de Trabajo**: + + - Ve a **Cargas de Trabajo** → **Servicio Web** + - Encuentra tu servicio web en la lista + +2. **Editar el Servicio**: + + - Haz clic en el servicio 'web' que deseas modificar + - Haz clic en **Editar** para abrir la configuración + +3. **Configurar Recursos**: + + - Navega al **último paso** de los formularios de configuración + - En la sección **Recursos**, encontrarás configuraciones mínimas y máximas de recursos + - Incrementa el límite **máximo de memoria** al valor deseado (por ejemplo, 2048M o 4096M) + +4. **Desplegar Cambios**: + - Asegúrate de que **Desplegar** esté activado/marcado + - Haz clic en **Guardar** o **Actualizar** + - Espera a que finalice el despliegue + +```yaml +# Ejemplo de configuración de recursos +resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2048Mi" # Incrementado desde 768Mi + cpu: "1000m" +``` + +**Nota**: El proceso de despliegue actualizará estos valores automáticamente en el clúster. + + + + + +Para crear un pod de depuración para ejecutar comandos en consola: + +1. **Crear un Nuevo Job**: + + - Ve a **Cargas de Trabajo** → **Jobs** + - Haz clic en **Crear Nuevo Job** + +2. **Configurar el Job**: + + - **Nombre**: Ponle un nombre descriptivo como `debug-pod` o `script-runner` + - **URL de la Imagen**: Déjalo **vacío** (usará la misma imagen que tu servicio web) + - **Etiqueta de la Imagen**: Déjalo **vacío** + - **Comando**: Ingresa `sleep infinity` o `sleep 999999` + +3. **Establecer Recursos** (Segundo Paso): + - Configura la memoria y CPU que este pod de depuración debe tener + - Establece límites apropiados según los requisitos de tu script + +```yaml +# Ejemplo de configuración de job de depuración +apiVersion: batch/v1 +kind: Job +metadata: + name: debug-pod +spec: + template: + spec: + containers: + - name: debug + image: your-app-image # Igual que el servicio web + command: ["sleep", "infinity"] + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "2048Mi" + cpu: "1000m" + restartPolicy: Never +``` + +4. **Desplegar el Job**: + - Haz clic en **Crear** para desplegar el pod de depuración + - Espera a que esté en estado **Running** + + + + + +Una vez que tu pod de depuración esté en ejecución: + +1. **Navega al Panel de Kubernetes**: + + - Ve a la sección **Kubernetes** en SleakOps + - Encuentra tu pod de depuración en la lista de pods + +2. **Abrir Acceso a Consola**: + + - Haz clic en el pod de depuración + - Busca el botón **Shell** o **Terminal** + - Haz clic para abrir una sesión de terminal + +3. **Ejecuta Tus Scripts**: + + ```bash + # Ejemplos de comandos que puedes ejecutar + php artisan migrate + php artisan cache:clear + npm run build + python manage.py migrate + ``` + +4. **Limpieza**: + - Una vez terminado, elimina el pod de depuración + - Regresa a **Jobs** y elimina el job de depuración + - Esto previene el uso innecesario de recursos + + + + + +**Buenas Prácticas para Límites de Memoria:** + +- Comienza con 2 veces tu límite actual y monitorea el uso +- Usa herramientas de monitoreo para entender el consumo real de memoria +- Considera si los scripts pueden ser optimizados en lugar de solo aumentar límites +- Establece adecuadamente tanto requests como limits + +**Buenas Prácticas para Pods de Depuración:** + +- Siempre limpia los pods de depuración después de usarlos +- Usa nombres descriptivos para fácil identificación +- Establece límites de recursos apropiados para evitar agotamiento del clúster +- Considera usar `kubectl exec` directamente si tienes acceso al clúster + +**Enfoques Alternativos:** + +1. **Jobs Programados**: Para scripts recurrentes, crea CronJobs de Kubernetes adecuados +2. **Contenedores Init**: Para scripts de configuración únicos durante el despliegue +3. **Contenedores Sidecar**: Para tareas de mantenimiento continuas + +```yaml +# Ejemplo de un job de mantenimiento adecuado +apiVersion: batch/v1 +kind: CronJob +metadata: + name: maintenance-script +spec: + schedule: "0 2 * * *" # Diario a las 2 AM + jobTemplate: + spec: + template: + spec: + containers: + - name: maintenance + image: your-app-image + command: + ["/bin/sh", "-c", "php artisan queue:work --stop-when-empty"] + resources: + limits: + memory: "1024Mi" + restartPolicy: OnFailure +``` + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-public-access-configuration.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-public-access-configuration.mdx new file mode 100644 index 000000000..d678f2a8b --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-public-access-configuration.mdx @@ -0,0 +1,146 @@ +--- +sidebar_position: 3 +title: "Configuración de Acceso Público para Workloads" +description: "Cómo configurar workloads para acceso público sin conexión VPN" +date: "2025-03-18" +category: "workload" +tags: ["workload", "acceso-publico", "vpn", "configuración", "staging"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Configuración de Acceso Público para Workloads + +**Fecha:** 18 de marzo de 2025 +**Categoría:** Workload +**Etiquetas:** Workload, Acceso Público, VPN, Configuración, Staging + +## Descripción del Problema + +**Contexto:** Los usuarios necesitan acceder a entornos de staging sin estar conectados a VPN, pero los workloads están configurados por defecto para acceso privado en SleakOps. + +**Síntomas Observados:** + +- El workload de staging solo es accesible cuando se está conectado a VPN +- Los miembros del equipo no pueden acceder al entorno de staging sin conexión VPN +- Necesidad de hacer el workload accesible públicamente para un acceso más amplio del equipo + +**Configuración Relevante:** + +- Entorno: Staging +- Método de acceso: Actualmente solo VPN +- Acceso deseado: Acceso desde internet público +- Plataforma: Gestión de workloads en SleakOps + +**Condiciones de Error:** + +- Workload inaccesible sin conexión VPN +- Tiempo de espera agotado al acceder desde internet público +- Necesidad de modificar configuración del workload para acceso público + +## Solución Detallada + + + +Para hacer un workload accesible públicamente en SleakOps: + +1. **Navega a tu proyecto** en el panel de SleakOps +2. **Selecciona el workload** que deseas hacer público +3. **Ve a la configuración** o ajustes del workload +4. **Encuentra la sección de configuración de red/acceso** +5. **Cambia el tipo de acceso** de "Privado" a "Público" +6. **Guarda y redepliega** el workload + +El workload ahora será accesible desde internet público sin requerir conexión VPN. + + + + + +**Acceso Privado (Por Defecto):** + +- El workload solo es accesible a través de VPN +- Mayor seguridad para aplicaciones internas +- Requiere conexión VPN para todo acceso + +**Acceso Público:** + +- El workload es accesible desde internet público +- Adecuado para entornos de staging y aplicaciones públicas +- No requiere conexión VPN +- Debe usarse con mecanismos adecuados de autenticación y seguridad + + + + + +Al hacer workloads accesibles públicamente: + +1. **Habilitar autenticación**: Asegúrate de que tu aplicación tenga mecanismos adecuados de inicio de sesión +2. **Usar HTTPS**: Siempre activa SSL/TLS para workloads públicos +3. **Implementar limitación de tasa**: Protege contra abusos y ataques DDoS +4. **Monitorear registros de acceso**: Lleva control de quién accede a tu aplicación +5. **Actualizaciones regulares de seguridad**: Mantén tu aplicación y dependencias actualizadas + +```yaml +# Ejemplo de configuración de workload +apiVersion: v1 +kind: Service +metadata: + name: staging-app +spec: + type: LoadBalancer # Para acceso público + ports: + - port: 80 + targetPort: 3000 + selector: + app: staging-app +``` + + + + + +Si aún tienes problemas después de hacer público el workload: + +1. **Verifica la propagación DNS**: Puede tardar algunos minutos en propagarse +2. **Confirma el estado del balanceador de carga**: Asegúrate de que esté configurado correctamente +3. **Revisa los grupos de seguridad**: Verifica que los puertos necesarios estén abiertos +4. **Revisa los registros de la aplicación**: Busca errores específicos de la aplicación +5. **Prueba desde diferentes redes**: Verifica el acceso desde múltiples ubicaciones + +**Comandos comunes para solución de problemas:** + +```bash +# Ver estado de servicios +kubectl get services + +# Ver configuración de ingress +kubectl get ingress + +# Ver logs del workload +kubectl logs -f deployment/your-workload-name +``` + + + + + +**Entornos de Staging:** + +- Pueden hacerse públicos para colaboración del equipo +- Deben tener autenticación básica +- Considera listas blancas de IP para mayor seguridad + +**Entornos de Producción:** + +- Evaluar cuidadosamente antes de hacer públicos +- Implementar medidas de seguridad completas +- Usar monitoreo y alertas adecuadas +- Considerar estrategias de despliegue gradual + + + +--- + +_Esta FAQ fue generada automáticamente el 18 de marzo de 2025 basada en una consulta real de usuario._ diff --git a/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-traffic-routing-issues.mdx b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-traffic-routing-issues.mdx new file mode 100644 index 000000000..f918f56bb --- /dev/null +++ b/i18n/es/docusaurus-plugin-content-docs/current/troubleshooting/workload-traffic-routing-issues.mdx @@ -0,0 +1,245 @@ +--- +sidebar_position: 3 +title: "Problemas de Enrutamiento de Tráfico a Pods Backend" +description: "Solución para problemas de enrutamiento de tráfico al acceder a servicios backend a través de subdominios" +date: "2024-01-15" +category: "workload" +tags: + ["redes", "enrutamiento", "subdominio", "backend", "solución de problemas"] +--- + +import TroubleshootingItem from "@site/src/components/HomepageFeatures/troubleshootingitem"; + +# Problemas de Enrutamiento de Tráfico a Pods Backend + +**Fecha:** 15 de enero de 2024 +**Categoría:** Carga de trabajo +**Etiquetas:** Redes, Enrutamiento, Subdominio, Backend, Solución de problemas + +## Descripción del Problema + +**Contexto:** El usuario tiene un servicio backend desplegado en SleakOps que puede conectarse a la base de datos con éxito, pero el tráfico externo a través de subdominios no está llegando a los pods backend. + +**Síntomas Observados:** + +- El reenvío de puerto directo al pod funciona en localhost +- El acceso externo a través del subdominio falla con errores de enrutamiento +- El backend puede conectarse a la base de datos exitosamente +- La versión 1 de la base de datos está funcionando correctamente +- Las solicitudes HTTP nunca llegan al servicio backend + +**Configuración Relevante:** + +- Servicio backend: desplegado y en ejecución +- Base de datos: versión 1, conectada exitosamente +- Redes: subdominio configurado pero no enruta tráfico +- Reenvío de puerto: funciona localmente vía kubectl + +**Condiciones de Error:** + +- El acceso al subdominio falla con errores de redirección +- El enrutamiento de tráfico falla entre ingress y pods backend +- Las solicitudes externas no llegan a la aplicación +- El reenvío de puerto local funciona correctamente + +## Solución Detallada + + + +Este tipo de problema típicamente ocurre debido a uno de estos problemas comunes: + +1. **Configuración de ingress**: reglas de ingress faltantes o incorrectas +2. **Configuración del servicio**: servicio no expone correctamente el backend +3. **Configuración DNS**: subdominio no configurado correctamente +4. **Políticas de red**: bloqueo del tráfico externo +5. **Problemas con el balanceador de carga**: balanceador externo no enruta correctamente + +Primero, verifica que tu servicio esté correctamente configurado y en ejecución: + +```bash +kubectl get services +kubectl get pods +kubectl get ingress +``` + + + + + +Revisa si tu servicio backend está correctamente configurado: + +```bash +# Ver detalles del servicio +kubectl describe service your-backend-service + +# Verificar endpoints del servicio +kubectl get endpoints your-backend-service +``` + +Asegúrate de que la configuración de tu servicio incluya: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: backend-service +spec: + selector: + app: your-backend-app + ports: + - protocol: TCP + port: 80 + targetPort: 8080 # Tu puerto backend + type: ClusterIP +``` + + + + + +Verifica que tu ingress esté configurado correctamente para el subdominio: + +```bash +# Ver estado del ingress +kubectl get ingress +kubectl describe ingress your-ingress-name +``` + +Asegúrate de que la configuración de tu ingress sea correcta: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: backend-ingress + annotations: + kubernetes.io/ingress.class: nginx +spec: + rules: + - host: your-subdomain.yourdomain.com + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: backend-service + port: + number: 80 +``` + + + + + +Comprueba si tu subdominio está configurado correctamente: + +1. **Prueba resolución DNS**: + +```bash +nslookup your-subdomain.yourdomain.com +dig your-subdomain.yourdomain.com +``` + +2. **Verifica que el subdominio apunte a la IP correcta**: + + - El subdominio debe apuntar a la IP del balanceador de carga de tu clúster + - Obtén la IP del balanceador: `kubectl get ingress` + +3. **Verifica en el panel de SleakOps**: + - Ve a la configuración de tu proyecto + - Revisa la configuración del subdominio + - Asegúrate que esté correctamente vinculado a tu servicio backend + + + + + +Prueba la conectividad en diferentes niveles: + +1. **Probar pod directamente**: + +```bash +kubectl port-forward pod/your-backend-pod 8080:8080 +curl http://localhost:8080/health +``` + +2. **Probar servicio internamente**: + +```bash +kubectl run test-pod --image=curlimages/curl -it --rm -- sh +# Dentro del pod: +curl http://backend-service/health +``` + +3. **Probar ingress internamente**: + +```bash +curl -H "Host: your-subdomain.yourdomain.com" http://INGRESS_IP/health +``` + +4. **Probar acceso externo**: + +```bash +curl https://your-subdomain.yourdomain.com/health +``` + + + + + +**Si el servicio no funciona:** + +- Verifica que el selector del servicio coincida con las etiquetas del pod +- Verifica que la configuración de puertos coincida con el puerto del contenedor + +**Si ingress no funciona:** + +- Asegúrate que el controlador de ingress esté en ejecución: `kubectl get pods -n ingress-nginx` +- Revisa la anotación de clase de ingress +- Verifica que la configuración del host coincida con tu subdominio + +**Si DNS no resuelve:** + +- Revisa la configuración del subdominio en el panel de SleakOps +- Verifica la propagación DNS (puede tardar hasta 24 horas) +- Intenta usar un servidor DNS diferente para pruebas + +**Si hay problemas con SSL/TLS:** + +- Revisa el estado del certificado: `kubectl describe certificate` +- Verifica que cert-manager esté funcionando: `kubectl get pods -n cert-manager` + + + + + +En la plataforma SleakOps: + +1. **Verifica la configuración del servicio**: + + - Ve a tu proyecto → Servicios + - Verifica que el servicio backend esté correctamente configurado + - Revisa los mapeos de puertos y chequeos de salud + +2. **Verifica la configuración del subdominio**: + + - Ve a configuración del proyecto → Dominios + - Asegúrate que el subdominio esté activo y configurado correctamente + - Revisa el estado del certificado SSL + +3. **Revisa los logs**: + + - Consulta los logs del servicio backend en el panel de SleakOps + - Busca errores de arranque o problemas de conectividad + - Revisa los logs del controlador ingress si están disponibles + +4. **Políticas de red**: + - Asegúrate que no haya políticas de red bloqueando el tráfico + - Verifica si hay grupos de seguridad o reglas de firewall que interfieran + + + +--- + +_Esta FAQ fue generada automáticamente el 15 de enero de 2024 basada en una consulta real de usuario._ diff --git a/src/components/HomepageFeatures/faqitem.jsx b/src/components/HomepageFeatures/troubleshootingitem.jsx similarity index 76% rename from src/components/HomepageFeatures/faqitem.jsx rename to src/components/HomepageFeatures/troubleshootingitem.jsx index a39872905..e140c277e 100644 --- a/src/components/HomepageFeatures/faqitem.jsx +++ b/src/components/HomepageFeatures/troubleshootingitem.jsx @@ -1,12 +1,12 @@ import React, { useEffect, useRef } from "react"; import { useLocation } from "react-router-dom"; -const FAQItem = ({ id, summary, children }) => { +const TroubleshootingItem = ({ id, summary, children }) => { const detailsRef = useRef(null); const location = useLocation(); useEffect(() => { - // Check if the current hash matches this FAQ's ID + // Check if the current hash matches this troubleshooting item's ID if (location.hash === `#${id}`) { if (detailsRef.current) { detailsRef.current.open = true; @@ -23,4 +23,4 @@ const FAQItem = ({ id, summary, children }) => { ); }; -export default FAQItem; +export default TroubleshootingItem; diff --git a/src/css/custom.css b/src/css/custom.css index d7b1f87c7..935f12edf 100644 --- a/src/css/custom.css +++ b/src/css/custom.css @@ -105,4 +105,168 @@ details { summary > h3 { margin: 0px +} + +/* FAQ Specific Styles */ +.troubleshooting-header { + display: flex; + justify-content: space-between; + align-items: center; + margin-bottom: 1rem; +} + +.troubleshooting-meta { + display: flex; + justify-content: space-between; + align-items: center; + margin: 1rem 0; + padding: 0.75rem; + background-color: var(--ifm-color-emphasis-100); + border-radius: 6px; + font-size: 0.9rem; +} + +.troubleshooting-meta strong { + color: var(--ifm-color-primary); +} + +.troubleshooting-original-query { + background-color: var(--ifm-color-emphasis-50); + border-left: 4px solid var(--ifm-color-primary); + padding: 1rem; + margin: 1rem 0; +} + +.troubleshooting-original-query blockquote { + margin: 0; + font-style: italic; +} + +.troubleshooting-status { + display: inline-block; + padding: 0.25rem 0.75rem; + border-radius: 12px; + font-size: 0.8rem; + font-weight: 600; + text-transform: uppercase; +} + +.troubleshooting-status.resolved { + background-color: var(--ifm-color-success-contrast-background); + color: var(--ifm-color-success-contrast-foreground); +} + +.troubleshooting-status.pending { + background-color: var(--ifm-color-warning-contrast-background); + color: var(--ifm-color-warning-contrast-foreground); +} + +.troubleshooting-status.in-progress { + background-color: var(--ifm-color-info-contrast-background); + color: var(--ifm-color-info-contrast-foreground); +} + +.troubleshooting-followup { + margin-top: 2rem; + padding: 1rem; + background-color: var(--ifm-color-emphasis-50); + border-radius: 8px; +} + +.troubleshooting-followup h2 { + margin-top: 0; + font-size: 1.1rem; +} + +.troubleshooting-related-links { + background-color: var(--ifm-color-emphasis-50); + padding: 1rem; + border-radius: 8px; + margin: 1.5rem 0; +} + +.troubleshooting-related-links h2 { + margin-top: 0; + font-size: 1.1rem; +} + +.troubleshooting-related-links ul { + margin-bottom: 0; +} + +.troubleshooting-timestamp { + font-size: 0.8rem; + color: var(--ifm-color-emphasis-600); + font-style: italic; + text-align: center; + margin-top: 2rem; + padding-top: 1rem; + border-top: 1px solid var(--ifm-color-emphasis-300); +} + +/* Troubleshooting Item enhancements */ +details[id] { + margin: 1.5rem 0; + border: 1px solid var(--ifm-color-emphasis-300); + border-radius: 8px; + overflow: hidden; +} + +details[id] summary { + background-color: var(--ifm-color-emphasis-100); + padding: 1rem; + cursor: pointer; + font-weight: 600; + transition: background-color 0.2s ease; +} + +details[id] summary:hover { + background-color: var(--ifm-color-emphasis-200); +} + +details[id] > div { + padding: 1rem; + background-color: var(--ifm-background-color); +} + +/* DocCard customization for Troubleshooting index */ +.theme-doc-card-container { + margin: 0.5rem 0; +} + +/* DocCardList with col--12-cards className - Full width cards */ +.col--12-cards { + display: flex; + flex-direction: column; + gap: 1rem; +} + +.col--12-cards .theme-doc-card { + width: 100% !important; + max-width: 100% !important; + flex: 1 1 100% !important; +} + +.col--12-cards .theme-doc-card-container { + width: 100% !important; + max-width: 100% !important; +} + +/* Alternative approach - if the above doesn't work, try this */ +.col--12-cards article { + width: 100% !important; + max-width: 100% !important; + flex-basis: 100% !important; +} + +/* Responsive adjustments */ +@media (max-width: 768px) { + .troubleshooting-meta { + flex-direction: column; + gap: 0.5rem; + } + + .troubleshooting-meta > div { + margin: 0; + } } \ No newline at end of file diff --git a/static/img/network/sleakops-network.png b/static/img/network/sleakops-network.png new file mode 100644 index 000000000..5c5f8abe1 Binary files /dev/null and b/static/img/network/sleakops-network.png differ