diff --git a/custom-gateway/01-customer-request.txt b/custom-gateway/01-customer-request.txt new file mode 100644 index 0000000..7a2c166 --- /dev/null +++ b/custom-gateway/01-customer-request.txt @@ -0,0 +1,78 @@ +Hello Simon, + +Thanks for your detailed response, + +We do have most of the network config that you mentioned as part of 2.0 + +My specific question would be if we can use our TGW instead of dedicated Quilt NAT Gateways? + +Additionally, +Can Quilt function if we route 0.0.0.0/0 through Transit Gateway instead of NAT Gateway? + +Current: Private Subnet → NAT Gateway → Internet +Desired: Private Subnet → TGW → Corporate Firewall → Internet + +Do Lambda functions and ECS containers require direct internet access to external (non-AWS) services? +If they only call AWS services, we can use VPC Endpoints and avoid internet routing entirely. If they call external APIs/registries, we need TGW routing to work. + +Do they call external APIs?, Pull from public Docker Hub? Download from PyPI/npm at runtime? + +Which AWS services does Quilt need outbound access to? + Do we need to add more VPC Endpoints for all AWS services Quilt uses, so those calls bypass the firewall. Current endpoints: S3, DynamoDB, execute-api + +Need to know if we need ECR? CloudWatch? STS? Others? +Thanks, and Regards, + +Customer Contact + + +From: Simon Kohnstamm from Quilt Data, Inc. +Date: Tuesday, January 27, 2026 at 9:00 AM +To: Customer Contact , ernest +Cc: Network Team , Security Team , Infrastructure Team +Subject: [EXTERNAL] Re: Quilt application integration with Customer's Network Infrastructure + +CAUTION: External Email. THINK BEFORE YOU CLICK. It could be a phishing email. +Do not click links or open attachments unless you recognize the sender and are expecting the attachment or link. +Hi Customer Contact, +Thanks for the detailed note. Yes, Quilt supports integration into an existing corporate network/VPC and is designed to be private-by-default. Our current "Network 2.0" architecture places most services in private subnets and supports internal-only access via private load balancers and VPC endpoints. (See README.md and t4/template/PRIVATE_ENDPOINTS.md.) +Can Quilt run inside an existing corporate network? + +Yes. We support deploying into an existing VPC, using customer-provided subnets and security groups. For Network 2.0 (what you're on), you provide: +Private subnets for service containers +Intra subnets for DB/Search +Optional public subnets if you want an internet-facing ELB +A "UserSecurityGroup" that controls ingress to the load balancer +(and optionally, if you want the API Gateway to run inside of your VPC) a VPC Endpoint for execute-api +Network requirements / dependencies + +Network 2.0 defaults: +Private subnets for ECS/Lambda +Intra subnets for DB/Elasticsearch +ELB defaults to internal +API Gateway and Lambdas run inside the VPC +Outbound access for private subnets is via NAT or VPC endpoints (for AWS services). If you want fully private egress, we typically use VPC endpoints for services like S3, ECR, CloudWatch, and STS. +Firewall rules / ports + +From the deployment templates: +Inbound to ELB: TCP 443 and 80 (80 redirects to 443). +DB access: TCP 5432 is only allowed from the DB accessor security group (internal). +Everything else remains internal to the VPC and is controlled by security groups. +Best practices / documentation + +Use Network 2.0 defaults (private-by-default) and internal ELB where possible. +Potential impacts on performance/functionality + +No functional limitations are expected. The main impact is access path: if the ELB is internal, users will need VPN/Direct Connect/Transit Gateway connectivity. For outbound calls to AWS services, NAT or VPC endpoints are required. Performance is generally comparable; any added latency is typically due to the corporate network path rather than Quilt itself. +If helpful, we're happy to jump on a call and review your target topology (internal-only vs. internet-facing, VPC endpoint strategy, etc.) and map it to the required parameters. + +Best regards, + +Simon + + + +Simon Kohnstamm +Service And Support +QUILT.BIO +See All Tickets diff --git a/custom-gateway/02-customer-issue.md b/custom-gateway/02-customer-issue.md new file mode 100644 index 0000000..c7214a1 --- /dev/null +++ b/custom-gateway/02-customer-issue.md @@ -0,0 +1,344 @@ +# Product Management Summary: Customer Custom Network Routing Request + +**Date:** February 2, 2026 +**Customer:** Customer Organization +**Request Type:** Custom Network Architecture Support +**Priority:** High - Blocking Production Deployment + +--- + +## 1. Executive Summary + +The customer organization is requesting support for routing their Quilt deployment through their Transit Gateway (TGW) infrastructure instead of using Quilt's default NAT Gateway setup. This represents a common enterprise requirement where customers need Quilt to integrate with their existing network architecture for security, compliance, and operational reasons. The request requires clarification on Quilt's external service dependencies and network requirements to enable proper routing configuration. + +**Key Ask:** Route all egress traffic (0.0.0.0/0) through customer's Transit Gateway instead of NAT Gateway, while maintaining full Quilt functionality. + +--- + +## 2. Customer Context + +### Organization +- **Company:** Customer Organization +- **Contact:** Customer Contact (contact@customer.com) +- **Industry:** Enterprise +- **Scale:** Enterprise customer + +### Current Situation +- Customer has an established AWS network architecture with Transit Gateway +- They want to deploy Quilt within their existing VPC/networking infrastructure +- All egress traffic must route through their TGW for security/compliance +- This is blocking their production deployment of Quilt + +### Strategic Context +- Represents common enterprise networking pattern +- Likely affects other enterprise customers with similar requirements +- Shows need for Quilt to support flexible network architectures +- May indicate gap in deployment documentation/configuration options + +--- + +## 3. Core Requirements + +### Primary Requirement +Route all Quilt egress traffic (0.0.0.0/0) through customer's Transit Gateway instead of NAT Gateway. + +### Specific Configuration Needs +1. **No NAT Gateway:** Customer wants to eliminate Quilt-managed NAT Gateways +2. **TGW Routing:** All outbound traffic should route through their TGW +3. **AWS Service Access:** Quilt components must still access required AWS services +4. **Maintained Functionality:** All Quilt features must work without degradation + +### Architecture Constraints +- Must work within customer's existing VPC structure +- Must comply with customer's network security policies +- Must support Lambda and ECS workloads +- Must handle both AWS service calls and external API calls + +--- + +## 4. Technical Questions Asked + +The customer has specific technical questions that need answers: + +### Network Architecture Questions + +1. **Primary Routing Question:** + > "Can we route 0.0.0.0/0 through TGW instead of NAT Gateway?" + - Need to confirm if this routing pattern is supported + - Identify any Quilt-specific routing requirements + +2. **Service Dependencies:** + > "Do Lambda/ECS need to call any external services other than AWS services?" + - Critical for routing design + - Determines if TGW needs internet egress or just AWS service access + +3. **AWS Service Requirements:** + > "Which AWS services does Quilt need to call?" + - Complete list needed for: + - VPC Endpoint planning + - Security group configuration + - Route table design + - Examples likely include: S3, DynamoDB, SQS, SNS, CloudWatch, etc. + +4. **VPC Endpoints:** + > "Do we need VPC endpoints for AWS services?" + - Preferred approach for AWS service access in private subnets + - Reduces/eliminates need for NAT Gateway or TGW internet routing + - Need to provide complete list of required VPC endpoints + +### Additional Implied Questions +- What are the minimum network requirements for Quilt? +- Are there any services that specifically require NAT Gateway? +- Can Quilt components run entirely in private subnets? +- What are the latency/bandwidth requirements? + +--- + +## 5. Business Impact + +### Impact to Customer +- **Deployment Blocked:** Cannot proceed with production deployment +- **Security Compliance:** Need to maintain network security posture +- **Cost Control:** TGW may reduce NAT Gateway costs +- **Operational Integration:** Want Quilt to fit existing infrastructure +- **Timeline Risk:** Delay affects their project timelines + +### Impact to Quilt +- **Revenue Risk:** Enterprise deal potentially blocked +- **Product Gap:** May indicate limitation in network flexibility +- **Customer Satisfaction:** Responsiveness affects relationship +- **Competitive Position:** Competitors may support this use case +- **Technical Debt:** May need architecture changes to support + +### Broader Market Impact +- **Enterprise Adoption:** Common requirement for large organizations +- **Product-Market Fit:** Shows need for enterprise networking support +- **Differentiation Opportunity:** Better support could be competitive advantage +- **Documentation Gap:** May need better network architecture docs +- **Sales Enablement:** Sales team needs clear guidance on network requirements + +--- + +## 6. Dependencies & Blockers + +### Information Needed (CRITICAL) + +1. **Complete AWS Service List:** + - All AWS services Quilt Lambda functions call + - All AWS services Quilt ECS tasks call + - Service-specific requirements (regional vs. global endpoints) + - Authentication methods (IAM roles, API keys, etc.) + +2. **External Service Dependencies:** + - Any third-party APIs called by Quilt + - Webhooks or callbacks that need internet access + - License validation or telemetry endpoints + - Container registries (ECR, Docker Hub, etc.) + +3. **Network Requirements:** + - Bandwidth requirements + - Latency sensitivity + - Port/protocol requirements + - Any multicast or broadcast needs + +4. **Current Architecture Documentation:** + - Existing network diagrams + - Default VPC/subnet configuration + - Security group templates + - IAM role assumptions + +### Technical Decisions Needed + +1. **Support Strategy:** + - Should Quilt officially support TGW-only deployments? + - Should this be a configuration option or custom deployment? + - How to maintain compatibility with existing deployments? + +2. **VPC Endpoint Strategy:** + - Which VPC endpoints should be mandatory vs. optional? + - Should Quilt CDK create VPC endpoints automatically? + - How to handle VPC endpoint costs in pricing model? + +3. **Documentation Updates:** + - Network architecture guide needed + - VPC endpoint setup instructions + - Custom routing configuration examples + - Troubleshooting guide for network issues + +### Stakeholder Alignment Needed + +- **Engineering:** Can we support this configuration? +- **Solutions Architecture:** What's the recommended approach? +- **Product:** Should this be a standard feature? +- **Sales:** What's the business priority? +- **Support:** Can we support troubleshooting? +- **Security:** Any security implications? + +--- + +## 7. Recommended Next Steps + +### Immediate Actions (This Week) + +1. **Gather Service Dependencies (DAY 1)** + - [ ] Audit all Lambda functions for AWS service calls + - [ ] Audit all ECS tasks for AWS service calls + - [ ] Identify external API dependencies + - [ ] Document required network endpoints + - **Owner:** Engineering Team + - **Output:** Complete service dependency list + +2. **Document VPC Endpoint Requirements (DAY 2)** + - [ ] Create list of required VPC endpoints + - [ ] Document optional VPC endpoints + - [ ] Estimate VPC endpoint costs + - [ ] Create VPC endpoint setup guide + - **Owner:** Solutions Architecture + - **Output:** VPC Endpoint guide + +3. **Respond to Customer (DAY 3)** + - [ ] Send complete AWS service list + - [ ] Confirm external service dependencies + - [ ] Provide VPC endpoint recommendations + - [ ] Offer architecture review call + - **Owner:** Product Manager + Solutions Architect + - **Output:** Detailed technical response + +### Short-term Actions (Next 2 Weeks) + +4. **Create Reference Architecture** + - [ ] Design TGW-based network architecture + - [ ] Create network diagrams + - [ ] Document routing configuration + - [ ] Test with pilot customer + - **Owner:** Solutions Architecture + - **Output:** Reference architecture document + +5. **Update Documentation** + - [ ] Add network architecture section to docs + - [ ] Create TGW deployment guide + - [ ] Document VPC endpoint setup + - [ ] Add troubleshooting guide + - **Owner:** Technical Writing + Engineering + - **Output:** Updated documentation + +6. **Enable Customer Deployment** + - [ ] Schedule architecture review with customer + - [ ] Validate their proposed design + - [ ] Provide deployment support + - [ ] Monitor deployment success + - **Owner:** Solutions Architecture + Support + - **Output:** Successful production deployment + +### Medium-term Actions (Next Quarter) + +7. **Product Enhancement Planning** + - [ ] Evaluate making TGW support a standard feature + - [ ] Design CDK configuration options + - [ ] Plan VPC endpoint automation + - [ ] Create network architecture testing + - **Owner:** Product Management + - **Output:** Product roadmap items + +8. **Sales Enablement** + - [ ] Create network requirements guide for sales + - [ ] Document enterprise networking capabilities + - [ ] Train solutions architects + - [ ] Add to RFP response templates + - **Owner:** Product Marketing + - **Output:** Sales enablement materials + +9. **Market Research** + - [ ] Survey other enterprise customers + - [ ] Identify common network patterns + - [ ] Benchmark competitor capabilities + - [ ] Prioritize network features + - **Owner:** Product Management + - **Output:** Network feature prioritization + +--- + +## 8. Success Metrics + +### Immediate Success (Customer-Specific) +- Customer successfully deploys Quilt with TGW routing +- All Quilt functionality works as expected +- No performance degradation +- Customer satisfaction score: 9+/10 + +### Product Success (Organization-Wide) +- Reduction in network-related support tickets +- Increase in enterprise customer adoption +- Improved sales cycle time for enterprise deals +- Positive feedback on network flexibility + +### Business Success +- Customer deployment generates reference architecture +- Convert to long-term customer +- Enable 3+ similar enterprise deployments +- Establish Quilt as enterprise-ready solution + +--- + +## 9. Risk Assessment + +### High Risks +- **Incomplete Service List:** May miss critical dependencies +- **Latency Issues:** VPC endpoints may introduce latency +- **Cost Surprise:** VPC endpoint costs may be significant +- **Support Complexity:** Harder to troubleshoot customer networks + +### Medium Risks +- **Documentation Gaps:** Customers may struggle with setup +- **Version Compatibility:** Future Quilt versions may add dependencies +- **Regional Limitations:** Some VPC endpoints not available in all regions +- **Performance Variation:** Customer TGW performance varies + +### Mitigation Strategies +- Comprehensive testing in customer-like environment +- Clear documentation and setup automation +- Ongoing monitoring of service dependencies +- Regular architecture reviews with customers + +--- + +## 10. Open Questions + +1. Does Quilt currently have any telemetry or phone-home requirements? +2. What container registries does Quilt pull from (ECR, Docker Hub)? +3. Are there any licensing or authentication services called at runtime? +4. How are Quilt updates delivered (new Lambda code, new containers)? +5. What's the expected bandwidth usage pattern? +6. Are there any services that specifically require public internet access? +7. How do we validate that TGW routing is working correctly? +8. What monitoring is needed to detect network issues? + +--- + +## Appendix: Technical Context + +### Transit Gateway (TGW) Overview +- AWS service for connecting multiple VPCs and on-premises networks +- Acts as cloud router with centralized control +- Common in enterprise AWS architectures +- Supports routing to internet via attached VPCs +- Can integrate with AWS services via VPC endpoints + +### VPC Endpoints +- Private connections to AWS services without internet gateway +- Interface Endpoints (powered by PrivateLink) for most services +- Gateway Endpoints for S3 and DynamoDB +- Eliminate need for NAT Gateway for AWS service access +- Per-endpoint costs vary by service and data transfer + +### Network Architecture Patterns +1. **Default Pattern:** Private subnet → NAT Gateway → Internet Gateway +2. **VPC Endpoint Pattern:** Private subnet → VPC Endpoints → AWS Services +3. **TGW Pattern (Customer Request):** Private subnet → TGW → Customer Network +4. **Hybrid Pattern:** TGW for some traffic, VPC Endpoints for AWS services + +--- + +**Next Review Date:** February 9, 2026 +**Document Owner:** Product Manager +**Stakeholders:** Engineering, Solutions Architecture, Sales, Support diff --git a/custom-gateway/03-gateway-audit.md b/custom-gateway/03-gateway-audit.md new file mode 100644 index 0000000..f686d74 --- /dev/null +++ b/custom-gateway/03-gateway-audit.md @@ -0,0 +1,851 @@ +# Quilt Deployment Network Dependencies Audit + +**Date:** February 2, 2026 +**Auditor:** Engineering Team +**Purpose:** Document all AWS services and external dependencies for Transit Gateway routing decisions +**Repository:** ~/GitHub/deployment/ + +--- + +## Executive Summary + +This audit identifies **40+ AWS services** and **multiple external dependencies** that Quilt's deployment architecture requires. The findings directly answer the customer's questions about Transit Gateway routing feasibility. + +**Key Findings:** +- ✅ Most AWS services can be accessed via VPC Endpoints (eliminating NAT/TGW internet routing) +- ⚠️ External services require internet egress: Telemetry, SSO providers, ECR image pulls +- ✅ Lambda and ECS can run entirely in private subnets with proper VPC endpoint configuration +- ⚠️ Optional features (SSO, telemetry) can be disabled to reduce external dependencies + +--- + +## Quick Answer to Customer's Questions + +### Q1: Can we route 0.0.0.0/0 through TGW instead of NAT Gateway? +**Answer:** Yes, with the following conditions: +- TGW must route to internet for: ECR pulls, telemetry (optional), SSO providers (optional) +- VPC Endpoints should be configured for AWS services to bypass TGW/internet routing +- Or deploy VPC Interface Endpoints for all services (see recommendations below) + +### Q2: Do Lambda/ECS need to call external (non-AWS) services? +**Answer:** Yes, but mostly optional: +- **Required:** ECR (AWS service) for pulling Docker images +- **Optional:** Quilt telemetry service (`telemetry.quiltdata.cloud`) +- **Optional:** SSO providers (Google, Azure, Okta, OneLogin) +- **Optional:** External MCP/Benchling APIs (if configured) + +### Q3: Which AWS services does Quilt need? +**Answer:** See complete list below. Primary services: +- **Core:** S3, RDS (PostgreSQL), ElasticSearch, ECS, Lambda +- **Messaging:** SQS, SNS, EventBridge +- **Networking:** VPC, ALB, Service Discovery +- **Monitoring:** CloudWatch Logs/Metrics, CloudWatch Synthetics +- **Analytics:** Athena, Glue, Firehose, CloudTrail +- **Security:** IAM, KMS, WAF v2 + +### Q4: Which VPC Endpoints do we need? +**Answer:** See "Recommended VPC Endpoint Configuration" section below. + +--- + +## AWS Services Inventory + +### 1. Compute Services + +#### **Lambda (AWS Lambda)** +- **Usage:** + - Search indexing (SearchHandler, EsIngest, ManifestIndexer) + - Package creation/promotion handlers + - API Gateway integrations + - S3 to EventBridge conversion + - DuckDB select operations + - Custom CloudFormation resource handlers +- **Location:** `t4/template/search.py`, `pkg_push_lambdas.py`, `api_services.py` +- **Network:** Configurable with `lambdas_in_vpc` parameter +- **VPC Endpoint:** Use API Gateway endpoint if API Gateway is in VPC +- **Egress Needs:** AWS API calls, S3 access, CloudWatch Logs + +#### **ECS (Elastic Container Service)** +- **Usage:** + - Registry service container + - MCP (Model Context Protocol) server + - Benchling integration service + - S3 proxy service + - Voila notebook service + - Bucket scanner tasks + - Migration tasks +- **Location:** `t4/template/ecs.py`, `containers.py` +- **Network:** Private subnets with Service Discovery +- **Egress Needs:** + - ECR image pulls (required) + - CloudWatch Logs + - S3 access + - External SSO APIs (optional) + - Telemetry (optional) + +#### **EC2 (Elastic Compute Cloud)** +- **Usage:** + - VPC infrastructure + - NAT Gateways (can be replaced by TGW) + - Security groups + - Voila instance (optional) +- **Location:** `t4/template/network.py`, `voila.py` +- **Components:** VPC, Subnets, Route Tables, Internet Gateway, NAT Gateway + +--- + +### 2. Storage Services + +#### **S3 (Simple Storage Service)** +- **Usage:** + - Data bucket storage for packages (primary use) + - Analytics bucket for usage data + - CloudTrail logs storage + - Audit trail storage + - Lambda code storage + - Synthetics canary results + - Service discovery bucket +- **Location:** Used throughout all templates +- **VPC Endpoint:** ✅ Gateway Endpoint (currently deployed) +- **Encryption:** KMS encryption supported +- **Access Pattern:** Heavy read/write from Lambda and ECS + +#### **RDS (Relational Database Service)** +- **Usage:** PostgreSQL 15.12 for registry data +- **Location:** `t4/template/database.py` +- **Configuration:** + - Multi-AZ optional + - Storage encryption enabled + - Private subnet deployment + - CloudWatch Logs export (upgrade logs) +- **Port:** 5432 (internal only) +- **Network:** Private subnets, no internet access needed + +--- + +### 3. Search & Database + +#### **ElasticSearch (OpenSearch Service)** +- **Usage:** Full-text search and indexing for objects and packages +- **Location:** `t4/template/search.py` +- **Configuration:** + - VPC or public deployment + - Multi-AZ with zone awareness + - Encryption at rest and in-transit + - CloudWatch logging optional +- **Port:** 443 (HTTPS) +- **Network:** Private subnet deployment recommended +- **Access:** Lambda functions and ECS tasks + +--- + +### 4. Messaging & Event Services + +#### **SQS (Simple Queue Service)** +- **Usage:** + - Search indexing queues + - Package events queue + - Dead letter queues + - Event batching for Lambda +- **Location:** `t4/template/search.py`, `events.py` +- **Features:** Visibility timeout, DLQ, encryption +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +#### **SNS (Simple Notification Service)** +- **Usage:** + - Canary failure notifications (email) + - S3 bucket event notifications + - Topic-based messaging +- **Location:** `t4/template/sns_kms.py`, `status/canaries.py` +- **Encryption:** KMS-encrypted topics +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +#### **EventBridge (CloudWatch Events)** +- **Usage:** + - S3 to EventBridge event conversion + - Scheduled events (canaries) + - Service event routing + - Synthetics state changes +- **Location:** `t4/template/events.py`, `s3_sns_to_eventbridge.py` +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +--- + +### 5. Networking & Load Balancing + +#### **VPC (Virtual Private Cloud)** +- **Components:** + - Multiple subnets (public, private, intra) + - NAT Gateway (can be replaced by TGW) + - Internet Gateway + - Security groups + - Route tables + - VPC Endpoints +- **Location:** `t4/template/network.py` +- **Current Architecture:** Private subnets → NAT Gateway → Internet Gateway + +#### **Application Load Balancer (ALB)** +- **Usage:** + - HTTPS/HTTP routing + - Path-based routing for services + - Health checks + - Private and public listeners +- **Services Routed:** + - Priority 24: MCP server + - Priority 25: Catalog + - Priority 26: Benchling + - Others: Registry, S3 proxy +- **Ports:** 80 (redirects to 443), 443 (HTTPS) +- **Location:** `t4/template/network.py` + +#### **Service Discovery (AWS Cloud Map)** +- **Usage:** Private DNS namespace for ECS services +- **Services Registered:** + - registry + - mcp + - benchling + - s3-proxy + - catalog +- **DNS TTL:** 10 seconds +- **Location:** `t4/template/dns.py` +- **Network:** Internal VPC only + +#### **VPC Endpoints** +- **Currently Deployed:** + - S3 Gateway Endpoint (for private S3 access) + - API Gateway Interface Endpoint (optional, if `api_gateway_in_vpc=true`) +- **Available but Not Deployed:** See recommendations section + +--- + +### 6. Security & Identity + +#### **IAM (Identity & Access Management)** +- **Usage:** + - Lambda execution roles + - ECS task roles + - Service-to-service permissions + - User access policies (read/write/QPE) + - Cross-service assume role policies +- **Location:** All template files +- **Key Roles:** + - Lambda execution roles + - ECS task roles + - Database accessor roles + - User roles (QPE, Read, Write) + +#### **KMS (Key Management Service)** +- **Usage:** + - SNS topic encryption + - S3 bucket encryption + - RDS database encryption + - Service authentication (RSA_4096 for JWT signing) +- **Location:** `t4/template/sns_kms.py`, multiple files +- **VPC Endpoint:** ✅ Interface Endpoint available + +#### **WAF v2 (Web Application Firewall)** +- **Usage:** + - ALB protection + - Geo-blocking (optional) + - Rate-based rules + - Account takeover prevention (ATP) + - Account creation fraud prevention (ACFP) +- **Location:** `t4/template/waf.py` + +#### **ACM (AWS Certificate Manager)** +- **Usage:** SSL/TLS certificates for ALB +- **Location:** `t4/template/s3_proxy.py` + +--- + +### 7. Logging & Monitoring + +#### **CloudWatch Logs** +- **Usage:** + - ECS container logs + - Lambda function logs + - ElasticSearch logs + - Audit trail logs + - Synthetics canary logs + - ALB access logs +- **Retention:** 90 days (configurable via `LOG_RETENTION_DAYS`) +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +#### **CloudWatch (Metrics & Alarms)** +- **Usage:** + - CPU/memory metrics + - Request count + - Latency tracking + - Custom metrics +- **VPC Endpoint:** ✅ Shared with CloudWatch Logs endpoint + +#### **CloudWatch Synthetics** +- **Usage:** + - Canary tests for catalog + - Bucket access validation + - Package push/search testing + - Scheduled monitoring (hourly) +- **Location:** `t4/template/status/canaries.py` +- **Alerts:** SNS notifications on failure + +--- + +### 8. Analytics & Query Services + +#### **Athena** +- **Usage:** + - Analytics queries on S3 data + - Audit trail querying + - User-provisioned databases +- **Location:** `t4/template/audit_trail.py`, `analytics.py`, `user_athena.py` +- **Query Results:** Stored in S3 +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +#### **Glue (Data Catalog)** +- **Usage:** + - Database and table definitions + - Metadata catalog for audit/analytics + - Schema management +- **Location:** `t4/template/audit_trail.py`, `analytics.py` +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +#### **Kinesis Data Firehose** +- **Usage:** + - Audit trail delivery stream + - Extended S3 destination + - Lambda-based data transformation + - Partitioned delivery +- **Location:** `t4/template/audit_trail.py` +- **Destination:** S3 with partitioning +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +#### **CloudTrail** +- **Usage:** Object access tracking for analytics +- **Location:** `t4/template/analytics.py` +- **Features:** + - Multi-region trail + - S3 event recording + - Optional (can use existing trail) +- **Storage:** S3 bucket + +--- + +### 9. Configuration & Deployment + +#### **CloudFormation** +- **Usage:** Infrastructure as Code deployment +- **Location:** Entire deployment architecture +- **Stack Management:** Template generation via CDK + +#### **SSM Parameter Store** +- **Usage:** Indexing per-bucket configurations +- **Location:** `t4/template/search.py` +- **VPC Endpoint:** ✅ Interface Endpoint available (not deployed by default) + +--- + +## External Services (Non-AWS) + +### 1. Telemetry & Analytics + +#### **Quilt Telemetry Service** +- **URL:** `https://telemetry.quiltdata.cloud/Prod/metrics/installer` +- **Location:** `installer/quilt_stack_installer/session_log.py` +- **Purpose:** Installer usage metrics +- **Data Sent:** + - Session ID + - Installation events + - CloudFormation stack events + - Platform info +- **Optional:** ✅ Can be disabled via `DISABLE_QUILT_TELEMETRY` env var +- **Network Requirement:** HTTPS (443) to external service + +#### **Mixpanel** +- **Configuration:** Token in environment (`constants["mixpanel"]`) +- **Purpose:** Client-side analytics for catalog UI +- **Used By:** Web catalog, registry container +- **Optional:** ✅ Can be disabled +- **Network Requirement:** HTTPS (443) to mixpanel.com + +--- + +### 2. Third-Party Authentication (SSO) + +All SSO providers are **optional** and can be disabled: + +#### **Google OAuth** +- **Location:** `t4/template/containers.py`, `parameters.py` +- **Environment Variables:** `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET` +- **Purpose:** Social sign-in +- **Network Requirement:** HTTPS (443) to accounts.google.com +- **Optional:** ✅ Yes + +#### **Azure AD (Microsoft Entra)** +- **Environment Variables:** `AZURE_CLIENT_ID`, `AZURE_CLIENT_SECRET`, `AZURE_BASE_URL` +- **Purpose:** Enterprise SSO +- **Network Requirement:** HTTPS (443) to login.microsoftonline.com +- **Optional:** ✅ Yes + +#### **Okta** +- **Environment Variables:** `OKTA_CLIENT_ID`, `OKTA_CLIENT_SECRET` +- **Purpose:** Enterprise SSO +- **Network Requirement:** HTTPS (443) to customer's Okta domain +- **Optional:** ✅ Yes + +#### **OneLogin** +- **Environment Variables:** `ONELOGIN_CLIENT_ID`, `ONELOGIN_CLIENT_SECRET` +- **Purpose:** Enterprise SSO +- **Network Requirement:** HTTPS (443) to api.onelogin.com +- **Optional:** ✅ Yes + +--- + +### 3. Container Image Registries + +#### **AWS ECR (Elastic Container Registry)** +- **Account (Quilt Images):** `709825985650` (Marketplace) +- **Account (Custom):** Customer account +- **Region:** `us-east-1` (Marketplace), customer region (custom) +- **Repositories:** + - `quilt-data/quilt-payg-*` (pay-as-you-go) + - `quiltdata/catalog` + - `quiltdata/nginx` + - `quiltdata/registry` + - `quiltdata/s3-proxy` + - `quiltdata/voila` (optional) + - `quiltdata/mcp` + - `quiltdata/benchling` (optional) +- **Network Requirement:** HTTPS (443) to ECR API and S3 (for image layers) +- **VPC Endpoint:** ✅ ECR Interface Endpoint available +- **Required:** ✅ Yes - ECS tasks must pull images + +#### **Benchling Special Case** +- **Account:** `712023778557` (Quilt central) +- **Region:** `us-east-1` +- **Repository:** `quiltdata/benchling` +- **Full URI:** `712023778557.dkr.ecr.us-east-1.amazonaws.com/quiltdata/benchling:latest` +- **Used For:** Benchling webhook integration service +- **Note:** Hardcoded to central ECR account + +--- + +### 4. External APIs + +#### **Benchling API** (Optional) +- **Location:** `t4/template/benchling.py` +- **Purpose:** LIMS integration +- **Access:** Customer's Benchling instance +- **Ports:** 443 (HTTPS) +- **Optional:** ✅ Yes - only if Benchling integration enabled +- **Network:** Can be internal (VPC) or external + +#### **MCP Server External** (Optional) +- **Configuration:** `RemoteMCPUrl` parameter +- **Location:** `t4/template/parameters.py` +- **Purpose:** Model Context Protocol for AI +- **Optional:** ✅ Yes - only if external MCP configured +- **Network:** HTTPS (443) to configured endpoint + +--- + +### 5. Email Services (Optional) + +#### **SMTP Server** +- **Configuration:** `EMAIL_SERVER` environment variable +- **Location:** `t4/template/containers.py` +- **Purpose:** Email notifications +- **Optional:** ✅ Yes - only if email configured +- **Network:** SMTP ports (25/465/587) + +--- + +## Network Architecture Analysis + +### Current Architecture (NAT Gateway) + +``` +Private Subnet (Lambda/ECS) + ↓ +NAT Gateway + ↓ +Internet Gateway + ↓ +Internet (external services, ECR, SSO, telemetry) +``` + +### Proposed Architecture (Transit Gateway) + +``` +Private Subnet (Lambda/ECS) + ↓ +Transit Gateway + ↓ +Corporate Network/Firewall + ↓ +Internet (external services, ECR, SSO, telemetry) +``` + +### Hybrid Architecture (Recommended) + +``` +Private Subnet (Lambda/ECS) + ├─→ VPC Endpoints → AWS Services (S3, SQS, SNS, CloudWatch, etc.) + └─→ Transit Gateway → Corporate Network → Internet (ECR, SSO, telemetry) +``` + +--- + +## Egress Requirements Summary + +### Required External Access + +| Destination | Port | Purpose | Optional? | +|-------------|------|---------|-----------| +| ECR API (*.amazonaws.com) | 443 | Pull Docker images | ❌ Required | +| S3 (*.amazonaws.com) | 443 | ECR image layers | ❌ Required (or use VPC endpoint) | + +### Optional External Access + +| Destination | Port | Purpose | Optional? | +|-------------|------|---------|-----------| +| telemetry.quiltdata.cloud | 443 | Usage metrics | ✅ Yes | +| accounts.google.com | 443 | Google OAuth | ✅ Yes | +| login.microsoftonline.com | 443 | Azure AD | ✅ Yes | +| *.okta.com | 443 | Okta SSO | ✅ Yes | +| api.onelogin.com | 443 | OneLogin SSO | ✅ Yes | +| mixpanel.com | 443 | Analytics | ✅ Yes | +| Customer SMTP server | 25/465/587 | Email | ✅ Yes | +| Customer Benchling | 443 | LIMS integration | ✅ Yes | +| Customer MCP server | 443 | AI integration | ✅ Yes | + +### Internal (VPC-Only) Access + +| Service | Port | Communication | +|---------|------|---------------| +| RDS PostgreSQL | 5432 | Lambda/ECS → Database | +| ElasticSearch | 443 | Lambda/ECS → Search | +| ALB | 80/443 | Internet → Services | +| Service Discovery | 53 | ECS → ECS (DNS) | +| ECS Services | Various | Internal service mesh | + +--- + +## Recommended VPC Endpoint Configuration + +For Transit Gateway routing with minimal internet egress, deploy these VPC Interface Endpoints: + +### Tier 1: Essential (Recommended) + +| Service | Endpoint Type | Cost/Month (approx) | Benefit | +|---------|---------------|---------------------|---------| +| **S3** | Gateway | Free | Already deployed ✅ | +| **CloudWatch Logs** | Interface | $7 + data | Essential for logging | +| **ECR API** | Interface | $7 + data | Docker image pulls | +| **ECR Docker** | Interface | $7 + data | Docker image layers | +| **SQS** | Interface | $7 + data | Message queuing | +| **SNS** | Interface | $7 + data | Notifications | + +**Tier 1 Cost:** ~$35/month + data transfer + +### Tier 2: High Value (Strongly Recommended) + +| Service | Endpoint Type | Cost/Month (approx) | Benefit | +|---------|---------------|---------------------|---------| +| **Lambda** | Interface | $7 + data | Lambda management | +| **ECS** | Interface | $7 + data | ECS task management | +| **EventBridge** | Interface | $7 + data | Event routing | +| **KMS** | Interface | $7 + data | Encryption operations | +| **API Gateway** | Interface | $7 + data | API calls | + +**Tier 2 Cost:** ~$35/month + data transfer + +### Tier 3: Analytics & Management (Optional) + +| Service | Endpoint Type | Cost/Month (approx) | Benefit | +|---------|---------------|---------------------|---------| +| **Athena** | Interface | $7 + data | Analytics queries | +| **Glue** | Interface | $7 + data | Data catalog | +| **Kinesis Firehose** | Interface | $7 + data | Stream delivery | +| **SSM** | Interface | $7 + data | Parameter Store | +| **CloudFormation** | Interface | $7 + data | Stack updates | + +**Tier 3 Cost:** ~$35/month + data transfer + +### Total VPC Endpoint Cost Estimate +- **Tier 1 only:** ~$35/month + data +- **Tier 1 + 2:** ~$70/month + data +- **All tiers:** ~$105/month + data +- **Data transfer:** Typically $0.01/GB (far cheaper than NAT Gateway at $0.045/GB) + +**Cost Comparison:** +- NAT Gateway: ~$32/month base + $0.045/GB data +- VPC Endpoints (Tier 1+2): ~$70/month + $0.01/GB data +- **Break-even:** ~850 GB/month of traffic + +--- + +## Transit Gateway Routing Configuration + +### Routing Rules Required + +#### Route Table for Private Subnets + +``` +Destination Target Purpose +----------------------------------------------------------- +10.0.0.0/16 Local Intra-VPC communication +pl-xxxxx (S3) vpce-xxxxx S3 via VPC Gateway Endpoint +0.0.0.0/0 tgw-xxxxx All other traffic via TGW +``` + +#### Services Requiring External Routing via TGW + +1. **ECR Image Pulls** (if not using ECR VPC endpoints) + - Destination: `*.ecr.us-east-1.amazonaws.com`, `*.s3.amazonaws.com` + - Protocol: HTTPS (443) + - Frequency: On deployment/task start + +2. **Quilt Telemetry** (optional, can disable) + - Destination: `telemetry.quiltdata.cloud` + - Protocol: HTTPS (443) + - Frequency: On installer events + +3. **SSO Providers** (optional, if configured) + - Destination: Various (google.com, microsoftonline.com, etc.) + - Protocol: HTTPS (443) + - Frequency: On user authentication + +#### Services NOT Requiring External Routing + +✅ Can use VPC Endpoints: +- S3 +- CloudWatch Logs +- SQS, SNS +- EventBridge +- Athena, Glue, Firehose +- API Gateway +- KMS +- SSM + +✅ Entirely internal: +- RDS PostgreSQL +- ElasticSearch +- Service Discovery (Cloud Map) +- ECS service-to-service communication + +--- + +## Testing & Validation Plan + +### Phase 1: VPC Endpoint Validation +1. Deploy Tier 1 VPC endpoints +2. Test Lambda S3 access via gateway endpoint +3. Test ECS CloudWatch Logs via interface endpoint +4. Verify no traffic to NAT Gateway for AWS services + +### Phase 2: TGW Routing Validation +1. Update route tables to point 0.0.0.0/0 to TGW +2. Test ECR image pulls via TGW +3. Test external SSO authentication (if configured) +4. Verify telemetry calls route via TGW (if enabled) + +### Phase 3: Functional Testing +1. Deploy full Quilt stack +2. Test package push/pull operations +3. Test search indexing +4. Test catalog access +5. Monitor CloudWatch Logs for connection errors +6. Verify no connection timeouts + +### Phase 4: Performance Validation +1. Measure S3 operation latency +2. Measure ECR pull times +3. Compare against NAT Gateway baseline +4. Validate throughput for large file transfers + +--- + +## Recommendations for Customer + +### 1. Deploy Essential VPC Endpoints (Tier 1) + +Deploy these endpoints to eliminate most NAT/TGW routing: +- ✅ S3 Gateway Endpoint (already deployed) +- CloudWatch Logs +- ECR API +- ECR Docker +- SQS +- SNS + +**Benefit:** Eliminates NAT Gateway cost for 90%+ of AWS API traffic + +### 2. Configure TGW Routing for Remaining Traffic + +Point 0.0.0.0/0 to Transit Gateway for: +- ECR image pulls (or use ECR VPC endpoints from Tier 1) +- External SSO (if needed) +- Telemetry (or disable) + +**Benefit:** Centralized routing control, compliance with network policies + +### 3. Consider Disabling Optional External Services + +To minimize TGW internet routing requirements: +- Set `DISABLE_QUILT_TELEMETRY=true` (no telemetry) +- Don't configure external SSO (use IAM-based auth) +- Use internal MCP/Benchling (if applicable) + +**Benefit:** Reduces external dependencies to just ECR + +### 4. Use ECR VPC Endpoints for Fully Private Architecture + +Deploy ECR API and ECR Docker VPC endpoints: +- Zero internet routing needed +- All traffic stays within AWS network +- Eliminates TGW internet routing entirely + +**Benefit:** Fully private architecture, no firewall rules for internet access + +### 5. Monitoring & Validation + +Set up CloudWatch metrics and alerts for: +- VPC endpoint connection counts +- Failed DNS resolutions +- ECS task launch failures +- Lambda timeout errors + +**Benefit:** Early detection of routing issues + +--- + +## Implementation Checklist for Customer + +### Pre-Deployment +- [ ] Inventory existing VPC endpoints in target VPC +- [ ] Confirm TGW attachment to target VPC +- [ ] Verify TGW routing to internet (if needed) +- [ ] Plan DNS resolution for VPC endpoints +- [ ] Review security group rules for VPC endpoints + +### VPC Endpoint Deployment +- [ ] Deploy S3 Gateway Endpoint (if not exists) +- [ ] Deploy CloudWatch Logs Interface Endpoint +- [ ] Deploy ECR API Interface Endpoint +- [ ] Deploy ECR Docker Interface Endpoint +- [ ] Deploy SQS Interface Endpoint +- [ ] Deploy SNS Interface Endpoint +- [ ] Enable Private DNS for all interface endpoints +- [ ] Update security groups to allow VPC endpoint access + +### Route Table Configuration +- [ ] Backup existing route tables +- [ ] Update private subnet route tables: + - [ ] Keep local VPC routes + - [ ] Keep S3 Gateway Endpoint route + - [ ] Change 0.0.0.0/0 target from NAT Gateway to TGW +- [ ] Verify route table associations + +### Quilt Configuration +- [ ] Set `DISABLE_QUILT_TELEMETRY=true` (optional) +- [ ] Configure ECR repository (customer account or Quilt account) +- [ ] Decide on SSO providers (or disable) +- [ ] Configure `lambdas_in_vpc=true` +- [ ] Configure `api_gateway_in_vpc` (optional) + +### Testing +- [ ] Deploy Quilt stack +- [ ] Verify ECS task launches successfully +- [ ] Verify ECR image pulls work +- [ ] Test S3 bucket access +- [ ] Test search indexing +- [ ] Test package push/pull +- [ ] Monitor CloudWatch Logs for errors +- [ ] Performance test: measure latency vs baseline + +### Post-Deployment +- [ ] Remove NAT Gateway (if no longer needed) +- [ ] Update documentation +- [ ] Set up monitoring/alerts +- [ ] Schedule performance review + +--- + +## Security Considerations + +### Private Architecture Benefits +✅ No public IPs for Lambda/ECS +✅ All AWS API calls via private network +✅ Reduced attack surface +✅ Compliance with network isolation policies + +### Transit Gateway Security +⚠️ Ensure TGW has proper routing rules +⚠️ Firewall rules for external access +⚠️ Monitor TGW traffic for anomalies +⚠️ Regularly audit TGW route tables + +### VPC Endpoint Security +✅ Private DNS eliminates DNS hijacking +✅ Endpoint policies can restrict access +✅ Security groups control endpoint access +⚠️ Ensure endpoint security groups allow required traffic + +--- + +## Cost Analysis + +### Current Architecture (NAT Gateway) +- **NAT Gateway:** $32.40/month (730 hours × $0.045) +- **Data Processing:** $0.045/GB +- **Total (1 TB/month):** $32.40 + $46.08 = **$78.48/month** + +### Proposed Architecture (TGW + VPC Endpoints) +- **TGW Attachment:** $36.50/month (730 hours × $0.05) +- **TGW Data:** $0.02/GB +- **VPC Endpoints (Tier 1):** $35/month + $0.01/GB +- **Total (1 TB/month):** $36.50 + $20.48 + $35 + $10.24 = **$102.22/month** + +### Fully Private Architecture (TGW + All Endpoints) +- **TGW Attachment:** $36.50/month (minimal traffic) +- **VPC Endpoints (All tiers):** $105/month + $0.01/GB +- **Total (1 TB/month):** $36.50 + $105 + $10.24 = **$151.74/month** + +### Cost Observations +- ⚠️ TGW + VPC endpoints cost more than NAT Gateway alone +- ✅ However, TGW cost is **shared** across all VPCs (sunk cost for customer) +- ✅ VPC endpoints eliminate data charges for AWS API calls +- ✅ Marginal cost for Quilt is just VPC endpoints (~$35-105/month) +- ✅ For multi-VPC environments, TGW + VPC endpoints is more cost-effective + +--- + +## Conclusion + +### Can Customer Use Transit Gateway? **YES ✅** + +Quilt can successfully operate with Transit Gateway routing instead of NAT Gateway, with the following configuration: + +1. **Deploy Tier 1 VPC Endpoints** to eliminate most external routing +2. **Route 0.0.0.0/0 via TGW** for remaining traffic (ECR pulls, optional SSO) +3. **Optionally disable external services** (telemetry, SSO) to minimize TGW internet routing +4. **For fully private architecture**, deploy all VPC endpoints and eliminate internet routing entirely + +### Benefits for Customer +- ✅ Compliance with network security policies +- ✅ Centralized routing control via TGW +- ✅ Eliminates per-VPC NAT Gateway costs +- ✅ Consistent with enterprise network architecture +- ✅ All Quilt functionality preserved + +### Next Steps +1. Review VPC endpoint requirements with customer's network team +2. Provide CDK template modifications for VPC endpoint deployment +3. Schedule deployment and testing window +4. Perform phased rollout with validation at each step + +--- + +**Audit Completed By:** Engineering Team +**Review Date:** February 2, 2026 +**Next Review:** Post-deployment validation + diff --git a/custom-gateway/04-gateway-workaround.md b/custom-gateway/04-gateway-workaround.md new file mode 100644 index 0000000..4afd9dc --- /dev/null +++ b/custom-gateway/04-gateway-workaround.md @@ -0,0 +1,423 @@ +# Customer Gateway Workaround - Simplest Fix + +**Date:** February 2, 2026 +**For:** Customer Organization +**Goal:** Use Transit Gateway instead of NAT Gateway with minimal code changes + +--- + +## TL;DR - The Simple Answer + +**Customer can use their Transit Gateway with ZERO code changes to Quilt!** + +The key is that the customer is already configured with `network.vpn: true` in their variant, which sets `existing_vpc: true`. This means: + +✅ **Customer controls their own routing via their own route tables** +✅ **Quilt doesn't create NAT Gateway when `existing_vpc: true`** +✅ **Just provide TGW-configured subnets as parameters** + +--- + +## Current Customer Configuration + +From customer's variant files (`customer-prod.yaml`, `customer-staging.yaml`, `customer-dev.yaml`): + +```yaml +factory: + network: + vpn: true # This sets existing_vpc: true + deployment: tf +``` + +This configuration means: +- Customer provides their own VPC +- Customer provides their own subnets +- Customer controls routing via their own route tables +- **Quilt does NOT create NAT Gateway** + +--- + +## What Customer Needs to Do (Zero Code Changes Required) + +### Step 1: Prepare Subnets with TGW Routing + +Create or use existing subnets in your VPC with route tables that look like: + +``` +Destination Target Notes +----------------------------------------------------------- +10.0.0.0/16 local Intra-VPC traffic +0.0.0.0/0 tgw-xxxxx All internet via TGW +``` + +You need three types of subnets: + +1. **Private Subnets** (for ECS tasks and Lambda functions) + - Route 0.0.0.0/0 → TGW + - 2 subnets in different AZs + - Example: 10.0.1.0/24, 10.0.2.0/24 + +2. **Intra Subnets** (for RDS and ElasticSearch) + - No internet routing at all + - 2 subnets in different AZs + - Example: 10.0.3.0/24, 10.0.4.0/24 + +3. **User/Public Subnets** (for load balancer) + - For VPN access: Private subnets (same as #1) + - For internet: Public subnets with route 0.0.0.0/0 → IGW + +### Step 2: Deploy VPC Endpoints (Recommended) + +To minimize TGW internet traffic, deploy these VPC Interface Endpoints: + +**Essential (Tier 1):** +- `com.amazonaws.us-east-1.s3` (Gateway - free!) +- `com.amazonaws.us-east-1.logs` (CloudWatch Logs) +- `com.amazonaws.us-east-1.ecr.api` (ECR API) +- `com.amazonaws.us-east-1.ecr.dkr` (ECR Docker) +- `com.amazonaws.us-east-1.sqs` +- `com.amazonaws.us-east-1.sns` + +With these endpoints, most AWS API calls bypass TGW entirely and stay within AWS network. + +### Step 3: Provide Parameters During Deployment + +When deploying the Quilt stack, provide these parameters: + +```bash +# VPC Parameters +VPC=vpc-xxxxx # Your VPC ID +Subnets=subnet-xxx1,subnet-xxx2 # Private subnets with TGW routing +IntraSubnets=subnet-xxx3,subnet-xxx4 # Intra subnets (no internet) +UserSubnets=subnet-xxx1,subnet-xxx2 # Same as Subnets for VPN access +UserSecurityGroup=sg-xxxxx # Security group for load balancer ingress +``` + +**Important:** The `Subnets` parameter description says "Must route traffic to public AWS services (e.g. via NAT Gateway)" but this is just a **comment**, not a technical requirement. The actual requirement is: + +> "Subnets must be able to reach AWS services" + +This can be satisfied via: +- ✅ NAT Gateway (Quilt's default) +- ✅ Transit Gateway (customer's preferred) +- ✅ VPC Endpoints (most private) + +### Step 4: Configure External Services (Optional) + +If you want to minimize TGW internet routing: + +**Option A: Disable Telemetry** +```bash +export DISABLE_QUILT_TELEMETRY=true +``` + +**Option B: Skip External SSO** +- Don't configure Google/Azure/Okta/OneLogin credentials +- Use IAM-based authentication instead + +**Option C: Use VPC Endpoints for Everything** +- Deploy all VPC endpoints from Tier 1 + 2 (see 03-gateway-audit.md) +- Only ECR pulls will need external routing (if using Quilt ECR) + +--- + +## Code Analysis: Why This Works + +### When `existing_vpc: true` + +From `t4/template/network.py` line 246: + +```python +if env["options"]["existing_vpc"]: + vpc_id = Ref("VPC") + subnet_ids = Ref("Subnets") + # ... Quilt uses YOUR subnets, YOUR route tables +``` + +**Quilt doesn't create NAT Gateway at all!** + +### When `existing_vpc: false` + +From `t4/template/network.py` lines 393-399: + +```python +nat_gateway = ec2.NatGateway( + f"NatGateway{name}", + template=cft, + AllocationId=GetAtt(elastic_ip, "AllocationId"), + ConnectivityType="public", + SubnetId=public_subnet.ref(), +) +``` + +**Quilt creates NAT Gateway ONLY when it creates the VPC itself.** + +--- + +## Routing Architecture Comparison + +### Current Assumption (NAT Gateway) + +``` +ECS Task/Lambda → Private Subnet → NAT Gateway → Internet Gateway → AWS Services +``` + +### Customer's Transit Gateway Setup + +``` +ECS Task/Lambda → Private Subnet → Transit Gateway → Corporate Network → AWS Services + ↓ + VPC Endpoints (for most AWS services) +``` + +### Recommended Hybrid (Best Performance) + +``` +ECS Task/Lambda → Private Subnet → VPC Endpoints → AWS Services (S3, SQS, etc.) + ↘ Transit Gateway → Corporate Network → Internet + (ECR, optional SSO) +``` + +--- + +## What Needs Internet Access via TGW + +### Required (if not using VPC endpoints): + +1. **ECR Image Pulls** + - `*.ecr.us-east-1.amazonaws.com` (API) + - `*.s3.amazonaws.com` (image layers) + - Or deploy ECR VPC endpoints to avoid this + +2. **AWS Service APIs** + - S3, CloudWatch, SQS, SNS, etc. + - Or deploy VPC endpoints to avoid this + +### Optional (can be disabled): + +3. **Quilt Telemetry** + - `telemetry.quiltdata.cloud` + - Disable with `DISABLE_QUILT_TELEMETRY=true` + +4. **SSO Providers** + - `accounts.google.com`, `login.microsoftonline.com`, etc. + - Don't configure SSO to avoid this + +--- + +## Terraform vs CloudFormation Note + +Customer is using `deployment: tf` (Terraform), which means they're likely already managing their own VPC infrastructure via Terraform. + +**Recommendation:** +1. Customer's Terraform manages: VPC, subnets, route tables, TGW attachment, VPC endpoints +2. Quilt's Terraform references: Existing VPC and subnets via parameters +3. No conflict, clean separation of concerns + +--- + +## Testing Checklist + +### Phase 1: Pre-Deployment Validation + +- [ ] Verify TGW attachment to target VPC +- [ ] Verify route tables point 0.0.0.0/0 to TGW +- [ ] Verify TGW routes to internet (or corporate firewall → internet) +- [ ] Deploy VPC endpoints (at least S3 Gateway) +- [ ] Test DNS resolution from private subnets + +### Phase 2: Deployment + +- [ ] Deploy Quilt with `existing_vpc: true` +- [ ] Provide TGW-configured subnets as parameters +- [ ] Monitor CloudFormation/Terraform logs +- [ ] Verify no NAT Gateway created + +### Phase 3: Functional Testing + +- [ ] Verify ECS tasks launch successfully +- [ ] Check CloudWatch Logs for errors +- [ ] Test ECR image pulls (check ECS task startup time) +- [ ] Test S3 access (upload/download packages) +- [ ] Test search indexing +- [ ] Test catalog access + +### Phase 4: Network Validation + +- [ ] Verify no traffic to NAT Gateway (shouldn't exist) +- [ ] Verify traffic to VPC endpoints (if deployed) +- [ ] Verify traffic to TGW (for internet-bound requests) +- [ ] Check TGW metrics in CloudWatch +- [ ] Validate no connection timeouts + +--- + +## Troubleshooting + +### Issue: ECS tasks fail to start + +**Possible Cause:** Cannot pull Docker images from ECR +**Solution:** +1. Deploy ECR VPC endpoints (`ecr.api` and `ecr.dkr`) +2. Or verify TGW routes to `*.ecr.us-east-1.amazonaws.com` +3. Check security groups allow HTTPS (443) from private subnets + +### Issue: Lambda timeout errors + +**Possible Cause:** Cannot reach AWS services +**Solution:** +1. Deploy VPC endpoints for services Lambda needs (S3, SQS, SNS) +2. Or verify TGW routes to `*.amazonaws.com` +3. Check Lambda VPC configuration and security groups + +### Issue: Search indexing fails + +**Possible Cause:** ElasticSearch in VPC cannot communicate +**Solution:** +1. Verify ElasticSearch is in intra subnets (no internet needed) +2. Check security groups allow traffic from Lambda/ECS to ElasticSearch +3. ElasticSearch should NOT need TGW or internet access + +### Issue: Database connection errors + +**Possible Cause:** RDS in wrong subnets +**Solution:** +1. Verify RDS is in intra subnets (no internet needed) +2. RDS should NEVER need TGW or internet access +3. Check security groups allow traffic from ECS/Lambda to RDS + +--- + +## Cost Comparison + +### Option 1: NAT Gateway (Quilt Default) +- NAT Gateway: $32.40/month (730 hours × $0.045) +- Data Processing: $0.045/GB +- **Total (1 TB/month):** $32.40 + $46.08 = **$78.48/month** + +### Option 2: TGW Only (Customer's Request) +- TGW Attachment: $36.50/month (730 hours × $0.05) +- TGW Data: $0.02/GB +- **Total (1 TB/month):** $36.50 + $20.48 = **$56.98/month** +- **But:** TGW cost is shared across all VPCs (sunk cost) +- **Marginal cost for Quilt:** Just the data transfer (~$20/month) + +### Option 3: TGW + VPC Endpoints (Recommended) +- TGW Attachment: $36.50/month (shared/sunk) +- VPC Endpoints (Tier 1): $35/month + $0.01/GB +- TGW Data (minimal): ~$2-5/month (only ECR/telemetry) +- **Total (1 TB/month):** $36.50 + $35 + $10.24 + $2 = **$83.74/month** +- **Marginal cost for Quilt:** ~$47/month (VPC endpoints + minimal TGW data) + +**For Customer:** +- TGW attachment cost is already paid (shared resource) +- Only new cost is VPC endpoints +- **Net new cost: ~$35-47/month** (much cheaper than NAT Gateway data charges at scale) + +--- + +## Recommended Action Plan + +### Immediate (This Week) + +1. **Confirm Customer's Current Setup** + - Are they using `existing_vpc: true`? (Yes, based on `network.vpn: true`) + - What subnets are they currently providing? + - Are those subnets routing through TGW already? + +2. **Deploy VPC Endpoints (Tier 1)** + - S3 Gateway Endpoint (free) + - CloudWatch Logs Interface Endpoint + - ECR API and ECR Docker Interface Endpoints + - SQS and SNS Interface Endpoints + +3. **Test Deployment** + - Deploy to dev environment first + - Verify all functionality works + - Monitor for connection issues + - Validate performance + +### Short-term (Next 2 Weeks) + +4. **Deploy to Staging** + - Use TGW-configured subnets + - Full functional testing + - Performance benchmarking + +5. **Document Learnings** + - Update customer's deployment documentation + - Create runbook for TGW deployments + - Share with other customers who might want this + +### Medium-term (Next Month) + +6. **Deploy to Production** + - After successful staging validation + - Monitor closely for first 48 hours + - Compare metrics to baseline + +7. **Product Enhancement** + - Update parameter descriptions to be less NAT Gateway-specific + - Add TGW example to documentation + - Consider adding VPC endpoint auto-deployment option + +--- + +## Documentation Updates Needed + +### In Template Parameter Descriptions + +**Current (misleading):** +``` +"List of private subnets for Quilt service containers. +Must route traffic to public AWS services (e.g. via NAT Gateway)." +``` + +**Better (accurate):** +``` +"List of private subnets for Quilt service containers. +Must have outbound connectivity to AWS services (via NAT Gateway, +Transit Gateway, or VPC Endpoints)." +``` + +### In Deployment Documentation + +Add section: +- "Using Transit Gateway Instead of NAT Gateway" +- "Enterprise Network Integration" +- "VPC Endpoint Configuration" + +--- + +## Key Takeaway for Customer + +**You don't need to modify any Quilt code or templates!** + +The solution is configuration-only: + +1. ✅ Use `existing_vpc: true` (you already have this via `network.vpn: true`) +2. ✅ Provide your TGW-configured private subnets as the `Subnets` parameter +3. ✅ Deploy VPC endpoints for optimal performance +4. ✅ Optionally disable telemetry and external SSO + +**That's it! No code changes required.** + +The parameter description saying "e.g. via NAT Gateway" is just an example, not a requirement. The actual requirement is "can reach AWS services" which your TGW + VPC endpoint setup satisfies. + +--- + +## Next Steps for Customer + +1. **Share your subnet IDs** that are configured with TGW routing +2. **Confirm which VPC endpoints** you've already deployed +3. **Test in dev environment** with your TGW-configured subnets +4. **Report any issues** so we can help troubleshoot + +We're here to help make this work smoothly! + +--- + +**Contact:** +- Quilt Engineering Team +- Simon Kohnstamm (support@quiltdata.com) +- Ernest (ernest@quilt.bio) diff --git a/custom-gateway/05-transit-gateway-howto.md b/custom-gateway/05-transit-gateway-howto.md new file mode 100644 index 0000000..14cf6c5 --- /dev/null +++ b/custom-gateway/05-transit-gateway-howto.md @@ -0,0 +1,964 @@ +# How to Deploy Quilt with Transit Gateway Routing + +**Audience:** Enterprise customers with existing Transit Gateway infrastructure +**Goal:** Deploy Quilt using Transit Gateway instead of NAT Gateway for outbound routing +**Difficulty:** Intermediate (requires AWS networking knowledge) + +--- + +## Overview + +This guide explains how to deploy Quilt in an existing VPC using AWS Transit Gateway for outbound connectivity instead of the default NAT Gateway configuration. This is common in enterprise environments where centralized routing and network security policies are managed through Transit Gateway. + +### When to Use This Guide + +Use Transit Gateway routing when: +- ✅ You have an existing Transit Gateway hub-and-spoke architecture +- ✅ Your corporate policy requires all outbound traffic to route through TGW +- ✅ You want to centralize network routing and firewall policies +- ✅ You're deploying into an existing VPC with pre-configured routing + +### Prerequisites + +- Existing VPC with Transit Gateway attachment +- Understanding of AWS networking (VPC, subnets, route tables, security groups) +- Familiarity with Transit Gateway routing +- Access to deploy VPC endpoints +- Quilt deployment using `existing_vpc: true` configuration + +--- + +## Architecture Patterns + +### Pattern 1: Default Quilt Architecture (NAT Gateway) + +``` +┌─────────────────────────────────────────────────┐ +│ VPC (Created by Quilt) │ +│ │ +│ ┌──────────────┐ ┌─────────────┐ │ +│ │ ECS/Lambda │──────│ NAT Gateway │─────────┼──> Internet +│ │ (Private │ │ │ │ (AWS APIs, +│ │ Subnet) │ │ │ │ ECR, etc.) +│ └──────────────┘ └─────────────┘ │ +│ │ +└─────────────────────────────────────────────────┘ +``` + +**Characteristics:** +- Quilt creates VPC, subnets, and NAT Gateway +- Each AZ has its own NAT Gateway for high availability +- Cost: ~$32/month per NAT Gateway + $0.045/GB data transfer + +### Pattern 2: Transit Gateway Routing + +``` +┌─────────────────────────────────────────────────┐ +│ Customer VPC │ +│ │ +│ ┌──────────────┐ ┌─────────────┐ │ +│ │ ECS/Lambda │──────│ TGW │─────────┼──> Corporate Network +│ │ (Private │ │ Attachment │ │ └─> Firewall +│ │ Subnet) │ │ │ │ └─> Internet +│ └──────────────┘ └─────────────┘ │ +│ │ +└─────────────────────────────────────────────────┘ +``` + +**Characteristics:** +- Customer manages VPC, subnets, and routing +- All outbound traffic goes through TGW to corporate network +- TGW cost is shared across all VPCs +- Centralized security and routing policies + +### Pattern 3: Hybrid with VPC Endpoints (Recommended) + +``` +┌─────────────────────────────────────────────────┐ +│ Customer VPC │ +│ │ +│ ┌──────────────┐ │ +│ │ ECS/Lambda │──────┐ │ +│ │ (Private │ │ │ +│ │ Subnet) │ │ │ +│ └──────────────┘ │ │ +│ │ │ +│ ┌────┴────┐ │ +│ │ Route │ │ +│ │ Decision│ │ +│ └────┬────┘ │ +│ │ │ +│ ┌───────────────┼───────────────┐ │ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ VPC │ │ VPC │ │ TGW │───┼──> Internet +│ │ Endpoint │ │ Endpoint │ │ │ │ (minimal) +│ │ (S3) │ │ (ECR) │ │ │ │ +│ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ Most AWS API traffic External traffic │ +│ stays in AWS network via TGW │ +└─────────────────────────────────────────────────┘ +``` + +**Characteristics:** +- Best of both worlds: private AWS service access + TGW for internet +- 90%+ of traffic uses VPC endpoints (no TGW data charges) +- Only external services (ECR, telemetry, SSO) use TGW +- Optimal performance and security + +--- + +## Step-by-Step Implementation + +### Phase 1: Network Preparation + +#### 1.1 Create or Identify Subnets + +You need three types of subnets: + +**Private Subnets** (for ECS tasks and Lambda functions) +- Purpose: Run Quilt service containers and Lambda functions +- Routing: 0.0.0.0/0 → Transit Gateway +- Quantity: 2 subnets in different Availability Zones +- Example CIDRs: 10.0.1.0/24, 10.0.2.0/24 + +**Intra Subnets** (for RDS and ElasticSearch) +- Purpose: Database and search cluster (no internet access needed) +- Routing: No default route (local VPC only) +- Quantity: 2 subnets in different Availability Zones +- Example CIDRs: 10.0.3.0/24, 10.0.4.0/24 + +**User/Load Balancer Subnets** +- For VPN/internal access: Use private subnets (same as above) +- For public access: Public subnets with 0.0.0.0/0 → Internet Gateway +- Quantity: 2 subnets in different Availability Zones + +#### 1.2 Configure Route Tables + +**Route Table for Private Subnets:** +``` +Destination Target Notes +----------------------------------------------------------- +10.0.0.0/16 local Intra-VPC communication +0.0.0.0/0 tgw-xxxxx All internet via TGW +``` + +**Route Table for Intra Subnets:** +``` +Destination Target Notes +----------------------------------------------------------- +10.0.0.0/16 local Intra-VPC only (no internet) +``` + +**Route Table for Public Subnets** (if using public load balancer): +``` +Destination Target Notes +----------------------------------------------------------- +10.0.0.0/16 local Intra-VPC communication +0.0.0.0/0 igw-xxxxx Internet via Internet Gateway +``` + +#### 1.3 Verify Transit Gateway Configuration + +Ensure your Transit Gateway is configured to route traffic: + +```bash +# Check TGW attachment +aws ec2 describe-transit-gateway-attachments \ + --filters "Name=vpc-id,Values=vpc-xxxxx" + +# Check TGW route table +aws ec2 describe-transit-gateway-route-tables \ + --filters "Name=transit-gateway-id,Values=tgw-xxxxx" +``` + +Verify TGW routes traffic to: +- Your corporate network/firewall +- Internet (directly or via firewall) +- DNS resolvers + +### Phase 2: Deploy VPC Endpoints (Strongly Recommended) + +Deploy VPC Interface Endpoints to minimize TGW internet traffic and improve performance. + +#### 2.1 Essential VPC Endpoints (Tier 1) + +These endpoints handle 90%+ of Quilt's AWS API traffic: + +**Deploy via Console:** +1. Go to VPC → Endpoints → Create Endpoint +2. Select service name +3. Choose your VPC +4. Select private subnets +5. Enable "Private DNS name" +6. Create security group allowing HTTPS (443) from private subnets + +**Deploy via CLI:** + +```bash +# S3 Gateway Endpoint (FREE!) +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.s3 \ + --route-table-ids rtb-xxxxx rtb-yyyyy + +# CloudWatch Logs +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.logs \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# ECR API +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.ecr.api \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# ECR Docker (for image layers) +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.ecr.dkr \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# SQS +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.sqs \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# SNS +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.sns \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled +``` + +**Security Group for VPC Endpoints:** +``` +Ingress: + - Port 443, Source: Private subnet CIDRs (10.0.1.0/24, 10.0.2.0/24) +Egress: + - None required (endpoints are destination, not source) +``` + +**Estimated Cost:** ~$35/month + $0.01/GB (much cheaper than NAT Gateway's $0.045/GB) + +#### 2.2 Additional VPC Endpoints (Tier 2 - Optional but Recommended) + +```bash +# EventBridge (Events) +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.events \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# KMS (encryption operations) +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.kms \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# SSM Parameter Store +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.ssm \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled +``` + +**Estimated Cost:** Additional ~$35/month + $0.01/GB + +#### 2.3 Analytics VPC Endpoints (Tier 3 - Optional) + +If using Quilt's analytics features: + +```bash +# Athena +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.athena \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# Glue (Data Catalog) +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.glue \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled + +# Kinesis Firehose +aws ec2 create-vpc-endpoint \ + --vpc-id vpc-xxxxx \ + --service-name com.amazonaws.us-east-1.kinesis-firehose \ + --vpc-endpoint-type Interface \ + --subnet-ids subnet-xxxxx subnet-yyyyy \ + --security-group-ids sg-xxxxx \ + --private-dns-enabled +``` + +### Phase 3: Configure Quilt Deployment + +#### 3.1 Update Variant Configuration + +In your environment variant YAML file: + +```yaml +factory: + network: + vpn: true # Sets existing_vpc: true + vpc: theirs # Use customer-provided VPC (not applicable with vpn:true) + deployment: tf # or 'cf' for CloudFormation + +options: + existing_vpc: true # Implicit when network.vpn: true + network_version: "2.0" + lambdas_in_vpc: true + api_gateway_in_vpc: true # Requires VPC endpoint for API Gateway + elb_scheme: internal # For VPN access + # elb_scheme: internet-facing # For public access +``` + +#### 3.2 Prepare Deployment Parameters + +Create a parameters file or prepare CLI arguments: + +```yaml +# parameters.yaml +Parameters: + # Network Configuration + VPC: vpc-xxxxx + Subnets: + - subnet-private1-id + - subnet-private2-id + IntraSubnets: + - subnet-intra1-id + - subnet-intra2-id + UserSubnets: + - subnet-private1-id # Same as Subnets for VPN + - subnet-private2-id + # Or for public access: + # PublicSubnets: + # - subnet-public1-id + # - subnet-public2-id + + UserSecurityGroup: sg-xxxxx # Allow 443/80 from your users + + # VPC Endpoint for API Gateway (if api_gateway_in_vpc: true) + ApiGatewayVPCEndpoint: vpce-xxxxx + + # Database Configuration + DBUser: quilt_admin + DBPassword: + + # Certificates + CertificateArnELB: arn:aws:acm:us-east-1:xxxxx:certificate/xxxxx + + # Admin Configuration + AdminEmail: admin@yourcompany.com + QuiltWebHost: quilt.yourcompany.com +``` + +#### 3.3 Optional: Minimize External Dependencies + +To reduce TGW internet traffic, disable optional external services: + +**Disable Telemetry:** +Add to your environment configuration: +```yaml +# In deployment environment or container environment variables +DISABLE_QUILT_TELEMETRY: "true" +``` + +**Skip External SSO:** +Don't configure Google/Azure/Okta/OneLogin credentials. Use IAM-based authentication instead. + +**Use Local ECR:** +Set `local_ecr: true` in options to use your account's ECR instead of Quilt's central registry. + +```yaml +options: + local_ecr: true +``` + +### Phase 4: Deploy and Validate + +#### 4.1 Deploy Quilt Stack + +**Using Terraform:** +```bash +cd deployment/t4 +make variant=your-variant-name +cd ../tf +terraform init +terraform plan +terraform apply +``` + +**Using CloudFormation:** +```bash +cd deployment/t4 +make variant=your-variant-name +aws cloudformation create-stack \ + --stack-name quilt-production \ + --template-body file://cloudformation.json \ + --parameters file://parameters.yaml \ + --capabilities CAPABILITY_IAM +``` + +#### 4.2 Monitor Deployment + +Watch for common issues: + +```bash +# Check CloudFormation events +aws cloudformation describe-stack-events \ + --stack-name quilt-production \ + --max-items 20 + +# Or Terraform output +terraform apply -no-color 2>&1 | tee deploy.log + +# Monitor ECS task launches +aws ecs list-tasks --cluster +aws ecs describe-tasks --cluster --tasks + +# Check CloudWatch Logs +aws logs tail /aws/ecs/ --follow +``` + +#### 4.3 Validate Network Connectivity + +**Test VPC Endpoints:** +```bash +# From a bastion or test instance in private subnet +nslookup logs.us-east-1.amazonaws.com +# Should resolve to private IP (10.x.x.x) + +nslookup ecr.us-east-1.amazonaws.com +# Should resolve to private IP (10.x.x.x) +``` + +**Test TGW Routing:** +```bash +# Check route table +aws ec2 describe-route-tables --route-table-ids rtb-xxxxx + +# Verify TGW attachment +aws ec2 describe-transit-gateway-vpc-attachments \ + --filters "Name=vpc-id,Values=vpc-xxxxx" +``` + +**Test Application Functionality:** +1. Access catalog via VPN: https://quilt.yourcompany.com +2. Upload a test package +3. Search for objects +4. Download a file +5. Check that all features work as expected + +#### 4.4 Performance Validation + +Monitor key metrics in CloudWatch: + +- ECS task startup time (should be similar to baseline) +- S3 operation latency (should be better with S3 Gateway Endpoint) +- Search indexing performance +- API response times +- Lambda execution duration + +**Expected Performance:** +- With VPC Endpoints: Similar or better than NAT Gateway +- Without VPC Endpoints: Slightly higher latency due to TGW hop + +--- + +## Traffic Flow Analysis + +### With VPC Endpoints (Recommended) + +| Service | Traffic Path | Internet Required? | +|---------|--------------|-------------------| +| S3 API calls | Private subnet → S3 Gateway Endpoint | ❌ No | +| CloudWatch Logs | Private subnet → Logs VPC Endpoint | ❌ No | +| SQS messages | Private subnet → SQS VPC Endpoint | ❌ No | +| ECR image pulls | Private subnet → ECR VPC Endpoints | ❌ No | +| RDS queries | Private subnet → Intra subnet (local) | ❌ No | +| ElasticSearch | Private subnet → Intra subnet (local) | ❌ No | +| Telemetry (optional) | Private subnet → TGW → Internet | ✅ Yes | +| SSO (optional) | Private subnet → TGW → Internet | ✅ Yes | + +**Result:** 95%+ of traffic stays within AWS network, minimal TGW internet routing. + +### Without VPC Endpoints (Not Recommended) + +| Service | Traffic Path | Internet Required? | +|---------|--------------|-------------------| +| S3 API calls | Private subnet → TGW → Internet → S3 | ✅ Yes | +| CloudWatch Logs | Private subnet → TGW → Internet → CloudWatch | ✅ Yes | +| All AWS APIs | Private subnet → TGW → Internet → AWS | ✅ Yes | + +**Result:** High TGW data transfer costs, higher latency, more complex firewall rules. + +--- + +## Firewall Configuration + +If your TGW routes through a corporate firewall, you'll need to allow: + +### With VPC Endpoints (Minimal Rules) + +**HTTPS (443) Outbound:** +- `*.ecr.us-east-1.amazonaws.com` (if not using ECR VPC endpoints) +- `telemetry.quiltdata.cloud` (if telemetry enabled) +- `accounts.google.com` (if Google SSO enabled) +- `login.microsoftonline.com` (if Azure SSO enabled) +- `*.okta.com` (if Okta SSO enabled) + +**DNS (53) Outbound:** +- Your DNS resolvers + +### Without VPC Endpoints (Extensive Rules) + +**HTTPS (443) Outbound:** +- `*.amazonaws.com` (all AWS services) +- `*.cloudfront.net` (CloudFront) +- Plus all external services listed above + +--- + +## Troubleshooting + +### Issue: ECS Tasks Fail to Start + +**Symptoms:** +- Tasks transition from PENDING to STOPPED +- Error: "CannotPullContainerError" + +**Diagnosis:** +```bash +# Check task stopped reason +aws ecs describe-tasks --cluster --tasks + +# Check CloudWatch Logs +aws logs tail /aws/ecs/ --since 30m +``` + +**Solutions:** +1. Deploy ECR VPC endpoints (`ecr.api` and `ecr.dkr`) +2. Verify TGW routes to `*.ecr.amazonaws.com` +3. Check security groups allow HTTPS (443) outbound +4. Verify DNS resolution works from private subnets + +### Issue: Lambda Functions Timeout + +**Symptoms:** +- Lambda functions timeout at 30s or configured limit +- CloudWatch Logs show connection errors + +**Diagnosis:** +```bash +# Check Lambda logs +aws logs tail /aws/lambda/ --since 30m --follow + +# Look for connection errors, DNS failures +``` + +**Solutions:** +1. Deploy VPC endpoints for services Lambda calls (S3, SQS, SNS) +2. Verify Lambda security group allows HTTPS outbound +3. Check Lambda has ENI in correct subnets +4. Increase Lambda timeout if needed (but shouldn't be necessary) + +### Issue: Search Indexing Fails + +**Symptoms:** +- Objects uploaded but not appearing in search +- SQS queue growing without processing + +**Diagnosis:** +```bash +# Check indexing Lambda logs +aws logs tail /aws/lambda/indexer --follow + +# Check SQS queue depth +aws sqs get-queue-attributes \ + --queue-url https://sqs.us-east-1.amazonaws.com/.../indexing \ + --attribute-names ApproximateNumberOfMessages +``` + +**Solutions:** +1. Verify ElasticSearch is in intra subnets +2. Check security group allows Lambda → ElasticSearch (port 443) +3. ElasticSearch should NOT need internet access +4. Verify Lambda can reach ElasticSearch endpoint + +### Issue: Database Connection Errors + +**Symptoms:** +- ECS tasks crash with "connection refused" +- Registry service unable to start + +**Diagnosis:** +```bash +# Check registry container logs +aws logs tail /aws/ecs/registry --follow + +# Check RDS endpoint +aws rds describe-db-instances --db-instance-identifier +``` + +**Solutions:** +1. Verify RDS is in intra subnets +2. Check security group allows ECS/Lambda → RDS (port 5432) +3. RDS should NEVER need internet access +4. Verify database endpoint resolution from private subnets + +### Issue: High TGW Data Transfer Costs + +**Symptoms:** +- Unexpectedly high TGW data processing charges +- CloudWatch metrics show high TGW bytes + +**Solutions:** +1. Deploy missing VPC endpoints (especially S3, CloudWatch, ECR) +2. Enable VPC Flow Logs to identify traffic patterns +3. Check for unnecessary external API calls +4. Consider disabling telemetry and external SSO + +### Issue: Slow Performance + +**Symptoms:** +- Catalog loads slowly +- Package operations take longer than expected + +**Diagnosis:** +```bash +# Check VPC endpoint usage +aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-xxxxx + +# Check CloudWatch metrics for latency +aws cloudwatch get-metric-statistics \ + --namespace AWS/ECS \ + --metric-name TargetResponseTime \ + --dimensions Name=LoadBalancer,Value=... \ + --start-time 2026-02-02T00:00:00Z \ + --end-time 2026-02-02T23:59:59Z \ + --period 3600 \ + --statistics Average +``` + +**Solutions:** +1. Ensure S3 Gateway Endpoint is deployed (huge performance impact) +2. Deploy ECR VPC endpoints for faster image pulls +3. Verify TGW is not congested (check TGW CloudWatch metrics) +4. Consider enabling accelerated networking on EC2 instances + +--- + +## Cost Analysis + +### Scenario 1: NAT Gateway (Default) + +**Monthly Costs (per AZ):** +- NAT Gateway: $32.40/month (730 hours × $0.045) +- Data Processing: $0.045/GB + +**Total (2 AZs, 1 TB data/month):** +- NAT Gateway: $64.80 +- Data Processing: $46.08 +- **Total: $110.88/month** + +### Scenario 2: Transit Gateway Only + +**Monthly Costs:** +- TGW Attachment: $36.50/month (730 hours × $0.05) +- TGW Data Processing: $0.02/GB + +**Total (1 TB data/month):** +- TGW Attachment: $36.50 +- Data Processing: $20.48 +- **Total: $56.98/month** + +**Savings:** $53.90/month vs NAT Gateway + +**However:** TGW cost is typically shared across many VPCs, so marginal cost is just data transfer (~$20/month). + +### Scenario 3: Transit Gateway + VPC Endpoints (Recommended) + +**Monthly Costs:** +- TGW Attachment: $36.50/month (shared) +- VPC Endpoints (Tier 1): ~$35/month (6 endpoints × ~$6) +- TGW Data (minimal): ~$2-5/month (only external traffic) +- VPC Endpoint Data: $0.01/GB + +**Total (1 TB data/month, 90% via VPC endpoints):** +- TGW Attachment: $36.50 +- VPC Endpoints: $35.00 +- TGW Data (100 GB): $2.05 +- VPC Endpoint Data (900 GB): $9.24 +- **Total: $82.79/month** + +**vs NAT Gateway:** Saves $28/month +**vs TGW only:** Costs $26/month more, but much better performance and security + +--- + +## Best Practices + +### Network Design + +1. **Always use Network 2.0** with private subnets and proper subnet segmentation +2. **Deploy VPC endpoints** for all AWS services Quilt uses +3. **Use separate route tables** for private, intra, and public subnets +4. **Enable VPC Flow Logs** to monitor traffic patterns +5. **Use security groups** as primary firewall, not NACLs + +### Security + +1. **Enable private DNS** for all VPC endpoints +2. **Restrict security groups** to minimum required ports +3. **Use separate intra subnets** for RDS/ElasticSearch (no internet) +4. **Enable encryption** at rest and in transit +5. **Audit TGW routes** regularly for unexpected changes + +### Operations + +1. **Document your network architecture** with diagrams +2. **Create runbooks** for common troubleshooting scenarios +3. **Set up CloudWatch alarms** for network issues +4. **Monitor TGW CloudWatch metrics** for congestion +5. **Test failover scenarios** (TGW attachment failure, etc.) + +### Cost Optimization + +1. **Deploy Tier 1 VPC endpoints minimum** to eliminate most data transfer +2. **Disable optional external services** (telemetry, external SSO) +3. **Use S3 Gateway Endpoint** (free!) instead of routing S3 via TGW +4. **Monitor VPC Endpoint costs** and optimize based on usage patterns +5. **Consider Reserved Capacity** for TGW if heavily used + +--- + +## Verification Checklist + +### Pre-Deployment + +- [ ] TGW attached to target VPC +- [ ] Route tables configured with 0.0.0.0/0 → TGW +- [ ] TGW routes to internet (directly or via firewall) +- [ ] DNS resolution works from private subnets +- [ ] Security groups created for VPC endpoints +- [ ] VPC endpoints deployed (at least S3 Gateway) +- [ ] Firewall rules configured (if applicable) +- [ ] Subnet IDs documented +- [ ] Parameters file prepared + +### Post-Deployment + +- [ ] CloudFormation/Terraform deployment succeeded +- [ ] No NAT Gateway created (verify in VPC console) +- [ ] ECS tasks launched successfully +- [ ] CloudWatch Logs receiving data +- [ ] RDS database accessible from ECS +- [ ] ElasticSearch accessible from Lambda/ECS +- [ ] Catalog accessible via VPN/public internet +- [ ] Test package upload successful +- [ ] Test search query returns results +- [ ] Test file download works +- [ ] VPC endpoints showing usage in CloudWatch metrics +- [ ] TGW metrics show expected traffic patterns +- [ ] No connection timeout errors in logs + +### Performance Validation + +- [ ] ECS task startup time < 2 minutes +- [ ] S3 operations < 500ms latency +- [ ] Search queries < 1 second +- [ ] API response time < 2 seconds +- [ ] No Lambda timeout errors +- [ ] CloudWatch metrics show healthy state + +--- + +## Additional Resources + +### AWS Documentation + +- [AWS Transit Gateway](https://docs.aws.amazon.com/vpc/latest/tgw/) +- [VPC Endpoints](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints.html) +- [VPC Endpoint Services (AWS PrivateLink)](https://docs.aws.amazon.com/vpc/latest/privatelink/endpoint-services-overview.html) +- [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) + +### Quilt Documentation + +- Network Architecture Guide (README.md) +- Private Endpoints Configuration (t4/template/PRIVATE_ENDPOINTS.md) +- Environment Configuration Schema (t4/template/environment/env_schema.py) + +### Support + +For questions or issues: +- Email: support@quiltdata.com +- Documentation: https://docs.quiltdata.com +- GitHub Issues: https://github.com/quiltdata/quilt + +--- + +## Appendix: Example Terraform Configuration + +```hcl +# Example: Create VPC endpoints for Quilt deployment + +locals { + vpc_id = "vpc-xxxxx" + private_subnet_ids = ["subnet-xxxxx", "subnet-yyyyy"] + vpc_endpoint_sg_id = aws_security_group.vpc_endpoints.id +} + +# Security Group for VPC Endpoints +resource "aws_security_group" "vpc_endpoints" { + name = "vpc-endpoints-quilt" + description = "Allow HTTPS from private subnets to VPC endpoints" + vpc_id = local.vpc_id + + ingress { + from_port = 443 + to_port = 443 + protocol = "tcp" + cidr_blocks = ["10.0.1.0/24", "10.0.2.0/24"] # Private subnet CIDRs + } + + tags = { + Name = "vpc-endpoints-quilt" + } +} + +# S3 Gateway Endpoint (FREE!) +resource "aws_vpc_endpoint" "s3" { + vpc_id = local.vpc_id + service_name = "com.amazonaws.us-east-1.s3" + vpc_endpoint_type = "Gateway" + route_table_ids = [aws_route_table.private.id] + + tags = { + Name = "s3-gateway-endpoint" + } +} + +# CloudWatch Logs Interface Endpoint +resource "aws_vpc_endpoint" "logs" { + vpc_id = local.vpc_id + service_name = "com.amazonaws.us-east-1.logs" + vpc_endpoint_type = "Interface" + subnet_ids = local.private_subnet_ids + security_group_ids = [local.vpc_endpoint_sg_id] + private_dns_enabled = true + + tags = { + Name = "logs-interface-endpoint" + } +} + +# ECR API Interface Endpoint +resource "aws_vpc_endpoint" "ecr_api" { + vpc_id = local.vpc_id + service_name = "com.amazonaws.us-east-1.ecr.api" + vpc_endpoint_type = "Interface" + subnet_ids = local.private_subnet_ids + security_group_ids = [local.vpc_endpoint_sg_id] + private_dns_enabled = true + + tags = { + Name = "ecr-api-interface-endpoint" + } +} + +# ECR Docker Interface Endpoint +resource "aws_vpc_endpoint" "ecr_dkr" { + vpc_id = local.vpc_id + service_name = "com.amazonaws.us-east-1.ecr.dkr" + vpc_endpoint_type = "Interface" + subnet_ids = local.private_subnet_ids + security_group_ids = [local.vpc_endpoint_sg_id] + private_dns_enabled = true + + tags = { + Name = "ecr-dkr-interface-endpoint" + } +} + +# SQS Interface Endpoint +resource "aws_vpc_endpoint" "sqs" { + vpc_id = local.vpc_id + service_name = "com.amazonaws.us-east-1.sqs" + vpc_endpoint_type = "Interface" + subnet_ids = local.private_subnet_ids + security_group_ids = [local.vpc_endpoint_sg_id] + private_dns_enabled = true + + tags = { + Name = "sqs-interface-endpoint" + } +} + +# SNS Interface Endpoint +resource "aws_vpc_endpoint" "sns" { + vpc_id = local.vpc_id + service_name = "com.amazonaws.us-east-1.sns" + vpc_endpoint_type = "Interface" + subnet_ids = local.private_subnet_ids + security_group_ids = [local.vpc_endpoint_sg_id] + private_dns_enabled = true + + tags = { + Name = "sns-interface-endpoint" + } +} + +# Output endpoint IDs for reference +output "vpc_endpoint_ids" { + value = { + s3 = aws_vpc_endpoint.s3.id + logs = aws_vpc_endpoint.logs.id + ecr_api = aws_vpc_endpoint.ecr_api.id + ecr_dkr = aws_vpc_endpoint.ecr_dkr.id + sqs = aws_vpc_endpoint.sqs.id + sns = aws_vpc_endpoint.sns.id + } +} +``` + +--- + +**Document Version:** 1.0 +**Last Updated:** February 2, 2026 +**Maintained By:** Quilt Engineering Team diff --git a/howto-3-transit-gateway-deployment.md b/howto-3-transit-gateway-deployment.md new file mode 100644 index 0000000..f108a71 --- /dev/null +++ b/howto-3-transit-gateway-deployment.md @@ -0,0 +1,170 @@ +# How-To: Deploy Quilt with Transit Gateway + +## Tags + +`aws`, `networking`, `transit-gateway`, `vpc-endpoints`, `enterprise` + +## Summary + +> Deploy Quilt using Transit Gateway by providing TGW-configured subnets. Optionally deploy VPC endpoints to reduce TGW data charges. + +--- + +## Prerequisites + +- VPC with Transit Gateway attachment (TGW routes to internet) +- Quilt deployment configured with `network.vpn: true` (sets `existing_vpc: true`) +- AWS networking knowledge (VPC, subnets, route tables, security groups) + +### Subnet Requirements + +**Private Subnets** (2, different AZs): + +- Route: `0.0.0.0/0 → tgw-xxxxx` +- For: ECS containers, Lambda functions + +**Intra Subnets** (2, different AZs): + +- Route: Local VPC only +- For: RDS, ElasticSearch + +**User Subnets** (load balancer): + +- Internal: Use private subnets +- Public: Use public subnets with IGW + +--- + +## Step 1: Deploy VPC Endpoints (Strongly Recommended) + +Configuring these essential endpoints costs ~$35/month, but can reduce TGW charges by 90%+. + +```bash +VPC_ID="vpc-xxxxx" +REGION=$(aws configure get region) +PRIVATE_SUBNET_1="subnet-xxxxx" +PRIVATE_SUBNET_2="subnet-yyyyy" + +# Create security group for endpoints +VPCE_SG=$(aws ec2 create-security-group \ + --group-name quilt-vpc-endpoints \ + --description "VPC endpoints for Quilt" \ + --vpc-id $VPC_ID \ + --query 'GroupId' --output text) + +# Allow HTTPS from private subnets +aws ec2 authorize-security-group-ingress \ + --group-id $VPCE_SG \ + --protocol tcp --port 443 --cidr 10.0.0.0/16 # Adjust CIDR + +# S3 Gateway (FREE) +aws ec2 create-vpc-endpoint \ + --vpc-id $VPC_ID \ + --service-name com.amazonaws.$REGION.s3 \ + --route-table-ids rtb-private1 rtb-private2 \ + --vpc-endpoint-type Gateway + +# CloudWatch Logs +aws ec2 create-vpc-endpoint \ + --vpc-id $VPC_ID \ + --service-name com.amazonaws.$REGION.logs \ + --vpc-endpoint-type Interface \ + --subnet-ids $PRIVATE_SUBNET_1 $PRIVATE_SUBNET_2 \ + --security-group-ids $VPCE_SG \ + --private-dns-enabled + +# ECR (for Docker images) +aws ec2 create-vpc-endpoint \ + --vpc-id $VPC_ID \ + --service-name com.amazonaws.$REGION.ecr.api \ + --vpc-endpoint-type Interface \ + --subnet-ids $PRIVATE_SUBNET_1 $PRIVATE_SUBNET_2 \ + --security-group-ids $VPCE_SG \ + --private-dns-enabled + +aws ec2 create-vpc-endpoint \ + --vpc-id $VPC_ID \ + --service-name com.amazonaws.$REGION.ecr.dkr \ + --vpc-endpoint-type Interface \ + --subnet-ids $PRIVATE_SUBNET_1 $PRIVATE_SUBNET_2 \ + --security-group-ids $VPCE_SG \ + --private-dns-enabled + +# SQS +aws ec2 create-vpc-endpoint \ + --vpc-id $VPC_ID \ + --service-name com.amazonaws.$REGION.sqs \ + --vpc-endpoint-type Interface \ + --subnet-ids $PRIVATE_SUBNET_1 $PRIVATE_SUBNET_2 \ + --security-group-ids $VPCE_SG \ + --private-dns-enabled + +# SNS +aws ec2 create-vpc-endpoint \ + --vpc-id $VPC_ID \ + --service-name com.amazonaws.$REGION.sns \ + --vpc-endpoint-type Interface \ + --subnet-ids $PRIVATE_SUBNET_1 $PRIVATE_SUBNET_2 \ + --security-group-ids $VPCE_SG \ + --private-dns-enabled +``` + +--- + +## Step 2: Configure Firewall Rules (If Applicable) + +If TGW routes through firewall, allow HTTPS (443) to: + +- `telemetry.quiltdata.cloud` (if telemetry enabled) +- `login.microsoftonline.com` (if Microsoft Entra SSO) +- `*.okta.com` or `*.oktapreview.com` (if Okta SSO) +- `accounts.google.com` (if Google SSO) +- `*.amazonaws.com` (if no VPC endpoints) + +--- + +## Step 3: Deploy Quilt Stack + +When deploying the CloudFormation stack, add these Transit Gateway-specific parameters: + +```yaml +VPC: vpc-xxxxx +Subnets: subnet-private1,subnet-private2 # Private subnets with TGW routing for ECS/Lambda +IntraSubnets: subnet-intra1,subnet-intra2 # Isolated subnets for RDS/ElasticSearch (VPC-only) +UserSubnets: subnet-private1,subnet-private2 # Load balancer subnets (same as Subnets for internal) +UserSecurityGroup: sg-xxxxx # Security group allowing ingress to load balancer +``` + +--- + +## Step 4: Validate & Troubleshoot + +Run these tests from within your VPC (EC2 instance, bastion host, or VPN-connected machine): + +Test deployment: + +```bash +STACK_NAME="your-stack" +REGISTRY=$(aws cloudformation describe-stacks --stack-name $STACK_NAME \ + --query 'Stacks[0].Outputs[?OutputKey==`RegistryHost`].OutputValue' --output text) +curl -s -o /dev/null -w "%{http_code}" https://$REGISTRY/ # Expect: 200 or 302 +``` + +Verify VPC endpoints (DNS should resolve to 10.x.x.x private IPs): + +```bash +nslookup s3.$REGION.amazonaws.com +nslookup logs.$REGION.amazonaws.com +``` + +**Common Issues:** + +**ECS "CannotPullContainerError":** Deploy ECR VPC endpoints or verify TGW routes to `*.ecr.amazonaws.com` + +**Lambda timeouts:** Deploy VPC endpoints or verify security groups allow 443 outbound + +**High TGW costs:** Deploy missing VPC endpoints or set `DISABLE_QUILT_TELEMETRY=true` + +--- + +**Support:**