-
Notifications
You must be signed in to change notification settings - Fork 848
Fix origin_server_auth URL encoding for mixed-encoding URLs #12802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Added unit tests and autest to demonstrate the isUriEncoded() bug in the origin_server_auth plugin (s3_auth). Bug: When a URL has mixed encoding (some characters encoded, some not), isUriEncoded() incorrectly returns true upon finding the first %XX sequence, assuming the entire string is already encoded. This causes canonicalEncode() to skip re-encoding unencoded characters like parentheses. Example: Client sends: /app/(channel)/%5B%5Bparts%5D%5D/page.js isUriEncoded(): Returns true (finds %5B) canonicalEncode(): Returns as-is (thinks already encoded) Signature for: /app/(channel)/%5B%5Bparts%5D%5D/page.js <- WRONG S3 expects: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js This results in S3 returning 403 SignatureDoesNotMatch. Unit tests marked with [!mayfail] to document expected-to-fail behavior. Autest passes with mock server but would fail with real S3.
Fixed the isUriEncoded() and canonicalEncode() functions to properly handle URLs with mixed encoding (some characters encoded, some not). Bug: URLs like /app/(channel)/%5B%5Bparts%5D%5D/page.js would cause S3 signature mismatch because: - isUriEncoded() found %5B and returned true (incorrectly) - canonicalEncode() returned the string as-is - Signature was calculated for partially-encoded path - S3 expected signature for fully-encoded path Fix: 1. isUriEncoded() now checks the ENTIRE string and returns false if ANY character that should be encoded is found unencoded. 2. canonicalEncode() now always decodes first, then re-encodes using AWS canonical rules. This ensures consistent output regardless of input. Added uriDecode() helper function to decode percent-encoded characters. Updated unit tests to verify the correct behavior. This fixes YTSATS-4835.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug in the origin_server_auth plugin where mixed URL encoding (some characters encoded, others not) caused AWS SigV4 signature mismatches with S3/GCP.
Changes:
- Added
uriDecode()function to decode percent-encoded URLs - Modified
isUriEncoded()to check if ALL characters requiring encoding are encoded (not just if ANY encoded character exists) - Modified
canonicalEncode()to decode-then-reencode partially-encoded URLs for consistent canonical output
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| plugins/origin_server_auth/aws_auth_v4.cc | Added uriDecode() function and fixed isUriEncoded()/canonicalEncode() to handle mixed encoding |
| plugins/origin_server_auth/unit_tests/test_aws_auth_v4.h | Exposed uriDecode() and canonicalEncode() functions for unit testing |
| plugins/origin_server_auth/unit_tests/test_aws_auth_v4.cc | Added comprehensive unit tests for mixed encoding scenarios and S3 safe characters |
| tests/gold_tests/pluginTest/origin_server_auth/s3_url_encoding.test.py | Added integration test demonstrating the bug and verifying the fix |
| tests/gold_tests/pluginTest/origin_server_auth/rules/s3_url_encoding.test_input | Added test configuration with fake AWS credentials |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /* If we found at least one encoded character and no unencoded special chars, | ||
| * the string is fully encoded. If no encoded chars were found, the string | ||
| * contains only unreserved chars (and possibly '/' for object names). */ | ||
| return foundEncodedChar; |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isUriEncoded function returns false for strings containing only unreserved characters (like "/test/file.txt" or "plain-text.js"). While functionally correct (these strings will pass through decode/encode unchanged in canonicalEncode), this introduces unnecessary overhead. Consider optimizing by returning true when the loop completes without finding any characters that need encoding, indicating the string is already in its canonical form.
| Bug: When a URL has SOME characters encoded (e.g., %5B) but others NOT encoded | ||
| (e.g., parentheses), the isUriEncoded() function incorrectly returns true, | ||
| causing canonicalEncode() to skip re-encoding. This results in a signature | ||
| calculated for a partially-encoded path, which won't match what S3 expects. | ||
| Example: | ||
| Client sends: /app/(channel)/%5B%5Bparts%5D%5D/page.js | ||
| isUriEncoded(): Returns true (finds %5B) | ||
| canonicalEncode(): Returns as-is (thinks already encoded) | ||
| Signature for: /app/(channel)/%5B%5Bparts%5D%5D/page.js <- WRONG | ||
| S3 expects: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js <- CORRECT | ||
| This test sends requests with various URL encodings and captures what | ||
| the origin receives, including the Authorization header signature. | ||
| ''' | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| Test.testName = "origin_server_auth: URL encoding bug" | ||
| Test.Summary = ''' | ||
| Test for S3 signature calculation with mixed URL encoding | ||
| ''' | ||
| Test.ContinueOnFail = True | ||
|
|
||
|
|
||
| class S3UrlEncodingTest: | ||
| """ | ||
| Test class for verifying URL encoding behavior in S3 auth signature calculation. | ||
| """ | ||
|
|
||
| def __init__(self): | ||
| self.setupOriginServer() | ||
| self.setupTS() | ||
|
|
||
| def setupOriginServer(self): | ||
| """ | ||
| Create an origin server that captures and logs the request path and headers. | ||
| ATS preserves URL encoding from the client, so we need to match exactly. | ||
| """ | ||
| self.server = Test.MakeOriginServer("s3_mock") | ||
|
|
||
| # Test case 1: Fully unencoded path with parentheses | ||
| # Client sends: /bucket/app/(channel)/test.js (unencoded) | ||
| # ATS forwards: /bucket/app/(channel)/test.js (as-is) | ||
| request1 = { | ||
| "headers": "GET /bucket/app/(channel)/test.js HTTP/1.1\r\nHost: s3.amazonaws.com\r\n\r\n", | ||
| "timestamp": "1469733493.993", | ||
| "body": "" | ||
| } | ||
| response1 = { | ||
| "headers": "HTTP/1.1 200 OK\r\nConnection: close\r\nContent-Length: 7\r\n\r\n", | ||
| "timestamp": "1469733493.993", | ||
| "body": "test1ok" | ||
| } | ||
| self.server.addResponse("sessionlog.log", request1, response1) | ||
|
|
||
| # Test case 2: Fully encoded parentheses | ||
| # Client sends: /bucket/app/%28channel%29/test.js (encoded) | ||
| # ATS forwards: /bucket/app/%28channel%29/test.js (as-is, preserves encoding) | ||
| request2 = { | ||
| "headers": "GET /bucket/app/%28channel%29/test.js HTTP/1.1\r\nHost: s3.amazonaws.com\r\n\r\n", | ||
| "timestamp": "1469733493.993", | ||
| "body": "" | ||
| } | ||
| response2 = { | ||
| "headers": "HTTP/1.1 200 OK\r\nConnection: close\r\nContent-Length: 7\r\n\r\n", | ||
| "timestamp": "1469733493.993", | ||
| "body": "test2ok" | ||
| } | ||
| self.server.addResponse("sessionlog.log", request2, response2) | ||
|
|
||
| # Test case 3: Mixed encoding - BUG CASE | ||
| # Client sends: /bucket/app/(channel)/%5B%5Bparts%5D%5D/page.js | ||
| # - Parentheses () NOT encoded | ||
| # - Brackets [] ARE encoded as %5B%5D | ||
| # ATS forwards: /bucket/app/(channel)/%5B%5Bparts%5D%5D/page.js (as-is) | ||
| # | ||
| # BUG: isUriEncoded() sees %5B -> returns true -> canonicalEncode() skips encoding |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments on lines 7-10 and 97 describe the bug in present tense ("the isUriEncoded() function incorrectly returns true"), but this PR fixes that bug. These comments should be updated to past tense or rephrased to indicate this was the bug that has been fixed, for example: "This test verifies the fix for a bug where mixed URL encoding..." or "Previously, when a URL had SOME characters encoded..."
| * BUG: isUriEncoded() returns true if ANY %XX sequence is found, even if other | ||
| * reserved characters are NOT encoded. This causes canonicalEncode() to skip | ||
| * encoding, resulting in signature mismatch with S3. | ||
| * | ||
| * Example from YTSATS-4835: | ||
| * Client sends: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | ||
| * ATS decodes to: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partial decode) | ||
| * isUriEncoded() sees %5B -> returns true (incorrectly assumes fully encoded) | ||
| * canonicalEncode() returns as-is: /app/(channel)/%5B%5Bparts%5D%5D/page.js | ||
| * Signature calculated for partially-encoded path | ||
| * S3 expects signature for: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | ||
| * Result: 403 signature mismatch |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment block on lines 145-158 describes the bug in present tense ("BUG: isUriEncoded() returns true if ANY %XX sequence is found..."), but this PR fixes that bug. The comment should be updated to indicate this was a historical bug that has been fixed, for example: "This test verifies the fix for a bug where isUriEncoded() would return true if ANY %XX sequence was found..."
| * BUG: isUriEncoded() returns true if ANY %XX sequence is found, even if other | |
| * reserved characters are NOT encoded. This causes canonicalEncode() to skip | |
| * encoding, resulting in signature mismatch with S3. | |
| * | |
| * Example from YTSATS-4835: | |
| * Client sends: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | |
| * ATS decodes to: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partial decode) | |
| * isUriEncoded() sees %5B -> returns true (incorrectly assumes fully encoded) | |
| * canonicalEncode() returns as-is: /app/(channel)/%5B%5Bparts%5D%5D/page.js | |
| * Signature calculated for partially-encoded path | |
| * S3 expects signature for: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | |
| * Result: 403 signature mismatch | |
| * This test verifies the fix for a bug where isUriEncoded() would return true | |
| * if ANY %XX sequence was found, even if other reserved characters were NOT | |
| * encoded. That behavior caused canonicalEncode() to skip encoding, resulting | |
| * in a signature mismatch with S3. | |
| * | |
| * Historical example from YTSATS-4835: | |
| * Client sends: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | |
| * ATS decodes to: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partial decode) | |
| * isUriEncoded() saw %5B -> returned true (incorrectly assumed fully encoded) | |
| * canonicalEncode() returned as-is: /app/(channel)/%5B%5Bparts%5D%5D/page.js | |
| * Signature was calculated for the partially-encoded path | |
| * S3 expected the signature for: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | |
| * Result: 403 signature mismatch (now prevented by this fix) |
| /* Not fully encoded (either unencoded or partially encoded). | ||
| * Decode first to avoid double-encoding, then re-encode with AWS canonical rules. | ||
| * | ||
| * This fixes a bug where mixed-encoding URLs (e.g., /app/(channel)/%5B%5Bparts%5D%5D/) | ||
| * would cause signature mismatch. The parentheses were not encoded but brackets were. | ||
| * | ||
| * Example: | ||
| * - Input: /app/(channel)/%5B%5Bparts%5D%5D/page.js (partially encoded) | ||
| * - Decode: /app/(channel)/[[parts]]/page.js | ||
| * - Encode: /app/%28channel%29/%5B%5Bparts%5D%5D/page.js | ||
| */ | ||
| String decoded = uriDecode(in); | ||
| String canonical = uriEncode(decoded, isObjectName); | ||
| return canonical; |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decode-then-reencode approach in canonicalEncode may have unintended behavior with plus signs. If a URL contains "%2B" (encoded plus sign), it will be decoded to "+" and then re-encoded as "%20" (space) by uriEncode (lines 125-128), changing its semantic meaning. While this matches the existing AWS-specific behavior where '+' is treated as space, it means that "%2B" sequences in mixed-encoded URLs will be converted to "%20". Consider adding a test case to document this behavior and verify it matches AWS SigV4 expectations for plus sign handling in URI paths.
| if ('%' == c) { | ||
| if (pos + 2 < in.length() && std::isxdigit(in[pos + 1]) && std::isxdigit(in[pos + 2])) { | ||
| /* if string was encoded we should have exactly 2 hexadecimal chars following it */ | ||
| return true; | ||
| /* Valid encoded sequence found */ | ||
| foundEncodedChar = true; | ||
| pos += 2; /* Skip past the hex digits */ | ||
| continue; | ||
| } else { | ||
| /* lonely '%' should have been encoded with %25 according to the RFC so likely not encoded */ | ||
| /* lonely '%' should have been encoded with %25 according to the RFC */ | ||
| return false; | ||
| } | ||
| } | ||
|
|
||
| /* Any other character that reaches here should have been encoded but wasn't. | ||
| * This includes: space, '/', '(', ')', '[', ']', etc. | ||
| * Return false to indicate the string is not fully encoded. */ | ||
| return false; | ||
| } | ||
|
|
||
| return false; | ||
| /* If we found at least one encoded character and no unencoded special chars, | ||
| * the string is fully encoded. If no encoded chars were found, the string | ||
| * contains only unreserved chars (and possibly '/' for object names). */ | ||
| return foundEncodedChar; |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isUriEncoded function accepts both uppercase and lowercase hex digits (via std::isxdigit), and canonicalEncode returns such strings unchanged. However, AWS SigV4 spec requires uppercase hex digits (e.g., "%2F" not "%2f"). URLs with lowercase hex encoding (e.g., "/test%2ffile") will bypass the decode/encode normalization and retain lowercase hex, potentially causing signature mismatches. Consider either: (1) modifying isUriEncoded to return false for lowercase hex digits, forcing normalization through decode/encode, or (2) add a case-normalization step for fully-encoded strings.
Summary
Fixed the
isUriEncoded()andcanonicalEncode()functions in the origin_server_auth plugin to properly handle URLs with mixed encoding (some characters encoded, some not).Bug Description
When a URL has mixed encoding (e.g.,
/app/(channel)/%5B%5Bparts%5D%5D/page.jswhere parentheses are NOT encoded but brackets ARE encoded), the AWS v4 signature calculation was incorrect:isUriEncoded()found%5Band returnedtrue, incorrectly assuming the entire string was fully encodedcanonicalEncode()returned the string as-isAWS SigV4 Canonical URI Encoding
Per the AWS SigV4 spec:
This means characters like
(,),!,*,'must be percent-encoded in the canonical URI for signature calculation, even though they are listed as "safe" for S3 object key names.(%28))%29)!%21)*%2A)'%27)Fix
isUriEncoded(): Now checks the ENTIRE string and returnsfalseif ANY character that should be encoded is found unencodedcanonicalEncode(): For partially-encoded strings, decodes first then re-encodes to ensure consistent canonical outputuriDecode()helper functionTesting
!,*,',(,))References