Standardize Endpoint invoke lifecycle with centralized error handling#61
Closed
acere wants to merge 2 commits intoawslabs:mainfrom
Closed
Standardize Endpoint invoke lifecycle with centralized error handling#61acere wants to merge 2 commits intoawslabs:mainfrom
acere wants to merge 2 commits intoawslabs:mainfrom
Conversation
Refactor the Endpoint base class to provide a structured invoke lifecycle via __init_subclass__ wrapping: - prepare_payload(payload, **kwargs) → merge kwargs, inject provider fields - invoke(payload) → API call + parse_response() (abstract) - parse_response(raw_response, start_t) → extract text/tokens/metadata (abstract) The base class wrapper automatically provides: - Error handling: exceptions → error InvocationResponse with partial data - Timing: time_to_last_token back-fill for non-streaming endpoints - Metadata: input_payload, input_prompt, id always populated - _parse_payload for input prompt extraction (token counting fallback) Additional improvements: - Add num_tokens_input_cached field for prompt caching (Bedrock + OpenAI) - Extract AWS RequestId as response ID for Bedrock and SageMaker - Extract RetryAttempts for SageMaker (Bedrock already had this) - Preserve partial data on streaming errors instead of discarding - Define BEDROCK_STREAM_ERROR_TYPES as shared constant - Skip unknown stream events gracefully (forward-compatible) - Remove redundant try/except from all _parse_response methods - Remove uuid4/error handling boilerplate from all endpoint subclasses - Update docs: metrics table, key concepts, custom endpoint guide Closes awslabs#60
- Add num_tokens_input_cached to Result.stats aggregation metrics and total_cached_input_tokens to run-level stats - Add integration test for ConverseStream prompt caching with unique-per-run prefix to avoid stale cache hits - Add 6 unit tests verifying mid-stream errors (TimeoutError, ConnectionError) are caught by the invoke wrapper for BedrockConverseStream, BedrockInvokeStream, and OpenAICompletionStreamEndpoint - Add demo notebook comparing TTFT with/without prompt caching, using a CacheBuster callback to guarantee cache misses - Sort imports across codebase (ruff --select I) - Update metrics documentation with new stats fields
Collaborator
Author
|
Superseded by PR #58, which now includes all changes from this PR (endpoint lifecycle refactor, prompt caching metrics, mid-stream error handling, demo notebook). The combined branch was force-pushed to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the
Endpointbase class to eliminate duplicated error handling, timing, and metadata boilerplate across all 12 endpoint implementations.Closes #60
What changed
Base class (
base.py)The
Endpointclass now provides a structured invoke lifecycle via__init_subclass__wrapping. Subclasses define three methods:invoke(payload)parse_response()parse_response(raw_response, start_t)prepare_payload(payload, **kwargs)The wrapper automatically handles:
InvocationResponsewith payload attachedtime_to_last_tokenback-filled for non-streaming endpointsinput_payload,input_prompt,idalways populated_parse_payload— extracts human-readable prompt for observability and token counting fallbackInvocationResponsenew fieldnum_tokens_input_cached— input tokens served from prompt cache. Populated by Bedrock (cacheReadInputTokens) and OpenAI (cached_tokens).Endpoint improvements
ResponseMetadata.RequestIdas the response IDRetryAttempts(Bedrock already had this)BEDROCK_STREAM_ERROR_TYPESdefined as a sharedfrozensetconstant, used by both Converse and InvokeModel stream parsers_parse_responsemethodsBefore/after (e.g.
OpenAIResponseEndpoint.invoke)Before (27 lines with 5 duplicate except handlers):
After (3 lines):
Documentation
metrics.md— Per-request fields table expanded from 6 to 12 fieldskey_concepts.md— Endpoint description explains the invoke lifecycleconnect_endpoints.md— Added custom endpoint example with the new abstract methodsTesting