Skip to content

Optimize WSGI/ASGI performance#13

Merged
benoitc merged 52 commits intomainfrom
feature/wsgi-asgi-performance-optimization
Mar 24, 2026
Merged

Optimize WSGI/ASGI performance#13
benoitc merged 52 commits intomainfrom
feature/wsgi-asgi-performance-optimization

Conversation

@benoitc
Copy link
Copy Markdown
Owner

@benoitc benoitc commented Mar 17, 2026

Summary

  • Add Python context pool with scheduler affinity for reduced context switching
  • Implement zero-copy body streaming via py_buffer
  • Add environ template caching and BytesIO/ASGIResponse pooling
  • Cache lifespan state in handler state at startup
  • Stream pooled responses directly to client
  • Unify ASGI handler and simplify message protocol
  • Use lazy state proxy with callbacks for ASGI state access
  • Use event loop pool for ASGI task distribution
  • Reuse default py_context_router pool instead of per-mount workers
  • Remove dead code (hornbeam_pool, unused streaming functions)
  • Bump erlang_python to 2.2.0

benoitc added 30 commits March 12, 2026 12:38
- Fix erlang_python git repo URL (erlang-python not erlang_python)
- Replace py:bind/py:unbind with py:context/py:contexts_started
- Replace py:ctx_call with py:call
- Replace py:with_context with direct py:call
- Update py:call signatures to use options map for timeout
- Create hornbeam_request.erl for pre-parsing HTTP requests in Erlang
- Add to_wsgi_header_key/1 for header format conversion
- Add build_wsgi_tuple/2 and build_asgi_scope/2 functions
- Add BytesIO pool to WSGI runner to reduce allocation overhead
- Add environ template for O(1) environ creation
- Add run_wsgi_fast/3 and create_environ_from_tuple/1 for fast path
- Add response object pool with reset() method
- Add _get_response() and _return_response() pool functions
- Pool size of 100 responses for high-throughput scenarios
- Use _ENVIRON_TEMPLATE.copy() instead of inline dict creation
- Use pooled BytesIO for wsgi.input
- Return BytesIO to pool after request completion
Workers receive requests via channels and loop continuously,
reducing Python startup overhead. Features heartbeat monitoring,
scheduler affinity routing, and automatic restart on failure.

Enable via mount config: pool_enabled => true
Use cowboy stream_reply/stream_body instead of collecting
chunks before sending. Reduces memory usage and latency
for large streaming responses.
Remove per-mount heartbeat_interval and heartbeat_timeout options.
Use module constants (5s interval, 15s timeout) for all workers.
Store channels as tuple instead of individual entries.
One lookup instead of two: get tuple, then element().
Store channels as {{pool, MountId}, Ch1, Ch2, ...} in ETS.
Handler gets channel via single lookup_element call.
Remove persistent_term usage entirely.
Store as {{MountId, Idx}, Channel} for direct lookup.
No tuple manipulation needed.
Call py_context:extend_erlang_module_in_context/1 before importing
hornbeam_wsgi_worker and hornbeam_asgi_worker. This ensures the
erlang module is fully extended with send/call/schedule_inline
before the workers check HAS_ERLANG at import time.

Also adds noop_asgi.py benchmark app for ASGI testing.

Performance results:
- WSGI single context: 76K req/sec, 13us latency
- WSGI 14 workers: 62-64K req/sec
- ASGI single worker: 60K req/sec
- ASGI 14 workers: 113K req/sec, 9us latency
- Replace pooled worker architecture with context_call + schedule_inline
- WSGI now uses py_nif:context_call() with schedule_inline for yielding
- ASGI uses py_event_loop for async execution
- Remove hornbeam_worker_arbiter and hornbeam_worker_pool (obsolete)
- Simplify hornbeam_handler to single codepath
- Update Python workers for schedule_inline continuation pattern
- Add max_concurrent config for erlang_python
- Small bodies (< 64KB): buffered path (read fully before Python call)
- Large bodies (>= 64KB): stream via py_channel in 64KB chunks
- Add handle_wsgi_streaming entry point in Python worker
- StreamingBodyReader reads body chunks from channel
- Add hornbeam_context_pool:add_paths/1 to set pythonpath in all contexts
- Fix Channel.receive() to use timeout_ms parameter
- Add wsgi_body_chunk_size and wsgi_streaming_threshold config options
- Replace handle_wsgi_buffered/handle_wsgi_streaming with single handle_wsgi
- Add ChannelBuffer (inherits io.BufferedIOBase) for wsgi.input
- Body delivered via channel for all sizes (small: {body, Data}, large: chunks)
- Add hop-by-hop header filtering for HTTP compliance
- Remove BytesIO pool and StreamingBodyReader class
- Use py_buffer API for request body (zero-copy shared memory)
- Use erlang.send instead of erlang.reply for responses
- Skip buffer creation for bodyless GET/HEAD/DELETE/OPTIONS
- Preload WSGI app at startup in all contexts
- Single message path for simple [body] list responses
- Use worker mode instead of subinterpreter for contexts

Benchmark: 56,720 req/sec (5.8x faster than Gunicorn)
- Rename workers config option to num_contexts for clarity
- Default num_contexts to erlang:system_info(schedulers)
- Restart context pool when num_contexts changes
- Fix _get_app safety check for multi-app scenarios
- Remove unused Python runtime functions
- Replace py_event_loop:create_task with spawn_task (fire-and-forget)
- Use py_buffer for request body streaming (consistent with WSGI)
- Handle async_result messages in receive loops
- Add asgi_noop_app.py benchmark app
- Simplify ASGI worker to use erlang.run() pattern

ASGI performance: ~39k req/sec (WSGI: ~62.5k req/sec)
- Add pre-computed ASGI_SCOPE_TEMPLATE macro for static scope fields
- Cache _erlang_send function reference to avoid attribute lookup per call
- Store cached send in _ASGISend.__slots__ for instance-level access

Performance improvement:
- Before: ~39k req/sec
- After:  ~63k req/sec (+62%)
- ASGI now matches WSGI performance
Check body size threshold before checking more_body flag.
This ensures large single-chunk responses are streamed instead
of buffered, preventing memory issues with large responses.

- Reorder threshold check to happen first
- Stream if total_size >= BUFFER_THRESHOLD (64KB)
- Add test app for large response validation
Fetch lifespan_state once when configuring cowboy routes instead
of calling hornbeam_lifespan:get_state() on every request.

- Add lifespan_state to HandlerState in start_listener
- Add lifespan_state to multi-app HandlerState
- Use cached state in build_scope instead of ETS lookup
- Fix quadratic buffering in ASGI send with O(1) size tracking
- Use create_task instead of spawn_task to avoid process overhead
- Move pythonpath setup to mount registration (not per-request)
- Implement ASGI request body streaming with more_body support
- Wire up WSGI tuple fast path for O(1) environ creation

ASGI now at 86% of WSGI throughput (67.5K vs 78.3K req/s).
Use Python-side lifespan state dict from hornbeam_lifespan_runner
instead of the Erlang-provided copy. This ensures state modifications
made by request handlers persist across requests per ASGI spec.
- Add context_mode option (worker | owngil) to hornbeam and context pool
- owngil mode uses per-interpreter GIL for true parallelism (Python 3.12+)
- Update benchmark to support WSGI owngil testing via PYTHON_CONFIG env var
- Rebuild erlang_python when PYTHON_CONFIG is set for correct Python version
Use hornbeam_context_pool instead of py:context() to ensure priv/ is in
sys.path when calling Python lifespan functions. Also use py_nif:context_call
with empty options map to avoid passing timeout as Python kwargs.
- Add _MutableStateProxy in hornbeam_asgi_worker.py that syncs
  scope['state'] mutations to Erlang ETS via erlang.send()
- Add update_state/2 and update_state/3 to hornbeam_lifespan.erl
- Add handle_info for {<<"update_state">>, Key, Value} messages
- Read fresh lifespan state from ETS per request (not cached)
- Update lifespan_test_app.py to prefer scope state over module state
- Requires erlang-python with erlang.whereis() support
benoitc added 22 commits March 19, 2026 16:04
- Remove unused buffering logic in _ASGISend
- Stream all responses directly through ByteChannel
- Fix default status code from 400 to 200 on http.response.start
- Remove unused fast path response handler
- Remove debug logging statements
- Add hop-by-hop header filtering to streaming path
- Close request channel when response starts
- Raise RuntimeError if http.response.start sent twice
- Raise RuntimeError if http.response.body sent before start
- Raise RuntimeError if send called after response completed
- Raise OSError on client disconnect per ASGI spec 2.4
Switch from py_event_loop to py_event_loop_pool for better
load distribution across multiple event loops. Process affinity
ensures ordered execution for requests from the same handler.

Benchmark shows improved scaling at higher concurrency:
- 200 connections: 25.4k req/s
- 400 connections: 27.5k req/s
WSGI worker:
- Remove unnecessary decode() calls (erlang_python handles in C)
- Add documentation for binary-to-string conversion

Lifespan runner:
- Add per-mount lifespan support for multi-app mode
- Each mount gets isolated state dict
- Add startup_mount/shutdown_mount functions

hornbeam.erl:
- Pass mount_id to lifespan startup for state isolation
- Build mount-specific options for lifespan protocol
- Add chunk coalescing in drain_response_channel (4KB threshold, 1ms timeout)
  to batch small chunks and reduce per-request syscall overhead
- Unify scope builders: use hornbeam_request:build_asgi_scope everywhere,
  remove duplicate build_scope from handler
- Optimize hooks: store individual hooks in persistent_term with direct keys
  for zero-overhead check when no hooks configured
Skip channel/pump for small bodies (<64KB): pass directly to Python.
Large bodies still use channel streaming. Reduces process spawns and
memory pressure for typical requests.
Prototype using Cowboy's async body reading with push/pull pattern:
- cowboy_req:cast for async body chunks
- ASGIProtocol class mirroring asyncio.Protocol interface
- Buffer + asyncio.Event for ASGI receive()

Use worker_class => asgi_loop to test. Not yet optimized for production.
- Remove response channel, use erlang.send() directly for body
- Remove drain loop with timer polling
- Remove Python buffer/event/reader task pattern
- Read from request channel directly in receive()

Result: -130 lines, +21% GET, +10% POST throughput
- Rename hornbeam_asgi_loop to hornbeam_asgi
- Remove old ASGI handler code from hornbeam_handler.erl
- Use direct reply for responses with Content-Length
- Use chunked encoding for responses without Content-Length
- Fix empty body handling: check Transfer-Encoding too
- Simplify Python worker: no buffering, read channel directly

Removes ~900 lines, all 38 ASGI tests pass.
- Cache pid/streamid in state for direct send (avoid map lookups)
- Use direct Pid ! message instead of cowboy_req:cast for body reading
- Replace 5+ message types with simplified protocol:
  - start_response: headers + first chunk
  - chunk: subsequent body chunks
  - fin: end of response
- Remove headers_sent/buffered_headers state fields
Module is already imported by Erlang via ensure_all_imported,
so use sys.modules lookup with caching instead of importlib.
Replace py_event_loop_pool with py_event_loop:get_loop() to remove
pool routing overhead.
- Register lifespan_state_get/set callbacks at startup in hornbeam_lifespan
- Replace _MutableStateProxy with _LazyStateProxy in Python
- Remove state from ASGI scope building, Python fetches lazily

Benefits:
- No erlang.whereis() on every request
- State only fetched when actually accessed
- Direct ETS access via callbacks, no message passing
- hornbeam_context_pool now uses py_context_router for context lifecycle
- Cache NIF refs in persistent_term for O(1) access without message passing
- Cache state proxies per mount_id to avoid allocation per request
- Use py_import:add_path and py_import:ensure_imported for mount setup
- hornbeam_context_pool now caches NIF refs from default pool
- Wait for py_context_router to be ready before caching
- Remove workers and pool_enabled from mount type (use shared pool)
- Simplify mount type to just routing config
- Add [Unreleased] changelog section with performance metrics
- Remove per-mount workers option from docs (now uses shared pool)
- Add notes explaining shared py_context_router pool architecture
- Delete hornbeam_pool.erl (replaced by hornbeam_context_pool)
- Remove unused get_context_rr/0 and stats/0 from hornbeam_context_pool
- Remove dead handle_request and _process_environ from WSGI worker
- Remove unused streaming code from ASGI runner
- Fix comment in hornbeam_sup.erl
@benoitc benoitc merged commit 3057536 into main Mar 24, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant