Optimize WSGI/ASGI performance by benoitc · Pull Request #13 · benoitc/hornbeam

benoitc · 2026-03-17T11:25:05Z

Summary

Add Python context pool with scheduler affinity for reduced context switching
Implement zero-copy body streaming via py_buffer
Add environ template caching and BytesIO/ASGIResponse pooling
Cache lifespan state in handler state at startup
Stream pooled responses directly to client
Unify ASGI handler and simplify message protocol
Use lazy state proxy with callbacks for ASGI state access
Use event loop pool for ASGI task distribution
Reuse default py_context_router pool instead of per-mount workers
Remove dead code (hornbeam_pool, unused streaming functions)
Bump erlang_python to 2.2.0

- Fix erlang_python git repo URL (erlang-python not erlang_python) - Replace py:bind/py:unbind with py:context/py:contexts_started - Replace py:ctx_call with py:call - Replace py:with_context with direct py:call - Update py:call signatures to use options map for timeout

- Create hornbeam_request.erl for pre-parsing HTTP requests in Erlang - Add to_wsgi_header_key/1 for header format conversion - Add build_wsgi_tuple/2 and build_asgi_scope/2 functions - Add BytesIO pool to WSGI runner to reduce allocation overhead - Add environ template for O(1) environ creation - Add run_wsgi_fast/3 and create_environ_from_tuple/1 for fast path

- Add response object pool with reset() method - Add _get_response() and _return_response() pool functions - Pool size of 100 responses for high-throughput scenarios

- Use _ENVIRON_TEMPLATE.copy() instead of inline dict creation - Use pooled BytesIO for wsgi.input - Return BytesIO to pool after request completion

Workers receive requests via channels and loop continuously, reducing Python startup overhead. Features heartbeat monitoring, scheduler affinity routing, and automatic restart on failure. Enable via mount config: pool_enabled => true

Use cowboy stream_reply/stream_body instead of collecting chunks before sending. Reduces memory usage and latency for large streaming responses.

Remove per-mount heartbeat_interval and heartbeat_timeout options. Use module constants (5s interval, 15s timeout) for all workers.

Store channels as tuple instead of individual entries. One lookup instead of two: get tuple, then element().

Store channels as {{pool, MountId}, Ch1, Ch2, ...} in ETS. Handler gets channel via single lookup_element call. Remove persistent_term usage entirely.

Store as {{MountId, Idx}, Channel} for direct lookup. No tuple manipulation needed.

Call py_context:extend_erlang_module_in_context/1 before importing hornbeam_wsgi_worker and hornbeam_asgi_worker. This ensures the erlang module is fully extended with send/call/schedule_inline before the workers check HAS_ERLANG at import time. Also adds noop_asgi.py benchmark app for ASGI testing. Performance results: - WSGI single context: 76K req/sec, 13us latency - WSGI 14 workers: 62-64K req/sec - ASGI single worker: 60K req/sec - ASGI 14 workers: 113K req/sec, 9us latency

- Replace pooled worker architecture with context_call + schedule_inline - WSGI now uses py_nif:context_call() with schedule_inline for yielding - ASGI uses py_event_loop for async execution - Remove hornbeam_worker_arbiter and hornbeam_worker_pool (obsolete) - Simplify hornbeam_handler to single codepath - Update Python workers for schedule_inline continuation pattern - Add max_concurrent config for erlang_python

- Small bodies (< 64KB): buffered path (read fully before Python call) - Large bodies (>= 64KB): stream via py_channel in 64KB chunks - Add handle_wsgi_streaming entry point in Python worker - StreamingBodyReader reads body chunks from channel - Add hornbeam_context_pool:add_paths/1 to set pythonpath in all contexts - Fix Channel.receive() to use timeout_ms parameter - Add wsgi_body_chunk_size and wsgi_streaming_threshold config options

- Replace handle_wsgi_buffered/handle_wsgi_streaming with single handle_wsgi - Add ChannelBuffer (inherits io.BufferedIOBase) for wsgi.input - Body delivered via channel for all sizes (small: {body, Data}, large: chunks) - Add hop-by-hop header filtering for HTTP compliance - Remove BytesIO pool and StreamingBodyReader class

- Use py_buffer API for request body (zero-copy shared memory) - Use erlang.send instead of erlang.reply for responses - Skip buffer creation for bodyless GET/HEAD/DELETE/OPTIONS - Preload WSGI app at startup in all contexts - Single message path for simple [body] list responses - Use worker mode instead of subinterpreter for contexts Benchmark: 56,720 req/sec (5.8x faster than Gunicorn)

- Rename workers config option to num_contexts for clarity - Default num_contexts to erlang:system_info(schedulers) - Restart context pool when num_contexts changes - Fix _get_app safety check for multi-app scenarios - Remove unused Python runtime functions

- Replace py_event_loop:create_task with spawn_task (fire-and-forget) - Use py_buffer for request body streaming (consistent with WSGI) - Handle async_result messages in receive loops - Add asgi_noop_app.py benchmark app - Simplify ASGI worker to use erlang.run() pattern ASGI performance: ~39k req/sec (WSGI: ~62.5k req/sec)

- Add pre-computed ASGI_SCOPE_TEMPLATE macro for static scope fields - Cache _erlang_send function reference to avoid attribute lookup per call - Store cached send in _ASGISend.__slots__ for instance-level access Performance improvement: - Before: ~39k req/sec - After: ~63k req/sec (+62%) - ASGI now matches WSGI performance

Check body size threshold before checking more_body flag. This ensures large single-chunk responses are streamed instead of buffered, preventing memory issues with large responses. - Reorder threshold check to happen first - Stream if total_size >= BUFFER_THRESHOLD (64KB) - Add test app for large response validation

Fetch lifespan_state once when configuring cowboy routes instead of calling hornbeam_lifespan:get_state() on every request. - Add lifespan_state to HandlerState in start_listener - Add lifespan_state to multi-app HandlerState - Use cached state in build_scope instead of ETS lookup

- Fix quadratic buffering in ASGI send with O(1) size tracking - Use create_task instead of spawn_task to avoid process overhead - Move pythonpath setup to mount registration (not per-request) - Implement ASGI request body streaming with more_body support - Wire up WSGI tuple fast path for O(1) environ creation ASGI now at 86% of WSGI throughput (67.5K vs 78.3K req/s).

Use Python-side lifespan state dict from hornbeam_lifespan_runner instead of the Erlang-provided copy. This ensures state modifications made by request handlers persist across requests per ASGI spec.

- Add context_mode option (worker | owngil) to hornbeam and context pool - owngil mode uses per-interpreter GIL for true parallelism (Python 3.12+) - Update benchmark to support WSGI owngil testing via PYTHON_CONFIG env var - Rebuild erlang_python when PYTHON_CONFIG is set for correct Python version

Use hornbeam_context_pool instead of py:context() to ensure priv/ is in sys.path when calling Python lifespan functions. Also use py_nif:context_call with empty options map to avoid passing timeout as Python kwargs.

- Add _MutableStateProxy in hornbeam_asgi_worker.py that syncs scope['state'] mutations to Erlang ETS via erlang.send() - Add update_state/2 and update_state/3 to hornbeam_lifespan.erl - Add handle_info for {<<"update_state">>, Key, Value} messages - Read fresh lifespan state from ETS per request (not cached) - Update lifespan_test_app.py to prefer scope state over module state - Requires erlang-python with erlang.whereis() support

- Remove unused buffering logic in _ASGISend - Stream all responses directly through ByteChannel - Fix default status code from 400 to 200 on http.response.start

- Remove unused fast path response handler - Remove debug logging statements - Add hop-by-hop header filtering to streaming path - Close request channel when response starts

- Raise RuntimeError if http.response.start sent twice - Raise RuntimeError if http.response.body sent before start - Raise RuntimeError if send called after response completed - Raise OSError on client disconnect per ASGI spec 2.4

Switch from py_event_loop to py_event_loop_pool for better load distribution across multiple event loops. Process affinity ensures ordered execution for requests from the same handler. Benchmark shows improved scaling at higher concurrency: - 200 connections: 25.4k req/s - 400 connections: 27.5k req/s

WSGI worker: - Remove unnecessary decode() calls (erlang_python handles in C) - Add documentation for binary-to-string conversion Lifespan runner: - Add per-mount lifespan support for multi-app mode - Each mount gets isolated state dict - Add startup_mount/shutdown_mount functions hornbeam.erl: - Pass mount_id to lifespan startup for state isolation - Build mount-specific options for lifespan protocol

- Add chunk coalescing in drain_response_channel (4KB threshold, 1ms timeout) to batch small chunks and reduce per-request syscall overhead - Unify scope builders: use hornbeam_request:build_asgi_scope everywhere, remove duplicate build_scope from handler - Optimize hooks: store individual hooks in persistent_term with direct keys for zero-overhead check when no hooks configured

Skip channel/pump for small bodies (<64KB): pass directly to Python. Large bodies still use channel streaming. Reduces process spawns and memory pressure for typical requests.

Prototype using Cowboy's async body reading with push/pull pattern: - cowboy_req:cast for async body chunks - ASGIProtocol class mirroring asyncio.Protocol interface - Buffer + asyncio.Event for ASGI receive() Use worker_class => asgi_loop to test. Not yet optimized for production.

- Remove response channel, use erlang.send() directly for body - Remove drain loop with timer polling - Remove Python buffer/event/reader task pattern - Read from request channel directly in receive() Result: -130 lines, +21% GET, +10% POST throughput

- Rename hornbeam_asgi_loop to hornbeam_asgi - Remove old ASGI handler code from hornbeam_handler.erl - Use direct reply for responses with Content-Length - Use chunked encoding for responses without Content-Length - Fix empty body handling: check Transfer-Encoding too - Simplify Python worker: no buffering, read channel directly Removes ~900 lines, all 38 ASGI tests pass.

- Cache pid/streamid in state for direct send (avoid map lookups) - Use direct Pid ! message instead of cowboy_req:cast for body reading - Replace 5+ message types with simplified protocol: - start_response: headers + first chunk - chunk: subsequent body chunks - fin: end of response - Remove headers_sent/buffered_headers state fields

Module is already imported by Erlang via ensure_all_imported, so use sys.modules lookup with caching instead of importlib.

Replace py_event_loop_pool with py_event_loop:get_loop() to remove pool routing overhead.

- Register lifespan_state_get/set callbacks at startup in hornbeam_lifespan - Replace _MutableStateProxy with _LazyStateProxy in Python - Remove state from ASGI scope building, Python fetches lazily Benefits: - No erlang.whereis() on every request - State only fetched when actually accessed - Direct ETS access via callbacks, no message passing

- hornbeam_context_pool now uses py_context_router for context lifecycle - Cache NIF refs in persistent_term for O(1) access without message passing - Cache state proxies per mount_id to avoid allocation per request - Use py_import:add_path and py_import:ensure_imported for mount setup

- hornbeam_context_pool now caches NIF refs from default pool - Wait for py_context_router to be ready before caching - Remove workers and pool_enabled from mount type (use shared pool) - Simplify mount type to just routing config

- Add [Unreleased] changelog section with performance metrics - Remove per-mount workers option from docs (now uses shared pool) - Add notes explaining shared py_context_router pool architecture

- Delete hornbeam_pool.erl (replaced by hornbeam_context_pool) - Remove unused get_context_rr/0 and stats/0 from hornbeam_context_pool - Remove dead handle_request and _process_environ from WSGI worker - Remove unused streaming code from ASGI runner - Fix comment in hornbeam_sup.erl

benoitc added 30 commits March 12, 2026 12:38

Update erlang_python to latest main branch

b40953a

Update hackney to 3.2.1

d56b266

Update erlang_python to hex 2.1.0

e83367e

Add ASGIResponse pooling for reduced allocation overhead

fbc1699

- Add response object pool with reset() method - Add _get_response() and _return_response() pool functions - Pool size of 100 responses for high-throughput scenarios

Optimize create_environ to use template and BytesIO pool

c704e48

- Use _ENVIRON_TEMPLATE.copy() instead of inline dict creation - Use pooled BytesIO for wsgi.input - Return BytesIO to pool after request completion

Add persistent worker pool for WSGI/ASGI mounts

6902c37

Workers receive requests via channels and loop continuously, reducing Python startup overhead. Features heartbeat monitoring, scheduler affinity routing, and automatic restart on failure. Enable via mount config: pool_enabled => true

Stream pooled responses directly to client

4e9170c

Use cowboy stream_reply/stream_body instead of collecting chunks before sending. Reduces memory usage and latency for large streaming responses.

Simplify heartbeat config to global constants

1e56c33

Remove per-mount heartbeat_interval and heartbeat_timeout options. Use module constants (5s interval, 15s timeout) for all workers.

Use single persistent_term lookup for channel routing

56cd3de

Store channels as tuple instead of individual entries. One lookup instead of two: get tuple, then element().

Use single ETS lookup_element for channel routing

f7d2142

Store channels as {{pool, MountId}, Ch1, Ch2, ...} in ETS. Handler gets channel via single lookup_element call. Remove persistent_term usage entirely.

Simplify channel storage to one key per channel

e415220

Store as {{MountId, Idx}, Channel} for direct lookup. No tuple manipulation needed.

Split channel storage into set_channel and clear_channels

209d7b2

Add static Python context pool with scheduler affinity

4bdba70

Fix ASGI lifespan state persistence across requests

fc172ce

Use Python-side lifespan state dict from hornbeam_lifespan_runner instead of the Erlang-provided copy. This ensures state modifications made by request handlers persist across requests per ASGI spec.

Fix Python ModuleNotFoundError for hornbeam_lifespan_runner

2628fac

Use hornbeam_context_pool instead of py:context() to ensure priv/ is in sys.path when calling Python lifespan functions. Also use py_nif:context_call with empty options map to avoid passing timeout as Python kwargs.

benoitc added 22 commits March 19, 2026 16:04

Simplify ASGI response handling, fix default status code

0b533bf

- Remove unused buffering logic in _ASGISend - Stream all responses directly through ByteChannel - Fix default status code from 400 to 200 on http.response.start

Clean up ASGI handler, remove debug logging

93d9457

- Remove unused fast path response handler - Remove debug logging statements - Add hop-by-hop header filtering to streaming path - Close request channel when response starts

Add ASGI protocol compliance validation

2e83042

- Raise RuntimeError if http.response.start sent twice - Raise RuntimeError if http.response.body sent before start - Raise RuntimeError if send called after response completed - Raise OSError on client disconnect per ASGI spec 2.4

Add threshold-based ASGI request body handling

d9ce7d5

Skip channel/pump for small bodies (<64KB): pass directly to Python. Large bodies still use channel streaming. Reduces process spawns and memory pressure for typical requests.

Use sys.modules instead of importlib for app lookup

ce1cd51

Module is already imported by Erlang via ensure_all_imported, so use sys.modules lookup with caching instead of importlib.

Use main event loop directly in ASGI handler

b7434c4

Replace py_event_loop_pool with py_event_loop:get_loop() to remove pool routing overhead.

Use event loop pool for ASGI task distribution

2357e99

Update docs for shared context pool and ASGI performance optimizations

d2d5471

- Add [Unreleased] changelog section with performance metrics - Remove per-mount workers option from docs (now uses shared pool) - Add notes explaining shared py_context_router pool architecture

Bump erlang_python to 2.2.0

ed63e6b

Fix edoc comments with unescaped angle brackets

5985407

Add OTP 28 and Python 3.14 to CI matrix

1eda9f3

benoitc merged commit 3057536 into main Mar 24, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize WSGI/ASGI performance#13

Optimize WSGI/ASGI performance#13
benoitc merged 52 commits intomainfrom
feature/wsgi-asgi-performance-optimization

benoitc commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benoitc commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benoitc commented Mar 17, 2026 •

edited

Loading