While running the BentoML service with 4 workers (each with 1 thread), it appears that the incoming HTTP requests are not evenly balanced across the worker processes.
I'd like to know if there's any configuration I'm missing.
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43376 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.820ms (trace=2cf1120afa437d5ebe8f9792eb3519b0,span=2507eaf9014aba6e,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43608 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 202.028ms (trace=5565f4c5a624f3cda4c4835b2853f727,span=0e51701b5e614c96,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43816 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.912ms (trace=d475a2d347df94ca4edd3da858d16905,span=7b44259dafc4f72b,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44090 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.699ms (trace=a539c617542d29cb80aa70dc8b2ee42d,span=3d80958ead45d70c,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44270 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.748ms (trace=fdb5dc2c2773f0d87a2ff20acdd45eb3,span=d83e432d948c33dc,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms (trace=84fc319d34bfa9337a2eb08314015323,span=d0b1e9ead2bb78eb,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44962 (scheme=http,method=POST,path=/classify,type=application/json,length=78) (status=200,type=application/json,length=10) 201.729ms (trace=ae1647b92484b84a80ba7cee49d9667c,span=537587c62b086b50,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45190 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.974ms (trace=ac90d9510dc60c5162309c900911d846,span=a1842bae213272ce,sampled=0,service.name=Predictor)
2025-08-06T06:00:55+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45382 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/
...
Here is the simple statistics.
All the requests should evenly distributed among the workers.
# service3.py
import bentoml
import logging
import time
bentoml_logger = logging.getLogger("bentoml")
@bentoml.service(workers=4, threads=1)
class Predictor:
def __init__(self):
pass
@bentoml.api
def classify(self, input_ids: list[list[int]]) -> list[float]:
"""
input_ids example:
[[82, 13, 59, 45, 97, 36, 74, 6, 91, 12, 33, 19, 77, 68, 40, 50]]
"""
time.sleep(0.2)
return [0.1, 0.2]
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms
Describe the bug
While running the BentoML service with 4 workers (each with 1 thread), it appears that the incoming HTTP requests are not evenly balanced across the worker processes.
I'd like to know if there's any configuration I'm missing.
Here is the simple statistics.
Full logs are attached here.
bentoml_test_server.log
Expected behavior
All the requests should evenly distributed among the workers.
To reproduce
Environment
bentoml: 1.4.19