Environment
What version are you running? 25.9.0
Steps to Reproduce
- Have a Kafka Cluster
- Use SSL
- Rip out one node to test fail-over (either network or power so it's completely unreachable)
- Snuba doesn't drop the connection to the dead broker
- Snuba is stuck in a SSL handshake error loop until restarted
The following ENVs are set:
DEFAULT_BROKERS: kafka-broker-1,kafka-broker-2,...,kafka-broker-9
KAFKA_SECURITY_PROTOCOL: SSL
KAFKA_SSL_CA_PATH: /etc/ssl/certs/my-ca.pem
KAFKA_SSL_CERT_PATH: client.crt
KAFKA_SSL_KEY_PATH: client.key
Expected Result
Snuba drops the broken connection and connects to another working broker.
Healthcheck file not being created since the consumer is in a non-working state:
--health-check-file /tmp/health.txt
Actual Result
Snuba keeps trying to do a SSL handshake to the dead broker.
%4|1771925946.628|FAIL|rdkafka#producer-1| [thrd:ssl://kafka-broker-1:9093/bootstra]: ssl://kafka-broker-1:9093/1: Connection setup timed out in state CONNECT (after 30027ms in state CONNECT, 1 identical error(s) suppressed
Health-check file is still being created therefore the container cannot be restarted automatically.
Additional information
Some Snuba consumers do drop the connection (I see like 2-5 errors in the log) and connect to a working one while others don't. I haven't found out why it sometimes works and sometimes doesn't.
Environment
What version are you running?
25.9.0Steps to Reproduce
The following ENVs are set:
DEFAULT_BROKERS: kafka-broker-1,kafka-broker-2,...,kafka-broker-9KAFKA_SECURITY_PROTOCOL: SSLKAFKA_SSL_CA_PATH: /etc/ssl/certs/my-ca.pemKAFKA_SSL_CERT_PATH: client.crtKAFKA_SSL_KEY_PATH: client.keyExpected Result
Snuba drops the broken connection and connects to another working broker.
Healthcheck file not being created since the consumer is in a non-working state:
Actual Result
Snuba keeps trying to do a SSL handshake to the dead broker.
Health-check file is still being created therefore the container cannot be restarted automatically.
Additional information
Some Snuba consumers do drop the connection (I see like 2-5 errors in the log) and connect to a working one while others don't. I haven't found out why it sometimes works and sometimes doesn't.