Skip to content
This repository was archived by the owner on Aug 12, 2024. It is now read-only.
This repository was archived by the owner on Aug 12, 2024. It is now read-only.

SERVFAIL when trying to resolve a service address #44

@caesar-ralf

Description

@caesar-ralf

What happened?

When upgrading from com.spotify:dns version 3.1.5 to 3.2.2 some of the services started having SERVFAIL even though the service is there.

What was expected?

As there's no breaking change in the perceived API from com.spotify:dns, we expected the changes to not affect functionality.

How to reproduce

We didn't find a good way to reproduce. We didn't manage to pin down what is causing the problem. It seems related to some concurrency, as sometimes the problem doesn't appear. I am more than glad to show the issue happening in a service.

Context

We need to upgrade dnsjava:dnsjava to from version 2.x to 3.x. We checked that com.spotify:dns has done this change in version 3.2.0. We tested in some services and they seem to be working fine, so we decided to roll out the change for all of our users. What happened is that in some of them, from what we can see the ones using gRPC, they started getting SERVFAIL intermittently.

Here is an anonymised stack trace:

Jul 15, 2021 4:29:20 PM io.grpc.internal.ManagedChannelImpl$NameResolverListener handleErrorInSyncContext
WARNING: [Channel<38>: (${PROTOCOL}://${SERVICE})] Failed to resolve name. status=Status{code=UNAVAILABLE, description=null, cause=java.util.concurrent.CompletionException: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL 
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
	at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:683)
	at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:658)
	at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2094)
	at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$4(DnsSrvNameResolver.java:160)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL 
	at com.spotify.dns.XBillDnsSrvResolver.resolve(XBillDnsSrvResolver.java:60)
	at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$0(DnsSrvNameResolver.java:162)
	at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:680)
	... 6 more
}

We tried bumping version of dnsjava:dnsjava from 3.0.2 to 3.4.0 and the problem seemed to go away, but after some minutes (around ~10min) of the service running it started again. I am not sure if this was a local problem.

When we did a dig srv ${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS} some hosts are returned as expected. Changing the version back to com.spotify:dns:3.1.5 and dnsjava:dnsjava:2.x makes the problem go away.

Java version used during the test:

$ java -version
> openjdk version "11.0.10" 2021-01-19 LTS
> OpenJDK Runtime Environment Corretto-11.0.10.9.1 (build 11.0.10+9-LTS)
> OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (build 11.0.10+9-LTS, mixed mode)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions