Skip to content

grpcproxy: fix CAS writes and AC reads failing against strict gRPC backends#886

Open
lance-tan-ai wants to merge 5 commits intobuchgr:masterfrom
lance-tan-ai:fix-grpc-proxy-finish-write
Open

grpcproxy: fix CAS writes and AC reads failing against strict gRPC backends#886
lance-tan-ai wants to merge 5 commits intobuchgr:masterfrom
lance-tan-ai:fix-grpc-proxy-finish-write

Conversation

@lance-tan-ai
Copy link
Copy Markdown

@lance-tan-ai lance-tan-ai commented Feb 24, 2026

Errors

When using the gRPC proxy backend, both CAS write-throughs and AC read-throughs fail against gRPC cache backends that strictly enforce the Remote Execution API spec.

CAS writes fail with finished_writing='false':

GRPC PROXY WRITE CAS ...: rpc error: code = InvalidArgument desc = Request completed before all data was sent: digest='.../167222', actual_size='167222', finished_writing='false'

Then after partially fixing that, with write_offset=0:

GRPC PROXY WRITE CAS ...: rpc error: code = InvalidArgument desc = write_offset: Current write_offset (0) does not match expected one (4572)

AC reads fail with an invalid digest:

GRPC PROXY GET AC ...: rpc error: code = InvalidArgument desc = action_digest: invalid or missing action digest

Root causes

1. finish_write never set on the final ByteStream WriteRequest

The ByteStream protocol requires the last WriteRequest to have finish_write = true to signal upload completion. The previous loop never set it — calling CloseAndRecv() alone is not sufficient for backends that require an explicit terminal message.

2. write_offset not tracked across chunks

Each WriteRequest must carry the cumulative byte offset of data sent so far. The previous implementation always sent write_offset = 0, causing backends to reject multi-chunk uploads.

3. Action digest size_bytes sent as -1 in GetActionResult

grpc_ac.go passes unknownActionResultSize (-1) to cache.Get() because the ActionResult size is not known at request time. The proxy was forwarding this same -1 as size_bytes in the GetActionResult Digest, which backends reject.

The correct value is ActionDigest.SizeBytes from the client request (the size of the Action, not the ActionResult). Since the Proxy.Get interface doesn't carry the action digest size, it is threaded via context.WithValue using a package-level typed key.

Fix

  • Track finishWrite from io.EOF and set FinishWrite: true on the corresponding WriteRequest, handling both EOF-with-data (n > 0) and EOF-only (n == 0).
  • Track a running writeOffset and include it in every WriteRequest.
  • In grpc_ac.go, store req.ActionDigest.SizeBytes in context before calling cache.Get(). In the proxy's AC path, read it back to construct a valid Digest for GetActionResult.

Validation

After deploying with these fixes and clearing the local disk cache to force proxy fallbacks, all CAS writes and AC reads succeed:

GRPC PROXY GET AC 4dfa76214a85a477dc124e75adfe9415d07ce06d6abedcf77513f1dc2b395738: Success
GRPC PROXY CONTAINS CAS ce443496d644a99cc58a1da41a40aaf18d356abec228f4e1864ff72702887105: Success
GRPC PROXY GET AC 7a04428fcdb0388652bc0193c22a9fa5249f54170dca5cb76622c101ee0c6e00: Success
GRPC PROXY GET AC fda4ef0d282774ed830a2492be005785fd7c5e68011cd523c45a9c458bb1d011: Success
GRPC PROXY GET AC 5f82c3f1c16e0bb4a00ad91c9018198d55243af7c4b55e1882759ee53e3fc248: Success
GRPC PROXY GET AC ae8217d7f508d073256d729f7b0b1b14affa20576c2c180904f1679bdd265c42: Success
GRPC PROXY GET AC 31ae1f5e45defefd1ff2338a4ebd96d34823c4030d50ec99f2b959ea78ebf65a: Success
GRPC PROXY GET AC ea8e8bc151329e536b073d202b2af4c701df28b6ce484e0da6ee10cbd6337837: Success
GRPC PROXY GET AC 5aab10c5e7a8f372b250bcc6b69403e85976e257f364d283577d552684a1b41e: Success

The ByteStream Write protocol requires the last WriteRequest to have
finish_write=true to signal that the upload is complete. Without it,
servers that strictly enforce the spec (e.g. EngFlow) reject the write
with: "Request completed before all data was sent: finished_writing=false".

The previous loop only called CloseAndRecv() after reading n==0, which
closes the gRPC stream client-side but never sends a message with
finish_write=true. io.Reader is also permitted to return n>0 alongside
io.EOF in the same call, so the last data chunk could be sent without
the finish flag and the n==0 branch might never be reached.

Fix: track whether the most recent Read returned io.EOF (finishWrite),
set FinishWrite on the WriteRequest for that chunk, and call
CloseAndRecv immediately after — handling both the "EOF with data" and
"EOF alone" cases.

Co-authored-by: Cursor <cursoragent@cursor.com>
@lance-tan-ai
Copy link
Copy Markdown
Author

lance-tan-ai commented Feb 24, 2026

@buchgr Mind taking a look at this? We are hitting this issue when trying to write CAS blobs through bazel-remote's gRPC proxy.

fwiw, our setup is the basic bazel-remote setup and just trying to hit the gRPC proxy backend. We are seeing this error appear on every proxy write.

lance-tan-ai and others added 4 commits February 24, 2026 09:50
The previous fix set FinishWrite=true on the data chunk when Read
returned (n>0, io.EOF) simultaneously, but missed the case where Read
returns (n>0, nil) for the last chunk and then (0, io.EOF) on the
following call — which is the more common io.Reader behaviour.

In that case the data was fully sent across previous iterations but no
message with FinishWrite=true was ever transmitted; CloseAndRecv just
closed the stream client-side, leaving the server reporting
finished_writing=false.

Fix: when finishWrite is true and n==0, send a zero-data WriteRequest
with FinishWrite=true before calling CloseAndRecv so the server always
receives an explicit completion signal.

Co-authored-by: Cursor <cursoragent@cursor.com>
The ByteStream Write protocol requires each WriteRequest to carry the
byte offset of its data relative to the start of the resource. Without
it the server rejects the terminal finish_write=true message with:

  write_offset: Current write_offset (0) does not match expected one (<size>)

Track a running writeOffset, increment it after each data chunk, and
pass it on every WriteRequest — including the zero-data terminal message
sent when io.Reader returns (0, io.EOF) separately from the last chunk.

Co-authored-by: Cursor <cursoragent@cursor.com>
Get() received the action digest size as the size parameter but the AC
branch ignored it, always sending SizeBytes=-1 to the upstream. Servers
that require a valid digest (e.g. EngFlow) reject this with:

  action_digest: invalid or missing action digest

Fix: pass size through directly; it comes from the client's original
GetActionResult request and is always the correct action digest size.

Co-authored-by: Cursor <cursoragent@cursor.com>
grpc_ac.go passes -1 for the ActionResult size (unknown at request time),
but the same -1 was also being used as the action digest size_bytes in the
proxy's GetActionResult call, which gRPC remote caches reject as invalid.

Thread the action digest size_bytes via context so the proxy can construct
a correct Digest without changing the Proxy.Get interface.

Co-authored-by: Cursor <cursoragent@cursor.com>
@lance-tan-ai lance-tan-ai changed the title grpcproxy: set finish_write=true on final ByteStream WriteRequest grpcproxy: fix CAS writes and AC reads failing against strict gRPC backends Feb 24, 2026
@mostynb
Copy link
Copy Markdown
Collaborator

mostynb commented Feb 24, 2026

Hi, which server are you seeing these errors with?

@lance-tan-ai
Copy link
Copy Markdown
Author

Hi, which server are you seeing these errors with?

@mostynb We are seeing this with EngFlow servers.

@lance-tan-ai
Copy link
Copy Markdown
Author

lance-tan-ai commented Feb 27, 2026

For some context, this is our bazel-remote-compose.yaml:

services:
  bazel-remote:
    build:
      context: .
      dockerfile: Dockerfile
    restart: unless-stopped
    logging:
      driver: journald
    environment:
      BAZEL_REMOTE_DIR: /tmp/bazel-remote/
      BAZEL_REMOTE_MAX_SIZE: 5000
      BAZEL_REMOTE_PORT: 8080
      BAZEL_REMOTE_STORAGE_MODE: uncompressed
      BAZEL_REMOTE_ENABLE_ENDPOINT_METRICS: true
      BAZEL_REMOTE_MAX_BLOB_SIZE: 524288000
      BAZEL_REMOTE_MAX_PROXY_BLOB_SIZE: 524288000
      BAZEL_REMOTE_GRPC_PROXY_URL: grpcs://<name>.cluster.engflow.com
      BAZEL_REMOTE_GRPC_PROXY_CERT_FILE: /etc/bazel-remote/engflow.crt
      BAZEL_REMOTE_GRPC_PROXY_KEY_FILE: /etc/bazel-remote/engflow.key

@mostynb
Copy link
Copy Markdown
Collaborator

mostynb commented Feb 27, 2026

I refreshed my memory a bit, bazel-remote's grpc proxy backend is a relatively recent addtion, and should probably still be marked as experimental. It has the best chance of working with another bazel-remote instance as the grpc proxy backend.

The reason for this is historical- bazel-remote originally only supported bazel's older http cache protocol, which does not provide full digest information to the server as the newer grpc protocol does. Some non-trivial refactoring is required to improve the situation, and I suspect that your changes to GetActionResult would not be enough to make the current setup work well. I would prefer to take some time to fix this instead of sneaking extra data through with the context.

However the grpcproxy bytestream changes look OK. Would you be interested in reducing the scope of this PR to focus on this?

@lance-tan-ai
Copy link
Copy Markdown
Author

I refreshed my memory a bit, bazel-remote's grpc proxy backend is a relatively recent addtion, and should probably still be marked as experimental. It has the best chance of working with another bazel-remote instance as the grpc proxy backend.

The reason for this is historical- bazel-remote originally only supported bazel's older http cache protocol, which does not provide full digest information to the server as the newer grpc protocol does. Some non-trivial refactoring is required to improve the situation, and I suspect that your changes to GetActionResult would not be enough to make the current setup work well. I would prefer to take some time to fix this instead of sneaking extra data through with the context.

However the grpcproxy bytestream changes look OK. Would you be interested in reducing the scope of this PR to focus on this?

Will descope the PR. I will create an issue to outline all of the issues I was seeing with migrating to a gRPC backend since there were additional patches that I needed to apply for it to work out-of-the-box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants