Skip to content

feat(embeddings): support passthrough remote model ids#156

Open
sanikolaev wants to merge 1 commit intomasterfrom
ae/arbitrary-models
Open

feat(embeddings): support passthrough remote model ids#156
sanikolaev wants to merge 1 commit intomasterfrom
ae/arbitrary-models

Conversation

@sanikolaev
Copy link
Copy Markdown
Collaborator

  1. Allow explicit provider-prefixed passthrough model ids for remote endpoints
  • keep the existing slash-prefixed forms (openai/..., voyage/..., jina/...) working as before
  • add explicit colon-prefixed forms (openai:..., voyage:..., jina:...)
  • when the colon form is used, pass the model id through after stripping only the provider prefix
  • this allows OpenAI-compatible custom endpoints to receive full upstream model ids unchanged, for example:
    • openai:openai/text-embedding-ada-002
    • openai:jinaai/jina-embeddings-v3
  • preserve strict built-in validation for default provider endpoints while allowing passthrough mode for custom API_URL-based setups
  1. Allow CMake to pass optional cargo features to the embeddings crate
  • add EMBEDDINGS_CARGO_FEATURE_ARGS in cmake/build_embeddings.cmake
  • if EMBEDDINGS_CARGO_FEATURES is set, convert it to a valid cargo CLI fragment: --features
  • this makes it possible to configure builds such as download-ort from the CMake side without hard-coding the flag in the build script

Additional remote-model adjustment:

  • cache inferred embedding dimensionality in remote providers so passthrough/custom models can learn their vector dimension from a successful response instead of requiring a built-in static mapping
  • apply that caching approach consistently across OpenAI, Voyage, and Jina

Related issue #155

1. Allow explicit provider-prefixed passthrough model ids for remote endpoints
- keep the existing slash-prefixed forms (openai/..., voyage/..., jina/...) working as before
- add explicit colon-prefixed forms (openai:..., voyage:..., jina:...)
- when the colon form is used, pass the model id through after stripping only the provider prefix
- this allows OpenAI-compatible custom endpoints to receive full upstream model ids unchanged, for example:
  - openai:openai/text-embedding-ada-002
  - openai:jinaai/jina-embeddings-v3
- preserve strict built-in validation for default provider endpoints while allowing passthrough mode for custom API_URL-based setups

2. Allow CMake to pass optional cargo features to the embeddings crate
- add EMBEDDINGS_CARGO_FEATURE_ARGS in cmake/build_embeddings.cmake
- if EMBEDDINGS_CARGO_FEATURES is set, convert it to a valid cargo CLI fragment:
  --features <value>
- this makes it possible to configure builds such as download-ort from the CMake side without hard-coding the flag in the build script

Additional remote-model adjustment:
- cache inferred embedding dimensionality in remote providers so passthrough/custom models can learn their vector dimension from a successful response instead of requiring a built-in static mapping
- apply that caching approach consistently across OpenAI, Voyage, and Jina
@sanikolaev sanikolaev force-pushed the ae/arbitrary-models branch from 3d92e0e to ecba6b3 Compare April 17, 2026 04:52
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 17, 2026

Windows test results

  5 files    5 suites   21m 18s ⏱️
499 tests 463 ✅ 20 💤 16 ❌
507 runs  471 ✅ 20 💤 16 ❌

For more details on these failures, see this check.

Results for commit ecba6b3.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

clt

❌ CLT tests in test/clt-tests/mcl/
✅ OK: 22
❌ Failed: 1
⏳ Duration: 969s
👉 Check Action Results for commit ff0a3b0

Failed tests:

🔧 Edit failed tests in UI:

test/clt-tests/mcl/auto-embeddings-json-api.rec
––– input –––
rm -f /var/log/manticore/searchd.log; stdbuf -oL searchd $SEARCHD_FLAGS > /dev/null; if timeout 10 grep -qm1 '\[BUDDY\] started' <(tail -n 1000 -f /var/log/manticore/searchd.log); then echo 'Buddy started!'; else echo 'Timeout or failed!'; cat /var/log/manticore/searchd.log;fi
––– output –––
OK
––– input –––
apt-get install jq -y > /dev/null; echo $?
––– output –––
- debconf: delaying package configuration, since apt-utils is not installed
+ E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/j/jq/libjq1_1.7.1-3ubuntu0.24.04.1_amd64.deb  Connection failed [IP: 185.125.190.82 80]
- 0
+ E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
+ 100
––– input –––
mysql -h0 -P9306 -e "CREATE TABLE test_json_columnar (
    title TEXT,
    content TEXT,
    embedding FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2'
    MODEL_NAME='sentence-transformers/all-MiniLM-L6-v2'
    FROM='title, content'
) engine='columnar'"; echo $?
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "SHOW CREATE TABLE test_json_columnar" | grep -o "model_name='sentence-transformers/all-MiniLM-L6-v2'"
––– output –––
OK
––– input –––
curl -s -X POST http://localhost:9308/insert -d '{"index":"test_json_columnar","id":1,"doc":{"title":"machine learning","content":"neural networks"}}' | jq -r 'if ._id then ._id else "inserted" end'
––– output –––
- inserted
+ bash: line 19: jq: command not found
––– input –––
mysql -h0 -P9306 -e "SELECT id FROM test_json_columnar WHERE KNN(embedding, 1, 'machine learning neural networks')"
––– output –––
OK
––– input –––
curl -s -X POST http://localhost:9308/bulk -H "Content-Type: application/x-ndjson" -d '
{"insert":{"index":"test_json_columnar","id":2,"doc":{"title":"computer vision","content":"image recognition"}}}
{"insert":{"index":"test_json_columnar","id":3,"doc":{"title":"NLP","content":"text processing"}}}
' | jq '{created: .items[0].bulk.created}'
––– output –––
- {
+ bash: line 26: jq: command not found
-   "created": 2
- }
––– input –––
mysql -h0 -P9306 -e "SELECT COUNT(*) FROM test_json_columnar WHERE id IN (2,3)"
––– output –––
OK
––– input –––
curl -s -X POST http://localhost:9308/replace -d '{"index":"test_json_columnar","id":1,"doc":{"title":"updated ML","content":"updated networks"}}' | jq -r '.result'
––– output –––
- updated
+ bash: line 30: jq: command not found
––– input –––
mysql -h0 -P9306 -e "SELECT title FROM test_json_columnar WHERE id=1 AND KNN(embedding, 1, 'updated ML networks')"
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "INSERT INTO test_json_columnar (id, title, content) VALUES (100, 'test', 'data')";
curl -s -X POST http://localhost:9308/insert -d '{"index":"test_json_columnar","id":101,"doc":{"title":"test","content":"data"}}' > /dev/null
––– output –––
OK
––– input –––
mysql -h0 -P9306 --batch --skip-column-names -e "SELECT embedding FROM test_json_columnar WHERE id=100" > /tmp/v1.txt
mysql -h0 -P9306 --batch --skip-column-names -e "SELECT embedding FROM test_json_columnar WHERE id=101" > /tmp/v2.txt
diff -q /tmp/v1.txt /tmp/v2.txt > /dev/null && echo "Vectors identical" || echo "Vectors differ"
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "SELECT COUNT(*) FROM test_json_columnar"
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "FLUSH RAMCHUNK test_json_columnar; OPTIMIZE TABLE test_json_columnar OPTION sync=1, cutoff=1"; echo $?
––– output –––
OK
––– input –––
VECTOR=$(python3 -c "print(','.join(['0.01']*384))")
curl -s -X POST http://localhost:9308/search -d "{\"index\":\"test_json_columnar\",\"knn\":{\"field\":\"embedding\",\"query_vector\":[$VECTOR],\"k\":2}}" | jq -r '.hits.total // "0"'
––– output –––
- 5
+ bash: line 46: jq: command not found
––– input –––
mysql -h0 -P9306 -e "CREATE TABLE no_auto_embed (title TEXT, vec FLOAT_VECTOR KNN_TYPE='hnsw' KNN_DIMS='384' HNSW_SIMILARITY='l2') engine='columnar'"
––– output –––
OK
––– input –––
VECTOR=$(python3 -c "print(','.join(['0.5']*384))")
curl -s -X POST http://localhost:9308/insert -d "{\"index\":\"no_auto_embed\",\"id\":1,\"doc\":{\"title\":\"test\",\"vec\":[$VECTOR]}}" | jq -r 'if ._id then ._id else "inserted" end'
––– output –––
- inserted
+ bash: line 51: jq: command not found
––– input –––
QUERY_VEC=$(python3 -c "print(','.join(['0.5']*384))")
curl -s -X POST http://localhost:9308/search -d "{\"index\":\"no_auto_embed\",\"knn\":{\"field\":\"vec\",\"query_vector\":[$QUERY_VEC],\"k\":1}}" | jq -r '.hits.total // "0"'
––– output –––
- 1
+ bash: line 54: jq: command not found
––– input –––
mysql -h0 -P9306 -e "CREATE TABLE test_json_rowwise (
    title TEXT,
    content TEXT,
    embedding FLOAT_VECTOR KNN_TYPE='hnsw' HNSW_SIMILARITY='l2'
    MODEL_NAME='sentence-transformers/all-MiniLM-L6-v2'
    FROM='title, content'
) engine='rowwise'"; echo $?
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "SHOW CREATE TABLE test_json_rowwise" | grep -o "model_name='sentence-transformers/all-MiniLM-L6-v2'"
––– output –––
OK
––– input –––
curl -s -X POST http://localhost:9308/insert -d '{"index":"test_json_rowwise","id":1,"doc":{"title":"machine learning","content":"neural networks"}}' | jq -r 'if ._id then ._id else "inserted" end'
––– output –––
- inserted
+ bash: line 66: jq: command not found
––– input –––
mysql -h0 -P9306 -e "SELECT id FROM test_json_rowwise WHERE KNN(embedding, 1, 'machine learning neural networks')"
––– output –––
OK
––– input –––
curl -s -X POST http://localhost:9308/bulk -H "Content-Type: application/x-ndjson" -d '
{"insert":{"index":"test_json_rowwise","id":2,"doc":{"title":"computer vision","content":"image recognition"}}}
{"insert":{"index":"test_json_rowwise","id":3,"doc":{"title":"NLP","content":"text processing"}}}
' | jq '{created: .items[0].bulk.created}'
––– output –––
- {
+ bash: line 73: jq: command not found
-   "created": 2
- }
––– input –––
mysql -h0 -P9306 -e "SELECT COUNT(*) FROM test_json_rowwise WHERE id IN (2,3)"
––– output –––
OK
––– input –––
curl -s -X POST http://localhost:9308/replace -d '{"index":"test_json_rowwise","id":1,"doc":{"title":"updated ML","content":"updated networks"}}' | jq -r '.result'
––– output –––
- updated
+ bash: line 77: jq: command not found
––– input –––
mysql -h0 -P9306 -e "SELECT title FROM test_json_rowwise WHERE id=1 AND KNN(embedding, 1, 'updated ML networks')"
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "INSERT INTO test_json_rowwise (id, title, content) VALUES (100, 'test', 'data')";
curl -s -X POST http://localhost:9308/insert -d '{"index":"test_json_rowwise","id":101,"doc":{"title":"test","content":"data"}}' > /dev/null
––– output –––
OK
––– input –––
mysql -h0 -P9306 --batch --skip-column-names -e "SELECT embedding FROM test_json_rowwise WHERE id=100" > /tmp/v1.txt
mysql -h0 -P9306 --batch --skip-column-names -e "SELECT embedding FROM test_json_rowwise WHERE id=101" > /tmp/v2.txt
diff -q /tmp/v1.txt /tmp/v2.txt > /dev/null && echo "Vectors identical" || echo "Vectors differ"
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "SELECT COUNT(*) FROM test_json_rowwise"
––– output –––
OK
––– input –––
mysql -h0 -P9306 -e "FLUSH RAMCHUNK test_json_rowwise; OPTIMIZE TABLE test_json_rowwise OPTION sync=1, cutoff=1"; echo $?
––– output –––
OK
––– input –––
VECTOR=$(python3 -c "print(','.join(['0.01']*384))")
curl -s -X POST http://localhost:9308/search -d "{\"index\":\"test_json_rowwise\",\"knn\":{\"field\":\"embedding\",\"query_vector\":[$VECTOR],\"k\":2}}" | jq -r '.hits.total // "0"'
––– output –––
- 5
+ bash: line 93: jq: command not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants