fix(spacemit): fix MiniCPM-V SMT multimodal inference#15
Open
oscar1229 wants to merge 1 commit into
Open
Conversation
Fixes three independent bugs that prevented MiniCPM-V from running via the SMT media backend with multi-threaded warmup and multi-turn image conversations. * fix(spacemit): fix IME paired-lane GEMM threadpool deadlock The IME GEMM kernels (forward_mul_mat and the mul_mat_id MoE path) rendezvous thread pairs (2k, 2k+1) on a spine_barrier built for two participants, so both lanes must call spine_barrier_wait() the same number of times. The old per-thread loop could iterate a different number of times per lane when gemm_n was not a multiple of NB_COLS*nth, and the trailing even thread on odd nth had no partner, so warmup hung with -t 8. Drive the loop from a pair-aligned base with a per-lane offset (both lanes always iterate equally; an out-of-range lane skips the GEMM but still hits the barrier) and guard the barrier with has_pair so a partnerless thread never waits. * server: force full re-prefill for multimodal FULL-only KV cache reuse MiniCPM-V runs on the qwen35 hybrid (SSM + periodic full-attention) backend whose KV memory only supports full sequence removal. On a multi-turn request, partial prompt-cache reuse would either restore a context checkpoint (resurrecting a KV state inconsistent with the external smt/ONNX vision embeddings) or call partial memory_seq_rm on FULL-only memory, which returns false and triggers GGML_ABORT. When the context is multimodal and the reused prefix is partial, force a full re-prefill (pos_next = 0, n_past = 0) before the checkpoint / seq_rm path. Pure-append turns and non-multimodal contexts are unaffected. * feat(mtmd): add MiniCPM-V SMT vision preprocessing The MiniCPM-V SMT vision ONNX export does not normalize pixels internally. Detect minicpmv / minicpm_v / minicpm-v architectures and route them through rgb_u8_to_chw_f32_with_config, which reads rescale_factor / image_mean / image_std from config.json's vision_preprocess block and emits a CHW float32 tensor. Target defaults to 448x448, overridable via vision_model.input_width/height.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes three independent bugs that prevented MiniCPM-V from running via the SMT media backend with multi-threaded warmup and multi-turn image conversations.
fix(spacemit): fix IME paired-lane GEMM threadpool deadlock
The IME GEMM kernels (forward_mul_mat and the mul_mat_id MoE path)
rendezvous thread pairs (2k, 2k+1) on a spine_barrier built for two
participants, so both lanes must call spine_barrier_wait() the same
number of times. The old per-thread loop could iterate a different
number of times per lane when gemm_n was not a multiple of
NB_COLS*nth, and the trailing even thread on odd nth had no partner,
so warmup hung with -t 8. Drive the loop from a pair-aligned base
with a per-lane offset (both lanes always iterate equally; an
out-of-range lane skips the GEMM but still hits the barrier) and
guard the barrier with has_pair so a partnerless thread never waits.
server: force full re-prefill for multimodal FULL-only KV cache reuse
MiniCPM-V runs on the qwen35 hybrid (SSM + periodic full-attention)
backend whose KV memory only supports full sequence removal. On a
multi-turn request, partial prompt-cache reuse would either restore a
context checkpoint (resurrecting a KV state inconsistent with the
external smt/ONNX vision embeddings) or call partial memory_seq_rm on
FULL-only memory, which returns false and triggers GGML_ABORT. When
the context is multimodal and the reused prefix is partial, force a
full re-prefill (pos_next = 0, n_past = 0) before the checkpoint /
seq_rm path. Pure-append turns and non-multimodal contexts are
unaffected.
feat(mtmd): add MiniCPM-V SMT vision preprocessing
The MiniCPM-V SMT vision ONNX export does not normalize pixels
internally. Detect minicpmv / minicpm_v / minicpm-v architectures and
route them through rgb_u8_to_chw_f32_with_config, which reads
rescale_factor / image_mean / image_std from config.json's
vision_preprocess block and emits a CHW float32 tensor. Target
defaults to 448x448, overridable via vision_model.input_width/height.
Overview
Additional information
Requirements