Skip to content

fix(spacemit): fix MiniCPM-V SMT multimodal inference#15

Open
oscar1229 wants to merge 1 commit into
spacemit-com:spacemit-mtmdfrom
oscar1229:spacemit-mtmd
Open

fix(spacemit): fix MiniCPM-V SMT multimodal inference#15
oscar1229 wants to merge 1 commit into
spacemit-com:spacemit-mtmdfrom
oscar1229:spacemit-mtmd

Conversation

@oscar1229

Copy link
Copy Markdown

Fixes three independent bugs that prevented MiniCPM-V from running via the SMT media backend with multi-threaded warmup and multi-turn image conversations.

  • fix(spacemit): fix IME paired-lane GEMM threadpool deadlock

    The IME GEMM kernels (forward_mul_mat and the mul_mat_id MoE path)
    rendezvous thread pairs (2k, 2k+1) on a spine_barrier built for two
    participants, so both lanes must call spine_barrier_wait() the same
    number of times. The old per-thread loop could iterate a different
    number of times per lane when gemm_n was not a multiple of
    NB_COLS*nth, and the trailing even thread on odd nth had no partner,
    so warmup hung with -t 8. Drive the loop from a pair-aligned base
    with a per-lane offset (both lanes always iterate equally; an
    out-of-range lane skips the GEMM but still hits the barrier) and
    guard the barrier with has_pair so a partnerless thread never waits.

  • server: force full re-prefill for multimodal FULL-only KV cache reuse

    MiniCPM-V runs on the qwen35 hybrid (SSM + periodic full-attention)
    backend whose KV memory only supports full sequence removal. On a
    multi-turn request, partial prompt-cache reuse would either restore a
    context checkpoint (resurrecting a KV state inconsistent with the
    external smt/ONNX vision embeddings) or call partial memory_seq_rm on
    FULL-only memory, which returns false and triggers GGML_ABORT. When
    the context is multimodal and the reused prefix is partial, force a
    full re-prefill (pos_next = 0, n_past = 0) before the checkpoint /
    seq_rm path. Pure-append turns and non-multimodal contexts are
    unaffected.

  • feat(mtmd): add MiniCPM-V SMT vision preprocessing

    The MiniCPM-V SMT vision ONNX export does not normalize pixels
    internally. Detect minicpmv / minicpm_v / minicpm-v architectures and
    route them through rgb_u8_to_chw_f32_with_config, which reads
    rescale_factor / image_mean / image_std from config.json's
    vision_preprocess block and emits a CHW float32 tensor. Target
    defaults to 448x448, overridable via vision_model.input_width/height.

Overview

Additional information

Requirements

Fixes three independent bugs that prevented MiniCPM-V from running via
the SMT media backend with multi-threaded warmup and multi-turn image
conversations.

* fix(spacemit): fix IME paired-lane GEMM threadpool deadlock

  The IME GEMM kernels (forward_mul_mat and the mul_mat_id MoE path)
  rendezvous thread pairs (2k, 2k+1) on a spine_barrier built for two
  participants, so both lanes must call spine_barrier_wait() the same
  number of times. The old per-thread loop could iterate a different
  number of times per lane when gemm_n was not a multiple of
  NB_COLS*nth, and the trailing even thread on odd nth had no partner,
  so warmup hung with -t 8. Drive the loop from a pair-aligned base
  with a per-lane offset (both lanes always iterate equally; an
  out-of-range lane skips the GEMM but still hits the barrier) and
  guard the barrier with has_pair so a partnerless thread never waits.

* server: force full re-prefill for multimodal FULL-only KV cache reuse

  MiniCPM-V runs on the qwen35 hybrid (SSM + periodic full-attention)
  backend whose KV memory only supports full sequence removal. On a
  multi-turn request, partial prompt-cache reuse would either restore a
  context checkpoint (resurrecting a KV state inconsistent with the
  external smt/ONNX vision embeddings) or call partial memory_seq_rm on
  FULL-only memory, which returns false and triggers GGML_ABORT. When
  the context is multimodal and the reused prefix is partial, force a
  full re-prefill (pos_next = 0, n_past = 0) before the checkpoint /
  seq_rm path. Pure-append turns and non-multimodal contexts are
  unaffected.

* feat(mtmd): add MiniCPM-V SMT vision preprocessing

  The MiniCPM-V SMT vision ONNX export does not normalize pixels
  internally. Detect minicpmv / minicpm_v / minicpm-v architectures and
  route them through rgb_u8_to_chw_f32_with_config, which reads
  rescale_factor / image_mean / image_std from config.json's
  vision_preprocess block and emits a CHW float32 tensor. Target
  defaults to 448x448, overridable via vision_model.input_width/height.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant