First, I tried splitting the target model and the draft onto different video cards. I got an error:
CUDA_VISIBLE_DEVICES=1,0 DFLASH_TARGET_GPU=0 DFLASH_DRAFT_GPU=1 DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 python tests/bench_niah_cpp.py --bin ../dflash/build/test_dflash --target ../../ModelsIA/Qwen/Qwen3.6-27B-Q4_K_M.gguf --draft-spec ../../ModelsIA/Qwen/draft/model.safetensors --drafter-gguf ../../ModelsIA/Qwen/drafter/Qwen3-0.6B-BF16.gguf --cases /tmp/niah_128k.jsonl --keep-ratio 0.05 --n-gen 256
[init] spawning daemon: ../dflash/build/test_dflash
[cfg] seq_verify=0 fast_rollback=1 ddtree=1 budget=16 temp=1.00 chain_seed=1 fa_window=0 draft_swa=0 draft_ctx_max=4096 draft_feature_mirror=0 peer_access=0 target_gpu=0 draft_gpu=1
[test_dflash] arch=qwen35 daemon -> dispatching to run_qwen35_daemon (max_ctx=16384 stream_fd=5)
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 36857 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 4080 Laptop GPU, compute capability 8.9, VMM: yes, VRAM: 12281 MiB
[loader] eos_id=248046 eos_chat_id=-1
[target] target loaded: layers [0,64) output=1, 850 tensors on GPU 14.99 GiB, tok_embd 682 MiB CPU-only (q4_K)
draft load: safetensors: 'layers.0.self_attn.k_norm.weight' shape[0]=128 expected 256
Traceback (most recent call last):
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 184, in
main()
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 123, in main
dflash = DflashClient(
^^^^^^^^^^^^^
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 91, in init
self._wait_until_loaded(timeout=boot_timeout_s, vram_mib=boot_vram_mib)
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 101, in _wait_until_loaded
raise RuntimeError(
RuntimeError: dflash daemon exited before weights finished loading. Check the daemon's stderr.
Then I used only RTX 3090 and I get the same error:
CUDA_VISIBLE_DEVICES=1 DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 python tests/bench_niah_cpp.py --bin ../dflash/build/test_dflash --target ../../ModelsIA/Qwen
/Qwen3.6-27B-Q4_K_M.gguf --draft-spec ../../ModelsIA/Qwen/draft/model.safetensors --drafter-gguf ../../ModelsIA/Qw
en/drafter/Qwen3-0.6B-BF16.gguf --cases /tmp/niah_128k.jsonl --keep-ratio 0.05 --n-gen 256
[init] spawning daemon: ../dflash/build/test_dflash
[cfg] seq_verify=0 fast_rollback=1 ddtree=1 budget=16 temp=1.00 chain_seed=1 fa_window=0 draft_swa=0 draft_ctx_max=4096 draft_feature_mirror=0 peer_access=0 target_gpu=0 draft_gpu=0
[test_dflash] arch=qwen35 daemon -> dispatching to run_qwen35_daemon (max_ctx=16384 stream_fd=5)
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24575 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
[loader] eos_id=248046 eos_chat_id=-1
[target] target loaded: layers [0,64) output=1, 850 tensors on GPU 14.99 GiB, tok_embd 682 MiB CPU-only (q4_K)
draft load: safetensors: 'layers.0.self_attn.k_norm.weight' shape[0]=128 expected 256
Traceback (most recent call last):
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 184, in
main()
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 123, in main
dflash = DflashClient(
^^^^^^^^^^^^^
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 91, in init
self._wait_until_loaded(timeout=boot_timeout_s, vram_mib=boot_vram_mib)
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 101, in _wait_until_loaded
raise RuntimeError(
RuntimeError: dflash daemon exited before weights finished loading. Check the daemon's stderr.
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
Please tell me the possible reasons for this error and how to fix it.
First, I tried splitting the target model and the draft onto different video cards. I got an error:
CUDA_VISIBLE_DEVICES=1,0 DFLASH_TARGET_GPU=0 DFLASH_DRAFT_GPU=1 DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 python tests/bench_niah_cpp.py --bin ../dflash/build/test_dflash --target ../../ModelsIA/Qwen/Qwen3.6-27B-Q4_K_M.gguf --draft-spec ../../ModelsIA/Qwen/draft/model.safetensors --drafter-gguf ../../ModelsIA/Qwen/drafter/Qwen3-0.6B-BF16.gguf --cases /tmp/niah_128k.jsonl --keep-ratio 0.05 --n-gen 256
[init] spawning daemon: ../dflash/build/test_dflash
[cfg] seq_verify=0 fast_rollback=1 ddtree=1 budget=16 temp=1.00 chain_seed=1 fa_window=0 draft_swa=0 draft_ctx_max=4096 draft_feature_mirror=0 peer_access=0 target_gpu=0 draft_gpu=1
[test_dflash] arch=qwen35 daemon -> dispatching to run_qwen35_daemon (max_ctx=16384 stream_fd=5)
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 36857 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
Device 1: NVIDIA GeForce RTX 4080 Laptop GPU, compute capability 8.9, VMM: yes, VRAM: 12281 MiB
[loader] eos_id=248046 eos_chat_id=-1
[target] target loaded: layers [0,64) output=1, 850 tensors on GPU 14.99 GiB, tok_embd 682 MiB CPU-only (q4_K)
draft load: safetensors: 'layers.0.self_attn.k_norm.weight' shape[0]=128 expected 256
Traceback (most recent call last):
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 184, in
main()
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 123, in main
dflash = DflashClient(
^^^^^^^^^^^^^
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 91, in init
self._wait_until_loaded(timeout=boot_timeout_s, vram_mib=boot_vram_mib)
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 101, in _wait_until_loaded
raise RuntimeError(
RuntimeError: dflash daemon exited before weights finished loading. Check the daemon's stderr.
Then I used only RTX 3090 and I get the same error:
CUDA_VISIBLE_DEVICES=1 DFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 python tests/bench_niah_cpp.py --bin ../dflash/build/test_dflash --target ../../ModelsIA/Qwen
/Qwen3.6-27B-Q4_K_M.gguf --draft-spec ../../ModelsIA/Qwen/draft/model.safetensors --drafter-gguf ../../ModelsIA/Qw
en/drafter/Qwen3-0.6B-BF16.gguf --cases /tmp/niah_128k.jsonl --keep-ratio 0.05 --n-gen 256
[init] spawning daemon: ../dflash/build/test_dflash
[cfg] seq_verify=0 fast_rollback=1 ddtree=1 budget=16 temp=1.00 chain_seed=1 fa_window=0 draft_swa=0 draft_ctx_max=4096 draft_feature_mirror=0 peer_access=0 target_gpu=0 draft_gpu=0
[test_dflash] arch=qwen35 daemon -> dispatching to run_qwen35_daemon (max_ctx=16384 stream_fd=5)
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24575 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
[loader] eos_id=248046 eos_chat_id=-1
[target] target loaded: layers [0,64) output=1, 850 tensors on GPU 14.99 GiB, tok_embd 682 MiB CPU-only (q4_K)
draft load: safetensors: 'layers.0.self_attn.k_norm.weight' shape[0]=128 expected 256
Traceback (most recent call last):
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 184, in
main()
File "/home/dimanodg/myproject/lucebox-hub/pflash/tests/bench_niah_cpp.py", line 123, in main
dflash = DflashClient(
^^^^^^^^^^^^^
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 91, in init
self._wait_until_loaded(timeout=boot_timeout_s, vram_mib=boot_vram_mib)
File "/home/dimanodg/myproject/lucebox-hub/pflash/pflash/dflash_client.py", line 101, in _wait_until_loaded
raise RuntimeError(
RuntimeError: dflash daemon exited before weights finished loading. Check the daemon's stderr.
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
Please tell me the possible reasons for this error and how to fix it.