diff --git a/docs/gguf-conversion.md b/docs/gguf-conversion.md new file mode 100644 index 000000000..e0a90c9be --- /dev/null +++ b/docs/gguf-conversion.md @@ -0,0 +1,99 @@ +# GGUF Conversion Guide + +Complete workflow for generating quantized GGUF files from DeepSeek V4 Flash safetensors weights. + +## Quick Start (one command) + +```bash +# From the ds4 repo root: +bash test-4expert.sh /path/to/DeepSeek-V4-Flash-4Expert $(nproc) +``` + +This generates `template.gguf`, runs the quantizer, and creates the final GGUF. See the next section for what happens under the hood. + +## Manual Steps + +### 1. Build + +```bash +# Linux: +make cpu -j$(nproc) +make -C gguf-tools -j$(nproc) +# macOS: +make -j$(sysctl -n hw.ncpu) +make -C gguf-tools -j$(sysctl -n hw.ncpu) +``` + +### 2. Generate GGUF Template + +```bash +python3 gguf-tools/gen_gguf_template.py \ + --hf /path/to/DeepSeek-V4-Flash-4Expert \ + --out /tmp/template.gguf +``` + +The template (~5.6 MB) contains complete metadata, tokenizer data, and tensor descriptors. It does **not** contain weight data — only tensor names, shapes, and types. + +### 3. Quantize and Convert + +```bash +./gguf-tools/deepseek4-quantize \ + --hf /path/to/DeepSeek-V4-Flash-4Expert \ + --template /tmp/template.gguf \ + --out /path/to/output/model-q4k.gguf \ + --experts q4_k \ + --attention-proj q8_0 \ + --attention f16 \ + --shared q8_0 \ + --output q8_0 \ + --embedding f16 \ + --dense f16 \ + --n-experts 256 \ + --threads 12 +``` + +> **Note**: If the output file already exists, you must add `--overwrite` or the tool will error. + +#### Quantization Options Reference + +| Flag | Typical value | Description | +|------|--------------|-------------| +| `--experts` | `q4_k` | Routed experts MoE FFN (w1/w2/w3) | +| `--attention-proj` | `q8_0` | Attention projection matrices (q/kv/output_a/output_b) | +| `--attention` | `f16` | Other 2D attention/compressor/indexer tensors | +| `--shared` | `q8_0` | Shared expert FFN | +| `--output` | `q8_0` | Output projection (output.*) | +| `--embedding` | `f16` | Token embedding layer | +| `--dense` | `f16` | Remaining 2D+ tensors not matched above | +| `--n-experts` | from template | Number of routed experts (read from template metadata if omitted) | +| `--threads` | `8` | Parallel worker count | + +### 4. Test + +```bash +ln -sfn /path/to/output/model-q4k.gguf ds4flash.gguf +./ds4 -p "Hello" -n 100 +``` + +## How It Works + +### Template Generation (gen_gguf_template.py) + +1. Reads `model.safetensors.index.json` for tensor names and shapes +2. Maps HF tensor names to GGUF names using the same `layer_map` as `deepseek4-quantize.c` +3. Sets regular tensor types to F32 (routed expert tensors to F16). 1D tensors (norms, scales, biases) remain F32. 2D+ tensors get their type overridden by the quantizer policy. +4. Writes a GGUF file containing metadata + tokenizer + tensor descriptors + +### Quantizer (deepseek4-quantize) + +1. Loads the template to obtain all tensor descriptors +2. For each tensor: determines the final type using the user-specified quantization policy +3. Reads safetensors weights, performs quantization, writes to the output GGUF +4. Produces a ready-to-use GGUF file + +## Notes + +- **Regenerate the template** whenever the model's tensor set changes (step 1) +- **Type conversion**: `gen_gguf_template.py` automatically handles I64 → I32 conversion (for the `tid2eid` routing table) +- **1D tensors** (norms, scales, biases) are always stored as F32 and never quantized +- **Large model**: Q4_K output is approximately 153 GiB; ensure sufficient disk space diff --git a/docs/test-pr-on-linux.md b/docs/test-pr-on-linux.md new file mode 100644 index 000000000..9a3a2ef51 --- /dev/null +++ b/docs/test-pr-on-linux.md @@ -0,0 +1,90 @@ +# Testing the 4Expert PR on a Fresh Linux Machine + +## Quick Start (single command) + +```bash +git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert +bash test-4expert.sh +``` + +This runs all 5 steps: +1. Build ds4 + gguf-tools +2. Download 4Expert safetensors weights (~130 GiB) +3. Generate GGUF template from metadata +4. Quantize to Q4_K GGUF (~153 GiB output, ~30 min with 20 threads) +5. Link GGUF and run `./ds4 -p "..." -n 100` + +After step 4 completes, the GGUF file is reusable — skip re-conversion on subsequent runs. + +## Custom Paths + +```bash +bash test-4expert.sh /path/to/existing/weights 16 +``` + +- 1st arg: path to safetensors directory (skips download if `model.safetensors.index.json` exists) +- 2nd arg: number of threads (default all cores) + +## Pre-Quantized (skip conversion, download GGUF directly) + +```bash +git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert +make cpu -j$(nproc) +make -C gguf-tools -j$(nproc) + +pip install -q huggingface_hub +python3 -c " +from huggingface_hub import hf_hub_download +hf_hub_download('cloudyu/DeepSeek-V4-Flash-4Expert-GGUF', 'DeepSeek-V4-Flash-4Expert-Q4K.gguf', local_dir='.') +" + +ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf +./ds4 -p "Hello" -n 100 +``` + +## Manual Steps (if the one-click script doesn't work) + +### 1. Build + +```bash +make -C gguf-tools -j$(nproc) +make cpu -j$(nproc) +``` + +### 2. Download weights + +```bash +pip install huggingface_hub +python3 -c " +from huggingface_hub import snapshot_download +snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='./DeepSeek-V4-Flash-4Expert') +" +``` + +### 3. Generate template + quantize + +```bash +python3 gguf-tools/gen_gguf_template.py \ + --hf ./DeepSeek-V4-Flash-4Expert \ + --out template.gguf + +./gguf-tools/deepseek4-quantize \ + --hf ./DeepSeek-V4-Flash-4Expert \ + --template template.gguf \ + --out ds4flash-4expert.gguf \ + --experts q4_k \ + --attention-proj q8_0 \ + --attention f16 \ + --shared q8_0 \ + --output q8_0 \ + --embedding f16 \ + --dense f16 \ + --threads $(nproc) +``` + +### 4. Test + +```bash +ln -sfn ds4flash-4expert.gguf ds4flash.gguf +./ds4 -p "The weather is great today" -n 100 +``` diff --git a/ds4.c b/ds4.c index 640511eb0..82a00265e 100644 --- a/ds4.c +++ b/ds4.c @@ -189,7 +189,7 @@ static const ds4_shape DS4_SHAPE_FLASH = { .n_lora_q = 1024, .n_lora_o = 1024, .n_expert = 256, - .n_expert_used = 6, + .n_expert_used = 4, .n_expert_shared = 1, .n_ff_exp = 2048, .n_hash_layer = 3, @@ -263,7 +263,7 @@ static ds4_shape g_ds4_shape = { .n_lora_q = 1024, .n_lora_o = 1024, .n_expert = 256, - .n_expert_used = 6, + .n_expert_used = 4, .n_expert_shared = 1, .n_ff_exp = 2048, .n_hash_layer = 3, @@ -3756,15 +3756,17 @@ static void ds4_select_shape_from_metadata( uint32_t n_indexer_top_k, uint32_t n_hc, uint32_t n_hc_sinkhorn_iter) { - if (ds4_shape_matches_metadata(&DS4_SHAPE_FLASH, - n_layer, n_embd, n_vocab, n_head, n_head_kv, - n_head_dim, n_value_dim, n_rot, n_lora_q, - n_lora_o, n_out_group, n_expert, - n_expert_used, n_ff_exp, n_expert_shared, - n_hash_layer, n_swa, n_indexer_head, - n_indexer_head_dim, n_indexer_top_k, n_hc, - n_hc_sinkhorn_iter)) { + if ((n_expert_used == 4 || n_expert_used == 6) && + ds4_shape_matches_metadata(&DS4_SHAPE_FLASH, + n_layer, n_embd, n_vocab, n_head, n_head_kv, + n_head_dim, n_value_dim, n_rot, n_lora_q, + n_lora_o, n_out_group, n_expert, + 4, n_ff_exp, n_expert_shared, + n_hash_layer, n_swa, n_indexer_head, + n_indexer_head_dim, n_indexer_top_k, n_hc, + n_hc_sinkhorn_iter)) { g_ds4_shape = DS4_SHAPE_FLASH; + if (n_expert_used == 6) g_ds4_shape.n_expert_used = 6; return; } if (ds4_shape_matches_metadata(&DS4_SHAPE_PRO, diff --git a/ds4_cuda.cu b/ds4_cuda.cu index 188b341ad..65bb6b852 100644 --- a/ds4_cuda.cu +++ b/ds4_cuda.cu @@ -5972,28 +5972,29 @@ __global__ static void router_select_kernel( uint32_t hash_rows, uint32_t n_tokens, int has_bias, - int hash_mode) { + int hash_mode, + uint32_t n_expert_used) { uint32_t t = blockIdx.x; if (t >= n_tokens || threadIdx.x != 0) return; const float *log = logits + (uint64_t)t * 256; float *prob = probs + (uint64_t)t * 256; - int32_t *sel = selected + (uint64_t)t * 6; - float *w = weights + (uint64_t)t * 6; + int32_t *sel = selected + (uint64_t)t * n_expert_used; + float *w = weights + (uint64_t)t * n_expert_used; for (int i = 0; i < 256; i++) prob[i] = sqrtf(softplus_dev(log[i])); if (hash_mode) { int32_t tok = tokens ? tokens[t] : token_scalar; if (tok < 0 || (uint32_t)tok >= hash_rows) tok = 0; - const int32_t *row = hash + (uint64_t)tok * 6; - for (int i = 0; i < 6; i++) sel[i] = row[i]; + const int32_t *row = hash + (uint64_t)tok * n_expert_used; + for (int i = 0; i < (int)n_expert_used; i++) sel[i] = row[i]; } else { - for (int i = 0; i < 6; i++) sel[i] = -1; + for (int i = 0; i < (int)n_expert_used; i++) sel[i] = -1; for (int i = 0; i < 256; i++) { float score = prob[i] + (has_bias ? bias[i] : 0.0f); - for (int j = 0; j < 6; j++) { + for (int j = 0; j < (int)n_expert_used; j++) { if (sel[j] < 0 || score > prob[sel[j]] + (has_bias ? bias[sel[j]] : 0.0f)) { - for (int k = 5; k > j; k--) sel[k] = sel[k - 1]; + for (int k = (int)n_expert_used - 1; k > j; k--) sel[k] = sel[k - 1]; sel[j] = i; break; } @@ -6002,14 +6003,14 @@ __global__ static void router_select_kernel( } float sum = 0.0f; - for (int i = 0; i < 6; i++) { + for (int i = 0; i < (int)n_expert_used; i++) { int e = sel[i]; float v = (e >= 0 && e < 256) ? prob[e] : 0.0f; w[i] = v; sum += v; } sum = fmaxf(sum, 6.103515625e-5f); - for (int i = 0; i < 6; i++) w[i] = w[i] / sum * 1.5f; + for (int i = 0; i < (int)n_expert_used; i++) w[i] = w[i] / sum * 1.5f; } __global__ static void router_select_parallel_kernel( @@ -6024,14 +6025,15 @@ __global__ static void router_select_parallel_kernel( uint32_t hash_rows, uint32_t n_tokens, int has_bias, - int hash_mode) { + int hash_mode, + uint32_t n_expert_used) { uint32_t t = blockIdx.x; uint32_t i = threadIdx.x; if (t >= n_tokens || i >= 256u) return; const float *log = logits + (uint64_t)t * 256; float *prob = probs + (uint64_t)t * 256; - int32_t *sel = selected + (uint64_t)t * 6; - float *w = weights + (uint64_t)t * 6; + int32_t *sel = selected + (uint64_t)t * n_expert_used; + float *w = weights + (uint64_t)t * n_expert_used; __shared__ float sprob[256]; const float p = sqrtf(softplus_dev(log[i])); @@ -6043,15 +6045,15 @@ __global__ static void router_select_parallel_kernel( if (hash_mode) { int32_t tok = tokens ? tokens[t] : token_scalar; if (tok < 0 || (uint32_t)tok >= hash_rows) tok = 0; - const int32_t *row = hash + (uint64_t)tok * 6; - for (int j = 0; j < 6; j++) sel[j] = row[j]; + const int32_t *row = hash + (uint64_t)tok * n_expert_used; + for (int j = 0; j < (int)n_expert_used; j++) sel[j] = row[j]; } else { - for (int j = 0; j < 6; j++) sel[j] = -1; + for (int j = 0; j < (int)n_expert_used; j++) sel[j] = -1; for (int e = 0; e < 256; e++) { float score = sprob[e] + (has_bias ? bias[e] : 0.0f); - for (int j = 0; j < 6; j++) { + for (int j = 0; j < (int)n_expert_used; j++) { if (sel[j] < 0 || score > sprob[sel[j]] + (has_bias ? bias[sel[j]] : 0.0f)) { - for (int k = 5; k > j; k--) sel[k] = sel[k - 1]; + for (int k = (int)n_expert_used - 1; k > j; k--) sel[k] = sel[k - 1]; sel[j] = e; break; } @@ -6060,14 +6062,14 @@ __global__ static void router_select_parallel_kernel( } float sum = 0.0f; - for (int j = 0; j < 6; j++) { + for (int j = 0; j < (int)n_expert_used; j++) { int e = sel[j]; float v = (e >= 0 && e < 256) ? sprob[e] : 0.0f; w[j] = v; sum += v; } sum = fmaxf(sum, 6.103515625e-5f); - for (int j = 0; j < 6; j++) w[j] = w[j] / sum * 1.5f; + for (int j = 0; j < (int)n_expert_used; j++) w[j] = w[j] / sum * 1.5f; } __device__ __forceinline__ static bool router_score_better(float av, uint32_t ai, float bv, uint32_t bi) { @@ -6086,7 +6088,8 @@ __global__ static void router_select_warp_topk_kernel( uint32_t hash_rows, uint32_t n_tokens, int has_bias, - int hash_mode) { + int hash_mode, + uint32_t n_expert_used) { const uint32_t lane = threadIdx.x; const uint32_t row_in_block = threadIdx.y; const uint32_t t = blockIdx.x * blockDim.y + row_in_block; @@ -6094,8 +6097,8 @@ __global__ static void router_select_warp_topk_kernel( const float *log = logits + (uint64_t)t * 256u; float *prob = probs + (uint64_t)t * 256u; - int32_t *sel = selected + (uint64_t)t * 6u; - float *w = weights + (uint64_t)t * 6u; + int32_t *sel = selected + (uint64_t)t * n_expert_used; + float *w = weights + (uint64_t)t * n_expert_used; __shared__ float sprob[4][256]; float local_prob[8]; float local_score[8]; @@ -6115,10 +6118,10 @@ __global__ static void router_select_warp_topk_kernel( if (lane == 0) { int32_t tok = tokens ? tokens[t] : token_scalar; if (tok < 0 || (uint32_t)tok >= hash_rows) tok = 0; - const int32_t *row = hash + (uint64_t)tok * 6u; + const int32_t *row = hash + (uint64_t)tok * n_expert_used; float sum = 0.0f; #pragma unroll - for (uint32_t j = 0; j < 6u; j++) { + for (uint32_t j = 0; j < n_expert_used; j++) { const int32_t e = row[j]; sel[j] = e; const float v = (e >= 0 && e < 256) ? sprob[row_in_block][(uint32_t)e] : 0.0f; @@ -6127,7 +6130,7 @@ __global__ static void router_select_warp_topk_kernel( } sum = fmaxf(sum, 6.103515625e-5f); #pragma unroll - for (uint32_t j = 0; j < 6u; j++) w[j] = w[j] / sum * 1.5f; + for (uint32_t j = 0; j < n_expert_used; j++) w[j] = w[j] / sum * 1.5f; } return; } @@ -6135,7 +6138,7 @@ __global__ static void router_select_warp_topk_kernel( float out_prob[6] = {0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f}; uint32_t out_idx[6] = {0, 0, 0, 0, 0, 0}; #pragma unroll - for (uint32_t k = 0; k < 6u; k++) { + for (uint32_t k = 0; k < n_expert_used; k++) { float best_score = -INFINITY; float best_prob = 0.0f; uint32_t best_idx = UINT32_MAX; @@ -6174,14 +6177,14 @@ __global__ static void router_select_warp_topk_kernel( if (lane == 0) { float sum = 0.0f; #pragma unroll - for (uint32_t j = 0; j < 6u; j++) { + for (uint32_t j = 0; j < n_expert_used; j++) { sel[j] = (int32_t)out_idx[j]; w[j] = out_prob[j]; sum += out_prob[j]; } sum = fmaxf(sum, 6.103515625e-5f); #pragma unroll - for (uint32_t j = 0; j < 6u; j++) w[j] = w[j] / sum * 1.5f; + for (uint32_t j = 0; j < n_expert_used; j++) w[j] = w[j] / sum * 1.5f; } } @@ -9530,7 +9533,7 @@ extern "C" int ds4_gpu_directional_steering_project_tensor( } extern "C" int ds4_gpu_router_select_tensor(ds4_gpu_tensor *selected, ds4_gpu_tensor *weights, ds4_gpu_tensor *probs, const void *model_map, uint64_t model_size, uint64_t bias_offset, uint64_t hash_offset, uint32_t hash_rows, uint32_t token, uint32_t n_expert, uint32_t n_expert_used, float expert_weight_scale, uint32_t n_expert_groups, uint32_t n_group_used, bool has_bias, bool hash_mode, const ds4_gpu_tensor *logits) { if (!selected || !weights || !probs || !logits || !model_map || n_expert_groups > 1u || n_group_used > 0u) return 0; - if (n_expert != 256u || n_expert_used != 6u || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0; + if (n_expert != 256u || (n_expert_used != 6u && n_expert_used != 4u) || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0; int32_t tok = (int32_t)token; int ok = 1; const float *bias = NULL; @@ -9541,7 +9544,7 @@ extern "C" int ds4_gpu_router_select_tensor(ds4_gpu_tensor *selected, ds4_gpu_te if (!bias) ok = 0; } if (ok && hash_mode) { - const uint64_t hash_bytes = (uint64_t)hash_rows * 6u * sizeof(int32_t); + const uint64_t hash_bytes = (uint64_t)hash_rows * n_expert_used * sizeof(int32_t); if (hash_offset > model_size || hash_bytes > model_size - hash_offset) ok = 0; else hash = (const int32_t *)cuda_model_range_ptr(model_map, hash_offset, hash_bytes, "router_hash"); if (!hash) ok = 0; @@ -9551,29 +9554,29 @@ extern "C" int ds4_gpu_router_select_tensor(ds4_gpu_tensor *selected, ds4_gpu_te getenv("DS4_CUDA_NO_PARALLEL_ROUTER_SELECT") == NULL) { dim3 block(32, 4, 1); router_select_warp_topk_kernel<<<1, block>>>((int32_t *)selected->ptr, (float *)weights->ptr, (float *)probs->ptr, - bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1, - has_bias && !hash_mode, hash_mode); + bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1, + has_bias && !hash_mode, hash_mode, n_expert_used); } else if (getenv("DS4_CUDA_NO_PARALLEL_ROUTER_SELECT") == NULL) { router_select_parallel_kernel<<<1, 256>>>((int32_t *)selected->ptr, (float *)weights->ptr, (float *)probs->ptr, - bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1, - has_bias && !hash_mode, hash_mode); + bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1, + has_bias && !hash_mode, hash_mode, n_expert_used); } else { router_select_kernel<<<1, 1>>>((int32_t *)selected->ptr, (float *)weights->ptr, (float *)probs->ptr, - bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1, - has_bias && !hash_mode, hash_mode); + bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1, + has_bias && !hash_mode, hash_mode, n_expert_used); } ok = cuda_ok(cudaGetLastError(), "router_select launch"); } return ok; } extern "C" int ds4_gpu_router_select_batch_tensor(ds4_gpu_tensor *selected, ds4_gpu_tensor *weights, ds4_gpu_tensor *probs, const void *model_map, uint64_t model_size, uint64_t bias_offset, uint64_t hash_offset, uint32_t hash_rows, uint32_t n_expert_groups, uint32_t n_group_used, bool has_bias, bool hash_mode, const ds4_gpu_tensor *logits, const ds4_gpu_tensor *tokens, uint32_t n_expert, uint32_t n_expert_used, float expert_weight_scale, uint32_t n_tokens) { - if (n_expert != 256u || n_expert_used != 6u || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0; + if (n_expert != 256u || (n_expert_used != 6u && n_expert_used != 4u) || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0; if (!selected || !weights || !probs || !logits || !tokens || !model_map || n_tokens == 0 || n_expert_groups > 1u || n_group_used > 0u || logits->bytes < (uint64_t)n_tokens * 256u * sizeof(float) || probs->bytes < (uint64_t)n_tokens * 256u * sizeof(float) || - selected->bytes < (uint64_t)n_tokens * 6u * sizeof(int32_t) || - weights->bytes < (uint64_t)n_tokens * 6u * sizeof(float)) { + selected->bytes < (uint64_t)n_tokens * (uint64_t)n_expert_used * sizeof(int32_t) || + weights->bytes < (uint64_t)n_tokens * (uint64_t)n_expert_used * sizeof(float)) { return 0; } const float *bias = NULL; @@ -9584,7 +9587,7 @@ extern "C" int ds4_gpu_router_select_batch_tensor(ds4_gpu_tensor *selected, ds4_ if (!bias) return 0; } if (hash_mode) { - const uint64_t hash_bytes = (uint64_t)hash_rows * 6u * sizeof(int32_t); + const uint64_t hash_bytes = (uint64_t)hash_rows * n_expert_used * sizeof(int32_t); if (hash_offset > model_size || hash_bytes > model_size - hash_offset) return 0; hash = (const int32_t *)cuda_model_range_ptr(model_map, hash_offset, hash_bytes, "router_hash"); if (!hash) return 0; @@ -9603,33 +9606,36 @@ extern "C" int ds4_gpu_router_select_batch_tensor(ds4_gpu_tensor *selected, ds4_ hash_rows, n_tokens, has_bias && !hash_mode, - hash_mode); + hash_mode, + n_expert_used); } else if (getenv("DS4_CUDA_NO_PARALLEL_ROUTER_SELECT") == NULL) { router_select_parallel_kernel<<>>((int32_t *)selected->ptr, - (float *)weights->ptr, - (float *)probs->ptr, - bias, - hash, - (const float *)logits->ptr, - (const int32_t *)tokens->ptr, - 0, - hash_rows, - n_tokens, - has_bias && !hash_mode, - hash_mode); + (float *)weights->ptr, + (float *)probs->ptr, + bias, + hash, + (const float *)logits->ptr, + (const int32_t *)tokens->ptr, + 0, + hash_rows, + n_tokens, + has_bias && !hash_mode, + hash_mode, + n_expert_used); } else { router_select_kernel<<>>((int32_t *)selected->ptr, - (float *)weights->ptr, - (float *)probs->ptr, - bias, - hash, - (const float *)logits->ptr, - (const int32_t *)tokens->ptr, - 0, - hash_rows, - n_tokens, - has_bias && !hash_mode, - hash_mode); + (float *)weights->ptr, + (float *)probs->ptr, + bias, + hash, + (const float *)logits->ptr, + (const int32_t *)tokens->ptr, + 0, + hash_rows, + n_tokens, + has_bias && !hash_mode, + hash_mode, + n_expert_used); } return cuda_ok(cudaGetLastError(), "router_select launch"); } diff --git a/gguf-tools/gen_gguf_template.py b/gguf-tools/gen_gguf_template.py new file mode 100644 index 000000000..6ae91faea --- /dev/null +++ b/gguf-tools/gen_gguf_template.py @@ -0,0 +1,432 @@ +#!/usr/bin/env python3.12 +"""Generate a GGUF template for deepseek4-quantize from safetensors metadata. + +The template contains only GGUF header + metadata + tensor info descriptors. +The deepseek4-quantize tool will regenerate all tensor data from safetensors. + +Usage: + python3.12 gen_gguf_template.py --hf ../DeepSeek-V4-Flash-4Expert --out template.gguf +""" + +import json, struct, sys, os, argparse +from collections import OrderedDict + +# ── GGUF constants ── +GGUF_MAGIC = b"GGUF" +GGUF_VERSION = 3 +GGUF_TYPE_UINT8 = 0 +GGUF_TYPE_INT8 = 1 +GGUF_TYPE_UINT16 = 2 +GGUF_TYPE_INT16 = 3 +GGUF_TYPE_UINT32 = 4 +GGUF_TYPE_INT32 = 5 +GGUF_TYPE_FLOAT32 = 6 +GGUF_TYPE_BOOL = 7 +GGUF_TYPE_STRING = 8 +GGUF_TYPE_ARRAY = 9 +GGUF_TYPE_UINT64 = 10 +GGUF_TYPE_INT64 = 11 +GGUF_TYPE_FLOAT64 = 12 + +# GGUF tensor types (dtype for tensors) +GGUF_TENSOR_F32 = 0 +GGUF_TENSOR_F16 = 1 +GGUF_TENSOR_Q4_0 = 2 +GGUF_TENSOR_Q4_1 = 3 +GGUF_TENSOR_Q8_0 = 8 +GGUF_TENSOR_Q8_1 = 9 +GGUF_TENSOR_Q2_K = 10 +GGUF_TENSOR_Q3_K = 11 +GGUF_TENSOR_Q4_K = 12 +GGUF_TENSOR_Q5_K = 13 +GGUF_TENSOR_Q6_K = 14 +GGUF_TENSOR_Q8_K = 15 +GGUF_TENSOR_IQ2_XXS = 16 +GGUF_TENSOR_IQ2_XS = 17 +GGUF_TENSOR_IQ3_XXS = 18 +GGUF_TENSOR_IQ1_S = 19 +GGUF_TENSOR_IQ4_NL = 20 +GGUF_TENSOR_IQ3_S = 21 +GGUF_TENSOR_IQ2_S = 22 +GGUF_TENSOR_IQ4_XS = 23 +GGUF_TENSOR_I8 = 24 +GGUF_TENSOR_I16 = 25 +GGUF_TENSOR_I32 = 26 +GGUF_TENSOR_I64 = 27 +GGUF_TENSOR_F64 = 28 +GGUF_TENSOR_IQ1_M = 29 +GGUF_TENSOR_BF16 = 30 + +def write_u8(f, v): f.write(struct.pack(' HF safetensors name suffix ── +LAYER_MAP = OrderedDict([ + ("hc_attn_base.weight", "hc_attn_base"), + ("hc_attn_fn.weight", "hc_attn_fn"), + ("hc_attn_scale.weight", "hc_attn_scale"), + ("hc_ffn_base.weight", "hc_ffn_base"), + ("hc_ffn_fn.weight", "hc_ffn_fn"), + ("hc_ffn_scale.weight", "hc_ffn_scale"), + ("attn_sinks.weight", "attn.attn_sink"), + ("attn_q_a.weight", "attn.wq_a.weight"), + ("attn_q_b.weight", "attn.wq_b.weight"), + ("attn_q_a_norm.weight", "attn.q_norm.weight"), + ("attn_kv.weight", "attn.wkv.weight"), + ("attn_kv_a_norm.weight", "attn.kv_norm.weight"), + ("attn_output_a.weight", "attn.wo_a.weight"), + ("attn_output_b.weight", "attn.wo_b.weight"), + ("attn_compressor_ape.weight", "attn.compressor.ape"), + ("attn_compressor_kv.weight", "attn.compressor.wkv.weight"), + ("attn_compressor_gate.weight", "attn.compressor.wgate.weight"), + ("attn_compressor_norm.weight", "attn.compressor.norm.weight"), + ("indexer.attn_q_b.weight", "attn.indexer.wq_b.weight"), + ("indexer.proj.weight", "attn.indexer.weights_proj.weight"), + ("indexer_compressor_ape.weight", "attn.indexer.compressor.ape"), + ("indexer_compressor_kv.weight", "attn.indexer.compressor.wkv.weight"), + ("indexer_compressor_gate.weight", "attn.indexer.compressor.wgate.weight"), + ("indexer_compressor_norm.weight", "attn.indexer.compressor.norm.weight"), + ("attn_norm.weight", "attn_norm.weight"), + ("ffn_norm.weight", "ffn_norm.weight"), + ("ffn_gate_shexp.weight", "ffn.shared_experts.w1.weight"), + ("ffn_up_shexp.weight", "ffn.shared_experts.w3.weight"), + ("ffn_down_shexp.weight", "ffn.shared_experts.w2.weight"), + ("ffn_gate_inp.weight", "ffn.gate.weight"), + ("exp_probs_b.bias", "ffn.gate.bias"), + ("ffn_gate_tid2eid.weight", "ffn.gate.tid2eid"), +]) + +# HF tensor name to scale companion +HF_SCALE_SUFFIXES = { + ".weight": ".scale", + ".bias": ".scale", # actually no, but some have scales +} + +def hf_dtype_to_gguf(hf_dtype, tensor_name=""): + """Map HF safetensors dtype to GGUF template tensor type. + + Simple rule: everything -> F32 (the quant policy handles 2D+ tensors via + --experts/--attention/--dense flags), except I64 -> I32 (tid2eid routing table). + + F32 is safe because: + - 1D tensors (norms/scales/bias): policy does NOT apply, F32 preserved + - 2D tensors: policy applies and overrides F32 to the correct quant type + """ + if hf_dtype == "I64": + return GGUF_TENSOR_I32 + return GGUF_TENSOR_F32 + +def main(): + parser = argparse.ArgumentParser(description="Generate GGUF template for deepseek4-quantize") + parser.add_argument("--hf", required=True, help="HuggingFace model directory") + parser.add_argument("--out", required=True, help="Output GGUF template path") + parser.add_argument("--n-experts", type=int, default=256, help="Number of routed experts") + parser.add_argument("--n-layers", type=int, default=43, help="Number of layers") + parser.add_argument("--n-expert-used", type=int, default=4, help="Experts used per token") + args = parser.parse_args() + + # ── 1. Read safetensors index ── + index_path = os.path.join(args.hf, "model.safetensors.index.json") + if not os.path.exists(index_path): + print(f"Error: {index_path} not found") + sys.exit(1) + with open(index_path) as f: + index = json.load(f) + + weight_map = index["weight_map"] + print(f"Found {len(weight_map)} tensors in index") + + # Determine shapes by reading all safetensors headers + hf_shapes = {} # hf_name -> (shape, dtype) + st_files = sorted(set(weight_map.values())) + print(f"Reading {len(st_files)} safetensors files for tensor metadata...") + for st_file in st_files: + st_path = os.path.join(args.hf, st_file) + with open(st_path, "rb") as f: + header_size = struct.unpack('= 3: + continue + if "compressor" in gguf_suffix and layer >= 3: + continue + if "indexer" in gguf_suffix: + continue # indexer tensors have different naming + continue + shape, dtype = hf_shapes[hf_name] + dims = list(reversed(shape)) + tensor_infos.append((f"blk.{layer}.{gguf_suffix}", len(dims), dims, + hf_dtype_to_gguf(hf_shapes[hf_name][1]))) + + # Add scale tensors (for quantized weights with .scale companions) + extra_tensors = [] + for name, rank, dims, dtype in tensor_infos: + # Check if there's a .scale companion in HF + # GGUF naming for scales: weight_name is the base, scale is embedded + pass # The quantizer handles scale merging + + # Expert tensors (routed experts) + # GGUF shapes from ds4.c tensor_expect_routed_expert: + # gate/up: [DS4_N_EMBD, DS4_N_FF_EXP, DS4_N_EXPERT] = [4096, 2048, 256] + # down: [DS4_N_FF_EXP, DS4_N_EMBD, DS4_N_EXPERT] = [2048, 4096, 256] + for layer in range(args.n_layers): + for expert_type, hf_pattern, dims in [ + ("ffn_gate_exps.weight", "w1", [4096, 2048, args.n_experts]), + ("ffn_up_exps.weight", "w3", [4096, 2048, args.n_experts]), + ("ffn_down_exps.weight", "w2", [2048, 4096, args.n_experts]), + ]: + tensor_infos.append((f"blk.{layer}.{expert_type}", len(dims), dims, + GGUF_TENSOR_F16)) # policy will quantize + + print(f"Total tensor descriptors: {len(tensor_infos)}") + + # ── 4. Write GGUF file ── + with open(args.out, "wb") as f: + # Magic + Version + f.write(GGUF_MAGIC) + write_u32(f, GGUF_VERSION) + write_u64(f, len(tensor_infos)) # n_tensors + write_u64(f, len(metadata)) # n_kv + + # Write metadata KVs + for key, typecode, val in metadata: + write_string(f, key) + write_u32(f, typecode) + write_gguf_value(f, typecode, val) + + # Write tensor infos + for name, rank, dims, dtype in tensor_infos: + write_string(f, name) + write_u32(f, rank) + for d in dims: + write_u64(f, d) + write_u32(f, dtype) + write_u64(f, 0) # offset (placeholder, quantizer will rewrite) + + # Pad to alignment + pos = f.tell() + aligned = ((pos + 31) // 32) * 32 + f.write(b'\0' * (aligned - pos)) + + size_mb = os.path.getsize(args.out) / (1024 * 1024) + print(f"\nTemplate written: {args.out} ({size_mb:.1f} MB)") + print(f"\nNow run:") + print(f" ./gguf-tools/deepseek4-quantize \\") + print(f" --hf {args.hf} \\") + print(f" --template {args.out} \\") + print(f" --out ds4flash-4expert.gguf \\") + print(f" --experts q4_k \\") + print(f" --attention-proj q8_0 \\") + print(f" --shared q8_0 \\") + print(f" --output q8_0 \\") + print(f" --embedding f16 \\") + print(f" --dense f16 \\") + print(f" --n-experts {args.n_experts} \\") + print(f" --overwrite") + +if __name__ == "__main__": + main() diff --git a/test-4expert.sh b/test-4expert.sh new file mode 100644 index 000000000..efbb3d53e --- /dev/null +++ b/test-4expert.sh @@ -0,0 +1,93 @@ +#!/bin/bash +# Test 4Expert PR end-to-end on a fresh Linux machine. +# +# Usage: +# bash test-4expert.sh # all defaults +# bash test-4expert.sh /path/to/weights 16 # custom weight dir and threads +# +# This script: +# 1. Clones and builds ds4 (4expert branch) +# 2. Downloads 4Expert safetensors (if not already present) +# 3. Generates GGUF template from safetensors metadata +# 4. Quantizes to Q4_K GGUF (~153 GiB output) +# 5. Links GGUF and runs a test inference +set -euo pipefail + +WEIGHTS_DIR="${1:-DeepSeek-V4-Flash-4Expert}" +THREADS="${2:-$(nproc)}" +OUT_GGUF="ds4flash-4expert.gguf" +TEMPLATE="template.gguf" +BRANCH="4expert" + +echo "============================================" +echo " 4Expert PR End-to-End Test" +echo "============================================" +echo " weights dir : $WEIGHTS_DIR" +echo " threads : $THREADS" +echo " output : $OUT_GGUF" +echo "" + +# ── Step 1: Clone and build ── +if [ ! -f ds4.c ]; then + echo "==> [1/5] Cloning ds4 ($BRANCH branch) ..." + git clone https://github.com/yuhai-china/ds4 + cd ds4 + git checkout "$BRANCH" +else + echo "==> [1/5] Already in ds4 repo, building ..." +fi + +echo " Building gguf-tools ..." +make -C gguf-tools -j"$THREADS" +echo " Building ds4 (CPU) ..." +make cpu -j"$THREADS" +echo "" + +# ── Step 2: Download weights if needed ── +if [ ! -f "$WEIGHTS_DIR/model.safetensors.index.json" ]; then + echo "==> [2/5] Downloading 4Expert safetensors ..." + pip install -q huggingface_hub + python3 -c " +from huggingface_hub import snapshot_download +snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='$WEIGHTS_DIR') +" +else + echo "==> [2/5] Weights found at $WEIGHTS_DIR" +fi +echo "" + +# ── Step 3: Generate template ── +if [ ! -f "$TEMPLATE" ]; then + echo "==> [3/5] Generating GGUF template ..." + python3 gguf-tools/gen_gguf_template.py --hf "$WEIGHTS_DIR" --out "$TEMPLATE" +else + echo "==> [3/5] Template already exists: $TEMPLATE" +fi +echo "" + +# ── Step 4: Quantize ── +echo "==> [4/5] Quantizing ($THREADS threads, ~153 GiB output) ..." +./gguf-tools/deepseek4-quantize \ + --hf "$WEIGHTS_DIR" \ + --template "$TEMPLATE" \ + --out "$OUT_GGUF" \ + --experts q4_k \ + --attention-proj q8_0 \ + --attention f16 \ + --shared q8_0 \ + --output q8_0 \ + --embedding f16 \ + --dense f16 \ + --threads "$THREADS" +echo "" + +# ── Step 5: Test inference ── +echo "==> [5/5] Running test inference ..." +ln -sfn "$OUT_GGUF" ds4flash.gguf +./ds4 -p "The weather is great today" -n 100 + +echo "" +echo "============================================" +echo " Test complete. If you see output above," +echo " 4Expert support is working." +echo "============================================"