antirez · yuhai-china · Jun 28, 2026 · Jun 29, 2026
diff --git a/docs/gguf-conversion.md b/docs/gguf-conversion.md
@@ -0,0 +1,99 @@
+# GGUF Conversion Guide
+
+Complete workflow for generating quantized GGUF files from DeepSeek V4 Flash safetensors weights.
+
+## Quick Start (one command)
+
+```bash
+# From the ds4 repo root:
+bash test-4expert.sh /path/to/DeepSeek-V4-Flash-4Expert $(nproc)
+```
+
+This generates `template.gguf`, runs the quantizer, and creates the final GGUF. See the next section for what happens under the hood.
+
+## Manual Steps
+
+### 1. Build
+
+```bash
+# Linux:
+make cpu -j$(nproc)
+make -C gguf-tools -j$(nproc)
+# macOS:
+make -j$(sysctl -n hw.ncpu)
+make -C gguf-tools -j$(sysctl -n hw.ncpu)
+```
+
+### 2. Generate GGUF Template
+
+```bash
+python3 gguf-tools/gen_gguf_template.py \
+  --hf /path/to/DeepSeek-V4-Flash-4Expert \
+  --out /tmp/template.gguf
+```
+
+The template (~5.6 MB) contains complete metadata, tokenizer data, and tensor descriptors. It does **not** contain weight data — only tensor names, shapes, and types.
+
+### 3. Quantize and Convert
+
+```bash
+./gguf-tools/deepseek4-quantize \
+  --hf /path/to/DeepSeek-V4-Flash-4Expert \
+  --template /tmp/template.gguf \
+  --out /path/to/output/model-q4k.gguf \
+  --experts q4_k \
+  --attention-proj q8_0 \
+  --attention f16 \
+  --shared q8_0 \
+  --output q8_0 \
+  --embedding f16 \
+  --dense f16 \
+  --n-experts 256 \
+  --threads 12
+```
+
+> **Note**: If the output file already exists, you must add `--overwrite` or the tool will error.
+
+#### Quantization Options Reference
+
+| Flag | Typical value | Description |
+|------|--------------|-------------|
+| `--experts` | `q4_k` | Routed experts MoE FFN (w1/w2/w3) |
+| `--attention-proj` | `q8_0` | Attention projection matrices (q/kv/output_a/output_b) |
+| `--attention` | `f16` | Other 2D attention/compressor/indexer tensors |
+| `--shared` | `q8_0` | Shared expert FFN |
+| `--output` | `q8_0` | Output projection (output.*) |
+| `--embedding` | `f16` | Token embedding layer |
+| `--dense` | `f16` | Remaining 2D+ tensors not matched above |
+| `--n-experts` | from template | Number of routed experts (read from template metadata if omitted) |
+| `--threads` | `8` | Parallel worker count |
+
+### 4. Test
+
+```bash
+ln -sfn /path/to/output/model-q4k.gguf ds4flash.gguf
+./ds4 -p "Hello" -n 100
+```
+
+## How It Works
+
+### Template Generation (gen_gguf_template.py)
+
+1. Reads `model.safetensors.index.json` for tensor names and shapes
+2. Maps HF tensor names to GGUF names using the same `layer_map` as `deepseek4-quantize.c`
+3. Sets regular tensor types to F32 (routed expert tensors to F16). 1D tensors (norms, scales, biases) remain F32. 2D+ tensors get their type overridden by the quantizer policy.
+4. Writes a GGUF file containing metadata + tokenizer + tensor descriptors
+
+### Quantizer (deepseek4-quantize)
+
+1. Loads the template to obtain all tensor descriptors
+2. For each tensor: determines the final type using the user-specified quantization policy
+3. Reads safetensors weights, performs quantization, writes to the output GGUF
+4. Produces a ready-to-use GGUF file
+
+## Notes
+
+- **Regenerate the template** whenever the model's tensor set changes (step 1)
+- **Type conversion**: `gen_gguf_template.py` automatically handles I64 → I32 conversion (for the `tid2eid` routing table)
+- **1D tensors** (norms, scales, biases) are always stored as F32 and never quantized
+- **Large model**: Q4_K output is approximately 153 GiB; ensure sufficient disk space
diff --git a/docs/test-pr-on-linux.md b/docs/test-pr-on-linux.md
@@ -0,0 +1,90 @@
+# Testing the 4Expert PR on a Fresh Linux Machine
+
+## Quick Start (single command)
+
+```bash
+git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
+bash test-4expert.sh
+```
+
+This runs all 5 steps:
+1. Build ds4 + gguf-tools
+2. Download 4Expert safetensors weights (~130 GiB)
+3. Generate GGUF template from metadata
+4. Quantize to Q4_K GGUF (~153 GiB output, ~30 min with 20 threads)
+5. Link GGUF and run `./ds4 -p "..." -n 100`
+
+After step 4 completes, the GGUF file is reusable — skip re-conversion on subsequent runs.
+
+## Custom Paths
+
+```bash
+bash test-4expert.sh /path/to/existing/weights 16
+```
+
+- 1st arg: path to safetensors directory (skips download if `model.safetensors.index.json` exists)
+- 2nd arg: number of threads (default all cores)
+
+## Pre-Quantized (skip conversion, download GGUF directly)
+
+```bash
+git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
+make cpu -j$(nproc)
+make -C gguf-tools -j$(nproc)
+
+pip install -q huggingface_hub
+python3 -c "
+from huggingface_hub import hf_hub_download
+hf_hub_download('cloudyu/DeepSeek-V4-Flash-4Expert-GGUF', 'DeepSeek-V4-Flash-4Expert-Q4K.gguf', local_dir='.')
+"
+
+ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
+./ds4 -p "Hello" -n 100
+```
+
+## Manual Steps (if the one-click script doesn't work)
+
+### 1. Build
+
+```bash
+make -C gguf-tools -j$(nproc)
+make cpu -j$(nproc)
+```
+
+### 2. Download weights
+
+```bash
+pip install huggingface_hub
+python3 -c "
+from huggingface_hub import snapshot_download
+snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='./DeepSeek-V4-Flash-4Expert')
+"
+```
+
+### 3. Generate template + quantize
+
+```bash
+python3 gguf-tools/gen_gguf_template.py \
+  --hf ./DeepSeek-V4-Flash-4Expert \
+  --out template.gguf
+
+./gguf-tools/deepseek4-quantize \
+  --hf ./DeepSeek-V4-Flash-4Expert \
+  --template template.gguf \
+  --out ds4flash-4expert.gguf \
+  --experts q4_k \
+  --attention-proj q8_0 \
+  --attention f16 \
+  --shared q8_0 \
+  --output q8_0 \
+  --embedding f16 \
+  --dense f16 \
+  --threads $(nproc)
+```
+
+### 4. Test
+
+```bash
+ln -sfn ds4flash-4expert.gguf ds4flash.gguf
+./ds4 -p "The weather is great today" -n 100
+```
diff --git a/ds4.c b/ds4.c
@@ -189,7 +189,7 @@ static const ds4_shape DS4_SHAPE_FLASH = {
     .n_lora_q = 1024,
     .n_lora_o = 1024,
     .n_expert = 256,
-    .n_expert_used = 6,
+    .n_expert_used = 4,
     .n_expert_shared = 1,
     .n_ff_exp = 2048,
     .n_hash_layer = 3,
@@ -263,7 +263,7 @@ static ds4_shape g_ds4_shape = {
     .n_lora_q = 1024,
     .n_lora_o = 1024,
     .n_expert = 256,
-    .n_expert_used = 6,
+    .n_expert_used = 4,
     .n_expert_shared = 1,
     .n_ff_exp = 2048,
     .n_hash_layer = 3,
@@ -3756,15 +3756,17 @@ static void ds4_select_shape_from_metadata(
         uint32_t n_indexer_top_k,
         uint32_t n_hc,
         uint32_t n_hc_sinkhorn_iter) {
-    if (ds4_shape_matches_metadata(&DS4_SHAPE_FLASH,
-                                   n_layer, n_embd, n_vocab, n_head, n_head_kv,
-                                   n_head_dim, n_value_dim, n_rot, n_lora_q,
-                                   n_lora_o, n_out_group, n_expert,
-                                   n_expert_used, n_ff_exp, n_expert_shared,
-                                   n_hash_layer, n_swa, n_indexer_head,
-                                   n_indexer_head_dim, n_indexer_top_k, n_hc,
-                                   n_hc_sinkhorn_iter)) {
+    if ((n_expert_used == 4 || n_expert_used == 6) &&
+        ds4_shape_matches_metadata(&DS4_SHAPE_FLASH,
+                                    n_layer, n_embd, n_vocab, n_head, n_head_kv,
+                                    n_head_dim, n_value_dim, n_rot, n_lora_q,
+                                    n_lora_o, n_out_group, n_expert,
+                                    4, n_ff_exp, n_expert_shared,
+                                    n_hash_layer, n_swa, n_indexer_head,
+                                    n_indexer_head_dim, n_indexer_top_k, n_hc,
+                                    n_hc_sinkhorn_iter)) {
         g_ds4_shape = DS4_SHAPE_FLASH;
+        if (n_expert_used == 6) g_ds4_shape.n_expert_used = 6;
         return;
     }
     if (ds4_shape_matches_metadata(&DS4_SHAPE_PRO,