diff --git a/docs/gguf-conversion.md b/docs/gguf-conversion.md
new file mode 100644
index 000000000..e0a90c9be
--- /dev/null
+++ b/docs/gguf-conversion.md
@@ -0,0 +1,99 @@
+# GGUF Conversion Guide
+
+Complete workflow for generating quantized GGUF files from DeepSeek V4 Flash safetensors weights.
+
+## Quick Start (one command)
+
+```bash
+# From the ds4 repo root:
+bash test-4expert.sh /path/to/DeepSeek-V4-Flash-4Expert $(nproc)
+```
+
+This generates `template.gguf`, runs the quantizer, and creates the final GGUF. See the next section for what happens under the hood.
+
+## Manual Steps
+
+### 1. Build
+
+```bash
+# Linux:
+make cpu -j$(nproc)
+make -C gguf-tools -j$(nproc)
+# macOS:
+make -j$(sysctl -n hw.ncpu)
+make -C gguf-tools -j$(sysctl -n hw.ncpu)
+```
+
+### 2. Generate GGUF Template
+
+```bash
+python3 gguf-tools/gen_gguf_template.py \
+  --hf /path/to/DeepSeek-V4-Flash-4Expert \
+  --out /tmp/template.gguf
+```
+
+The template (~5.6 MB) contains complete metadata, tokenizer data, and tensor descriptors. It does **not** contain weight data — only tensor names, shapes, and types.
+
+### 3. Quantize and Convert
+
+```bash
+./gguf-tools/deepseek4-quantize \
+  --hf /path/to/DeepSeek-V4-Flash-4Expert \
+  --template /tmp/template.gguf \
+  --out /path/to/output/model-q4k.gguf \
+  --experts q4_k \
+  --attention-proj q8_0 \
+  --attention f16 \
+  --shared q8_0 \
+  --output q8_0 \
+  --embedding f16 \
+  --dense f16 \
+  --n-experts 256 \
+  --threads 12
+```
+
+> **Note**: If the output file already exists, you must add `--overwrite` or the tool will error.
+
+#### Quantization Options Reference
+
+| Flag | Typical value | Description |
+|------|--------------|-------------|
+| `--experts` | `q4_k` | Routed experts MoE FFN (w1/w2/w3) |
+| `--attention-proj` | `q8_0` | Attention projection matrices (q/kv/output_a/output_b) |
+| `--attention` | `f16` | Other 2D attention/compressor/indexer tensors |
+| `--shared` | `q8_0` | Shared expert FFN |
+| `--output` | `q8_0` | Output projection (output.*) |
+| `--embedding` | `f16` | Token embedding layer |
+| `--dense` | `f16` | Remaining 2D+ tensors not matched above |
+| `--n-experts` | from template | Number of routed experts (read from template metadata if omitted) |
+| `--threads` | `8` | Parallel worker count |
+
+### 4. Test
+
+```bash
+ln -sfn /path/to/output/model-q4k.gguf ds4flash.gguf
+./ds4 -p "Hello" -n 100
+```
+
+## How It Works
+
+### Template Generation (gen_gguf_template.py)
+
+1. Reads `model.safetensors.index.json` for tensor names and shapes
+2. Maps HF tensor names to GGUF names using the same `layer_map` as `deepseek4-quantize.c`
+3. Sets regular tensor types to F32 (routed expert tensors to F16). 1D tensors (norms, scales, biases) remain F32. 2D+ tensors get their type overridden by the quantizer policy.
+4. Writes a GGUF file containing metadata + tokenizer + tensor descriptors
+
+### Quantizer (deepseek4-quantize)
+
+1. Loads the template to obtain all tensor descriptors
+2. For each tensor: determines the final type using the user-specified quantization policy
+3. Reads safetensors weights, performs quantization, writes to the output GGUF
+4. Produces a ready-to-use GGUF file
+
+## Notes
+
+- **Regenerate the template** whenever the model's tensor set changes (step 1)
+- **Type conversion**: `gen_gguf_template.py` automatically handles I64 → I32 conversion (for the `tid2eid` routing table)
+- **1D tensors** (norms, scales, biases) are always stored as F32 and never quantized
+- **Large model**: Q4_K output is approximately 153 GiB; ensure sufficient disk space
diff --git a/docs/test-pr-on-linux.md b/docs/test-pr-on-linux.md
new file mode 100644
index 000000000..9a3a2ef51
--- /dev/null
+++ b/docs/test-pr-on-linux.md
@@ -0,0 +1,90 @@
+# Testing the 4Expert PR on a Fresh Linux Machine
+
+## Quick Start (single command)
+
+```bash
+git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
+bash test-4expert.sh
+```
+
+This runs all 5 steps:
+1. Build ds4 + gguf-tools
+2. Download 4Expert safetensors weights (~130 GiB)
+3. Generate GGUF template from metadata
+4. Quantize to Q4_K GGUF (~153 GiB output, ~30 min with 20 threads)
+5. Link GGUF and run `./ds4 -p "..." -n 100`
+
+After step 4 completes, the GGUF file is reusable — skip re-conversion on subsequent runs.
+
+## Custom Paths
+
+```bash
+bash test-4expert.sh /path/to/existing/weights 16
+```
+
+- 1st arg: path to safetensors directory (skips download if `model.safetensors.index.json` exists)
+- 2nd arg: number of threads (default all cores)
+
+## Pre-Quantized (skip conversion, download GGUF directly)
+
+```bash
+git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
+make cpu -j$(nproc)
+make -C gguf-tools -j$(nproc)
+
+pip install -q huggingface_hub
+python3 -c "
+from huggingface_hub import hf_hub_download
+hf_hub_download('cloudyu/DeepSeek-V4-Flash-4Expert-GGUF', 'DeepSeek-V4-Flash-4Expert-Q4K.gguf', local_dir='.')
+"
+
+ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
+./ds4 -p "Hello" -n 100
+```
+
+## Manual Steps (if the one-click script doesn't work)
+
+### 1. Build
+
+```bash
+make -C gguf-tools -j$(nproc)
+make cpu -j$(nproc)
+```
+
+### 2. Download weights
+
+```bash
+pip install huggingface_hub
+python3 -c "
+from huggingface_hub import snapshot_download
+snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='./DeepSeek-V4-Flash-4Expert')
+"
+```
+
+### 3. Generate template + quantize
+
+```bash
+python3 gguf-tools/gen_gguf_template.py \
+  --hf ./DeepSeek-V4-Flash-4Expert \
+  --out template.gguf
+
+./gguf-tools/deepseek4-quantize \
+  --hf ./DeepSeek-V4-Flash-4Expert \
+  --template template.gguf \
+  --out ds4flash-4expert.gguf \
+  --experts q4_k \
+  --attention-proj q8_0 \
+  --attention f16 \
+  --shared q8_0 \
+  --output q8_0 \
+  --embedding f16 \
+  --dense f16 \
+  --threads $(nproc)
+```
+
+### 4. Test
+
+```bash
+ln -sfn ds4flash-4expert.gguf ds4flash.gguf
+./ds4 -p "The weather is great today" -n 100
+```
diff --git a/ds4.c b/ds4.c
index 640511eb0..82a00265e 100644
--- a/ds4.c
+++ b/ds4.c
@@ -189,7 +189,7 @@ static const ds4_shape DS4_SHAPE_FLASH = {
     .n_lora_q = 1024,
     .n_lora_o = 1024,
     .n_expert = 256,
-    .n_expert_used = 6,
+    .n_expert_used = 4,
     .n_expert_shared = 1,
     .n_ff_exp = 2048,
     .n_hash_layer = 3,
@@ -263,7 +263,7 @@ static ds4_shape g_ds4_shape = {
     .n_lora_q = 1024,
     .n_lora_o = 1024,
     .n_expert = 256,
-    .n_expert_used = 6,
+    .n_expert_used = 4,
     .n_expert_shared = 1,
     .n_ff_exp = 2048,
     .n_hash_layer = 3,
@@ -3756,15 +3756,17 @@ static void ds4_select_shape_from_metadata(
         uint32_t n_indexer_top_k,
         uint32_t n_hc,
         uint32_t n_hc_sinkhorn_iter) {
-    if (ds4_shape_matches_metadata(&DS4_SHAPE_FLASH,
-                                   n_layer, n_embd, n_vocab, n_head, n_head_kv,
-                                   n_head_dim, n_value_dim, n_rot, n_lora_q,
-                                   n_lora_o, n_out_group, n_expert,
-                                   n_expert_used, n_ff_exp, n_expert_shared,
-                                   n_hash_layer, n_swa, n_indexer_head,
-                                   n_indexer_head_dim, n_indexer_top_k, n_hc,
-                                   n_hc_sinkhorn_iter)) {
+    if ((n_expert_used == 4 || n_expert_used == 6) &&
+        ds4_shape_matches_metadata(&DS4_SHAPE_FLASH,
+                                    n_layer, n_embd, n_vocab, n_head, n_head_kv,
+                                    n_head_dim, n_value_dim, n_rot, n_lora_q,
+                                    n_lora_o, n_out_group, n_expert,
+                                    4, n_ff_exp, n_expert_shared,
+                                    n_hash_layer, n_swa, n_indexer_head,
+                                    n_indexer_head_dim, n_indexer_top_k, n_hc,
+                                    n_hc_sinkhorn_iter)) {
         g_ds4_shape = DS4_SHAPE_FLASH;
+        if (n_expert_used == 6) g_ds4_shape.n_expert_used = 6;
         return;
     }
     if (ds4_shape_matches_metadata(&DS4_SHAPE_PRO,
diff --git a/ds4_cuda.cu b/ds4_cuda.cu
index 188b341ad..65bb6b852 100644
--- a/ds4_cuda.cu
+++ b/ds4_cuda.cu
@@ -5972,28 +5972,29 @@ __global__ static void router_select_kernel(
         uint32_t hash_rows,
         uint32_t n_tokens,
         int has_bias,
-        int hash_mode) {
+        int hash_mode,
+        uint32_t n_expert_used) {
     uint32_t t = blockIdx.x;
     if (t >= n_tokens || threadIdx.x != 0) return;
     const float *log = logits + (uint64_t)t * 256;
     float *prob = probs + (uint64_t)t * 256;
-    int32_t *sel = selected + (uint64_t)t * 6;
-    float *w = weights + (uint64_t)t * 6;
+    int32_t *sel = selected + (uint64_t)t * n_expert_used;
+    float *w = weights + (uint64_t)t * n_expert_used;
 
     for (int i = 0; i < 256; i++) prob[i] = sqrtf(softplus_dev(log[i]));
 
     if (hash_mode) {
         int32_t tok = tokens ? tokens[t] : token_scalar;
         if (tok < 0 || (uint32_t)tok >= hash_rows) tok = 0;
-        const int32_t *row = hash + (uint64_t)tok * 6;
-        for (int i = 0; i < 6; i++) sel[i] = row[i];
+        const int32_t *row = hash + (uint64_t)tok * n_expert_used;
+        for (int i = 0; i < (int)n_expert_used; i++) sel[i] = row[i];
     } else {
-        for (int i = 0; i < 6; i++) sel[i] = -1;
+        for (int i = 0; i < (int)n_expert_used; i++) sel[i] = -1;
         for (int i = 0; i < 256; i++) {
             float score = prob[i] + (has_bias ? bias[i] : 0.0f);
-            for (int j = 0; j < 6; j++) {
+            for (int j = 0; j < (int)n_expert_used; j++) {
                 if (sel[j] < 0 || score > prob[sel[j]] + (has_bias ? bias[sel[j]] : 0.0f)) {
-                    for (int k = 5; k > j; k--) sel[k] = sel[k - 1];
+                    for (int k = (int)n_expert_used - 1; k > j; k--) sel[k] = sel[k - 1];
                     sel[j] = i;
                     break;
                 }
@@ -6002,14 +6003,14 @@ __global__ static void router_select_kernel(
     }
 
     float sum = 0.0f;
-    for (int i = 0; i < 6; i++) {
+    for (int i = 0; i < (int)n_expert_used; i++) {
         int e = sel[i];
         float v = (e >= 0 && e < 256) ? prob[e] : 0.0f;
         w[i] = v;
         sum += v;
     }
     sum = fmaxf(sum, 6.103515625e-5f);
-    for (int i = 0; i < 6; i++) w[i] = w[i] / sum * 1.5f;
+    for (int i = 0; i < (int)n_expert_used; i++) w[i] = w[i] / sum * 1.5f;
 }
 
 __global__ static void router_select_parallel_kernel(
@@ -6024,14 +6025,15 @@ __global__ static void router_select_parallel_kernel(
         uint32_t hash_rows,
         uint32_t n_tokens,
         int has_bias,
-        int hash_mode) {
+        int hash_mode,
+        uint32_t n_expert_used) {
     uint32_t t = blockIdx.x;
     uint32_t i = threadIdx.x;
     if (t >= n_tokens || i >= 256u) return;
     const float *log = logits + (uint64_t)t * 256;
     float *prob = probs + (uint64_t)t * 256;
-    int32_t *sel = selected + (uint64_t)t * 6;
-    float *w = weights + (uint64_t)t * 6;
+    int32_t *sel = selected + (uint64_t)t * n_expert_used;
+    float *w = weights + (uint64_t)t * n_expert_used;
     __shared__ float sprob[256];
 
     const float p = sqrtf(softplus_dev(log[i]));
@@ -6043,15 +6045,15 @@ __global__ static void router_select_parallel_kernel(
     if (hash_mode) {
         int32_t tok = tokens ? tokens[t] : token_scalar;
         if (tok < 0 || (uint32_t)tok >= hash_rows) tok = 0;
-        const int32_t *row = hash + (uint64_t)tok * 6;
-        for (int j = 0; j < 6; j++) sel[j] = row[j];
+        const int32_t *row = hash + (uint64_t)tok * n_expert_used;
+        for (int j = 0; j < (int)n_expert_used; j++) sel[j] = row[j];
     } else {
-        for (int j = 0; j < 6; j++) sel[j] = -1;
+        for (int j = 0; j < (int)n_expert_used; j++) sel[j] = -1;
         for (int e = 0; e < 256; e++) {
             float score = sprob[e] + (has_bias ? bias[e] : 0.0f);
-            for (int j = 0; j < 6; j++) {
+            for (int j = 0; j < (int)n_expert_used; j++) {
                 if (sel[j] < 0 || score > sprob[sel[j]] + (has_bias ? bias[sel[j]] : 0.0f)) {
-                    for (int k = 5; k > j; k--) sel[k] = sel[k - 1];
+                    for (int k = (int)n_expert_used - 1; k > j; k--) sel[k] = sel[k - 1];
                     sel[j] = e;
                     break;
                 }
@@ -6060,14 +6062,14 @@ __global__ static void router_select_parallel_kernel(
     }
 
     float sum = 0.0f;
-    for (int j = 0; j < 6; j++) {
+    for (int j = 0; j < (int)n_expert_used; j++) {
         int e = sel[j];
         float v = (e >= 0 && e < 256) ? sprob[e] : 0.0f;
         w[j] = v;
         sum += v;
     }
     sum = fmaxf(sum, 6.103515625e-5f);
-    for (int j = 0; j < 6; j++) w[j] = w[j] / sum * 1.5f;
+    for (int j = 0; j < (int)n_expert_used; j++) w[j] = w[j] / sum * 1.5f;
 }
 
 __device__ __forceinline__ static bool router_score_better(float av, uint32_t ai, float bv, uint32_t bi) {
@@ -6086,7 +6088,8 @@ __global__ static void router_select_warp_topk_kernel(
         uint32_t hash_rows,
         uint32_t n_tokens,
         int has_bias,
-        int hash_mode) {
+        int hash_mode,
+        uint32_t n_expert_used) {
     const uint32_t lane = threadIdx.x;
     const uint32_t row_in_block = threadIdx.y;
     const uint32_t t = blockIdx.x * blockDim.y + row_in_block;
@@ -6094,8 +6097,8 @@ __global__ static void router_select_warp_topk_kernel(
 
     const float *log = logits + (uint64_t)t * 256u;
     float *prob = probs + (uint64_t)t * 256u;
-    int32_t *sel = selected + (uint64_t)t * 6u;
-    float *w = weights + (uint64_t)t * 6u;
+    int32_t *sel = selected + (uint64_t)t * n_expert_used;
+    float *w = weights + (uint64_t)t * n_expert_used;
     __shared__ float sprob[4][256];
     float local_prob[8];
     float local_score[8];
@@ -6115,10 +6118,10 @@ __global__ static void router_select_warp_topk_kernel(
         if (lane == 0) {
             int32_t tok = tokens ? tokens[t] : token_scalar;
             if (tok < 0 || (uint32_t)tok >= hash_rows) tok = 0;
-            const int32_t *row = hash + (uint64_t)tok * 6u;
+            const int32_t *row = hash + (uint64_t)tok * n_expert_used;
             float sum = 0.0f;
             #pragma unroll
-            for (uint32_t j = 0; j < 6u; j++) {
+            for (uint32_t j = 0; j < n_expert_used; j++) {
                 const int32_t e = row[j];
                 sel[j] = e;
                 const float v = (e >= 0 && e < 256) ? sprob[row_in_block][(uint32_t)e] : 0.0f;
@@ -6127,7 +6130,7 @@ __global__ static void router_select_warp_topk_kernel(
             }
             sum = fmaxf(sum, 6.103515625e-5f);
             #pragma unroll
-            for (uint32_t j = 0; j < 6u; j++) w[j] = w[j] / sum * 1.5f;
+            for (uint32_t j = 0; j < n_expert_used; j++) w[j] = w[j] / sum * 1.5f;
         }
         return;
     }
@@ -6135,7 +6138,7 @@ __global__ static void router_select_warp_topk_kernel(
     float out_prob[6] = {0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f};
     uint32_t out_idx[6] = {0, 0, 0, 0, 0, 0};
     #pragma unroll
-    for (uint32_t k = 0; k < 6u; k++) {
+    for (uint32_t k = 0; k < n_expert_used; k++) {
         float best_score = -INFINITY;
         float best_prob = 0.0f;
         uint32_t best_idx = UINT32_MAX;
@@ -6174,14 +6177,14 @@ __global__ static void router_select_warp_topk_kernel(
     if (lane == 0) {
         float sum = 0.0f;
         #pragma unroll
-        for (uint32_t j = 0; j < 6u; j++) {
+        for (uint32_t j = 0; j < n_expert_used; j++) {
             sel[j] = (int32_t)out_idx[j];
             w[j] = out_prob[j];
             sum += out_prob[j];
         }
         sum = fmaxf(sum, 6.103515625e-5f);
         #pragma unroll
-        for (uint32_t j = 0; j < 6u; j++) w[j] = w[j] / sum * 1.5f;
+        for (uint32_t j = 0; j < n_expert_used; j++) w[j] = w[j] / sum * 1.5f;
     }
 }
 
@@ -9530,7 +9533,7 @@ extern "C" int ds4_gpu_directional_steering_project_tensor(
 }
 extern "C" int ds4_gpu_router_select_tensor(ds4_gpu_tensor *selected, ds4_gpu_tensor *weights, ds4_gpu_tensor *probs, const void *model_map, uint64_t model_size, uint64_t bias_offset, uint64_t hash_offset, uint32_t hash_rows, uint32_t token, uint32_t n_expert, uint32_t n_expert_used, float expert_weight_scale, uint32_t n_expert_groups, uint32_t n_group_used, bool has_bias, bool hash_mode, const ds4_gpu_tensor *logits) {
     if (!selected || !weights || !probs || !logits || !model_map || n_expert_groups > 1u || n_group_used > 0u) return 0;
-    if (n_expert != 256u || n_expert_used != 6u || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0;
+    if (n_expert != 256u || (n_expert_used != 6u && n_expert_used != 4u) || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0;
     int32_t tok = (int32_t)token;
     int ok = 1;
     const float *bias = NULL;
@@ -9541,7 +9544,7 @@ extern "C" int ds4_gpu_router_select_tensor(ds4_gpu_tensor *selected, ds4_gpu_te
         if (!bias) ok = 0;
     }
     if (ok && hash_mode) {
-        const uint64_t hash_bytes = (uint64_t)hash_rows * 6u * sizeof(int32_t);
+        const uint64_t hash_bytes = (uint64_t)hash_rows * n_expert_used * sizeof(int32_t);
         if (hash_offset > model_size || hash_bytes > model_size - hash_offset) ok = 0;
         else hash = (const int32_t *)cuda_model_range_ptr(model_map, hash_offset, hash_bytes, "router_hash");
         if (!hash) ok = 0;
@@ -9551,29 +9554,29 @@ extern "C" int ds4_gpu_router_select_tensor(ds4_gpu_tensor *selected, ds4_gpu_te
             getenv("DS4_CUDA_NO_PARALLEL_ROUTER_SELECT") == NULL) {
             dim3 block(32, 4, 1);
             router_select_warp_topk_kernel<<<1, block>>>((int32_t *)selected->ptr, (float *)weights->ptr, (float *)probs->ptr,
-                                                         bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1,
-                                                         has_bias && !hash_mode, hash_mode);
+                                                          bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1,
+                                                          has_bias && !hash_mode, hash_mode, n_expert_used);
         } else if (getenv("DS4_CUDA_NO_PARALLEL_ROUTER_SELECT") == NULL) {
             router_select_parallel_kernel<<<1, 256>>>((int32_t *)selected->ptr, (float *)weights->ptr, (float *)probs->ptr,
-                                                      bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1,
-                                                      has_bias && !hash_mode, hash_mode);
+                                                       bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1,
+                                                       has_bias && !hash_mode, hash_mode, n_expert_used);
         } else {
             router_select_kernel<<<1, 1>>>((int32_t *)selected->ptr, (float *)weights->ptr, (float *)probs->ptr,
-                                          bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1,
-                                          has_bias && !hash_mode, hash_mode);
+                                           bias, hash, (const float *)logits->ptr, NULL, tok, hash_rows, 1,
+                                           has_bias && !hash_mode, hash_mode, n_expert_used);
         }
         ok = cuda_ok(cudaGetLastError(), "router_select launch");
     }
     return ok;
 }
 extern "C" int ds4_gpu_router_select_batch_tensor(ds4_gpu_tensor *selected, ds4_gpu_tensor *weights, ds4_gpu_tensor *probs, const void *model_map, uint64_t model_size, uint64_t bias_offset, uint64_t hash_offset, uint32_t hash_rows, uint32_t n_expert_groups, uint32_t n_group_used, bool has_bias, bool hash_mode, const ds4_gpu_tensor *logits, const ds4_gpu_tensor *tokens, uint32_t n_expert, uint32_t n_expert_used, float expert_weight_scale, uint32_t n_tokens) {
-    if (n_expert != 256u || n_expert_used != 6u || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0;
+    if (n_expert != 256u || (n_expert_used != 6u && n_expert_used != 4u) || fabsf(expert_weight_scale - 1.5f) > 1.0e-6f) return 0;
     if (!selected || !weights || !probs || !logits || !tokens || !model_map || n_tokens == 0 ||
         n_expert_groups > 1u || n_group_used > 0u ||
         logits->bytes < (uint64_t)n_tokens * 256u * sizeof(float) ||
         probs->bytes < (uint64_t)n_tokens * 256u * sizeof(float) ||
-        selected->bytes < (uint64_t)n_tokens * 6u * sizeof(int32_t) ||
-        weights->bytes < (uint64_t)n_tokens * 6u * sizeof(float)) {
+        selected->bytes < (uint64_t)n_tokens * (uint64_t)n_expert_used * sizeof(int32_t) ||
+        weights->bytes < (uint64_t)n_tokens * (uint64_t)n_expert_used * sizeof(float)) {
         return 0;
     }
     const float *bias = NULL;
@@ -9584,7 +9587,7 @@ extern "C" int ds4_gpu_router_select_batch_tensor(ds4_gpu_tensor *selected, ds4_
         if (!bias) return 0;
     }
     if (hash_mode) {
-        const uint64_t hash_bytes = (uint64_t)hash_rows * 6u * sizeof(int32_t);
+        const uint64_t hash_bytes = (uint64_t)hash_rows * n_expert_used * sizeof(int32_t);
         if (hash_offset > model_size || hash_bytes > model_size - hash_offset) return 0;
         hash = (const int32_t *)cuda_model_range_ptr(model_map, hash_offset, hash_bytes, "router_hash");
         if (!hash) return 0;
@@ -9603,33 +9606,36 @@ extern "C" int ds4_gpu_router_select_batch_tensor(ds4_gpu_tensor *selected, ds4_
                                                                         hash_rows,
                                                                         n_tokens,
                                                                         has_bias && !hash_mode,
-                                                                        hash_mode);
+                                                                        hash_mode,
+                                                                        n_expert_used);
     } else if (getenv("DS4_CUDA_NO_PARALLEL_ROUTER_SELECT") == NULL) {
         router_select_parallel_kernel<<<n_tokens, 256>>>((int32_t *)selected->ptr,
-                                                         (float *)weights->ptr,
-                                                         (float *)probs->ptr,
-                                                         bias,
-                                                         hash,
-                                                         (const float *)logits->ptr,
-                                                         (const int32_t *)tokens->ptr,
-                                                         0,
-                                                         hash_rows,
-                                                         n_tokens,
-                                                         has_bias && !hash_mode,
-                                                         hash_mode);
+                                                          (float *)weights->ptr,
+                                                          (float *)probs->ptr,
+                                                          bias,
+                                                          hash,
+                                                          (const float *)logits->ptr,
+                                                          (const int32_t *)tokens->ptr,
+                                                          0,
+                                                          hash_rows,
+                                                          n_tokens,
+                                                          has_bias && !hash_mode,
+                                                          hash_mode,
+                                                          n_expert_used);
     } else {
         router_select_kernel<<<n_tokens, 1>>>((int32_t *)selected->ptr,
-                                              (float *)weights->ptr,
-                                              (float *)probs->ptr,
-                                              bias,
-                                              hash,
-                                              (const float *)logits->ptr,
-                                              (const int32_t *)tokens->ptr,
-                                              0,
-                                              hash_rows,
-                                              n_tokens,
-                                              has_bias && !hash_mode,
-                                              hash_mode);
+                                               (float *)weights->ptr,
+                                               (float *)probs->ptr,
+                                               bias,
+                                               hash,
+                                               (const float *)logits->ptr,
+                                               (const int32_t *)tokens->ptr,
+                                               0,
+                                               hash_rows,
+                                               n_tokens,
+                                               has_bias && !hash_mode,
+                                               hash_mode,
+                                               n_expert_used);
     }
     return cuda_ok(cudaGetLastError(), "router_select launch");
 }
diff --git a/gguf-tools/gen_gguf_template.py b/gguf-tools/gen_gguf_template.py
new file mode 100644
index 000000000..6ae91faea
--- /dev/null
+++ b/gguf-tools/gen_gguf_template.py
@@ -0,0 +1,432 @@
+#!/usr/bin/env python3.12
+"""Generate a GGUF template for deepseek4-quantize from safetensors metadata.
+
+The template contains only GGUF header + metadata + tensor info descriptors.
+The deepseek4-quantize tool will regenerate all tensor data from safetensors.
+
+Usage:
+    python3.12 gen_gguf_template.py --hf ../DeepSeek-V4-Flash-4Expert --out template.gguf
+"""
+
+import json, struct, sys, os, argparse
+from collections import OrderedDict
+
+# ── GGUF constants ──
+GGUF_MAGIC = b"GGUF"
+GGUF_VERSION = 3
+GGUF_TYPE_UINT8 = 0
+GGUF_TYPE_INT8 = 1
+GGUF_TYPE_UINT16 = 2
+GGUF_TYPE_INT16 = 3
+GGUF_TYPE_UINT32 = 4
+GGUF_TYPE_INT32 = 5
+GGUF_TYPE_FLOAT32 = 6
+GGUF_TYPE_BOOL = 7
+GGUF_TYPE_STRING = 8
+GGUF_TYPE_ARRAY = 9
+GGUF_TYPE_UINT64 = 10
+GGUF_TYPE_INT64 = 11
+GGUF_TYPE_FLOAT64 = 12
+
+# GGUF tensor types (dtype for tensors)
+GGUF_TENSOR_F32 = 0
+GGUF_TENSOR_F16 = 1
+GGUF_TENSOR_Q4_0 = 2
+GGUF_TENSOR_Q4_1 = 3
+GGUF_TENSOR_Q8_0 = 8
+GGUF_TENSOR_Q8_1 = 9
+GGUF_TENSOR_Q2_K = 10
+GGUF_TENSOR_Q3_K = 11
+GGUF_TENSOR_Q4_K = 12
+GGUF_TENSOR_Q5_K = 13
+GGUF_TENSOR_Q6_K = 14
+GGUF_TENSOR_Q8_K = 15
+GGUF_TENSOR_IQ2_XXS = 16
+GGUF_TENSOR_IQ2_XS = 17
+GGUF_TENSOR_IQ3_XXS = 18
+GGUF_TENSOR_IQ1_S = 19
+GGUF_TENSOR_IQ4_NL = 20
+GGUF_TENSOR_IQ3_S = 21
+GGUF_TENSOR_IQ2_S = 22
+GGUF_TENSOR_IQ4_XS = 23
+GGUF_TENSOR_I8 = 24
+GGUF_TENSOR_I16 = 25
+GGUF_TENSOR_I32 = 26
+GGUF_TENSOR_I64 = 27
+GGUF_TENSOR_F64 = 28
+GGUF_TENSOR_IQ1_M = 29
+GGUF_TENSOR_BF16 = 30
+
+def write_u8(f, v): f.write(struct.pack('<B', v))
+def write_u16(f, v): f.write(struct.pack('<H', v))
+def write_u32(f, v): f.write(struct.pack('<I', v))
+def write_i32(f, v): f.write(struct.pack('<i', v))
+def write_u64(f, v): f.write(struct.pack('<Q', v))
+def write_f32(f, v): f.write(struct.pack('<f', v))
+def write_f64(f, v): f.write(struct.pack('<d', v))
+def write_bool(f, v): f.write(struct.pack('<?', v))
+def write_string(f, s):
+    b = s.encode()
+    write_u64(f, len(b))
+    f.write(b)
+
+def write_gguf_value(f, typecode, value):
+    """Write a GGUF metadata value of the given type."""
+    if typecode == GGUF_TYPE_UINT32:
+        write_u32(f, value)
+    elif typecode == GGUF_TYPE_UINT64:
+        write_u64(f, value)
+    elif typecode == GGUF_TYPE_FLOAT32:
+        write_f32(f, value)
+    elif typecode == GGUF_TYPE_FLOAT64:
+        write_f64(f, value)
+    elif typecode == GGUF_TYPE_BOOL:
+        write_bool(f, value)
+    elif typecode == GGUF_TYPE_STRING:
+        write_string(f, value)
+    elif typecode == GGUF_TYPE_UINT8:
+        write_u8(f, value)
+    elif typecode == GGUF_TYPE_INT32:
+        write_i32(f, value)
+    elif typecode == GGUF_TYPE_ARRAY:
+        elem_type, arr = value
+        write_u32(f, elem_type)
+        write_u64(f, len(arr))
+        for v in arr:
+            write_gguf_value(f, elem_type, v)
+    else:
+        raise ValueError(f"Unknown GGUF type: {typecode}")
+
+# ── Layer mapping: GGUF name suffix -> HF safetensors name suffix ──
+LAYER_MAP = OrderedDict([
+    ("hc_attn_base.weight",             "hc_attn_base"),
+    ("hc_attn_fn.weight",               "hc_attn_fn"),
+    ("hc_attn_scale.weight",            "hc_attn_scale"),
+    ("hc_ffn_base.weight",              "hc_ffn_base"),
+    ("hc_ffn_fn.weight",                "hc_ffn_fn"),
+    ("hc_ffn_scale.weight",             "hc_ffn_scale"),
+    ("attn_sinks.weight",               "attn.attn_sink"),
+    ("attn_q_a.weight",                "attn.wq_a.weight"),
+    ("attn_q_b.weight",                "attn.wq_b.weight"),
+    ("attn_q_a_norm.weight",           "attn.q_norm.weight"),
+    ("attn_kv.weight",                 "attn.wkv.weight"),
+    ("attn_kv_a_norm.weight",          "attn.kv_norm.weight"),
+    ("attn_output_a.weight",           "attn.wo_a.weight"),
+    ("attn_output_b.weight",           "attn.wo_b.weight"),
+    ("attn_compressor_ape.weight",     "attn.compressor.ape"),
+    ("attn_compressor_kv.weight",      "attn.compressor.wkv.weight"),
+    ("attn_compressor_gate.weight",    "attn.compressor.wgate.weight"),
+    ("attn_compressor_norm.weight",    "attn.compressor.norm.weight"),
+    ("indexer.attn_q_b.weight",        "attn.indexer.wq_b.weight"),
+    ("indexer.proj.weight",            "attn.indexer.weights_proj.weight"),
+    ("indexer_compressor_ape.weight",  "attn.indexer.compressor.ape"),
+    ("indexer_compressor_kv.weight",   "attn.indexer.compressor.wkv.weight"),
+    ("indexer_compressor_gate.weight", "attn.indexer.compressor.wgate.weight"),
+    ("indexer_compressor_norm.weight", "attn.indexer.compressor.norm.weight"),
+    ("attn_norm.weight",               "attn_norm.weight"),
+    ("ffn_norm.weight",                "ffn_norm.weight"),
+    ("ffn_gate_shexp.weight",          "ffn.shared_experts.w1.weight"),
+    ("ffn_up_shexp.weight",            "ffn.shared_experts.w3.weight"),
+    ("ffn_down_shexp.weight",          "ffn.shared_experts.w2.weight"),
+    ("ffn_gate_inp.weight",            "ffn.gate.weight"),
+    ("exp_probs_b.bias",               "ffn.gate.bias"),
+    ("ffn_gate_tid2eid.weight",        "ffn.gate.tid2eid"),
+])
+
+# HF tensor name to scale companion
+HF_SCALE_SUFFIXES = {
+    ".weight": ".scale",
+    ".bias": ".scale",  # actually no, but some have scales
+}
+
+def hf_dtype_to_gguf(hf_dtype, tensor_name=""):
+    """Map HF safetensors dtype to GGUF template tensor type.
+    
+    Simple rule: everything -> F32 (the quant policy handles 2D+ tensors via 
+    --experts/--attention/--dense flags), except I64 -> I32 (tid2eid routing table).
+    
+    F32 is safe because:
+    - 1D tensors (norms/scales/bias): policy does NOT apply, F32 preserved
+    - 2D tensors: policy applies and overrides F32 to the correct quant type
+    """
+    if hf_dtype == "I64":
+        return GGUF_TENSOR_I32
+    return GGUF_TENSOR_F32
+
+def main():
+    parser = argparse.ArgumentParser(description="Generate GGUF template for deepseek4-quantize")
+    parser.add_argument("--hf", required=True, help="HuggingFace model directory")
+    parser.add_argument("--out", required=True, help="Output GGUF template path")
+    parser.add_argument("--n-experts", type=int, default=256, help="Number of routed experts")
+    parser.add_argument("--n-layers", type=int, default=43, help="Number of layers")
+    parser.add_argument("--n-expert-used", type=int, default=4, help="Experts used per token")
+    args = parser.parse_args()
+
+    # ── 1. Read safetensors index ──
+    index_path = os.path.join(args.hf, "model.safetensors.index.json")
+    if not os.path.exists(index_path):
+        print(f"Error: {index_path} not found")
+        sys.exit(1)
+    with open(index_path) as f:
+        index = json.load(f)
+
+    weight_map = index["weight_map"]
+    print(f"Found {len(weight_map)} tensors in index")
+
+    # Determine shapes by reading all safetensors headers
+    hf_shapes = {}  # hf_name -> (shape, dtype)
+    st_files = sorted(set(weight_map.values()))
+    print(f"Reading {len(st_files)} safetensors files for tensor metadata...")
+    for st_file in st_files:
+        st_path = os.path.join(args.hf, st_file)
+        with open(st_path, "rb") as f:
+            header_size = struct.unpack('<Q', f.read(8))[0]
+            header = json.loads(f.read(header_size).decode())
+        for name in header:
+            hf_shapes[name] = (tuple(header[name]["shape"]), header[name]["dtype"])
+
+    print(f"Parsed {len(hf_shapes)} tensor shapes. Writing template...")
+
+    # ── 2. Build metadata KVs ──
+    metadata = []
+
+    def add_u32(key, val):
+        metadata.append((key, GGUF_TYPE_UINT32, val))
+
+    def add_u64(key, val):
+        metadata.append((key, GGUF_TYPE_UINT64, val))
+
+    def add_f32(key, val):
+        metadata.append((key, GGUF_TYPE_FLOAT32, val))
+
+    def add_bool(key, val):
+        metadata.append((key, GGUF_TYPE_BOOL, val))
+
+    def add_str(key, val):
+        metadata.append((key, GGUF_TYPE_STRING, val))
+
+    def add_f64(key, val):
+        metadata.append((key, GGUF_TYPE_FLOAT64, val))
+
+    def add_arr(key, elem_type, val_list):
+        metadata.append((key, GGUF_TYPE_ARRAY, (elem_type, val_list)))
+
+    add_str("general.architecture", "deepseek4")
+    add_str("general.name", "DeepSeek V4 Flash 4Expert")
+    add_u32("general.alignment", 32)
+
+    # Pre-tokenizer
+    add_str("tokenizer.ggml.pre", "joyai-llm")
+
+    # Model config
+    add_u32("deepseek4.block_count", args.n_layers)
+    add_u64("deepseek4.context_length", 65536)
+    add_u32("deepseek4.embedding_length", 4096)
+    add_u32("deepseek4.vocab_size", 129280)
+    add_u32("deepseek4.attention.head_count", 64)
+    add_u32("deepseek4.attention.head_count_kv", 1)
+    add_u32("deepseek4.attention.key_length", 512)
+    add_u32("deepseek4.attention.value_length", 512)
+    add_u32("deepseek4.rope.dimension_count", 64)
+    add_u32("deepseek4.attention.q_lora_rank", 1024)
+    add_u32("deepseek4.attention.output_lora_rank", 1024)
+    add_u32("deepseek4.attention.output_group_count", 8)
+    add_u32("deepseek4.expert_count", args.n_experts)
+    add_u32("deepseek4.expert_used_count", args.n_expert_used)
+    add_u32("deepseek4.expert_feed_forward_length", 2048)
+    add_u32("deepseek4.expert_shared_count", 1)
+    add_u32("deepseek4.hash_layer_count", 3)
+    add_u32("deepseek4.attention.sliding_window", 128)
+    add_u32("deepseek4.attention.indexer.head_count", 64)
+    add_u32("deepseek4.attention.indexer.key_length", 128)
+    add_u32("deepseek4.attention.indexer.top_k", 512)
+    add_u32("deepseek4.hyper_connection.count", 4)
+    add_u32("deepseek4.hyper_connection.sinkhorn_iterations", 20)
+
+    add_f32("deepseek4.attention.layer_norm_rms_epsilon", 1.0e-6)
+    add_f32("deepseek4.hyper_connection.epsilon", 1.0e-6)
+    add_f32("deepseek4.expert_weights_scale", 1.5)
+    add_bool("deepseek4.expert_weights_norm", True)
+    add_f64("deepseek4.rope.freq_base", 10000.0)
+    add_f64("deepseek4.rope.scaling.factor", 16.0)
+    add_f32("deepseek4.rope.scaling.yarn_beta_fast", 32.0)
+    add_f32("deepseek4.rope.scaling.yarn_beta_slow", 1.0)
+    add_f64("deepseek4.attention.compress_rope_freq_base", 160000.0)
+    add_u64("deepseek4.rope.scaling.original_context_length", 65536)
+
+    # Compress ratios per layer (required array)
+    compress_ratios = []
+    for il in range(args.n_layers):
+        if il < 2:
+            compress_ratios.append(0)
+        elif il % 2 == 0:
+            compress_ratios.append(4)
+        else:
+            compress_ratios.append(128)
+    add_arr("deepseek4.attention.compress_ratios", GGUF_TYPE_UINT32, compress_ratios)
+
+    # SwiGLU clamp exponents (required float array per layer)
+    swiglu_clamp = [10.0] * args.n_layers
+    add_arr("deepseek4.swiglu_clamp_exp", GGUF_TYPE_FLOAT32, swiglu_clamp)
+
+    # ── Tokenizer data ──
+    with open(os.path.join(args.hf, "tokenizer_config.json")) as f:
+        tok_config = json.load(f)
+    with open(os.path.join(args.hf, "tokenizer.json")) as f:
+        tok_data = json.load(f)
+
+    model_type = tok_data.get("model", {}).get("type", "gpt2").lower()
+    add_str("tokenizer.ggml.model", model_type)
+
+    # Build vocabulary: base vocab (0..127999) + added tokens (128000+)
+    vocab_dict = tok_data.get("model", {}).get("vocab", {})
+    merges = tok_data.get("model", {}).get("merges", [])
+    added_tokens = tok_data.get("added_tokens", [])
+
+    # Sort by ID
+    base_vocab = sorted(vocab_dict.items(), key=lambda x: x[1])
+    added_vocab = sorted(added_tokens, key=lambda x: x["id"])
+
+    # Combine: base vocab (IDs 0..127999), then added tokens (IDs 128000+)
+    all_tokens = [t for t, _ in base_vocab]
+    # Check if added tokens fill IDs 128000..129279
+    for at in added_vocab:
+        tid = at["id"]
+        # Extend tokens list if needed
+        while len(all_tokens) <= tid:
+            all_tokens.append("")
+        all_tokens[tid] = at["content"]
+
+    add_arr("tokenizer.ggml.tokens", GGUF_TYPE_STRING, all_tokens)
+
+    # Scores: 0 for special tokens, default -inf for others
+    # BPE tokens don't always have scores. Use 0.0 as default.
+    scores = [0.0] * len(all_tokens)
+    add_arr("tokenizer.ggml.scores", GGUF_TYPE_FLOAT32, scores)
+
+    # Token types: 3=CONTROL for special, 1=UNKNOWN for unk, 0=NORMAL for others
+    token_types = [0] * len(all_tokens)
+    token_types[0] = 3   # BOS = control
+    token_types[1] = 3   # EOS = control
+    token_types[2] = 3   # PAD = control
+    add_arr("tokenizer.ggml.token_type", GGUF_TYPE_UINT32, token_types)
+
+    # Merges (BPE)
+    add_arr("tokenizer.ggml.merges", GGUF_TYPE_STRING, merges)
+
+    # Special token IDs
+    add_u32("tokenizer.ggml.bos_token_id", 0)
+    add_u32("tokenizer.ggml.eos_token_id", 1)
+    add_bool("tokenizer.ggml.add_bos_token", False)
+    add_bool("tokenizer.ggml.add_eos_token", False)
+
+    # Chat template
+    chat_template = tok_config.get("chat_template", "")
+    if chat_template:
+        add_str("tokenizer.chat_template", chat_template)
+
+    # ── 3. Build tensor list ──
+    tensor_infos = []  # (name, rank, dims_list, gguf_type)
+
+    # Top-level tensors
+    top_tensors = {
+        "token_embd.weight": "embed.weight",
+        "output.weight": "head.weight",
+        "output_norm.weight": "norm.weight",
+        "output_hc_base.weight": "hc_head_base",
+        "output_hc_fn.weight": "hc_head_fn",
+        "output_hc_scale.weight": "hc_head_scale",
+    }
+    for gguf_name, hf_name in top_tensors.items():
+        if hf_name in hf_shapes:
+            shape, dtype = hf_shapes[hf_name]
+            dims = list(reversed(shape))  # GGUF uses reversed dims
+            dtype = hf_dtype_to_gguf(hf_shapes[hf_name][1]) if hf_name in hf_shapes else GGUF_TENSOR_F16
+            tensor_infos.append((gguf_name, len(dims), dims, dtype))
+
+    # Per-layer tensors
+    for layer in range(args.n_layers):
+        for gguf_suffix, hf_suffix in LAYER_MAP.items():
+            hf_name = f"layers.{layer}.{hf_suffix}"
+            if hf_name not in hf_shapes:
+                # Some tensors only exist in certain layers
+                if gguf_suffix == "ffn_gate_tid2eid.weight" and layer >= 3:
+                    continue
+                if "compressor" in gguf_suffix and layer >= 3:
+                    continue
+                if "indexer" in gguf_suffix:
+                    continue  # indexer tensors have different naming
+                continue
+            shape, dtype = hf_shapes[hf_name]
+            dims = list(reversed(shape))
+            tensor_infos.append((f"blk.{layer}.{gguf_suffix}", len(dims), dims,
+                                hf_dtype_to_gguf(hf_shapes[hf_name][1])))
+
+    # Add scale tensors (for quantized weights with .scale companions)
+    extra_tensors = []
+    for name, rank, dims, dtype in tensor_infos:
+        # Check if there's a .scale companion in HF
+        # GGUF naming for scales: weight_name is the base, scale is embedded
+        pass  # The quantizer handles scale merging
+
+    # Expert tensors (routed experts)
+    # GGUF shapes from ds4.c tensor_expect_routed_expert:
+    #   gate/up: [DS4_N_EMBD, DS4_N_FF_EXP, DS4_N_EXPERT] = [4096, 2048, 256]
+    #   down:    [DS4_N_FF_EXP, DS4_N_EMBD, DS4_N_EXPERT] = [2048, 4096, 256]
+    for layer in range(args.n_layers):
+        for expert_type, hf_pattern, dims in [
+            ("ffn_gate_exps.weight", "w1", [4096, 2048, args.n_experts]),
+            ("ffn_up_exps.weight",   "w3", [4096, 2048, args.n_experts]),
+            ("ffn_down_exps.weight", "w2", [2048, 4096, args.n_experts]),
+        ]:
+            tensor_infos.append((f"blk.{layer}.{expert_type}", len(dims), dims,
+                                GGUF_TENSOR_F16))  # policy will quantize
+
+    print(f"Total tensor descriptors: {len(tensor_infos)}")
+
+    # ── 4. Write GGUF file ──
+    with open(args.out, "wb") as f:
+        # Magic + Version
+        f.write(GGUF_MAGIC)
+        write_u32(f, GGUF_VERSION)
+        write_u64(f, len(tensor_infos))  # n_tensors
+        write_u64(f, len(metadata))      # n_kv
+
+        # Write metadata KVs
+        for key, typecode, val in metadata:
+            write_string(f, key)
+            write_u32(f, typecode)
+            write_gguf_value(f, typecode, val)
+
+        # Write tensor infos
+        for name, rank, dims, dtype in tensor_infos:
+            write_string(f, name)
+            write_u32(f, rank)
+            for d in dims:
+                write_u64(f, d)
+            write_u32(f, dtype)
+            write_u64(f, 0)  # offset (placeholder, quantizer will rewrite)
+
+        # Pad to alignment
+        pos = f.tell()
+        aligned = ((pos + 31) // 32) * 32
+        f.write(b'\0' * (aligned - pos))
+
+    size_mb = os.path.getsize(args.out) / (1024 * 1024)
+    print(f"\nTemplate written: {args.out} ({size_mb:.1f} MB)")
+    print(f"\nNow run:")
+    print(f"  ./gguf-tools/deepseek4-quantize \\")
+    print(f"    --hf {args.hf} \\")
+    print(f"    --template {args.out} \\")
+    print(f"    --out ds4flash-4expert.gguf \\")
+    print(f"    --experts q4_k \\")
+    print(f"    --attention-proj q8_0 \\")
+    print(f"    --shared q8_0 \\")
+    print(f"    --output q8_0 \\")
+    print(f"    --embedding f16 \\")
+    print(f"    --dense f16 \\")
+    print(f"    --n-experts {args.n_experts} \\")
+    print(f"    --overwrite")
+
+if __name__ == "__main__":
+    main()
diff --git a/test-4expert.sh b/test-4expert.sh
new file mode 100644
index 000000000..efbb3d53e
--- /dev/null
+++ b/test-4expert.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+# Test 4Expert PR end-to-end on a fresh Linux machine.
+#
+# Usage:
+#   bash test-4expert.sh                      # all defaults
+#   bash test-4expert.sh /path/to/weights 16   # custom weight dir and threads
+#
+# This script:
+#   1. Clones and builds ds4 (4expert branch)
+#   2. Downloads 4Expert safetensors (if not already present)
+#   3. Generates GGUF template from safetensors metadata
+#   4. Quantizes to Q4_K GGUF (~153 GiB output)
+#   5. Links GGUF and runs a test inference
+set -euo pipefail
+
+WEIGHTS_DIR="${1:-DeepSeek-V4-Flash-4Expert}"
+THREADS="${2:-$(nproc)}"
+OUT_GGUF="ds4flash-4expert.gguf"
+TEMPLATE="template.gguf"
+BRANCH="4expert"
+
+echo "============================================"
+echo " 4Expert PR End-to-End Test"
+echo "============================================"
+echo " weights dir : $WEIGHTS_DIR"
+echo " threads     : $THREADS"
+echo " output      : $OUT_GGUF"
+echo ""
+
+# ── Step 1: Clone and build ──
+if [ ! -f ds4.c ]; then
+    echo "==> [1/5] Cloning ds4 ($BRANCH branch) ..."
+    git clone https://github.com/yuhai-china/ds4
+    cd ds4
+    git checkout "$BRANCH"
+else
+    echo "==> [1/5] Already in ds4 repo, building ..."
+fi
+
+echo "     Building gguf-tools ..."
+make -C gguf-tools -j"$THREADS"
+echo "     Building ds4 (CPU) ..."
+make cpu -j"$THREADS"
+echo ""
+
+# ── Step 2: Download weights if needed ──
+if [ ! -f "$WEIGHTS_DIR/model.safetensors.index.json" ]; then
+    echo "==> [2/5] Downloading 4Expert safetensors ..."
+    pip install -q huggingface_hub
+    python3 -c "
+from huggingface_hub import snapshot_download
+snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='$WEIGHTS_DIR')
+"
+else
+    echo "==> [2/5] Weights found at $WEIGHTS_DIR"
+fi
+echo ""
+
+# ── Step 3: Generate template ──
+if [ ! -f "$TEMPLATE" ]; then
+    echo "==> [3/5] Generating GGUF template ..."
+    python3 gguf-tools/gen_gguf_template.py --hf "$WEIGHTS_DIR" --out "$TEMPLATE"
+else
+    echo "==> [3/5] Template already exists: $TEMPLATE"
+fi
+echo ""
+
+# ── Step 4: Quantize ──
+echo "==> [4/5] Quantizing ($THREADS threads, ~153 GiB output) ..."
+./gguf-tools/deepseek4-quantize \
+    --hf "$WEIGHTS_DIR" \
+    --template "$TEMPLATE" \
+    --out "$OUT_GGUF" \
+    --experts q4_k \
+    --attention-proj q8_0 \
+    --attention f16 \
+    --shared q8_0 \
+    --output q8_0 \
+    --embedding f16 \
+    --dense f16 \
+    --threads "$THREADS"
+echo ""
+
+# ── Step 5: Test inference ──
+echo "==> [5/5] Running test inference ..."
+ln -sfn "$OUT_GGUF" ds4flash.gguf
+./ds4 -p "The weather is great today" -n 100
+
+echo ""
+echo "============================================"
+echo " Test complete. If you see output above,"
+echo " 4Expert support is working."
+echo "============================================"