Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions docs/gguf-conversion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# GGUF Conversion Guide

Complete workflow for generating quantized GGUF files from DeepSeek V4 Flash safetensors weights.

## Quick Start (one command)

```bash
# From the ds4 repo root:
bash test-4expert.sh /path/to/DeepSeek-V4-Flash-4Expert $(nproc)
```

This generates `template.gguf`, runs the quantizer, and creates the final GGUF. See the next section for what happens under the hood.

## Manual Steps

### 1. Build

```bash
# Linux:
make cpu -j$(nproc)
make -C gguf-tools -j$(nproc)
# macOS:
make -j$(sysctl -n hw.ncpu)
make -C gguf-tools -j$(sysctl -n hw.ncpu)
```

### 2. Generate GGUF Template

```bash
python3 gguf-tools/gen_gguf_template.py \
--hf /path/to/DeepSeek-V4-Flash-4Expert \
--out /tmp/template.gguf
```

The template (~5.6 MB) contains complete metadata, tokenizer data, and tensor descriptors. It does **not** contain weight data — only tensor names, shapes, and types.

### 3. Quantize and Convert

```bash
./gguf-tools/deepseek4-quantize \
--hf /path/to/DeepSeek-V4-Flash-4Expert \
--template /tmp/template.gguf \
--out /path/to/output/model-q4k.gguf \
--experts q4_k \
--attention-proj q8_0 \
--attention f16 \
--shared q8_0 \
--output q8_0 \
--embedding f16 \
--dense f16 \
--n-experts 256 \
--threads 12
```

> **Note**: If the output file already exists, you must add `--overwrite` or the tool will error.

#### Quantization Options Reference

| Flag | Typical value | Description |
|------|--------------|-------------|
| `--experts` | `q4_k` | Routed experts MoE FFN (w1/w2/w3) |
| `--attention-proj` | `q8_0` | Attention projection matrices (q/kv/output_a/output_b) |
| `--attention` | `f16` | Other 2D attention/compressor/indexer tensors |
| `--shared` | `q8_0` | Shared expert FFN |
| `--output` | `q8_0` | Output projection (output.*) |
| `--embedding` | `f16` | Token embedding layer |
| `--dense` | `f16` | Remaining 2D+ tensors not matched above |
| `--n-experts` | from template | Number of routed experts (read from template metadata if omitted) |
| `--threads` | `8` | Parallel worker count |

### 4. Test

```bash
ln -sfn /path/to/output/model-q4k.gguf ds4flash.gguf
./ds4 -p "Hello" -n 100
```

## How It Works

### Template Generation (gen_gguf_template.py)

1. Reads `model.safetensors.index.json` for tensor names and shapes
2. Maps HF tensor names to GGUF names using the same `layer_map` as `deepseek4-quantize.c`
3. Sets regular tensor types to F32 (routed expert tensors to F16). 1D tensors (norms, scales, biases) remain F32. 2D+ tensors get their type overridden by the quantizer policy.
4. Writes a GGUF file containing metadata + tokenizer + tensor descriptors

### Quantizer (deepseek4-quantize)

1. Loads the template to obtain all tensor descriptors
2. For each tensor: determines the final type using the user-specified quantization policy
3. Reads safetensors weights, performs quantization, writes to the output GGUF
4. Produces a ready-to-use GGUF file

## Notes

- **Regenerate the template** whenever the model's tensor set changes (step 1)
- **Type conversion**: `gen_gguf_template.py` automatically handles I64 → I32 conversion (for the `tid2eid` routing table)
- **1D tensors** (norms, scales, biases) are always stored as F32 and never quantized
- **Large model**: Q4_K output is approximately 153 GiB; ensure sufficient disk space
90 changes: 90 additions & 0 deletions docs/test-pr-on-linux.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Testing the 4Expert PR on a Fresh Linux Machine

## Quick Start (single command)

```bash
git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
bash test-4expert.sh
```

This runs all 5 steps:
1. Build ds4 + gguf-tools
2. Download 4Expert safetensors weights (~130 GiB)
3. Generate GGUF template from metadata
4. Quantize to Q4_K GGUF (~153 GiB output, ~30 min with 20 threads)
5. Link GGUF and run `./ds4 -p "..." -n 100`

After step 4 completes, the GGUF file is reusable — skip re-conversion on subsequent runs.

## Custom Paths

```bash
bash test-4expert.sh /path/to/existing/weights 16
```

- 1st arg: path to safetensors directory (skips download if `model.safetensors.index.json` exists)
- 2nd arg: number of threads (default all cores)

## Pre-Quantized (skip conversion, download GGUF directly)

```bash
git clone https://github.com/yuhai-china/ds4 && cd ds4 && git checkout 4expert
make cpu -j$(nproc)
make -C gguf-tools -j$(nproc)

pip install -q huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('cloudyu/DeepSeek-V4-Flash-4Expert-GGUF', 'DeepSeek-V4-Flash-4Expert-Q4K.gguf', local_dir='.')
"

ln -sfn DeepSeek-V4-Flash-4Expert-Q4K.gguf ds4flash.gguf
./ds4 -p "Hello" -n 100
```

## Manual Steps (if the one-click script doesn't work)

### 1. Build

```bash
make -C gguf-tools -j$(nproc)
make cpu -j$(nproc)
```

### 2. Download weights

```bash
pip install huggingface_hub
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('cloudyu/DeepSeek-V4-Flash-4Expert', local_dir='./DeepSeek-V4-Flash-4Expert')
"
```

### 3. Generate template + quantize

```bash
python3 gguf-tools/gen_gguf_template.py \
--hf ./DeepSeek-V4-Flash-4Expert \
--out template.gguf

./gguf-tools/deepseek4-quantize \
--hf ./DeepSeek-V4-Flash-4Expert \
--template template.gguf \
--out ds4flash-4expert.gguf \
--experts q4_k \
--attention-proj q8_0 \
--attention f16 \
--shared q8_0 \
--output q8_0 \
--embedding f16 \
--dense f16 \
--threads $(nproc)
```

### 4. Test

```bash
ln -sfn ds4flash-4expert.gguf ds4flash.gguf
./ds4 -p "The weather is great today" -n 100
```
22 changes: 12 additions & 10 deletions ds4.c
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ static const ds4_shape DS4_SHAPE_FLASH = {
.n_lora_q = 1024,
.n_lora_o = 1024,
.n_expert = 256,
.n_expert_used = 6,
.n_expert_used = 4,
.n_expert_shared = 1,
.n_ff_exp = 2048,
.n_hash_layer = 3,
Expand Down Expand Up @@ -263,7 +263,7 @@ static ds4_shape g_ds4_shape = {
.n_lora_q = 1024,
.n_lora_o = 1024,
.n_expert = 256,
.n_expert_used = 6,
.n_expert_used = 4,
.n_expert_shared = 1,
.n_ff_exp = 2048,
.n_hash_layer = 3,
Expand Down Expand Up @@ -3756,15 +3756,17 @@ static void ds4_select_shape_from_metadata(
uint32_t n_indexer_top_k,
uint32_t n_hc,
uint32_t n_hc_sinkhorn_iter) {
if (ds4_shape_matches_metadata(&DS4_SHAPE_FLASH,
n_layer, n_embd, n_vocab, n_head, n_head_kv,
n_head_dim, n_value_dim, n_rot, n_lora_q,
n_lora_o, n_out_group, n_expert,
n_expert_used, n_ff_exp, n_expert_shared,
n_hash_layer, n_swa, n_indexer_head,
n_indexer_head_dim, n_indexer_top_k, n_hc,
n_hc_sinkhorn_iter)) {
if ((n_expert_used == 4 || n_expert_used == 6) &&
ds4_shape_matches_metadata(&DS4_SHAPE_FLASH,
n_layer, n_embd, n_vocab, n_head, n_head_kv,
n_head_dim, n_value_dim, n_rot, n_lora_q,
n_lora_o, n_out_group, n_expert,
4, n_ff_exp, n_expert_shared,
n_hash_layer, n_swa, n_indexer_head,
n_indexer_head_dim, n_indexer_top_k, n_hc,
n_hc_sinkhorn_iter)) {
g_ds4_shape = DS4_SHAPE_FLASH;
if (n_expert_used == 6) g_ds4_shape.n_expert_used = 6;
return;
}
if (ds4_shape_matches_metadata(&DS4_SHAPE_PRO,
Expand Down
Loading