Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
203 commits
Select commit Hold shift + click to select a range
3e7cd4f
Add Metal 4 M5 scaffold
ivanfioravanti May 10, 2026
c96c567
Improve Metal MPP diagnostics and safe defaults
ivanfioravanti May 10, 2026
1217d71
Tune Metal MPP defaults and thinking checkpoints
ivanfioravanti May 11, 2026
ff2d499
Improve Metal MPP prefill throughput
ivanfioravanti May 11, 2026
8664f51
Add low-power Metal MPP Q8 profile
ivanfioravanti May 12, 2026
7c1d873
feat(server): add /v1/messages/count_tokens endpoint
audreyt May 12, 2026
88f46a1
README.md: add usage instructions for swival.dev
jedisct1 May 12, 2026
38750e8
metal: add Apple M5 simdgroup_matrix matmul fast path
jedisct1 May 12, 2026
04d18a7
metal: use M5-private scratch buffers for hot intermediates
jedisct1 May 12, 2026
1ed09cf
Update README.md
jedisct1 May 12, 2026
18d0e43
metal: keep hazard tracking for private scratch buffers
jedisct1 May 12, 2026
7bf1ba2
Merge remote-tracking branch 'ivan/codex/metal4-m5-scaffold' into HEAD
audreyt May 13, 2026
aef8dfe
Merge remote-tracking branch 'swival/m5' into HEAD
audreyt May 13, 2026
ed5f13e
fix(metal): move FC_MUL_MM_M5_SGMATRIX off slot 702 to break MPP/M5 c…
audreyt May 12, 2026
0f4f366
* download cyberneurova-DeepSeek-V4-Flash-abliterated-IQ2XXS-w2Q2K-AP…
audreyt May 13, 2026
86a2e1b
test: refresh CyberNeurova fixtures
audreyt May 13, 2026
fb883f3
* re-add uncertainty_ablit_imatrix
audreyt May 13, 2026
ed0ffac
* mention directional steering
audreyt May 13, 2026
97aed45
fix(server): stabilize steered thinking generation
audreyt May 13, 2026
9dc5540
docs: refresh M5 benchmark table
audreyt May 13, 2026
e637613
fix(server): make seeded tool ids deterministic
audreyt May 13, 2026
c4613b9
Merge pull request #2 from audreyt/codex/deterministic-tool-ids
audreyt May 13, 2026
7f5f8a3
Update aligned imatrix steering and quantizer
audreyt May 13, 2026
132fdf6
Merge remote-tracking branch 'origin/main'
audreyt May 13, 2026
db68c34
Merge branch 'feat/count-tokens'
audreyt May 13, 2026
6f6027b
Merge branch 'feat/v1-responses'
audreyt May 13, 2026
cc4e65c
Point q2-imatrix at aligned variant
audreyt May 13, 2026
12ca86a
Clarify m5 comparison baseline as antirez/main
audreyt May 13, 2026
b7dcb2c
Label m5 prefill column as m5+MPP
audreyt May 13, 2026
87c6d3e
Add Metal 4 M5 scaffold
ivanfioravanti May 10, 2026
a50dd90
Improve Metal MPP diagnostics and safe defaults
ivanfioravanti May 10, 2026
e823fe2
Tune Metal MPP defaults and thinking checkpoints
ivanfioravanti May 11, 2026
f5363ab
Improve Metal MPP prefill throughput
ivanfioravanti May 11, 2026
77eafa2
Add low-power Metal MPP Q8 profile
ivanfioravanti May 12, 2026
0dd25e1
Add M5 Max drift-patch macro plumbing and --dump-logits tooling
ivanfioravanti May 13, 2026
670411d
Stabilize HC mixer sigmoid behind DS4_METAL_HC_STABLE (default on)
ivanfioravanti May 13, 2026
ae34183
Unify RMSNorm scale formula behind DS4_METAL_NORM_RSQRT_DISABLE (defa…
ivanfioravanti May 13, 2026
6240bdb
Add diagnostic DS4_METAL_KV_RAW_F32 to skip FP16 KV round-trip
ivanfioravanti May 13, 2026
a822317
Add diagnostic DS4_METAL_ROPE_EXP2_LOG2 RoPE angle path
ivanfioravanti May 13, 2026
a544c53
Fix DS4_METAL_TENSOR_MATMUL_DISABLE host dispatch
ivanfioravanti May 13, 2026
eeed77e
Default Metal Tensor Q8_0 matmul OFF on M5 Max
ivanfioravanti May 13, 2026
2dfac58
Add DS4_METAL_MATH_SAFE diagnostic to pin shader library to IEEE-754
ivanfioravanti May 13, 2026
fd7e9fa
Fix: F16 compressor Tensor matmul incorrectly coupled to Q8 default
ivanfioravanti May 13, 2026
08de0d4
Fix Q8 MPP kernel test: reference must take the legacy path
ivanfioravanti May 13, 2026
49c1137
Update README to match new M5 Tensor defaults and refreshed drift num…
ivanfioravanti May 13, 2026
ed5bbb9
Merge remote-tracking branch 'refs/remotes/origin/pr/15' into codex/s…
audreyt May 13, 2026
f818208
Merge pull request #3 from audreyt/codex/sync-pr-15-metal4-m5
audreyt May 13, 2026
7d08f3e
Fix Metal Tensor merge and refresh M5 benchmarks
audreyt May 13, 2026
1671cfb
Merge remote-tracking branch 'origin/responses-api'
audreyt May 14, 2026
54a91d5
Merge remote-tracking branch 'origin/main'
audreyt May 14, 2026
c2db366
test: refresh abliterated logprob fixture
audreyt May 14, 2026
45e45ca
Merge remote-tracking branch 'origin/main'
audreyt May 14, 2026
72f190e
docs: tune steering default for tool prompts
audreyt May 14, 2026
9a8271d
server: make directional steering tool-safe
audreyt May 14, 2026
6185263
Merge branch 'codex/tool-safe-steering'
audreyt May 14, 2026
b7c2305
server: make directional steering tool-safe
audreyt May 14, 2026
7f966fb
server: default steering to final-answer policy
audreyt May 14, 2026
e701844
Merge branch 'codex/tool-safe-steering'
audreyt May 14, 2026
91f485d
Merge remote-tracking branch 'origin/main'
audreyt May 15, 2026
6e7d92f
Merge remote-tracking branch 'origin/main'
audreyt May 15, 2026
32f5fca
Add Metal 4 M5 scaffold
ivanfioravanti May 10, 2026
2b16436
Improve Metal MPP diagnostics and safe defaults
ivanfioravanti May 10, 2026
8285710
Tune Metal MPP defaults and thinking checkpoints
ivanfioravanti May 11, 2026
0fc7f33
Improve Metal MPP prefill throughput
ivanfioravanti May 11, 2026
98ba58e
Add low-power Metal MPP Q8 profile
ivanfioravanti May 12, 2026
6e80bcd
Add M5 Max drift-patch macro plumbing and --dump-logits tooling
ivanfioravanti May 13, 2026
890654f
Stabilize HC mixer sigmoid behind DS4_METAL_HC_STABLE (default on)
ivanfioravanti May 13, 2026
862fdd5
Unify RMSNorm scale formula behind DS4_METAL_NORM_RSQRT_DISABLE (defa…
ivanfioravanti May 13, 2026
909394f
Add diagnostic DS4_METAL_KV_RAW_F32 to skip FP16 KV round-trip
ivanfioravanti May 13, 2026
5bbfaed
Add diagnostic DS4_METAL_ROPE_EXP2_LOG2 RoPE angle path
ivanfioravanti May 13, 2026
54bd72b
Fix DS4_METAL_TENSOR_MATMUL_DISABLE host dispatch
ivanfioravanti May 13, 2026
567c143
Default Metal Tensor Q8_0 matmul OFF on M5 Max
ivanfioravanti May 13, 2026
4ecfd1f
Add DS4_METAL_MATH_SAFE diagnostic to pin shader library to IEEE-754
ivanfioravanti May 13, 2026
6116a06
Fix: F16 compressor Tensor matmul incorrectly coupled to Q8 default
ivanfioravanti May 13, 2026
80b6edf
Fix Q8 MPP kernel test: reference must take the legacy path
ivanfioravanti May 13, 2026
0e87fb0
Update README to match new M5 Tensor defaults and refreshed drift num…
ivanfioravanti May 13, 2026
7f8a10c
Establish Metal Tensor prefill drift baseline
ivanfioravanti May 14, 2026
a310836
Tune routed MoE Tensor default window
ivanfioravanti May 14, 2026
e66caf2
Tune routed MoE down Tensor window
ivanfioravanti May 14, 2026
deacaac
Tune routed MoE gate up Tensor window
ivanfioravanti May 14, 2026
849cbcf
Document latest Tensor prefill candidate results
ivanfioravanti May 14, 2026
8448056
Record experimental MoE layout drift check
ivanfioravanti May 14, 2026
c792c9c
Document route-specific MoE Tensor sweep
ivanfioravanti May 14, 2026
fda92d5
Document dense Q8 Tensor prototype results
ivanfioravanti May 14, 2026
1c2dd84
Document attention output direct RHS check
ivanfioravanti May 14, 2026
f47c36b
Document wide F16 Tensor rejection
ivanfioravanti May 14, 2026
0f5f4c6
Document Tensor prefill baseline tooling
ivanfioravanti May 16, 2026
948ecc5
Fix Tensor drift test naming and vector path
ivanfioravanti May 16, 2026
d754d2f
Merge remote-tracking branch 'origin/main'
audreyt May 16, 2026
869fd6b
Merge updated PR #15 (Metal 4 M5 Tensor prefill, ivanfioravanti/ds4@9…
audreyt May 16, 2026
2070e73
test: refresh local logprob fixture under strict MPP_OFF config
audreyt May 16, 2026
dea8e00
docs: dedup Metal 4 section and refresh M5 Max perf/drift numbers
audreyt May 16, 2026
2413fa4
Merge branch 'main' into codex/metal4-m5-scaffold
ivanfioravanti May 16, 2026
a6bb015
Tune routed MoE Tensor default window
ivanfioravanti May 16, 2026
0e20681
Merge PR #15 (Metal 4 M5 Tensor prefill, ivanfioravanti/ds4@a6bb015)
audreyt May 16, 2026
604cdd1
docs: refresh README perf table under layer-40..42 MoE default
audreyt May 16, 2026
6abe9e6
docs: refresh README headline benchmark — honest post-PR-#15 numbers
audreyt May 16, 2026
262a026
docs: bench each fork against its own preferred IQ2XXS gguf
audreyt May 17, 2026
97227eb
Expand safe routed MoE Tensor window
ivanfioravanti May 17, 2026
25db703
Merge PR #15 (Metal 4 M5 Tensor prefill, ivanfioravanti/ds4@97227eb)
audreyt May 17, 2026
e0e6109
Use private Metal scratch on M5
ivanfioravanti May 17, 2026
d000757
Merge PR #15 (Metal 4 M5 Tensor prefill, ivanfioravanti/ds4@e0e6109)
audreyt May 17, 2026
1b46137
Fix CUDA object rebuild on config changes
audreyt May 17, 2026
a3c9b9d
Merge branch 'codex/rebuild-cuda-object-on-config-change'
audreyt May 17, 2026
36adc5b
docs: refresh M5 Max bench numbers post private-scratch and wider saf…
audreyt May 17, 2026
dfd94fa
Add Metal 4 M5 scaffold
ivanfioravanti May 10, 2026
97a3618
Improve Metal MPP diagnostics and safe defaults
ivanfioravanti May 10, 2026
b87f0e5
Tune Metal MPP defaults and thinking checkpoints
ivanfioravanti May 11, 2026
dc5cf8b
Improve Metal MPP prefill throughput
ivanfioravanti May 11, 2026
bec2e3f
Add low-power Metal MPP Q8 profile
ivanfioravanti May 12, 2026
31285fb
Add M5 Max drift-patch macro plumbing and --dump-logits tooling
ivanfioravanti May 13, 2026
05524f9
Stabilize HC mixer sigmoid behind DS4_METAL_HC_STABLE (default on)
ivanfioravanti May 13, 2026
e232d6b
Unify RMSNorm scale formula behind DS4_METAL_NORM_RSQRT_DISABLE (defa…
ivanfioravanti May 13, 2026
62a0587
Add diagnostic DS4_METAL_KV_RAW_F32 to skip FP16 KV round-trip
ivanfioravanti May 13, 2026
d408b50
Add diagnostic DS4_METAL_ROPE_EXP2_LOG2 RoPE angle path
ivanfioravanti May 13, 2026
f871eb6
Fix DS4_METAL_TENSOR_MATMUL_DISABLE host dispatch
ivanfioravanti May 13, 2026
ad83f09
Default Metal Tensor Q8_0 matmul OFF on M5 Max
ivanfioravanti May 13, 2026
025bb36
Add DS4_METAL_MATH_SAFE diagnostic to pin shader library to IEEE-754
ivanfioravanti May 13, 2026
a40e402
Fix: F16 compressor Tensor matmul incorrectly coupled to Q8 default
ivanfioravanti May 13, 2026
560d936
Fix Q8 MPP kernel test: reference must take the legacy path
ivanfioravanti May 13, 2026
65dfee8
Update README to match new M5 Tensor defaults and refreshed drift num…
ivanfioravanti May 13, 2026
fdd387b
Establish Metal Tensor prefill drift baseline
ivanfioravanti May 14, 2026
1538c21
Tune routed MoE Tensor default window
ivanfioravanti May 14, 2026
da969fb
Tune routed MoE down Tensor window
ivanfioravanti May 14, 2026
d19dff0
Tune routed MoE gate up Tensor window
ivanfioravanti May 14, 2026
abbfeb5
Document latest Tensor prefill candidate results
ivanfioravanti May 14, 2026
e3fdca8
Record experimental MoE layout drift check
ivanfioravanti May 14, 2026
61f85fb
Document route-specific MoE Tensor sweep
ivanfioravanti May 14, 2026
45ff978
Document dense Q8 Tensor prototype results
ivanfioravanti May 14, 2026
46a22c9
Document attention output direct RHS check
ivanfioravanti May 14, 2026
d6ecb31
Document wide F16 Tensor rejection
ivanfioravanti May 14, 2026
b958e76
Document Tensor prefill baseline tooling
ivanfioravanti May 16, 2026
274d309
Fix Tensor drift test naming and vector path
ivanfioravanti May 16, 2026
fa881b8
Tune routed MoE Tensor default window
ivanfioravanti May 16, 2026
6420524
Expand safe routed MoE Tensor window
ivanfioravanti May 17, 2026
46b5da3
Use private Metal scratch on M5
ivanfioravanti May 17, 2026
e1a4fa6
Document eval token-count drift gate
ivanfioravanti May 17, 2026
126651e
Move routed MoE up Tensor default to layer 37
ivanfioravanti May 17, 2026
f0bab5d
Lower routed MoE Tensor default layers
ivanfioravanti May 18, 2026
493c29c
Merge PR #15 (Metal 4 M5 Tensor prefill, ivanfioravanti/ds4@f0bab5d)
audreyt May 18, 2026
3bbe235
Fix merge artifact: remove duplicate function and kernel definitions
audreyt May 18, 2026
5536a4c
Graft Q8 prefill profile/compare hooks and refresh M5 Max headline
audreyt May 18, 2026
0515e1c
Merge antirez/main
audreyt May 21, 2026
59a2ccf
Add Metal 4 M5 scaffold
ivanfioravanti May 10, 2026
04bc09a
Improve Metal MPP diagnostics and safe defaults
ivanfioravanti May 10, 2026
2239241
Tune Metal MPP defaults and thinking checkpoints
ivanfioravanti May 11, 2026
2fa510f
Improve Metal MPP prefill throughput
ivanfioravanti May 11, 2026
95762cf
Add low-power Metal MPP Q8 profile
ivanfioravanti May 12, 2026
5d549e9
Add M5 Max drift-patch macro plumbing and --dump-logits tooling
ivanfioravanti May 13, 2026
97d966e
Stabilize HC mixer sigmoid behind DS4_METAL_HC_STABLE (default on)
ivanfioravanti May 13, 2026
ef4b2cc
Unify RMSNorm scale formula behind DS4_METAL_NORM_RSQRT_DISABLE (defa…
ivanfioravanti May 13, 2026
4ac218f
Add diagnostic DS4_METAL_KV_RAW_F32 to skip FP16 KV round-trip
ivanfioravanti May 13, 2026
2562846
Add diagnostic DS4_METAL_ROPE_EXP2_LOG2 RoPE angle path
ivanfioravanti May 13, 2026
63a35db
Fix DS4_METAL_TENSOR_MATMUL_DISABLE host dispatch
ivanfioravanti May 13, 2026
b78ae9c
Default Metal Tensor Q8_0 matmul OFF on M5 Max
ivanfioravanti May 13, 2026
9f1380c
Add DS4_METAL_MATH_SAFE diagnostic to pin shader library to IEEE-754
ivanfioravanti May 13, 2026
5c6a460
Fix: F16 compressor Tensor matmul incorrectly coupled to Q8 default
ivanfioravanti May 13, 2026
779fa5a
Fix Q8 MPP kernel test: reference must take the legacy path
ivanfioravanti May 13, 2026
568ae1b
Update README to match new M5 Tensor defaults and refreshed drift num…
ivanfioravanti May 13, 2026
7455051
Establish Metal Tensor prefill drift baseline
ivanfioravanti May 14, 2026
374df30
Tune routed MoE Tensor default window
ivanfioravanti May 14, 2026
5814d0c
Tune routed MoE down Tensor window
ivanfioravanti May 14, 2026
38cce28
Tune routed MoE gate up Tensor window
ivanfioravanti May 14, 2026
941f7c4
Document latest Tensor prefill candidate results
ivanfioravanti May 14, 2026
7312587
Record experimental MoE layout drift check
ivanfioravanti May 14, 2026
650851b
Document route-specific MoE Tensor sweep
ivanfioravanti May 14, 2026
96aa8fc
Document dense Q8 Tensor prototype results
ivanfioravanti May 14, 2026
ad32365
Document attention output direct RHS check
ivanfioravanti May 14, 2026
eaba5b8
Document wide F16 Tensor rejection
ivanfioravanti May 14, 2026
3ecbf46
Document Tensor prefill baseline tooling
ivanfioravanti May 16, 2026
7d878db
Fix Tensor drift test naming and vector path
ivanfioravanti May 16, 2026
61345a1
Tune routed MoE Tensor default window
ivanfioravanti May 16, 2026
b84dd2d
Expand safe routed MoE Tensor window
ivanfioravanti May 17, 2026
56c8c80
Use private Metal scratch on M5
ivanfioravanti May 17, 2026
1da4fc7
Document eval token-count drift gate
ivanfioravanti May 17, 2026
068d8dd
Move routed MoE up Tensor default to layer 37
ivanfioravanti May 17, 2026
18e3190
Lower routed MoE Tensor default layers
ivanfioravanti May 18, 2026
b109d85
Add ds4-eval trace regrading
ivanfioravanti May 22, 2026
bd93bbf
Merge rebased PR #15 (Metal 4 M5 Tensor prefill, ivanfioravanti/ds4@b…
audreyt May 22, 2026
8a813e8
Merge upstream antirez/ds4
audreyt May 23, 2026
e819866
test: regenerate official.vec logprob vectors post-upstream short pre…
audreyt May 23, 2026
57c0693
Merge upstream antirez/ds4 (NAX + DSML repair)
claude May 23, 2026
683ff4d
Merge pull request #6 from audreyt/claude/antirez-upstream-sync-kTnuO
audreyt May 23, 2026
15e65d8
Merge remote-tracking branch 'origin/main'
audreyt May 23, 2026
ea4bf92
docs: refocus README on abliteration steering
audreyt May 23, 2026
02ebe9d
Merge remote-tracking branch 'origin/main'
audreyt May 23, 2026
ff88cd3
Merge remote-tracking branch 'origin/main'
audreyt May 23, 2026
7a13917
Merge remote-tracking branch 'origin/main'
audreyt May 27, 2026
1b62cf5
Merge remote-tracking branch 'origin/main'
audreyt May 27, 2026
97319db
Merge remote-tracking branch 'origin/main'
audreyt Jun 1, 2026
fd151f7
Merge remote-tracking branch 'origin/main'
audreyt Jun 19, 2026
96a6f14
Fix missing closing brace in server_apply_decode_directional_steering
audreyt Jun 19, 2026
d333fb6
fix: update regen script and regenerate official.vec
audreyt Jun 19, 2026
b59a1cf
Delete au-ai-pass-zh-tw-2026-06-12.md
audreyt Jun 21, 2026
d6fc639
Add guarded DSpark loader plumbing
audreyt Jun 29, 2026
9142703
Add DeepSpec nonseq draft-head schema gate
audreyt Jun 29, 2026
29eacb2
Add DeepSpec DSpark scaffold
audreyt Jun 29, 2026
9bbadcc
Add DSpark speculative draft runtime
audreyt Jun 30, 2026
4eb9d98
gitignore: ignore /logs/ and built tests/test_q4k_dot
audreyt Jun 30, 2026
e926dda
chore(clawpatch): removed local clawpatch scan artifacts
audreyt Jun 30, 2026
8f084c3
Make DSpark nonseq runtime ready
audreyt Jun 30, 2026
e5ae2b3
Merge DSpark nonseq runtime
audreyt Jun 30, 2026
4c0209d
Commit DSpark partial accepts from prefix checkpoints
audreyt Jun 30, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
/ds4_native
/ds4_server_test
/ds4_test
/tests/test_q4k_dot
/ds4flash.gguf
/TODO.md
/gguf/
Expand All @@ -16,3 +17,4 @@ __pycache__/
/misc/
.*.swp
.DS_Store
/logs/
18 changes: 18 additions & 0 deletions AGENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,24 @@ Objective-C only where Metal requires it and Metal kernels under `metal/`.
- Avoid large CPU inference runs on macOS; the CPU path has previously exposed kernel VM failures with very large mappings.
- Do not run multiple huge model processes concurrently. The instance lock is intentional.


## Repository Maintenance

- In this checkout, `origin` is the audreyt/ds4 fork. To chase antirez
upstream, fetch it explicitly:
`git fetch https://github.com/antirez/ds4.git main:refs/remotes/antirez/main`.
- Compare and merge `antirez/main`; do not merge `origin/main` expecting
upstream changes.
- Leave upstream-chase merge commits unpushed unless the user asks.
- Preserve local README/MODEL_CARD benchmark numbers unless replaced by fresh
local measurements.
- Treat scheduling, KV-cache lifetime, attention math, tokenizer behavior,
model shape, and tensor metadata conflicts as correctness-sensitive.
- Keep CUDA/ROCm parity in view when upstream changes Metal logic.
- Historical branch note: older M5 side experiments used Metal function
constant slot 703 after avoiding slot 702. Verify current branches before
reusing those slots.

## Layout

- `ds4.c`: model loading, tokenizer, CPU reference code, Metal graph scheduling,
Expand Down
11 changes: 11 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,17 @@ and background load when comparing two commits. For backend work, run at least
one before/after CSV and compare both `prefill_tps` and `gen_tps`. Generation is
greedy and skips EOS so each frontier gets the same number of generated tokens.

When comparing this fork against antirez/main or another fork:

- Use each fork's preferred quant only when the question is fork-vs-fork
behavior; use the same GGUF when isolating runtime or kernel changes.
- Run baseline and candidate sequentially on the same machine. Do not run two
huge model processes concurrently.
- Use the current sweep above unless the PR notes explicitly choose another
sweep. Do not copy old README tables or stale agent playbooks.
- Report exact commits, model files, CSV paths, backend, hardware, and thermal
state.

To generate a graph for a CSV:

```sh
Expand Down
39 changes: 28 additions & 11 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ OBJCFLAGS ?= -O3 -ffast-math $(DEBUG_FLAGS) $(NATIVE_CPU_FLAG) -Wall -Wextra -fo

LDLIBS ?= -lm -pthread
METAL_SRCS := $(wildcard metal/*.metal)
CUDA_CONFIG := .ds4_cuda.config
ROCM_SRCS := $(wildcard rocm/*.cuh)

ifeq ($(UNAME_S),Darwin)
METAL_LDLIBS := $(LDLIBS) -framework Foundation -framework Metal
CORE_OBJS = ds4.o ds4_distributed.o ds4_ssd.o ds4_metal.o
CPU_CORE_OBJS = ds4_cpu.o ds4_distributed.o ds4_ssd.o
CORE_OBJS = ds4.o ds4_distributed.o ds4_ssd.o ds4_dspark_runtime.o ds4_metal.o
CPU_CORE_OBJS = ds4_cpu.o ds4_distributed.o ds4_ssd.o ds4_dspark_runtime.o
else
CFLAGS += -D_GNU_SOURCE -fno-finite-math-only
CUDA_HOME ?= /usr/local/cuda
Expand All @@ -28,8 +29,8 @@ ifneq ($(strip $(CUDA_ARCH)),)
NVCC_ARCH_FLAGS := -arch=$(CUDA_ARCH)
endif
NVCCFLAGS ?= -O3 -g -lineinfo --use_fast_math $(NVCC_ARCH_FLAGS) -Xcompiler $(NATIVE_CPU_FLAG) -Xcompiler -pthread
CORE_OBJS = ds4.o ds4_distributed.o ds4_ssd.o ds4_cuda.o
CPU_CORE_OBJS = ds4_cpu.o ds4_distributed.o ds4_ssd.o
CORE_OBJS = ds4.o ds4_distributed.o ds4_ssd.o ds4_dspark_runtime.o ds4_cuda.o
CPU_CORE_OBJS = ds4_cpu.o ds4_distributed.o ds4_ssd.o ds4_dspark_runtime.o
CUDA_LDLIBS ?= -lm -Xcompiler -pthread -L$(CUDA_HOME)/targets/sbsa-linux/lib -L$(CUDA_HOME)/lib64 -lcudart -lcublas
HIPCC ?= $(shell command -v hipcc 2>/dev/null || echo /opt/rocm/bin/hipcc)
ROCM_ARCH ?= gfx1151
Expand All @@ -40,7 +41,7 @@ DS4_LINK_LIBS ?= $(CUDA_LDLIBS)
METAL_LDLIBS := $(LDLIBS)
endif

.PHONY: all help clean test cpu cuda cuda-spark cuda-generic cuda-regression strix-halo rocm
.PHONY: all help clean test cpu cuda cuda-spark cuda-generic cuda-regression strix-halo rocm FORCE

ifeq ($(UNAME_S),Darwin)
all: ds4 ds4-server ds4-bench ds4-eval ds4-agent
Expand Down Expand Up @@ -106,7 +107,7 @@ cuda:

strix-halo:
$(MAKE) -B ds4 ds4-server ds4-bench ds4-eval ds4-agent \
CORE_OBJS="ds4.o ds4_distributed.o ds4_ssd.o ds4_rocm.o" \
CORE_OBJS="ds4.o ds4_distributed.o ds4_ssd.o ds4_dspark_runtime.o ds4_rocm.o" \
CFLAGS="$(CFLAGS) -DDS4_ROCM_BUILD" \
DS4_LINK="$(HIPCC) $(ROCM_CFLAGS)" \
DS4_LINK_LIBS="$(ROCM_LDLIBS)"
Expand Down Expand Up @@ -137,13 +138,29 @@ cpu: ds4_cli_cpu.o ds4_server_cpu.o ds4_bench_cpu.o ds4_eval_cpu.o ds4_agent_cpu

cuda-regression: tests/cuda_long_context_smoke
./tests/cuda_long_context_smoke

$(CUDA_CONFIG): FORCE
@tmp="$@.tmp"; \
{ \
printf '%s\n' "CUDA_ARCH=$(CUDA_ARCH)"; \
printf '%s\n' "NVCC=$(NVCC)"; \
printf '%s\n' "NVCCFLAGS=$(NVCCFLAGS)"; \
} > "$$tmp"; \
if test -r "$@" && cmp -s "$$tmp" "$@"; then \
rm -f "$$tmp"; \
else \
mv "$$tmp" "$@"; \
rm -f ds4_cuda.o; \
fi
endif

ds4.o: ds4.c ds4.h ds4_ssd.h ds4_distributed.h ds4_gpu.h
ds4.o: ds4.c ds4.h ds4_ssd.h ds4_distributed.h ds4_dspark_runtime.h ds4_gpu.h
$(CC) $(CFLAGS) -c -o $@ ds4.c

ds4_ssd.o: ds4_ssd.c ds4_ssd.h
$(CC) $(CFLAGS) -c -o $@ ds4_ssd.c

ds4_dspark_runtime.o: ds4_dspark_runtime.c ds4_dspark_runtime.h ds4.h
$(CC) $(CFLAGS) -c -o $@ ds4_dspark_runtime.c

ds4_cli.o: ds4_cli.c ds4.h ds4_ssd.h ds4_distributed.h ds4_help.h linenoise.h
$(CC) $(CFLAGS) -c -o $@ ds4_cli.c
Expand Down Expand Up @@ -187,7 +204,7 @@ rax.o: rax.c rax.h rax_malloc.h
linenoise.o: linenoise.c linenoise.h
$(CC) $(CFLAGS) -c -o $@ linenoise.c

ds4_cpu.o: ds4.c ds4.h ds4_ssd.h ds4_distributed.h ds4_gpu.h
ds4_cpu.o: ds4.c ds4.h ds4_ssd.h ds4_distributed.h ds4_dspark_runtime.h ds4_gpu.h
$(CC) $(CFLAGS) -DDS4_NO_GPU -c -o $@ ds4.c

ds4_cli_cpu.o: ds4_cli.c ds4.h ds4_ssd.h ds4_distributed.h ds4_help.h linenoise.h
Expand All @@ -208,7 +225,7 @@ ds4_agent_cpu.o: ds4_agent.c ds4.h ds4_ssd.h ds4_distributed.h ds4_help.h ds4_kv
ds4_metal.o: ds4_metal.m ds4_gpu.h $(METAL_SRCS)
$(CC) $(OBJCFLAGS) -c -o $@ ds4_metal.m

ds4_cuda.o: ds4_cuda.cu ds4_gpu.h ds4_iq2_tables_cuda.inc
ds4_cuda.o: ds4_cuda.cu ds4_gpu.h ds4_iq2_tables_cuda.inc $(CUDA_CONFIG)
$(NVCC) $(NVCCFLAGS) -c -o $@ ds4_cuda.cu

ds4_rocm.o: ds4_rocm.cu ds4_gpu.h ds4_iq2_tables_cuda.inc $(ROCM_SRCS)
Expand Down Expand Up @@ -241,4 +258,4 @@ q4k-dot-test: tests/test_q4k_dot.c
./tests/test_q4k_dot

clean:
rm -f ds4 ds4-server ds4-bench ds4-eval ds4-agent ds4_cpu ds4_native ds4_server_test ds4_test ds4_agent_test tests/test_q4k_dot *.o tests/cuda_long_context_smoke tests/cuda_long_context_smoke.o
rm -f ds4 ds4-server ds4-bench ds4-eval ds4-agent ds4_cpu ds4_native ds4_server_test ds4_test ds4_agent_test tests/test_q4k_dot *.o tests/cuda_long_context_smoke tests/cuda_long_context_smoke.o $(CUDA_CONFIG) $(CUDA_CONFIG).tmp
Loading