DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package
This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar.
Quantization
- Dense/shared tensors: native DeepSeek FP8, represented as
F8_E4M3_B128in GGUF. - Routed MoE expert tensors: native DeepSeek FP4, represented as
MXFP4in the sidecar manifest. - Embeddings, output, norms, routing metadata, and IDs may remain
BF16,F32, orI32where appropriate.
The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks.
Files
dense/
model-dense.gguf
flashmoe-package.json
sidecar/
manifest.json
layer_000.bin
...
layer_042.bin
Model Details
- Architecture:
deepseek4 - Blocks:
43 - Experts:
256 - Active experts per token:
6 - Context length:
1048576 - Dense GGUF tensors:
1199 - Routed expert sidecar entries:
129
Example
./build/bin/llama-cli \
-m dense/model-dense.gguf \
--moe-mode slot-bank \
--moe-sidecar sidecar \
--moe-slot-bank 96 \
--moe-topk 6 \
-ngl 999 \
--moe-cache-io-split 4 \
-c 8192 \
-b 128 \
-ub 1 \
--no-warmup \
-p "What is Apple Neural Engine?" \
-n 256
This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.
- Downloads last month
- 365
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Model tree for anemll/DeepSeek-V4-Flash-FP4-FP8-SSD
Base model
deepseek-ai/DeepSeek-V4-Flash