DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package

This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar.

Quantization

Dense/shared tensors: native DeepSeek FP8, represented as F8_E4M3_B128 in GGUF.
Routed MoE expert tensors: native DeepSeek FP4, represented as MXFP4 in the sidecar manifest.
Embeddings, output, norms, routing metadata, and IDs may remain BF16, F32, or I32 where appropriate.

The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks.

Files

dense/
  model-dense.gguf
  flashmoe-package.json

sidecar/
  manifest.json
  layer_000.bin
  ...
  layer_042.bin

Model Details

Architecture: deepseek4
Blocks: 43
Experts: 256
Active experts per token: 6
Context length: 1048576
Dense GGUF tensors: 1199
Routed expert sidecar entries: 129

Example

./build/bin/llama-cli \
  -m dense/model-dense.gguf \
  --moe-mode slot-bank \
  --moe-sidecar sidecar \
  --moe-slot-bank 96 \
  --moe-topk 6 \
  -ngl 999 \
  --moe-cache-io-split 4 \
  -c 8192 \
  -b 128 \
  -ub 1 \
  --no-warmup \
  -p "What is Apple Neural Engine?" \
  -n 256

This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.

Downloads last month: 365

GGUF

Model size

7B params

Architecture

deepseek4

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(23)

this model