DeepSeek V4 Flash FP4/FP8 SSD Flash-MoE Package

This repository contains an SSD Flash-MoE package for DeepSeek V4 Flash. It is intended for runtimes that can load a dense GGUF plus a routed expert sidecar.

Quantization

  • Dense/shared tensors: native DeepSeek FP8, represented as F8_E4M3_B128 in GGUF.
  • Routed MoE expert tensors: native DeepSeek FP4, represented as MXFP4 in the sidecar manifest.
  • Embeddings, output, norms, routing metadata, and IDs may remain BF16, F32, or I32 where appropriate.

The routed expert tensors are not stored in the dense GGUF. They are stored in the sidecar as layer-major binary banks.

Files

dense/
  model-dense.gguf
  flashmoe-package.json

sidecar/
  manifest.json
  layer_000.bin
  ...
  layer_042.bin

Model Details

  • Architecture: deepseek4
  • Blocks: 43
  • Experts: 256
  • Active experts per token: 6
  • Context length: 1048576
  • Dense GGUF tensors: 1199
  • Routed expert sidecar entries: 129

Example

./build/bin/llama-cli \
  -m dense/model-dense.gguf \
  --moe-mode slot-bank \
  --moe-sidecar sidecar \
  --moe-slot-bank 96 \
  --moe-topk 6 \
  -ngl 999 \
  --moe-cache-io-split 4 \
  -c 8192 \
  -b 128 \
  -ub 1 \
  --no-warmup \
  -p "What is Apple Neural Engine?" \
  -n 256

This package is not a standalone dense-only GGUF. Use a Flash-MoE aware llama.cpp build that supports DeepSeek V4 Flash, slot-bank mode, and the native FP8/FP4 tensor types.

Downloads last month
365
GGUF
Model size
7B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anemll/DeepSeek-V4-Flash-FP4-FP8-SSD

Quantized
(23)
this model