--- license: mit base_model: microsoft/Phi-4-multimodal-instruct tags: - phi4mm - gguf - quantized - q4_k_m - cpu - ollama - llama-cpp language: - multilingual - ar - zh - cs - da - nl - en - fi - fr - de - he - hu - it - ja - ko - "no" - pl - pt - ru - es - sv - th - tr - uk pipeline_tag: text-generation --- # Phi-4 Multimodal Instruct — GGUF Quantizations CPU-optimised GGUF quantizations of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) produced with [llama.cpp](https://github.com/ggerganov/llama.cpp). ## Model Summary Phi-4-multimodal-instruct is a **5.6 B parameter** lightweight multimodal foundation model by Microsoft that processes text, image, and audio inputs. The backbone LLM is **Phi-4-Mini** (3.8 B). This repository contains GGUF quantizations for local CPU and GPU deployment. - **Context length:** 128 K tokens (131 072) - **Architecture:** `phi3` (GGUF) - **License:** [MIT](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/LICENSE) — © Microsoft Corporation ## Available Files | File | Quant | Size | BPW | Best for | |------|-------|------|-----|----------| | `phi4-mm-Q4_K_M.gguf` | Q4_K_M | 2.37 GB | 5.18 | CPU inference, 8 GB RAM systems | | `phi4-mm-Q8_0.gguf` | Q8_0 | 3.90 GB | 8.00 | GPU/high-RAM systems, near-lossless | | `phi4-mm-f16.gguf` | F16 | 7.17 GB | 16.00 | Source / re-quantization base | ## Quantization Method Quantized using `llama-quantize` from [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) (build 8334). ```bash llama-quantize phi4-mm-f16.gguf phi4-mm-Q4_K_M.gguf Q4_K_M 16 ``` Source weights converted from the original safetensors using `convert_hf_to_gguf.py`. ## Recommended Usage ### Ollama (CPU — Intel NUC / low-power hardware) ```bash # Pull and run (once uploaded to Ollama registry) ollama run phi4-mm-nuc # Or import from GGUF directly with a Modelfile: ollama create phi4-mm-nuc -f Modelfile ollama run phi4-mm-nuc ``` **Modelfile for 8 GB RAM / no GPU:** ``` FROM phi4-mm-Q4_K_M.gguf PARAMETER num_ctx 8192 PARAMETER num_thread 8 PARAMETER num_gpu 0 PARAMETER flash_attn false PARAMETER temperature 0.7 PARAMETER repeat_penalty 1.1 SYSTEM "You are a helpful, accurate, and concise AI assistant." ``` ### llama.cpp CLI (GPU, full quality) ```bash ./build/bin/llama-cli \ -m phi4-mm-Q8_0.gguf \ --ctx-size 65536 \ --flash-attn on \ --kv-offload \ -ngl 99 \ --threads 16 ``` ### OpenClaw Integration In `~/.openclaw/openclaw.json`: ```json { "agent": { "model": "ollama/phi4-mm-nuc" }, "modelConfigs": { "ollama/phi4-mm-nuc": { "provider": "ollama", "model": "phi4-mm-nuc", "baseUrl": "http://localhost:11434" } } } ``` ## Hardware Notes | Hardware | Recommended quant | Context | Notes | |---|---|---|---| | Intel NUC 11th Gen, 8 GB RAM | Q4_K_M | 8 192 | CPU-only, `num_gpu 0` | | Laptop / desktop, 16 GB RAM | Q5_K_M or Q8_0 | 16 384 | CPU or iGPU | | GPU with ≥ 8 GB VRAM | Q8_0 or F16 | 32 768–65 536 | Full `-ngl 99` offload | ## License This repository redistributes quantized weights derived from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) under the original **MIT License**. ``` MIT License Copyright (c) Microsoft Corporation. ``` Quantization tooling (llama.cpp) is also MIT licensed. See [llama.cpp LICENSE](https://github.com/ggerganov/llama.cpp/blob/master/LICENSE). ## Attribution - Original model: Microsoft Research - Quantization: produced with [llama.cpp](https://github.com/ggerganov/llama.cpp) by Georgi Gerganov et al. - Technical report: [arXiv:2503.01743](https://arxiv.org/abs/2503.01743)