Ollama refuses to load it...

#1
by IrisColt - opened

Sadly... I got a llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'

It seems that GGUF files for Qwen3-VL that are created directly with/using llama.cpp don't work when Ollama attempts to load them...

Wow... I moved from Ollama to llama.cpp and it’s like stepping into a different universe. The model works great; it’s incredible and beyond words. Thanks!!!

My bad, I'm a bit late, and I'm glad you figured out yourself!

Your Ollama might need to be updated, Qwen3vl is a very recent architecture. But at this point, if you moved to llama.cpp, might as well keep using it, it tends to have all the cutting edge features and runs pretty fast.

Thanks! Your model got me to switch to llama.cpp, and there’s no going back. :)

One thing... Is there any chance you could apply the heretic procedure to gpt-oss-20b? Heretically ablated gpt-oss-20b models available in huggingface are broken beyond recognition. Your Qwen3-VL model performs extremely well, and I’m curious whether your approach could unlock gpt-oss-20b potential. Pretty please?

Sorry, I noticed that you only quantized coder3101's previous Qwen3-VL-32B-Thinking-heretic. Please, forget my last message. ;)

Yeah, sorry, I don't have the compute to do more than just quantization. No problem.

Recently, Coder3101 kindly created an heretic version of gpt-oss-20b that you can find at https://huggingface.co/coder3101/gpt-oss-20b-heretic ... you might want to quantize it... Pretty please? ;)

I probably can. I'll give it a go and leave it working today. I'll have to run some tests on it too.

But I have to warn you, quantizing GPT-OSS-20B is much more damaging than with other models. Beyond the practicality of having a GGUF running on llama.cpp-powered backends, I wouldn't recommend it.

Why? Well, The official GPT-OSS model is already a 4-bit model (MXFP4), it was directly trained in it. So doing a GGUF pass will be like quantizing it twice, and trice with heretic. With the base model in this MXFP4 format, it's already a lot less tolerant to format conversion. The guy who used heretic on it had to convert it back to BF16 (and despite being "larger", it's still an approximation of the actual MXFP4 values). Now, I'd have to turn the thing back into a GGUF. And then quantize it back to something useable, memory-wise, making it an approximation of an approximation (which is usually a big no-no in this field).

That's why most quant you'll find behave weirdly or are plain unusable, GPT-OSS needs special care. Modern llama.cpp tools can convert the model keeping its MXFP4 model intact, which I would have done here, but given that coder3101 already converted and damaged the weights (I can only hope it was because it wasn't possible to use Heretic on it otherwise), I can't undo that damage.

Given I'm already downloading the files, I'll attempt a few different quants, but don't expect miracles. Q4 will probably behave very badly, Q6 or Q8 might be useable, though.

Thanks for the insightful answer. Now I truly see that GGUF’s quantization blocks most probably rarely align with MXFP4's original 32-element scaling blocks. That's what's destroying coherence entirely...

Pretty much. I still made the Quants. Q8 probably works okay, anything below will be degraded in different ways.

https://huggingface.co/SerialKicked/GPT-OSS-20B-Heretic-GGUF/

Cheers.

Sign up or log in to comment