Ollama refuses to load it...

by IrisColt - opened 23 days ago

23 days ago

Sadly... I got a llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'

It seems that GGUF files for Qwen3-VL that are created directly with/using llama.cpp don't work when Ollama attempts to load them...

IrisColt

17 days ago

Wow... I moved from Ollama to llama.cpp and it’s like stepping into a different universe. The model works great; it’s incredible and beyond words. Thanks!!!

SerialKicked

Owner 16 days ago

My bad, I'm a bit late, and I'm glad you figured out yourself!

Your Ollama might need to be updated, Qwen3vl is a very recent architecture. But at this point, if you moved to llama.cpp, might as well keep using it, it tends to have all the cutting edge features and runs pretty fast.

IrisColt

12 days ago

Thanks! Your model got me to switch to llama.cpp, and there’s no going back. :)

IrisColt

4 days ago

One thing... Is there any chance you could apply the heretic procedure to gpt-oss-20b? Heretically ablated gpt-oss-20b models available in huggingface are broken beyond recognition. Your Qwen3-VL model performs extremely well, and I’m curious whether your approach could unlock gpt-oss-20b potential. Pretty please?

IrisColt

4 days ago

•

edited 4 days ago

Sorry, I noticed that you only quantized coder3101's previous Qwen3-VL-32B-Thinking-heretic. Please, forget my last message. ;)

SerialKicked

Owner 4 days ago

Yeah, sorry, I don't have the compute to do more than just quantization. No problem.

IrisColt

about 17 hours ago

Recently, Coder3101 kindly created an heretic version of gpt-oss-20b that you can find at https://huggingface.co/coder3101/gpt-oss-20b-heretic ... you might want to quantize it... Pretty please? ;)

SerialKicked

Owner about 14 hours ago

•

edited about 14 hours ago

I probably can. I'll give it a go and leave it working today. I'll have to run some tests on it too.

But I have to warn you, quantizing GPT-OSS-20B is much more damaging than with other models. Beyond the practicality of having a GGUF running on llama.cpp-powered backends, I wouldn't recommend it.

Why? Well, The official GPT-OSS model is already a 4-bit model (MXFP4), it was directly trained in it. So doing a GGUF pass will be like quantizing it twice, and trice with heretic. With the base model in this MXFP4 format, it's already a lot less tolerant to format conversion. The guy who used heretic on it had to convert it back to BF16 (and despite being "larger", it's still an approximation of the actual MXFP4 values). Now, I'd have to turn the thing back into a GGUF. And then quantize it back to something useable, memory-wise, making it an approximation of an approximation (which is usually a big no-no in this field).

That's why most quant you'll find behave weirdly or are plain unusable, GPT-OSS needs special care. Modern llama.cpp tools can convert the model keeping its MXFP4 model intact, which I would have done here, but given that coder3101 already converted and damaged the weights (I can only hope it was because it wasn't possible to use Heretic on it otherwise), I can't undo that damage.

Given I'm already downloading the files, I'll attempt a few different quants, but don't expect miracles. Q4 will probably behave very badly, Q6 or Q8 might be useable, though.

IrisColt

about 3 hours ago

Thanks for the insightful answer. Now I truly see that GGUF’s quantization blocks most probably rarely align with MXFP4's original 32-element scaling blocks. That's what's destroying coherence entirely...

SerialKicked

Owner about 2 hours ago

Pretty much. I still made the Quants. Q8 probably works okay, anything below will be degraded in different ways.

https://huggingface.co/SerialKicked/GPT-OSS-20B-Heretic-GGUF/

Cheers.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment