gemma3-4b-vqa-33k

Direct QLoRA VQA fine-tune on 33k Sinhala VQA samples (simple prompt). Best performing model in the study (Exp 2c).

Part of the research: Benchmarking and Adapting Compact Multimodal Models for Sinhala Visual Question Answering

  • Base model: google/gemma-3-4b-it
  • Experiment: Exp 2c — Direct VQA QLoRA (33k, simple prompt)
  • Training data: Siluni/sinhala-vqa-dataset (33k samples)
  • Method: QLoRA (4-bit NF4, LoRA rank 16, alpha 32)
  • Prompt: Simple Sinhala prompt (simple_p)
  • Note: Checkpoint filenames reference '33k' due to a labelling error; the correct dataset size is 33k.

Loading

from transformers import AutoProcessor, Gemma3ForConditionalGeneration, BitsAndBytesConfig
from peft import PeftModel
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = Gemma3ForConditionalGeneration.from_pretrained(
    "google/gemma-3-4b-it",
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
).eval()

processor = AutoProcessor.from_pretrained("Siluni/gemma3-4b-vqa-33k")
model = PeftModel.from_pretrained(base_model, "Siluni/gemma3-4b-vqa-33k").eval()

Inference Example

from PIL import Image
import requests

# Load any image
image = Image.open("your_image.jpg").convert("RGB")
question = "රූපයේ ඇත්තේ කුමක්ද?"  # "What is in the image?"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text",  "text": question},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    padding=True,
).to(model.device)

# Gemma-3 requires token_type_ids
if "token_type_ids" not in inputs:
    import torch
    inputs["token_type_ids"] = torch.zeros_like(inputs["input_ids"])

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=128, do_sample=False)

answer = processor.decode(output[0][input_len:], skip_special_tokens=True).strip()
print(answer)

Citation

@misc{keerthiratne2025sinhalavqa,
  title        = {Benchmarking and Adapting Compact Multimodal Models for Sinhala Visual Question Answering},
  author       = {Keerthiratne, Siluni and Weerasinghe, Ruvan and Sumanathilaka, Deshan},
  year         = {2025},
  institution  = {Informatics Institute of Technology / Robert Gordon University},
}
Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Siluni/gemma3-4b-vqa-33k

Adapter
(318)
this model

Dataset used to train Siluni/gemma3-4b-vqa-33k

Collection including Siluni/gemma3-4b-vqa-33k