HydraQwen3.5-4B

Dual-head retrieval + generation adapter on Qwen/Qwen3.5-4B. Toggle the LoRA at inference: on for ColBERT-style late-interaction retrieval, off for autoregressive generation on the base model.

Hydra starts from the observation that a LoRA adapter trained for retrieval leaves the base model's weights intact by construction: disabling the adapter recovers the original generation head bit-for-bit. The contribution of the paper is identifying three engineering requirements that make this usable in practice β€” attention-mode restoration (causal β†’ bidirectional toggle on the full-attention layers), lm_head preservation under weight tying and DDP gradient synchronisation, and KV-cache-aware generation β€” and showing that, once those are addressed, one VLM instance can serve both ColBERT-style late-interaction document retrieval and autoregressive generation without any generation training and with peak VRAM approximately half that of running the two models separately.

This 4B release is a larger-scale instantiation of the same mechanism, scaled to the full ~760K multilingual data mix.

Paper-cited measurements for this adapter live in results/. Highlights:

Measurement Result Artifacts
vs same-recipe baseline (22 tasks) Ξ”NDCG@5 = +0.0005, paired t p=0.89 retrieval/vs_baseline/
Gen ANLS (DocVQA+ChartQA+InfoVQA+TextVQA, 15,301 samples) max |Ξ”| = 0.0043 generation/4bench/
Byte-identical greedy (5k DocVQA, T=0) 71.08% full / 88.05% non-truncated generation/byte_identical/
Mode-switching latency (50 iters, B200) 6.4 ms round-trip efficiency/mode_switch/
Bitwise weight equivalence 426/426 LM tensors match generation_equivalence/
Mode-switch VRAM 10.77 GB peak (62.7% vs two-model) mode_switch_vram/

Datasets

Trained on the concatenation of:

Results β€” ViDoRe (MTEB) nDCG@5

V1 (10 tasks)

Task nDCG@5
VidoreArxivQARetrieval 0.9193
VidoreDocVQARetrieval 0.6593
VidoreInfoVQARetrieval 0.9328
VidoreShiftProjectRetrieval 0.9133
VidoreSyntheticDocQAAIRetrieval 1.0000
VidoreSyntheticDocQAEnergyRetrieval 0.9776
VidoreSyntheticDocQAGovernmentReportsRetrieval 0.9742
VidoreSyntheticDocQAHealthcareIndustryRetrieval 0.9963
VidoreTabfquadRetrieval 0.8951
VidoreTatdqaRetrieval 0.8141
avg (10/10 tasks) 0.9082

V2 (4 tasks)

Task nDCG@5
Vidore2BioMedicalLecturesRetrieval 0.5939
Vidore2ESGReportsHLRetrieval 0.7328
Vidore2ESGReportsRetrieval 0.5207
Vidore2EconomicsReportsRetrieval 0.4314
avg (4/4 tasks) 0.5697

V3 (8 tasks)

Task nDCG@5
Vidore3ComputerScienceRetrieval 0.7279
Vidore3EnergyRetrieval 0.6682
Vidore3FinanceEnRetrieval 0.5860
Vidore3FinanceFrRetrieval 0.4694
Vidore3HrRetrieval 0.5162
Vidore3IndustrialRetrieval 0.4699
Vidore3PharmaceuticalsRetrieval 0.5958
Vidore3PhysicsRetrieval 0.4782
avg (8/8 tasks) 0.5639

All 22 tasks: nDCG@5 0.7215, nDCG@10 0.7410. Per-task JSONs in results/evals/.

Generation equivalence

When the LoRA is disabled, the Hydra model is the vanilla Qwen/Qwen3.5-4B. The audit in scripts/test_gen_equivalence_4b.py checks three invariants and a per-tensor state-dict comparison:

  • adapter_config.json has no gen-path modules (lm_head, embed_tokens) in modules_to_save (only the retrieval-side custom_text_proj is saved as a full module)
  • adapter_model.safetensors contains only LoRA A/B pairs and the custom_text_proj weight; nothing touches lm_head or embed_tokens
  • Every language_model weight tensor in the Hydra stack is byte-for-byte identical to the corresponding weight in a freshly-loaded base model (426/426 tensors match bitwise)

The left panel below shows the three invariants (all pass). The right panel shows the state-dict comparison across all language-model weight tensors.

Generation equivalence

Mode-switching VRAM

The dual-head design means one process holds one set of weights, toggling the LoRA in place. A conventional setup needs two separate models in memory for the same retrieve-then-generate flow. Both configurations are measured on the same hardware with the same inputs (scripts/bench_mode_switch_vram_4b.py).

Mode Peak VRAM (GB)
Vanilla base generation 9.46
Hydra retrieval (LoRA on) 9.57
Hydra generation (LoRA off) 10.77
Two-model deployment (separate retriever + generator) 28.85

Hydra peaks at 10.77 GB vs 28.85 GB for the two-model setup β€” a 62.7% reduction (18.08 GB saved).

Mode-switch VRAM

Training recipe

Parameter Value
Base Qwen/Qwen3.5-4B
Loss ColbertLoss, Ο„=0.02
LoRA r / Ξ± / dropout 32 / 32 / 0.05
target_modules Qwen3.5 LM + MLP projections + custom_text_proj
modules_to_save ["custom_text_proj"] (full base weight saved, not LoRA delta)
Optimizer adamw_torch
Scheduler cosine, 2.5% warmup
Learning rate 5e-5
Effective batch 252 (per-device 36, grad_accum 1)
Precision bf16
Epochs / steps 1 / 3020
Seed 42
Gradient checkpointing on
Bidirectional attention 8/32 full-attention layers
Embedding dim 320

Loss curve: results/loss_curve.png.

Usage

import torch
from transformers import Qwen3_5ForConditionalGeneration
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
from peft import PeftModel

BASE = "Qwen/Qwen3.5-4B"
ADAPTER = "athrael-soju/HydraQwen3.5-4B"

model = ColQwen3_5.from_pretrained(
    BASE, torch_dtype=torch.bfloat16,
    attn_implementation="sdpa", ignore_mismatched_sizes=True,
)
fcg = Qwen3_5ForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model.load_state_dict(fcg.model.state_dict(), strict=False)
del fcg; torch.cuda.empty_cache()

# Swap projection head to dim=320 before attaching the adapter
import torch.nn as nn
model.custom_text_proj = nn.Linear(2560, 320, bias=True, dtype=torch.bfloat16)
model.config.dim = 320

model = PeftModel.from_pretrained(model, ADAPTER).cuda().eval()
processor = ColQwen3_5Processor.from_pretrained(BASE, max_num_visual_tokens=768)

# Retrieval: adapter on (see train_hydra_4b.py for the bidirectional-attention patch).
# Generation: `with model.disable_adapter(): model.generate(...)` using the saved lm_head.

Environment

pip install -r requirements.txt
pip install flash-linear-attention causal-conv1d

Resume training

Full optimizer, scheduler, and RNG state are in checkpoints/step-{500,1000,1500,2000,2500,3000,3020}/. Resume with the same WORLD_SIZE as the original run (one per-rank rng_state_*.pth was written).

torchrun --nproc_per_node=<N> train_hydra_4b.py \
  --output-dir ./out --base-model Qwen/Qwen3.5-4B \
  --resume-from-checkpoint ./checkpoints/step-3000 --seed 42 \
  --text-proj-dim 320 --gradient-checkpointing

Baseline comparator (matched single-head)

The baseline/ folder holds a matched single-head retrieval-only ColQwen3.5-4B trained on the identical recipe (LoRA r=32 / alpha=32 / dropout=0.05, 3020 steps of DocMatix, seed 123). It exists solely as the paired control for the retrieval claim in results/retrieval/vs_baseline/: with no architecture and no hyperparameter changes, the dual-head modification adds no measurable retrieval cost.

Comparison All 22 tasks
Hydra-4B (dual-head) mean NDCG@5 0.7215
Baseline (single-head) mean NDCG@5 0.7210
Mean Ξ” +0.0005
Paired t-test p = 0.89
Wilcoxon p = 1.00
Cohen's d (paired) 0.03
95% CI on Ξ” [βˆ’0.006, +0.007]
Wins / Ties / Losses 9 / 2 / 11

Contents: final adapter (step-3020), 5 resume checkpoints (step-1500..3020), 22 per-task Vidore JSONs under baseline/results/eval__colqwen_baseline_r32_seed123/, train.log, and the training script used to produce it.

Files

  • adapter_config.json, adapter_model.safetensors β€” LoRA
  • lm_head.pt β€” base lm_head, for LoRA-off generation
  • config.json β€” ColQwen3_5 config (records dim=320)
  • requirements.txt β€” pinned dependencies
  • train_hydra_4b.py β€” training entrypoint
  • scripts/
    • test_gen_equivalence_4b.py β€” LoRA-off vs base bitwise weight audit
    • bench_mode_switch_vram_4b.py β€” peak VRAM across operating modes
  • baseline/ β€” matched single-head baseline (r=32, seed 123); see β€œBaseline comparator” section
    • adapter_model.safetensors, adapter_config.json, config.json, lm_head.pt
    • checkpoints/step-{1500,2000,2500,3000,3020}/ β€” resume state
    • results/eval__colqwen_baseline_r32_seed123/ β€” 22 per-task MTEB JSONs
    • train.log, train_hydra_4b_v3.py
  • results/
    • evals/ β€” per-task MTEB scored JSONs (V1 + V2 + V3) + model_meta.json
    • retrieval/vs_baseline/ β€” paired stat analysis (baseline seed 123): stat_analysis.json, analyze.py
    • generation/4bench/ β€” per-bench ANLS reports (DocVQA, ChartQA, InfoVQA, TextVQA) + combined_summary.json
    • generation/byte_identical/ β€” greedy byte-identical audit + summary.md
    • generation_equivalence/ β€” report.json + plot.png (bitwise weight audit)
    • efficiency/mode_switch/ β€” mode_switch_latency_report.json (50-iter bench)
    • mode_switch_vram/ β€” report.json + plot.png
    • loss_curve.png, loss_raw.csv
  • checkpoints/step-{500,1000,1500,2000,2500,3000,3020}/ β€” full resume state (optimizer + scheduler + RNG Γ— 7 + training_args)

Citation

@misc{hydra4b,
  title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
  author={Athrael Soju},
  year={2026},
  url={https://huggingface.co/athrael-soju/HydraQwen3.5-4B},
}
Downloads last month
212
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for athrael-soju/HydraQwen3.5-4B

Finetuned
Qwen/Qwen3.5-4B
Adapter
(92)
this model

Datasets used to train athrael-soju/HydraQwen3.5-4B

Collection including athrael-soju/HydraQwen3.5-4B

Paper for athrael-soju/HydraQwen3.5-4B