HydraQwen3.5-4B

Dual-head retrieval + generation adapter on Qwen/Qwen3.5-4B. Toggle the LoRA at inference: on for ColBERT-style late-interaction retrieval, off for autoregressive generation on the base model.

Hydra starts from the observation that a LoRA adapter trained for retrieval leaves the base model's weights intact by construction: disabling the adapter recovers the original generation head bit-for-bit. The contribution of the paper is identifying three engineering requirements that make this usable in practice — attention-mode restoration (causal → bidirectional toggle on the full-attention layers), lm_head preservation under weight tying and DDP gradient synchronisation, and KV-cache-aware generation — and showing that, once those are addressed, one VLM instance can serve both ColBERT-style late-interaction document retrieval and autoregressive generation without any generation training and with peak VRAM approximately half that of running the two models separately.

This 4B release is a larger-scale instantiation of the same mechanism, scaled to the full ~760K multilingual data mix.

Paper: Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Collection: Hydra — Dual-Head Retrieval and Generation
Sister model: HydraQwen3.5-0.8B (smaller-scale Hydra)
Demo: HydraQwen3.5-0.8B-demo (Gradio Space, 0.8B)

Paper-cited measurements for this adapter live in results/. Highlights:

Measurement Result Artifacts

vs same-recipe baseline (22 tasks) ΔNDCG@5 = +0.0005, paired t p=0.89 retrieval/vs_baseline/

Gen ANLS (DocVQA+ChartQA+InfoVQA+TextVQA, 15,301 samples) max |Δ| = 0.0043 generation/4bench/

Byte-identical greedy (5k DocVQA, T=0) 71.08% full / 88.05% non-truncated generation/byte_identical/

Mode-switching latency (50 iters, B200) 6.4 ms round-trip efficiency/mode_switch/

Bitwise weight equivalence 426/426 LM tensors match generation_equivalence/

Mode-switch VRAM 10.77 GB peak (62.7% vs two-model) mode_switch_vram/

Measurement	Result	Artifacts
vs same-recipe baseline (22 tasks)	ΔNDCG@5 = +0.0005, paired t p=0.89	retrieval/vs_baseline/
Gen ANLS (DocVQA+ChartQA+InfoVQA+TextVQA, 15,301 samples)	max \|Δ\| = 0.0043	generation/4bench/
Byte-identical greedy (5k DocVQA, T=0)	71.08% full / 88.05% non-truncated	generation/byte_identical/
Mode-switching latency (50 iters, B200)	6.4 ms round-trip	efficiency/mode_switch/
Bitwise weight equivalence	426/426 LM tensors match	generation_equivalence/
Mode-switch VRAM	10.77 GB peak (62.7% vs two-model)	mode_switch_vram/

Datasets

Trained on the concatenation of:

vidore/colpali_train_set — ColPali-style query/document pairs (ArxivQA, DocVQA, InfoVQA, etc.)
openbmb/VisRAG-Ret-Train-Synthetic-data — synthetic retrieval queries
openbmb/VisRAG-Ret-Train-In-domain-data — in-domain retrieval queries
llamaindex/vdr-multilingual-train — multilingual visual document retrieval (en/de/es/fr/it)

Results — ViDoRe (MTEB) nDCG@5

V1 (10 tasks)

Task	nDCG@5
VidoreArxivQARetrieval	0.9193
VidoreDocVQARetrieval	0.6593
VidoreInfoVQARetrieval	0.9328
VidoreShiftProjectRetrieval	0.9133
VidoreSyntheticDocQAAIRetrieval	1.0000
VidoreSyntheticDocQAEnergyRetrieval	0.9776
VidoreSyntheticDocQAGovernmentReportsRetrieval	0.9742
VidoreSyntheticDocQAHealthcareIndustryRetrieval	0.9963
VidoreTabfquadRetrieval	0.8951
VidoreTatdqaRetrieval	0.8141
avg (10/10 tasks)	0.9082

V2 (4 tasks)

Task	nDCG@5
Vidore2BioMedicalLecturesRetrieval	0.5939
Vidore2ESGReportsHLRetrieval	0.7328
Vidore2ESGReportsRetrieval	0.5207
Vidore2EconomicsReportsRetrieval	0.4314
avg (4/4 tasks)	0.5697

V3 (8 tasks)

Task	nDCG@5
Vidore3ComputerScienceRetrieval	0.7279
Vidore3EnergyRetrieval	0.6682
Vidore3FinanceEnRetrieval	0.5860
Vidore3FinanceFrRetrieval	0.4694
Vidore3HrRetrieval	0.5162
Vidore3IndustrialRetrieval	0.4699
Vidore3PharmaceuticalsRetrieval	0.5958
Vidore3PhysicsRetrieval	0.4782
avg (8/8 tasks)	0.5639

All 22 tasks: nDCG@5 0.7215, nDCG@10 0.7410. Per-task JSONs in results/evals/.

Generation equivalence

When the LoRA is disabled, the Hydra model is the vanilla Qwen/Qwen3.5-4B. The audit in scripts/test_gen_equivalence_4b.py checks three invariants and a per-tensor state-dict comparison:

adapter_config.json has no gen-path modules (lm_head, embed_tokens) in modules_to_save (only the retrieval-side custom_text_proj is saved as a full module)
adapter_model.safetensors contains only LoRA A/B pairs and the custom_text_proj weight; nothing touches lm_head or embed_tokens
Every language_model weight tensor in the Hydra stack is byte-for-byte identical to the corresponding weight in a freshly-loaded base model (426/426 tensors match bitwise)

The left panel below shows the three invariants (all pass). The right panel shows the state-dict comparison across all language-model weight tensors.

Mode-switching VRAM

The dual-head design means one process holds one set of weights, toggling the LoRA in place. A conventional setup needs two separate models in memory for the same retrieve-then-generate flow. Both configurations are measured on the same hardware with the same inputs (scripts/bench_mode_switch_vram_4b.py).

Mode	Peak VRAM (GB)
Vanilla base generation	9.46
Hydra retrieval (LoRA on)	9.57
Hydra generation (LoRA off)	10.77
Two-model deployment (separate retriever + generator)	28.85

Hydra peaks at 10.77 GB vs 28.85 GB for the two-model setup — a 62.7% reduction (18.08 GB saved).

Training recipe

Parameter	Value
Base	`Qwen/Qwen3.5-4B`
Loss	ColbertLoss, τ=0.02
LoRA r / α / dropout	32 / 32 / 0.05
`target_modules`	Qwen3.5 LM + MLP projections + `custom_text_proj`
`modules_to_save`	`["custom_text_proj"]` (full base weight saved, not LoRA delta)
Optimizer	adamw_torch
Scheduler	cosine, 2.5% warmup
Learning rate	5e-5
Effective batch	252 (per-device 36, grad_accum 1)
Precision	bf16
Epochs / steps	1 / 3020
Seed	42
Gradient checkpointing	on
Bidirectional attention	8/32 full-attention layers
Embedding dim	320

Loss curve: results/loss_curve.png.

Usage

import torch
from transformers import Qwen3_5ForConditionalGeneration
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
from peft import PeftModel

BASE = "Qwen/Qwen3.5-4B"
ADAPTER = "athrael-soju/HydraQwen3.5-4B"

model = ColQwen3_5.from_pretrained(
    BASE, torch_dtype=torch.bfloat16,
    attn_implementation="sdpa", ignore_mismatched_sizes=True,
)
fcg = Qwen3_5ForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model.load_state_dict(fcg.model.state_dict(), strict=False)
del fcg; torch.cuda.empty_cache()

# Swap projection head to dim=320 before attaching the adapter
import torch.nn as nn
model.custom_text_proj = nn.Linear(2560, 320, bias=True, dtype=torch.bfloat16)
model.config.dim = 320

model = PeftModel.from_pretrained(model, ADAPTER).cuda().eval()
processor = ColQwen3_5Processor.from_pretrained(BASE, max_num_visual_tokens=768)

# Retrieval: adapter on (see train_hydra_4b.py for the bidirectional-attention patch).
# Generation: `with model.disable_adapter(): model.generate(...)` using the saved lm_head.

Environment

pip install -r requirements.txt
pip install flash-linear-attention causal-conv1d

Resume training

Full optimizer, scheduler, and RNG state are in checkpoints/step-{500,1000,1500,2000,2500,3000,3020}/. Resume with the same WORLD_SIZE as the original run (one per-rank rng_state_*.pth was written).

torchrun --nproc_per_node=<N> train_hydra_4b.py \
  --output-dir ./out --base-model Qwen/Qwen3.5-4B \
  --resume-from-checkpoint ./checkpoints/step-3000 --seed 42 \
  --text-proj-dim 320 --gradient-checkpointing

Baseline comparator (matched single-head)

The baseline/ folder holds a matched single-head retrieval-only ColQwen3.5-4B trained on the identical recipe (LoRA r=32 / alpha=32 / dropout=0.05, 3020 steps of DocMatix, seed 123). It exists solely as the paired control for the retrieval claim in results/retrieval/vs_baseline/: with no architecture and no hyperparameter changes, the dual-head modification adds no measurable retrieval cost.

Comparison	All 22 tasks
Hydra-4B (dual-head) mean NDCG@5	0.7215
Baseline (single-head) mean NDCG@5	0.7210
Mean Δ	+0.0005
Paired t-test	p = 0.89
Wilcoxon	p = 1.00
Cohen's d (paired)	0.03
95% CI on Δ	[−0.006, +0.007]
Wins / Ties / Losses	9 / 2 / 11

Contents: final adapter (step-3020), 5 resume checkpoints (step-1500..3020), 22 per-task Vidore JSONs under baseline/results/eval__colqwen_baseline_r32_seed123/, train.log, and the training script used to produce it.

Files

adapter_config.json, adapter_model.safetensors — LoRA
lm_head.pt — base lm_head, for LoRA-off generation
config.json — ColQwen3_5 config (records dim=320)
requirements.txt — pinned dependencies
train_hydra_4b.py — training entrypoint
scripts/
- test_gen_equivalence_4b.py — LoRA-off vs base bitwise weight audit
- bench_mode_switch_vram_4b.py — peak VRAM across operating modes
baseline/ — matched single-head baseline (r=32, seed 123); see “Baseline comparator” section
- adapter_model.safetensors, adapter_config.json, config.json, lm_head.pt
- checkpoints/step-{1500,2000,2500,3000,3020}/ — resume state
- results/eval__colqwen_baseline_r32_seed123/ — 22 per-task MTEB JSONs
- train.log, train_hydra_4b_v3.py
results/
- evals/ — per-task MTEB scored JSONs (V1 + V2 + V3) + model_meta.json
- retrieval/vs_baseline/ — paired stat analysis (baseline seed 123): stat_analysis.json, analyze.py
- generation/4bench/ — per-bench ANLS reports (DocVQA, ChartQA, InfoVQA, TextVQA) + combined_summary.json
- generation/byte_identical/ — greedy byte-identical audit + summary.md
- generation_equivalence/ — report.json + plot.png (bitwise weight audit)
- efficiency/mode_switch/ — mode_switch_latency_report.json (50-iter bench)
- mode_switch_vram/ — report.json + plot.png
- loss_curve.png, loss_raw.csv
checkpoints/step-{500,1000,1500,2000,2500,3000,3020}/ — full resume state (optimizer + scheduler + RNG × 7 + training_args)

Citation

@misc{hydra4b,
  title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
  author={Athrael Soju},
  year={2026},
  url={https://huggingface.co/athrael-soju/HydraQwen3.5-4B},
}