HydraQwen3.5-4B
Dual-head retrieval + generation adapter on Qwen/Qwen3.5-4B. Toggle the LoRA at inference: on for ColBERT-style late-interaction retrieval, off for autoregressive generation on the base model.
Hydra starts from the observation that a LoRA adapter trained for retrieval leaves the base model's weights intact by construction: disabling the adapter recovers the original generation head bit-for-bit. The contribution of the paper is identifying three engineering requirements that make this usable in practice β attention-mode restoration (causal β bidirectional toggle on the full-attention layers), lm_head preservation under weight tying and DDP gradient synchronisation, and KV-cache-aware generation β and showing that, once those are addressed, one VLM instance can serve both ColBERT-style late-interaction document retrieval and autoregressive generation without any generation training and with peak VRAM approximately half that of running the two models separately.
This 4B release is a larger-scale instantiation of the same mechanism, scaled to the full ~760K multilingual data mix.
- Paper: Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
- Collection: Hydra β Dual-Head Retrieval and Generation
- Sister model:
HydraQwen3.5-0.8B(smaller-scale Hydra) - Demo: HydraQwen3.5-0.8B-demo (Gradio Space, 0.8B)
Paper-cited measurements for this adapter live in
results/. Highlights:
Measurement Result Artifacts vs same-recipe baseline (22 tasks) ΞNDCG@5 = +0.0005, paired t p=0.89 retrieval/vs_baseline/ Gen ANLS (DocVQA+ChartQA+InfoVQA+TextVQA, 15,301 samples) max |Ξ| = 0.0043 generation/4bench/ Byte-identical greedy (5k DocVQA, T=0) 71.08% full / 88.05% non-truncated generation/byte_identical/ Mode-switching latency (50 iters, B200) 6.4 ms round-trip efficiency/mode_switch/ Bitwise weight equivalence 426/426 LM tensors match generation_equivalence/ Mode-switch VRAM 10.77 GB peak (62.7% vs two-model) mode_switch_vram/
Datasets
Trained on the concatenation of:
vidore/colpali_train_setβ ColPali-style query/document pairs (ArxivQA, DocVQA, InfoVQA, etc.)openbmb/VisRAG-Ret-Train-Synthetic-dataβ synthetic retrieval queriesopenbmb/VisRAG-Ret-Train-In-domain-dataβ in-domain retrieval queriesllamaindex/vdr-multilingual-trainβ multilingual visual document retrieval (en/de/es/fr/it)
Results β ViDoRe (MTEB) nDCG@5
V1 (10 tasks)
| Task | nDCG@5 |
|---|---|
| VidoreArxivQARetrieval | 0.9193 |
| VidoreDocVQARetrieval | 0.6593 |
| VidoreInfoVQARetrieval | 0.9328 |
| VidoreShiftProjectRetrieval | 0.9133 |
| VidoreSyntheticDocQAAIRetrieval | 1.0000 |
| VidoreSyntheticDocQAEnergyRetrieval | 0.9776 |
| VidoreSyntheticDocQAGovernmentReportsRetrieval | 0.9742 |
| VidoreSyntheticDocQAHealthcareIndustryRetrieval | 0.9963 |
| VidoreTabfquadRetrieval | 0.8951 |
| VidoreTatdqaRetrieval | 0.8141 |
| avg (10/10 tasks) | 0.9082 |
V2 (4 tasks)
| Task | nDCG@5 |
|---|---|
| Vidore2BioMedicalLecturesRetrieval | 0.5939 |
| Vidore2ESGReportsHLRetrieval | 0.7328 |
| Vidore2ESGReportsRetrieval | 0.5207 |
| Vidore2EconomicsReportsRetrieval | 0.4314 |
| avg (4/4 tasks) | 0.5697 |
V3 (8 tasks)
| Task | nDCG@5 |
|---|---|
| Vidore3ComputerScienceRetrieval | 0.7279 |
| Vidore3EnergyRetrieval | 0.6682 |
| Vidore3FinanceEnRetrieval | 0.5860 |
| Vidore3FinanceFrRetrieval | 0.4694 |
| Vidore3HrRetrieval | 0.5162 |
| Vidore3IndustrialRetrieval | 0.4699 |
| Vidore3PharmaceuticalsRetrieval | 0.5958 |
| Vidore3PhysicsRetrieval | 0.4782 |
| avg (8/8 tasks) | 0.5639 |
All 22 tasks: nDCG@5 0.7215, nDCG@10 0.7410. Per-task JSONs in results/evals/.
Generation equivalence
When the LoRA is disabled, the Hydra model is the vanilla Qwen/Qwen3.5-4B. The audit in scripts/test_gen_equivalence_4b.py checks three invariants and a per-tensor state-dict comparison:
adapter_config.jsonhas no gen-path modules (lm_head,embed_tokens) inmodules_to_save(only the retrieval-sidecustom_text_projis saved as a full module)adapter_model.safetensorscontains only LoRA A/B pairs and thecustom_text_projweight; nothing toucheslm_headorembed_tokens- Every
language_modelweight tensor in the Hydra stack is byte-for-byte identical to the corresponding weight in a freshly-loaded base model (426/426 tensors match bitwise)
The left panel below shows the three invariants (all pass). The right panel shows the state-dict comparison across all language-model weight tensors.
Mode-switching VRAM
The dual-head design means one process holds one set of weights, toggling the LoRA in place. A conventional setup needs two separate models in memory for the same retrieve-then-generate flow. Both configurations are measured on the same hardware with the same inputs (scripts/bench_mode_switch_vram_4b.py).
| Mode | Peak VRAM (GB) |
|---|---|
| Vanilla base generation | 9.46 |
| Hydra retrieval (LoRA on) | 9.57 |
| Hydra generation (LoRA off) | 10.77 |
| Two-model deployment (separate retriever + generator) | 28.85 |
Hydra peaks at 10.77 GB vs 28.85 GB for the two-model setup β a 62.7% reduction (18.08 GB saved).
Training recipe
| Parameter | Value |
|---|---|
| Base | Qwen/Qwen3.5-4B |
| Loss | ColbertLoss, Ο=0.02 |
| LoRA r / Ξ± / dropout | 32 / 32 / 0.05 |
target_modules |
Qwen3.5 LM + MLP projections + custom_text_proj |
modules_to_save |
["custom_text_proj"] (full base weight saved, not LoRA delta) |
| Optimizer | adamw_torch |
| Scheduler | cosine, 2.5% warmup |
| Learning rate | 5e-5 |
| Effective batch | 252 (per-device 36, grad_accum 1) |
| Precision | bf16 |
| Epochs / steps | 1 / 3020 |
| Seed | 42 |
| Gradient checkpointing | on |
| Bidirectional attention | 8/32 full-attention layers |
| Embedding dim | 320 |
Loss curve: results/loss_curve.png.
Usage
import torch
from transformers import Qwen3_5ForConditionalGeneration
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
from peft import PeftModel
BASE = "Qwen/Qwen3.5-4B"
ADAPTER = "athrael-soju/HydraQwen3.5-4B"
model = ColQwen3_5.from_pretrained(
BASE, torch_dtype=torch.bfloat16,
attn_implementation="sdpa", ignore_mismatched_sizes=True,
)
fcg = Qwen3_5ForConditionalGeneration.from_pretrained(
BASE, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model.load_state_dict(fcg.model.state_dict(), strict=False)
del fcg; torch.cuda.empty_cache()
# Swap projection head to dim=320 before attaching the adapter
import torch.nn as nn
model.custom_text_proj = nn.Linear(2560, 320, bias=True, dtype=torch.bfloat16)
model.config.dim = 320
model = PeftModel.from_pretrained(model, ADAPTER).cuda().eval()
processor = ColQwen3_5Processor.from_pretrained(BASE, max_num_visual_tokens=768)
# Retrieval: adapter on (see train_hydra_4b.py for the bidirectional-attention patch).
# Generation: `with model.disable_adapter(): model.generate(...)` using the saved lm_head.
Environment
pip install -r requirements.txt
pip install flash-linear-attention causal-conv1d
Resume training
Full optimizer, scheduler, and RNG state are in checkpoints/step-{500,1000,1500,2000,2500,3000,3020}/. Resume with the same WORLD_SIZE as the original run (one per-rank rng_state_*.pth was written).
torchrun --nproc_per_node=<N> train_hydra_4b.py \
--output-dir ./out --base-model Qwen/Qwen3.5-4B \
--resume-from-checkpoint ./checkpoints/step-3000 --seed 42 \
--text-proj-dim 320 --gradient-checkpointing
Baseline comparator (matched single-head)
The baseline/ folder holds a matched single-head retrieval-only ColQwen3.5-4B trained on the identical recipe (LoRA r=32 / alpha=32 / dropout=0.05, 3020 steps of DocMatix, seed 123). It exists solely as the paired control for the retrieval claim in results/retrieval/vs_baseline/: with no architecture and no hyperparameter changes, the dual-head modification adds no measurable retrieval cost.
| Comparison | All 22 tasks |
|---|---|
| Hydra-4B (dual-head) mean NDCG@5 | 0.7215 |
| Baseline (single-head) mean NDCG@5 | 0.7210 |
| Mean Ξ | +0.0005 |
| Paired t-test | p = 0.89 |
| Wilcoxon | p = 1.00 |
| Cohen's d (paired) | 0.03 |
| 95% CI on Ξ | [β0.006, +0.007] |
| Wins / Ties / Losses | 9 / 2 / 11 |
Contents: final adapter (step-3020), 5 resume checkpoints (step-1500..3020), 22 per-task Vidore JSONs under baseline/results/eval__colqwen_baseline_r32_seed123/, train.log, and the training script used to produce it.
Files
adapter_config.json,adapter_model.safetensorsβ LoRAlm_head.ptβ base lm_head, for LoRA-off generationconfig.jsonβ ColQwen3_5 config (recordsdim=320)requirements.txtβ pinned dependenciestrain_hydra_4b.pyβ training entrypointscripts/test_gen_equivalence_4b.pyβ LoRA-off vs base bitwise weight auditbench_mode_switch_vram_4b.pyβ peak VRAM across operating modes
baseline/β matched single-head baseline (r=32, seed 123); see βBaseline comparatorβ sectionadapter_model.safetensors,adapter_config.json,config.json,lm_head.ptcheckpoints/step-{1500,2000,2500,3000,3020}/β resume stateresults/eval__colqwen_baseline_r32_seed123/β 22 per-task MTEB JSONstrain.log,train_hydra_4b_v3.py
results/evals/β per-task MTEB scored JSONs (V1 + V2 + V3) +model_meta.jsonretrieval/vs_baseline/β paired stat analysis (baseline seed 123):stat_analysis.json,analyze.pygeneration/4bench/β per-bench ANLS reports (DocVQA, ChartQA, InfoVQA, TextVQA) +combined_summary.jsongeneration/byte_identical/β greedy byte-identical audit +summary.mdgeneration_equivalence/βreport.json+plot.png(bitwise weight audit)efficiency/mode_switch/βmode_switch_latency_report.json(50-iter bench)mode_switch_vram/βreport.json+plot.pngloss_curve.png,loss_raw.csv
checkpoints/step-{500,1000,1500,2000,2500,3000,3020}/β full resume state (optimizer + scheduler + RNG Γ 7 + training_args)
Citation
@misc{hydra4b,
title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
author={Athrael Soju},
year={2026},
url={https://huggingface.co/athrael-soju/HydraQwen3.5-4B},
}
- Downloads last month
- 212

