Chaperone-Thinking-LQ-1.0

A domain-optimized reasoning model built on DeepSeek-R1-Distill-Qwen-32B, refined through a multi-stage pipeline of GPTQ quantization-aware training and QLoRA fine-tuning. Achieves 84% on MedQA — within 4 points of GPT-4o — in a ~20GB package that fits on a single L40/L40s GPU.

Fully open-source under CC-BY-4.0.

Highlights

Base model: DeepSeek-R1-Distill-Qwen-32B (32B parameters)
Size reduction: ~60GB → ~20GB (4-bit GPTQ)
MedQA accuracy: 84% (GPT-4o: ~88%)
Hardware target: Runs on a single NVIDIA L40, L40s, or A100 GPU
License: CC-BY-4.0

How We Built It

This model is not a simple quantization. It was produced through a four-stage pipeline:

Stage	Method	What it does
1. Quantization	4-bit GPTQ	Compresses weights from ~60GB to ~20GB for efficient inference
2. Quantization-Aware Training	GPTQ-based QAT with calibration	Minimizes accuracy loss during quantization by optimizing scale/zero-point parameters against a calibration dataset
3. Domain Fine-Tuning	QLoRA	Adapts the quantized model on medical and scientific corpora, recovering and improving accuracy for domain-specific reasoning
4. Transparency	Adaptive layer removal	Removes the identity adaptive layer so the model correctly attributes its foundational architecture to its original creators

Benchmark Results

MedQA

Model	Accuracy
Chaperone-Thinking-LQ-1.0	84%
GPT-4o	88%

Multi-Model Comparison

Benchmark	DeepSeek-R1	OpenAI-o1-1217	DeepSeek-R1-32B	OpenAI-o1-mini	Chaperone-Thinking-LQ-1.0
AIME 2024	79.8	79.2	72.6	63.6	66.7
GPQA Diamond	71.5	75.7	62.1	60.0	56.7
MATH-500	97.3	96.4	94.3	90.0	91.9
MMLU	90.8	91.8	87.4	85.2	85.9

Chaperone-Thinking-LQ-1.0 delivers competitive performance against full-precision frontier models at ~3x smaller model size.

Speed & Latency

Metric	Chaperone-Thinking-LQ-1.0	DeepSeek-R1-Distill-Qwen-32B
Throughput	36.86 tok/s	22.84 tok/s
Latency p50	11.49s	20.10s
Latency p95	13.06s	20.11s

1.6x higher throughput with ~43% lower median latency. Averages over 10 trials, concurrency=1, max_tokens=512, temperature=0.

Model Details


Base model	DeepSeek-R1-Distill-Qwen-32B
Parameters	32 billion
Quantization	4-bit GPTQ
Fine-tuning	QLoRA on medical/scientific corpora
Model size	~20GB
Precision	torch.float16
Evaluation hardware	NVIDIA A100 80GB PCIe
CUDA	12.4
PyTorch	2.6.0+cu124

Intended Use

Medical and clinical reasoning tasks
Scientific Q&A and research workflows
Enterprise deployments requiring data sovereignty (on-premises, private cloud)
Domain-specific text analysis and insight extraction

Limitations

4-bit quantization introduces some accuracy trade-off on general benchmarks vs. the full-precision base model
Domain fine-tuning is optimized for medical/scientific reasoning; general-purpose performance may differ
Not intended as a replacement for professional medical judgment

Citation

If you use this model, please cite:

@misc{chaperone-thinking-lq,
  title={Chaperone-Thinking-LQ-1.0: Domain-Optimized Reasoning via GPTQ-QAT and QLoRA},
  author={Empirisch Technologies},
  year={2025},
  url={https://huggingface.co/empirischtech}
}

Model tree for empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

Quantized

(136)

this model

Dataset used to train empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit

Evaluation results

accuracy on MedQA
self-reported

84.000
accuracy on MATH-500
self-reported

91.900
accuracy on AIME 2024
self-reported

66.700
accuracy on GPQA Diamond
self-reported

56.700
accuracy on MMLU
self-reported

85.900
accuracy on GSM8K-Platinum
self-reported

84.040
accuracy on IFEval
self-reported

83.340
accuracy on MMLU-PRO
self-reported

65.760

empirischtech
/

DeepSeek-R1-Distill-Qwen-32B-gptq-4bit