Chaperone-Thinking-LQ-1.0

A domain-optimized reasoning model built on DeepSeek-R1-Distill-Qwen-32B, refined through a multi-stage pipeline of GPTQ quantization-aware training and QLoRA fine-tuning. Achieves 84% on MedQA — within 4 points of GPT-4o — in a ~20GB package that fits on a single L40/L40s GPU.

Fully open-source under CC-BY-4.0.


Highlights

  • Base model: DeepSeek-R1-Distill-Qwen-32B (32B parameters)
  • Size reduction: ~60GB → ~20GB (4-bit GPTQ)
  • MedQA accuracy: 84% (GPT-4o: ~88%)
  • Hardware target: Runs on a single NVIDIA L40, L40s, or A100 GPU
  • License: CC-BY-4.0

How We Built It

This model is not a simple quantization. It was produced through a four-stage pipeline:

Stage Method What it does
1. Quantization 4-bit GPTQ Compresses weights from ~60GB to ~20GB for efficient inference
2. Quantization-Aware Training GPTQ-based QAT with calibration Minimizes accuracy loss during quantization by optimizing scale/zero-point parameters against a calibration dataset
3. Domain Fine-Tuning QLoRA Adapts the quantized model on medical and scientific corpora, recovering and improving accuracy for domain-specific reasoning
4. Transparency Adaptive layer removal Removes the identity adaptive layer so the model correctly attributes its foundational architecture to its original creators

Benchmark Results

MedQA

Model Accuracy
Chaperone-Thinking-LQ-1.0 84%
GPT-4o 88%

Multi-Model Comparison

Benchmark DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini Chaperone-Thinking-LQ-1.0
AIME 2024 79.8 79.2 72.6 63.6 66.7
GPQA Diamond 71.5 75.7 62.1 60.0 56.7
MATH-500 97.3 96.4 94.3 90.0 91.9
MMLU 90.8 91.8 87.4 85.2 85.9

Chaperone-Thinking-LQ-1.0 delivers competitive performance against full-precision frontier models at ~3x smaller model size.

Speed & Latency

Metric Chaperone-Thinking-LQ-1.0 DeepSeek-R1-Distill-Qwen-32B
Throughput 36.86 tok/s 22.84 tok/s
Latency p50 11.49s 20.10s
Latency p95 13.06s 20.11s

1.6x higher throughput with ~43% lower median latency. Averages over 10 trials, concurrency=1, max_tokens=512, temperature=0.


Model Details

Base model DeepSeek-R1-Distill-Qwen-32B
Parameters 32 billion
Quantization 4-bit GPTQ
Fine-tuning QLoRA on medical/scientific corpora
Model size ~20GB
Precision torch.float16
Evaluation hardware NVIDIA A100 80GB PCIe
CUDA 12.4
PyTorch 2.6.0+cu124

Intended Use

  • Medical and clinical reasoning tasks
  • Scientific Q&A and research workflows
  • Enterprise deployments requiring data sovereignty (on-premises, private cloud)
  • Domain-specific text analysis and insight extraction

Limitations

  • 4-bit quantization introduces some accuracy trade-off on general benchmarks vs. the full-precision base model
  • Domain fine-tuning is optimized for medical/scientific reasoning; general-purpose performance may differ
  • Not intended as a replacement for professional medical judgment

Citation

If you use this model, please cite:

@misc{chaperone-thinking-lq,
  title={Chaperone-Thinking-LQ-1.0: Domain-Optimized Reasoning via GPTQ-QAT and QLoRA},
  author={Empirisch Technologies},
  year={2025},
  url={https://huggingface.co/empirischtech}
}

Links

Downloads last month
189
Safetensors
Model size
33B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit

Quantized
(136)
this model

Dataset used to train empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit

Evaluation results