💡 Introduction

Eagle3-Qwen3-32B-zh is a retrained model based on the open-source Qwen3-32B. It is designed for the Eagle-3 speculative decoding algorithm and features bilingual (Chinese/English) acceleration capabilities, aiming to speed up the inference phase of Large Language Models (LLMs). Trained on a mixed Chinese-English dataset, this model demonstrates an improved acceptance rate for Chinese text, making it highly suitable for mixed-language inference tasks.

To reduce training costs, we built a low-cost Eagle-3 training pipeline based on the Eagle and SpecForge frameworks, specifically optimized for consumer-grade GPUs and All-in-one AI server production environments. The end-to-end training process can run on NVIDIA RTX 4090 workstations with optimized efficiency, completing the training in approximately one week.

Notably, our draft model, trained on only about 100k samples, outperforms currently available open-source weights of the same class (showing slightly better performance in English and significant improvements in Chinese acceleration). In real-world tests on a 4x NVIDIA RTX 4090 workstation, we achieved a decoding acceleration of up to nearly 2x on GSM8K with a concurrency of 4.

Note: The configuration files in this project are configured for the SGLang framework. If you intend to use vLLM or other frameworks, please modify the configurations accordingly.

⚙️ Training Configuration

Operating under limited hardware resources, Eagle and SpecForge open-source projects to build a cost-effective Eagle-3 weight training tool for consumer-grade graphics cards. This tool supports efficient Eagle-3 training for standard-sized LLMs in workstation environments.

Dataset: We constructed a training dataset of 100k text samples using ShareGPT-68K (English) and ShareGPT-Chinese-English-90k (Chinese). The prompt portion utilizes the original dataset content, while the output portion was regenerated using the Qwen3-32B model.
Training Environment: Training was conducted on an 8-card NVIDIA RTX 4090 (24G) setup, taking approximately one week to complete.

🖥️ Inference Launch Command

To launch the EAGLE-3 algorithm service using SGLang, here is the instruction:

python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-32B \
--speculative-algo EAGLE3 \
--speculative-draft Zjcxy-SmartAI/Eagle3-Qwen3-32B-zh \
--speculative-num-steps 5 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \
--mem-fraction 0.7 \
--dtype float16 \
--served-model-name qwen \
--tp-size 4 \
--cuda-graph-max-bs 4

To launch the original model service (for comparative experiments) using SGLang, here is the instruction:

 python -m sglang.launch_server \
--model-path Qwen/Qwen3-32B \
--mem-fraction 0.7 \
--dtype float16 \
--served-model-name qwen \
--tp-size 4 \
--cuda-graph-max-bs 4

🌟 Performance Evaluation

We conducted systematic testing on 4x NVIDIA RTX 4090 GPUs using the Eagle-3 algorithm on the following datasets:

MT-bench: Contains 80 high-quality multi-turn English dialogues covering various domains.
MT-bench-zh: A Chinese version of the MT-bench dataset, processed via the Tongyi Qianwen official service and manually corrected.
C-Eval: Composed of 104 samples, with 2 samples extracted from each category of the C-Eval dataset.
GSM8K: 100 samples randomly extracted from the GSM8K dataset.

Note: All tests were conducted with a concurrency of 4, with the maximum generated token limit set to 512.

To ensure the robustness of our conclusions, we performed the following experiments:

🧪 Experiment 1

We set the inference to perform 5 speculative decoding steps per iteration, keeping the top-4 highest probability tokens for speculation, and selecting a maximum of 16 draft tokens for parallel verification.

--speculative-num-steps 5 \
--speculative-eagle-topk 4 \
--speculative-num-draft-tokens 16 \

		MT-bench-zh		MT-bench		C-Eval		GSM8K
Temperature	Model	Speedup	τ	Speedup	τ	Speedup	τ	Speedup	τ
T=0	Zjcxy-SmartAI/Eagle3-Qwen3-32B	1.63x	3.37	1.73x	3.57	1.61x	3.3	1.91x	4.03

	RedHatAI/Qwen3-32B-speculator.eagle3	0.69x	1.4	1.58x	3.24	1.36x	0.66	1.66x	3.53

🧪 Experiment 2

Building on Experiment 1, we conducted tests using the official recommended sampling parameters for Qwen3-32B: Temperature=0.6, TopP=0.95, TopK=20, and MinP=0.

		MT-bench-zh		MT-bench		C-Eval		GSM8K
Temperature	Model	Speedup	τ	Speedup	τ	Speedup	τ	Speedup	τ
T=0.6	Zjcxy-SmartAI/Eagle3-Qwen3-32B-zh	1.42x	3.27	1.5x	3.49	1.28x	3	1.67x	3.99

	RedHatAI/Qwen3-32B-speculator.eagle3	0.62x	1.39	1.36x	3.15	-	-	1.48x	3.53

✏️ Conclusion

Based on our comparative analysis, we draw the following conclusions:

Overall, with a training dataset size of 100k, the model demonstrates solid performance across all evaluations. Compared to current open-source weights of the same scale, our model's token prediction capability and acceleration effects achieve significant progress in Chinese tasks while maintaining advantages in English tasks. Experiments further indicate that the model can boost inference efficiency by nearly 1.5x in both Chinese and English scenarios.

🔑 Relevant Link

Qwen3-32B Open-source Weights: https://www.modelscope.cn/models/Qwen/Qwen3-32B

Eagle Open-source Repository: https://github.com/SafeAILab/EAGLE

SpecForce Framework: https://github.com/sgl-project/SpecForge

Downloads last month: 391