Transformers documentation
Optimum
Get started
Base classes
Models
Preprocessors
Inference
Pipeline API
Generate API
Optimization
Chat with models
Serving
Training
Quantization
OverviewSelecting a quantization methodQuantization conceptsAQLMAutoRoundAWQBitNetbitsandbytescompressed-tensorsEETQFBGEMMFine-grained FP8Four Over SixFP-QuantGGUFGPTQHIGGSHQQMetalMXFP4OptimumQuantoQuarktorchaoSpQRVPTQSINQContribute
Ecosystem integrations
Resources
API
You are viewing v5.6.1 version. A newer version v5.8.1 is available.
Optimum
Optimum is an optimization library that supports quantization for Intel, Furiousa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. It is designed to enhance performance for specific hardware - Intel CPUs/HPUs, AMD GPUs, Furiousa NPUs, etc. - and model accelerators like ONNX Runtime.
Update on GitHub