Wondering how could you tested this with 2xRTX 6000 pro
#2
by
csabakecskemeti
- opened
Is there some fancy offloading to system memory in vllm or tensorRT ? This nvfp4 format not suppose to fit to 192gb vram.
Please share the trick
same here
I uploaded some reference cpu code and the quantization script. The vllm / deep gem hackery is still too hackery atm, also taking a look at triton. Main issue is most of the optimized paths are targetting sm100. It will be a bit before there are better rtx pro 6000 iml's for this.
--cpu-offload-gb 150
--cpu-offload-gb 150
I have 512GB RAM and 2XRTX PRO 6000 Blackwell. Can you tell me the full command to start this? Ubuntu 24.04.
Vllm serve $model —tensor-parallel 2 --cpu-offload-gb 150. You need to make sure nccl is functional as well to support inter gpu communication