Wondering how could you tested this with 2xRTX 6000 pro

#2
by csabakecskemeti - opened

Is there some fancy offloading to system memory in vllm or tensorRT ? This nvfp4 format not suppose to fit to 192gb vram.
Please share the trick

I uploaded some reference cpu code and the quantization script. The vllm / deep gem hackery is still too hackery atm, also taking a look at triton. Main issue is most of the optimized paths are targetting sm100. It will be a bit before there are better rtx pro 6000 iml's for this.

--cpu-offload-gb 150

--cpu-offload-gb 150

I have 512GB RAM and 2XRTX PRO 6000 Blackwell. Can you tell me the full command to start this? Ubuntu 24.04.

Do you know which parts it offloads to cpu when using this? @eousphoros

Vllm serve $model —tensor-parallel 2 --cpu-offload-gb 150. You need to make sure nccl is functional as well to support inter gpu communication

Sign up or log in to comment