NVIDIA’s High-End GeForce RTX 5090 & RTX PRO 6000 GPUs Reportedly Affected by Virtualization Bug, Requiring Full System Reboot to Recover

It seems like NVIDIA’s flagship GPUs, the GeForce RTX 5090 and the RTX PRO 6000, have encountered a new bug that involves unresponsiveness under virtualization.

NVIDIA’s Flagship Blackwell GPUs Are Becoming ‘Unresponsive’ After Extensive VM Usage

CloudRift, a GPU cloud for developers, was the first to report crashing issues with NVIDIA’s high-end GPUs. According to them, after the SKUs were under a ‘few days’ of VM usage, they started to become completely unresponsive. Interestingly, the GPUs can no longer be accessed unless the node system is rebooted. The problem is claimed to be specific to just the RTX 5090 and the RTX PRO 6000, and models such as the RTX 4090, Hopper H100s, and the Blackwell-based B200s aren’t affected for now.

The problem specifically occurs when the GPU is assigned to a VM environment using the device driver VFIO, and after the Function Level Reset (FLR), the GPU doesn’t respond at all. The unresponsiveness then results in a kernel ‘soft lock’, which puts the host and client environments under a deadlock. To get out of it, the host machine has to be rebooted, which is a difficult procedure for CloudRift, considering the volume of their guest machines.

This issue isn’t limited to CloudRift only. A user at Proxmox has reported a similar issue, where he saw a complete host crash after shutting down a Windows client. Interestingly, he says that NVIDIA has responded to the problem, claiming that the firm has been able to reproduce the issue and is working on a fix. We are waiting on an official confirmation from NVIDIA, but it seems like the problem is specific to Blackwell-based GPUs.

Interestingly, CloudRift has put out a $1,000 bug bounty for those who can fix or mitigate the issue, and we are expecting NVIDIA to release a fix soon, considering that it is affecting crucial AI workloads.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *