Dual Reticle GPU With Over 20K Cores, 288 GB HBM3e Memory at 8 TB/s & 50% Faster Than GB200

NVIDIA has provided an in-depth breakdown of its fastest chip for AI, the Blackwell Ultra GB300, which is 50% faster than GB200 & packs 288 GB memory.

NVIDIA’s Blackwell Ultra “GB300” Is The Miracle Chip For AI, 50% Faster Than GB200 And Packs 288 GB of Memory

A few days ago, NVIDIA rolled out an article giving a breakdown of its latest and greatest AI chip, the GB300 Blackwell Ultra. This chip is now in full production and has already been rolled out to key customers. While the chip is an extension of the Blackwell solution, it does offer a significant upgrade in terms of performance and features.

Just like how the NVIDIA Super series is a better version of the original RTX gaming cards, the Ultra series is an enhanced version of the AI chips that were initially introduced. NVIDIA didn’t have Ultra offerings in the previous lineups, such as Hopper and Volta, but those also technically had Ultra or enhanced versions. Plus, even though Ultra chips are better on a hardware level, software updates and optimizations also deliver some substantial gains on Non-Ultra or non-enhanced chips.

So, what is Blackwell Ultra GB300? Well, as said above, it is an enhanced version which makes use of two Reticle-sized Dies and connects them with NVIDIA’s NV-HBI high-bandwidth interface to show us as a single GPU. The GPU is quite dense, based on the TSMC 4NP (optimized 5nm for NVIDIA) node, and houses a total of 208 billion transistors. The NV-HBI interface provides a 10 TB/s bandwidth for the two GPU dies, all while functioning as a single chip.

The NVIDIA Blackwell Ultra GB300 GPU packs a total of 160 SMs, each with a total of 128 CUDA cores, four 5th Gen Tensor cores with FP8, FP6, NVFP4 precision compute, 256 KB of Tensor memory or TMEM, and SFUs. This rounds up to a total of 20,480 CUDA cores and 640 Tensor cores, plus 40 MB of TMEM.

Feature Hopper Blackwell Blackwell Ultra
Manufacturing process TSMC 4N TSMC 4NP TSMC 4NP
Transistors 80B 208B 208B
Dies per GPU 1 2 2
NVFP4 dense | sparse performance 10 | 20 PetaFLOPS 15  | 20 PetaFLOPS
FP8 dense | sparse performance 2 | 4 PetaFLOPS 5 | 10 PetaFLOPS 5 | 10 PetaFLOPS
Attention acceleration
(SFU EX2)
4.5 TeraExponentials/s 5 TeraExponentials/s 10.7 TeraExponentials/s
Max HBM capacity 80 GB HBM (H100) 
141 GB HBM3E (H200)
192 GB HBM3E 288 GB HBM3E
Max HBM bandwidth 3.35 TB/s (H100)
4.8 TB/s (H200)
8 TB/s 8 TB/s
NVLink bandwidth 900 GB/s 1,800 GB/s 1,800 GB/s
Max power (TGP) Up to 700W Up to 1,200W Up to 1,400W

The 5th Gen Tensor Cores are where all the magic happens, as they are responsible for all the AI compute operations. NVIDIA has delivered major innovations in each generation of Tensor Cores for its GPUs, such as:

  • NVIDIA Volta: 8-thread MMA units, FP16 with FP32 accumulation for training.
  • NVIDIA Ampere: Full warp-wide MMA, BF16, and TensorFloat-32 formats.
  • NVIDIA Hopper: Warp-group MMA across 128 threads, Transformer Engine with FP8 support.
  • NVIDIA Blackwell: 2nd Gen Transformer Engine with FP8, FP6, NVFP4 compute, TMEM Memory

Blackwell Ultra also brings a huge upgrade to memory, offering 288 GB of HBM3e capacities versus a max of 192 GB on the previous Blackwell GB200 solutions. This upgrade is what will lead NVIDIA to support multi-trillion-parameter AI models. The memory comes in 8 stacks with a 16 512-bit controller (8192-bit wide interface) and operates at 8 TB/s per GPU. The memory enables:

  • Complete model residence: 300B+ parameter models without memory offloading.
  • Extended context lengths: Larger KV cache capacity for transformer models.
  • Improved compute efficiency: Higher compute-to-memory ratios for diverse workloads.

The interconnect on Blackwell is the same NVLINK provided by the NVLINK Switch, NVLINK-C2C, and there’s also the use of PCIe Gen6 x16 interface for connection to host GPUs. Following are the NVLINK 5 and Host side connectivity features/specs:

  • Per-GPU Bandwidth: 1.8 TB/s bidirectional (18 links x 100 GB/s)
  • Performance Scaling: 2x improvement over NVLink 4 (Hopper GPU)
  • Maximum Topology: 576 GPUs in non-blocking compute fabric
  • Rack-Scale Integration: 72-GPU NVL72 configurations with 130 TB/s aggregate bandwidth
  • PCIe Interface: Gen6 × 16 lanes (256 GB/s bidirectional)
  • NVLink-C2C: Grace CPU-GPU communication with memory coherency (900 GB/s
Interconnect Hopper GPU Blackwell GPU Blackwell Ultra GPU
NVLink (GPU-GPU) 900 1,800 1,800
NVLink-C2C (CPU-GPU) 900 900 900
PCIe Interface 128 (Gen 5) 256 (Gen 6) 256 (Gen 6)

The result is that NVIDIA’s Blackwell Ultra GB300 platform is able to achieve a 50% increase in Dense Low Precision Compute output using the new NVFP4 standard. The new model delivers near FP8 accuracy, & the differences are often less than 1%. This also reduces the memory footprint by 1.8x versus FP8 and 3.5x versus FP16.

Blackwell Ultra also sees advanced scheduling management and new Enterprise-grade security features, such as:

  • Enhanced GigaThread Engine: Next-generation work scheduler providing improved context switching performance and optimized workload distribution across all 160 SMs.
  • Multi-Instance GPU (MIG): Blackwell Ultra GPUs can be partitioned into different-sized MIG instances. For example, an administrator can create two instances with 140 GB of memory each, four instances with 70 GB each, or seven instances with 34 GB each, enabling secure multi-tenancy with predictable performance isolation.
  • Confidential computing and secure AI: Secure and performant protection for sensitive AI models and data, extending hardware-based Trusted Execution Environment (TEE) to GPUs with industry-first TEE-I/O capabilities in the Blackwell architecture and inline NVLink protection for near-identical throughput when compared to unencrypted modes.
  • Advanced NVIDIA Remote Attestation Service (RAS) engine: AI-powered reliability system monitoring thousands of parameters to predict failures, optimize maintenance schedules, and maximize system uptime in large-scale deployments.

Performance efficiency is another area where Blackwell Ultra GB300 takes charge, offering higher TPS/MW than Blackwell GB200, as shown in the chart below:

All this shows that NVIDIA is simply at the top of the AI ladder with engineering marvels such as Blackwell and Blackwell Ultra. Their in-depth software support and optimizations are what’s been really ticking the boxes for them, and the annual hardware cadence plus increased R&D is definitely going to keep them going at it for several years.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *