NVIDIA Launches Server Certification Program, Offering Direct Technical Support

While a good deal of NVIDIA’s success in servers over the last decade has of course come from their proficient GPUs, as a business NVIDIA these days is much more than a fabless GPU designer. With more software engineers than hardware engineers on staff, it’s software and ecosystem plays that have really cemented NVIDIA’s position as the top GPU manufacturer, and created a larger market for their GPUs. At the same time, it’s these ecosystem plays that have allowed NVIDIA to build a profit-printing machine, diversifying beyond just GPU sales and moving into systems, software, support, and other avenues.

To that end, NVIDIA this morning is formally rolling out a new ecosystem play aimed at high-end deep learning servers, which the company is branding as NVIDIA-Certified Systems. Soft-launched back in the fall, today the company is giving the program a more proper introduction, detailing the program and announcing some of the partners. Under NVIDIA’s plan, going forward customers can opt to buy NVIDIA-Certified systems if they want an extra guarantee on system performance and reliability, as well as opt in to buying support contracts to get access to direct, full-stack technical support from NVIDIA.

Conceptually, the certification program is rather straightforward, due in large part to its hardware requirements. Systems first need to be using NVIDIA’s A100 accelerators, along with Mellanox Ethernet adapters and DPUs. Or in other words, the servers already need to be using NVIDIA silicon where available. OEMs can then submit systems meeting these hardware requirements to NVIDIA, who will test the systems across multiple metrics, including multi-GPU and multi-node DL performance, network performance, storage performance, and security (secure boot/root of trust). Systems that pass these tests can then be labeled as NVIDIA-Certified.

Those certified systems, in turn, are eligible for additional full-stack technical support through NVIDIA and the OEM. Customers can opt to buy multi-year support contracts, which entitles them to support through the OEM and NVIDIA. NVIDIA essentially assumes responsibility for all software support above the OS, including their hardware drivers, CUDA, their wide collection of frameworks and libraries, and even major open source libraries like TensorFlow. The latter is what makes NVIDIA’s support proposition particularly valuable, as they’re essentially committing to helping customers with any kind of GPU or deep learning-related software issue.

Of course, that support won’t come for free: this is where NVIDIA will be making their money. While NVIDIA is not charging OEMs for certification (so there’s no additional certification tax baked into the hardware), support contracts are priced based on the number of GPUs. In one example, NVIDIA has stated that a 3 year support contract for a dual-A100 system would be $4,299, or about $715 per-year per-GPU for support. So one can imagine how quickly this ratchets up for larger 4 and 8 way A100 systems, and then again for multiple nodes.

For NVIDIA and its OEM partners, the creation of a certification program is a straightforward way to try to further grow the market for deep learning servers, especially for mid-sized businesses. The market for AI hardware has been booming, and NVIDIA wants to keep it that way by making it easier for potential customers to use their wares. NVIDIA already has the top-end of the market covered in this respect with their direct relationships with the hyperscalers – and by extension their small-cap cloud computing customers – so a hardware certification program fills the middle tier for organizations that are going to run their own servers, but aren’t going to be a massive customer that gets personalized attention.

As for those customers, NVIDIA’s server certification and support programs are designed to eliminate (or at least mitigate) the risks of making significant investments into NVIDIA hardware. That means being able to buy a system where the vendor (in this case the duo of NVIDIA and the OEM) can vouch for the performance of the system, as well as guarantee it will be able to properly run various AI packages, such as NVIDIA’s NGC catalog of GPU-optimized and containerized software.

Altogether, NVIDIA is launching with 14 certified systems, with the promise of more certified systems to come. For the first wave of systems, participating OEMs include Dell, Gigabyte, HPE, Inspur, and Supermicro, all of whom are frequently participants in new NVIDIA server initiatives.

With all that said, NVIDIA’s server certification program is unlikely to significantly change how things work for most of the company’s customers; but it’s a program that seems primed to address a specific niche for NVIDIA and its OEM partners. For companies that are interested in GPU computing but are looking for a greater degree of support and certainty, this would address those needs. Which, to bring things full circle, it’s exactly by addressing those sorts of needs with ecosystem plays like server certification that NVIDIA has been so successful in the server GPU market over the last decade.