What is our primary use case?
I have been using NVIDIA AI Enterprise for two to three years and was first introduced to this product a couple of years ago through an NVIDIA sales representative that I was working with at Dell Technologies, supporting numerous large-scale AI and high-performance computing products with NVIDIA AI Enterprise.
NVIDIA VGPU was one of the compute layers that has been one of the most common main use cases for NVIDIA AI Enterprise. It enables multiple GPUs to share different virtual machines and optimizes resource utilization while condensing hardware operating costs. Because NVIDIA AI Enterprise is typically sold on a per-GPU license, it is important that customers get the best bang for their buck, and NVIDIA VGPU for compute nodes has really been helpful. I have also used this in a number of large-scale RFPs and RFIs.
I primarily work with AI workloads that are on a hybrid cloud model because the public cloud lacks a secure posture that is required for organizations such as the Department of Defense and military organizations. The private cloud, while it is very secure, is also quite expensive. The hybrid approach is very helpful with primarily on-prem infrastructure for rack integration but also some remote connectivity options. Everything also connects via DHCP, which is a dynamic host control protocol that allows customers to use things such as PuTTY and other VS Code type platforms to essentially SSH or remote into a desktop server.
I have also been using a couple of other software development kit libraries including NVIDIA NeMo, which is one of our data curation tools that helps clean the data and allows for model training and fine-tuning, and NVIDIA AI Blueprints are very important, allowing for retrieval augmented generation or RAG. Model training and data curation are very important as well.
There is a large range of libraries offered by NVIDIA AI Enterprise. These catalogs give you all of the information necessary to securely run AI workloads. That has been a very important use case, such as NeMo for the data curation engine for retrieval augmented automated generation, and there are a couple of other use cases such as TensorRT, which is a built-in library for Jupyter notebooks, providing resources for developing the code and the programming. There are also other options available such as NVIDIA for Digital Twins that gets you interested in building a virtual layer to a physical data center, with various APIs available such as NVIDIA Base Command Manager, and many libraries available. The vast majority of these libraries are open source and can be found on tools such as GitHub and GitLab.
What is most valuable?
NVIDIA AI Enterprise has impacted my organization positively for a number of reasons. There has been a lot of optimization when it comes to researching organizational information because we have consolidated sites such as SharePoint, and NVIDIA AI Enterprise helps us access resources much quicker without needing to search the web for article after article. That has been very helpful. Additionally, there has also been productivity gains in optimizing workloads with retrieval augmented generation and running demos on the AI workstation, the laptop, leading to a 200 percent increase in productivity.
The accuracy of NVIDIA AI Enterprise has been exceptional, particularly when using generative AI such as retrieval augmented generation. The platform is built on reinforcement learning and model training with extensive libraries, making accuracy and reliability standout features. I believe this to be one of the best advantages of NVIDIA AI Enterprise, and the training continues to reduce errors. While models are never perfect, as humans and data curation are not perfect, I do believe that increased customer support, such as a real-time support desk, would help provide customers with the right information to support this type of platform.
What needs improvement?
There should be more marketing presence for NVIDIA AI Enterprise. There are numerous training options available, but I feel that many people do not always know where to go because there are so many resources. I recommend creating a weekly or monthly newsletter depending on the subscription type, as there are different levels and layers of NVIDIA AI Enterprise software. The best approach is to make information widely accessible and provide relevant training and content not just for software engineers and developers but for a wide range of audiences.
To further emphasize the need for improvements, I think NVIDIA AI Enterprise should add more marketing, training, and collaborative material. It would also be very helpful to have people available for online chats to answer basic questions for newcomers. Investing in our youth as they are the future is also important; K through 12 schools and universities should have access to this type of information.
The governance and security of NVIDIA AI Enterprise need improvement. Some security features such as zero trust architecture or ZTA are crucial because everyone needs a secure software solution. While NVIDIA AI Enterprise does implement secure hardening of endpoints, it lacks all federal compliance certifications such as FIPS, which governs cryptography and the installation of cryptographic keys onto hard drives. FIPS 140-2, FIPS 140-3, data at rest encryption, and other security measures are necessary additions to NVIDIA AI Enterprise software, especially for US federal government clients such as the Department of Defense, which would enhance governance, surveillance, and security.
Reinforcing the need for improvements, I see a requirement for more human contact to work on support tickets. It would be beneficial if NVIDIA AI Enterprise allows customers to quickly reach someone for support without delays. I have experienced situations with Dell customers where support can bounce back and forth, creating challenges that need to be reduced for better efficiency.
For how long have I used the solution?
I have been using NVIDIA AI Enterprise for two to three years.
What do I think about the stability of the solution?
NVIDIA AI Enterprise is a stable platform, releasing quarterly updates that customers can access.
What do I think about the scalability of the solution?
The scalability of NVIDIA AI Enterprise is absolutely incredible because it layers across numerous GPUs and racks. I have designed systems with up to 12 compute racks, four storage racks, and several networking cables and cards, which are crucial. I have observed NVIDIA AI Enterprise scaling up to at least 512 GPUs simultaneously.
How are customer service and support?
Customer support varies based on the support level purchased, whether it is ProSupport Plus with a mission-critical four-hour response. While this level guarantees quick access, sometimes there are delays as support can bounce between Dell, NVIDIA, and other involved partners and vendors. I believe there is room for improvement regarding transparency and communication in customer support.
I would rate customer support a seven, as there are metrics assessing effectiveness, time to value, and return on investment for customers. However, there have been delays in communication and responsibilities between companies such as Dell and NVIDIA, creating confusion regarding who owns specific responsibilities. I would like better communication between both parties, which would require investing in highly skilled AI services departments and customer support, including the online chat I previously mentioned.
Which solution did I use previously and why did I switch?
I was previously using a combination of Red Hat OS and other orchestration platforms on Linux Ubuntu, which the federal government primarily utilizes. While Red Hat is crucial and works across many servers, it is not always the latest or most advanced, and its licensing costs have become expensive. The same situation applies to VMware private cloud foundations, where costs also escalated.
What was our ROI?
The return on investment has shown significant money saved and time needed. There has not been a reduction in employees, and nobody wants their job to be replaced by AI in any capacity. However, with GPUs, especially through RunAI, the GPU orchestration platform facilitates increased effectiveness and efficiency. NVIDIA has invested in GPU orchestration by acquiring Slurm, a popular job scheduling tool for high-performance computing, providing roughly a 250 percent return on investment. Millions of dollars are being reinvested into hardware, and savings from GPU orchestration are now allocated for power and cooling operations, such as liquid-cooled and air-cooled data center GPUs.
What's my experience with pricing, setup cost, and licensing?
I am not too involved in the pricing, setup cost, and licensing process as a solution architect. I am responsible for creating the bill of materials, detailing items needed for compute servers, storage nodes, and networking fabric. The account team, including the account executive, sales executive, and storage executive, translate technical components into list pricing and discounts. I am aware that NVIDIA has promotions, including bundles for Omniverse and RunAI for GPU orchestration targeted at specific types of GPUs, which typically show up quarterly. NVIDIA AI Enterprise is structured as a per-license GPU cost.
Which other solutions did I evaluate?
I evaluated other options before choosing NVIDIA AI Enterprise, as discussed previously.
What other advice do I have?
My advice for others considering NVIDIA AI Enterprise is to conduct thorough research and discuss with their facility team. Understanding the rack layout, data center size, floor height, and humidity or CFM in the room is essential. You must determine whether you have the plumbing for AI data center needs, the capacity to support the weight of heavy racks (typically two to 3,000 pounds), and essential infrastructure components such as shock pallets, doors, heat exchangers, and chillers. Once these components are solidified, you can have conversations regarding the appropriate type of NVIDIA AI Enterprise support based on your GPUs.
NVIDIA AI Enterprise platform continues to evolve over time, and the more often customers are able to go online and teach themselves about these platforms the better. NVIDIA Omniverse Enterprise is a collaborative environment for 3D workflows. When you are making a digital twin, you are basically creating a 3D layer that virtualizes a hardware infrastructure platform, bringing the ideas to life.
NVIDIA AI Enterprise is primarily deployed in my organization through a hybrid cloud, which I have discussed earlier. Hybrid cloud combines both private and public on-prem solutions, offering the best of both worlds. Data that needs to stay on-prem can live in a secure environment while allowing for archival or secondary storage in the cloud, which can reduce costs. Working with a company such as Equinix for colocation of data back and forth plays a crucial role in the deployment as it provides a scalable, flexible approach, with private cloud environments making the most sense for the customers I work with. I am providing this review with an overall rating of 9.
Which deployment model are you using for this solution?
Hybrid Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Google