NVIDIA DGX is a crucial component within our HPC cluster, which comprises two 250 machines.
What is our primary use case?
How has it helped my organization?
Using DGX systems has been beneficial, especially when dealing with large language models or resource-intensive applications. The all-in-one design of the server is great for users because everything runs on a single box. It is versatile, supporting both single-node and multi-node setups through Mellanox interconnectivity. This is handy when you need to scale up or require significant resources. The customized software from NVIDIA adds an extra layer of efficiency compared to other options.
What is most valuable?
The most valuable thing about DGX systems is their super-fast connection, called NVLink, between the CPU and the GPU, making them work together really well. The GPUs are powerful and work closely with the server. They also have special cards for better communication between different GPUs, which helps when we want to quickly move data around without bothering the main brain. The servers now use AMD processors, which are known for being strong. Overall, DGX systems are well-built and powerful.
What needs improvement?
One thing that could be better in DGX systems is their power consumption. They have been making improvements, but finding the right balance between performance and using less power is a challenge. It would be great to see more progress in making the system more efficient in terms of power usage. In terms of features, I would like to see enhanced application support in DGX systems. Currently, they are strong in machine learning and typical vector-based applications, but it would be beneficial to broaden their capabilities to support a more diverse range of applications.
For how long have I used the solution?
I have been working with NVIDIA DGX Systems for over three years.
What do I think about the stability of the solution?
The product is quite stable.
What do I think about the scalability of the solution?
The product is highly scalable. DGX Systems follow the concept of DGX boxes, allowing you to stack them up and create a supportive infrastructure. This design makes it well-suited for scaling up, providing flexibility and scalability for handling increasing workloads.
How was the initial setup?
The initial setup of the DGX server was quite straightforward. We treated it like any other server during deployment. It went to the data center, where they set it up, placed it in the rack, and enabled it. The deployment process was familiar, using our standard tools like Foreman and Ansible. Since the operating system is supported, we didn't encounter any specific challenges. For deploying the DGX server, we typically need two people for software tasks and sometimes vendor assistance for hardware setup. The process takes about four hours, with NVIDIA firmware updates taking the most time (around two hours), and the rest dedicated to OS and Ansible deployment. Maintaining the DGX server is pretty straightforward. We treat it like any other server, with around 10% downtime, while the rest of the cluster remains up.
What was our ROI?
We have seen a good return on investment with NVIDIA DGX.
What's my experience with pricing, setup cost, and licensing?
The prices for DGX are pretty high, and not everyone can afford them. We only have a few out of our total servers because of the cost. It would be great if the prices could come down in the future to make it more accessible.
Which other solutions did I evaluate?
NVIDIA DGX has tight GPU integration and strong interconnect capabilities within the server, giving it an edge. While Dell and HP have similar concepts, the direct connection to NVIDIA and firmware assurance set DGX apart. Better software support and optimized firmware make it a standout choice, and having direct contact with NVIDIA adds a unique advantage.
What other advice do I have?
My advice to anyone who is considering using this product would be that if you aretechnically capable, I recommend going for NVIDIA DGX Systems. It is excellent for heavy machine learning and deep learning workloads. However, for more typical HPC tasks or less GPU-intensive work, you might consider a more cost-effective CPU-based server from Dell or HP. You should choose based on your specific workload needs. Overall, I would rate NVIDIA DGX Systems as a nine out of ten.
Which deployment model are you using for this solution?
On-premises
