My recent trip to Supercomputing 2015 opened my eyes to some very interesting new applications for Optical Circuit Switched networking in high‑performance computing (HPC) applications including hosted graphic processor unit (GPU) and field programmable gate arrays (FPGA).
Both of these drive new levels of application performance and functionality. And both could benefit from the CALIENT vPod Data Center network architecture. A vPod fabric can help create resource pools and expand access to these resources while also lowering the operating and capital costs of deploying these new services.
Is There a Fabric For GPU Networking?
Graphics processing performance is increasingly important for many desktop applications. This goes beyond traditional applications like CAD systems, 3D scientific modeling applications, data visualization and others. But now many applications, meant to run on PCs, laptops and mobile devices have significant 3D visualization or other graphics requirements that go beyond their built in GPU capabilities.
This is leading to a trend of hosting GPUs in the data center connected to the servers that are running the application. Based on the required performance of an application, a number of GPUs are assigned to the server. Depending on the performance requirements, some applications are run on nodes with two GPUs per server, while other nodes have up to 16 GPUs per server and beyond.
This introduces a challenge because a data center manager needs to custom develop different nodes to support a variety of workloads, which require differing mixes of compute, networking and storage resources.
Today, GPUs are hardwired to a server through a copper PCIe switch with a cable that can be no longer than two meters. This fixes the ratio of GPUs per server and also makes those GPUs captive to that particular server, limiting their use by other servers that could take advantage of any spare cycles on that GPU. As the interface speeds increase, there is a shift towards using single‑mode fiber.
What’s intriguing is to think about creating a GPU pool that is connected to servers using CALIENT’s vPod Data Center architecture powered by its S-Series Optical Circuit Switches and the LightConnect™ Fabric Manager. LightConnect enables virtual pod (vPod)-based data centers by facilitating the orchestration of an S‑Series interconnect fabric for intra‑data center connectivity allowing compute and storage resources to be shared between physical pods in data centers.
With this network architecture, a connection can be made between any server and any GPU. Ratios could be set more easily and changed if application performance requires a change. This can allow for true pooling of GPU resources to maximize utilization and make the entire initiative more cost effective.
Similarly, the trend toward high-performance FPGAs in super computing applications can benefit from vPod technology. The FPGAs are built into HPC processor boards and offer topology adaptation and low-latency, processor‑to‑processor communications.
The flexible use of these FPGAs means that there is a need to change the network topologies of the FPGA boards. But, these aren’t a pooled resource, which limits the ability to make these changes over the network. One reason is the need for a maximum latency budget of 150ns, 100ns of which can be consumed by the optics. This leaves only 50ns for router and switch hops between two servers. The CALIENT-powered vPod architecture provides an ultra-low latency connection of less than 30ns, making it a perfect fit for the pooling of these FPGA resources.
Many of the folks that I spoke with at SC15 had no idea that they could use OCS to pool these valuable resources and reduce their costs. I hope this blog post and further publications will help to raise awareness about the exciting possibilities.