Blogs

Search

Why a robust network fabric is critical for AI workloads

Making the move into AI has big implications for any IT infrastructure, including your enterprise network. Here’s a look at the challenges AI poses to networks and what to do about it.

Look at any list of top IT trends today and you’ll inevitably find AI. Thanks largely to the launch of ChatGPT in 2022, AI has gone from futuristic to mainstream seemingly in the blink of an eye.

Since the launch of ChatGPT, a survey of senior executives across a range of industries by the consulting firm EY found a nearly twofold increase in the number of AI investments exceeding $10 million. Goldman Sachs estimates the collective spending on AI—from the smallest corporations to the established tech giants—is now over $1 trillion. Whether it’s in marketing, manufacturing, healthcare, finance—you name it—AI now plays a central role in corporate growth strategies.

Getting AI right—training models and deploying them as fully functioning inference engines—requires vast amounts of data and processing power. That means data centers filled with high-powered graphics processing units (GPUs) such as those from NVIDIA.

However, organizations looking to deploy AI also need to take a close look at their enterprise networks. When they do, they’ll likely discover that their legacy network infrastructures are simply not up to the AI challenge.

 

AI Pushes Networks to the Breaking Point

Most enterprise networks were originally designed to connect headquarters locations with data centers and branch offices using a mix of private line solutions (such as Multiprotocol Label Switching, or MPLS) with public internet links. This widely used hybrid approach has been under pressure for years thanks to the demands posed by voice and video traffic, the Internet of Things, and Big Data applications.

Now, AI is pushing already stressed enterprise networks to the breaking point. Make no mistake about it, this is a major inflection point. Here’s why:

‘Elephant flows’ of data

The sheer volume of data required for training and operating AI models dwarfs previous IT requirements and leads to punishingly difficult traffic patterns. While pre-AI era traffic largely consisted of many relatively small, discrete data flows measured in kilobytes or megabytes—database calls, web server requests—AI workloads are measured in terabytes and even petabytes. In the AI world, they’re referred to as ‘elephant flows.’

New levels of latency

From a performance perspective, voice and video calls have always been among the most demanding applications for any network.  We’ve all experienced instances of latency where audio or video gets degraded because of network congestion. With AI, get ready for a whole new level of latency. Compared to AI, the data transfer speeds required for voice and video are child’s play—2-4 Mbps for an HD video conference vs. 10-40 Gbps for AI model training.

No performance compromises

With traditional network applications, a certain amount of performance degradation is tolerable. If a packet of data gets delayed or even dropped, the result might be a jittery video call, but the application can still function. Not so with AI. Many AI processes—such as autonomous vehicles—can’t be completed until all the data arrives. While that’s taking place, those expensive NVIDIA GPUs sit idle. Meta (Facebook) has estimated that one-third of the elapsed time in both AI and machine learning (ML) is spent waiting for the network.

 

Unlocking an AI-ready network

So, what do networks need in order to handle the new world of AI? At Zenlayer, our entire focus is on helping companies implement low latency/high performance networking solutions for the most demanding applications. Companies around the world in e-learning, financial trading, media streaming, online gaming, and other industries rely on Zenlayer’s proven ability to design global networks that deliver a seamless user experience. Now we’re doing it for companies looking to deploy, connect, and scale AI models.

Whether you rely on a provider such as Zenlayer to build and manage your network—or piece together solutions on your own—here are key things to consider in addressing the challenges posed by AI:

Go Glocal

Gathering and processing data for AI is an undertaking that spans the globe. But achieving the low latency and performance demanded by AI requires a local, on-the-ground presence. At Zenlayer, we call this going “Glocal”–expanding a network’s global reach while also increasing the number of local points of presence (PoPs). No matter what you do to accelerate an AI application, performance will always be bound by the speed of light and physical distance. That’s why we’ve established over 350 edge PoPs worldwide and built a high-speed fabric for AI in Asia. This network is anchored by hubs in Singapore and Johor, where industry giants like Microsoft and NVIDIA are investing heavily to drive AI growth. The closer you are to the point where data is being generated or used, the better the performance and user experience.

Avoid public cloud lock-in

Many organizations use their network to connect to one of the major cloud providers for processing large-scale AI training data. This keeps things simple, but over-reliance on a single cloud provider can lead to inflated operational costs—you are locked into the provider’s data ingress and egress charges—while also introducing latency, data sovereignty, and security concerns. Also, public cloud providers tend to have a limited global presence. Worldwide, there are more than 400 cities with populations over a million, but less than 10 percent have public cloud nodes.

A better approach is to use a mix of cloud providers and on-premises solutions. We make this easy. In addition to connecting more than 350 PoPs around the world, we also maintain direct connections to the top cloud providers such as AWS, Google Cloud, Alibaba Cloud, Tencent Cloud, Huawei Cloud, and IBM Cloud. Our customers get access to both cloud and on-premises resources through a single port, adjusting bandwidth and connectivity according to their changing demand levels. The result: complete flexibility without any cloud-lock-in.

Focus on fabric

While enterprise networks have typically been based on a centralized hub and spoke arrangement, the AI era is going to accelerate the shift to more densely interconnected network fabrics. With a network fabric, such as a mesh network, each node (e.g., device, router, or other networking hardware) is interconnected with multiple other nodes. This makes the network very flexible in routing data, reducing latency and costs while boosting transmission speeds.

At Zenlayer, we take this a step further through extensive peering with leading global and regional internet service providers. As a result, we’ve been able to reduce the average number of hops between source and destination by 50% or more. It’s a key reason we can reach 85% of internet users in less than 25 milliseconds.

However, in any network, individual links and paths can still get overwhelmed and experience failure. That’s why smart routing is essential: using a controller (increasingly AI-based) to dynamically select optimal paths based on application requirements and scheduling policies. Again, we take this a step further, assessing the reliability of both the primary transmitting path as well as multiple backup paths.

Think beyond the network

Networks don’t operate in a vacuum. In addition to performance issues such as bandwidth and latency, there are many other factors to take into consideration, including:

  • Security: AI workloads are subject to a wide spectrum of attacks such as tampering with prompts and poisoning the data pool. It’s critical that organizations maintain a consistently high level of security across any environment.
  • Non-IT Infrastructure: Operating in global markets requires adapting to a wide range of non-IT infrastructure issues including transportation, water, electrical power, and the local environment. How will all of those NVIDIA GPUs get to where they need to go? Is there enough water and power for cooling?  What is the local vulnerability to earthquakes, tsunamis and typhoons?
  • Compliance: As you rely on your network to expand into new regions, don’t underestimate what it takes to comply with local regulations on things like content monitoring, in-country presence, and data privacy. There are big differences between regions.

 

What’s next for your network in the era of AI?

We are still at the very early stages of the AI revolution, but our experience at Zenlayer has established beyond any doubt that high bandwidth and performance requirements demanded by AI represent a turning point in IT infrastructure and network design.

What’s the right path forward for your organization? Whether it’s sourcing NVIDIA GPUs, configuring the right mix of cloud and physical PoPs, or addressing AI-driven security and compliance issues, our AI experts welcome the opportunity to learn about your needs and help you take advantage of our hyperconnected network fabric for AI.

Feel free to contact a Zenlayer AI expert and let’s get the conversation started.

 

Share article :

Live Webinar on Jan 28: Edge Computing 201 – Innovations, Migration, & Use Cases