The AI compute market is seeing somewhat of a renaissance with the boom in generative AI. At the same time, research into alternative methods of accelerating AI workloads have also been in full swing. Researchers from the University of Washington, in conjunction with a researcher from Microsoft, have found a way to serve LLM workloads in a more efficient way.
They published their findings in a paper termed ‘Chiplet Cloud’, detailing their plan to build an AI supercomputer based on the chiplet manufacturing process. Compared to general-purpose GPUs, this computing model achieves a whopping 94x improvement. Even when pitted against Google’s built-for-AI TPUv4, the new architecture sees a 15x improvement.
As covered by AIM previously, the industry as a whole is moving towards specialised chip design. Following in the footsteps of companies like Cerebras, Samba Nova, and GraphCore, the chiplet cloud might just represent the future of AI compute in the enterprise.
Chiplet cloud explained
The paper describes an architecture wherein purpose-built ASIC (application specific integrated circuit) chips make up the bulk of computing power. These chips are the pinnacle of specialised chips, as seen by their adoption by Intel (Meteor Lake) and AMD (Zen). While these chipmakers use ASICs as a smaller part of their general-purpose chips, the paper proposes the whole architecture be constructed of ASICs.
By creating an ASIC optimised for maths matrix calculations, which make up the bulk of AI compute workloads, the researchers showed a huge performance increase and cost savings over GPUs. In terms of total cost of ownership per generated tokens, the chiplet cloud saw a 94x improvement over a cloud of NVIDIA’s last-gen A100 GPUs.
This cost savings mainly stems from the silicon-level optimisations that come with creating customised chips. In addition to the optimisations for maths matrix calculations, the chip also has huge amounts of memory in the form of SRAM (static random access memory). This is one of the most important parts of any system for LLM workloads, as it allows for the model to be stored in fast memory.
This has long been an issue with GPUs, as even the fastest memory today cannot keep up with LLMs’ requirement. This leads to a phenomenon known as bottlenecking, wherein the GPU is not used to its fullest potential due to memory bandwidth constraints. The chiplet cloud does not fall prey to this issue, as it has extremely low-latency memory placed right next to the processing chips.
These chips are then connected together using a 2D torus structure, which the researchers say is flexible enough for different kinds of AI workloads. These features are only half the story, as the main benefit of introducing a chiplet cloud comes in the cost reduction.
Future of AI compute
As mentioned previously, cloud service providers are reaching into their deep pockets to fund research into specialised AI chips. AWS has Graviton and Inferentia, and Google has TPUs, but Microsoft had fallen behind, until now. This research holds the potential to change the way the enterprises approach cloud compute for AI.
To begin with, even manufacturing the nodes required for the chiplet cloud would be a drop in the ocean compared to competitors. Researchers estimated the cost of building a comparable GPU cluster at $40 billion, notwithstanding the operating expenses that come with such powerful machines.
On the other hand, the chiplet cloud cost was estimated to be around $35 million, which makes it highly competitive especially when considering their huge efficiency gains. Moreover, breaking down the silicon chip into chiplets improves manufacturing yield, further driving down the cost of ownership.
In addition to this, these ASICs will also be utilised at their full capacity due to the 2D torus architecture, as opposed to 40% utilisation on TPUs and 50% on GPUs for LLM workloads. These chips can also be deployed as per the software and hardware requirements of the companies, making it even more suited for cloud deployment.
The chiplet cloud compute type and memory capacity can be changed depending on the type of model being deployed on it. This alone will have AI-first companies queuing up for the product, as the custom-sized clouds can help them save costs while optimising for narrow use-cases. Moreover, the cloud can also be configured for either latency or TCO per token, meaning that companies can either opt to have their models fast or accurate.
The possibilities are endless with the chiplet cloud architecture, which might also be why Microsoft is conducting research into this field. If this undertaking makes its way into Azure, Microsoft would not only have a unique bargaining chip against AWS and GCP, but can also supercharge OpenAI’s APIs and its own Azure OpenAI service. While it is still in the research phase, the chiplet cloud’s various benefits might make it the go-to cloud compute for AI.