Cerebras Systems, a pioneer in accelerating artificial intelligence (AI) compute, unveiled the Cerebras Wafer-Scale Cluster, delivering near-perfect linear scaling across hundreds of millions of AI-optimized compute cores while avoiding the pain of the distributed compute. With a Wafer-Scale Cluster, users can distribute even the largest language models from a Jupyter notebook running on a laptop with just a few keystrokes. This replaces months of painstaking work with clusters of graphics processing units (GPU).
Large language models (LLMs) are transforming entire industries across healthcare and life sciences, energy, financial services, transportation, entertainment, and more. However, training large models with traditional hardware is challenging and time consuming and has only successfully been accomplished by a few organizations. It requires months of complex distributed computing before any training can even occur. In fact, training these models is so uncommon, successful training is frequently deemed worthy of publication.
Cerebras Wafer-Scale Clusters allow one to quickly, simply, and easily build clusters that support the largest LLMs. By exclusively using data parallelism, Cerebras avoids the pain of distributed computing. Instead, Cerebras Wafer-Scale Clusters deliver push-button allocation of work to compute, and linear performance scaling from a single CS-2 to up to 192 CS-2 systems. Wafer-Scale Clusters make scaling the largest models dead simple. From a digital notebook on a laptop, the largest of LLMs like GPT-3 can be spread over a cluster of CS-2s with a single keystroke, trained, and the results evaluated. Switching between a 1B, 20B, and 175B parameter model is similarly simple, as is allocating an LLM to 850,000 AI cores (1 CS-2), 3.4 million compute cores (4 CS-2s), and 13.6 million cores (16 CS-2s). Each one of these actions would have taken months of work on a cluster of graphics processing units.
The key to the new Cerebras Wafer-Scale Cluster is the exclusive use of using data parallelism. Data parallelism is the preferred approach for all AI work. However, data parallelism requires that all the calculations, including the largest matrix multiplications of the largest layer, fit on a single device, and that all the parameters fit in the device’s memory. Only the CS-2 — and not graphics processing units — achieves both characteristics for LLMs.
The Cerebras WSE-2 is the largest processor ever built. It is 56 times larger than the largest GPU, has 123 times more cores, 1,000 times more on-chip memory, 12,000 times more memory bandwidth, and 45,000 times more fabric bandwidth. The WSE-2 is the size of a dinner plate, while the largest graphic processing unit is the size of postage stamp.
The sheer size and computational resources on the WSE-2 allow Cerebras to fit the largest layers of the largest neural networks onto a single device. In fact, the WSE-2 can fit layers 1,000 times larger than the largest layer in the largest existing natural language processing (NLP) network. This means work never needs to break up and spread across multiple processors. Small graphics processing units routinely must break up work and spread it across multiple processors.
MemoryX enables Cerebras to disaggregate parameter storage from compute without suffering the penalty usually associated with off-chip memory. Storage for model parameters is in the separate MemoryX system, while all the compute is in the CS-2. By disaggregating compute from memory, the MemoryX provides nearly unbounded amounts of storage for the parameters and optimizer states.
MemoryX streams weights to the CS-2, where the activations reside. In return, the CS-2 streams back the gradients. The MemoryX uses these in combination with stored optimizer parameters to compute weight updates for the next training iterations. This process is then repeated until training is complete. MemoryX enables even a single CS-2 to support a model with trillions of parameters.
While MemoryX adds vast parameter storage capabilities, SwarmX connects MemoryX to clusters of CS-2s, enabling CS-2s to scale out and for the cluster to run strictly data parallel. SwarmX forms a broadcast reduce fabric. The parameters stored in MemoryX are replicated in hardware and broadcast across the SwarmX fabric to multiple CS-2s. The SwarmX fabric reduces the gradients sent back from the CS-2s, providing a single gradient stream to the MemoryX.
Based on the CS-2, MemoryX and SwarmX, the Cerebras Wafer-Scale Cluster is the only cluster in AI compute that enables strict linear scaling of models with billions, tens of billions, hundreds of billions, and trillions of parameters. If users go from one CS-2 to two CS-2s in a cluster, the time to train is cut in half. If users go from one CS-2 to four CS-2s, training time is cut to one-fourth. This is an exceptionally rare characteristic in cluster computing. It is profoundly cost and power efficient. Unlike GPU clusters, in a Cerebras cluster, as users add more compute, performance increases linearly.