Dont Let Storage Be Your AI Training Kryptonite
Fast storage plays a critical role in AI model training workflows. Saving and loading checkpoints takes time. With slow storage, these operations can become bottlenecks. Faster storage allows for quicker saving and loading, minimizing the time spent on I/O operations and keeping the overall training process efficient which ultimately has a positive impact for data scientists, and also the training costs because GPU’s are not left idle for long periods. Solidigm QLC drives not only tick all the performance and reliability boxes, but they partner with some of the best vendors in the world.
So, what is checkpointing?
The endeavour to train large, sophisticated models often necessitates the management of vast datasets. Checkpointing emerges as a pivotal tool, significantly alleviating the burdens of AI researchers by safeguarding progress and mitigating potential setbacks. However, the efficacy of checkpointing is contingent upon a critical component: high-performance storage infrastructure.
Checkpointing is a vital practice in model training, creating snapshots of progress to safeguard against disruptions like system crashes, power outages or hardware failure. These snapshots, containing the model’s current parameters, serve as invaluable insurance, ensuring that extensive training efforts aren’t lost in an instant and can be resumed rather than starting over.
Yet, while checkpointing is a lifesaver, it can also introduce challenges. Storing and retrieving these checkpoints can strain systems, especially with large models producing hefty checkpoint files, sometimes 10’s of terabytes in size. Slow storage can delay these processes, extending training durations significantly and associated costs when using expensive computational resources such as GPU. This is where high-performance storage solutions step in, streamlining data transfer and keeping training cycles running smoothly.
In essence, fast storage empowers checkpointing to truly act as a pit crew for AI models, ensuring swift recovery from setbacks and keeping them on the fast track to learning.
A New Player… with Deep Roots
Solidigm is a new company, founded in December 2021. However, it benefits from the extensive experience of two industry leaders in memory and storage.
- Intel’s NAND and SSD Business: This division has decades of experience creating innovative flash memory and solid-state drive technologies.
- SK hynix: A leading South Korean semiconductor manufacturer, recognized for its global reach and manufacturing expertise.
This partnership has resulted in a new force in the data storage industry. Solidigm is well- positioned to be a key player in the evolving data storage market, and already has a large and impressive product portfolio.
Solidigm QLC portfolio is up to the task
AI data storage needs are ever-growing. In addition to high performance there is also demand in the storage industry for:
- High-capacity drives: with larger and larger training sets being utilised, drive capacity becomes a critical factor in the density of storage deployed for AI workflows
- Dense storage enclosures: A smaller form-factor allows for significantly higher storage density compared to traditional options, thus reducing deployment footprint in valuable data centre real estate
The Solidigm D5-P5336 QLC SSD portfolio meets both of these demands by maximizing storage density and capacity, all whilst providing crucial performance in AI workflows, such as checkpointing.
With the Solidigm range now available in capacities of up to 61.44TB per drive and support for both U.2 15mm and E1.L 9.5mm form factors, let’s look at a typical 2U storage enclosure to get a sense for the density that can be achieved.
Using U.2 15mm form factor drives a 2U enclosure can hold around 24-26 drives, for a total raw capacity of up to 1.44PB. Switching to an E1.L 9.5mm form factor results in even more density. A 1U configuration can accommodate a whopping 32 disks. Doubling the storage density in a single unit. This translates to a 2.6x increase in total capacity for a 2U, or around 3.93PB in the same 2U.
This density does not come at a cost of performance. The Solidigm D5-P5336 QLC SSD boasts some serious performance numbers to keep those AI workflows running optimally. Up to 7,000 MB/s sequential read bandwidth and 3,000 MB/s write.
Wrapping up
In conclusion, minimizing training inefficiencies caused by slow storage is paramount for successful AI model development. Frequent checkpointing, crucial for mitigating training risks, can be hindered by traditional storage solutions. Solidigm’s high-performance QLC drives address this challenge by providing exceptional read/write speeds and high capacities, allowing researchers to store massive datasets and perform efficient checkpointing.
The dense storage enclosures offered by Solidigm further enhance data centre space utilization by maximizing storage capacity within a smaller footprint. With Solidigm’s QLC drive portfolio, researchers can focus on what truly matters: developing groundbreaking AI models, without storage bottlenecks hindering their progress.