Guest Post: All-NVMe Flash Storage for AI and ML File Workloads was originally published on HPE’s Community Blog
Learn how HPE servers and the Qumulo File Data Platform deliver All-NVMe Flash storage for AI and ML/DL workloads to support growing unstructured data demands with high-throughput performance and ease of use.
Why AI, ML, and DL are optimal use cases for NVMe flash storage
Deep learning (DL) workflows use file size between 64KB and 1MB. To saturate an artificial intelligence (AI) GPU-based server—like the HPE Apollo 6500 system that need at least 20GBs—requires thousands of HDDs. NVMe is the answer for machine learning (ML) and DL workloads because NVMe flash drives can deliver up to 1000x the performance of HDDs and can exceed 5x the performance of the fastest SATA SSDs in AI training scenarios.*
As we have seen in the previous blogs in this data store series, three main dimensions come into play when selecting a data platform for AI, ML, and DL workloads:
- Performance—ML/DL requires multi-gigabytes-per-second I/O rates. Storage systems must deliver the performance required during AI/ML training to avoid “starving” GPU, and therefore prolong the length of the run.
- Scalability—More data is better! This is the AI mantra. Machine learning projects require huge data sets for model training, resulting in constant data growth over time.
- Operations—Data platform systems must be easy to use, deliver consistent performance to applications, and have limited downtimes. Excessive downtime, spotty performance, or extensive operational skills will delay AI projects and increase platform TCO.
Generally, existing storage systems sacrifice one or more of these dimensions:
- Direct-attached storage (DAS) is generally the initial choice of AI projects because it can provide consistent performance, but it presents scalability limits, creates isolated data sets, and challenges in sharing data sets across multiple computing units.
- Shared file systems like the Hadoop Distributed File System (HDFS) solve the capacity issues, but they present performance limits especially for small, random I/O patterns that are common in many DL use cases.
- Parallel file systems such as GPFS and Lustre, have been designed for high throughput performance and to share large data set, but they are extremely complicated to operate.
Qumulo file data platform optimizes these three dimensions: performance, scale, and ease of use
With its scale-out, flash-first architecture—and a file data platform purpose-built for massive concurrency across all data types, it delivers on all of these dimensions. Qumulo keeps required configuration and management complexity to a bare minimum. It allows seamless and linear scalability from TBs to PBs all in a single namespace. And lastly, it provides the persistent high performance and concurrency needed to accelerate AI and ML workloads at scale.
Qumulo flash-first file data platform
Qumulo’s multi-protocol file data platform makes it easy for organizations to store, manage, and build applications and workflows with data in its native file form on prem and in the cloud—with real-time visibility and total freedom.
Qumulo is more economical than legacy storage with leading performance. The solution provides real-time analytics to help save time and money while increasing performance. Continuous replication allows data to move where it’s needed when it’s needed either on-prem, in the public cloud or in multicloud environments. Built-in data protection provides integrated snapshots and copy to native S3.
Qumulo’s flash-first file data platform has been certified and optimized on the HPE Apollo 4000 systems and the HPE ProLiant DL325 Gen 10 Plus server family, to deliver an extremely cost-effective, at petabyte scale, and high-performance solution designed for AI-centric workloads.
Here is a high-level architecture diagram of the Qumulo File Data Platform.
The Qumulo File Data Platform includes powerful real-time analytics for insight into data usage and performance, data security with software-based encryption, and data protection with data services such as continuous replication and snapshots. It also simplifies the management of massive amounts of unstructured data. The Qumulo File Data Platform is designed to scale on demand with ease.
Qumulo’s data services allow data stored in a Qumulo File Data Platform to be viewed both in its current form and in previous versions via snapshots. These snapshots use a unique write-out-of-place methodology that only consumes space when changes occur. Snapshots policies can also be linked with replication policies. This enables snapshots to be replicated to a second Qumulo file data platform and enables frequent snapshots to be kept on one Qumulo and less frequent snapshots on another, which is a common enterprise data loss and ransomware protection strategy.
Replication enables users to copy, move, and synchronize data across multiple Qumulo File Data Platforms. This replication technology offers two core capabilities: efficient data movement and granular identification of changed data. Qumulo’s replication is continuous, meaning that any new changes to a replicated directory will be identified and moved, asynchronous, and unidirectional.
Object store replication enables any Qumulo File Data Platform to treat a cloud object storage service (e.g. Amazon S3) as a suitable replication target. Users can copy data from a Qumulo namespace to a cloud object store via Qumulo Shift one time, or on a continuous basis, and vice versa. Data moved to an object store is stored in an open and non-proprietary format enabling creators to leverage that data via applications that connect directly to the Amazon S3 cloud object store, in Amazon S3 native format.
Quotas enable users to control the growth of any subset of a Qumulo namespace. Quotas act as independent limits on the size of any directory, preventing data growth when the capacity limit is reached.
Qumulo file system
Qumulo’s file data platform is a software-defined, distributed, shared-nothing architecture which runs bare metal on data center hardware including HPE ProLiant Gen 10 servers and HPE Apollo Gen 10 servers. It also runs natively on the public cloud infrastructure. Qumulo scales linearly as the amount of data grows. Simply add nodes and the Qumulo software automatically rebalances data and performance across the cluster.
The Qumulo file system organizes all data stored in a Qumulo file system into a single namespace. This namespace is POSIX-compliant and maintains the permissions and identity information that support the full semantics available over the NFS or SMB protocols as well as a REST API. Like all file systems, the Qumulo file system organizes data into directories, and presents data to SMB and NFS clients. However, the Qumulo File Data Platform has several unique properties: the use of B-trees, a real-time analytics engine, and cross protocol permissions (XPP).
Qumulo scalable block storage
Scalable block storage (SBS) is the foundation of the Qumulo File Data Platform. The SBS leverages these core technologies to enable scale, portability, protection, and performance: a virtualized block system, erasure coding, a global transaction system, and an intelligent cache
The storage capacity of a Qumulo system is conceptually organized into a single, protected virtual address space. Each protected address within that space stores a 4K block of bytes. Each of those “blocks” is protected using an erasure coding scheme to ensure redundancy in the face of storage device failure. The entire file system is stored within the protected virtual address space provided by SBS, including the directory structure, user data, file metadata, analytics, and configuration information.
SBS uses the principles of a massively scalable distributed databases and is optimized for the specialized needs of file-based data. The SBS is the block layer of the Qumulo File Data Platform, making it simpler to implement and extremely robust. SBS also gives the file system massive scalability, optimized performance, and data protection.
Qumulo’s block-based protection, as implemented by SBS, provides outstanding performance in environments that have petabytes of data and workloads with mixed file sizes. SBS has many benefits, including:
- Fast rebuild times in case of a failed disk drive
- Ability to continue normal file operations during rebuild operations
- No performance degradation due to contention between normal file writes and rebuild writes
- Equal storage efficiency for small files and large files
- Real-time accurate reporting of usable space
- Efficient transactions that allow Qumulo clusters to scale to many hundreds of nodes
- Built-in tiering of hot/cold data that gives flash performance at archive prices
Qumulo’s file data platform includes cloud-based monitoring and trends analysis:
- Cloud monitoring includes proactive detection of events such as disk failures to prevent problems before they happen.
- Historical trends help lower costs and optimize workflows for the best use of your storage investment.
To learn more about Qumulo, see the Qumulo technical guide.
Qumulo’s file data platform has been optimized for the HPE ProLiant DL325 Gen 10 Plus servers using All-NVMe and the very latest industry-standard components. HPE ProLiant servers enable the extremely consistent, scalable, and high-performance file storage that is needed to support AI and ML workloads.
In additional to the All-NVMe configuration, the Qumulo File Data Platform can be configured in hybrid mode, combining an all-flash SSD tier for high performance and an HDD tier for a lower cost. In this configuration, files can be automatically moved across tiers to optimize performance and costs throughout the AI development lifecycle. Qumulo has a flash-first architecture where 100% of the writes go to SSDs, with the intelligent machine learning cache most reads come from either RAM or SSDs.
Why HPE and Qumulo are better together
HPE All-NVMe Flash systems with Qumulo’s file data platform effectively address:
- Growing unstructured data needs—Scale and manage billions of files with instant control at a lower cost and high performance, on premise, cloud, or spanning both, now and into the future.
- High-throughput performance needs for AI and ML applications and services—Feed GB/s to GPU-based servers.
- Easy operation need—Lower TCO and system downtime
Read about HPE solutions for Qumulo. And stay tuned to this blog series for more information on HPE data store solutions for AI and advanced analytics.
Watch this on-demand webinar to learn how Qumulo and HPE are delivering simplicity and performance in unstructured data environments. Featuring Ben Gitenstein, Qumulo’s VP of Product, and Stephen Bacon Director, Product Management & Systems Engineering for Scale-Out Data Analytics & Data Storage Platforms at HPE.
Take a test drive. Demo Qumulo in our interactive Hands-On Labs.
Subscribe to the Qumulo blog for customer stories, technical insights, industry trends and product news.