Accelerating AI Data Workflows with the Qumulo Cloud Data Platform

December 17, 2024

Authored by:

Douglas Gourlay
Co-authored by Dack Busch, Steve Philips, and Brandon Whitelaw

Introduction

In the era of artificial intelligence (AI) and machine learning (ML), efficiently processing massive amounts of unstructured data is table stakes. Enterprises and government agencies increasingly leverage AI to gain insights, improve operations, and drive innovation. However, data management challenges can hinder AI initiatives, especially in hybrid and multi-cloud environments. The Qumulo Cloud Data Platform addresses these challenges by providing a seamless, high-performance solution for AI Data Acceleration in the public cloud.

Qumulo Cloud Data Platform Overview

The Qumulo Cloud Data Platform is a comprehensive solution that unifies data storage and management across on-premises and public, private, and hybrid cloud environments. It comprises three core components:

On-Premises Qumulo Clusters: These are deployed across data centers, research campuses, hospitals, and other major locations, hosting exabytes of unstructured file and object data. They provide high-performance storage optimized for large-scale workloads across all points on the price/performance curve.
Cloud-Native Qumulo Instances: Deployed in major cloud providers such as AWS, Azure, Google Cloud Platform (GCP), and Oracle Cloud Infrastructure (OCI), these instances extend Qumulo’s capabilities to the cloud, enabling scalable and flexible data storage solutions at performance rates comparable to parallel file systems but economics comparable with on-premises storage offerings.
Global Data Fabric: This is the data backbone that integrates the on-premises and multi-cloud instances into a cohesive system. This allows the adoption of any AI model and/or service to access the same data simultaneously, irrespective of whether the GPUs are on-premise or in the cloud, based on availability and pricing. It offers:
- Strictly Consistent Global Namespace (GNS): Ensures data consistency across all locations.
- Edge Read/Write GNS Caching: Provides low-latency access to frequently used data at the edge and across geographically dispersed data centers, clouds, or a combination of both.
- Clustered Read/Write Persistent Data Store Caching: Enhances performance by caching data closer to compute resources and reduces S3/Blob/GCS API transaction costs.
- Network-aware quality of service and Efficient Network Utilization: This technology optimizes data transfer across wide area networks (WANs) based on network conditions.

Accelerating AI Workloads

Intelligent Data Movement

The Qumulo Cloud Data Platform enables intelligent and efficient data movement across the Global Data Fabric. Data can be streamed on-demand at a block level from any location and expressed to cloud-based read/write clusters across the WAN. These clusters use low-cost, high-durability S3 storage as a persistence layer and intelligent caching to an NVMe instance attached disk in EC2. By doing so, data is readily available to feed GPU instances at speeds unmatched by traditional cloud-based file storage offerings.

Performance Enhancements

Reduced GPU Execution Time: Qumulo improves GPU execution time by up to 40% by accelerating data transfer from the Cloud Native Qumulo-powered file storage to the cloud-hosted GPU system, avoiding the S3 to GPU data copy phase. This optimization addresses the bottleneck and costs often caused by loading data from object or file stores to local NVMe storage on GPU instances before training execution starts.
Cost Savings on S3 API Calls: The Cloud Data Platform employs intelligent, machine learning-based predictive, intelligent read-caching and compacting while compressing the write-cache, which combines S3 API calls. This approach reduces S3 API charges by up to 90%, resulting in significant cost savings.
Optimized GPU Instances: This feature eliminates the need for GPU EC2 instances to have local NVMe storage, allowing for lower-cost GPU instances without compromising performance.

Cloud-Based AI for Enterprises

Many enterprises and government agencies do not require full-time GPU clusters for training workloads. Qumulo’s position is that generative AI (GenAI) workloads—training, tuning, and inference—will primarily be cloud-based for most organizations. The advantages include:

Maintained Data Governance enables existing data provenance and governance requirements to be upheld, ensuring compliance and data security, with reduced risk.
Reduced Capital Expenditure eliminates the need for substantial investment in GPU acquisition and reduces runtime processing costs.
Elastic Resource Consumption provides the flexibility to scale resources up or down based on workload demands, optimizing operational expenses. This is critically important because 80% of AI development involves wrangling data and refining models before running the training job.
Accelerated Processing Time expedites AI workflows by up to 40%, enhancing agility and time-to-insight.

Leveraging Public and Commercial GenAI Models

Qumulo recognizes that most enterprises will consume public or commercial GenAI models rather than build their proprietary ones. To support this, Qumulo has developed the following:

Robust API Integration: The Qumulo Cloud Data Platform offers robust APIs that can interface with cloud-based AI services, including large language models (LLMs) and AI/ML development tools available from major cloud providers like Microsoft and AWS today.
Secure Data Handling: Leveraging techniques such as Retrieval-Augmented Generation (RAG) and proper data governance policies, enterprises can utilize public or open LLMs while ensuring their data is not used in future training datasets, thereby maintaining data privacy and intellectual property protection.

Conclusion

The Qumulo Cloud Data Platform offers a robust solution for accelerating AI data workflows in the public cloud. By unifying on-premises and cloud environments through its Global Data Fabric, Qumulo addresses the challenges of data management and movement at scale. Enterprises can achieve significant performance gains and cost reductions and maintain compliance with data governance standards. Furthermore, by facilitating integration with public GenAI models while safeguarding data, Qumulo empowers organizations to leverage AI technologies effectively without compromising security or incurring unnecessary expenses.

Key Benefits

Flexible: It addresses the performance, capacity, and security needs of the entire AI data lifecycle—data Ingest, Data Transformation, and Data Loading—enabling a seamless end-to-end data pipeline.
Performance: Faster data load times improve GPU execution time and economics.
Limitless: Enables seamless, secure data access between public and private clouds and between organizations to enable transformative business and research opportunities.
Cost Efficiency: Up to 90% reduction in S3 API charges; enables utilization of lower-cost GPU instances without local NVMe.
Scalability: Elastic Consumption of GPU Resources
Data Governance: Maintains existing data provenance and compliance requirements.
Data Durability: Multi-AZ support and parallel S3 erasure coding further enhance the legendary durability of AWS S3
Security: Prevents enterprise data from being used in external model training through secure API integrations.

By adopting the Qumulo Cloud Data Platform, organizations are equipped with the tools to handle the demands of modern AI workloads efficiently and securely. This enables them to position themselves at the forefront of AI innovation, delivering competitive advantage and enabling transformative business opportunities.