Scalable Storage for Data Analytics Workflows
The ability to make informed decisions from large datasets is critical for today’s enterprises. The intelligence that companies gain from data analytics fuels their growth and ability to compete.
For example, online advertisers rely on data analytics to optimize ad yield and predict buyer behavior. Social media platforms use it to gain insight into what’s important to their users. Logistics companies analyze vast amounts of data from sensors and devices (IoT) to lower costs and speed delivery. Data analytics is central to the development of autonomous vehicle technologies.
Data sources for analysis include mobile phones, sensors and wearable devices, as well as applications and infrastructure in the data center and the cloud.
Adequate storage is a pressing problem for data analytics of all kinds.
- How should storage be attached to the compute resources to ensure high availability of data with low latency and horizontal scalability?
- What are the requirements for a file storage system to serve these demanding workloads?
- What are the best strategies for scaling storage over time?
Storage demands of data analytics
Data analytics can generate insights from massive data sets or data streams with a variety of workflows. Two of these workflows are batch (big data) analytics and streaming analytics.
Whether batch or streaming, data analytics demands great performance from the file storage system. One solution has been to directly attach the compute resources to the storage resources. Direct attached storage creates data silos and is difficult to manage and scale efficiently, but the idea that proximity would ensure performance drove its popularity. Direct-attached storage for data analytics arose from the assumptions that disk bandwidths exceed network bandwidths and that disk I/O constitutes a considerable fraction of a task’s lifetime.
With increased networking speeds and more computationally complex analytic techniques, these assumptions no longer hold. Highly scalable network-attached storage can now outperform direct-attached storage. In addition, storage accessed via a network is cost competitive and won’t create data silos. Today, a more effective strategy for data analytics workflows, such as those that use Apache Spark or Spark Streaming, is to scale compute and storage separately with high-performance network-attached storage.
QF2 for data analytics
Qumulo File Fabric (QF2) is a modern file storage system that has the performance, scalability and enterprise features required by data analytic workloads. QF2 runs on standard hardware on premises and as EC2 instances on AWS.
Get your results faster
QF2 has better sustained read throughput than direct-attached storage for analytic workloads. QF2, operating over today’s fast networks, outperforms HDFS infrastructure. The performance edge of QF2 comes from its hybrid SSD/HDD architecture and its advanced distributed file system technology.
Buy only the storage you need
QF2 decouples storage from compute. With QF2, customers have control over how much storage they buy and can avoid overprovisioning. Customers save money by buying only the storage they need, regardless of how their compute cluster grows.
In addition, QF2 uses efficient data protection based on erasure coding at the block level, instead of inefficient and expensive mirrored file copies. Efficient protection gives you more usable capacity on your storage system. You’ll save money on disks as well as infrastructure costs such as power and cooling.
Eliminate data silos
Solve storage problems in real time
Customers need to do more than warehouse their data. They need to manage it. QF2 lets administrators find and solve problems in real time. For example, an administrator can easily determine IO hotspots and apply capacity quotas that take effect immediately. QF2 makes it easy to manage projects and users with insight into how the storage is being used.
Run in the cloud and on premises
Many data analytics workloads can benefit from running in the cloud, as well as on premises.
QF2 operates both on premises and AWS, with the highest performance and best scalability of any file-based cloud offering. With QF2, cloud and on-prem clusters work together to provide scalable performance with a unified file storage fabric.
QF2 uses continuous replication to move data where it’s needed, when it’s needed. Continuous replication means customers can easily transfer data from their on-prem QF2 cluster to their QF2 cluster in AWS, perform their computations, and then transfer the results back to the on-prem storage.
The ability to run the same data analytics workflow in cloud and on-premises environments ensures consistency and reduces development costs. It also gives customers the ability to choose where they place their workloads based on business decisions rather than technical limitations.
Data analytics workflow
Here is an example of a streaming data analytics workflow that shows QF2 as the central, storage for the entire process, from ingesting the data to displaying it and acting on it.
Input can come from devices, such as cell phones, scientific instruments, autonomous vehicles and serial devices. It can also come from applications, which typically store their data in QF2 and then send a link to the event data flow software packages. The compute resources process the data and both store and retrieve files from QF2. Finally, the results are delivered and either displayed as information on a dashboard or used to trigger a particular action, such as a security alert.
Managing data with QF2 is so simple it’s hard to describe the impact. It has given us tremendous ROI in terms of time saved and problems eliminated, and having that reliable storage we can finally trust makes us eager to use it more broadly throughout the company.
John Beck — IT Manager Hyundai MOBIS