Setting the stage: The emergence of the cloud data lake
The public cloud has fundamentally changed the economics and competitive dynamics of nearly every industry. CIOs and CEOs from the smallest startups to the biggest multinationals are wrestling with the ramifications of limitless infrastructure and services available to them, and their competitors, with just a few commands and a credit card. Barriers to entering new markets are falling and time to market for new products are shrinking, which makes leaders both excited and terrified.
Unstructured data is at the heart of these fundamental shifts. Images, videos, log files, genomes, maps, and text files, are the raw materials used by these companies to create new innovation. Consider a research computing center at one of the world’s largest universities. This group serves scientists from around the world as they seek to understand the origins of our sun and the mutations of a gene. For this research center, success is defined by attracting the most talented scientists tackling the biggest problems. The elasticity of the public cloud makes that possible, by enabling the center to create new compute and storage resources for their best researchers with a few lines of code, and to share their end results around the world.
But to make that elasticity work, the research center needs an accessible data layer, open enough to foster collaboration but controlled enough to protect intellectual property. The public clouds have solved this problem with a well-known architecture known as the “data lake.” These large unstructured data repositories combine multiple data sources into one pool, monitored and governed by shared management systems. With the right permissions, any researcher can access that data from anywhere to run their experiments.
The challenge: File-based data
The cloud data lake works well for many types of data. If the data is mostly finished (i.e. it won’t change very much), is application-independent, and has an infrequent or only streaming IO pattern, then the cloud data lake works well. However, not all unstructured data fits that mold. Some data is created and processed by a file-based application, changes frequently as it is being processed, and has a “small update” IO pattern (where the file is changed repeatedly through the course of a workflow). These data types are failed by the legacy cloud data lake.
Take, for example, the videos and images that modern studios use to create a film. Much like our research center, the modern studio competes for the most talented artists and uses the elasticity of the cloud to make those professional magicians productive at any hour of the day and without delay. However, the applications that edit and transform raw images and videos into film are file-based, and the artist workflow is made of many changes to many files as the film moves through the digital production line. A legacy cloud data lake built solely on AWS’ S3 (for example), will not serve this workload well.
The breakdown is both technical and economic. The technical challenge lies at the heart of the current approach to data lakes. Most cloud providers build their data lakes around object systems (e.g. S3 in AWS). While powerfully scalable and highly customizable, these systems fundamentally assume that individual objects are “immutable.” That assumption lies at the heart of all object systems. When changes are made to an object they don’t update the object, they destroy and re-create the object. For a file-based workflow this is a real problem, because file based applications assume that the underlying data will be changed repeatedly. Without being able to make that assumption, our research center and film studio must re-work their applications or ask their end users to change their workflows. Both of which make it harder for those organizations to attract the best talent in their industries.
The economic breakdown has to do with the pricing models of cloud object storage services. The major object services charge customers for individual operations against their data. Take as an example a relatively small 20TB object data set. In S3, the cost to store this data is only ~$420/mo., and if the data is accessed infrequently that will be the only bill the research or film studio will see. However, as soon as small random IO is performed against the data, that bill can skyrocket to over $100,000/mo. The reason is simple: changes per IO. So long as the data set is at the heart of an IOPS-heavy workload, the economic model of today’s cloud data lake breaks down.
A way forward: The cloud file lake
File-based applications are best served by file-based storage. These applications are mission-critical enablers of innovation and demand infrastructure that is built to make them successful. That is why file systems have existed for decades and why new file systems (and file services) are being developed all the time. We believe that the modern data lake should include a scalable, performant, and cloud-native file system as part of its fundamental architecture.
These “cloud file lakes” would offer customers the ability to store file data as it was intended to be stored: as file. This new approach to the data lake creates a single scalable file namespace in a public cloud, with the features and capabilities of a modern file system. This will let customers:
- Use the applications their talented end users expect (and know) and not re-build their applications for object
- Protect intellectual property using standard identity access methodologies proven in every modern enterprise (e.g. Active Directory)
- Share data across organizational boundaries using the reach of the cloud, while maintaining the organizational structure of their file systems
Finally, and maybe most importantly, a “cloud file lake” offers access for free. IO to a given file in a cloud file lake is included in the cost of the namespace. This makes it possible to run high IO workloads in the public cloud at reasonable economics, and without fear that an active user or application will create a budget-breaking bill.
The requirements: What to look for in a file lake
A real cloud file lake must, at its heart, be a scalable file system. In order to serve large scale file workloads, the cloud file lake must be able to grow in capacity and performance to meet the needs of the workflow. At the same time, it must offer the core features of an enterprise-ready file system needed to serve multiple workloads. Some key capabilities we believe are central to any cloud file lake:
- Scale to petabytes, hundreds of GB/s and hundreds of thousands of IOPS in a single namespace
- Serve Windows, Linux, and Mac clients (and applications) without any customization and from the same namespace
- Offer standard enterprise file management tools such as quotas and snapshots so that administrators can protect data and avoid cost overruns
- Integrate with Active Directory and LDAP, and offer granular permission control (across Windows/Mac/Linux) to control intellectual property risk
- Be manageable entirely from an API or a command line so that the file lake can be created, reported on, and managed from standard orchestration tools like CFTs
Finally, a cloud file lake should not live on an island. Whether through native features or simple integration with Lambda functions, a cloud file lake should enable customers to import data from S3 or other cloud object stores for processing and to export data to object data lakes when the file-based work is done.
Qumulo: The first cloud file lake
Qumulo has spent the last several years building a scalable cloud-native file system. Our product combines the rich enterprise controls of a modern file product with the scale of a distributed shared-nothing architecture in a cloud-native package. Our customers use our product to make movies, sequence genomes, and map undersea floors.
Qumulo offers a single file system with the following benefits:
- Scale to petabytes and serve the most demanding workloads, without paying extra for IO
- Serve NFS, SMB, and API-based applications for Windows, Mac, and Linux clients without changing your workflow
- Manage your storage with charge-back quotas, replication, and snapshots
- See into your storage with industry-leading real time visibility so you know which workloads are driving your bill
- Protect your IP with enterprise-grade permissions, role-based access controls, and audit logs
Of course that’s just the beginning; we aren’t done yet. We are hard at work continuing to build more capabilities that make the file lake even more powerful and unleash the power of your cloud file workloads. Of course, as a Qumulo cloud subscriber you get access to all of those features for free, simply by signing up.
Innovation-led organizations around the world are turning to the public cloud to create new products, to make new discoveries, and to accomplish their missions. At the heart of that work is file-based data. At Qumulo, we believe those workloads are best served by a data lake built on technology that unleashes the potential of that file data.
Ben Gitenstein runs Product at Qumulo. He and his team of product managers and data scientists have conducted nearly 1,000 interviews with storage users and analyzed millions of data points to understand customer needs and the direction of the storage market. Prior to working at Qumulo, Ben spent five years at Microsoft, where he split his time between Corporate Strategy and Product Planning.