At Qumulo, we’ve dedicated years to creating Scale Anywhere enterprise-wide primary storage systems and evolving them to a true Cloud Data Platform: catering to high-performance computing, supercomputing, artificial intelligence, content creation, healthcare, life sciences, defense/intelligence, and research sectors. One of the most impactful use cases for our technology has been supporting Autonomous Driving Clusters, also known as Advanced Driver Assistance Systems (ADAS). These AI clusters, foundational to autonomous vehicle development, leverage Qumulo’s unique strengths in managing massive datasets with a mix of large and small files, offering unmatched durability, consistency, and scalability across public, hybrid, and private cloud environments.
While large language models (LLMs) like GPT-4 have dominated the headlines for their ability to write stories, refine language, or even crack decent jokes, ADAS clusters serve a more mission-critical purpose: enhancing driver safety, optimizing fuel usage, and ultimately saving lives. Each of these computationally intensive domains—ADAS and LLMs— have nuanced differences that bring forth both challenges and opportunities. While LLMs may capture the public imagination, at Qumulo, we’re proud to empower the data systems behind many of the world’s largest ADAS clusters, a transformative application that touches everyone’s lives on the road, improving safety and efficiency.
Purpose and Nature of Data
Large Language Models (LLMs):
LLMs are designed to process and generate human-like text. Their datasets consist of tokens derived from natural language sources such as books, articles, websites, and code repositories. These datasets emphasize linguistic generalization, requiring data to be diverse and representative of the language(s) the model will serve. Tokenization—a process where text is broken into subword units or words—allows for efficient representation of the data.
Autonomous Driving / Advanced Driver Assistance Systems (ADAS):
Autonomous vehicles rely on sensor data to navigate real-world environments. These datasets include raw, uncompressed outputs from cameras, LiDAR, radar, GPS, and inertial measurement units (IMUs). The goal is to train models to understand spatial environments, recognize objects, and make real-time decisions. ADAS datasets must capture not only common driving scenarios but also rare edge cases, such as adverse weather conditions or unusual pedestrian behavior.
Dataset Sizes: A Quantitative Perspective
LLMs:
The scale of LLM datasets is typically measured in tokens. For example:
- GPT-3 was trained on approximately 300 billion tokens, equivalent to ~570 GB of compressed data or several terabytes uncompressed (Brown et al., 2020).
- Modern LLMs like GPT-4 likely utilize datasets exceeding 1–2 petabytes, particularly when incorporating multimodal and multilingual sources. This is equivalent to approximately one-hundred 8K RAW feature-length films.
ADAS:
ADAS datasets are measured in raw data storage due to the uncompressed nature of sensor outputs:
- A single autonomous vehicle generates 1–10 terabytes of data daily (Waymo, 2023).
- Fleet-wide datasets, used by companies such as Tesla and Waymo, exceed 100-500 petabytes annually. For context, Tesla’s fleet collects over 1 million miles of driving data daily (Tesla AI Day, 2021). By comparison to LLM training datasets this is approximately 25,000 8K RAW feature-length films every year, or 32 years of modern film-making.
Diversity and Structure of Data
LLMs:
- Highly compressed data due to tokenization and deduplication processes.
- Prioritizes diversity across domains (e.g., scientific papers, fiction, code) to ensure generalization.
- Significant preprocessing is performed to filter low-quality or biased text (OpenAI, 2020).
ADAS:
- Video: High-resolution (1080p or 4K) recordings at 30–60 frames per second.
- LiDAR: Millions of 3D points per second.
A significant portion of data is used for simulation and validation, particularly for rare edge cases.
Computational Challenges
LLMs:
- Training involves billions to trillions of parameters, requiring high-throughput processing of tokenized datasets.
- Training GPT-3 required approximately 3640 petaflop-days of compute (Brown et al., 2020).
- Optimized data pipelines (e.g., tokenization, batching) reduce the effective dataset size during training.
ADAS:
- Processing involves time-series data and spatial modeling, often requiring real-time performance.
- Simulation environments (e.g., CARLA, NVIDIA DRIVE) are used to augment training, which adds to computational complexity.
- Specialized hardware, such as GPUs or dedicated TPUs, and large core-width single-socket CISC CPUs process large raw datasets for training and inference.
Data Longevity and Growth
LLMs:
- Dataset size increases incrementally with model complexity. However, growth slows due to diminishing returns at scale (Kaplan et al., 2020).
- Older datasets remain relevant, as linguistic fundamentals do not change rapidly.
ADAS:
- Dataset growth is exponential due to:
- Increasing fleet sizes and higher adoption rates.
- Advances in sensor technology (higher resolution and sampling rates).
- Expanding coverage of edge cases for robust generalization.
- Older datasets may become obsolete as vehicle and sensor technologies evolve.
Dataset Comparisons
Aspect | LLMs | ADAS/Autonomous Driving |
---|---|---|
Dataset Size | Terabytes to low petabytes | Hundreds of petabytes |
Data Type | Text (tokens) | Video, LiDAR, Radar, GPS, GIS, Satellite Imagery |
Compression | Highly compressed (tokenization) | Minimal compression (raw data) |
Purpose | Linguistic understanding | Real-time spatial decision-making - saving lives and improving transportation safety |
Growth | Slower scaling with diminishing returns | Exponential growth (fleet, sensors) |
Conclusion
Freedom of Choice
This leads to a broader strategic decision: should accelerated computing infrastructure be built on-premises, or is it more efficient to utilize the scalability and capacity of public cloud environments, connecting datasets seamlessly across hybrid infrastructures? At Qumulo, we aim to empower our customers to excel in both scenarios, breaking down technological barriers so they can make the best business, engineering, and operational decisions for their unique needs. To learn more about the groundbreaking performance Qumulo has delivered in the public cloud environment using our Cloud Data Platform, check out this video.