Brief
Snowflake
decouple compute and persistent storage, providing warehousing as a service. It is in production for over 7 years.
Design
Intermediate data is generated by query operators (e.g., joins) and is usually consumed by nodes participating in executing that query. It is short-lived, and prefer low-latency high-throughput access.
As nodes are added and removed, the ephemeral storage system does not require data repartitioning or reshuffling. Tasks executing query operations (e.g., joins) on a given compute node write intermediate data locally; and, tasks consuming the intermediate data read it either locally or remotely over the network (depending on the node where the task is scheduled).
Elasticity:
- Persistent storage - easy, offloaded to S3
- Compute - easy, pre-warmed pool of VMs
- Ephemeral storage - challenging, due to co-location with compute
Key finding:
- Customers submit a wide variety of query types.
- Intermediate data sizes can vary over multiple orders of magnitude across queries.
- Even with a small amount of local storage capacity, skewed access distributions and temporal access patterns common in data warehouses enable reasonably high average cache hit rates (60-80% depending on the type of query) for persistent data accesses.
- Several of our customers exploit our support for elasticity (for 20% of the clusters).
- Peak resource utilization can be high, the average resource utilization is usually low.