What is a Clustering Key and Why is it Important

Chadd Kenney

One of the daily benefits of building a co-pilot for workload optimization is talking to data engineers about how they approach optimization in their organizations. One topic that comes up continuously is optimizing table clustering keys, also known as clustering keys. Last week, a large enterprise explained that they spent a whole week of an engineer's time researching and investigating how to optimize the clustering keys of a single dataset, all because the benefits were so magical with boosting query performance, optimizing data access, and reducing costs all at the same time. If you pair clustering key optimization with warehouse optimization and bin-packed queries, you can see something truly magical (but I will leave that for a future blog post). In this blog post, we will take the mystery out of clustering keys by exploring clustering keys, explaining why it is crucial for optimizing Snowflake databases, discussing optimizing options, and looking at how Bluesky approaches the problem for customers. So let’s dig right in.

What is a Table Clustering Key?

Snowflake made giant leaps and bounds in effectively querying large datasets by changing how the data is logically and physically stored into micro-partitions. Micro-partitions are tiny pieces of a larger table that are stored, immutable and compressed, making queries more effective as smaller amounts of data are scanned, assuming the cluster key is optimized. The clustering key is a design feature that dictates the amount of data organized into these micro-partitions and specifies the columns based on which the data should be sorted and stored, leading to improved data locality and more efficient query processing as Snowflake can prune or ignore unchanged or irrelevant partitions. 

This functionality typically works well when first created, but as data is updated over time, tables can become unoptimized on various dimensions, resulting in unnecessary table scans, which slows down queries.

Why is Optimization Important?

Optimizing the performance of Table Clustering Key in Snowflake has multiple benefits, which all have implications for query optimization and performance:

  • Data Locality: When accessing data quickly, optimizing data locality is key. This is because when queries are executed, the system scans for data that is relevant to the query and skips over micro-partitions that don't contain relevant data. Keeping relevant data close to each other results in faster query processing times and reduced costs. 
  • Better Data Caching: Snowflake caches all the data pulled into the virtual warehouse and uses a first-in-first-out cache replacement, which means that keeping the most relevant micro-partitions in the virtual warehouse will significantly improve performance. The way to do that is twofold: First, only the most relevant data is brought into the cache, and second, make sure no irrelevant data comes in on subsequent queries. Appropriately clustered tables achieve both of those. 
  • Reduced I/O Operations: I come from storage, and this one is most intuitive to me. Reading and writing less equate to increased efficiency. In other words, the amount of I/O operations for a given query can quickly determine its efficiency. Configuring your table clustering keys correctly can significantly reduce the number of I/O operations required for query execution. For example, when data is nicely grouped within a small number of micro-partitions, the system can read fewer blocks from storage, minimizing disk I/O and accelerating query performance.

What is the best way to optimize your Table Clustering Key?

Selecting an appropriate clustering key is a critical decision that impacts the performance of your Snowflake database. When done correctly, Snowflake can use the cluster key to prune or ignore partitions that are not relevant to the query, significantly reducing the amount of data that needs to be scanned and increasing query performance. It is a non-trivial task to select an ideal cluster key, as doing so requires knowledge of the table structure, the data within it, how it changes over time, and query patterns against the table. 

For example, date fields are common cluster keys, as many users will query tables looking for data from a specific period, such as sales from the past month. However, what if the most common queries were to understand the region sales occurred in?  Using the date field as a cluster key may be inappropriate. In the example above, querying for the region with a date cluster key would force all partitions to be scanned, increasing costs and reducing performance.

Let’s assume the most common queries against our sales data table filter by the sale's geographic region instead of the date; clustering by the Region column could be a better choice than the date as it would allow Snowflake to prune more partitions. In the example above, I am only interested in east customers, so there is no need to scan other micro-partions if the cluster key is set correctly.  This would result in the lowest cost and highest performance.  

The goal is to choose cluster keys for your data sets to create intelligent partitions for your query workloads. Considering factors such as: 

  • Distribution of Data:  If the data is not relatively evenly split by our cluster key, then scanning these partitions will be less efficient, or we may even have to scan multiple partitions for the same value. This is why columns with high cardinality, i.e., many distinct and non-overlapping values for your data set or columns used in filtering or joining operations, often make excellent choices for clustering keys. It’s similar to labeling the drawers on a filing cabinet. If I organize my files intelligently at the drawer level, I won’t have to open very many to find the specific files I’m looking for. 
  • Reader Benefits and Writer Overhead:  In some cases, a table clustering key can make a set of reader queries 10x+ faster by reducing their scans, but at the cost of increasing writer (auto clustering) cost, and users need to decide on the trade-offs.

The best way to choose a clustering key is by considering the real-world reader and writer queries hitting the table, the cardinality of the data and the usage patterns across time. Then, you should also consider how much it would cost to recluster the table initially and what the cost would be over time. Finally, you want to pick how many clustering keys you should assign to the table. The best and most effective way to do all of that is run a simulation on all of these variables and determine the cost and benefit of each, continuously. That’s where Bluesky comes in.

Why is Bluesky’s Clustering Key Approach Different?

The big win for customers is that Bluesky has context, and the platform shines big time at scale to reduce the effort in analyzing your own data to optimize.  Bluesky scans all workloads via the metadata customers provide, finds reader/writer patterns, and makes comparisons of reader and writer tradeoffs. This data is pumped into our proprietary algorithms to make a recommendation 100% customized to your environment.  These optimizations are provided to you without effort, and our recommendations are simple and easy to implement. Think, build more, figure this stuff out less. But enough talking; let's see it in action.

Table clustering findings can be found in Storage on the left-hand side of our navigation. From here, you can see if you have any clustering key optimizations. In the case below, I have 2 critical optimizations, which, if clicked on, will filter the optimizations for quick access.

When clicking on a finding, users get a quick overview of the finding, savings opportunity, recommendation, risks, and effort. This information helps users determine if they would like to execute this finding. Let’s dig into each section:

  • Finding Overview: This is a quick rundown of the problem we have identified. In this case: we identified that the READ cost is relatively high with large table scans and at the same time, the write frequency, the volume is relatively light, and the table has no cluster key.
  • Savings Opportunity: The savings opportunity is potential savings with this optimization. In this case, by adding the recommended clustering key we estimate that $15,000 could be saved annually based on an 80% reduction in execution time for the portion dedicated to table scans.
  • Recommendation: We supply recommendations to remediate this issue and enable performance and cost savings. In this case, we recommend adding {{info.optimal_cluster_key}} clustering key to the table to benefit from faster reads with the following SQL query: 

ALTER TABLE ANALYTICS.DBT_SEAN.BLOCK_MAPPING_V2 CLUSTER BY DS

  • Risks: Defining the risk is significant to customers, so they can choose if this is a good idea to implement or first talk to others on their team. We typically tell customers that anything 1 to 2 is simple and easy and most likely a no-brainer with low risk, 3 to 4 is medium risk and something to review with the team before implementing, and 5 is something to be careful and potentially test in a development environment first to ensure no impact. In this case, “the main risk associated with adding a new clustering key can be the potential of increased cost in auto-clustering to maintain the table layout. In our analysis of historic trends, the writes are not that frequent, but any backfill job which does not honor the logic to insert the data sorted w.r.t cluster key can increase the chances of credits spent on auto-clustering. Estimated risk score (1-5): 2
  • Effort: Lastly is the effort to implement. Changing a table clustering key is very straightforward and requires minimal effort in executing a SQL command. Effort level estimate (1-5) : 1
  • Review and Apply: Customers have two options to apply a finding. You can click the copy command next to the SQL query or apply the findings directly if they have read/write access enabled on the account. Simple and easy either way.

Lots of fun stuff to digest and understand about Snowflake's Table Clustering Key, but one thing is for sure, it is a vital tool for optimizing query performance and overall data warehouse efficiency. The good news is that by choosing and maintaining a clustering key, you can significantly reduce query execution times, enhance data locality, and make the most of Snowflake's powerful architecture. While you can try to manage this yourself, it is unfortunately very time-consuming and complex, but the good news is platforms such as Bluesky are here to help unlock peak database performance.

Until next time, build more and stay optimized, my friends!