New shard key distribute data – SingleStore Support

{question}

How to test that a new shard key will evenly distribute data?

{question}

{answer}

Ideally, shard keys in Singlestore should distribute data evenly: after data is ingested, each partition across the cluster should end up storing approximately equal amounts of data, as opposed to having a "lopsided" cluster in which some partitions have much more data than others. The latter scenario is referred to as "skew", and should be avoided whenever possible.

When choosing a candidate shard key for a table, the cardinality and underlying data skew of the columns(s) should be considered. Ideally, higher cardinality is better, since a higher number of unique values makes it easier to spread them out evenly across the cluster. To see this in an extreme example, consider a shard key column with only two values: this would be impossible to distribute evenly over more than 2 partitions. Likewise, underlying data skew on the shard key column can propagate through to data skew in Singlestore. For example, consider sharding a table of customers on last names: this might conceivably be acceptable if the set of a customer last names have even alphabetical distribution, but if some letters (E, T, A) occur much more than others (Z, Q) then there's a potential for skew to arise. In particular, if a large percentage of customers happened to have the same last name, then the partition containing those rows might end up considerably larger than other partitions.

Skew can always be avoided by using either an empty shard key or sharding on a column (or columns) known to be unique (note that this doesn't necessarily require adding a UNIQUE KEY in Singlestore). There may be scenarios in which it's acceptable to choose a shard key with some skew in exchange for optimizing filter criteria or JOIN conditions, particularly if the size of the table in question is small. For small enough tables, opting to use a reference table instead (which, unlike sharded tables, are replicated in their entirety on every node in a cluster) may also be an option.

After a table is created and data is ingested, there are several diagnostic queries that can be used to monitor skew, which are described in detail in the following guide: Detecting and Resolving Data Skew.

{answer}

Articles in this section

Related articles