Why are some of my leaves using more memory or disk than others?
When utilizing SingleStore, several factors can cause some leaf nodes to utilize more memory or disk than others.
A SingleStore cluster will have its data evenly distributed amongst the cluster's leaf nodes in the most optimal operating circumstances. When a query is run against the cluster, the various nodes will be able to "share the load", equitably reducing the load for the nodes within the cluster. When data is unevenly distributed amongst the leaves in the cluster, we call that data skew. Skew can happen for many reasons, including the table's schema (specifically the shard key) and from the cluster changing in size without rebalancing the data.
You can find information about detecting and correcting skew within our documentation.
Bad Shard Key
Picking a proper shard key when creating the tables in your database is important. The shard key is utilized in the hashing function, which assigns what partition the data will reside upon (and therefore also which leaf in the cluster). Because of this picking a column with high cardinality, a column wherein the values are all distinct and unique from each other, is important in ensuring that all the data is evenly distributed amongst the leaves in the cluster. If a column with low cardinality is picked, then the likelihood of experience data skew is high because equal entries will be hashed and stored in the same place.
Shard keys must be determined upon the creation of the table as they decide the sort order. Once a table is created with a shard key, it can not be changed without creating a new table and re-ingesting it to re-organize it with the new shard key.
Over time, a cluster can change in size, and the data within needs to adapt to the changing number of nodes within the cluster to maintain balance and prevent skew from occurring.
Because of this need to have equal data distribution, SingleStore has features built to allow you to view if the cluster needs to rebalance any of the partitions that make up a database across the leaf nodes. The commands are called REBALANCE PARTITIONS which is done on each database separately, and (from version 7.3) you can rebalance all databases at once with REBALANCE ALL DATABASES.