How to achieve local join, and why is it essential in distributed databases?
Local or co-located joins occur when the data being joined between two tables is guaranteed to be located on the same partition in both tables. It is easier to explain with an example:
Consider two tables, A and B, each of which has a column called “customer_id”. Naturally, we want to join the two tables on this column. However, in the general case, there is no guarantee that if rows in table A with a particular value of customer_id (e.g., customer_id=5) exist on a particular partition (e.g. partition 2), that rows in table B with that same value of customer_id will lie on the same partition. So, in general, each partition needs to communicate with every other partition in the database to check whether there are any rows that satisfy the join condition, i.e. partition 2 needs to check every other partition to see if any of them has rows with customer_id=5.
However, if both tables are sharded on customer_id, then the above condition is guaranteed: if partition 2 has the rows in table A with customer_id=5, then it is guaranteed that all the rows in table B with customer_id=5 will also lie on partition 2. This means that during the join operation, partition 2 does not need to communicate with any other partitions in the cluster, greatly reducing resource utilization as well as network latency delays.