{question}
Why do I see that the rows processed are significantly higher than the actual rows in the case of a reference table?
{question}
{answer}
When joining a sharded table with a reference table, it's normal to observe that the number of rows processed is much higher than the actual number of rows in the reference table. This behavior is due to how reference tables are stored and used during query execution.
Reference tables are replicated across all the nodes in the cluster, meaning each node holds a complete copy of the reference table. During a join operation, every partition processes its portion of the sharded table and joins it with the full reference table available locally.
Example:
If a reference table contains 100,000 rows and your cluster has 10 nodes(2 Agg + 8 leaf), each of those nodes joins its data with the 100,000-row reference table. As a result, the reference table's rows are processed 10 times—once per partition—resulting in a total of 1,000,000 rows being processed overall.
This behavior is expected and highlights why we recommend keeping reference tables small. As the number of partitions increases, the processing overhead grows linearly with the size of the reference table, which can impact query performance.
{answer}