How does a bloom filter improve hash join performance?
A Bloom filter is a probabilistic data structure used to check whether an element in one set is also a member of another set. Since checking set overlap is the underlying logical operation during a JOIN in a relational database, Bloom filters are used to optimize JOINs in SingleStore under certain circumstances.
A basic high-level description of the bloom filter algorithm is as follows:
The elements of one set X are fed through a collection of N hash functions to produce the Bloom filter itself
To test if an element of another set Y is in X, apply the N hash functions to the element and see if the result exists in the Bloom filter.
If the hashed test element does not exist in the Bloom filter, then it cannot have existed in the other set X (because if it did, then one of the hash functions in N would have added it to the filter)
If the hashed test element does exist in the Bloom filter, then either it does exist in X, or it is a false positive (since hash functions are many-to-one: they will map different source elements to the same filter element). This is the sense in which Bloom filters are probabilistic.
The false-positive rate can be adjusted by changing the number of hash functions in N and the size of the filter itself (i.e., the number of unique values that the hash functions can map to)
Bloom filters are not a SingleStore-specific concept; more detailed discussions of the concept and algorithm can be readily found online.
Since a Bloom filter guarantees no false negatives but allows for the possibility of false positives, they are likely to improve performance when there is relatively little overlap between two sets/tables being JOINed, since in this case, the query execution engine knows a JOIN condition is not met as soon as an element fails to pass the filter. If two sets/tables have a lot of overlap, then elements that pass the filter still have to be checked deterministically to verify that the JOIN condition is met (i.e., to rule out a false positive), so in this case, a Bloom filter can actually degrade performance.
Generally, the query optimizer will decide whether or not a Bloom filter should be used based on table statistics. Still, you can force the use of Bloom Filter either by a hint or a session variable:
SELECT WITH(force_bloom_filter=true) ...
When you check a profile using The Studio Visual Explain, a Bloom Filter shows as the following icon: