{question}
How does a bloom filter improve hash join performance?
{question}
{answer}
A Bloom filter is a probabilistic data structure used to check whether an element in one set is also a member of another set. Since checking set overlap is the underlying logical operation during a JOIN in a relational database, Bloom filters are used to optimize JOINs in SingleStore under certain circumstances.
A basic high-level description of the bloom filter algorithm is as follows:
-
The elements of one set X are fed through a collection of N hash functions to produce the Bloom filter itself
-
To test if an element of another set Y is in X, apply the N hash functions to the element and see if the result exists in the Bloom filter.
-
If the hashed test element does not exist in the Bloom filter, then it cannot have existed in the other set X (because if it did, then one of the hash functions in N would have added it to the filter)
-
If the hashed test element does exist in the Bloom filter, then either it does exist in X, or it is a false positive (since hash functions are many-to-one: they will map different source elements to the same filter element). This is the sense in which Bloom filters are probabilistic.
-
The false-positive rate can be adjusted by changing the number of hash functions in N and the size of the filter itself (i.e., the number of unique values that the hash functions can map to)
-
Bloom filters are not a SingleStore-specific concept; more detailed discussions of the concept and algorithm can be readily found online.
Since a Bloom filter guarantees no false negatives but allows for the possibility of false positives, they are likely to improve performance when there is relatively little overlap between two sets/tables being JOINed, since in this case, the query execution engine knows a JOIN condition is not met as soon as an element fails to pass the filter. If two sets/tables have a lot of overlap, then elements that pass the filter still have to be checked deterministically to verify that the JOIN condition is met (i.e., to rule out a false positive), so in this case, a Bloom filter can actually degrade performance.
Generally, the query optimizer will decide whether or not a Bloom filter should be used based on table statistics. Still, you can force the use of Bloom Filter either by a hint or a session variable:
SELECT WITH(force_bloom_filter=true) ...
Or
SET force_bloom_filters=on;
When you check a profile using The Studio Visual Explain, a Bloom Filter shows as the following icon:
{answer}