# Filter Bloom in Java using Guava

- Transfer

Good day to all.

We launched a new course - “Algorithms for Developers” , designed for those to pull up knowledge of various structures and algorithms for data processing, solving algebraic problems and dynamic programming problems for various languages. So today we are sharing a short note about the work of the Bloom filter in Java.

In this article we will look at the structure of the Bloom filter from the Guava library. The Bloom filter is a probabilistic data structure with efficient memory usage that we can use to answer the question “Is this element contained in a set?”.

There

However, the Bloom filter

To learn more about how the Bloom filter works, check out this guide.

We’ll use the Guava filter implementation of Bloom, so add a Guava dependency. You

can find the latest version in Maven Central .

Bloom filter is designed to be

Due to its economy, the Bloom filter fits easily in memory even with a large number of elements. Some databases, including Cassandra and Oracle, use this filter for initial verification before accessing the disk or in the cache, for example, when a request is received for a specific ID.

If the filter says that the ID does not exist, the database may stop processing the request and return to the client. Otherwise, the query goes to the disk and returns the item found.

Suppose we want to create a Bloom filter for not more than 500 integers, and the one percent (0.01) probability of false positives triples us.

For this we can use the class

Now add some numbers to the filter:

We added only three elements and determined the maximum number of inserts - 500, so our Bloom filter should produce fairly accurate results. Test it using the method

As the name implies, we cannot be 100% sure that this element really exists in the filter if the method returns true.

In our case, when it

When designing a Bloom filter, it is important to provide a reasonably accurate value for the expected number of elements. Otherwise, our filter will return false positives more often than we would like. Consider an example.

Suppose we create a filter with the desired probability of false positives equal to one percent and the expected number of elements equal to five, but then insert 100,000 elements:

Expecting such a small number of elements, the filter will take very little memory.

However, after adding a much larger number of elements, the filter becomes supersaturated and has a higher probability of producing a false positive result than the desired one percent.

Note that the method

In this quick tutorial, we examined the probabilistic nature of the Bloom filter's data structure using the Guava implementation.

Implementation of all these examples and code snippets can be found in the GitHub project .

This is a Maven project, so import and launch should be easy.

We are waiting for comments and questions, which, as always, you can leave here, and go on an open day to the course instructor Mikhail Gorshkov .

We launched a new course - “Algorithms for Developers” , designed for those to pull up knowledge of various structures and algorithms for data processing, solving algebraic problems and dynamic programming problems for various languages. So today we are sharing a short note about the work of the Bloom filter in Java.

**Introduction**In this article we will look at the structure of the Bloom filter from the Guava library. The Bloom filter is a probabilistic data structure with efficient memory usage that we can use to answer the question “Is this element contained in a set?”.

There

**are no false negatives in**Bloom’s filtertherefore, if it returns false, you can be 100% sure that this element is not in the set.However, the Bloom filter

**can return false positives**, so when returning true, there is a high probability that the element really exists in the set, but the probability is not 100%.To learn more about how the Bloom filter works, check out this guide.

**Maven dependency**We’ll use the Guava filter implementation of Bloom, so add a Guava dependency. You

can find the latest version in Maven Central .

```
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>22.0</version>
</dependency>
```

**Why use a bloom filter?**Bloom filter is designed to be

**economical and fast**. When using it, we can specify the*probability of false positive responses that we are ready to accept*, and, in accordance with the configuration, the Bloom filter will take as little memory as possible.Due to its economy, the Bloom filter fits easily in memory even with a large number of elements. Some databases, including Cassandra and Oracle, use this filter for initial verification before accessing the disk or in the cache, for example, when a request is received for a specific ID.

If the filter says that the ID does not exist, the database may stop processing the request and return to the client. Otherwise, the query goes to the disk and returns the item found.

**Creating a Bloom Filter**Suppose we want to create a Bloom filter for not more than 500 integers, and the one percent (0.01) probability of false positives triples us.

For this we can use the class

`BloomFilter `

from the Guava library. You must pass the number of elements reported to the filter, and the desired probability of false positives:```
BloomFilter<Integer> filter = BloomFilter.create(
Funnels.integerFunnel(),
500,
0.01);
```

Now add some numbers to the filter:

```
filter.put(1);
filter.put(2);
filter.put(3);
```

We added only three elements and determined the maximum number of inserts - 500, so our Bloom filter should produce fairly accurate results. Test it using the method

`mightContain()`

:```
assertThat(filter.mightContain(1)).isTrue();
assertThat(filter.mightContain(2)).isTrue();
assertThat(filter.mightContain(3)).isTrue();
assertThat(filter.mightContain(100)).isFalse();
```

As the name implies, we cannot be 100% sure that this element really exists in the filter if the method returns true.

In our case, when it

`mightContain()`

returns true, 99% that the element is in the filter, and 1%, that the result is false positive. If the filter returns false, you can be 100% sure that the item is missing. **Supersaturated Bloom Filter**When designing a Bloom filter, it is important to provide a reasonably accurate value for the expected number of elements. Otherwise, our filter will return false positives more often than we would like. Consider an example.

Suppose we create a filter with the desired probability of false positives equal to one percent and the expected number of elements equal to five, but then insert 100,000 elements:

```
BloomFilter<Integer> filter = BloomFilter.create(
Funnels.integerFunnel(),
5,
0.01);
IntStream.range(0, 100_000).forEach(filter::put);
```

Expecting such a small number of elements, the filter will take very little memory.

However, after adding a much larger number of elements, the filter becomes supersaturated and has a higher probability of producing a false positive result than the desired one percent.

Note that the method

`mightContatin()`

returns true even for a value that we have not previously inserted into the filter. **Conclusion**In this quick tutorial, we examined the probabilistic nature of the Bloom filter's data structure using the Guava implementation.

Implementation of all these examples and code snippets can be found in the GitHub project .

This is a Maven project, so import and launch should be easy.

**THE END**We are waiting for comments and questions, which, as always, you can leave here, and go on an open day to the course instructor Mikhail Gorshkov .