TL;DR: With linear probing, we can delete elements from an open addressing hash table without tombstones. Here are the C and the C++ implementations.
Introduction
When implementing a hash table based on open addressing, we usually set a tombstone for each deleted element, which indicates a bucket used to have an element. These tombstones maintain proper probe chains in the presence of hash collisions. They are critical to the correctness of open-addressing hash tables. My khash.h and most hash table implementations use tombstones.
However, the tombstone strategy is problematic. These tombstones waste memory and may increase the frequency of expensive rehashing. To alleviate these adverse effects, the FaceBook F14 hash table implements reference-counted tombstones. Such tombstones are removed from the hash table when they are at the end of probe chains. Note that F14 still uses tombstones. It just removes them more effectively.
Deletions without tombstones for linear probing
For a long time, I thought tombstones are inevitable. I was wrong. A recent reddit post pointed out that the wiki linear probing page has already offered a no-tombstone solution for years. The basic idea is not complex: when we delete an element, we move the next element to the deleted location if it is pushed away by the deleted element due to hash collision; we repeat this process until we come to an empty bucket. In C++, this algorithm only takes ~10 lines.
The current wiki page describes the idea well, but it is incomplete –– you can’t implement the algorithm with the description there. Fortunately, Google search directs me to an old StackOverflow answer which gives the correct pseudocode. Unlike F14 or robin-hood hashing, this algorithm doesn’t require any additional storage, not even a single bit.
Implementation
I implemented the algorithm in a new experimental hash table library khashl.h along with its C++ version khashl.hpp. I started to use Fibonacci hashing and optional hash value caching as are described in my earlier post. It uses one bit per bucket to indicate whether the bucket is empty.
Benchmark
I modified my earlier benchmark to evaluate deletions. The new benchmark feeds each hash table library a list of random integers. We insert an integer if it is absent from the table; we delete an integer if it is already in the table. For the largest dataset, there are 50 million input integers but there are only 6.2 million integers left in the final table. There are plenty of deletions.
The timing and memory of several hash table libraries are shown below:
In the figure, each library is associated with five points corresponding to 10, 18, 26, 34, 42 and 50 million input integers. The red circle line shows khashl, the new implementation. It has lower memory footprint across the board.
Interestingly, khashl is slower than my older khash.h. This may be caused by a combination of two effects. First, due to the presence of tombstones, khash.h has to double the number of buckets, resulting in fewer collisions. It implicitly trades memory for speed. Second, khashl may need to move multiple elements upon a deletion. Copying buckets can be slow. That said, the new deletion algorithm is only a little slower than khash.h and is faster than many other libraries for this particular task. It might also become faster than khash under a different payload (e.g. large batch of deletions). In addition, khashl has simpler insertion. It is faster than khash even if no deletions are involved.
Conclusion
Considering the clear advantage in memory, I think the new deletion algorithm without tombstones is overall better than traditional algorithms. It should become the preferred way to implement hash tables. I am surprised that I only found this algorithm a couple of days ago.
Appendix: comments on other libraries
- I am only showing fast libraries in the plot. Adding slow libraries will squeeze all the current points into a corner, making them harder to see.
- Many libraries are also evaluated in another benchmark.
- Abseil/absl hash map was not very impressive in my earlier benchmark, but the recent version seems better.
- phmap is largely a more portable version of older Abseil hash map. It is not as fast as Abseil nowadays.
- Consistent with the other benchmark, emilib is the fastest on inserting 32-bit integers. It implements a relatively simple hash table with linear probing. Emilib is faster than khashl possibly because 1) it uses a small load factor of 67% (vs 75% with khashl) and 2) it puts the empty/deleted bits inside each bucket, which may help cache efficiency. Emilib is very slow on deletions. I am not sure why.
Update on 2019-12-28: added absl::flat_hash_map and removed the rant about Abseil.
Why not move the last eligible element in the probe, rather than all of them? If you’ve got 10 collisions, why move all 10 instead of just swapping index 9 with the deleted index, ensuring that you’ve moved the newly emptied cell to the end of the probe?
When you get 10 hash collisions, the probe length may be 12, including other elements at the right buckets. If you only move the last element, deleting elements at the right buckets later may break the probe chain. I am not 100% sure about this, though.
try my new and fast hash map https://github.com/ktprime/emhash/blob/master/hash_table5.hpp
and it’s load factor can set 1.0
https://github.com/ktprime/emhash/blob/master/hash_table7.hpp
my hash map nerve use tombstones and it’s load factor can set 0.999.
https://github.com/ktprime/emhash/blob/master/hash_table5.hpp(fast insert)
https://github.com/ktprime/emhash/blob/master/hash_table7.hpp (high load factor)
can you bench mark it?
The issue with linear probing is primary clustering, which causes the hash table distribution to degrade to long runs of filled buckets after a while; this is self-reinforcing as bigger clusters attract more collisions (cf. raindrops on a windscreen).
If your key distribution is perfectly random, the load factor remains low, or inserts/deletes are well balanced it may take longer, but it’s a fundamental issue that’ll bite you eventually.