Monday, September 30, 2019

Cassandra : Compaction, compaction and compaction

Hi Folks,

The word "compaction" is heavily used in the Cassandra world. You can see it everywhere while reading documentation, blog post, mailing list and so on.

Sometimes the use of compaction in combination of another word can be a bit misleading. Let's try to take a step back and to sump up a bit all the different concepts where the word compaction is used:

Machine makes cars compact, Fig. 1

[Compaction]

Let's say the default one when you talk about Cassandra compaction. Cassandra has a write path which is very efficient. The concept is like "let's put as fast as possible the data into a file and in memory". In a second step, the data will be flushed on the disk as is. Ok but then, to enable Cassandra to also have an efficient read path, especially when you have to read a file from the disk, you need to tidy up and rearrange all the files. This process is called the compaction. A bit more technically : SSTTable are immutable and compaction is the action of generating new SSTable by merging and purging the old ones (duplicate, deleted data with expired ttl and tombstone).

Minor [Compaction]

We are still talking about the default compaction. The minor compaction is not really something which is minor.... The minor compaction is the compaction handled automatically by Cassandra as a background process and according to the chosen compaction strategy (see below).

Major [Compaction]

We are still talking about the default compaction mechanism in Cassandra. Contrary to the minor compaction, the major compaction is triggered by a manual action on a node (using nodetool compact). The major compaction can behave differently depending on the compaction strategy (see below). The other main difference between major and minor compaction and that explain the naming difference is that a major compaction has a bigger impact in terms of I/O.

[Compaction] Strategy

As already said, Compaction will merge into SSTables into new ones. Depending on you use case, Cassandra propose different strategies to compact the data new SSTables.  There is a default strategy applied if you do not specify it but you can choose the one you want per table. There are 3 main compaction strategies.

Size Tiered [Compaction] Strategy

This is the default compaction strategy. It fits for the write heavy and general workload. More documentation here:
http://cassandra.apache.org/doc/latest/operating/compaction.html#size-tiered-compaction-strategy

Leveled [Compaction] Strategy

This compaction strategy fits perfectly for read heavy workloads. This strategy involves a bit more I/O than the Size Tiered Compaction. It can be a good idea to combine that with SSDs. More documentation here:
http://cassandra.apache.org/doc/latest/operating/compaction.html#leveled-compaction-strategy

Time Window [Compaction] Strategy

This compaction strategy fits perfectly for time series. Basically the data is compacted regarding the timeMore documentation here:
http://cassandra.apache.org/doc/latest/operating/compaction.html#timewindowcompactionstrategy-operational-concerns

A good blog blog post which deep dive into it:
https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html

Validation [Compaction]

This one is a fake friend from my point of view. It's considered as a compaction but is not really about compaction. The validation compaction is the process of building Merkle tree on nodes during a repair. It's anyway called validation compaction because this action is anyway controlled by the  

Anti[Compaction]

The anti compaction occurs during incremental repair. The goal is to split into two SSTables the repaired data from the unrepaired data. The 2 sets of data can no longer be compacted together and it's why it's called anticompaction.

Conclusion

Hopefully you get now a better vision of what means compaction regarding the context it's used into the Cassandra world.

No comments: