We will look at different deployment models for executing compactions asynchronously.
For Merge-On-Read table, data is stored using a combination of columnar (e.g parquet) + row based (e.g avro) file formats.
Updates are logged to delta files & later compacted to produce new versions of columnar files synchronously or
asynchronously. One of th main motivations behind Merge-On-Read is to reduce data latency when ingesting records.
Hence, it makes sense to run compaction asynchronously without blocking ingestion.
Async Compaction is performed in 2 steps:
Compaction Scheduling: This is done by the ingestion job. In this step, Hudi scans the partitions and selects file
slices to be compacted. A compaction plan is finally written to Hudi timeline.
Compaction Execution: A separate process reads the compaction plan and performs compaction of file slices.
There are few ways by which we can execute compactions asynchronously.
Spark Structured Streaming
With 0.6.0, we now have support for running async compactions in Spark
Structured Streaming jobs. Compactions are scheduled and executed asynchronously inside the
streaming job. Async Compactions are enabled by default for structured streaming jobs
on Merge-On-Read table.
Hudi DeltaStreamer provides continuous ingestion mode where a single long running spark application
ingests data to Hudi table continuously from upstream sources. In this mode, Hudi supports managing asynchronous
compactions. Here is an example snippet for running in continuous mode with async compactions