Asynchronous Clustering using Hudi

August 23, 20219 min read

Asynchronous clustering is a way to run Hudi's clustering table service in the background, reorganizing data for improved query performance while writers continue ingesting into the table without being blocked. In one of the previous blog posts, we introduced clustering and learnt how to setup inline clustering. In this post, we will discuss what has changed since then and see how asynchronous clustering can be setup using HoodieClusteringJob as well as DeltaStreamer utility.

info

For a newer, format-agnostic take on this topic, see our 2025 clustering post on what clustering means in an open data lakehouse.

Introduction

On a high level, clustering creates a plan based on a configurable strategy, groups eligible files based on specific criteria and then executes the plan. Hudi supports multi-writers which provides snapshot isolation between multiple table services, thus allowing writers to continue with ingestion while clustering runs in the background. For a more detailed overview of the clustering architecture please check out the previous blog post.

Clustering Strategies

As mentioned before, clustering plan as well as execution depends on configurable strategy. These strategies can be broadly classified into three types: clustering plan strategy, execution strategy and update strategy.

Plan Strategy

This strategy comes into play while creating clustering plan. It helps to decide what file groups should be clustered. Let's look at different plan strategies that are available with Hudi. Note that these strategies are easily pluggable using this config.

SparkSizeBasedClusteringPlanStrategy: It selects file slices based on the small file limit of base files and creates clustering groups upto max file size allowed per group. The max size can be specified using this config. This strategy is useful for stitching together medium-sized files into larger ones to reduce lot of files spread across cold partitions.
SparkRecentDaysClusteringPlanStrategy: It looks back previous 'N' days partitions and creates a plan that will cluster the 'small' file slices within those partitions. This is the default strategy. It could be useful when the workload is predictable and data is partitioned by time.
SparkSelectedPartitionsClusteringPlanStrategy: In case you want to cluster only specific partitions within a range, no matter how old or new are those partitions, then this strategy could be useful. To use this strategy, one needs to set below two configs additionally (both begin and end partitions are inclusive):

hoodie.clustering.plan.strategy.cluster.begin.partition
hoodie.clustering.plan.strategy.cluster.end.partition

note

All the strategies are partition-aware and the latter two are still bound by the size limits of the first strategy.

Execution Strategy

After building the clustering groups in the planning phase, Hudi applies execution strategy, for each group, primarily based on sort columns and size. The strategy can be specified using this config.

SparkSortAndSizeExecutionStrategy is the default strategy. Users can specify the columns to sort the data by, when clustering using this config. Apart from that, we can also set max file size for the parquet files produced due to clustering. The strategy uses bulk insert to write data into new files, in which case, Hudi implicitly uses a partitioner that does sorting based on specified columns. In this way, the strategy changes the data layout in a way that not only improves query performance but also balance rewrite overhead automatically.

Now this strategy can be executed either as a single spark job or multiple jobs depending on number of clustering groups created in the planning phase. By default, Hudi will submit multiple spark jobs and union the results. In case you want to force Hudi to use single spark job, set the execution strategy class config to SingleSparkJobExecutionStrategy.

Update Strategy

Currently, clustering can only be scheduled for tables/partitions not receiving any concurrent updates. By default, the config for update strategy is set to SparkRejectUpdateStrategy. If some file group has updates during clustering then it will reject updates and throw an exception. However, in some use-cases updates are very sparse and do not touch most file groups. The default strategy to simply reject updates does not seem fair. In such use-cases, users can set the config to SparkAllowUpdateStrategy.

We discussed the critical strategy configurations. All other configurations related to clustering are listed here. Out of this list, a few configurations that will be very useful are:

Config key	Remarks	Default
`hoodie.clustering.async.enabled`	Enable running of clustering service, asynchronously as writes happen on the table.	False
`hoodie.clustering.async.max.commits`	Control frequency of async clustering by specifying after how many commits clustering should be triggered.	4
`hoodie.clustering.preserve.commit.metadata`	When rewriting data, preserves existing _hoodie_commit_time. This means users can run incremental queries on clustered data without any side-effects.	False

Asynchronous Clustering

Previously, we have seen how users can setup inline clustering. Additionally, users can leverage HoodieClusteringJob to setup 2-step asynchronous clustering.

HoodieClusteringJob

With the release of Hudi version 0.9.0, we can schedule as well as execute clustering in the same step. We just need to specify the —mode or -m option. There are three modes:

schedule: Make a clustering plan. This gives an instant which can be passed in execute mode.
execute: Execute a clustering plan at given instant which means --instant-time is required here.
scheduleAndExecute: Make a clustering plan first and execute that plan immediately.

Note that to run this job while the original writer is still running, please enable multi-writing:

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider

A sample spark-submit command to setup HoodieClusteringJob is as below:

spark-submit \
--class org.apache.hudi.utilities.HoodieClusteringJob \
/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar \
--props /path/to/config/clusteringjob.properties \
--mode scheduleAndExecute \
--base-path /path/to/hudi_table/basePath \
--table-name hudi_table_schedule_clustering \
--spark-memory 1g

A sample clusteringjob.properties file:

hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=4
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.small.file.limit=629145600
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=column1,column2

HoodieDeltaStreamer

This brings us to our users' favorite utility in Hudi. Now, we can trigger asynchronous clustering with DeltaStreamer. Just set the hoodie.clustering.async.enabled config to true and specify other clustering config in properties file whose location can be pased as —props when starting the deltastreamer (just like in the case of HoodieClusteringJob).

A sample spark-submit command to setup HoodieDeltaStreamer is as below:

spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
/path/to/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.9.0-SNAPSHOT.jar \
--props /path/to/config/clustering_kafka.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
--source-ordering-field impresssiontime \
--table-type COPY_ON_WRITE \
--target-base-path /path/to/hudi_table/basePath \
--target-table impressions_cow_cluster \
--op INSERT \
--hoodie-conf hoodie.clustering.async.enabled=true \
--continuous

Spark Structured Streaming

We can also enable asynchronous clustering with Spark structured streaming sink as shown below.

val commonOpts = Map(
   "hoodie.insert.shuffle.parallelism" -> "4",
   "hoodie.upsert.shuffle.parallelism" -> "4",
   DataSourceWriteOptions.RECORDKEY_FIELD.key -> "_row_key",
   DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "partition",
   DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "timestamp",
   HoodieWriteConfig.TBL_NAME.key -> "hoodie_test"
)

def getAsyncClusteringOpts(isAsyncClustering: String, 
                           clusteringNumCommit: String, 
                           executionStrategy: String):Map[String, String] = {
   commonOpts + (DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE.key -> isAsyncClustering,
           HoodieClusteringConfig.ASYNC_CLUSTERING_MAX_COMMITS.key -> clusteringNumCommit,
           HoodieClusteringConfig.EXECUTION_STRATEGY_CLASS_NAME.key -> executionStrategy
   )
}

def initStreamingWriteFuture(hudiOptions: Map[String, String]): Future[Unit] = {
   val streamingInput = // define the source of streaming
   Future {
      println("streaming starting")
      streamingInput
              .writeStream
              .format("org.apache.hudi")
              .options(hudiOptions)
              .option("checkpointLocation", basePath + "/checkpoint")
              .mode(Append)
              .start()
              .awaitTermination(10000)
      println("streaming ends")
   }
}

def structuredStreamingWithClustering(): Unit = {
   val df = //generate data frame
   val hudiOptions = getClusteringOpts("true", "1", "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
   val f1 = initStreamingWriteFuture(hudiOptions)
   Await.result(f1, Duration.Inf)
}

Conclusion and Future Work

In this post, we discussed different clustering strategies and how to setup asynchronous clustering. The story is not over yet and future work entails:

Support clustering with updates.
CLI tools to support clustering.

Please follow this JIRA to learn more about active development on this issue. We look forward to contributions from the community. Hope you enjoyed this post. Put your Hudi on and keep streaming!

FAQ

What is asynchronous clustering in Apache Hudi?

Asynchronous clustering runs Hudi's clustering table service in the background while regular writers keep ingesting into the table. Hudi's multi-writer support provides snapshot isolation between table services, so data can be reorganized for better query performance without compromising ingestion speed.

What clustering plan strategies does Hudi provide?

Hudi ships three pluggable plan strategies. SparkSizeBasedClusteringPlanStrategy groups small file slices up to a max size per group, SparkRecentDaysClusteringPlanStrategy (the default when this post was written; current releases default to SparkSizeBasedClusteringPlanStrategy) clusters small files in the previous N days of partitions, and SparkSelectedPartitionsClusteringPlanStrategy clusters only partitions within a configured begin and end range.

What are the modes of HoodieClusteringJob?

Since Hudi 0.9.0, HoodieClusteringJob supports three modes passed via the -m option. The schedule mode makes a clustering plan and returns an instant, the execute mode runs a plan at a given instant time, and scheduleAndExecute does both in one step.

How do I enable asynchronous clustering with DeltaStreamer?

Set hoodie.clustering.async.enabled to true and put the other clustering configs in the properties file passed via the props option, then run DeltaStreamer in continuous mode. The same config also enables async clustering with a Spark structured streaming sink.

Can clustering run on a table that is receiving updates?

By default no, clustering can only be scheduled for tables or partitions not receiving concurrent updates, and the default SparkRejectUpdateStrategy throws an exception if a file group gets updates during clustering. If updates are sparse, you can set the update strategy to SparkAllowUpdateStrategy, and running HoodieClusteringJob alongside a writer requires enabling optimistic concurrency control with a lock provider.

Introduction​

Clustering Strategies​

Plan Strategy​

Execution Strategy​

Update Strategy​

Asynchronous Clustering​

HoodieClusteringJob​

HoodieDeltaStreamer​

Spark Structured Streaming​

Conclusion and Future Work​

FAQ​