Version: Current

Microsoft Azure

In this page, we explain how to use Hudi on Microsoft Azure.

Disclaimer

This page is maintained by the Hudi community. If the information is inaccurate or you have additional information to add. Please feel free to create a GitHub issue. Contribution is highly appreciated.

Supported Storage System

There are two storage systems support Hudi .

Azure Blob Storage
Azure Data Lake Gen 2

Verified Combination of Spark and storage system

HDInsight Spark2.4 on Azure Data Lake Storage Gen 2

This combination works out of the box. No extra config needed.

Databricks Spark2.4 on Azure Data Lake Storage Gen 2

Import Hudi jar to databricks workspace

Mount the file system to dbutils.

dbutils.fs.mount(
  source = "abfss://xxx@xxx.dfs.core.windows.net",
  mountPoint = "/mountpoint",
  extraConfigs = configs)

When writing Hudi dataset, use abfss URL

inputDF.write
  .format("org.apache.hudi")
  .options(opts)
  .mode(SaveMode.Append)
  .save("abfss://<<storage-account>>.dfs.core.windows.net/hudi-tables/customer")

When reading Hudi dataset, use the mounting point

spark.read
  .format("org.apache.hudi")
  .load("/mountpoint/hudi-tables/customer")

Concurrency Control

As of Hudi 1.2.0, the storage-based lock provider supports Azure ADLS Gen2 (abfs://, abfss://) and Azure Blob Storage (wasb://, wasbs://) base paths for concurrency control. This allows multi-writer pipelines on Azure to use storage-native conditional writes for locking — without requiring external systems like ZooKeeper, or Hive Metastore.

Add hudi-azure-bundle to your classpath and set:

hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.StorageBasedLockProvider

The lock client supports multiple Azure authentication methods (connection string, SAS token, managed identity, service principal, and DefaultAzureCredential). See Concurrency Control — Azure Storage-Based Lock for the full configuration reference and authentication precedence.

Blogs

How to use Apache Hudi with Databricks

Disclaimer​

Supported Storage System​

Verified Combination of Spark and storage system​

HDInsight Spark2.4 on Azure Data Lake Storage Gen 2​

Databricks Spark2.4 on Azure Data Lake Storage Gen 2​

Concurrency Control​

Related Resources​