Apache Hudi | An Open Source Data Lake Platform

What is Hudi

Apache Hudi is an open data lakehouse platform, built on a high-performance open table format to bring database functionality to your data lakes. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.

Integrations

Data Streaming

Apache Kafka

Apache Pulsar

Databases

PostgreSQL

MySQL

CDC

Debezium

Apache Flink CDC

File Formats

Apache Parquet

Apache ORC

Apache Avro

CSV

JSON

Lake Storage

Apache Hadoop

Amazon S3

Google Cloud Storage

Azure Blob Storage

Alibaba Cloud

IBM Cloud

Oracle Cloud

Tencent Cloud

MinIO

Data Catalogs

AWS Glue Data Catalog

Google BigQuery

Apache Hive Metastore

DataHub

Apache XTable (Incubating) (For sync)

Data Warehouses

Amazon Redshift

ClickHouse

Interactive Analytics

Presto

Trino

Apache Hive

AWS Athena

Google BigQuery

Apache Doris

StarRocks

Apache Impala

Data Processing

Apache Spark

Apache Flink

Databricks

AWS EMR

Azure HDInsight

Onehouse

Ray

Daft

Orchestration

dbt

Apache Airflow

Hudi Features

Mutability support for all workload shapes & sizes

Quickly update & delete data with fast, pluggable indexing. This includes database CDC and high-scale streaming data, with best-in-class support for out-of-order records, bursty traffic & data deduplication.

Unlock 10x efficiency by incrementally processing new data

Replace old-school batch pipelines with incremental streaming on your data lake. Experience faster ingestion and lower processing times for your data pipelines.

ACID Transactional guarantees for your data lake

Atomic writes, with relational/streaming data consistency models, snapshot isolation and non-blocking concurrency controls tailored for longer-running lake transactions.

Analyze historical data with time travel

Query historical data with the ability to roll back to a table version; debug data versions to understand what changed over time; audit data changes by viewing the commit history.

Interoperable multi-cloud ecosystem support

Built on open data formats with extensive ecosystem support across cloud vendor ecosystem, with plug-and-play options for popular data sources & query engines.

Automatic table services for a high-performance lakehouse

Fully automated table services that continuously schedule & orchestrate clustering, compaction, cleaning, file sizing & indexing to ensure tables are always optimized.

Open Data Lakehouse platform to get you going faster

Effortlessly build your lakehouse with built-in tools for auto ingestion from services like Debezium and Kafka and auto catalog sync to major cloud engines & more.

Query acceleration through multimodal indexes.

Experience faster write transactions on huge/wide tables & faster query performance with first-of-its kind multimodal indexing subsystem.

Resilient Pipelines with schema evolution & enforcement

Easily change the current schema of a Hudi table to adapt to the data that is changing over time and ensure pipeline resilience by failing fast and avoiding data corruption.

Why Hudi

The most innovative and completely open data lakehouse platform in the industry!

Trusted Platform

Battle tested and proven in production in some of the largest data lakes on the planet.

Open Source

Hudi is a thriving & growing community that is built with contributions from people around the globe.

High Performance

Hudi's storage format is purpose-built to continuously deliver performance as data scales.

Data streams

Take advantage of built-in CDC sources and tools for streaming ingestion.

Hudi Blogs

Migrating from Apache Hive Tables to Apache Hudi

Sivabalan Narayanan

July 30, 2026

Join our Community

Get technical help, influence the product roadmap & see what’s new with Hudi!

Youtube

GitHub

Slack

Mailing

What is Hudi

Integrations

Hudi Features

Mutability support for all workload shapes & sizes

Unlock 10x efficiency by incrementally processing new data

ACID Transactional guarantees for your data lake

Analyze historical data with time travel

Interoperable multi-cloud ecosystem support

Automatic table services for a high-performance lakehouse

Open Data Lakehouse platform to get you going faster

Query acceleration through multimodal indexes.

Resilient Pipelines with schema evolution & enforcement

Why Hudi

Trusted Platform

Open Source

High Performance

Data streams

Hudi Blogs

Migrating from Apache Hive Tables to Apache Hudi

Migrating from Parquet to Apache Hudi: A Practical Guide

Using Apache Hudi with Apache Iceberg: Interoperability via Apache XTable

Open Table Format vs Data Lakehouse: What's the Difference?

Data Lakehouse vs Data Warehouse vs Data Lake: What's the Difference?

What is CDC on a Data Lake?

What is a Streaming Data Lake?

What is ACID on a Data Lake?

Can a Lakehouse Really Run Maintenance Without Blocking Writes?

What is Incremental ETL on a Data Lake?

Join our Community