Apache Hudi | An Open Source Data Lake Platform

What is Hudi

Apache Hudi is an open data lakehouse platform, built on a high-performance open table format to bring database functionality to your data lakes. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.

Integrations

Data Streaming

Apache Kafka

Apache Pulsar

Databases

PostgreSQL

MySQL

CDC

Debezium

Apache Flink CDC

File Formats

Apache Parquet

Apache ORC

Apache Avro

CSV

JSON

Lake Storage

Apache Hadoop

Amazon S3

Google Cloud Storage

Azure Blob Storage

Alibaba Cloud

IBM Cloud

Oracle Cloud

Tencent Cloud

MinIO

Data Catalogs

AWS Glue Data Catalog

Google BigQuery

Apache Hive Metastore

DataHub

Apache XTable (Incubating) (For sync)

Data Warehouses

Amazon Redshift

ClickHouse

Interactive Analytics

Presto

Trino

Apache Hive

AWS Athena

Google BigQuery

Apache Doris

StarRocks

Apache Impala

Data Processing

Apache Spark

Apache Flink

Databricks

AWS EMR

Azure HDInsight

Onehouse

Ray

Daft

Orchestration

dbt

Apache Airflow

Hudi Features

Mutability support for all workload shapes & sizes

Quickly update & delete data with fast, pluggable indexing. This includes database CDC and high-scale streaming data, with best-in-class support for out-of-order records, bursty traffic & data deduplication.

Unlock 10x efficiency by incrementally processing new data

Replace old-school batch pipelines with incremental streaming on your data lake. Experience faster ingestion and lower processing times for your data pipelines.

ACID Transactional guarantees for your data lake

Atomic writes, with relational/streaming data consistency models, snapshot isolation and non-blocking concurrency controls tailored for longer-running lake transactions.

Analyze historical data with time travel

Query historical data with the ability to roll back to a table version; debug data versions to understand what changed over time; audit data changes by viewing the commit history.

Interoperable multi-cloud ecosystem support

Built on open data formats with extensive ecosystem support across cloud vendor ecosystem, with plug-and-play options for popular data sources & query engines.

Automatic table services for a high-performance lakehouse

Fully automated table services that continuously schedule & orchestrate clustering, compaction, cleaning, file sizing & indexing to ensure tables are always optimized.

Open Data Lakehouse platform to get you going faster

Effortlessly build your lakehouse with built-in tools for auto ingestion from services like Debezium and Kafka and auto catalog sync to major cloud engines & more.

Query acceleration through multimodal indexes.

Experience faster write transactions on huge/wide tables & faster query performance with first-of-its kind multimodal indexing subsystem.

Resilient Pipelines with schema evolution & enforcement

Easily change the current schema of a Hudi table to adapt to the data that is changing over time and ensure pipeline resilience by failing fast and avoiding data corruption.