Apache Hudi | An Open Source Data Lake Platform

What is Hudi

Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.

Hudi Features

Mutability support for all data lake workloads

Quickly update & delete data with Hudi’s fast, pluggable indexing. This includes streaming workloads, with full support for out-of-order data, bursty traffic & data deduplication.

Improved efficiency by incrementally processing new data

Replace old-school batch pipelines with incremental streaming on your data lake. Experience faster ingestion and lower processing times for analytical workloads.

ACID Transactional guarantees to your data lake

Bring transactional guarantees to your data lake, with consistent, atomic writes and concurrency controls tailored for longer-running lake transactions.

Unlock historical data with time travel

Query historical data with the ability to roll back to a table version; debug data versions to understand what changed over time; audit data changes by viewing the commit history.

Interoperable multi-cloud ecosystem support

Extensive ecosystem support with plug-and-play options for popular data sources & query engines. Build future-proof architectures interoperable with your vendor of choice.

Comprehensive table services for high-performance analytics

Fully automated table services that continuously schedule & orchestrate clustering, compaction, cleaning, file sizing & indexing to ensure tables are always ready.

A rich platform to build your lakehouse faster

Effortlessly build your lakehouse with built-in tools for auto ingestion from services like Debezium and Kafka and auto catalog sync for easy discoverability & more.

Query acceleration through multi-modal indexes.

Experience faster write transactions on huge/wide tables & faster query performance with first-of-its kind multi-modal indexing subsystem.

Resilient Pipelines with schema evolution & enforcement

Easily change the current schema of a Hudi table to adapt to the data that is changing over time and ensure pipeline resilience by failing fast and avoiding data corruption.

Why Hudi

Take advantage of Hudi’s platform with rich services and tools to make your data lake actionable for applications like personalization, machine learning, customer 360 and more!

Trusted Platform

Battle tested and proven in production in some of the largest data lakes on the planet.

Open Source

Hudi is a thriving & growing community that is built with contributions from people around the globe.

Derived tables

Seamlessly create and manage SQL tables on your data lake to build multi-stage incremental pipelines.

Data streams

Take advantage of built-in CDC sources and tools for streaming ingestion.

Hudi Blogs

What is a Data Lakehouse & How does it Work?

Dipankar Mazumdar

July 11, 2024

Join our Community

Get technical help, influence the product roadmap & see what’s new with Hudi!

GitHub

Join community

Slack

Join community

Twitter

Join community

Youtube

Mailing

What is Hudi

Hudi Features

Mutability support for all data lake workloads

Improved efficiency by incrementally processing new data

ACID Transactional guarantees to your data lake

Unlock historical data with time travel

Interoperable multi-cloud ecosystem support

Comprehensive table services for high-performance analytics

A rich platform to build your lakehouse faster

Query acceleration through multi-modal indexes.

Resilient Pipelines with schema evolution & enforcement

Why Hudi

Trusted Platform

Open Source

Derived tables

Data streams

Hudi Blogs

What is a Data Lakehouse & How does it Work?

How to use Apache Hudi with Databricks

Apache Hudi: A Deep Dive with Python Code Examples

Apache Hudi vs. Delta Lake: Choosing the Right Tool for Your Data Lake on AWS

Use AWS Data Exchange to seamlessly share Apache Hudi datasets

Apache Hudi on AWS Glue

Building Analytical Apps on the Lakehouse using Apache Hudi, Daft & Streamlit

Learn how to read Hudi data with AWS Glue Ray using Daft (No Spark)

How to Query Apache Hudi Tables with Python Using Daft: A Spark-Free Approach

Apache Hudi vs Apache Iceberg: A Comprehensive Comparison

Join our Community