Overview
Welcome to Apache Hudi! This overview will provide a high level summary of what Apache Hudi is and will orient you on how to learn more to get started.
What is Apache Hudi
Apache Hudi (pronounced “hoodie”) is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.
Not only is Apache Hudi great for streaming workloads, but it also allows you to create efficient incremental batch pipelines. Read the docs for more use case descriptions and check out who's using Hudi, to see how some of the largest data lakes in the world including Uber, Amazon, ByteDance, Robinhood and more are transforming their production data lakes with Hudi.
Apache Hudi can easily be used on any cloud storage platform. Hudi’s advanced performance optimizations, make analytical workloads faster with any of the popular query engines including, Apache Spark, Flink, Presto, Trino, Hive, etc.
Core Concepts to Learn
If you are relatively new to Apache Hudi, it is important to be familiar with a few core concepts:
- Hudi Timeline – How Hudi manages transactions and other table services
- Hudi File Layout - How the files are laid out on storage
- Hudi Table Types –
COPY_ON_WRITEandMERGE_ON_READ - Hudi Query Types – Snapshot Queries, Incremental Queries, Read-Optimized Queries
See more in the "Concepts" section of the docs.
Take a look at recent blog posts that go in depth on certain topics or use cases.
Getting Started
Sometimes the fastest way to learn is by doing. Try out these Quick Start resources to get up and running in minutes:
- Spark Quick Start Guide – if you primarily use Apache Spark
- Flink Quick Start Guide – if you primarily use Apache Flink
If you want to experience Apache Hudi integrated into an end to end demo with Kafka, Spark, Hive, Presto, etc, try out the Docker Demo: