Skip to main content

Blogs

Welcome to Apache Hudi blogs! Here you'll find the latest articles, tutorials, and updates from the Hudi community.

All Blog Posts

Deep Dive Into Hudi’s Indexing Subsystem (Part 1 of 2)
Shiyan Xu
October 29, 2025

Deep Dive Into Hudi’s Indexing Subsystem (Part 1 of 2)

For decades, databases have relied on indexes—specialized data structures—to dramatically improve read and write performance by quickly locating specific records. Apache Hudi extends this fundamental principle to the data lakehouse with a unique and powerful approach. Every Hudi table contains a self-managed metadata table that functions as an indexing subsystem, enabling efficient data skipping and fast record lookups across a wide range of read and write scenarios.

Partition Stats: Enhancing Column Stats in Hudi 1.0
Aditya Goenka and Shiyan Xu
October 22, 2025

Partition Stats: Enhancing Column Stats in Hudi 1.0

For those tracking Apache Hudi's performance enhancements, the introduction of the column stats index was a significant development, as detailed in this blog. It represented a major advancement for query optimization by implementing a straightforward yet highly effective concept: storing lightweight, file-level statistics (such as min/max values and null counts) for specific columns. This provided Hudi's query engine a substantial performance improvement.

Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph
Jaz Samantha Ku, in collaboration with Shiyan Xu
October 2, 2025

Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph

CrowdStrike’s 2025 Global Threat Report puts average eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means that by the time security teams are even alerted about the potential breach, attackers have already long infiltrated the system. And that’s assuming they even get alerted. Cloud environments generate massive amounts of access logs, configuration changes, alerts, and telemetry. Reviewing these events in isolation rarely surfaces patterns like lateral movement or privilege escalation.

Automatic Record Key Generation in Apache Hudi
Shiyan Xu
September 17, 2025

Automatic Record Key Generation in Apache Hudi

In database systems, the primary key is a foundational design principle for managing data at the record level. Its function is to provide each record with a unique and stable logical identifier, which decouples the record's identity from its physical location on storage. While using direct physical address pointers (e.g., position inside a file being used as a key) can be convenient, the physical address can change when records are moved around within the table for things like clustering or z-ordering (called out here).

Modernizing Data Infrastructure at Peloton Using Apache Hudi
Amaresh Bingumalla, Thinh Kenny Vu, Gabriel Wang, Arun Vasudevan in collaboration with Dipankar Mazumdar
July 15, 2025

Modernizing Data Infrastructure at Peloton Using Apache Hudi

Peloton re-architected its data platform using Apache Hudi to overcome snapshot delays, rigid service coupling, and high operational costs. By adopting CDC-based ingestion from PostgreSQL and DynamoDB, moving from CoW to MoR tables, and leveraging asynchronous services with fine-grained schema control, Peloton achieved 10-minute ingestion cycles, reduced compute/storage overhead, and enabled time travel and GDPR compliance.

Showing 1-10 of 293 posts