Ziyue Guan from Bytedance shares the experience of building an ExaByte(EB)-level data lake using Apache Hudi at Bytedance.
In one of the previous blog posts, we introduced a new kind of table service called clustering to reorganize data for improved query performance without compromising on ingestion speed. We learnt how to setup inline clustering. In this post, we will discuss what has changed since then and see how asynchronous clustering can be setup using HoodieClusteringJob as well as DeltaStreamer utility.
In this post we will talk about a new deltastreamer source which reliably and efficiently processes new data files as they arrive in AWS S3. As of today, to ingest data from S3 into Hudi, users leverage DFS source whose path selector would identify the source files modified since the last checkpoint based on max modification time. The problem with this approach is that modification time precision is upto seconds in S3. It maybe possible that there were many files (beyond what the configurable source limit allows) modifed in that second and some files might be skipped. For more details, please refer to HUDI-1723. While the workaround is to ignore the source limit and keep reading, the problem motivated us to redesign so that users can reliably ingest from S3.
Hudi supports fully automatic cleanup of uncommitted data on storage during its write operations. Write operations in an Apache Hudi table use markers to efficiently track the data files written to storage. In this blog, we dive into the design of the existing direct marker file mechanism and explain its performance problems on cloud storage like AWS S3 for very large writes. We demonstrate how we improve write performance with introduction of timeline-server-based markers.
Apache Hudi helps you build and manage data lakes with different table types, config knobs to cater to everyone's need.
Hudi adds per record metadata fields like
_hoodie_commit_time which serves multiple purposes.
They assist in avoiding re-computing the record key, partition path during merges, compaction and other table operations
and also assists in supporting record-level incremental queries (in comparison to other table formats, that merely track files).
In addition, it ensures data quality by ensuring unique key constraints are enforced even if the key field changes for a given table, during its lifetime.
But one of the repeated asks from the community is to leverage existing fields and not to add additional meta fields, for simple use-cases where such benefits are not desired or key changes are very rare.
The schema used for data exchange between services can change rapidly with new business requirements. Apache Hudi is often used in combination with kafka as a event stream where all events are transmitted according to a record schema. In our case a Confluent schema registry is used to maintain the schema and as schema evolves, newer versions are updated in the schema registry.
As early as 2016, we set out a bold, new vision reimagining batch data processing through a new “incremental” data processing stack - alongside the existing batch and streaming stacks. While a stream processing pipeline does row-oriented processing, delivering a few seconds of processing latency, an incremental pipeline would apply the same principles to columnar data in the data lake, delivering orders of magnitude improvements in processing efficiency within few minutes, on extremely scalable batch storage/compute infrastructure. This new stack would be able to effortlessly support regular batch processing for bulk reprocessing/backfilling as well. Hudi was built as the manifestation of this vision, rooted in real, hard problems faced at Uber and later took a life of its own in the open source community. Together, we have been able to usher in fully incremental data ingestion and moderately complex ETLs on data lakes already.
Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users on how to maintain just the required number of old file versions so that long running readers do not fail.
Apache Hudi is a data lake platform technology that provides several functionalities needed to build and manage data lakes. One such key feature that hudi provides is self-managing file sizing so that users don’t need to worry about manual table maintenance. Having a lot of small files will make it harder to achieve good query performance, due to query engines having to open/read/close files way too many times, to plan and execute queries. But for streaming data lake use-cases, inherently ingests are going to end up having smaller volume of writes, which might result in lot of small files if no special handling is done.
Every record in Hudi is uniquely identified by a primary key, which is a pair of record key and partition path where the record belongs to. Using primary keys, Hudi can impose a) partition level uniqueness integrity constraint b) enable fast updates and deletes on records. One should choose the partitioning scheme wisely as it could be a determining factor for your ingestion and query latency.