Record Mergers
Hudi handles mutations to records and streaming data, as we briefly touched upon in timeline ordering section. To provide users full-fledged support for stream processing, Hudi goes all the way making the storage engine and the underlying storage format understand how to merge changes to the same record key, that may arrive even in different order at different times. With the rise of mobile applications and IoT, these scenarios have become the normal than an exception. For e.g. a social networking application uploading user events several hours after they happened, when the user connects to WiFi networks.
To achieve this, Hudi supports merge modes, which define how the base and log files are ordered in a file slice and further how different records with the same record key within that file slice are merged consistently to produce the same deterministic results for snapshot queries, writers and table services. Specifically, there are three merge modes supported as a table-level configuration, invoked in the following places.
-
(writing) Combining multiple change records for the same record key while reading input data during writes. This is an optional optimization that reduces the number of records written to log files to improve query and write performance subsequently.
-
(writing) Merging final change record (partial/full update/delete) against existing record in storage for CoW tables.
-
(compaction) Compaction service merges all change records in log files against base files, respecting the merge mode.
-
(query) Merging change records in log files, after filtering/projections against base file for MoR table queries.
Note that the merge mode should not be altered once the table is created to avoid inconsistent behavior due to compaction producing different merge results when switching between the modes.