Apache Hudi 1.0.0-beta1 is the first beta release of Apache Hudi. This release is meant for early adopters to try out the new features and provide feedback. The release is not meant for production use.
This release contains major format changes as we will see in highlights below. As such, migration would be required when the release is made generally available (GA). However, we encourage users to try out the features on new tables.
Given that timeline format and log file format has changed in this beta release, it is recommended not to attempt to do rolling upgrades from older versions to this release.
- Now all commit metadata is serialized to avro. This allows us to add new fields in the future without breaking compatibility and also maintain uniformity in metadata across all actions.
- All completed commit metadata file name will also have completion time. All the actions in requested/inflight states are stored in the active timeline as files named <begininstant_time>.<action_type>.<requested|inflight>. Completed actions are stored along with a time that denotes when the action was completed, in a file named < begin_instant_time><completion_instant_time>.<action_type>. This allows us to implement file slicing for non-blocking concurrecy control.
- Completed actions, their plans and completion metadata are stored in a more scalable LSM tree based timeline organized in an * *archived** storage location under the .hoodie metadata path. It consists of Apache Parquet files with action instant data and bookkeeping metadata files, in the following manner. Checkout timeline docs for more details.
Log File Format
- In addition to the fields in the log file header, we also store record positions. Refer to the latest spec for more details. This allows us to do position-based merging (apart from key-based merging) and skip pages based on positions.
- Log file name will now have the deltacommit instant time instead of base commit instant time.
Multiple base file formats
Now you can have multiple base files formats in a Hudi table. Even the same filegroup can have multiple base file
formats. We need to set a table config
hoodie.table.multiple.base.file.formats.enable to use this feature. And
whenever we need to change the format, then just specify the format in the
hoodie.base.file.format" config. Currently,
only Parquet, Orc and HFile formats are supported. This unlocks multiple benefits including choosing file format
suitable to index, and supporting emerging formats for ML/AI.
A new concurrency control mode called
NON_BLOCKING_CONCURRENCY_CONTROL is introduced in this release, where unlike
OCC, multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the
same file group with the conflicts resolved automatically by the query reader and the compactor. The new concurrency
mode is currently available for preview in version 1.0.0-beta only. You can read more about it under
section Model C: Multi-writer. A complete example with multiple
Flink streaming writers is available here. You
can follow the RFC and
the JIRA for more details.
A functional index is an index on a function of a column. It is a new addition to Hudi's multi-modal indexing subsystem which provides faster access method and also absorbs partitioning as part of the indexing system. Now you can simply create and drop index using SQL syntax as follows:
-- Create Index
CREATE INDEX [IF NOT EXISTS] index_name ON [TABLE] table_name
(column_name1 [OPTIONS(key1=value1, key2=value2, ...)], column_name2 [OPTIONS(key1=value1, key2=value2, ...)], ...)
[OPTIONS (key1=value1, key2=value2, ...)]
-- Drop Index
DROP INDEX [IF EXISTS] index_name ON [TABLE] table_name
index_nameis the name of the index to be created or dropped.
table_nameis the name of the table on which the index is created or dropped.
index_typeis the type of the index to be created. Currently, only
column_nameis the name of the column on which the index is created.
- Both index and column on which the index is created can be qualified with some options in the form of key-value pairs.
Record Merger API
HoodieRecordPayload interface was deprecated in favor of the new
HoodieRecordMerger interface in version 0.13.0.
The new interface has been further enhanced to support all kinds of merging operations. In particular, the new interface
supports partial merge and support custom checks before flushing merged records to disk. Please check
of the API for more details.
New FileGroup Reader
In addition to key-based merging of records in log files with base files for queries on MOR table, we have implemented position-based merging and skipping pages based on positions. The new reader has shown impressive performance gains for partial updates with key-based merging. For a MOR table of size 1TB with 100 partitions and 80% random updates in subsequent commits, the new reader is 5.7x faster for snapshot queries with 70x reduced write amplification. However, for position-based merging, the gains are yet to be realized as filter pushdown support is in progress. The new reader is enabled by default for all new tables. Following configs are used to control the reader:
# enabled by default
# need to enable position-based merging if required
Few things to note for the new reader:
- It is only applicable to COW or MOR tables with base files in Parquet format.
- Only snapshot queries for COW table, and snapshot queries and read-optimized queries for MOR table are supported.
- Currently, the reader will not be able to push down the data filters to scan. It is recommended to use key-based merging for now.
Raw Release Notes
The raw release notes are available here.