All Configurations
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels.
- Hudi Table Config: Basic Hudi Table configuration parameters.
- Environment Config: Hudi supports passing configurations via a configuration file
hudi-defaults.confin which each line consists of a key and a value separated by whitespace or = sign. For example:
hoodie.datasource.hive_sync.mode jdbc
hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000
hoodie.datasource.hive_sync.support_timestamp false
It helps to have a central configuration file for your common cross job configurations/tunings, so all the jobs on your cluster can utilize it. It also works with Spark SQL DML/DDL, and helps avoid having to pass configs inside the SQL statements.
Hudi always loads the configuration file under default directory file:/etc/hudi/conf, if exists, to set the default configs. Besides, you can specify another configuration directory location by setting the HUDI_CONF_DIR environment variable. The configs stored in HUDI_CONF_DIR/hudi-defaults.conf are loaded, overriding any configs already set by the config file in the default directory.
- Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
- Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
- Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
- Reader Configs: Please fill in the description for Config Group Name: Reader Configs
- Metastore and Catalog Sync Configs: Configurations used by the Hudi to sync metadata to external metastores and catalogs.
- Metrics Configs: These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.
- Record Payload Config: This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.
- Kafka Connect Configs: These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables
- Amazon Web Services Configs: Configurations specific to Amazon Web Services.
- Hudi Streamer Configs: These set of configs are used for Hudi Streamer utility which provides the way to ingest from different sources such as DFS or Kafka.
In the tables below (N/A) means there is no default value set
Externalized Config File
Instead of directly passing configuration settings to every Hudi job, you can also centrally set them in a configuration
file hudi-defaults.conf. By default, Hudi would load the configuration file under /etc/hudi/conf directory. You can
specify a different configuration directory location by setting the HUDI_CONF_DIR environment variable. This can be
useful for uniformly enforcing repeated configs (like Hive sync or write/index tuning), across your entire data lake.
Hudi Table Config
Basic Hudi Table configuration parameters.
Hudi Table Basic Configs
Configurations of the Hudi Table like type of ingestion, storage formats, hive table name etc. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and never changes during the lifetime of a hoodie table.
| Config Name | Default | Description |
|---|---|---|
| hoodie.bootstrap.base.path | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi tableConfig Param: BOOTSTRAP_BASE_PATH |
| hoodie.compaction.payload.class | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.Config Param: PAYLOAD_CLASS_NAME |
| hoodie.database.name | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific databaseConfig Param: DATABASE_NAME |
| hoodie.record.merge.mode | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.Config Param: RECORD_MERGE_MODESince Version: 1.0.0 |
| hoodie.record.merge.strategy.id | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.write.record.merge.custom.implementation.classes which has the same merger strategy idConfig Param: RECORD_MERGE_STRATEGY_IDSince Version: 0.13.0 |
| hoodie.table.checksum | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.Config Param: TABLE_CHECKSUMSince Version: 0.11.0 |
| hoodie.table.create.schema | (N/A) | Schema used when creating the tableConfig Param: CREATE_SCHEMA |
| hoodie.table.index.defs.path | (N/A) | Relative path to table base path where the index definitions are storedConfig Param: RELATIVE_INDEX_DEFINITION_PATHSince Version: 1.0.0 |
| hoodie.table.keygenerator.class | (N/A) | Key Generator class property for the hoodie tableConfig Param: KEY_GENERATOR_CLASS_NAME |
| hoodie.table.keygenerator.type | (N/A) | Key Generator type to determine key generator classConfig Param: KEY_GENERATOR_TYPESince Version: 1.0.0 |
| hoodie.table.legacy.payload.class | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.Config Param: LEGACY_PAYLOAD_CLASS_NAMESince Version: 1.1.0 |
| hoodie.table.metadata.partitions | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readersConfig Param: TABLE_METADATA_PARTITIONSSince Version: 0.11.0 |
| hoodie.table.metadata.partitions.inflight | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.Config Param: TABLE_METADATA_PARTITIONS_INFLIGHTSince Version: 0.11.0 |
| hoodie.table.name | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.Config Param: NAME |
| hoodie.table.ordering.fields | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.Config Param: ORDERING_FIELDS |
| hoodie.table.partial.update.mode | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formedConfig Param: PARTIAL_UPDATE_MODESince Version: 1.1.0 |
| hoodie.table.partition.fields | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generatorsConfig Param: PARTITION_FIELDS |
| hoodie.table.precombine.field | (N/A) | Comma separated fields used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.Config Param: PRECOMBINE_FIELD |
| hoodie.table.recordkey.fields | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.Config Param: RECORDKEY_FIELDS |
| hoodie.table.secondary.indexes.metadata | (N/A) | The metadata of secondary indexesConfig Param: SECONDARY_INDEXES_METADATASince Version: 0.13.0 |
| hoodie.timeline.layout.version | (N/A) | Version of timeline used, by the table.Config Param: TIMELINE_LAYOUT_VERSION |
| hoodie.archivelog.folder | archived | path under the meta folder, to store archived timeline instants at.Config Param: ARCHIVELOG_FOLDER |
| hoodie.bootstrap.index.class | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.Config Param: BOOTSTRAP_INDEX_CLASS_NAME |
| hoodie.bootstrap.index.enable | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.Config Param: BOOTSTRAP_INDEX_ENABLE |
| hoodie.bootstrap.index.type | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.Config Param: BOOTSTRAP_INDEX_TYPESince Version: 1.0.0 |
| hoodie.datasource.write.hive_style_partitioning | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)Config Param: HIVE_STYLE_PARTITIONING_ENABLE |
| hoodie.partition.metafile.use.base.format | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.Config Param: PARTITION_METAFILE_USE_BASE_FORMAT |
| hoodie.populate.meta.fields | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processingConfig Param: POPULATE_META_FIELDS |
| hoodie.table.base.file.format | PARQUET | Base file format to store all the base file data.Config Param: BASE_FILE_FORMAT |
| hoodie.table.cdc.enabled | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.Config Param: CDC_ENABLEDSince Version: 0.13.0 |
| hoodie.table.cdc.supplemental.logging.mode | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs.Config Param: CDC_SUPPLEMENTAL_LOGGING_MODESince Version: 0.13.0 |
| hoodie.table.format | native | Table format name used when writing to the table.Config Param: TABLE_FORMAT |
| hoodie.table.initial.version | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.Config Param: INITIAL_VERSIONSince Version: 1.0.0 |
| hoodie.table.log.file.format | HOODIE_LOG | Log format used for the delta logs.Config Param: LOG_FILE_FORMAT |
| hoodie.table.multiple.base.file.formats.enable | false | When set to true, the table can support reading and writing multiple base file formats.Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLESince Version: 1.0.0 |
| hoodie.table.timeline.timezone | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is defaultConfig Param: TIMELINE_TIMEZONE |
| hoodie.table.type | COPY_ON_WRITE | The table type for the underlying data.Config Param: TYPE |
| hoodie.table.version | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.Config Param: VERSION |
| hoodie.timeline.history.path | history | path under the meta folder, to store timeline history at.Config Param: TIMELINE_HISTORY_PATH |
| hoodie.timeline.path | timeline | path under the meta folder, to store timeline instants at.Config Param: TIMELINE_PATH |
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.write.drop.partition.columns | false | When set to true, will not write the partition columns into hudi. By default, false.Config Param: DROP_PARTITION_COLUMNS |
| hoodie.datasource.write.partitionpath.urlencode | false | Should we url encode the partition path value, before creating the folder structure.Config Param: URL_ENCODE_PARTITIONING |
| hoodie.datasource.write.slash.separated.date.partitioning | false | Flag to indicate whether to use slash separated date partitioning. If set to true, date partition values in yyyy-MM-dd format will be transformed to yyyy/MM/dd directory structure. By default false. Cannot be used together with hive-style partitioning.Config Param: SLASH_SEPARATED_DATE_PARTITIONING |
| hoodie.table.partition_extractor_class | Class which implements PartitionValueExtractor to extract the partition values, default is inferred based on partition configuration.Config Param: PARTITION_EXTRACTOR_CLASS |
Spark Datasource Configs
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
Read Options
Options useful for reading tables via read.format.option(...)
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.read.begin.instanttime | (N/A) | Required when hoodie.datasource.query.type is set to incremental. Represents the completion time to start incrementally pulling data from. The completion time here need not necessarily correspond to an instant on the timeline. New data written with completion_time >= START_COMMIT are fetched out. For e.g: ‘20170901080000’ will get all new data written on or after Sep 1, 2017 08:00AM.Config Param: START_COMMITSince Version: 0.9.0 |
| hoodie.datasource.read.end.instanttime | (N/A) | Used when hoodie.datasource.query.type is set to incremental. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified.Config Param: END_COMMITSince Version: 0.9.0 |
| hoodie.datasource.read.incr.table.version | (N/A) | The table version assumed for incremental readConfig Param: INCREMENTAL_READ_TABLE_VERSIONSince Version: 1.0.0 |
| hoodie.datasource.read.streaming.table.version | (N/A) | The table version assumed for streaming readConfig Param: STREAMING_READ_TABLE_VERSIONSince Version: 1.0.0 |
| hoodie.datasource.write.precombine.field | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge modeConfig Param: READ_PRE_COMBINE_FIELD |
| hoodie.datasource.query.type | snapshot | Whether data needs to be read, in incremental mode (new data since an instantTime) (or) read_optimized mode (obtain latest view, based on base files) (or) snapshot mode (obtain latest view, by merging base and (if any) log files)Config Param: QUERY_TYPESince Version: 0.9.0 |
| Config Name | Default | Description |
|---|---|---|
| as.of.instant | (N/A) | The query instant for time travel. Without specified this option, we query the latest snapshot.Config Param: TIME_TRAVEL_AS_OF_INSTANT |
| hoodie.datasource.read.paths | (N/A) | Comma separated list of file paths to read within a Hudi table.Config Param: READ_PATHSSince Version: 0.9.0 |
| hoodie.datasource.merge.type | payload_combine | For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records without merging; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records (combines the two records with same key for base and log file records), then read the left log file recordsConfig Param: REALTIME_MERGE |
| hoodie.datasource.query.incremental.format | latest_state | This config is used alone with the 'incremental' query type.When set to 'latest_state', it returns the latest records' values.When set to 'cdc', it returns the cdc data.Config Param: INCREMENTAL_FORMATSince Version: 0.13.0 |
| hoodie.datasource.read.create.filesystem.relation | false | When this is set, the relation created by DefaultSource is for a view representing the result set of the table valued function hudi_filesystem_view(...)Config Param: CREATE_FILESYSTEM_RELATIONSince Version: 1.0.0 |
| hoodie.datasource.read.extract.partition.values.from.path | false | When set to true, values for partition columns (partition values) will be extracted from physical partition path (default Spark behavior). When set to false partition values will be read from the data file (in Hudi partition columns are persisted by default). This config is a fallback allowing to preserve existing behavior, and should not be used otherwise.Config Param: EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATHSince Version: 0.11.0 |
| hoodie.datasource.read.file.index.list.partitions.from.catalog | false | Controls whether partition listing is obtained from the catalog instead of listing the file system to avoid recursively listing of large number of directories. Enabling this can reduce large amount of listing calls and speed up the queries for very large tables. This is only necessary when MDT is not enabled on the dataset as otherwise the MDT can provide the partition listing faster and without any actual listing on the file system.Config Param: FILE_INDEX_PARTITION_LISTING_VIA_CATALOGSince Version: 1.2.0 |
| hoodie.datasource.read.file.index.listing.mode | lazy | Overrides Hudi's file-index implementation's file listing mode: when set to 'eager', file-index will list all partition paths and corresponding file slices w/in them eagerly, during initialization, prior to partition-pruning kicking in, meaning that all partitions will be listed including ones that might be subsequently pruned out; when set to 'lazy', partitions and file-slices w/in them will be listed lazily (ie when they actually accessed, instead of when file-index is initialized) allowing partition pruning to occur before that, only listing partitions that has already been pruned. Please note that, this config is provided purely to allow to fallback to behavior existing prior to 0.13.0 release, and will be deprecated soon after.Config Param: FILE_INDEX_LISTING_MODE_OVERRIDESince Version: 0.13.0 |
| hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled | true | Controls whether partition-path prefix analysis is enabled w/in the file-index, allowing to avoid necessity to recursively list deep folder structures of partitioned tables w/ multiple partition columns, by carefully analyzing provided partition-column predicates and deducing corresponding partition-path prefix from them (if possible).Config Param: FILE_INDEX_LISTING_PARTITION_PATH_PREFIX_ANALYSIS_ENABLEDSince Version: 0.13.0 |
| hoodie.datasource.read.file.index.optimize.listing.using.path.filter | false | Controls whether file listing is done using the HoodieROTablePathFilter. This is mainly necessary when the metadata table is not enabled or corrupted and the job is doing recursive calls to fetch the partition paths and the dataset has multiple versions of the same file in the same partition and it could lead to Out of Memory on the driver if the dataset is too large. Another important limitation is that this config should not be used if there are bootstrap files present in the file system. NOTE: Only works for COW tables with snapshot queries.Config Param: FILE_INDEX_LIST_FILE_STATUSES_USING_RO_PATH_FILTERSince Version: 1.2.0 |
| hoodie.datasource.read.incr.fallback.fulltablescan.enable | true | When doing an incremental query whether we should fall back to full table scans if file does not exist.Config Param: INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES |
| hoodie.datasource.read.incr.fallback.fulltablescan.enable | true | When doing an incremental query whether we should fall back to full table scans if file does not exist.Config Param: INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN |
| hoodie.datasource.read.incr.filters | For use-cases like DeltaStreamer which reads from Hoodie Incremental table and applies opaque map functions, filters appearing late in the sequence of transformations cannot be automatically pushed down. This option allows setting filters directly on Hoodie Source.Config Param: PUSH_DOWN_INCR_FILTERSSince Version: 0.9.0 | |
| hoodie.datasource.read.incr.path.glob | For the use-cases like users only want to incremental pull from certain partitions instead of the full table. This option allows using glob pattern to directly filter on path.Config Param: INCR_PATH_GLOBSince Version: 0.9.0 | |
| hoodie.datasource.read.incr.skip_cluster | false | Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read to improve read performance.Config Param: INCREMENTAL_READ_SKIP_CLUSTERSince Version: 1.0.1 |
| hoodie.datasource.read.incr.skip_compact | false | Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.Config Param: INCREMENTAL_READ_SKIP_COMPACTSince Version: 1.0.1 |
| hoodie.datasource.read.partition.value.extractor.enabled | false | This config helps whether PartitionValueExtractor interface can be used for parsing partition values from partition path. When this config is enabled, it uses PartitionValueExtractor class value stored in the hoodie.properties file.Config Param: USE_PARTITION_VALUE_EXTRACTOR_ON_READSince Version: 1.2.0 |
| hoodie.datasource.read.schema.use.end.instanttime | false | Uses end instant schema when incrementally fetched data to. Default: users latest instant schema.Config Param: INCREMENTAL_READ_SCHEMA_USE_END_INSTANTTIMESince Version: 0.9.0 |
| hoodie.datasource.read.table.valued.function.filesystem.relation.subpath | A regex under the table's base path to get file system view informationConfig Param: FILESYSTEM_RELATION_ARG_SUBPATHSince Version: 1.0.0 | |
| hoodie.datasource.read.table.valued.function.timeline.relation | false | When this is set, the relation created by DefaultSource is for a view representing the result set of the table valued function hudi_query_timeline(...)Config Param: CREATE_TIMELINE_RELATIONSince Version: 1.0.0 |
| hoodie.datasource.read.table.valued.function.timeline.relation.archived | false | When this is set, the result set of the table valued function hudi_query_timeline(...) will include archived timelineConfig Param: TIMELINE_RELATION_ARG_ARCHIVED_TIMELINESince Version: 1.0.0 |
| hoodie.datasource.streaming.startOffset | earliest | Start offset to pull data from hoodie streaming source. allow earliest, latest, and specified start instant timeConfig Param: START_OFFSETSince Version: 0.13.0 |
| hoodie.enable.data.skipping | true | Enables data-skipping allowing queries to leverage indexes to reduce the search space by skipping over filesConfig Param: ENABLE_DATA_SKIPPINGSince Version: 0.10.0 |
| hoodie.file.index.enable | true | Enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.Config Param: ENABLE_HOODIE_FILE_INDEX |
| hoodie.read.timeline.holes.resolution.policy | FAIL | When doing incremental queries, there could be hollow commits (requested or inflight commits that are not the latest) that are produced by concurrent writers and could lead to potential data loss. This config allows users to have different ways of handling this situation. The valid values are [FAIL, BLOCK, USE_TRANSITION_TIME]: Use FAIL to throw an exception when hollow commit is detected. This is helpful when hollow commits are not expected. Use BLOCK to block processing commits from going beyond the hollow ones. This fits the case where waiting for hollow commits to finish is acceptable. Use USE_TRANSITION_TIME (experimental) to query commits in range by state transition time (completion time), instead of commit time (start time). Using this mode will result in begin.instanttime and end.instanttime using stateTransitionTime instead of the instant's commit time.Config Param: INCREMENTAL_READ_HANDLE_HOLLOW_COMMITSince Version: 0.14.0 |
| hoodie.schema.on.read.enable | false | Enables support for Schema Evolution featureConfig Param: SCHEMA_EVOLUTION_ENABLED |
| hoodie.spark.polaris.catalog.class | org.apache.polaris.spark.SparkCatalog | Fully qualified class name of the catalog that is used by the Polaris spark client.Config Param: POLARIS_CATALOG_CLASS_NAMESince Version: 1.1.0 |
Write Options
You can pass down any of the WriteClient level configs directly using options() or option(k,v) methods.
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(HoodieTableConfig.ORDERING_FIELDS(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Options useful for writing tables via write.format.option(...)
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.mode | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.Config Param: HIVE_SYNC_MODE |
| hoodie.datasource.write.partitionpath.field | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()Config Param: PARTITIONPATH_FIELD |
| hoodie.datasource.write.precombine.field | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge modeConfig Param: ORDERING_FIELDS |
| hoodie.datasource.write.precombine.field | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge modeConfig Param: PRECOMBINE_FIELD |
| hoodie.datasource.write.recordkey.field | (N/A) | Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: RECORDKEY_FIELD |
| hoodie.datasource.write.secondarykey.column | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: SECONDARYKEY_COLUMN_NAME |
| hoodie.write.record.merge.mode | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.Config Param: RECORD_MERGE_MODESince Version: 1.0.0 |
| hoodie.clustering.async.enabled | false | Enable running of clustering service, asynchronously as inserts happen on the table.Config Param: ASYNC_CLUSTERING_ENABLESince Version: 0.7.0 |
| hoodie.clustering.inline | false | Turn on inline clustering - clustering will be run after each write operation is completeConfig Param: INLINE_CLUSTERING_ENABLESince Version: 0.7.0 |
| hoodie.datasource.hive_sync.enable | false | When set to true, register/sync the table to Apache Hive metastore.Config Param: HIVE_SYNC_ENABLED |
| hoodie.datasource.hive_sync.jdbcurl | jdbc:hive2://localhost:10000 | Hive metastore urlConfig Param: HIVE_URL |
| hoodie.datasource.hive_sync.metastore.uris | thrift://localhost:9083 | Hive metastore urlConfig Param: METASTORE_URIS |
| hoodie.datasource.meta.sync.enable | false | Enable Syncing the Hudi Table with an external meta store or data catalog.Config Param: META_SYNC_ENABLED |
| hoodie.datasource.write.hive_style_partitioning | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)Config Param: HIVE_STYLE_PARTITIONING |
| hoodie.datasource.write.operation | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.Config Param: OPERATIONSince Version: 0.9.0 |
| hoodie.datasource.write.table.type | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.Config Param: TABLE_TYPESince Version: 0.9.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.serde_properties | (N/A) | Serde properties to hive table.Config Param: HIVE_TABLE_SERDE_PROPERTIES |
| hoodie.datasource.hive_sync.table_properties | (N/A) | Additional properties to store with table.Config Param: HIVE_TABLE_PROPERTIES |
| hoodie.datasource.overwrite.mode | (N/A) | Controls whether overwrite use dynamic or static mode, if not configured, respect spark.sql.sources.partitionOverwriteModeConfig Param: OVERWRITE_MODESince Version: 0.14.0 |
| hoodie.datasource.partition_extractor_class | (N/A) | PartitionValueExtractor implementation used by Spark datasource write/read paths to parse partition values from partition paths.Config Param: PARTITION_EXTRACTOR_CLASSSince Version: 1.2.0 |
| hoodie.datasource.write.partitions.to.delete | (N/A) | Comma separated list of partitions to delete. Allows use of wildcard *Config Param: PARTITIONS_TO_DELETESince Version: 0.9.0 |
| hoodie.datasource.write.payload.class | (N/A) | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effectiveConfig Param: PAYLOAD_CLASS_NAME |
| hoodie.datasource.write.table.name | (N/A) | Table name for the datasource write. Also used to register the table into meta stores.Config Param: TABLE_NAMESince Version: 0.9.0 |
| hoodie.write.record.merge.custom.implementation.classes | (N/A) | List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These record merge impls will filter by hoodie.write.record.merge.strategy.idHudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)Config Param: RECORD_MERGE_IMPL_CLASSESSince Version: 0.13.0 |
| hoodie.write.record.merge.strategy.id | (N/A) | ID of record merge strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.write.record.merge.custom.implementation.classes which has the same merge strategy idConfig Param: RECORD_MERGE_STRATEGY_IDSince Version: 0.13.0 |
| hoodie.datasource.compaction.async.enable | true | Controls whether async compaction should be turned on for MOR table writing.Config Param: ASYNC_COMPACT_ENABLE |
| hoodie.datasource.hive_sync.auto_create_database | true | Auto create hive database if does not existsConfig Param: HIVE_AUTO_CREATE_DATABASE |
| hoodie.datasource.hive_sync.base_file_format | PARQUET | Base file format for the sync.Config Param: HIVE_BASE_FILE_FORMAT |
| hoodie.datasource.hive_sync.batch_num | 1000 | The number of partitions one batch when synchronous partitions to hive.Config Param: HIVE_BATCH_SYNC_PARTITION_NUM |
| hoodie.datasource.hive_sync.bucket_sync | false | Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'Config Param: HIVE_SYNC_BUCKET_SYNC |
| hoodie.datasource.hive_sync.create_managed_table | false | Whether to sync the table as managed table.Config Param: HIVE_CREATE_MANAGED_TABLE |
| hoodie.datasource.hive_sync.database | default | The name of the destination database that we should sync the hudi table to.Config Param: HIVE_DATABASE |
| hoodie.datasource.hive_sync.ignore_exceptions | false | Ignore exceptions when syncing with Hive.Config Param: HIVE_IGNORE_EXCEPTIONS |
| hoodie.datasource.hive_sync.partition_extractor_class | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default is inferred based on partition configuration.Config Param: HIVE_PARTITION_EXTRACTOR_CLASS |
| hoodie.datasource.hive_sync.partition_fields | Field in the table to use for determining hive partition columns.Config Param: HIVE_PARTITION_FIELDS | |
| hoodie.datasource.hive_sync.password | hive | hive password to useConfig Param: HIVE_PASS |
| hoodie.datasource.hive_sync.skip_ro_suffix | false | Skip the _ro suffix for Read optimized table, when registeringConfig Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE |
| hoodie.datasource.hive_sync.support_timestamp | false | ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUEConfig Param: HIVE_SUPPORT_TIMESTAMP_TYPE |
| hoodie.datasource.hive_sync.sync_as_datasource | true | Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE |
| hoodie.datasource.hive_sync.sync_comment | false | Whether to sync the table column comments while syncing the table.Config Param: HIVE_SYNC_COMMENT |
| hoodie.datasource.hive_sync.table | unknown | The name of the destination table that we should sync the hudi table to.Config Param: HIVE_TABLE |
| hoodie.datasource.hive_sync.use_jdbc | true | Use JDBC when hive synchronization is enabledConfig Param: HIVE_USE_JDBC |
| hoodie.datasource.hive_sync.username | hive | hive user name to useConfig Param: HIVE_USER |
| hoodie.datasource.insert.dup.policy | none | Note This is only applicable to Spark SQL writing.<br />When operation type is set to "insert", users can optionally enforce a dedup policy. This policy will be employed when records being ingested already exists in storage. Default policy is none and no action will be taken. Another option is to choose "drop", on which matching records from incoming will be dropped and the rest will be ingested. Third option is "fail" which will fail the write operation when same records are re-ingested. In other words, a given record as deduced by the key generation policy can be ingested only once to the target table of interest.Config Param: INSERT_DUP_POLICYSince Version: 0.14.0 |
| hoodie.datasource.meta_sync.condition.sync | false | If true, only sync on conditions like schema change or partition change.Config Param: HIVE_CONDITIONAL_SYNC |
| hoodie.datasource.write.commitmeta.key.prefix | _ | Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timelineConfig Param: COMMIT_METADATA_KEYPREFIXSince Version: 0.9.0 |
| hoodie.datasource.write.drop.partition.columns | false | When set to true, will not write the partition columns into hudi. By default, false.Config Param: DROP_PARTITION_COLUMNS |
| hoodie.datasource.write.insert.drop.duplicates | false | If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. <br /> Note Just for Insert operation in Spark SQL writing since 0.14.0, users can switch to the config hoodie.datasource.insert.dup.policy instead for a simplified duplicate handling experience. The new config will be incorporated into all other writing flows and this config will be fully deprecated in future releases.Config Param: INSERT_DROP_DUPSSince Version: 0.9.0 |
| hoodie.datasource.write.keygenerator.class | org.apache.hudi.keygen.SimpleKeyGenerator | Key generator class, that implements org.apache.hudi.keygen.KeyGeneratorConfig Param: KEYGENERATOR_CLASS_NAMESince Version: 0.9.0 |
| hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled | false | When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value 2016-12-29 09:54:00 will be written as timestamp 2016-12-29 09:54:00.0 in row-writer path, while it will be written as long value 1483023240000000 in non row-writer path. If enabled, then the timestamp value will be written in both the cases.Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLEDSince Version: 0.10.1 |
| hoodie.datasource.write.partitionpath.urlencode | false | Should we url encode the partition path value, before creating the folder structure.Config Param: URL_ENCODE_PARTITIONING |
| hoodie.datasource.write.reconcile.schema | false | This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table's schema (after txn) is either kept the same or extended, meaning that we'll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table's schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch)Config Param: RECONCILE_SCHEMA |
| hoodie.datasource.write.row.writer.enable | true | When set to true, will perform write operations directly using the spark native Row representation, avoiding any additional conversion costs.Config Param: ENABLE_ROW_WRITERSince Version: 0.9.0 |
| hoodie.datasource.write.streaming.checkpoint.identifier | default_single_writer | A stream identifier used for HUDI to fetch the right checkpoint(batch id to be more specific) corresponding this writer. Please note that keep the identifier an unique value for different writer if under multi-writer scenario. If the value is not set, will only keep the checkpoint info in the memory. This could introduce the potential issue that the job is restart(batch id is lost) while spark checkpoint write fails, causing spark will retry and rewrite the data.Config Param: STREAMING_CHECKPOINT_IDENTIFIERSince Version: 0.13.0 |
| hoodie.datasource.write.streaming.disable.compaction | false | By default for MOR table, async compaction is enabled with spark streaming sink. By setting this config to true, we can disable it and the expectation is that, users will schedule and execute compaction in a different process/job altogether. Some users may wish to run it separately to manage resources across table services and regular ingestion pipeline and so this could be preferred on such cases.Config Param: STREAMING_DISABLE_COMPACTIONSince Version: 0.14.0 |
| hoodie.datasource.write.streaming.ignore.failed.batch | false | Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch. Turning this on, could hide the write status errors while the spark checkpoint moves ahead.So, would recommend users to use this with caution.Config Param: STREAMING_IGNORE_FAILED_BATCHSince Version: 0.9.0 |
| hoodie.datasource.write.streaming.retry.count | 3 | Config to indicate how many times streaming job should retry for a failed micro batch.Config Param: STREAMING_RETRY_CNTSince Version: 0.9.0 |
| hoodie.datasource.write.streaming.retry.interval.ms | 2000 | Config to indicate how long (by millisecond) before a retry should issued for failed microbatchConfig Param: STREAMING_RETRY_INTERVAL_MSSince Version: 0.9.0 |
| hoodie.meta.sync.client.tool.class | org.apache.hudi.hive.HiveSyncTool | Sync tool class name used to sync to metastore. Defaults to Hive.Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAMESince Version: 0.9.0 |
| hoodie.spark.sql.insert.into.operation | insert | Sql write operation to use with INSERT_INTO spark sql command. This comes with 3 possible values, bulk_insert, insert and upsert. "bulk_insert" is generally meant for initial loads and is known to be performant compared to insert. "insert" is the default value for this config and does small file handling in addition to bulk_insert, but will ensure to retain duplicates if ingested. If you may use INSERT_INTO for mutable dataset, then you may have to set this config value to "upsert". With upsert, Hudi will merge multiple versions of the record identified by record key configuration into one final record.Config Param: SPARK_SQL_INSERT_INTO_OPERATIONSince Version: 0.14.0 |
| hoodie.spark.sql.merge.into.partial.updates | true | Whether to write partial updates to the data blocks containing updates in MOR tables with Spark SQL MERGE INTO statement. The data blocks containing partial updates have a schema with a subset of fields compared to the full schema of the table.Config Param: ENABLE_MERGE_INTO_PARTIAL_UPDATESSince Version: 1.0.0 |
| hoodie.spark.sql.optimized.writes.enable | true | Controls whether spark sql prepped update and delete are enabled.Config Param: SPARK_SQL_OPTIMIZED_WRITESSince Version: 0.14.0 |
| hoodie.sql.bulk.insert.enable | false | When set to true, the sql insert statement will use bulk insert. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation instead.Config Param: SQL_ENABLE_BULK_INSERT |
| hoodie.sql.insert.mode | upsert | Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record.For strict mode, insert statement will keep the primary key uniqueness constraint which do not allow duplicate record.While for non-strict mode, hudi just do the insert operation for the pk-table. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation and hoodie.datasource.insert.dup.policy as you see fit.Config Param: SQL_INSERT_MODE |
| hoodie.streamer.source.kafka.value.deserializer.class | io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is used by kafka client to deserialize the recordsConfig Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASSSince Version: 0.9.0 |
| hoodie.write.set.null.for.missing.columns | false | When a nullable column is missing from incoming batch during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the missing column be filled with null values to successfully complete the write operation.Config Param: SET_NULL_FOR_MISSING_COLUMNSSince Version: 0.14.1 |
PreCommit Validator Configurations
The following set of configurations help validate new data before commits.
| Config Name | Default | Description |
|---|---|---|
| hoodie.precommit.validators | Comma separated list of class names that can be invoked to validate commitConfig Param: VALIDATOR_CLASS_NAMES | |
| hoodie.precommit.validators.equality.sql.queries | Spark SQL queries to run on table before committing new data to validate state before and after commit. Multiple queries separated by ';' delimiter are supported. Example: "select count(*) from <TABLE_NAME> Note <TABLE_NAME> is replaced by table state before and after commit.Config Param: EQUALITY_SQL_QUERIES | |
| hoodie.precommit.validators.failure.policy | FAIL | Policy for handling pre-commit validation failures. FAIL (default): validation failures block the commit with an exception. WARN_LOG: validation failures emit a warning log but allow the commit to proceed. Useful for monitoring data quality without impacting write availability.Config Param: VALIDATION_FAILURE_POLICYSince Version: 1.2.0 |
| hoodie.precommit.validators.inequality.sql.queries | Spark SQL queries to run on table before committing new data to validate state before and after commit.Multiple queries separated by ';' delimiter are supported.Example query: 'select count(*) from <TABLE_NAME> where col=null'Note <TABLE_NAME> variable is expected to be present in query.Config Param: INEQUALITY_SQL_QUERIES | |
| hoodie.precommit.validators.single.value.sql.queries | Spark SQL queries to run on table before committing new data to validate state after commit.Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: 'query1#result1:query2#result2'Note <TABLE_NAME> variable is expected to be present in query.Config Param: SINGLE_VALUE_SQL_QUERIES | |
| hoodie.precommit.validators.streaming.offset.tolerance.percentage | 0.0 | Tolerance percentage for streaming offset validation (used by org.apache.hudi.client.validator.StreamingOffsetValidator and org.apache.hudi.sink.validator.FlinkKafkaOffsetValidator). The validator compares the offset difference (expected records from source) with actual records written. If the deviation exceeds this percentage, the commit is rejected or warned depending on the validation failure policy. For upsert workloads with deduplication, set a higher tolerance. Default is 0.0 (strict mode, exact match required).Config Param: STREAMING_OFFSET_TOLERANCE_PERCENTAGESince Version: 1.2.0 |
Flink Sql Configs
These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
Flink Options
Flink jobs using the SQL can be configured through the options in WITH clause. The actual datasource level configs are listed below.
| Config Name | Default | Description |
|---|---|---|
| hoodie.database.name | (N/A) | Database name to register to Hive metastoreConfig Param: DATABASE_NAME |
| hoodie.datasource.write.recordkey.field | (N/A) | Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: RECORD_KEY_FIELD |
| hoodie.table.name | (N/A) | Table name to register to Hive metastoreConfig Param: TABLE_NAME |
| ordering.fields | (N/A) | Comma separated list of fields used in records merging. When two records have the same key value, we will pick the one with the largest value for the ordering field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. Config precombine.field is now deprecated, please use ordering.fields instead.Config Param: ORDERING_FIELDS |
| path | (N/A) | Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfullyConfig Param: PATH |
| read.commits.limit | (N/A) | The maximum number of commits allowed to read in each instant check, if it is streaming read, the avg read instants number per-second would be 'read.commits.limit'/'read.streaming.check-interval', by default no limitConfig Param: READ_COMMITS_LIMIT |
| read.end-commit | (N/A) | End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss'Config Param: READ_END_COMMIT |
| read.start-commit | (N/A) | Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming readConfig Param: READ_START_COMMIT |
| archive.max_commits | 50 | Max number of commits to keep before archiving older commits into a sequential log, default 50Config Param: ARCHIVE_MAX_COMMITS |
| archive.min_commits | 40 | Min number of commits to keep before archiving older commits into a sequential log, default 40Config Param: ARCHIVE_MIN_COMMITS |
| cdc.enabled | false | When enable, persist the change data if necessary, and can be queried as a CDC query modeConfig Param: CDC_ENABLED |
| cdc.supplemental.logging.mode | DATA_BEFORE_AFTER | Setting 'op_key_only' persists the 'op' and the record key only, setting 'data_before' persists the additional 'before' image, and setting 'data_before_after' persists the additional 'before' and 'after' images.Config Param: SUPPLEMENTAL_LOGGING_MODE |
| changelog.enabled | false | Whether to keep all the intermediate changes, we try to keep all the changes of a record when enabled: 1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every changes of a record. The semantics is best effort because the compaction job would finally merge all changes of a record into one. default false to have UPSERT semanticsConfig Param: CHANGELOG_ENABLED |
| clean.async.enabled | true | Whether to cleanup the old commits immediately on new commits, enabled by defaultConfig Param: CLEAN_ASYNC_ENABLED |
| clean.retain_commits | 30 | Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table, default 30Config Param: CLEAN_RETAIN_COMMITS |
| clustering.async.enabled | false | Async Clustering, default falseConfig Param: CLUSTERING_ASYNC_ENABLED |
| clustering.plan.strategy.small.file.limit | 600 | Files smaller than the size specified here are candidates for clustering, default 600 MBConfig Param: CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT |
| clustering.plan.strategy.target.file.max.bytes | 1073741824 | Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups, default 1 GBConfig Param: CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES |
| compaction.async.enabled | true | Async Compaction, enabled by default for MORConfig Param: COMPACTION_ASYNC_ENABLED |
| compaction.delta_commits | 5 | Max delta commits needed to trigger compaction, default 5 commitsConfig Param: COMPACTION_DELTA_COMMITS |
| hive_sync.enabled | false | Asynchronously sync Hive meta to HMS, default falseConfig Param: HIVE_SYNC_ENABLED |
| hive_sync.jdbc_url | jdbc:hive2://localhost:10000 | Jdbc URL for hive sync, default 'jdbc:hive2://localhost:10000'Config Param: HIVE_SYNC_JDBC_URL |
| hive_sync.metastore.uris | Metastore uris for hive sync, default ''Config Param: HIVE_SYNC_METASTORE_URIS | |
| hive_sync.mode | HMS | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'hms'Config Param: HIVE_SYNC_MODE |
| hoodie.datasource.query.type | snapshot | Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row & columnar data); 2) incremental mode (new data since an instantTime); 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshotConfig Param: QUERY_TYPE |
| hoodie.datasource.write.hive_style_partitioning | false | Whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)Config Param: HIVE_STYLE_PARTITIONING |
| hoodie.datasource.write.partitionpath.field | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString(), default ''Config Param: PARTITION_PATH_FIELD | |
| index.type | FLINK_STATE | Index type of Flink write job, default is using state backed index.Config Param: INDEX_TYPE |
| lookup.async | false | Whether to enable async lookup join.Config Param: LOOKUP_ASYNC |
| lookup.async-thread-number | 16 | The thread number for lookup async.Config Param: LOOKUP_ASYNC_THREAD_NUMBER |
| lookup.join.cache.ttl | PT1H | The cache TTL (e.g. 10min) for the build table in lookup join.Config Param: LOOKUP_JOIN_CACHE_TTL |
| lookup.join.cache.type | heap | The storage backend for the lookup join cache. Possible values: 'heap' (default) stores all dimension-table rows in JVM heap memory (may cause OutOfMemoryError for large tables); 'rocksdb' stores rows off-heap in an embedded RocksDB instance on local disk, which avoids OOM at the cost of additional serialization overhead.Config Param: LOOKUP_JOIN_CACHE_TYPE |
| lookup.join.rocksdb.path | /var/folders/60/wk8qzx310fd32b2dp7mhzvdc0000gn/T//hudi-lookup-rocksdb | Local directory path for storing RocksDB data when 'lookup.join.cache.type' is set to 'rocksdb'. Each task manager will create a unique subdirectory under this path. The directory is cleaned up when the lookup function is closed.Config Param: LOOKUP_JOIN_ROCKSDB_PATH |
| metadata.compaction.async.enabled | true | Whether to enable async compaction for metadata table,if true, the compaction for metadata table will be performed in the compaction pipeline, default enabled.Config Param: METADATA_COMPACTION_ASYNC_ENABLED |
| metadata.compaction.delta_commits | 10 | Max delta commits for metadata table to trigger compaction, default 10Config Param: METADATA_COMPACTION_DELTA_COMMITS |
| metadata.enabled | true | Enable the internal metadata table which serves table metadata like level file listings, default enabledConfig Param: METADATA_ENABLED |
| read.source-v2.enabled | false | Whether to use Flink FLIP27 new source to consume data files.Config Param: READ_SOURCE_V2_ENABLED |
| read.splits.limit | 2147483647 | The maximum number of splits allowed to read in each instant check, if it is streaming read, the avg read splits number per-second would be 'read.splits.limit'/'read.streaming.check-interval', by default no limitConfig Param: READ_SPLITS_LIMIT |
| read.streaming.enabled | false | Whether to read as streaming source, default falseConfig Param: READ_AS_STREAMING |
| read.streaming.skip_insertoverwrite | false | Whether to skip insert overwrite instants to avoid reading base files of insert overwrite operations for streaming read. In streaming scenarios, insert overwrite is usually used to repair data, here you can control the visibility of downstream streaming read.Config Param: READ_STREAMING_SKIP_INSERT_OVERWRITE |
| table.type | COPY_ON_WRITE | Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READConfig Param: TABLE_TYPE |
| write.insert.partitioner.class.name | Insert partitioner to use aiming to re-balance records and reducing small file number in the scenario of multi-level partitioning. For example dt/hour/eventIDCurrently support org.apache.hudi.sink.partitioner.GroupedInsertPartitionerConfig Param: INSERT_PARTITIONER_CLASS_NAME | |
| write.insert.partitioner.default_parallelism_per_partition | 30 | The parallelism to use in each partition when using GroupedInsertPartitioner.Config Param: DEFAULT_PARALLELISM_PER_PARTITION |
| write.operation | upsert | The write operation, that this write should doConfig Param: OPERATION |
| write.parquet.max.file.size | 120 | Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.Config Param: WRITE_PARQUET_MAX_FILE_SIZE |
| Config Name | Default | Description |
|---|---|---|
| clustering.tasks | (N/A) | Parallelism of tasks that do actual clustering, default same as the write task parallelismConfig Param: CLUSTERING_TASKS |
| compaction.tasks | (N/A) | Parallelism of tasks that do actual compaction, default same as the write task parallelismConfig Param: COMPACTION_TASKS |
| hive_sync.conf.dir | (N/A) | The hive configuration directory, where the hive-site.xml lies in, the file should be put on the client machineConfig Param: HIVE_SYNC_CONF_DIR |
| hive_sync.serde_properties | (N/A) | Serde properties to hive table, the data format is k1=v1 k2=v2Config Param: HIVE_SYNC_TABLE_SERDE_PROPERTIES |
| hive_sync.table_properties | (N/A) | Additional properties to store with table, the data format is k1=v1 k2=v2Config Param: HIVE_SYNC_TABLE_PROPERTIES |
| hoodie.bucket.index.partition.expressions | (N/A) | Users can use this parameter to specify expression and the corresponding bucket numbers (separated by commas).Multiple rules are separated by semicolons like hoodie.bucket.index.partition.expressions=expression1,bucket-number1;expression2,bucket-number2Config Param: BUCKET_INDEX_PARTITION_EXPRESSIONS |
| hoodie.datasource.write.keygenerator.class | (N/A) | Key generator class, that implements will extract the key out of incoming recordConfig Param: KEYGEN_CLASS_NAME |
| hoodie.table.partition_extractor_class | (N/A) | Partition value extractor class helps extract the partition value from partition pathsConfig Param: PARTITION_VALUE_EXTRACTOR |
| hoodie.write.record.merge.mode | (N/A) | Using EVENT_TIME_ORDERING to merge records by ordering value,using COMMIT_TIME_ORDERING to merge records by commit time,and using CUSTOM to merge records by the user specified logic.Config Param: RECORD_MERGE_MODE |
| index.write.tasks | (N/A) | Parallelism of tasks that do the index writing, default is the parallelism of the execution environmentConfig Param: INDEX_WRITE_TASKS |
| kafka.offset.trace.dc | (N/A) | Data center for checkpoint offset lookupConfig Param: DC |
| kafka.offset.trace.env | (N/A) | Environment for checkpoint offset lookupConfig Param: ENV |
| kafka.offset.trace.hadoop.user | (N/A) | Hadoop user for checkpoint offset lookupConfig Param: HADOOP_USER |
| kafka.offset.trace.job.name | (N/A) | Flink job name for checkpoint offset lookupConfig Param: JOB_NAME |
| kafka.offset.trace.source.cluster | (N/A) | Source Kafka cluster nameConfig Param: SOURCE_KAFKA_CLUSTER |
| kafka.offset.trace.target.cluster | (N/A) | Target Kafka cluster nameConfig Param: TARGET_KAFKA_CLUSTER |
| kafka.offset.trace.topic.id | (N/A) | Topic ID for checkpoint offset requestsConfig Param: TOPIC_ID |
| kafka.offset.trace.topic.name | (N/A) | Kafka topic name for storing topic metadata along with offsetsConfig Param: KAFKA_TOPIC_NAME |
| read.tasks | (N/A) | Parallelism of tasks that do actual read, default is the parallelism of the execution environmentConfig Param: READ_TASKS |
| source.avro-schema | (N/A) | Source avro schema string, the parsed schema is used for deserializationConfig Param: SOURCE_AVRO_SCHEMA |
| source.avro-schema.path | (N/A) | Source avro schema file path, the parsed schema is used for deserializationConfig Param: SOURCE_AVRO_SCHEMA_PATH |
| write.bucket_assign.tasks | (N/A) | Parallelism of tasks that do bucket assign, default same as the write task parallelismConfig Param: BUCKET_ASSIGN_TASKS |
| write.buffer.sort.keys | (N/A) | Sort keys concatenated by comma for buffer sort in append write function. Data is sorted within the buffer configured by number of records or buffer size. The order of entire written file is not guaranteed.Config Param: WRITE_BUFFER_SORT_KEYS |
| write.index_bootstrap.tasks | (N/A) | Parallelism of tasks that do index bootstrap, default same as the write task parallelismConfig Param: INDEX_BOOTSTRAP_TASKS |
| write.partition.format | (N/A) | Partition path format, only valid when 'write.datetime.partitioning' is true, default is: 1) 'yyyyMMddHH' for timestamp(3) WITHOUT TIME ZONE, LONG, FLOAT, DOUBLE, DECIMAL; 2) 'yyyyMMdd' for DATE and INT.Config Param: PARTITION_FORMAT |
| write.tasks | (N/A) | Parallelism of tasks that do actual write, default is the parallelism of the execution environmentConfig Param: WRITE_TASKS |
| clean.policy | KEEP_LATEST_COMMITS | Clean policy to manage the Hudi table. Available option: KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS.Default is KEEP_LATEST_COMMITS.Config Param: CLEAN_POLICY |
| clean.retain_file_versions | 5 | Number of file versions to retain. default 5Config Param: CLEAN_RETAIN_FILE_VERSIONS |
| clean.retain_hours | 24 | Number of hours for which commits need to be retained. This config provides a more flexible option ascompared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.Config Param: CLEAN_RETAIN_HOURS |
| clustering.delta_commits | 4 | Max delta commits needed to trigger clustering, default 4 commitsConfig Param: CLUSTERING_DELTA_COMMITS |
| clustering.plan.partition.filter.mode | NONE | Partition filter mode used in the creation of clustering plan. Available values are - NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate.RECENT_DAYS: keep a continuous range of partitions, worked together with configs 'clustering.plan.strategy.daybased.lookback.partitions' and 'clustering.plan.strategy.daybased.skipfromlatest.partitions.SELECTED_PARTITIONS: keep partitions that are in the specified range ['clustering.plan.strategy.cluster.begin.partition', 'clustering.plan.strategy.cluster.end.partition'].DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.Config Param: CLUSTERING_PLAN_PARTITION_FILTER_MODE_NAME |
| clustering.plan.strategy.class | org.apache.hudi.client.clustering.plan.strategy.FlinkSizeBasedClusteringPlanStrategy | Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the last N (determined by clustering.plan.strategy.daybased.lookback.partitions) day based partitions picks the small file slices within those partitions.Config Param: CLUSTERING_PLAN_STRATEGY_CLASS |
| clustering.plan.strategy.cluster.begin.partition | Begin partition used to filter partition (inclusive)Config Param: CLUSTERING_PLAN_STRATEGY_CLUSTER_BEGIN_PARTITION | |
| clustering.plan.strategy.cluster.end.partition | End partition used to filter partition (inclusive)Config Param: CLUSTERING_PLAN_STRATEGY_CLUSTER_END_PARTITION | |
| clustering.plan.strategy.daybased.lookback.partitions | 2 | Number of partitions to list to create ClusteringPlan, default is 2Config Param: CLUSTERING_TARGET_PARTITIONS |
| clustering.plan.strategy.daybased.skipfromlatest.partitions | 0 | Number of partitions to skip from latest when choosing partitions to create ClusteringPlanConfig Param: CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST |
| clustering.plan.strategy.max.num.groups | 30 | Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism, default is 30Config Param: CLUSTERING_MAX_NUM_GROUPS |
| clustering.plan.strategy.partition.regex.pattern | Filter clustering partitions that matched regex patternConfig Param: CLUSTERING_PLAN_STRATEGY_PARTITION_REGEX_PATTERN | |
| clustering.plan.strategy.partition.selected | Partitions to run clusteringConfig Param: CLUSTERING_PLAN_STRATEGY_PARTITION_SELECTED | |
| clustering.plan.strategy.sort.columns | Columns to sort the data by when clusteringConfig Param: CLUSTERING_SORT_COLUMNS | |
| clustering.schedule.enabled | false | Schedule the cluster plan, default falseConfig Param: CLUSTERING_SCHEDULE_ENABLED |
| compaction.delta_seconds | 3600 | Max delta seconds time needed to trigger compaction, default 1 hourConfig Param: COMPACTION_DELTA_SECONDS |
| compaction.max_memory | 100 | Max memory in MB for compaction spillable map, default 100MBConfig Param: COMPACTION_MAX_MEMORY |
| compaction.operation.execute.async.enabled | true | Whether the compaction operation should be executed asynchronously on compact operator, default enabled.Config Param: COMPACTION_OPERATION_EXECUTE_ASYNC_ENABLED |
| compaction.schedule.enabled | true | Schedule the compaction plan, enabled by default for MORConfig Param: COMPACTION_SCHEDULE_ENABLED |
| compaction.target_io | 512000 | Target IO in MB for per compaction (both read and write), default 500 GBConfig Param: COMPACTION_TARGET_IO |
| compaction.timeout.seconds | 1200 | Max timeout time in seconds for online compaction to rollback, default 20 minutesConfig Param: COMPACTION_TIMEOUT_SECONDS |
| compaction.trigger.strategy | num_commits | Strategy to trigger compaction, options are 'num_commits': trigger compaction when there are at least N delta commits after last completed compaction; 'num_commits_after_last_request': trigger compaction when there are at least N delta commits after last completed/requested compaction; 'time_elapsed': trigger compaction when time elapsed > N seconds since last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 'num_commits'Config Param: COMPACTION_TRIGGER_STRATEGY |
| hive_sync.assume_date_partitioning | false | Assume partitioning is yyyy/mm/dd, default falseConfig Param: HIVE_SYNC_ASSUME_DATE_PARTITION |
| hive_sync.auto_create_db | true | Auto create hive database if it does not exists, default trueConfig Param: HIVE_SYNC_AUTO_CREATE_DB |
| hive_sync.db | default | Database name for hive sync, default 'default'Config Param: HIVE_SYNC_DB |
| hive_sync.file_format | PARQUET | File format for hive sync, default 'PARQUET'Config Param: HIVE_SYNC_FILE_FORMAT |
| hive_sync.ignore_exceptions | false | Ignore exceptions during hive synchronization, default falseConfig Param: HIVE_SYNC_IGNORE_EXCEPTIONS |
| hive_sync.partition_extractor_class | org.apache.hudi.hive.MultiPartKeysValueExtractor | Tool to extract the partition value from HDFS path, default 'MultiPartKeysValueExtractor'Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME |
| hive_sync.partition_fields | Partition fields for hive sync, default ''Config Param: HIVE_SYNC_PARTITION_FIELDS | |
| hive_sync.password | hive | Password for hive sync, default 'hive'Config Param: HIVE_SYNC_PASSWORD |
| hive_sync.skip_ro_suffix | false | Skip the _ro suffix for Read optimized table when registering, default falseConfig Param: HIVE_SYNC_SKIP_RO_SUFFIX |
| hive_sync.support_timestamp | true | INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for backward compatibility.Config Param: HIVE_SYNC_SUPPORT_TIMESTAMP |
| hive_sync.table | unknown | Table name for hive sync, default 'unknown'Config Param: HIVE_SYNC_TABLE |
| hive_sync.table.strategy | ALL | Hive table synchronization strategy. Available option: RO, RT, ALL.Config Param: HIVE_SYNC_TABLE_STRATEGY |
| hive_sync.use_jdbc | true | Use JDBC when hive synchronization is enabled, default trueConfig Param: HIVE_SYNC_USE_JDBC |
| hive_sync.username | hive | Username for hive sync, default 'hive'Config Param: HIVE_SYNC_USERNAME |
| hoodie.bucket.index.hash.field | Index key field. Value to be used as hashing to find the bucket ID. Should be a subset of or equal to the recordKey fields. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: INDEX_KEY_FIELD | |
| hoodie.bucket.index.num.buckets | 4 | Hudi bucket number per partition. Only affected if using Hudi bucket index.Config Param: BUCKET_INDEX_NUM_BUCKETS |
| hoodie.bucket.index.partition.rule.type | regex | Rule parser for expressions when using partition level bucket index, default regex.Config Param: BUCKET_INDEX_PARTITION_RULE |
| hoodie.datasource.merge.type | payload_combine | For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records without merging; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records (combines the two records with same key for base and log file records), then read the left log file recordsConfig Param: MERGE_TYPE |
| hoodie.datasource.write.keygenerator.type | SIMPLE | Key generator type, that implements will extract the key out of incoming record. Note This is being actively worked on. Please use hoodie.datasource.write.keygenerator.class instead.Config Param: KEYGEN_TYPE |
| hoodie.datasource.write.partitionpath.urlencode | false | Whether to encode the partition path url, default falseConfig Param: URL_ENCODE_PARTITIONING |
| hoodie.index.bucket.engine | SIMPLE | Type of bucket index engine. Available options: [SIMPLE |
| hoodie.table.format | native | Table format produced by this writer.Config Param: WRITE_TABLE_FORMAT |
| hoodie.write.table.version | 9 | Table version produced by this writer.Config Param: WRITE_TABLE_VERSION |
| index.bootstrap.enabled | false | Whether to bootstrap the index state from existing hoodie table, default falseConfig Param: INDEX_BOOTSTRAP_ENABLED |
| index.bootstrap.rocksdb.path | /var/folders/60/wk8qzx310fd32b2dp7mhzvdc0000gn/T/ | Local directory path for RocksDB when bootstrap is enabled for record level index type.Each task manager creates a unique subdirectory under this path.Config Param: INDEX_BOOTSTRAP_ROCKSDB_PATH |
| index.global.enabled | true | Whether to update index for the old partition path if same key record with different partition path came in, default trueConfig Param: INDEX_GLOBAL_ENABLED |
| index.partition.regex | .* | Whether to load partitions in state if partition path matching, default *Config Param: INDEX_PARTITION_REGEX |
| index.rli.cache.concurrent.partitions.num | 2 | Expected number of partitions whose partitioned RLI caches are updated concurrently. Used to infer the initial memory size for each partition cache as INDEX_RLI_CACHE_SIZE / concurrency when historical cache usage is unavailable.Config Param: INDEX_RLI_CACHE_CONCURRENT_PARTITIONS_NUM |
| index.rli.cache.size | 256 | Maximum memory allocated for the record level index cache per bucket-assign task. The memory size of each individual cache within a checkpoint interval is dynamically calculated based on the average memory size of caches for historical checkpoints.Config Param: INDEX_RLI_CACHE_SIZE |
| index.rli.lookup.minibatch.size | 1000 | The maximum number of input records can be buffered for miniBatch during record index lookup. MiniBatch is an optimization to buffer input records to reduce the number of individual index lookups, which can significantly improve performance compared to processing each record individually. Default value is 1000, which is also the minimum value for the minibatch size, when the configured size is less than 1000, the default value will be used.Config Param: INDEX_RLI_LOOKUP_MINIBATCH_SIZE |
| index.rli.write.buffer.size | 100 | Maximum memory in MB for the buffer of index record writing operator, when the threshold hits, it flushes the index data to avoid OOM, default 100MB.Config Param: INDEX_RLI_WRITE_BUFFER_SIZE |
| index.state.ttl | 0.0 | Index state ttl in days, default stores the index permanentlyConfig Param: INDEX_STATE_TTL |
| kafka.offset.trace.caller.service.name | ingestion-rt | Caller service name for checkpoint service RPC headersConfig Param: CALLER_SERVICE_NAME |
| kafka.offset.trace.checkpoint.service | athena-job-manager | Checkpoint service name for offset lookupConfig Param: ATHENA_SERVICE |
| kafka.offset.trace.service.name | ingestion-rt | Service name for checkpoint offset lookupConfig Param: SERVICE_NAME |
| kafka.offset.trace.service.tier | DEFAULT | Service tier for checkpoint offset lookup (e.g., DEFAULT, CRITICAL)Config Param: SERVICE_TIER |
| metadata.compaction.schedule.enabled | true | Schedule the compaction plan for metadata table, enabled by default.Config Param: METADATA_COMPACTION_SCHEDULE_ENABLED |
| partition.default_name | HIVE_DEFAULT_PARTITION | The default partition name in case the dynamic partition column value is null/empty stringConfig Param: PARTITION_DEFAULT_NAME |
| payload.class | org.apache.hudi.common.model.EventTimeAvroPayload | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for the option in-effectiveConfig Param: PAYLOAD_CLASS_NAME |
| read.cdc.from.changelog | true | Whether to consume the delta changes only from the cdc changelog files. When CDC is enabled, i). for COW table, the changelog is generated on each file update; ii). for MOR table, the changelog is generated on compaction. By default, always read from the changelog file, once it is disabled, the reader would infer the changes based on the file slice dependencies.Config Param: READ_CDC_FROM_CHANGELOG |
| read.data.skipping.enabled | false | Enables data-skipping allowing queries to leverage indexes to reduce the search space by skipping over filesConfig Param: READ_DATA_SKIPPING_ENABLED |
| read.data.skipping.rli.keys.max.num | 8 | Record Level index statistics will be read from metadata table (MDT) for data skipping optimization, and currently the index statistics are collected by a single process. This config is used to constrain the maximum number of hoodie keys that can be read from MDT without sacrificing any performance. If the number of hoodie keys from query predicate is greater than the maximum value, the query will fallback to skip the record level index filtering. E.g., given query: SELECT * FROM T WHERE uuid IN (1,2,3,4,5,6,7,8,9), the number of hoodie keys is 9, and the maximum value is 8, so the source will not perform record level index filtering.Config Param: READ_DATA_SKIPPING_RLI_KEYS_MAX_NUM |
| read.streaming.check-interval | 60 | Check interval for streaming read of SECOND, default 1 minuteConfig Param: READ_STREAMING_CHECK_INTERVAL |
| read.streaming.skip_clustering | true | Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read to improve read performance.Config Param: READ_STREAMING_SKIP_CLUSTERING |
| read.streaming.skip_compaction | true | Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance. This option can be used to avoid reading duplicates when changelog mode is enabled, it is a solution to keep data integrity Config Param: READ_STREAMING_SKIP_COMPACT |
| read.utc-timezone | true | Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone, by default trueConfig Param: READ_UTC_TIMEZONE |
| record.merger.impls | org.apache.hudi.client.model.EventTimeFlinkRecordMerger | List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by record.merger.strategy. Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)Config Param: RECORD_MERGER_IMPLS |
| record.merger.strategy | eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in record.merger.impls which has the same merger strategy idConfig Param: RECORD_MERGER_STRATEGY_ID |
| write.batch.size | 256.0 | Batch buffer size in MB to flush data into the underneath filesystem, default 256MBConfig Param: WRITE_BATCH_SIZE |
| write.buffer.disruptor.ring.size | 16384 | Ring buffer size for Disruptor (must be power of 2), default 16384Config Param: WRITE_BUFFER_DISRUPTOR_RING_SIZE |
| write.buffer.disruptor.wait.strategy | BLOCKING_WAIT | Wait strategy for Disruptor: BLOCKING_WAIT (default), SLEEPING_WAIT, YIELDING_WAIT, BUSY_SPIN_WAITConfig Param: WRITE_BUFFER_DISRUPTOR_WAIT_STRATEGY |
| write.buffer.memory.type | ON_HEAP | The memory type used for the write buffer. Supported values are ON_HEAP (default) and MANAGED. ON_HEAP uses JVM heap memory, while MANAGED uses Flink managed memory which is accounted for in the task manager's memory budget and helps avoid OOM errors in containerized environments.Config Param: WRITE_BUFFER_MEMORY_TYPE |
| write.buffer.size | 1000 | Record count threshold for flushing the sort buffer in append write function. When buffer reaches this count, it is sorted and written. The order of entire written file is not guaranteed.Config Param: WRITE_BUFFER_SIZE |
| write.buffer.sort.continuous.drain.size | 1 | Number of records to drain each time the max capacity is reached when using continuous sorting. Default value of 1 provides smooth, incremental draining. Can be increased for batching if needed (e.g., 10, 100). Larger values reduce drain frequency but may cause latency spikes.Config Param: WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_SIZE |
| write.buffer.sort.enabled | false | Deprecated: use write.buffer.type=DISRUPTOR instead. Whether to enable buffer sort within append write function. The order of entire written file is not guaranteed.Config Param: WRITE_BUFFER_SORT_ENABLED |
| write.buffer.type | NONE | Buffer type for append write function: NONE (no buffer sort, default), BOUNDED_IN_MEMORY (double buffer with async write), DISRUPTOR (ring buffer with async write, recommended for better throughput), CONTINUOUS_SORT (TreeMap-based continuous sorting with incremental draining)Config Param: WRITE_BUFFER_TYPE |
| write.bulk_insert.shuffle_input | true | Whether to shuffle the inputs by specific fields for bulk insert tasks, default trueConfig Param: WRITE_BULK_INSERT_SHUFFLE_INPUT |
| write.bulk_insert.sort_input | true | Whether to sort the inputs by specific fields for bulk insert tasks, default trueConfig Param: WRITE_BULK_INSERT_SORT_INPUT |
| write.bulk_insert.sort_input.by_record_key | false | Whether to sort the inputs by record keys for bulk insert tasks, default falseConfig Param: WRITE_BULK_INSERT_SORT_INPUT_BY_RECORD_KEY |
| write.client.id | Unique identifier used to distinguish different writer pipelines for concurrent modeConfig Param: WRITE_CLIENT_ID | |
| write.commit.ack.timeout | 300000 | Timeout limit for a writer task after it finishes a checkpoint and waits for the instant commit success, only for internal use.Config Param: WRITE_COMMIT_ACK_TIMEOUT |
| write.extra.metadata.enabled | false | If enabled, the checkpoint Id will also be written to hudi metadata.Config Param: WRITE_EXTRA_METADATA_ENABLED |
| write.fail.fast | false | Flag to indicate whether to fail job immediately when an error record is detected. Currently, this option is only applied to Flink append write functions.Config Param: WRITE_FAIL_FAST |
| write.ignore.failed | false | Flag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default false. Turning this on, could hide the write status errors while the flink checkpoint moves ahead. So, would recommend users to use this with caution.Config Param: IGNORE_FAILED |
| write.incremental.job.graph.generation | false | Flag saying whether the incremental job graph generation is enabled.Config Param: WRITE_INCREMENTAL_JOB_GRAPH_GENERATION |
| write.insert.cluster | false | Whether to merge small files for insert mode, if true, the write throughput will decrease because the read/write of existing small file, only valid for COW table, default falseConfig Param: INSERT_CLUSTER |
| write.log.max.size | 1024 | Maximum size allowed in MB for a log file before it is rolled over to the next version, default 1GBConfig Param: WRITE_LOG_MAX_SIZE |
| write.log_block.size | 128 | Max log block size in MB for log file, default 128MBConfig Param: WRITE_LOG_BLOCK_SIZE |
| write.memory.segment.page.size | 32768 | Page size for memory segment used for write buffer.Config Param: WRITE_MEMORY_SEGMENT_PAGE_SIZE |
| write.merge.max_memory | 100 | Max memory in MB for merge, default 100MBConfig Param: WRITE_MERGE_MAX_MEMORY |
| write.parquet.block.size | 120 | Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.Config Param: WRITE_PARQUET_BLOCK_SIZE |
| write.parquet.page.size | 1 | Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.Config Param: WRITE_PARQUET_PAGE_SIZE |
| write.partition.overwrite.mode | STATIC | When INSERT OVERWRITE a partitioned data source table, we currently support 2 modes: static and dynamic. Static mode deletes all the partitions that match the partition specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting. Dynamic mode doesn't delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. By default we use static mode to keep the same behavior of previous version.Config Param: WRITE_PARTITION_OVERWRITE_MODE |
| write.precombine | false | Flag to indicate whether to drop duplicates before insert/upsert. By default these cases will accept duplicates, to gain extra performance: 1) insert operation; 2) upsert for MOR table, the MOR table deduplicate on readingConfig Param: PRE_COMBINE |
| write.rate.limit | 0 | Write record rate limit per second to prevent traffic jitter and improve stability, default 0 (no limit)Config Param: WRITE_RATE_LIMIT |
| write.retry.interval.ms | 2000 | Flag to indicate how long (by millisecond) before a retry should issued for failed checkpoint batch. By default 2000 and it will be doubled by every retryConfig Param: RETRY_INTERVAL_MS |
| write.retry.times | 3 | Flag to indicate how many times streaming job should retry for a failed checkpoint batch. By default 3Config Param: RETRY_TIMES |
| write.sort.memory | 128 | Sort memory in MB, default 128MBConfig Param: WRITE_SORT_MEMORY |
| write.task.max.size | 1024.0 | Maximum memory in MB for a write task, when the threshold hits, it flushes the max size data bucket to avoid OOM, default 1GBConfig Param: WRITE_TASK_MAX_SIZE |
| write.utc-timezone | true | Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Default value is utc timezone for forward compatibility.Config Param: WRITE_UTC_TIMEZONE |
Write Client Configs
Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
Common Configurations
The following set of configurations are common across Hudi.
| Config Name | Default | Description |
|---|---|---|
| hoodie.base.path | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.Config Param: BASE_PATH |
| Config Name | Default | Description |
|---|---|---|
| as.of.instant | (N/A) | The query instant for time travel. Without specified this option, we query the latest snapshot.Config Param: TIMESTAMP_AS_OF |
| hoodie.memory.compaction.max.size | (N/A) | Maximum amount of memory used in bytes for compaction operations in bytes , before spilling to local storage.Config Param: MAX_MEMORY_FOR_COMPACTION |
| hoodie.common.diskmap.compression.enabled | true | Turn on compression for BITCASK disk map used by the External Spillable MapConfig Param: DISK_MAP_BITCASK_COMPRESSION_ENABLED |
| hoodie.common.spillable.diskmap.type | BITCASK | When handling input data that cannot be held in memory, to merge with a file on storage, a spillable diskmap is employed. By default, we use a persistent hashmap based loosely on bitcask, that offers O(1) inserts, lookups. Change this to ROCKS_DB to prefer using rocksDB, for handling the spill.Config Param: SPILLABLE_DISK_MAP_TYPE |
| hoodie.datasource.write.reconcile.schema | false | This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table's schema (after txn) is either kept the same or extended, meaning that we'll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table's schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch)Config Param: RECONCILE_SCHEMA |
| hoodie.file.index.cache.spillable.mem | 524288000 | Amount of memory to be used in bytes for holding cachedAllInputFileSlices in org.apache.hudi.BaseHoodieTableFileIndex.Config Param: HOODIE_FILE_INDEX_SPILLABLE_MEMORYSince Version: 1.1.0 |
| hoodie.file.index.cache.use.spillable.map | false | Property to enable spillable map for caching input file slices in org.apache.hudi.BaseHoodieTableFileIndexConfig Param: HOODIE_FILE_INDEX_USE_SPILLABLE_MAPSince Version: 1.1.0 |
| hoodie.fs.atomic_creation.support | This config is used to specify the file system which supports atomic file creation . atomic means that an operation either succeeds and has an effect or has fails and has no effect; now this feature is used by FileSystemLockProvider to guaranteeing that only one writer can create the lock file at a time. since some FS does not support atomic file creation (eg: S3), we decide the FileSystemLockProvider only support HDFS,local FS and View FS as default. if you want to use FileSystemLockProvider with other FS, you can set this config with the FS scheme, eg: fs1,fs2Config Param: HOODIE_FS_ATOMIC_CREATION_SUPPORTSince Version: 0.14.0 | |
| hoodie.memory.dfs.buffer.max.size | 16777216 | Property to control the max memory in bytes for dfs input stream buffer sizeConfig Param: MAX_DFS_STREAM_BUFFER_SIZE |
| hoodie.read.timeline.holes.resolution.policy | FAIL | When doing incremental queries, there could be hollow commits (requested or inflight commits that are not the latest) that are produced by concurrent writers and could lead to potential data loss. This config allows users to have different ways of handling this situation. The valid values are [FAIL, BLOCK, USE_TRANSITION_TIME]: Use FAIL to throw an exception when hollow commit is detected. This is helpful when hollow commits are not expected. Use BLOCK to block processing commits from going beyond the hollow ones. This fits the case where waiting for hollow commits to finish is acceptable. Use USE_TRANSITION_TIME (experimental) to query commits in range by state transition time (completion time), instead of commit time (start time). Using this mode will result in begin.instanttime and end.instanttime using stateTransitionTime instead of the instant's commit time.Config Param: INCREMENTAL_READ_HANDLE_HOLLOW_COMMITSince Version: 0.14.0 |
| hoodie.schema.on.read.enable | false | Enables support for Schema Evolution featureConfig Param: SCHEMA_EVOLUTION_ENABLE |
| hoodie.write.set.null.for.missing.columns | false | When a nullable column is missing from incoming batch during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the missing column be filled with null values to successfully complete the write operation.Config Param: SET_NULL_FOR_MISSING_COLUMNSSince Version: 0.14.1 |
Memory Configurations
Controls memory usage for compaction and merges, performed internally by Hudi.
| Config Name | Default | Description |
|---|---|---|
| hoodie.memory.compaction.max.size | (N/A) | Maximum amount of memory used in bytes for compaction operations in bytes , before spilling to local storage.Config Param: MAX_MEMORY_FOR_COMPACTION |
| hoodie.memory.spillable.map.path | (N/A) | Default file path for spillable mapConfig Param: SPILLABLE_MAP_BASE_PATH |
| hoodie.memory.compaction.fraction | 0.6 | HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable mapConfig Param: MAX_MEMORY_FRACTION_FOR_COMPACTION |
| hoodie.memory.dfs.buffer.max.size | 16777216 | Property to control the max memory in bytes for dfs input stream buffer sizeConfig Param: MAX_DFS_STREAM_BUFFER_SIZE |
| hoodie.memory.merge.fraction | 0.6 | This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during mergeConfig Param: MAX_MEMORY_FRACTION_FOR_MERGE |
| hoodie.memory.merge.max.size | 1073741824 | Maximum amount of memory used in bytes for merge operations, before spilling to local storage.Config Param: MAX_MEMORY_FOR_MERGE |
| hoodie.memory.writestatus.failure.fraction | 0.1 | Property to control how what fraction of the failed record, exceptions we report back to driver. Default is 10%. If set to 100%, with lot of failures, this can cause memory pressure, cause OOMs and mask actual data errors.Config Param: WRITESTATUS_FAILURE_FRACTION |
Metadata Configs
Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries.
| Config Name | Default | Description |
|---|---|---|
| hoodie.index.name | (N/A) | Name of the expression index. This is also used for the partition name in the metadata table.Config Param: SECONDARY_INDEX_NAMESince Version: 1.0.0 |
| hoodie.index.name | (N/A) | Name of the expression index. This is also used for the partition name in the metadata table.Config Param: EXPRESSION_INDEX_NAMESince Version: 1.0.0 |
| hoodie.metadata.index.drop | (N/A) | Drop the specified index. The value should be the name of the index to delete. You can check index names using SHOW INDEXES command. The index name either starts with or matches exactly can be one of the following: files, column_stats, bloom_filters, record_index, expr_index_, secondary_index_, partition_stats, filesConfig Param: DROP_METADATA_INDEXSince Version: 1.0.1 |
| hoodie.expression.index.type | COLUMN_STATS | Type of the expression index. Default is column_stats if there are no functions and expressions in the command. Valid options could be BITMAP, COLUMN_STATS, LUCENE, etc. If index_type is not provided, and there are functions or expressions in the command then a expression index using column stats will be created.Config Param: EXPRESSION_INDEX_TYPESince Version: 1.0.0 |
| hoodie.metadata.enable | true | Enable the internal metadata table which serves table metadata like level file listingsConfig Param: ENABLESince Version: 0.7.0 |
| hoodie.metadata.index.bloom.filter.enable | false | Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups.Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTERSince Version: 0.11.0 |
| hoodie.metadata.index.column.stats.enable | false | Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups. For the Spark engine, this config defaults to true (enabled), overriding the base default of false. For Flink and Java engines, this remains false by default.Config Param: ENABLE_METADATA_INDEX_COLUMN_STATSSince Version: 0.11.0 |
| hoodie.metadata.index.expression.enable | false | Enable expression index within the metadata table. When this configuration property is enabled (true), the Hudi writer automatically keeps all expression indexes consistent with the data table. When disabled (false), all expression indexes are deleted. Note that individual expression index can only be created through a CREATE INDEX and deleted through a DROP INDEX statement in Spark SQL.Config Param: EXPRESSION_INDEX_ENABLE_PROPSince Version: 1.0.0 |
| hoodie.metadata.index.secondary.enable | true | Enable secondary index within the metadata table. When this configuration property is enabled (true), the Hudi writer automatically keeps all secondary indexes consistent with the data table. When disabled (false), all secondary indexes are deleted. Note that individual secondary index can only be created through a CREATE INDEX and deleted through a DROP INDEX statement in Spark SQL. Config Param: SECONDARY_INDEX_ENABLE_PROPSince Version: 1.0.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.metadata.index.bloom.filter.column.list | (N/A) | Comma-separated list of columns for which bloom filter index will be built. If not set, only record key will be indexed.Config Param: BLOOM_FILTER_INDEX_FOR_COLUMNSSince Version: 0.11.0 |
| hoodie.metadata.index.column.stats.column.list | (N/A) | Comma-separated list of columns for which column stats index will be built. If not set, all columns will be indexed. For nested fields within ARRAY types, use: field.list.element (e.g., items.list.element or items.list.element.price). For nested fields within MAP types, use: field.key_value.key for keys or field.key_value.value for values (e.g., metadata.key_value.key, metadata.key_value.value, or metadata.key_value.value.nested_field).Config Param: COLUMN_STATS_INDEX_FOR_COLUMNSSince Version: 0.11.0 |
| hoodie.metadata.index.column.stats.processing.mode.override | (N/A) | By default Column Stats Index is automatically determining whether it should be read and processed either'in-memory' (w/in executing process) or using Spark (on a cluster), based on some factors like the size of the Index and how many columns are read. This config allows to override this behavior.Config Param: COLUMN_STATS_INDEX_PROCESSING_MODE_OVERRIDESince Version: 0.12.0 |
| hoodie.metadata.index.expression.column | (N/A) | Column for which expression index will be built.Config Param: EXPRESSION_INDEX_COLUMNSince Version: 1.0.1 |
| hoodie.metadata.index.expression.options | (N/A) | Options for the expression index, e.g. "expr='from_unixtime', format='yyyy-MM-dd'"Config Param: EXPRESSION_INDEX_OPTIONSSince Version: 1.0.1 |
| hoodie.metadata.index.secondary.column | (N/A) | Column for which secondary index will be built.Config Param: SECONDARY_INDEX_COLUMNSince Version: 1.0.1 |
| _hoodie.metadata.ignore.spurious.deletes | true | There are cases when extra files are requested to be deleted from metadata table which are never added before. This config determines how to handle such spurious deletesConfig Param: IGNORE_SPURIOUS_DELETESSince Version: 0.10.0 |
| hoodie.file.listing.parallelism | 200 | Parallelism to use, when listing the table on lake storage.Config Param: FILE_LISTING_PARALLELISM_VALUESince Version: 0.7.0 |
| hoodie.metadata.auto.delete.partitions | true | When enabled (default), metadata table partitions (indexes) that are disabled in the write config will be automatically deleted. When disabled, users must explicitly drop metadata partitions using the hudi-cli or DROP INDEX command. Disabling automatic deletion is recommended for production datasets with multiple writers to avoid accidental deletion of large metadata indexes due to misconfiguration.Config Param: AUTO_DELETE_PARTITIONSSince Version: 1.2.0 |
| hoodie.metadata.auto.initialize | true | Initializes the metadata table by reading from the file system when the table is first created. Enabled by default. Warning: This should only be disabled when manually constructing the metadata table outside of typical Hudi writer flows.Config Param: AUTO_INITIALIZESince Version: 0.14.0 |
| hoodie.metadata.bloom.filter.dynamic.max.entries | 100000 | The threshold for the maximum number of keys to record in a dynamic bloom filter row for the files in the metadata table. Only applies if the filter type (hoodie.metadata.bloom.filter.type ) is BloomFilterTypeCode.DYNAMIC_V0.Config Param: BLOOM_FILTER_DYNAMIC_MAX_ENTRIESSince Version: 1.1.0 |
| hoodie.metadata.bloom.filter.enable | false | Whether to use bloom filter in the files for lookup in the metadata table.Config Param: BLOOM_FILTER_ENABLESince Version: 1.1.0 |
| hoodie.metadata.bloom.filter.fpp | 0.000000001 | Expected probability a false positive in a bloom filter for the files in the metadata table. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives. If the number of entries added to bloom filter exceeds the configured value (hoodie.metadata.bloom.num_entries), then this fpp may not be honored.Config Param: BLOOM_FILTER_FPPSince Version: 1.1.0 |
| hoodie.metadata.bloom.filter.num.entries | 10000 | This is the number of entries stored in a bloom filter for the files in the metadata table. The rationale for the default: 10000 is chosen to be a good tradeoff between false positive rate and storage size. Warning: Setting this very low generates a lot of false positives and the metadata table reading has to scan a lot more files than it has to and setting this to a very high number increases the size every base file linearly (roughly 4KB for every 50000 entries). This config is also used with DYNAMIC bloom filter which determines the initial size for the bloom.Config Param: BLOOM_FILTER_NUM_ENTRIESSince Version: 1.1.0 |
| hoodie.metadata.bloom.filter.type | DYNAMIC_V0 | Bloom filter type for the files in the metadata table org.apache.hudi.common.bloom.BloomFilterTypeCode: Filter type used by Bloom filter. SIMPLE: Bloom filter that is based on the configured size. DYNAMIC_V0(default): Bloom filter that is auto sized based on number of keys.Config Param: BLOOM_FILTER_TYPESince Version: 1.1.0 |
| hoodie.metadata.compact.max.delta.commits | 10 | Controls how often the metadata table is compacted.Config Param: COMPACT_NUM_DELTA_COMMITSSince Version: 0.7.0 |
| hoodie.metadata.compact.max.delta.seconds | 7200 | Number of elapsed seconds after the last compaction, before scheduling a new one (for metadata table). This config takes effect only for the compaction triggering strategy based on the elapsed time, i.e., TIME_ELAPSED, NUM_AND_TIME, and NUM_OR_TIME.Config Param: COMPACT_TIME_DELTA_SECONDSSince Version: 1.2.0 |
| hoodie.metadata.compact.trigger.strategy | NUM_COMMITS | Controls how compaction scheduling is triggered for metadata table,by time or num delta commits or combination of both. Config Param: COMPACT_TRIGGER_STRATEGYSince Version: 1.2.0 |
| hoodie.metadata.derive.from.datatable.clean.policy | true | This config determines whether the cleaner policy should use data table's cleaner policy.Config Param: DERIVE_FROM_DATA_TABLE_CLEAN_POLICYSince Version: 1.2.0 |
| hoodie.metadata.dir.filter.regex | Directories matching this regex, will be filtered out when initializing metadata table from lake storage for the first time.Config Param: DIR_FILTER_REGEXSince Version: 0.7.0 | |
| hoodie.metadata.file.cache.max.size.mb | 50 | Max size in MB below which metadata file (HFile) will be downloaded and cached entirely for the HFileReader.Config Param: METADATA_FILE_CACHE_MAX_SIZE_MBSince Version: 1.1.0 |
| hoodie.metadata.global.record.level.index.enable | false | Create the HUDI Record Index within the Metadata TableConfig Param: GLOBAL_RECORD_LEVEL_INDEX_ENABLE_PROPSince Version: 0.14.0 |
| hoodie.metadata.global.record.level.index.max.filegroup.count | 10000 | Maximum number of file groups to use for Record Index.Config Param: GLOBAL_RECORD_LEVEL_INDEX_MAX_FILE_GROUP_COUNT_PROPSince Version: 0.14.0 |
| hoodie.metadata.global.record.level.index.min.filegroup.count | 10 | Minimum number of file groups to use for Record Index.Config Param: GLOBAL_RECORD_LEVEL_INDEX_MIN_FILE_GROUP_COUNT_PROPSince Version: 0.14.0 |
| hoodie.metadata.index.async | false | Enable asynchronous indexing of metadata table.Config Param: ASYNC_INDEX_ENABLESince Version: 0.11.0 |
| hoodie.metadata.index.bloom.filter.file.group.count | 4 | Metadata bloom filter index partition file group count. This controls the size of the base and log files and read parallelism in the bloom filter index partition. The recommendation is to size the file group count such that the base files are under 1GB.Config Param: METADATA_INDEX_BLOOM_FILTER_FILE_GROUP_COUNTSince Version: 0.11.0 |
| hoodie.metadata.index.bloom.filter.parallelism | 200 | Parallelism to use for generating bloom filter index in metadata table.Config Param: BLOOM_FILTER_INDEX_PARALLELISMSince Version: 0.11.0 |
| hoodie.metadata.index.check.timeout.seconds | 900 | After the async indexer has finished indexing upto the base instant, it will ensure that all inflight writers reliably write index updates as well. If this timeout expires, then the indexer will abort itself safely.Config Param: METADATA_INDEX_CHECK_TIMEOUT_SECONDSSince Version: 0.11.0 |
| hoodie.metadata.index.column.stats.file.group.count | 2 | Metadata column stats partition file group count. This controls the size of the base and log files and read parallelism in the column stats index partition. The recommendation is to size the file group count such that the base files are under 1GB.Config Param: METADATA_INDEX_COLUMN_STATS_FILE_GROUP_COUNTSince Version: 0.11.0 |
| hoodie.metadata.index.column.stats.inMemory.projection.threshold | 100000 | When reading Column Stats Index, if the size of the expected resulting projection is below the in-memory threshold (counted by the # of rows), it will be attempted to be loaded "in-memory" (ie not using the execution engine like Spark, Flink, etc). If the value is above the threshold execution engine will be used to compose the projection.Config Param: COLUMN_STATS_INDEX_IN_MEMORY_PROJECTION_THRESHOLDSince Version: 0.12.0 |
| hoodie.metadata.index.column.stats.max.columns.to.index | 32 | Maximum number of columns to generate column stats for. If the config hoodie.metadata.index.column.stats.column.list is set, this config will be ignored. If the config hoodie.metadata.index.column.stats.column.list is not set, the column stats of the first n columns (n defined by this config) in the table schema are generated.Config Param: COLUMN_STATS_INDEX_MAX_COLUMNSSince Version: 1.0.0 |
| hoodie.metadata.index.column.stats.parallelism | 200 | Parallelism to use, when generating column stats index.Config Param: COLUMN_STATS_INDEX_PARALLELISMSince Version: 0.11.0 |
| hoodie.metadata.index.expression.file.group.count | 2 | Metadata expression index partition file group count.Config Param: EXPRESSION_INDEX_FILE_GROUP_COUNTSince Version: 1.0.0 |
| hoodie.metadata.index.expression.parallelism | 200 | Parallelism to use, when generating expression index.Config Param: EXPRESSION_INDEX_PARALLELISMSince Version: 1.0.0 |
| hoodie.metadata.index.partition.stats.file.group.count | 1 | Metadata partition stats file group count. This controls the size of the base and log files and read parallelism in the partition stats index.Config Param: METADATA_INDEX_PARTITION_STATS_FILE_GROUP_COUNTSince Version: 1.0.0 |
| hoodie.metadata.index.partition.stats.parallelism | 200 | Parallelism to use, when generating partition stats index.Config Param: PARTITION_STATS_INDEX_PARALLELISMSince Version: 1.0.0 |
| hoodie.metadata.index.secondary.parallelism | 200 | Parallelism to use, when generating secondary index.Config Param: SECONDARY_INDEX_PARALLELISMSince Version: 1.0.0 |
| hoodie.metadata.log.compaction.blocks.threshold | 5 | Controls the criteria to log compacted files groups in metadata table.Config Param: LOG_COMPACT_BLOCKS_THRESHOLDSince Version: 0.14.0 |
| hoodie.metadata.log.compaction.enable | false | This configs enables logcompaction for the metadata table.Config Param: ENABLE_LOG_COMPACTION_ON_METADATA_TABLESince Version: 0.14.0 |
| hoodie.metadata.max.deltacommits.when_pending | 1000 | When there is a pending instant in data table, this config limits the allowed number of deltacommits in metadata table to prevent the metadata table's timeline from growing unboundedly as compaction won't be triggered due to the pending data table instant.Config Param: METADATA_MAX_NUM_DELTACOMMITS_WHEN_PENDINGSince Version: 0.14.0 |
| hoodie.metadata.max.init.parallelism | 100000 | Maximum parallelism to use when initializing Record Index.Config Param: RECORD_INDEX_MAX_PARALLELISMSince Version: 0.14.0 |
| hoodie.metadata.max.logfile.size | 2147483648 | Maximum size in bytes of a single log file. Larger log files can contain larger log blocks thereby reducing the number of blocks to search for keysConfig Param: MAX_LOG_FILE_SIZE_BYTES_PROPSince Version: 0.14.0 |
| hoodie.metadata.max.reader.buffer.size | 10485760 | Max memory to use for the reader buffer while merging log blocksConfig Param: MAX_READER_BUFFER_SIZE_PROPSince Version: 0.14.0 |
| hoodie.metadata.max.reader.memory | 1073741824 | Max memory to use for the reader to read from metadataConfig Param: MAX_READER_MEMORY_PROPSince Version: 0.14.0 |
| hoodie.metadata.metrics.enable | false | Enable publishing of metrics around metadata table.Config Param: METRICS_ENABLESince Version: 0.7.0 |
| hoodie.metadata.optimized.log.blocks.scan.enable | false | Optimized log blocks scanner that addresses all the multi-writer use-cases while appending to log files. It also differentiates original blocks written by ingestion writers and compacted blocks written by log compaction.Config Param: ENABLE_OPTIMIZED_LOG_BLOCKS_SCANSince Version: 0.13.0 |
| hoodie.metadata.range.repartition.random.seed | 42 | Random seed used for sampling during range-based repartitioning. This ensures reproducible results across runs.Config Param: RANGE_REPARTITION_RANDOM_SEEDSince Version: 1.1.0 |
| hoodie.metadata.range.repartition.sampling.fraction | 0.01 | Sampling fraction used for range-based repartitioning during metadata table lookups. This controls the accuracy vs performance trade-off for key distribution sampling.Config Param: RANGE_REPARTITION_SAMPLING_FRACTIONSince Version: 1.1.0 |
| hoodie.metadata.range.repartition.target.records.per.partition | 10000 | Target number of records per partition during range-based repartitioning. This helps control the size of each partition for optimal processing.Config Param: RANGE_REPARTITION_TARGET_RECORDS_PER_PARTITIONSince Version: 1.1.0 |
| hoodie.metadata.record.index.enable.validation.on.initialization | false | Enable validation of record index after initialization by comparing the expected record count with the actual record count stored in the metadata table. This validation runs in a distributed manner using the compute engine. Disabled by default as it adds overhead to the initialization process.Config Param: RECORD_INDEX_INITIALIZATION_VALIDATION_ENABLESince Version: 1.1.0 |
| hoodie.metadata.record.index.growth.factor | 2.0 | The current number of records are multiplied by this number when estimating the number of file groups to create automatically. This helps account for growth in the number of records in the dataset.Config Param: RECORD_INDEX_GROWTH_FACTOR_PROPSince Version: 0.14.0 |
| hoodie.metadata.record.index.max.filegroup.size | 1073741824 | Maximum size in bytes of a single file group. Large file group takes longer to compact.Config Param: RECORD_INDEX_MAX_FILE_GROUP_SIZE_BYTES_PROPSince Version: 0.14.0 |
| hoodie.metadata.record.level.index.defer.init | false | When enabled, defers RLI initialization to 2nd commit for a fresh table. This should help with determining the file group count dynamically for Record Index (global and non-global RLI)Config Param: DEFER_RLI_INIT_FOR_FRESH_TABLESince Version: 1.2.0 |
| hoodie.metadata.record.level.index.enable | false | Create the HUDI Record Index within the Metadata Table for a partitioned dataset where a pair of partition path and record key is unique across the entire tableConfig Param: RECORD_LEVEL_INDEX_ENABLE_PROPSince Version: 1.1.0 |
| hoodie.metadata.record.level.index.max.filegroup.count | 10 | Maximum number of file groups to use for Partitioned Record Index.Config Param: RECORD_LEVEL_INDEX_MAX_FILE_GROUP_COUNT_PROPSince Version: 1.1.0 |
| hoodie.metadata.record.level.index.min.filegroup.count | 1 | Minimum number of file groups to use for Partitioned Record Index.Config Param: RECORD_LEVEL_INDEX_MIN_FILE_GROUP_COUNT_PROPSince Version: 1.1.0 |
| hoodie.metadata.record.preparation.parallelism | 0 | when set to positive number, metadata table record preparation stages honor the set value for number of tasks. If not, number of write status's from data table writes will be used for metadata table record preparationConfig Param: RECORD_PREPARATION_PARALLELISMSince Version: 1.1.0 |
| hoodie.metadata.repartition.default.partitions | 200 | Default number of partitions to use when repartitioning is needed. This provides a reasonable level of parallelism for metadata table operations.Config Param: REPARTITION_DEFAULT_PARTITIONSSince Version: 1.1.0 |
| hoodie.metadata.repartition.min.partitions.threshold | 100 | Minimum number of partitions threshold below which repartitioning is triggered. When the number of partitions is below this threshold, data will be repartitioned for better parallelism.Config Param: REPARTITION_MIN_PARTITIONS_THRESHOLDSince Version: 1.1.0 |
| hoodie.metadata.spillable.map.path | Path on local storage to use, when keys read from metadata are held in a spillable map.Config Param: SPILLABLE_MAP_DIR_PROPSince Version: 0.14.0 | |
| hoodie.metadata.streaming.write.datatable.write.statuses.coalesce.divisor | 5000 | When streaming writes to metadata table is enabled via hoodie.metadata.streaming.write.enabled, the data table write statuses are unioned with metadata table write statuses before triggering the entire write dag. The data table write statuses will be coalesce down to the number of write statuses divided by the specified divisor to avoid triggering thousands of no-op tasks for the data table writes which have their status cached.Config Param: STREAMING_WRITE_DATATABLE_WRITE_STATUSES_COALESCE_DIVISORSince Version: 1.1.0 |
| hoodie.metadata.streaming.write.enabled | false | Whether to enable streaming writes to metadata table or not. With streaming writes, we execute writes to both data table and metadata table in streaming manner rather than two disjoint writes. By default streaming writes to metadata table is enabled for SPARK engine for incremental operations and disabled for all other cases.Config Param: STREAMING_WRITE_ENABLEDSince Version: 1.1.0 |
| hoodie.metadata.table.service.manager.actions | Comma-separated list of table service actions on the metadata table that should be delegated to the table service manager. Currently supported actions are: compaction, logcompaction.Config Param: TABLE_SERVICE_MANAGER_ACTIONS | |
| hoodie.metadata.table.service.manager.enabled | false | If true, delegate specified table service actions on the metadata table to the table service manager instead of executing them inline. This prevents the current writer from executing compaction/logcompaction on the metadata table, allowing a separate async pipeline to handle them.Config Param: TABLE_SERVICE_MANAGER_ENABLED |
| hoodie.metadata.write.concurrency.mode | SINGLE_WRITER | Change this to OPTIMISTIC_CONCURRENCY_CONTROL when MDT operations are being performed from an external concurrent writer (such as a table service platform) so that appropriate locks are taken.Config Param: METADATA_WRITE_CONCURRENCY_MODE |
| hoodie.metadata.write.fail.on.table.service.failures | true | when set to true, it fails the job on metadata table's table services operation failureConfig Param: FAIL_ON_TABLE_SERVICE_FAILURESSince Version: 1.2.0 |
Metaserver Configs
Configurations used by the Hudi Metaserver.
| Config Name | Default | Description |
|---|---|---|
| hoodie.database.name | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific databaseConfig Param: DATABASE_NAMESince Version: 0.13.0 |
| hoodie.table.name | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.Config Param: TABLE_NAMESince Version: 0.13.0 |
| hoodie.metaserver.connect.retries | 3 | Number of retries while opening a connection to metaserverConfig Param: METASERVER_CONNECTION_RETRIESSince Version: 0.13.0 |
| hoodie.metaserver.connect.retry.delay | 1 | Number of seconds for the client to wait between consecutive connection attemptsConfig Param: METASERVER_CONNECTION_RETRY_DELAYSince Version: 0.13.0 |
| hoodie.metaserver.enabled | false | Enable Hudi metaserver for storing Hudi tables' metadata.Config Param: METASERVER_ENABLESince Version: 0.13.0 |
| hoodie.metaserver.uris | thrift://localhost:9090 | Metaserver server urisConfig Param: METASERVER_URLSSince Version: 0.13.0 |
Storage Configs
Configurations that control aspects around writing, sizing, reading base and log files.
| Config Name | Default | Description |
|---|---|---|
| hoodie.hfile.writes.allow.duplicates | false | When bootstrapping RI, if the main dataset contains duplicates then it will fail the bootstrap job. TO avoid the failure and bootstrap the RI with dups this config can be set to true. One thing to note is that, there is no deterministic way to specify which among these records will be ingested into RI.Config Param: HFILE_WRITER_TO_ALLOW_DUPLICATES |
| hoodie.parquet.compression.codec | gzip | Compression Codec for parquet filesConfig Param: PARQUET_COMPRESSION_CODEC_NAME |
| hoodie.parquet.max.file.size | 125829120 | Target size in bytes for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.Config Param: PARQUET_MAX_FILE_SIZE |
| Config Name | Default | Description |
|---|---|---|
| hoodie.logfile.data.block.format | (N/A) | Format of the data block within delta logs. Following formats are currently supported "avro", "hfile", "parquet"Config Param: LOGFILE_DATA_BLOCK_FORMAT |
| hoodie.parquet.write.config.injector.class | (N/A) | Config injector implementation for HoodieParquetConfigInjector class, for users willing to inject some custom configs to parquet writersConfig Param: HOODIE_PARQUET_CONFIG_INJECTOR_CLASSSince Version: 1.2.0 |
| hoodie.parquet.writelegacyformat.enabled | (N/A) | Sets spark.sql.parquet.writeLegacyFormat. If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Parquet's fixed-length byte array format which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format.Config Param: PARQUET_WRITE_LEGACY_FORMAT_ENABLED |
| hoodie.avro.write.support.class | org.apache.hudi.avro.HoodieAvroWriteSupport | Provided write support class should extend HoodieAvroWriteSupport class and it is loaded at runtime. This is only required when trying to override the existing write context.Config Param: HOODIE_AVRO_WRITE_SUPPORT_CLASSSince Version: 0.14.0 |
| hoodie.bloom.index.filter.dynamic.max.entries | 100000 | The threshold for the maximum number of keys to record in a dynamic Bloom filter row. Only applies if filter type is BloomFilterTypeCode.DYNAMIC_V0.Config Param: BLOOM_FILTER_DYNAMIC_MAX_ENTRIES |
| hoodie.bloom.index.filter.type | DYNAMIC_V0 | org.apache.hudi.common.bloom.BloomFilterTypeCode: Filter type used by Bloom filter. SIMPLE: Bloom filter that is based on the configured size. DYNAMIC_V0(default): Bloom filter that is auto sized based on number of keys.Config Param: BLOOM_FILTER_TYPE |
| hoodie.hfile.block.size | 1048576 | Lower values increase the size in bytes of metadata tracked within HFile, but can offer potentially faster lookup times.Config Param: HFILE_BLOCK_SIZE |
| hoodie.hfile.compression.algorithm | GZ | Compression codec to use for hfile base files.Config Param: HFILE_COMPRESSION_ALGORITHM_NAME |
| hoodie.hfile.max.file.size | 125829120 | Target file size in bytes for HFile base files.Config Param: HFILE_MAX_FILE_SIZE |
| hoodie.index.bloom.fpp | 0.000000001 | Only applies if index type is BLOOM. Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives. If the number of entries added to bloom filter exceeds the configured value (hoodie.index.bloom.num_entries), then this fpp may not be honored.Config Param: BLOOM_FILTER_FPP_VALUE |
| hoodie.index.bloom.num_entries | 60000 | Only applies if index type is BLOOM. This is the number of entries to be stored in the bloom filter. The rationale for the default: Assume the maxParquetFileSize is 128MB and averageRecordSize is 1kb and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. Warning: Setting this very low, will generate a lot of false positives and index lookup will have to scan a lot more files than it has to and setting this to a very high number will increase the size every base file linearly (roughly 4KB for every 50000 entries). This config is also used with DYNAMIC bloom filter which determines the initial size for the bloom.Config Param: BLOOM_FILTER_NUM_ENTRIES_VALUE |
| hoodie.io.factory.class | org.apache.hudi.io.storage.hadoop.HoodieHadoopIOFactory | The fully-qualified class name of the factory class to return readers and writers of files used by Hudi. The provided class should implement org.apache.hudi.io.storage.HoodieIOFactory.Config Param: HOODIE_IO_FACTORY_CLASSSince Version: 0.15.0 |
| hoodie.lance.max.file.size | 125829120 | Target file size in bytes for Lance base files.Config Param: LANCE_MAX_FILE_SIZE |
| hoodie.lance.read.allocator.size.bytes | 268435456 | Maximum size in bytes of the Arrow child allocator used by the Lance reader for the per-read data buffers. Raise this for files with very wide rows or large blob columns.Config Param: LANCE_READ_ALLOCATOR_SIZE_BYTES |
| hoodie.lance.read.metadata.allocator.size.bytes | 8388608 | Maximum size in bytes of the Arrow child allocator used by the Lance reader for footer/metadata operations (schema, bloom filter). Independent of the data allocator since metadata allocations are small and short-lived.Config Param: LANCE_READ_METADATA_ALLOCATOR_SIZE_BYTES |
| hoodie.lance.write.allocator.size.bytes | 268435456 | Maximum size in bytes of the Arrow child allocator used by the Lance writer for buffering in-flight batch data. Must be large enough that the Arrow BaseLargeVariableWidthVector's power-of-2 doubling growth never requests a buffer exceeding this cap, otherwise writes fail with OutOfMemoryException. The default of 256MB clears the 128MB doubling step with headroom; pair with hoodie.lance.write.flush.byte.watermark to bound in-flight memory regardless of blob size or row count.Config Param: LANCE_WRITE_ALLOCATOR_SIZE_BYTES |
| hoodie.lance.write.flush.byte.watermark | 100663296 | Byte-size watermark on the Lance writer's in-flight Arrow buffers; the writer flushes the current batch when the sum of FieldVector buffer sizes reaches this value, in addition to the row-count batch threshold. Keeps the data buffer below the next power-of-2 doubling step so reallocation cannot exceed hoodie.lance.write.allocator.size.bytes. Default is roughly 3/8 of the allocator size, leaving room for offset/validity buffers.Config Param: LANCE_WRITE_FLUSH_BYTE_WATERMARK |
| hoodie.logfile.data.block.max.size | 268435456 | LogFile Data block max size in bytes. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory.Config Param: LOGFILE_DATA_BLOCK_MAX_SIZE |
| hoodie.logfile.max.size | 1073741824 | LogFile max size in bytes. This is the maximum size allowed for a log file before it is rolled over to the next version.Config Param: LOGFILE_MAX_SIZE |
| hoodie.logfile.to.parquet.compression.ratio | 0.35 | Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file.When encoding log blocks in parquet format, increase this value for a more accurate estimationConfig Param: LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION |
| hoodie.orc.block.size | 125829120 | ORC block size, recommended to be aligned with the target file size.Config Param: ORC_BLOCK_SIZE |
| hoodie.orc.compression.codec | ZLIB | Compression codec to use for ORC base files.Config Param: ORC_COMPRESSION_CODEC_NAME |
| hoodie.orc.max.file.size | 125829120 | Target file size in bytes for ORC base files.Config Param: ORC_FILE_MAX_SIZE |
| hoodie.orc.stripe.size | 67108864 | Size of the memory buffer in bytes for writingConfig Param: ORC_STRIPE_SIZE |
| hoodie.parquet.block.size | 125829120 | Parquet RowGroup size in bytes. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.Config Param: PARQUET_BLOCK_SIZE |
| hoodie.parquet.bloom.filter.enabled | true | Control whether to write bloom filter or not. Default true. We can set to false in non bloom index cases for CPU resource saving.Config Param: PARQUET_WITH_BLOOM_FILTER_ENABLEDSince Version: 0.15.0 |
| hoodie.parquet.compression.ratio | 0.1 | Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized filesConfig Param: PARQUET_COMPRESSION_RATIO_FRACTION |
| hoodie.parquet.dictionary.enabled | true | Whether to use dictionary encodingConfig Param: PARQUET_DICTIONARY_ENABLED |
| hoodie.parquet.field_id.write.enabled | true | Would only be effective with Spark 3.3+. Sets spark.sql.parquet.fieldId.write.enabled. If enabled, Spark will write out parquet native field ids that are stored inside StructField's metadata as parquet.field.id to parquet files.Config Param: PARQUET_FIELD_ID_WRITE_ENABLEDSince Version: 0.12.0 |
| hoodie.parquet.flink.rowdata.write.support.class | org.apache.hudi.io.storage.row.HoodieRowDataParquetWriteSupport | Provided write support class should extend HoodieRowDataParquetWriteSupport class and it is loaded at runtime. This is only required when trying to override the existing write support.Config Param: HOODIE_PARQUET_FLINK_ROW_DATA_WRITE_SUPPORT_CLASSSince Version: 1.1.0 |
| hoodie.parquet.outputtimestamptype | TIMESTAMP_MICROS | Sets spark.sql.parquet.outputTimestampType. Parquet timestamp type to use when Spark writes data to Parquet files.Config Param: PARQUET_OUTPUT_TIMESTAMP_TYPE |
| hoodie.parquet.page.size | 1048576 | Parquet page size in bytes. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.Config Param: PARQUET_PAGE_SIZE |
| hoodie.parquet.spark.row.write.support.class | org.apache.hudi.io.storage.row.HoodieRowParquetWriteSupport | Provided write support class should extend HoodieRowParquetWriteSupport class and it is loaded at runtime. This is only required when trying to override the existing write context when hoodie.datasource.write.row.writer.enable=true.Config Param: HOODIE_PARQUET_SPARK_ROW_WRITE_SUPPORT_CLASSSince Version: 0.15.0 |
| hoodie.parquet.write.utc-timezone.enabled | true | Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Default value is utc timezone for forward compatibility.Config Param: WRITE_UTC_TIMEZONESince Version: 1.1.0 |
| hoodie.storage.class | org.apache.hudi.storage.hadoop.HoodieHadoopStorage | The fully-qualified class name of the HoodieStorage implementation class to instantiate. The provided class should implement org.apache.hudi.storage.HoodieStorageConfig Param: HOODIE_STORAGE_CLASSSince Version: 0.15.0 |
Consistency Guard Configurations
The consistency guard related config options, to help talk to eventually consistent object storage.(Tip: S3 is NOT eventually consistent anymore!)
| Config Name | Default | Description |
|---|---|---|
| _hoodie.optimistic.consistency.guard.enable | false | Enable consistency guard, which optimistically assumes consistency is achieved after a certain time period.Config Param: OPTIMISTIC_CONSISTENCY_GUARD_ENABLESince Version: 0.6.0 |
| hoodie.consistency.check.enabled | false | Enabled to handle S3 eventual consistency issue. This property is no longer required since S3 is now strongly consistent. Will be removed in the future releases.Config Param: ENABLESince Version: 0.5.0Deprecated since: 0.7.0 |
| hoodie.consistency.check.initial_interval_ms | 400 | Amount of time (in ms) to wait, before checking for consistency after an operation on storage.Config Param: INITIAL_CHECK_INTERVAL_MSSince Version: 0.5.0Deprecated since: 0.7.0 |
| hoodie.consistency.check.max_checks | 6 | Maximum number of consistency checks to perform, with exponential backoff.Config Param: MAX_CHECKSSince Version: 0.5.0Deprecated since: 0.7.0 |
| hoodie.consistency.check.max_interval_ms | 20000 | Maximum amount of time (in ms), to wait for consistency checking.Config Param: MAX_CHECK_INTERVAL_MSSince Version: 0.5.0Deprecated since: 0.7.0 |
| hoodie.optimistic.consistency.guard.sleep_time_ms | 500 | Amount of time (in ms), to wait after which we assume storage is consistent.Config Param: OPTIMISTIC_CONSISTENCY_GUARD_SLEEP_TIME_MSSince Version: 0.6.0 |
FileSystem Guard Configurations
The filesystem retry related config options, to help deal with runtime exception like list/get/put/delete performance issues.
| Config Name | Default | Description |
|---|---|---|
| hoodie.filesystem.operation.retry.enable | false | Enabled to handle list/get/delete etc file system performance issue.Config Param: FILESYSTEM_RETRY_ENABLESince Version: 0.11.0 |
| hoodie.filesystem.operation.retry.exceptions | The class name of the Exception that needs to be retried, separated by commas. Default is empty which means retry all the IOException and RuntimeException from FileSystemConfig Param: RETRY_EXCEPTIONSSince Version: 0.11.0 | |
| hoodie.filesystem.operation.retry.initial_interval_ms | 100 | Amount of time (in ms) to wait, before retry to do operations on storage.Config Param: INITIAL_RETRY_INTERVAL_MSSince Version: 0.11.0 |
| hoodie.filesystem.operation.retry.max_interval_ms | 2000 | Maximum amount of time (in ms), to wait for next retry.Config Param: MAX_RETRY_INTERVAL_MSSince Version: 0.11.0 |
| hoodie.filesystem.operation.retry.max_numbers | 4 | Maximum number of retry actions to perform, with exponential backoff.Config Param: MAX_RETRY_NUMBERSSince Version: 0.11.0 |
File System View Storage Configurations
Configurations that control how file metadata is stored by Hudi, for transaction processing and queries.
| Config Name | Default | Description |
|---|---|---|
| hoodie.filesystem.remote.backup.view.enable | true | Config to control whether backup needs to be configured if clients were not able to reach timeline service.Config Param: REMOTE_BACKUP_VIEW_ENABLE |
| hoodie.filesystem.view.incr.timeline.sync.enable | false | Controls whether or not, the file system view is incrementally updated as new actions are performed on the timeline.Config Param: INCREMENTAL_TIMELINE_SYNC_ENABLE |
| hoodie.filesystem.view.remote.host | localhost | We expect this to be rarely hand configured.Config Param: REMOTE_HOST_NAME |
| hoodie.filesystem.view.remote.port | 26754 | Port to serve file system view queries, when remote. We expect this to be rarely hand configured.Config Param: REMOTE_PORT_NUM |
| hoodie.filesystem.view.remote.retry.enable | false | Whether to enable API request retry for remote file system view.Config Param: REMOTE_RETRY_ENABLESince Version: 0.12.1 |
| hoodie.filesystem.view.remote.retry.exceptions | The class name of the Exception that needs to be retried, separated by commas. Default is empty which means retry all the IOException and RuntimeException from Remote Request.Config Param: RETRY_EXCEPTIONSSince Version: 0.12.1 | |
| hoodie.filesystem.view.remote.retry.initial_interval_ms | 100 | Amount of time (in ms) to wait, before retry to do operations on storage.Config Param: REMOTE_INITIAL_RETRY_INTERVAL_MSSince Version: 0.12.1 |
| hoodie.filesystem.view.remote.retry.max_interval_ms | 2000 | Maximum amount of time (in ms), to wait for next retry.Config Param: REMOTE_MAX_RETRY_INTERVAL_MSSince Version: 0.12.1 |
| hoodie.filesystem.view.remote.retry.max_numbers | 3 | Maximum number of retry for API requests against a remote file system view. e.g timeline server.Config Param: REMOTE_MAX_RETRY_NUMBERSSince Version: 0.12.1 |
| hoodie.filesystem.view.remote.timeout.secs | 300 | Timeout in seconds, to wait for API requests against a remote file system view. e.g timeline server.Config Param: REMOTE_TIMEOUT_SECS |
| hoodie.filesystem.view.rocksdb.base.path | /tmp/hoodie_timeline_rocksdb | Path on local storage to use, when storing file system view in embedded kv store/rocksdb.Config Param: ROCKSDB_BASE_PATH |
| hoodie.filesystem.view.secondary.type | MEMORY | Specifies the secondary form of storage for file system view, if the primary (e.g timeline server) is unavailable.Config Param: SECONDARY_VIEW_TYPE |
| hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction | 0.05 | Fraction of the file system view memory, to be used for holding mapping to bootstrap base files.Config Param: BOOTSTRAP_BASE_FILE_MEM_FRACTION |
| hoodie.filesystem.view.spillable.clustering.mem.fraction | 0.02 | Fraction of the file system view memory, to be used for holding clustering related metadata.Config Param: SPILLABLE_CLUSTERING_MEM_FRACTION |
| hoodie.filesystem.view.spillable.compaction.mem.fraction | 0.1 | Fraction of the file system view memory, to be used for holding compaction related metadata.Config Param: SPILLABLE_COMPACTION_MEM_FRACTION |
| hoodie.filesystem.view.spillable.dir | /tmp/ | Path on local storage to use, when file system view is held in a spillable map.Config Param: SPILLABLE_DIR |
| hoodie.filesystem.view.spillable.log.compaction.mem.fraction | 0.02 | Fraction of the file system view memory, to be used for holding log compaction related metadata.Config Param: SPILLABLE_LOG_COMPACTION_MEM_FRACTIONSince Version: 0.13.0 |
| hoodie.filesystem.view.spillable.mem | 104857600 | Amount of memory to be used in bytes for holding file system view, before spilling to disk.Config Param: SPILLABLE_MEMORY |
| hoodie.filesystem.view.spillable.replaced.mem.fraction | 0.05 | Fraction of the file system view memory, to be used for holding replace commit related metadata.Config Param: SPILLABLE_REPLACED_MEM_FRACTION |
| hoodie.filesystem.view.type | MEMORY | File system view provides APIs for viewing the files on the underlying lake storage, as file groups and file slices. This config controls how such a view is held. Options include MEMORY,SPILLABLE_DISK,EMBEDDED_KV_STORE,REMOTE_ONLY,REMOTE_FIRST which provide different trade offs for memory usage and API request performance.Config Param: VIEW_TYPE |
Archival Configs
Configurations that control archival.
| Config Name | Default | Description |
|---|---|---|
| hoodie.keep.max.commits | 30 | Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. Config Param: MAX_COMMITS_TO_KEEP |
| hoodie.keep.min.commits | 20 | Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline.Config Param: MIN_COMMITS_TO_KEEP |
| Config Name | Default | Description |
|---|---|---|
| hoodie.archive.async | false | Only applies when hoodie.archive.automatic is turned on. When turned on runs archiver async with writing, which can speed up overall write performance.Config Param: ASYNC_ARCHIVESince Version: 0.11.0 |
| hoodie.archive.automatic | true | When enabled, the archival table service is invoked immediately after each commit, to archive commits if we cross a maximum value of commits. It's recommended to enable this, to ensure number of active commits is bounded.Config Param: AUTO_ARCHIVE |
| hoodie.archive.beyond.savepoint | false | If enabled, archival will proceed beyond savepoint, skipping savepoint commits. If disabled, archival will stop at the earliest savepoint commit.Config Param: ARCHIVE_BEYOND_SAVEPOINTSince Version: 0.12.0 |
| hoodie.archive.block.on.latest.clean.ectr | false | If enabled, archival will not archive commits beyond the Earliest Commit To Retain (ECTR) from the last completed clean. ECTR represents the oldest commit whose data files are still needed by the table and have not yet been cleaned up. Blocking archival at this point ensures that timeline metadata is not removed for commits whose data files still exist on storage, preventing inconsistencies between the timeline and the actual data.Config Param: BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTRSince Version: 1.2.0 |
| hoodie.archive.delete.parallelism | 100 | When performing archival operation, Hudi needs to delete the files of the archived instants in the active timeline in .hoodie folder. The file deletion also happens after merging small archived files into larger ones if enabled. This config limits the Spark parallelism for deleting files in both cases, i.e., parallelism of deleting files does not go above the configured value and the parallelism is the number of files to delete if smaller than the configured value. If you see that the file deletion in archival operation is slow because of the limited parallelism, you can increase this to tune the performance.Config Param: DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE |
| hoodie.commits.archival.batch | 10 | Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size.Config Param: COMMITS_ARCHIVAL_BATCH_SIZE |
| hoodie.timeline.compaction.batch.size | 10 | The number of small files to compact at once.Config Param: TIMELINE_COMPACTION_BATCH_SIZE |
| hoodie.timeline.compaction.target.file.max.bytes | 1048576000 | Max size (in bytes) for each archived timeline file.Config Param: TIMELINE_COMPACTION_TARGET_FILE_MAX_BYTES |
| hoodie.timeline.manifest.retained.versions | 3 | Number of timeline manifest versions to retain.Config Param: TIMELINE_MANIFEST_RETAINED_VERSIONS |
Bootstrap Configs
Configurations that control how you want to bootstrap your existing tables for the first time into hudi. The bootstrap operation can flexibly avoid copying data over before you can use Hudi and support running the existing writers and new hudi writers in parallel, to validate the migration.
| Config Name | Default | Description |
|---|---|---|
| hoodie.bootstrap.base.path | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi tableConfig Param: BASE_PATHSince Version: 0.6.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.bootstrap.data.queries.only | false | Improves query performance, but queries cannot use hudi metadata fieldsConfig Param: DATA_QUERIES_ONLYSince Version: 0.14.0 |
| hoodie.bootstrap.full.input.provider | org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider | Class to use for reading the bootstrap dataset partitions/files, for Bootstrap mode FULL_RECORDConfig Param: FULL_BOOTSTRAP_INPUT_PROVIDER_CLASS_NAMESince Version: 0.6.0 |
| hoodie.bootstrap.index.class | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping a skeleton base file to a bootstrap base file.Config Param: INDEX_CLASS_NAMESince Version: 0.6.0 |
| hoodie.bootstrap.mode.selector | org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrappedConfig Param: MODE_SELECTOR_CLASS_NAMESince Version: 0.6.0 |
| hoodie.bootstrap.mode.selector.regex | .* | Matches each bootstrap dataset partition against this regex and applies the mode below to it.Config Param: PARTITION_SELECTOR_REGEX_PATTERNSince Version: 0.6.0 |
| hoodie.bootstrap.mode.selector.regex.mode | METADATA_ONLY | org.apache.hudi.client.bootstrap.BootstrapMode: Bootstrap mode for importing an existing table into Hudi FULL_RECORD: In this mode, the full record data is copied into hudi and metadata columns are added. A full record bootstrap is functionally equivalent to a bulk-insert. After a full record bootstrap, Hudi will function properly even if the original table is modified or deleted. METADATA_ONLY(default): In this mode, the full record data is not copied into Hudi therefore it avoids full cost of rewriting the dataset. Instead, 'skeleton' files containing just the corresponding metadata columns are added to the Hudi table. Hudi relies on the data in the original table and will face data-loss or corruption if files in the original table location are deleted or modified.Config Param: PARTITION_SELECTOR_REGEX_MODESince Version: 0.6.0 |
| hoodie.bootstrap.parallelism | 1500 | For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (hoodie.bulkinsert.shuffle.parallelism), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this.Config Param: PARALLELISM_VALUESince Version: 0.6.0 |
| hoodie.bootstrap.partitionpath.translator.class | org.apache.hudi.client.bootstrap.translator.IdentityBootstrapPartitionPathTranslator | Translates the partition paths from the bootstrapped data into how is laid out as a Hudi table.Config Param: PARTITION_PATH_TRANSLATOR_CLASS_NAMESince Version: 0.6.0 |
Clean Configs
Cleaning (reclamation of older/unused file groups/slices).
| Config Name | Default | Description |
|---|---|---|
| hoodie.clean.async.enabled | false | Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance.Config Param: ASYNC_CLEAN |
| hoodie.clean.commits.retained | 10 | When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries.Config Param: CLEANER_COMMITS_RETAINED |
| hoodie.prewrite.cleaner.policy | NONE | org.apache.hudi.common.model.HoodiePreWriteCleanerPolicy: If set, force attempting clean and/or failed writes rollback before starting a new ingestion write commit. This should only be set to ensure that data files do not build up on DFS if an ingestion writer is perpetually failing before completing a CLEAN. NONE(default): No pre-write clean or rollback. Default behavior. CLEAN: Force a CLEAN table service call before starting the write (also performs rollback of failed writes). ROLLBACK_FAILED_WRITES: Only perform rollback of failed writes before starting the write.Config Param: PREWRITE_CLEANER_POLICYSince Version: 1.2.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.clean.partition.filter.regex | (N/A) | When incremental clean is disabled, this regex can be used to filter the partitions to be cleaned. Only partitions matching this regex pattern will be cleaned. This can be useful for very large tables to avoid OOM issues during cleaning. If both this config and hoodie.clean.partition.filter.selected are set, the selected partitions take precedence.Config Param: CLEAN_PARTITION_FILTER_REGEXSince Version: 1.2.0 |
| hoodie.clean.partition.filter.selected | (N/A) | When incremental clean is disabled, this comma-separated list of partitions can be used to filter the partitions to be cleaned. Only the specified partitions will be cleaned. This can be useful for very large tables to avoid OOM issues during cleaning. If both this config and hoodie.clean.partition.filter.regex are set, the selected partitions take precedence.Config Param: CLEAN_PARTITION_FILTER_SELECTEDSince Version: 1.2.0 |
| hoodie.clean.automatic | true | When enabled, the cleaner table service is invoked immediately after each commit, to delete older file slices. It's recommended to enable this, to ensure metadata and data storage growth is bounded.Config Param: AUTO_CLEAN |
| hoodie.clean.delete.bootstrap.base.file | false | When set to true, cleaner also deletes the bootstrap base file when it's skeleton base file is cleaned. Turn this to true, if you want to ensure the bootstrap dataset storage is reclaimed over time, as the table receives updates/deletes. Another reason to turn this on, would be to ensure data residing in bootstrap base files are also physically deleted, to comply with data privacy enforcement processes.Config Param: CLEANER_BOOTSTRAP_BASE_FILE_ENABLE |
| hoodie.clean.failed.writes.policy | EAGER | org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy: Policy that controls how to clean up failed writes. Hudi will delete any files written by failed writes to re-claim space. EAGER(default): Clean failed writes inline after every write operation. LAZY: Clean failed writes lazily after heartbeat timeout when the cleaning service runs. This policy is required when multi-writers are enabled. NEVER: Never clean failed writes.Config Param: FAILED_WRITES_CLEANER_POLICY |
| hoodie.clean.fileversions.retained | 3 | When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning.Config Param: CLEANER_FILE_VERSIONS_RETAINED |
| hoodie.clean.hours.retained | 24 | When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.Config Param: CLEANER_HOURS_RETAINED |
| hoodie.clean.incremental.enabled | true | When enabled, the plans for each cleaner service run is computed incrementally off the events in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full table for each planning (even with a metadata table).Config Param: CLEANER_INCREMENTAL_MODE_ENABLE |
| hoodie.clean.max.commits.to.clean | 9223372036854775807 | Maximum number of commits to clean in one clean commit. Applicable only when the clean policy is based on KEEP_LATEST_COMMITS or KEEP_LATEST_HOURSConfig Param: MAX_COMMITS_TO_CLEAN |
| hoodie.clean.multiple.enabled | false | Allows scheduling/executing multiple cleans by enabling this config. If users prefer to strictly ensure clean requests should be mutually exclusive, .i.e. a 2nd clean will not be scheduled if another clean is not yet completed to avoid repeat cleaning of same files, they might want to disable this config.Config Param: ALLOW_MULTIPLE_CLEANSSince Version: 0.11.0Deprecated since: 0.15.0 |
| hoodie.clean.optimize.using.local.engine.context | true | Optimizes clean planning by using local engine context (driver-only) for metadata tables and non-partitioned datasets. This allows handling OOM errors during clean planning by scaling only driver memory instead of all executor memory. Some datasets with large record_index partitions can cause OOM errors during file listing in clean planning. By using local engine context, file listing is performed on the driver, allowing targeted memory scaling. When enabled, both non-partitioned datasets and metadata tables use the driver for scheduling cleans.Config Param: CLEAN_OPTIMIZE_USING_LOCAL_ENGINE_CONTEXTSince Version: 1.2.0 |
| hoodie.clean.parallelism | 200 | This config controls the behavior of both the cleaning plan and cleaning execution. Deriving the cleaning plan is parallelized at the table partition level, i.e., each table partition is processed by one Spark task to figure out the files to clean. The cleaner picks the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. The clean execution, i.e., the file deletion, is parallelized at file level, which is the unit of Spark task distribution. Similarly, the actual parallelism cannot exceed the configured value if the number of files is larger. If cleaning plan or execution is slow due to limited parallelism, you can increase this to tune the performance..Config Param: CLEANER_PARALLELISM_VALUE |
| hoodie.clean.policy | KEEP_LATEST_COMMITS | org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.clean.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.clean.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.clean.hours.retained" is explicitly set only.Config Param: CLEANER_POLICY |
| hoodie.clean.trigger.max.commits | 1 | Number of commits after the last clean operation, before scheduling of a new clean is attempted.Config Param: CLEAN_TRIGGER_MAX_COMMITS |
| hoodie.clean.trigger.strategy | NUM_COMMITS | org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by hoodie.clean.trigger.max.commits.Config Param: CLEAN_TRIGGER_STRATEGY |
| hoodie.write.empty.clean.interval.hours | -1 | In some cases empty clean commit needs to be created to ensure the clean planner does not look through entire dataset if there are no clean plans. This is possible for append-only dataset. Also, for these datasets we cannot ignore clean completely since in the future there could be upsert or replace operations. By creating empty clean commit, earliest_commit_to_retain value will be updated so that now clean planner can only check for partitions that are modified after the last empty clean's earliest_commit_toRetain value thereby optimizing the clean planningConfig Param: INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS |
Clustering Configs
Configurations that control the clustering table service in hudi, which optimizes the storage layout for better query performance by sorting and sizing data files.
| Config Name | Default | Description |
|---|---|---|
| hoodie.clustering.async.enabled | false | Enable running of clustering service, asynchronously as inserts happen on the table.Config Param: ASYNC_CLUSTERING_ENABLESince Version: 0.7.0 |
| hoodie.clustering.inline | false | Turn on inline clustering - clustering will be run after each write operation is completeConfig Param: INLINE_CLUSTERINGSince Version: 0.7.0 |
| hoodie.clustering.plan.generation.use.local.engine.context | false | When enabled, uses a local engine context (e.g., driver-side in Spark) instead of the distributed engine context to compute clustering groups for each partition during clustering plan generation. By default this is disabled, meaning the distributed engine context is used (e.g., with Spark, each partition's clustering groups are computed in a separate Spark task). Enable this for cases where there are guaranteed to only be a few partitions with many files in the clustering plan, and it would be more resource-efficient to compute locally on the driver rather than allocate executor resources.Config Param: PLAN_GENERATION_USE_LOCAL_ENGINE_CONTEXTSince Version: 1.2.0 |
| hoodie.clustering.plan.strategy.small.file.limit | 314572800 | Files smaller than the size in bytes specified here are candidates for clusteringConfig Param: PLAN_STRATEGY_SMALL_FILE_LIMITSince Version: 0.7.0 |
| hoodie.clustering.plan.strategy.target.file.max.bytes | 1073741824 | Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groupsConfig Param: PLAN_STRATEGY_TARGET_FILE_MAX_BYTESSince Version: 0.7.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.clustering.plan.strategy.cluster.begin.partition | (N/A) | Begin partition used to filter partition (inclusive), only effective when the filter mode 'hoodie.clustering.plan.partition.filter.mode' is SELECTED_PARTITIONSConfig Param: PARTITION_FILTER_BEGIN_PARTITIONSince Version: 0.11.0 |
| hoodie.clustering.plan.strategy.cluster.end.partition | (N/A) | End partition used to filter partition (inclusive), only effective when the filter mode 'hoodie.clustering.plan.partition.filter.mode' is SELECTED_PARTITIONSConfig Param: PARTITION_FILTER_END_PARTITIONSince Version: 0.11.0 |
| hoodie.clustering.plan.strategy.earliest.commit.to.cluster | (N/A) | Earliest commit time (exclusive) to start clustering from. Only commits after this time will be considered for commit-based clustering plan strategy.Config Param: PLAN_STRATEGY_EARLIEST_COMMIT_TO_CLUSTERSince Version: 0.14.0 |
| hoodie.clustering.plan.strategy.partition.regex.pattern | (N/A) | Filter clustering partitions that matched regex patternConfig Param: PARTITION_REGEX_PATTERNSince Version: 0.11.0 |
| hoodie.clustering.plan.strategy.partition.selected | (N/A) | Partitions to run clusteringConfig Param: PARTITION_SELECTEDSince Version: 0.11.0 |
| hoodie.clustering.plan.strategy.sort.columns | (N/A) | Columns to sort the data by when clusteringConfig Param: PLAN_STRATEGY_SORT_COLUMNSSince Version: 0.7.0 |
| hoodie.clustering.async.max.commits | 4 | Config to control frequency of async clusteringConfig Param: ASYNC_CLUSTERING_MAX_COMMITSSince Version: 0.9.0 |
| hoodie.clustering.enable.expirations | false | When enabled, rollback of failed writes (under LAZY cleaning policy) will also attempt to rollback clustering replacecommit instants whose heartbeat has expired. Clustering jobs will start a heartbeat before scheduling a plan, so that other writers can detect stale/failed clustering attempts. Note that the same client must be used to schedule, execute, and commit the clustering instant. And a clustering plan cannot be re-attemptedConfig Param: ENABLE_EXPIRATIONS |
| hoodie.clustering.execution.strategy.class | org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy | Config to provide a strategy class (subclass of RunClusteringStrategy) to define how the clustering plan is executed. By default, we sort the file groups in th plan by the specified columns, while meeting the configured target file sizes.Config Param: EXECUTION_STRATEGY_CLASS_NAMESince Version: 0.7.0 |
| hoodie.clustering.expiration.threshold.mins | 60 | When hoodie.clustering.enable.expirations is enabled, a clustering instant will not be considered expired unless its instant creation time is at least this many minutes old. This serves as a guardrail to avoid unnecessary work in rolling back clustering instants that other writers are already attempting to roll back.Config Param: EXPIRATION_THRESHOLD_MINS |
| hoodie.clustering.group.read.parallelism | 20 | Maximum number of parallelism when Spark read records from clustering group.Config Param: CLUSTERING_GROUP_READ_PARALLELISMSince Version: 1.0.0 |
| hoodie.clustering.inline.max.commits | 4 | Config to control frequency of clustering planningConfig Param: INLINE_CLUSTERING_MAX_COMMITSSince Version: 0.7.0 |
| hoodie.clustering.max.parallelism | 15 | Maximum number of parallelism jobs submitted in clustering operation. If the resource is sufficient(Like Spark engine has enough idle executors), increasing this value will let the clustering job run faster, while it will give additional pressure to the execution engines to manage more concurrent running jobs.Config Param: CLUSTERING_MAX_PARALLELISMSince Version: 0.14.0 |
| hoodie.clustering.plan.partition.filter.mode | NONE | org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode: Partition filter mode used in the creation of clustering plan. NONE(default): Do not filter partitions. The clustering plan will include all partitions that have clustering candidates. RECENT_DAYS: This filter assumes that your data is partitioned by date. The clustering plan will only include partitions from K days ago to N days ago, where K >= N. K is determined by hoodie.clustering.plan.strategy.daybased.lookback.partitions and N is determined by hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions. SELECTED_PARTITIONS: The clustering plan will include only partition paths with names that sort within the inclusive range [hoodie.clustering.plan.strategy.cluster.begin.partition, hoodie.clustering.plan.strategy.cluster.end.partition]. DAY_ROLLING: To determine the partitions in the clustering plan, the eligible partitions will be sorted in ascending order. Each partition will have an index i in that list. The clustering plan will only contain partitions such that i mod 24 = H, where H is the current hour of the day (from 0 to 23).Config Param: PLAN_PARTITION_FILTER_MODE_NAMESince Version: 0.11.0 |
| hoodie.clustering.plan.strategy.binary.copy.schema.evolution.enable | false | Enable schema evolution support for binary file stitching during clustering. When enabled, allows clustering of files with different but compatible schemas (e.g., files with added columns). When disabled (default), only files with identical schemas will be clustered together, providing better performance but requiring schema consistency across all files in a clustering group.Config Param: FILE_STITCHING_BINARY_COPY_SCHEMA_EVOLUTION_ENABLESince Version: 1.1.0 |
| hoodie.clustering.plan.strategy.class | org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy | Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the clustering small file size limit (determined by hoodie.clustering.plan.strategy.small.file.limit) to pick the small file slices within partitions for clustering.Config Param: PLAN_STRATEGY_CLASS_NAMESince Version: 0.7.0 |
| hoodie.clustering.plan.strategy.daybased.lookback.partitions | 2 | Number of partitions to list to create ClusteringPlanConfig Param: DAYBASED_LOOKBACK_PARTITIONSSince Version: 0.7.0 |
| hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions | 0 | Number of partitions to skip from latest when choosing partitions to create ClusteringPlanConfig Param: PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATESTSince Version: 0.9.0 |
| hoodie.clustering.plan.strategy.file.slices.sort.by | SIZE | Comma-separated list of fields to sort file slices by when packing files together within a partition to create clustering groups. Available fields: INSTANT_TIME (sort by commit time ascending, so that older data files are clustered first), SIZE (sort by file size descending). For example, 'INSTANT_TIME,SIZE' sorts by commit time first then by size. Default 'SIZE' sorts by file size only.Config Param: PLAN_STRATEGY_FILE_SLICES_SORT_BYSince Version: 1.2.0 |
| hoodie.clustering.plan.strategy.max.bytes.per.group | 2147483648 | Each clustering operation can create multiple output file groups. Total amount of data processed by clustering operation is defined by below two properties (CLUSTERING_MAX_BYTES_PER_GROUP * CLUSTERING_MAX_NUM_GROUPS). Max amount of data to be included in one groupConfig Param: PLAN_STRATEGY_MAX_BYTES_PER_OUTPUT_FILEGROUPSince Version: 0.7.0 |
| hoodie.clustering.plan.strategy.max.num.groups | 30 | Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelismConfig Param: PLAN_STRATEGY_MAX_GROUPSSince Version: 0.7.0 |
| hoodie.clustering.plan.strategy.single.group.clustering.enabled | true | Whether to generate clustering plan when there is only one file group involved, by default trueConfig Param: PLAN_STRATEGY_SINGLE_GROUP_CLUSTERING_ENABLEDSince Version: 0.14.0 |
| hoodie.clustering.rollback.pending.replacecommit.on.conflict | false | If updates are allowed to file groups pending clustering, then set this config to rollback failed or pending clustering instants. Pending clustering will be rolled back ONLY IF there is conflict between incoming upsert and filegroup to be clustered. Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in rare scenarios, for example, when the clustering completes after instants are fetched but before rollback completed.Config Param: ROLLBACK_PENDING_CLUSTERING_ON_CONFLICTSince Version: 0.10.0 |
| hoodie.clustering.schedule.inline | false | When set to true, clustering service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async clustering(execution) for the one scheduled by this writer. Users can choose to set both hoodie.clustering.inline and hoodie.clustering.schedule.inline to false and have both scheduling and execution triggered by any async process, on which case hoodie.clustering.async.enabled is expected to be set to true. But if hoodie.clustering.inline is set to false, and hoodie.clustering.schedule.inline is set to true, regular writers will schedule clustering inline, but users are expected to trigger async job for execution. If hoodie.clustering.inline is set to true, regular writers will do both scheduling and execution inline for clusteringConfig Param: SCHEDULE_INLINE_CLUSTERING |
| hoodie.clustering.updates.strategy | Determines how to handle updates, deletes to file groups that are under clustering. Default strategy just rejects the updateConfig Param: UPDATES_STRATEGYSince Version: 0.7.0 | |
| hoodie.layout.optimize.build.curve.sample.size | 200000 | Determines target sample size used by the Boundary-based Interleaved Index method of building space-filling curve. Larger sample size entails better layout optimization outcomes, at the expense of higher memory footprint.Config Param: LAYOUT_OPTIMIZE_BUILD_CURVE_SAMPLE_SIZESince Version: 0.10.0 |
| hoodie.layout.optimize.curve.build.method | DIRECT | org.apache.hudi.config.HoodieClusteringConfig$SpatialCurveCompositionStrategyType: This configuration only has effect if hoodie.layout.optimize.strategy is set to either "z-order" or "hilbert" (i.e. leveraging space-filling curves). This configuration controls the type of a strategy to use for building the space-filling curves, tackling specifically how the Strings are ordered based on the curve. Since we truncate the String to 8 bytes for ordering, there are two issues: (1) it can lead to poor aggregation effect, (2) the truncation of String longer than 8 bytes loses the precision, if the Strings are different but the 8-byte prefix is the same. The boundary-based interleaved index method ("SAMPLE") has better generalization, solving the two problems above, but is slower than direct method ("DIRECT"). User should benchmark the write and query performance before tweaking this in production, if this is actually a problem. Please refer to RFC-28 for more details. DIRECT(default): This strategy builds the spatial curve in full, filling in all of the individual points corresponding to each individual record, which requires less compute. SAMPLE: This strategy leverages boundary-base interleaved index method (described in more details in Amazon DynamoDB blog https://aws.amazon.com/cn/blogs/database/tag/z-order/) and produces a better layout compared to DIRECT strategy. It requires more compute and is slower.Config Param: LAYOUT_OPTIMIZE_SPATIAL_CURVE_BUILD_METHODSince Version: 0.10.0 |
| hoodie.layout.optimize.data.skipping.enable | true | Enable data skipping by collecting statistics once layout optimization is complete.Config Param: LAYOUT_OPTIMIZE_DATA_SKIPPING_ENABLESince Version: 0.10.0Deprecated since: 0.11.0 |
| hoodie.layout.optimize.enable | false | This setting has no effect. Please refer to clustering configuration, as well as LAYOUT_OPTIMIZE_STRATEGY config to enable advanced record layout optimization strategiesConfig Param: LAYOUT_OPTIMIZE_ENABLESince Version: 0.10.0Deprecated since: 0.11.0 |
| hoodie.layout.optimize.strategy | LINEAR | org.apache.hudi.config.HoodieClusteringConfig$LayoutOptimizationStrategy: Determines ordering strategy for records layout optimization. LINEAR(default): Orders records lexicographically ZORDER: Orders records along Z-order spatial-curve. HILBERT: Orders records along Hilbert's spatial-curve.Config Param: LAYOUT_OPTIMIZE_STRATEGYSince Version: 0.10.0 |
Compaction Configs
Configurations that control compaction (merging of log files onto a new base files).
| Config Name | Default | Description |
|---|---|---|
| hoodie.compact.inline | false | When set to true, compaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path.Config Param: INLINE_COMPACT |
| hoodie.compact.inline.max.delta.commits | 5 | Number of delta commits after the last compaction, before scheduling of a new compaction is attempted. This config takes effect only for the compaction triggering strategy based on the number of commits, i.e., NUM_COMMITS, NUM_COMMITS_AFTER_LAST_REQUEST, NUM_AND_TIME, and NUM_OR_TIME.Config Param: INLINE_COMPACT_NUM_DELTA_COMMITS |
| Config Name | Default | Description |
|---|---|---|
| hoodie.compaction.partition.path.regex | (N/A) | Used to specify the partition path regex for compaction. Only partitions that match the regex will be compacted. Only be used when configure PartitionRegexBasedCompactionStrategy.Config Param: COMPACTION_SPECIFY_PARTITION_PATH_REGEX |
| hoodie.compact.inline.max.delta.seconds | 3600 | Number of elapsed seconds after the last compaction, before scheduling a new one. This config takes effect only for the compaction triggering strategy based on the elapsed time, i.e., TIME_ELAPSED, NUM_AND_TIME, and NUM_OR_TIME.Config Param: INLINE_COMPACT_TIME_DELTA_SECONDS |
| hoodie.compact.inline.trigger.strategy | NUM_COMMITS | org.apache.hudi.table.action.compact.CompactionTriggerStrategy: Controls when compaction is scheduled. NUM_COMMITS(default): triggers compaction when there are at least N delta commits after last completed compaction. NUM_COMMITS_AFTER_LAST_REQUEST: triggers compaction when there are at least N delta commits after last completed or requested compaction. TIME_ELAPSED: triggers compaction after N seconds since last compaction. NUM_AND_TIME: triggers compaction when both there are at least N delta commits and N seconds elapsed (both must be satisfied) after last completed compaction. NUM_OR_TIME: triggers compaction when both there are at least N delta commits or N seconds elapsed (either condition is satisfied) after last completed compaction.Config Param: INLINE_COMPACT_TRIGGER_STRATEGY |
| hoodie.compact.schedule.inline | false | When set to true, compaction service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async compaction(execution) for the one scheduled by this writer. Users can choose to set both hoodie.compact.inline and hoodie.compact.schedule.inline to false and have both scheduling and execution triggered by any async process. But if hoodie.compact.inline is set to false, and hoodie.compact.schedule.inline is set to true, regular writers will schedule compaction inline, but users are expected to trigger async job for execution. If hoodie.compact.inline is set to true, regular writers will do both scheduling and execution inline for compactionConfig Param: SCHEDULE_INLINE_COMPACT |
| hoodie.compaction.daybased.target.partitions | 10 | Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.Config Param: TARGET_PARTITIONS_PER_DAYBASED_COMPACTION |
| hoodie.compaction.logfile.num.threshold | 0 | Only if the log file num is greater than the threshold, the file group will be compacted.Config Param: COMPACTION_LOG_FILE_NUM_THRESHOLDSince Version: 0.13.0 |
| hoodie.compaction.logfile.size.threshold | 0 | Only if the log file size is greater than the threshold in bytes, the file group will be compacted.Config Param: COMPACTION_LOG_FILE_SIZE_THRESHOLD |
| hoodie.compaction.plan.generator | org.apache.hudi.table.action.compact.plan.generators.HoodieCompactionPlanGenerator | Compaction plan generator for data files. Override with a custom plan generator if there's a need to use extraMetadata in the compaction plan for optimizations, ignore otherwiseConfig Param: COMPACTION_PLAN_GENERATOR |
| hoodie.compaction.strategy | org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy | Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data. The strategy can be composed with multiple strategies by concatenating the class names with ','.Config Param: COMPACTION_STRATEGY |
| hoodie.compaction.target.io | 512000 | Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode.Config Param: TARGET_IO_PER_COMPACTION_IN_MB |
| hoodie.copyonwrite.insert.auto.split | true | Config to control whether we control insert split sizes automatically based on average record sizes. It's recommended to keep this turned on, since hand tuning is otherwise extremely cumbersome.Config Param: COPY_ON_WRITE_AUTO_SPLIT_INSERTS |
| hoodie.copyonwrite.insert.split.size | 500000 | Number of inserts assigned for each partition/bucket for writing. We based the default on writing out 100MB files, with at least 1kb records (100K records per file), and over provision to 500K. As long as auto-tuning of splits is turned on, this only affects the first write, where there is no history to learn record sizes from.Config Param: COPY_ON_WRITE_INSERT_SPLIT_SIZE |
| hoodie.copyonwrite.record.size.estimate | 1024 | The average record size. If not explicitly specified, hudi will compute the record size estimate compute dynamically based on commit metadata. This is critical in computing the insert parallelism and bin-packing inserts into small files.Config Param: COPY_ON_WRITE_RECORD_SIZE_ESTIMATE |
| hoodie.log.compaction.blocks.threshold | 5 | Log compaction can be scheduled if the no. of log blocks crosses this threshold value. This is effective only when log compaction is enabled via hoodie.log.compaction.inlineConfig Param: LOG_COMPACTION_BLOCKS_THRESHOLDSince Version: 0.13.0 |
| hoodie.log.compaction.enable | false | By enabling log compaction through this config, log compaction will also get enabled for the metadata table.Config Param: ENABLE_LOG_COMPACTIONSince Version: 0.14.0 |
| hoodie.log.compaction.inline | false | When set to true, logcompaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path.Config Param: INLINE_LOG_COMPACTSince Version: 0.13.0 |
| hoodie.parquet.small.file.limit | 104857600 | During upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a small file. By default, treat any file <= 100MB as a small file. Also note that if this set <= 0, will not try to get small files and directly write new filesConfig Param: PARQUET_SMALL_FILE_LIMIT |
| hoodie.record.size.estimation.threshold | 1.0 | We use the previous commits' metadata to calculate the estimated record size and use it to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)Config Param: RECORD_SIZE_ESTIMATION_THRESHOLD |
Error table Configs
Configurations that are required for Error table configs
| Config Name | Default | Description |
|---|---|---|
| hoodie.errortable.base.path | (N/A) | Base path for error table under which all error records would be stored.Config Param: ERROR_TABLE_BASE_PATH |
| hoodie.errortable.target.table.name | (N/A) | Table name to be used for the error tableConfig Param: ERROR_TARGET_TABLE |
| hoodie.errortable.write.class | (N/A) | Class which handles the error table writes. This config is used to configure a custom implementation for Error Table Writer. Specify the full class name of the custom error table writer as a value for this configConfig Param: ERROR_TABLE_WRITE_CLASS |
| hoodie.errortable.enable | false | Config to enable error table. If the config is enabled, all the records with processing error in DeltaStreamer are transferred to error table.Config Param: ERROR_TABLE_ENABLED |
| hoodie.errortable.insert.shuffle.parallelism | 200 | Config to set insert shuffle parallelism. The config is similar to hoodie.insert.shuffle.parallelism config but applies to the error table.Config Param: ERROR_TABLE_INSERT_PARALLELISM_VALUE |
| hoodie.errortable.source.rdd.persist | false | Enabling this config, persists the sourceRDD to disk which helps in faster processing of data table + error table write DAGConfig Param: ERROR_TABLE_PERSIST_SOURCE_RDD |
| hoodie.errortable.upsert.shuffle.parallelism | 200 | Config to set upsert shuffle parallelism. The config is similar to hoodie.upsert.shuffle.parallelism config but applies to the error table.Config Param: ERROR_TABLE_UPSERT_PARALLELISM_VALUE |
| hoodie.errortable.validate.recordcreation.enable | true | Records that fail to be created due to keygeneration failure or other issues will be sent to the Error TableConfig Param: ERROR_ENABLE_VALIDATE_RECORD_CREATIONSince Version: 0.15.0 |
| hoodie.errortable.validate.targetschema.enable | false | Records with schema mismatch with Target Schema are sent to Error Table.Config Param: ERROR_ENABLE_VALIDATE_TARGET_SCHEMA |
| hoodie.errortable.write.failure.strategy | ROLLBACK_COMMIT | The config specifies the failure strategy if error table write fails. Use one of - [ROLLBACK_COMMIT (Rollback the corresponding base table write commit for which the error events were triggered) , LOG_ERROR (Error is logged but the base table write succeeds) ]Config Param: ERROR_TABLE_WRITE_FAILURE_STRATEGY |
| hoodie.errortable.write.union.enable | false | Enable error table union with data table when writing for improved commit performance. By default it is disabled meaning data table and error table writes are sequentialConfig Param: ENABLE_ERROR_TABLE_WRITE_UNIFICATION |
Layout Configs
Configurations that control storage layout and data distribution, which defines how the files are organized within a table.
| Config Name | Default | Description |
|---|---|---|
| hoodie.storage.layout.partitioner.class | (N/A) | Partitioner class, it is used to distribute data in a specific way.Config Param: LAYOUT_PARTITIONER_CLASS_NAME |
| hoodie.storage.layout.type | DEFAULT | org.apache.hudi.table.storage.HoodieStorageLayout$LayoutType: Determines how the files are organized within a table. DEFAULT(default): Each file group contains records of a certain set of keys, without particular grouping criteria. BUCKET: Each file group contains records of a set of keys which map to a certain range of hash values, so that using the hash function can easily identify the file group a record belongs to, based on the record key.Config Param: LAYOUT_TYPE |
PreWrite Validator Configurations
The following set of configurations help validate data before writes.
| Config Name | Default | Description |
|---|---|---|
| hoodie.prewrite.validators | Comma separated list of class names that can be invoked to validate before write operationsConfig Param: VALIDATOR_CLASS_NAMES |
TTL management Configs
Data ttl management
| Config Name | Default | Description |
|---|---|---|
| hoodie.partition.ttl.strategy.class | (N/A) | Config to provide a strategy class (subclass of PartitionTTLStrategy) to get the expired partitionsConfig Param: PARTITION_TTL_STRATEGY_CLASS_NAMESince Version: 1.0.0 |
| hoodie.partition.ttl.strategy.partition.selected | (N/A) | Partitions to manage ttlConfig Param: PARTITION_SELECTEDSince Version: 1.0.0 |
| hoodie.partition.ttl.inline | false | When enabled, the partition ttl management service is invoked immediately after each commit, to delete exipired partitionsConfig Param: INLINE_PARTITION_TTLSince Version: 1.0.0 |
| hoodie.partition.ttl.management.strategy.type | KEEP_BY_TIME | Partition ttl management strategy type to determine the strategy classConfig Param: PARTITION_TTL_STRATEGY_TYPESince Version: 1.0.0 |
| hoodie.partition.ttl.strategy.days.retain | -1 | Partition ttl management KEEP_BY_TIME strategy days retainConfig Param: DAYS_RETAINSince Version: 1.0.0 |
| hoodie.partition.ttl.strategy.max.delete.partitions | 1000 | max partitions to delete in partition ttl managementConfig Param: MAX_PARTITION_TO_DELETESince Version: 1.0.0 |
Write Configurations
Configurations that control write behavior on Hudi tables. These can be directly passed down from even higher level frameworks (e.g Spark datasources, Flink sink) and utilities (e.g Hudi Streamer).
| Config Name | Default | Description |
|---|---|---|
| hoodie.base.path | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.Config Param: BASE_PATH |
| hoodie.datasource.write.precombine.field | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge modeConfig Param: PRECOMBINE_FIELD_NAME |
| hoodie.table.name | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.Config Param: TBL_NAME |
| hoodie.write.record.merge.mode | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.Config Param: RECORD_MERGE_MODESince Version: 1.0.0 |
| hoodie.block.writes.on.speculative.execution | true | When enabled (default), throws an exception if Spark speculative execution is enabled during the client creation. This prevents potential data corruption due to duplicate writes by speculative executors. Set to false only if you understand the risks of running with Spark speculative execution enabled.Config Param: BLOCK_WRITES_ON_SPECULATIVE_EXECUTIONSince Version: 0.14.0 |
| hoodie.fail.job.on.duplicate.data.file.detection | false | If config is enabled, entire job is failed on invalid file detectionConfig Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION |
| hoodie.write.auto.upgrade | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.Config Param: AUTO_UPGRADE_VERSIONSince Version: 1.0.0 |
| hoodie.write.can.ignore.post.commit.failures | false | When this config is true, any failures in post-commit operations are ignored and do not kill the application.Config Param: CAN_IGNORE_POST_COMMIT_FAILURES |
| hoodie.write.concurrency.mode | SINGLE_WRITER | org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor.Config Param: WRITE_CONCURRENCY_MODE |
| hoodie.write.ignore.failed | true | Flag to indicate whether to ignore any non exception error (e.g. write status error).By default true for backward compatibility.Config Param: IGNORE_FAILEDSince Version: |
| hoodie.write.table.version | 9 | The table version this writer is storing the table in. This should match the current table version.Config Param: WRITE_TABLE_VERSIONSince Version: 1.0.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.avro.schema | (N/A) | Schema string representing the current write schema of the table. Hudi passes this to implementations of HoodieRecordPayload to convert incoming records to avro. This is also used as the write schema evolving records during an update.Config Param: AVRO_SCHEMA_STRING |
| hoodie.bulkinsert.user.defined.partitioner.class | (N/A) | If specified, this class will be used to re-partition records before they are bulk inserted. This can be used to sort, pack, cluster data optimally for common query patterns. For now we support a build-in user defined bulkinsert partitioner org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner which can does sorting based on specified column values set by hoodie.bulkinsert.user.defined.partitioner.sort.columnsConfig Param: BULKINSERT_USER_DEFINED_PARTITIONER_CLASS_NAME |
| hoodie.bulkinsert.user.defined.partitioner.sort.columns | (N/A) | Columns to sort the data by when use org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner as user defined partitioner during bulk_insert. For example 'column1,column2'Config Param: BULKINSERT_USER_DEFINED_PARTITIONER_SORT_COLUMNS |
| hoodie.datasource.write.keygenerator.class | (N/A) | Key generator class, that implements org.apache.hudi.keygen.KeyGenerator extract a key out of incoming records.Config Param: KEYGENERATOR_CLASS_NAME |
| hoodie.datasource.write.payload.class | (N/A) | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effectiveConfig Param: WRITE_PAYLOAD_CLASS_NAME |
| hoodie.internal.schema | (N/A) | Schema string representing the latest schema of the table. Hudi passes this to implementations of evolution of schemaConfig Param: INTERNAL_SCHEMA_STRING |
| hoodie.write.record.merge.custom.implementation.classes | (N/A) | List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These record merge impls will filter by hoodie.write.record.merge.strategy.idHudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)Config Param: RECORD_MERGE_IMPL_CLASSESSince Version: 0.13.0 |
| hoodie.write.record.merge.strategy.id | (N/A) | ID of record merge strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.write.record.merge.custom.implementation.classes which has the same merge strategy idConfig Param: RECORD_MERGE_STRATEGY_IDSince Version: 0.13.0 |
| hoodie.write.schema | (N/A) | Config allowing to override writer's schema. This might be necessary in cases when writer's schema derived from the incoming dataset might actually be different from the schema we actually want to use when writing. This, for ex, could be the case for'partial-update' use-cases (like MERGE INTO Spark SQL statement for ex) where only a projection of the incoming dataset might be used to update the records in the existing table, prompting us to override the writer's schemaConfig Param: WRITE_SCHEMA_OVERRIDE |
| _.hoodie.allow.multi.write.on.same.instant | false | Config Param: ALLOW_MULTI_WRITE_ON_SAME_INSTANT_ENABLE |
| _hoodie.record.size.estimator.max.commits | 5 | The maximum number of commits that will be read to estimate the avg record size. This makes sure we parse a limited number of commit metadata, as parsing the entire active timeline can be expensive and unnecessary.Config Param: RECORD_SIZE_ESTIMATOR_MAX_COMMITSSince Version: 1.0.0 |
| hoodie.allow.empty.commit | true | Whether to allow generation of empty commits, even if no data was written in the commit. It's useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting dataConfig Param: ALLOW_EMPTY_COMMIT |
| hoodie.allow.operation.metadata.field | false | Whether to include '_hoodie_operation' in the metadata fields. Once enabled, all the changes of a record are persisted to the delta log directly without mergeConfig Param: ALLOW_OPERATION_METADATA_FIELDSince Version: 0.9.0 |
| hoodie.auto.adjust.lock.configs | false | Auto adjust lock configurations when metadata table is enabled and for async table services.Config Param: AUTO_ADJUST_LOCK_CONFIGSSince Version: 0.11.0 |
| hoodie.avro.schema.external.transformation | false | When enabled, records in older schema are rewritten into newer schema during upsert,delete and background compaction,clustering operations.Config Param: AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE |
| hoodie.avro.schema.validate | false | Validate the schema used for the write against the latest schema, for backwards compatibility.Config Param: AVRO_SCHEMA_VALIDATE_ENABLE |
| hoodie.base.file.format | PARQUET | File format to store all the base file data. org.apache.hudi.common.model.HoodieFileFormat: Hoodie file formats. PARQUET(default): Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. HFILE: (internal config) File format for metadata table. A file of sorted key/value pairs. Both keys and values are byte arrays. ORC: The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.Config Param: BASE_FILE_FORMAT |
| hoodie.bulkinsert.shuffle.parallelism | 0 | For large initial imports using bulk_insert operation, controls the parallelism to use for sort modes or custom partitioning done before writing records to the table. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data or the parallelism based on the logical plan for row writer. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. If you observe small files from the bulk insert operation, we suggest configuring this shuffle parallelism explicitly, so that the parallelism is around total_input_data_size/120MB.Config Param: BULKINSERT_PARALLELISM_VALUE |
| hoodie.bulkinsert.sort.mode | NONE | org.apache.hudi.execution.bulkinsert.BulkInsertSortMode: Modes for sorting records during bulk insert. NONE(default): No sorting. Fastest and matches spark.write.parquet() in number of files and overhead. GLOBAL_SORT: This ensures best file sizes, with lowest memory overhead at cost of sorting. PARTITION_SORT: Strikes a balance by only sorting within a Spark RDD partition, still keeping the memory overhead of writing low. File sizing is not as good as GLOBAL_SORT. PARTITION_PATH_REPARTITION: This ensures that the data for a single physical partition in the table is written by the same Spark executor. This should only be used when input data is evenly distributed across different partition paths. If data is skewed (most records are intended for a handful of partition paths among all) then this can cause an imbalance among Spark executors. PARTITION_PATH_REPARTITION_AND_SORT: This ensures that the data for a single physical partition in the table is written by the same Spark executor. This should only be used when input data is evenly distributed across different partition paths. Compared to PARTITION_PATH_REPARTITION, this sort mode does an additional step of sorting the records based on the partition path within a single Spark partition, given that data for multiple physical partitions can be sent to the same Spark partition and executor. If data is skewed (most records are intended for a handful of partition paths among all) then this can cause an imbalance among Spark executors.Config Param: BULK_INSERT_SORT_MODE |
| hoodie.bulkinsert.sort.suffix.record_key | false | When using user defined sort columns there can be possibility of skew because spark's RangePartitioner used in sort can reduce the number of outputSparkPartitionsif the sampled dataset has a low cardinality on the provided sort columns. This can cause an increase in commit durations as we are not leveraging the original parallelism.Enabling this config suffixes the record key at the end to avoid skew.This config is used by RowCustomColumnsSortPartitioner, RDDCustomColumnsSortPartitioner and JavaCustomColumnsSortPartitionerConfig Param: BULKINSERT_SUFFIX_RECORD_KEY_SORT_COLUMNSSince Version: 1.0.0 |
| hoodie.cdc.file.group.iterator.memory.spill.bytes | 104857600 | Amount of memory in bytes to be used in bytes for CDCFileGroupIterator holding data in-memory, before spilling to disk.Config Param: CDC_FILE_GROUP_ITERATOR_MEMORY_SPILL_BYTESSince Version: 1.0.1 |
| hoodie.client.heartbeat.interval_in_ms | 60000 | Writers perform heartbeats to indicate liveness. Controls how often (in ms), such heartbeats are registered to lake storage.Config Param: CLIENT_HEARTBEAT_INTERVAL_IN_MS |
| hoodie.client.heartbeat.tolerable.misses | 2 | Number of heartbeat misses, before a writer is deemed not alive and all pending writes are aborted.Config Param: CLIENT_HEARTBEAT_NUM_TOLERABLE_MISSES |
| hoodie.client.init.callback.classes | Fully-qualified class names of the Hudi client init callbacks to run at the initialization of the Hudi client. The class names are separated by ,. The class must be a subclass of org.apache.hudi.callback.HoodieClientInitCallback.By default, no Hudi client init callback is executed.Config Param: CLIENT_INIT_CALLBACK_CLASS_NAMESSince Version: 0.14.0 | |
| hoodie.clustering.fail.on.pending.ingestion.during.conflict.resolution | false | Only applicable when "hoodie.write.concurrency.mode" is set to OCC or NBCC and the conflict resolution strategy ("hoodie.write.conflict.resolution.strategy") is set to PreferWriterConflictResolutionStrategy. When enabled, proactively prevents clustering from committing if there are any ongoing ingestion writes that have not transitioned from requested to inflight yet and have an active heartbeat, since ingestion may be targeting the same files and should have precedence.Config Param: CLUSTERING_BLOCK_FOR_PENDING_INGESTION |
| hoodie.combine.before.delete | true | During delete operations, controls whether we should combine deletes (and potentially also upserts) before writing to storage.Config Param: COMBINE_BEFORE_DELETE |
| hoodie.combine.before.insert | false | When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage.Config Param: COMBINE_BEFORE_INSERT |
| hoodie.combine.before.upsert | true | When upserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage. This should be turned off only if you are absolutely certain that there are no duplicates incoming, otherwise it can lead to duplicate keys and violate the uniqueness guarantees.Config Param: COMBINE_BEFORE_UPSERT |
| hoodie.compact.merge.handle.class | org.apache.hudi.io.FileGroupReaderBasedMergeHandle | Merge handle class for compactionConfig Param: COMPACT_MERGE_HANDLE_CLASS_NAMESince Version: 1.1.0 |
| hoodie.consistency.check.initial_interval_ms | 2000 | Initial time between successive attempts to ensure written data's metadata is consistent on storage. Grows with exponential backoff after the initial value.Config Param: INITIAL_CONSISTENCY_CHECK_INTERVAL_MS |
| hoodie.consistency.check.max_checks | 7 | Maximum number of checks, for consistency of written data.Config Param: MAX_CONSISTENCY_CHECKS |
| hoodie.consistency.check.max_interval_ms | 300000 | Max time to wait between successive attempts at performing consistency checksConfig Param: MAX_CONSISTENCY_CHECK_INTERVAL_MS |
| hoodie.datasource.write.keygenerator.type | SIMPLE | Note This is being actively worked on. Please use hoodie.datasource.write.keygenerator.class instead. org.apache.hudi.keygen.constant.KeyGeneratorType: Key generator type, indicating the key generator class to use, that implements org.apache.hudi.keygen.KeyGenerator. SIMPLE(default): Simple key generator, which takes names of fields to be used for recordKey and partitionPath as configs. SIMPLE_AVRO: Simple key generator, which takes names of fields to be used for recordKey and partitionPath as configs. COMPLEX: Complex key generator, which takes names of fields to be used for recordKey and partitionPath as configs. COMPLEX_AVRO: Complex key generator, which takes names of fields to be used for recordKey and partitionPath as configs. TIMESTAMP: Timestamp-based key generator, that relies on timestamps for partitioning field. Still picks record key by name. TIMESTAMP_AVRO: Timestamp-based key generator, that relies on timestamps for partitioning field. Still picks record key by name. CUSTOM: This is a generic implementation type of KeyGenerator where users can configure record key as a single field or a combination of fields. Similarly partition path can be configured to have multiple fields or only one field. This KeyGenerator expects value for prop "hoodie.datasource.write.partitionpath.field" in a specific format. For example: properties.put("hoodie.datasource.write.partitionpath.field", "field1:PartitionKeyType1,field2:PartitionKeyType2"). CUSTOM_AVRO: This is a generic implementation type of KeyGenerator where users can configure record key as a single field or a combination of fields. Similarly partition path can be configured to have multiple fields or only one field. This KeyGenerator expects value for prop "hoodie.datasource.write.partitionpath.field" in a specific format. For example: properties.put("hoodie.datasource.write.partitionpath.field", "field1:PartitionKeyType1,field2:PartitionKeyType2"). NON_PARTITION: Simple Key generator for non-partitioned tables. NON_PARTITION_AVRO: Simple Key generator for non-partitioned tables. GLOBAL_DELETE: Key generator for deletes using global indices. GLOBAL_DELETE_AVRO: Key generator for deletes using global indices. AUTO_RECORD: Automatic record key generation. AUTO_RECORD_AVRO: Automatic record key generation. HOODIE_TABLE_METADATA: Custom key generator for the Hudi table metadata. SPARK_SQL: Custom spark-sql specific KeyGenerator overriding behavior handling TimestampType partition values. SPARK_SQL_UUID: A KeyGenerator which use the uuid as the record key. SPARK_SQL_MERGE_INTO: Meant to be used internally for the spark sql MERGE INTO command. USER_PROVIDED: A KeyGenerator specified from the configuration.Config Param: KEYGENERATOR_TYPE |
| hoodie.datasource.write.schema.allow.auto.evolution.column.drop | false | Controls whether table's schema is allowed to automatically evolve when incoming batch's schema can have any of the columns dropped. By default, Hudi will not allow this kind of (auto) schema evolution. Set this config to true to allow table's schema to be updated automatically when columns are dropped from the new incoming batch.Config Param: SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROPSince Version: 0.13.0 |
| hoodie.delete.shuffle.parallelism | 0 | Parallelism used for delete operation. Delete operations also performs shuffles, similar to upsert operation. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism.Config Param: DELETE_PARALLELISM_VALUE |
| hoodie.embed.timeline.server | true | When true, spins up an instance of the timeline server (meta server that serves cached file listings, statistics),running on each writer's driver process, accepting requests during the write from executors.Config Param: EMBEDDED_TIMELINE_SERVER_ENABLE |
| hoodie.embed.timeline.server.async | false | Controls whether or not, the requests to the timeline server are processed in asynchronous fashion, potentially improving throughput.Config Param: EMBEDDED_TIMELINE_SERVER_USE_ASYNC_ENABLE |
| hoodie.embed.timeline.server.gzip | true | Controls whether gzip compression is used, for large responses from the timeline server, to improve latency.Config Param: EMBEDDED_TIMELINE_SERVER_COMPRESS_ENABLE |
| hoodie.embed.timeline.server.port | 0 | Port at which the timeline server listens for requests. When running embedded in each writer, it picks a free port and communicates to all the executors. This should rarely be changed.Config Param: EMBEDDED_TIMELINE_SERVER_PORT_NUM |
| hoodie.embed.timeline.server.reuse.enabled | false | Controls whether the timeline server instance should be cached and reused across the tablesto avoid startup costs and server overhead. This should only be used if you are running multiple writers in the same JVM.Config Param: EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED |
| hoodie.embed.timeline.server.threads | -1 | Number of threads to serve requests in the timeline server. By default, auto configured based on the number of underlying cores.Config Param: EMBEDDED_TIMELINE_NUM_SERVER_THREADS |
| hoodie.fail.on.timeline.archiving | true | Timeline archiving removes older instants from the timeline, after each write operation, to minimize metadata overhead. Controls whether or not, the write should be failed as well, if such archiving fails.Config Param: FAIL_ON_TIMELINE_ARCHIVING_ENABLE |
| hoodie.fail.writes.on.inline.table.service.exception | true | Table services such as compaction and clustering can fail and prevent syncing to the metaclient. Set this to true to fail writes when table services failConfig Param: FAIL_ON_INLINE_TABLE_SERVICE_EXCEPTIONSince Version: 0.13.0 |
| hoodie.fileid.prefix.provider.class | org.apache.hudi.table.RandomFileIdPrefixProvider | File Id Prefix provider class, that implements org.apache.hudi.fileid.FileIdPrefixProviderConfig Param: FILEID_PREFIX_PROVIDER_CLASSSince Version: 0.10.0 |
| hoodie.finalize.write.parallelism | 200 | Parallelism for the write finalization internal operation, which involves removing any partially written files from lake storage, before committing the write. Reduce this value, if the high number of tasks incur delays for smaller tables or low latency writes.Config Param: FINALIZE_WRITE_PARALLELISM_VALUE |
| hoodie.insert.shuffle.parallelism | 0 | Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. If you observe small files from the insert operation, we suggest configuring this shuffle parallelism explicitly, so that the parallelism is around total_input_data_size/120MB.Config Param: INSERT_PARALLELISM_VALUE |
| hoodie.markers.delete.parallelism | 100 | Determines the parallelism for deleting marker files, which are used to track all files (valid or invalid/partial) written during a write operation. Increase this value if delays are observed, with large batch writes.Config Param: MARKERS_DELETE_PARALLELISM_VALUE |
| hoodie.markers.timeline_server_based.batch.interval_ms | 50 | The batch interval in milliseconds for marker creation batch processingConfig Param: MARKERS_TIMELINE_SERVER_BASED_BATCH_INTERVAL_MSSince Version: 0.9.0 |
| hoodie.markers.timeline_server_based.batch.num_threads | 20 | Number of threads to use for batch processing marker creation requests at the timeline serverConfig Param: MARKERS_TIMELINE_SERVER_BASED_BATCH_NUM_THREADSSince Version: 0.9.0 |
| hoodie.merge.allow.duplicate.on.inserts | true | When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing). This is only relevant for insert operation, since upsert, delete operations will ensure unique key constraints are maintained.Config Param: MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE |
| hoodie.merge.data.validation.enabled | false | When enabled, data validation checks are performed during merges to ensure expected number of records after merge operation.Config Param: MERGE_DATA_VALIDATION_CHECK_ENABLE |
| hoodie.merge.small.file.group.candidates.limit | 1 | Limits number of file groups, whose base file satisfies small-file limit, to consider for appending records during upsert operation. Only applicable to MOR tablesConfig Param: MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT |
| hoodie.record.size.estimator.average.metadata.size | 0 | The approximate metadata size in bytes to subtract from the file size when estimating the record size.Config Param: RECORD_SIZE_ESTIMATOR_AVERAGE_METADATA_SIZESince Version: 1.0.0 |
| hoodie.record.size.estimator.class | org.apache.hudi.estimator.AverageRecordSizeEstimator | Class that estimates the size of records written by implementing org.apache.hudi.estimator.RecordSizeEstimator. Default implementation is org.apache.hudi.estimator.AverageRecordSizeEstimatorConfig Param: RECORD_SIZE_ESTIMATOR_CLASS_NAMESince Version: 1.0.0 |
| hoodie.release.resource.on.completion.enable | true | Control to enable release all persist rdds when the spark job finish.Config Param: RELEASE_RESOURCE_ENABLESince Version: 0.11.0 |
| hoodie.rollback.avoid.duplicate.plan | false | When enabled in multi-writer mode, before scheduling a new rollback plan, the writer reloads the timeline under lock to check if another writer already scheduled one for the same failed commit. This avoids duplicate rollback instants and uses heartbeats to ensure only one writer executes the rollback at a time.Config Param: ROLLBACK_AVOID_DUPLICATE_PLAN |
| hoodie.rollback.instant.backup.dir | .rollback_backup | Path where instants being rolled back are copied. If not absolute path then a directory relative to .hoodie folder is created.Config Param: ROLLBACK_INSTANT_BACKUP_DIRECTORY |
| hoodie.rollback.instant.backup.enabled | false | Backup instants removed during rollback and restore (useful for debugging)Config Param: ROLLBACK_INSTANT_BACKUP_ENABLED |
| hoodie.rollback.parallelism | 100 | This config controls the parallelism for rollback of commits. Rollbacks perform deletion of files or logging delete blocks to file groups on storage in parallel. The configure value limits the parallelism so that the number of Spark tasks do not exceed the value. If rollback is slow due to the limited parallelism, you can increase this to tune the performance.Config Param: ROLLBACK_PARALLELISM_VALUE |
| hoodie.rollback.using.markers | true | Enables a more efficient mechanism for rollbacks based on the marker files generated during the writes. Turned on by default.Config Param: ROLLBACK_USING_MARKERS_ENABLE |
| hoodie.sensitive.config.keys | ssl,tls,sasl,auth,credentials | Comma separated list of filters for sensitive config keys. Hudi Streamer will not print any configuration which contains the configured filter. For example with a configured filter ssl, value for config ssl.trustore.location would be masked.Config Param: SENSITIVE_CONFIG_KEYS_FILTERSince Version: 0.14.0 |
| hoodie.skip.default.partition.validation | false | When table is upgraded from pre 0.12 to 0.12, we check for "default" partition and fail if found one. Users are expected to rewrite the data in those partitions. Enabling this config will bypass this validationConfig Param: SKIP_DEFAULT_PARTITION_VALIDATIONSince Version: 0.12.0 |
| hoodie.table.services.enabled | true | Master control to disable all table services including archive, clean, compact, cluster, etc.Config Param: TABLE_SERVICES_ENABLEDSince Version: 0.11.0 |
| hoodie.table.services.incremental.enabled | true | Whether to enable incremental table service. So far Clustering and Compaction support incremental processing.Config Param: INCREMENTAL_TABLE_SERVICE_ENABLEDSince Version: 1.0.0 |
| hoodie.timeline.layout.version | 2 | Controls the layout of the timeline. Version 0 relied on renames, Version 1 (default) models the timeline as an immutable log relying only on atomic writes for object storage.Config Param: TIMELINE_LAYOUT_VERSION_NUMSince Version: 0.5.1 |
| hoodie.upsert.shuffle.parallelism | 0 | Parallelism to use for upsert operation on the table. Upserts can shuffle data to perform index lookups, file sizing, bin packing records optimally into file groups. Before 0.13.0 release, if users do not configure it, Hudi would use 200 as the default shuffle parallelism. From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. If you observe small files from the upsert operation, we suggest configuring this shuffle parallelism explicitly, so that the parallelism is around total_input_data_size/120MB.Config Param: UPSERT_PARALLELISM_VALUE |
| hoodie.write.application.id | Unknown | Application identifier (e.g. Spark application id) used to populate lock metadata so lock holders can be identified.Config Param: APPLICATION_ID |
| hoodie.write.buffer.limit.bytes | 4194304 | Size of in-memory buffer used for parallelizing network reads and lake storage writes.Config Param: WRITE_BUFFER_LIMIT_BYTES_VALUE |
| hoodie.write.buffer.record.cache.limit | 131072 | Maximum queue size of in-memory buffer for parallelizing network reads and lake storage writes.Config Param: WRITE_BUFFER_RECORD_CACHE_LIMITSince Version: 0.15.0 |
| hoodie.write.buffer.record.sampling.rate | 64 | Sampling rate of in-memory buffer used to estimate object size. Higher value lead to lower CPU usage.Config Param: WRITE_BUFFER_RECORD_SAMPLING_RATESince Version: 0.15.0 |
| hoodie.write.complex.keygen.new.encoding | false | This config only takes effect for writing table version 8 and below. If set to false, the record key field name is encoded and prepended in the case where a single record key field is used in the complex key generator, i.e., record keys stored in _hoodie_record_key meta field is in the format of <field_name>:<field_value>, which conforms to the behavior in 0.14.0 release and older. If set to true, the record key field name is not encoded under the same case in the complex key generator, i.e., record keys stored in _hoodie_record_key meta field is in the format of <field_value>, which conforms to the behavior in 0.14.1, 0.15.0, 1.0.0, 1.0.1, 1.0.2 releases.Config Param: COMPLEX_KEYGEN_NEW_ENCODINGSince Version: 1.1.0 |
| hoodie.write.complex.keygen.validation.enable | true | This config only takes effect for writing table version 8 and below, upgrade or downgrade. If set to true, the writer enables the validation on whether the table uses the complex key generator with a single record key field, which can be affected by a breaking change in 0.14.1, 0.15.0, 1.0.0, 1.0.1, 1.0.2 releases, causing key encoding change and potential duplicates in the table. The validation fails the pipeline if the table meets the condition for the user to take proper action. The user can turn this validation off by setting the config to false, after evaluating the table and situation and doing table repair if needed.Config Param: ENABLE_COMPLEX_KEYGEN_VALIDATIONSince Version: 1.1.0 |
| hoodie.write.concat.handle.class | org.apache.hudi.io.HoodieConcatHandle | The merge handle class to use to concat the records from a base file with an iterator of incoming records.Config Param: CONCAT_HANDLE_CLASS_NAMESince Version: 1.1.0 |
| hoodie.write.concurrency.async.conflict.detector.initial_delay_ms | 0 | Used for timeline-server-based markers with AsyncTimelineServerBasedDetectionStrategy. The time in milliseconds to delay the first execution of async marker-based conflict detection.Config Param: ASYNC_CONFLICT_DETECTOR_INITIAL_DELAY_MSSince Version: 0.13.0 |
| hoodie.write.concurrency.async.conflict.detector.period_ms | 30000 | Used for timeline-server-based markers with AsyncTimelineServerBasedDetectionStrategy. The period in milliseconds between successive executions of async marker-based conflict detection.Config Param: ASYNC_CONFLICT_DETECTOR_PERIOD_MSSince Version: 0.13.0 |
| hoodie.write.concurrency.early.conflict.check.commit.conflict | false | Whether to enable commit conflict checking or not during early conflict detection.Config Param: EARLY_CONFLICT_DETECTION_CHECK_COMMIT_CONFLICTSince Version: 0.13.0 |
| hoodie.write.concurrency.early.conflict.detection.enable | false | Whether to enable early conflict detection based on markers. It eagerly detects writing conflict before create markers and fails fast if a conflict is detected, to release cluster compute resources as soon as possible.Config Param: EARLY_CONFLICT_DETECTION_ENABLESince Version: 0.13.0 |
| hoodie.write.concurrency.early.conflict.detection.strategy | The class name of the early conflict detection strategy to use. This should be a subclass of org.apache.hudi.common.conflict.detection.EarlyConflictDetectionStrategy.Config Param: EARLY_CONFLICT_DETECTION_STRATEGY_CLASS_NAMESince Version: 0.13.0 | |
| hoodie.write.concurrency.schema.conflict.resolution.enable | true | If turned on, we detect and abort incompatible concurrent schema evolution.Config Param: ENABLE_SCHEMA_CONFLICT_RESOLUTIONSince Version: 1.1.0 |
| hoodie.write.executor.disruptor.buffer.limit.bytes | 1024 | The size of the Disruptor Executor ring buffer, must be power of 2Config Param: WRITE_EXECUTOR_DISRUPTOR_BUFFER_LIMIT_BYTESSince Version: 0.13.0 |
| hoodie.write.executor.disruptor.wait.strategy | BLOCKING_WAIT | org.apache.hudi.common.util.queue.DisruptorWaitStrategyType: Strategy employed for making Disruptor Executor wait on a cursor. BLOCKING_WAIT(default): The slowest of the available wait strategies. However, it is the most conservative with the respect to CPU usage and will give the most consistent behaviour across the widest variety of deployment options. SLEEPING_WAIT: Like the BLOCKING_WAIT strategy, it attempts to be conservative with CPU usage by using a simple busy wait loop. The difference is that the SLEEPING_WAIT strategy uses a call to LockSupport.parkNanos(1) in the middle of the loop. On a typical Linux system this will pause the thread for around 60µs. YIELDING_WAIT: The YIELDING_WAIT strategy is one of two wait strategy that can be used in low-latency systems. It is designed for cases where there is an opportunity to burn CPU cycles with the goal of improving latency. The YIELDING_WAIT strategy will busy spin, waiting for the sequence to increment to the appropriate value. Inside the body of the loop Thread#yield() will be called allowing other queued threads to run. This is the recommended wait strategy when you need very high performance, and the number of EventHandler threads is lower than the total number of logical cores, such as when hyper-threading is enabled. BUSY_SPIN_WAIT: The BUSY_SPIN_WAIT strategy is the highest performing wait strategy. Like the YIELDING_WAIT strategy, it can be used in low-latency systems, but puts the highest constraints on the deployment environment.Config Param: WRITE_EXECUTOR_DISRUPTOR_WAIT_STRATEGYSince Version: 0.13.0 |
| hoodie.write.executor.type | SIMPLE | org.apache.hudi.common.util.queue.ExecutorType: Types of executor that implements org.apache.hudi.common.util.queue.HoodieExecutor. The executor orchestrates concurrent producers and consumers communicating through a message queue. BOUNDED_IN_MEMORY: Executor which orchestrates concurrent producers and consumers communicating through a bounded in-memory message queue using LinkedBlockingQueue. This queue will use extra lock to balance producers and consumers. DISRUPTOR: Executor which orchestrates concurrent producers and consumers communicating through disruptor as a lock free message queue to gain better writing performance. Although DisruptorExecutor is still an experimental feature. SIMPLE(default): Executor with no inner message queue and no inner lock. Consuming and writing records from iterator directly. The advantage is that there is no need for additional memory and cpu resources due to lock or multithreading. The disadvantage is that the executor is a single-write-single-read model, cannot support functions such as speed limit and can not de-couple the network read (shuffle read) and network write (writing objects/files to storage) anymore.Config Param: WRITE_EXECUTOR_TYPESince Version: 0.13.0 |
| hoodie.write.markers.type | TIMELINE_SERVER_BASED | org.apache.hudi.common.table.marker.MarkerType: Marker type indicating how markers are stored in the file system, used for identifying the files written and cleaning up files not committed which should be deleted. DIRECT: Individual marker file corresponding to each data file is directly created by the writer. TIMELINE_SERVER_BASED(default): Marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. If HDFS is used or timeline server is disabled, DIRECT markers are used as fallback even if this is configured. This configuration does not take effect for Spark structured streaming; DIRECT markers are always used.Config Param: MARKERS_TYPESince Version: 0.9.0 |
| hoodie.write.merge.handle.class | org.apache.hudi.io.FileGroupReaderBasedMergeHandle | The merge handle class that implements interface HoodieMergeHandle to merge the records from a base file with an iterator of incoming records or a map of updates and deletes from log files at a file group level.Config Param: MERGE_HANDLE_CLASS_NAMESince Version: 1.1.0 |
| hoodie.write.merge.handle.fallback | true | When using a custom Hoodie Merge Handle Implementation controlled by the config hoodie.write.merge.handle.class or when using a custom Hoodie Concat Handle Implementation controlled by the config hoodie.write.concat.handle.class, enabling this config results in fallback to the default implementations if instantiation of the custom implementation failsConfig Param: MERGE_HANDLE_PERFORM_FALLBACKSince Version: 1.1.0 |
| hoodie.write.num.retries.on.conflict.failures | 0 | Maximum number of times to retry a batch on conflict failure.Config Param: NUM_RETRIES_ON_CONFLICT_FAILURESSince Version: 0.14.0 |
| hoodie.write.partial.update.schema | Avro schema of the partial updates. This is automatically set by the Hudi write client and user is not expected to manually change the value.Config Param: WRITE_PARTIAL_UPDATE_SCHEMASince Version: 1.0.0 | |
| hoodie.write.record.positions | true | Whether to write record positions to the block header for data blocks containing updates and delete blocks. The record positions can be used to improve the performance of merging records from base and log files.Config Param: WRITE_RECORD_POSITIONSSince Version: 1.0.0 |
| hoodie.write.rolling.metadata.keys | Comma-separated list of extra metadata keys that should be automatically carried forward to every new commit. These keys will be read from recent commit metadata and included in new commits, ensuring they remain accessible without walking the timeline or worrying about archival. This is useful for tracking checkpoint information (e.g., Kafka offsets, Flink checkpoints) or any metadata that needs to persist across commits. New values override old ones. Only applies to data table commits.Config Param: ROLLING_METADATA_KEYSSince Version: 1.2.0 | |
| hoodie.write.rolling.metadata.timeline.lookback.commits | 10 | Maximum number of completed commits to walk back in the timeline when searching for rolling metadata keys. If a rolling metadata key is not found in the latest commit, the system will walk back up to this many commits to find the most recent value. This ensures rolling metadata is preserved even if some commits don't update all keys. Higher values provide more resilience but may impact performance. Only applies when hoodie.write.rolling.metadata.keys is configured.Config Param: ROLLING_METADATA_TIMELINE_LOOKBACK_COMMITSSince Version: 1.2.0 |
| hoodie.write.status.storage.level | MEMORY_AND_DISK_SER | Write status objects hold metadata about a write (stats, errors), that is not yet committed to storage. This controls the how that information is cached for inspection by clients. We rarely expect this to be changed.Config Param: WRITE_STATUS_STORAGE_LEVEL_VALUE |
| hoodie.write.tagged.record.storage.level | MEMORY_AND_DISK_SER | Determine what level of persistence is used to cache write RDDs. Refer to org.apache.spark.storage.StorageLevel for different valuesConfig Param: TAGGED_RECORD_STORAGE_LEVEL_VALUE |
| hoodie.write.track.event.time.watermark | false | Records event time watermark metadata in commit metadata when enabledConfig Param: TRACK_EVENT_TIME_WATERMARKSince Version: 1.1.0 |
| hoodie.writestatus.class | org.apache.hudi.client.WriteStatus | Subclass of org.apache.hudi.client.WriteStatus to be used to collect information about a write. Can be overridden to collection additional metrics/statistics about the data if needed.Config Param: WRITE_STATUS_CLASS_NAME |
Commit Callback Configs
Configurations controlling callback behavior into HTTP endpoints, to push notifications on commits on hudi tables.
Write commit callback configs
| Config Name | Default | Description |
|---|---|---|
| hoodie.write.commit.callback.http.custom.headers | (N/A) | Http callback custom headers. Format: HeaderName1:HeaderValue1;HeaderName2:HeaderValue2Config Param: CALLBACK_HTTP_CUSTOM_HEADERSSince Version: 0.15.0 |
| hoodie.write.commit.callback.http.url | (N/A) | Callback host to be sent along with callback messagesConfig Param: CALLBACK_HTTP_URLSince Version: 0.6.0 |
| hoodie.write.commit.callback.class | org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback | Full path of callback class and must be a subclass of HoodieWriteCommitCallback class, org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback by defaultConfig Param: CALLBACK_CLASS_NAMESince Version: 0.6.0 |
| hoodie.write.commit.callback.http.api.key | hudi_write_commit_http_callback | Http callback API key. hudi_write_commit_http_callback by defaultConfig Param: CALLBACK_HTTP_API_KEY_VALUESince Version: 0.6.0 |
| hoodie.write.commit.callback.http.timeout.seconds | 30 | Callback timeout in seconds.Config Param: CALLBACK_HTTP_TIMEOUT_IN_SECONDSSince Version: 0.6.0 |
| hoodie.write.commit.callback.on | false | Turn commit callback on/off. off by default.Config Param: TURN_CALLBACK_ONSince Version: 0.6.0 |
Write commit Kafka callback configs
Controls notifications sent to Kafka, on events happening to a hudi table.
| Config Name | Default | Description |
|---|---|---|
| hoodie.write.commit.callback.kafka.bootstrap.servers | (N/A) | Bootstrap servers of kafka cluster, to be used for publishing commit metadata.Config Param: BOOTSTRAP_SERVERSSince Version: 0.7.0 |
| hoodie.write.commit.callback.kafka.partition | (N/A) | It may be desirable to serialize all changes into a single Kafka partition for providing strict ordering. By default, Kafka messages are keyed by table name, which guarantees ordering at the table level, but not globally (or when new partitions are added)Config Param: PARTITIONSince Version: 0.7.0 |
| hoodie.write.commit.callback.kafka.topic | (N/A) | Kafka topic name to publish timeline activity into.Config Param: TOPICSince Version: 0.7.0 |
| hoodie.write.commit.callback.kafka.acks | all | kafka acks level, all by default to ensure strong durability.Config Param: ACKSSince Version: 0.7.0 |
| hoodie.write.commit.callback.kafka.retries | 3 | Times to retry the produce. 3 by defaultConfig Param: RETRIESSince Version: 0.7.0 |
Write commit pulsar callback configs
Controls notifications sent to pulsar, on events happening to a hudi table.
| Config Name | Default | Description |
|---|---|---|
| hoodie.write.commit.callback.pulsar.broker.service.url | (N/A) | Server's url of pulsar cluster, to be used for publishing commit metadata.Config Param: BROKER_SERVICE_URLSince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.topic | (N/A) | pulsar topic name to publish timeline activity into.Config Param: TOPICSince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.connection-timeout | 10s | Duration of waiting for a connection to a broker to be established.Config Param: CONNECTION_TIMEOUTSince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.keepalive-interval | 30s | Duration of keeping alive interval for each client broker connection.Config Param: KEEPALIVE_INTERVALSince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.operation-timeout | 30s | Duration of waiting for completing an operation.Config Param: OPERATION_TIMEOUTSince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.producer.block-if-queue-full | true | When the queue is full, the method is blocked instead of an exception is thrown.Config Param: PRODUCER_BLOCK_QUEUE_FULLSince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.producer.pending-queue-size | 1000 | The maximum size of a queue holding pending messages.Config Param: PRODUCER_PENDING_QUEUE_SIZESince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.producer.pending-total-size | 50000 | The maximum number of pending messages across partitions.Config Param: PRODUCER_PENDING_SIZESince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.producer.route-mode | RoundRobinPartition | Message routing logic for producers on partitioned topics.Config Param: PRODUCER_ROUTE_MODESince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.producer.send-timeout | 30s | The timeout in each sending to pulsar.Config Param: PRODUCER_SEND_TIMEOUTSince Version: 0.11.0 |
| hoodie.write.commit.callback.pulsar.request-timeout | 60s | Duration of waiting for completing a request.Config Param: REQUEST_TIMEOUTSince Version: 0.11.0 |
Lock Configs
Configurations that control locking mechanisms required for concurrency control between writers to a Hudi table. Concurrency between Hudi's own table services are auto managed internally.
Common Lock Configurations
| Config Name | Default | Description |
|---|---|---|
| hoodie.write.lock.heartbeat_interval_ms | 60000 | Heartbeat interval in ms, to send a heartbeat to indicate that hive client holding locks.Config Param: LOCK_HEARTBEAT_INTERVAL_MSSince Version: 0.15.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.write.lock.filesystem.path | (N/A) | For DFS based lock providers, path to store the locks under. use Table's meta path as defaultConfig Param: FILESYSTEM_LOCK_PATHSince Version: 0.8.0 |
| hoodie.write.lock.hivemetastore.database | (N/A) | For Hive based lock provider, the Hive database to acquire lock againstConfig Param: HIVE_DATABASE_NAMESince Version: 0.8.0 |
| hoodie.write.lock.hivemetastore.table | (N/A) | For Hive based lock provider, the Hive table to acquire lock againstConfig Param: HIVE_TABLE_NAMESince Version: 0.8.0 |
| hoodie.write.lock.hivemetastore.uris | (N/A) | For Hive based lock provider, the Hive metastore URI to acquire locks against.Config Param: HIVE_METASTORE_URISince Version: 0.8.0 |
| hoodie.write.lock.provider | (N/A) | Lock provider class name, user can provide their own implementation of LockProvider which should be subclass of org.apache.hudi.common.lock.LockProviderConfig Param: LOCK_PROVIDER_CLASS_NAMESince Version: 0.8.0 |
| hoodie.write.lock.zookeeper.base_path | (N/A) | The base path on Zookeeper under which to create lock related ZNodes. This should be same for all concurrent writers to the same tableConfig Param: ZK_BASE_PATHSince Version: 0.8.0 |
| hoodie.write.lock.zookeeper.port | (N/A) | Zookeeper port to connect to.Config Param: ZK_PORTSince Version: 0.8.0 |
| hoodie.write.lock.zookeeper.url | (N/A) | Zookeeper URL to connect to.Config Param: ZK_CONNECT_URLSince Version: 0.8.0 |
| hoodie.write.lock.client.num_retries | 50 | Maximum number of times to retry to acquire lock additionally from the lock manager.Config Param: LOCK_ACQUIRE_CLIENT_NUM_RETRIESSince Version: 0.8.0 |
| hoodie.write.lock.client.wait_time_ms_between_retry | 5000 | Amount of time to wait between retries on the lock provider by the lock managerConfig Param: LOCK_ACQUIRE_CLIENT_RETRY_WAIT_TIME_IN_MILLISSince Version: 0.8.0 |
| hoodie.write.lock.conflict.resolution.strategy | org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy | Lock provider class name, this should be subclass of org.apache.hudi.client.transaction.ConflictResolutionStrategyConfig Param: WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAMESince Version: 0.8.0 |
| hoodie.write.lock.filesystem.expire | 0 | For DFS based lock providers, expire time in minutes, must be a non-negative number, default means no expireConfig Param: FILESYSTEM_LOCK_EXPIRESince Version: 0.12.0 |
| hoodie.write.lock.max_wait_time_ms_between_retry | 16000 | Maximum amount of time to wait between retries by lock provider client. This bounds the maximum delay from the exponential backoff. Currently used by ZK based lock provider only.Config Param: LOCK_ACQUIRE_RETRY_MAX_WAIT_TIME_IN_MILLISSince Version: 0.8.0 |
| hoodie.write.lock.num_retries | 15 | Maximum number of times to retry lock acquire, at each lock providerConfig Param: LOCK_ACQUIRE_NUM_RETRIESSince Version: 0.8.0 |
| hoodie.write.lock.wait_time_ms | 60000 | Timeout in ms, to wait on an individual lock acquire() call, at the lock provider.Config Param: LOCK_ACQUIRE_WAIT_TIMEOUT_MSSince Version: 0.8.0 |
| hoodie.write.lock.wait_time_ms_between_retry | 1000 | Initial amount of time to wait between retries to acquire locks, subsequent retries will exponentially backoff.Config Param: LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLISSince Version: 0.8.0 |
| hoodie.write.lock.zookeeper.connection_timeout_ms | 15000 | Timeout in ms, to wait for establishing connection with Zookeeper.Config Param: ZK_CONNECTION_TIMEOUT_MSSince Version: 0.8.0 |
| hoodie.write.lock.zookeeper.lock_key | Key name under base_path at which to create a ZNode and acquire lock. Final path on zk will look like base_path/lock_key. If this parameter is not set, we would set it as the table nameConfig Param: ZK_LOCK_KEYSince Version: 0.8.0 | |
| hoodie.write.lock.zookeeper.session_timeout_ms | 60000 | Timeout in ms, to wait after losing connection to ZooKeeper, before the session is expiredConfig Param: ZK_SESSION_TIMEOUT_MSSince Version: 0.8.0 |
Azure based Locks Configurations
Configs that control Azure Blob/ADLS based locking mechanisms required for concurrency control between writers to a Hudi table.
| Config Name | Default | Description |
|---|---|---|
| hoodie.write.lock.azure.client.id | (N/A) | For Azure based lock provider, Azure AD application (client) ID used together with 'hoodie.write.lock.azure.client.tenant.id' and 'hoodie.write.lock.azure.client.secret' to authenticate via service principal (ClientSecretCredential). All three must be set for this auth mode to activate.Config Param: AZURE_CLIENT_IDSince Version: 1.2.0 |
| hoodie.write.lock.azure.client.secret | (N/A) | For Azure based lock provider, Azure AD client secret used together with 'hoodie.write.lock.azure.client.tenant.id' and 'hoodie.write.lock.azure.client.id' to authenticate via service principal (ClientSecretCredential). All three must be set for this auth mode to activate.Config Param: AZURE_CLIENT_SECRETSince Version: 1.2.0 |
| hoodie.write.lock.azure.client.tenant.id | (N/A) | For Azure based lock provider, Azure AD tenant ID used together with 'hoodie.write.lock.azure.client.id' and 'hoodie.write.lock.azure.client.secret' to authenticate via service principal (ClientSecretCredential). All three must be set for this auth mode to activate.Config Param: AZURE_CLIENT_TENANT_IDSince Version: 1.2.0 |
| hoodie.write.lock.azure.connection.string | (N/A) | For Azure based lock provider, optional Azure Storage connection string used for authenticating BlobServiceClient.Config Param: AZURE_CONNECTION_STRINGSince Version: 1.2.0 |
| hoodie.write.lock.azure.managed.identity.client.id | (N/A) | For Azure based lock provider, client ID of a user-assigned managed identity to authenticate BlobServiceClient with ManagedIdentityCredential.Config Param: AZURE_MANAGED_IDENTITY_CLIENT_IDSince Version: 1.2.0 |
| hoodie.write.lock.azure.sas.token | (N/A) | For Azure based lock provider, optional SAS token used for authenticating BlobServiceClient when connection string is not provided. SAS token is not recommended for production use by Azure.Config Param: AZURE_SAS_TOKENSince Version: 1.2.0 |
DynamoDB based Locks Configurations
Configs that control DynamoDB based locking mechanisms required for concurrency control between writers to a Hudi table. Concurrency between Hudi's own table services are auto managed internally.
| Config Name | Default | Description |
|---|---|---|
| hoodie.write.lock.dynamodb.endpoint_url | (N/A) | For DynamoDB based lock provider, the url endpoint used for Amazon DynamoDB service. Useful for development with a local dynamodb instance.Config Param: DYNAMODB_ENDPOINT_URLSince Version: 0.10.1 |
| hoodie.write.lock.dynamodb.billing_mode | PAY_PER_REQUEST | For DynamoDB based lock provider, by default it is PAY_PER_REQUEST mode. Alternative is PROVISIONED.Config Param: DYNAMODB_LOCK_BILLING_MODESince Version: 0.10.0 |
| hoodie.write.lock.dynamodb.partition_key | For DynamoDB based lock provider, the partition key for the DynamoDB lock table. Each Hudi dataset should has it's unique key so concurrent writers could refer to the same partition key. By default we use the Hudi table name specified to be the partition keyConfig Param: DYNAMODB_LOCK_PARTITION_KEYSince Version: 0.10.0 | |
| hoodie.write.lock.dynamodb.read_capacity | 20 | For DynamoDB based lock provider, read capacity units when using PROVISIONED billing modeConfig Param: DYNAMODB_LOCK_READ_CAPACITYSince Version: 0.10.0 |
| hoodie.write.lock.dynamodb.region | us-east-1 | For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service. Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1Config Param: DYNAMODB_LOCK_REGIONSince Version: 0.10.0 |
| hoodie.write.lock.dynamodb.table | hudi_locks | For DynamoDB based lock provider, the name of the DynamoDB table acting as lock tableConfig Param: DYNAMODB_LOCK_TABLE_NAMESince Version: 0.10.0 |
| hoodie.write.lock.dynamodb.table_creation_timeout | 120000 | For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB tableConfig Param: DYNAMODB_LOCK_TABLE_CREATION_TIMEOUTSince Version: 0.10.0 |
| hoodie.write.lock.dynamodb.write_capacity | 10 | For DynamoDB based lock provider, write capacity units when using PROVISIONED billing modeConfig Param: DYNAMODB_LOCK_WRITE_CAPACITYSince Version: 0.10.0 |
| hoodie.write.lock.wait_time_ms | 60000 | Lock Acquire Wait Timeout in millisecondsConfig Param: LOCK_ACQUIRE_WAIT_TIMEOUT_MS_PROP_KEYSince Version: 0.10.0 |
Key Generator Configs
Hudi maintains keys (record key + partition path) for uniquely identifying a particular record. These configs allow developers to setup the Key generator class that extracts these out of incoming records.
Key Generator Options
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.write.partitionpath.field | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()Config Param: PARTITIONPATH_FIELD_NAME |
| hoodie.datasource.write.recordkey.field | (N/A) | Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: RECORDKEY_FIELD_NAME |
| hoodie.datasource.write.secondarykey.column | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: SECONDARYKEY_COLUMN_NAME |
| hoodie.datasource.write.hive_style_partitioning | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)Config Param: HIVE_STYLE_PARTITIONING_ENABLE |
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled | false | When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value 2016-12-29 09:54:00 will be written as timestamp 2016-12-29 09:54:00.0 in row-writer path, while it will be written as long value 1483023240000000 in non row-writer path. If enabled, then the timestamp value will be written in both the cases.Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLEDSince Version: 0.10.1 |
| hoodie.datasource.write.partitionpath.urlencode | false | Should we url encode the partition path value, before creating the folder structure.Config Param: URL_ENCODE_PARTITIONING |
| hoodie.datasource.write.slash.separated.date.partitioning | false | Flag to indicate whether to use slash separated date partitioning. If set to true, date partition values in yyyy-MM-dd format will be transformed to yyyy/MM/dd directory structure. By default false. Cannot be used together with hive-style partitioning.Config Param: SLASH_SEPARATED_DATE_PARTITIONING |
Timestamp-based key generator configs
Configs used for TimestampBasedKeyGenerator which relies on timestamps for the partition field. The field values are interpreted as timestamps and not just converted to string while generating partition path value for records. Record key is same as before where it is chosen by field name.
| Config Name | Default | Description |
|---|---|---|
| hoodie.keygen.timebased.timestamp.type | (N/A) | Timestamp type of the field, which should be one of the timestamp types supported: UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, EPOCHMICROSECONDS, SCALAR.Config Param: TIMESTAMP_TYPE_FIELD |
| hoodie.keygen.datetime.parser.class | org.apache.hudi.keygen.parser.HoodieDateTimeParser | Date time parser class name.Config Param: DATE_TIME_PARSER |
| hoodie.keygen.timebased.input.dateformat | Input date format such as yyyy-MM-dd'T'HH:mm:ss.SSSZ.Config Param: TIMESTAMP_INPUT_DATE_FORMAT | |
| hoodie.keygen.timebased.input.dateformat.list.delimiter.regex | , | The delimiter for allowed input date format list, usually ,.Config Param: TIMESTAMP_INPUT_DATE_FORMAT_LIST_DELIMITER_REGEX |
| hoodie.keygen.timebased.input.timezone | UTC | Timezone of the input timestamp, such as UTC.Config Param: TIMESTAMP_INPUT_TIMEZONE_FORMAT |
| hoodie.keygen.timebased.output.dateformat | Output date format such as yyyy-MM-dd'T'HH:mm:ss.SSSZ.Config Param: TIMESTAMP_OUTPUT_DATE_FORMAT | |
| hoodie.keygen.timebased.output.timezone | UTC | Timezone of the output timestamp, such as UTC.Config Param: TIMESTAMP_OUTPUT_TIMEZONE_FORMAT |
| hoodie.keygen.timebased.timestamp.scalar.time.unit | SECONDS | When timestamp type SCALAR is used, this specifies the time unit, with allowed unit specified by TimeUnit enums (NANOSECONDS, MICROSECONDS, MILLISECONDS, SECONDS, MINUTES, HOURS, DAYS).Config Param: INPUT_TIME_UNIT |
| hoodie.keygen.timebased.timezone | UTC | Timezone of both input and output timestamp if they are the same, such as UTC. Please use hoodie.keygen.timebased.input.timezone and hoodie.keygen.timebased.output.timezone instead if the input and output timezones are different.Config Param: TIMESTAMP_TIMEZONE_FORMAT |
Index Configs
Configurations that control indexing behavior, which tags incoming records as either inserts or updates to older records.
Common Index Configs
| Config Name | Default | Description |
|---|---|---|
| hoodie.expression.index.function | (N/A) | Function to be used for building the expression index.Config Param: INDEX_FUNCTIONSince Version: 1.0.0 |
| hoodie.index.name | (N/A) | Name of the expression index. This is also used for the partition name in the metadata table.Config Param: INDEX_NAMESince Version: 1.0.0 |
| hoodie.table.checksum | (N/A) | Index definition checksum is used to guard against partial writes in HDFS. It is added as the last entry in index.properties and then used to validate while reading table config.Config Param: INDEX_DEFINITION_CHECKSUMSince Version: 1.0.0 |
| hoodie.expression.index.type | COLUMN_STATS | Type of the expression index. Default is column_stats if there are no functions and expressions in the command. Valid options could be BITMAP, COLUMN_STATS, LUCENE, etc. If index_type is not provided, and there are functions or expressions in the command then a expression index using column stats will be created.Config Param: INDEX_TYPESince Version: 1.0.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.expression.index.range.metadata.storage.level | MEMORY_AND_DISK_SER | Determine what level of persistence is used to cache range metadata RDDs created to compute expression index. Refer to org.apache.spark.storage.StorageLevel for different valuesConfig Param: EXPRESSION_INDEX_RANGE_METADATA_STORAGE_LEVEL_VALUESince Version: 1.2.0 |
Common Index Configs
| Config Name | Default | Description |
|---|---|---|
| hoodie.index.type | (N/A) | org.apache.hudi.index.HoodieIndex$IndexType: Determines how input records are indexed, i.e., looked up based on the key for the location in the existing table. Default is SIMPLE on Spark engine, and INMEMORY on Flink and Java engines. INMEMORY: Uses in-memory hashmap in Spark and Java engine and Flink in-memory state in Flink for indexing. BLOOM: Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Key uniqueness is enforced inside partitions. GLOBAL_BLOOM: Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Key uniqueness is enforced across all partitions in the table. SIMPLE: Performs a lean join of the incoming update/delete records against keys extracted from the table on storage.Key uniqueness is enforced inside partitions. GLOBAL_SIMPLE: Performs a lean join of the incoming update/delete records against keys extracted from the table on storage.Key uniqueness is enforced across all partitions in the table. BUCKET: locates the file group containing the record fast by using bucket hashing, particularly beneficial in large scale. Use hoodie.index.bucket.engine to choose bucket engine type, i.e., how buckets are generated. FLINK_STATE: Internal Config for indexing based on Flink state. RECORD_INDEX: Index which saves the record key to location mappings in the HUDI Metadata Table. Record index is a global index, enforcing key uniqueness across all partitions in the table. Supports sharding to achieve very high scale. For a table with keys that are only unique inside each partition, use RECORD_LEVEL_INDEX instead. This enum is deprecated. Use GLOBAL_RECORD_LEVEL_INDEX for global uniqueness of record keys or RECORD_LEVEL_INDEX for partition-level uniqueness of record keys. GLOBAL_RECORD_LEVEL_INDEX: Index which saves the record key to location mappings in the HUDI Metadata Table. Record index is a global index, enforcing key uniqueness across all partitions in the table. Supports sharding to achieve very high scale. For a table with keys that are only unique inside each partition, use RECORD_LEVEL_INDEX instead. RECORD_LEVEL_INDEX: Index which saves the record key to location mappings in the HUDI Metadata Table. Supports sharding to achieve very high scale. This is a non global index, where keys can be replicated across partitions, since a pair of partition path and record keys will uniquely map to a location using this index. If users expect record keys to be unique across all partitions, use GLOBAL_RECORD_LEVEL_INDEX instead.Config Param: INDEX_TYPE |
| hoodie.bucket.index.query.pruning | true | Control if table with bucket index use bucket query or notConfig Param: BUCKET_QUERY_INDEX |
| hoodie.bucket.index.remote.partitioner.enable | false | Use Remote Partitioner using centralized allocation of partition IDs to do repartition based on bucket aiming to resolve data skew. Default local hash partitionerConfig Param: BUCKET_PARTITIONER |
| Config Name | Default | Description |
|---|---|---|
| hoodie.bucket.index.hash.field | (N/A) | Index key. It is used to index the record and find its file group. If not set, use record key field as defaultConfig Param: BUCKET_INDEX_HASH_FIELD |
| hoodie.bucket.index.max.num.buckets | (N/A) | Only applies if bucket index engine is consistent hashing. Determine the upper bound of the number of buckets in the hudi table. Bucket resizing cannot be done higher than this max limit.Config Param: BUCKET_INDEX_MAX_NUM_BUCKETSSince Version: 0.13.0 |
| hoodie.bucket.index.min.num.buckets | (N/A) | Only applies if bucket index engine is consistent hashing. Determine the lower bound of the number of buckets in the hudi table. Bucket resizing cannot be done lower than this min limit.Config Param: BUCKET_INDEX_MIN_NUM_BUCKETSSince Version: 0.13.0 |
| hoodie.bucket.index.partition.expressions | (N/A) | Users can use this parameter to specify expression and the corresponding bucket numbers (separated by commas).Multiple rules are separated by semicolons like hoodie.bucket.index.partition.expressions=expression1,bucket-number1;expression2,bucket-number2Config Param: BUCKET_INDEX_PARTITION_EXPRESSIONS |
| hoodie.bloom.index.bucketized.checking | true | Only applies if index type is BLOOM. When true, bucketized bloom filtering is enabled. This reduces skew seen in sort based bloom index lookupConfig Param: BLOOM_INDEX_BUCKETIZED_CHECKING |
| hoodie.bloom.index.bucketized.checking.enable.dynamic.parallelism | false | Only applies if index type is BLOOM and the bucketized bloom filtering is enabled. When true, the index parallelism is determined by the number of file groups to look up and the number of keys per bucket to split comparisons within a file group; otherwise, the index parallelism is limited by the input parallelism. PLEASE NOTE that if the bloom index parallelism (hoodie.bloom.index.parallelism) is configured, the bloom index parallelism takes effect instead of the input parallelism and always limits the number of buckets calculated based on the number of keys per bucket in the bucketized bloom filtering.Config Param: BLOOM_INDEX_BUCKETIZED_CHECKING_ENABLE_DYNAMIC_PARALLELISMSince Version: 1.1.0 |
| hoodie.bloom.index.fileid.key.sorting.enable | false | Only applies if index type is BLOOM. When true, the global sorting based on the fileId and key is enabled during key lookup. This reduces skew in the key lookup in the bloom index.Config Param: BLOOM_INDEX_FILE_GROUP_ID_KEY_SORTINGSince Version: 1.1.0 |
| hoodie.bloom.index.input.storage.level | MEMORY_AND_DISK_SER | Only applies when #bloomIndexUseCaching is set. Determine what level of persistence is used to cache input RDDs. Refer to org.apache.spark.storage.StorageLevel for different valuesConfig Param: BLOOM_INDEX_INPUT_STORAGE_LEVEL_VALUE |
| hoodie.bloom.index.keys.per.bucket | 10000000 | Only applies if bloomIndexBucketizedChecking is enabled and index type is bloom. This configuration controls the “bucket” size which tracks the number of record-key checks made against a single file and is the unit of work allocated to each partition performing bloom filter lookup. A higher value would amortize the fixed cost of reading a bloom filter to memory.Config Param: BLOOM_INDEX_KEYS_PER_BUCKET |
| hoodie.bloom.index.parallelism | 0 | Only applies if index type is BLOOM. This is the amount of parallelism for index lookup, which involves a shuffle. By default, this is auto computed based on input workload characteristics. If the parallelism is explicitly configured by the user, the user-configured value is used in defining the actual parallelism. If the indexing stage is slow due to the limited parallelism, you can increase this to tune the performance.Config Param: BLOOM_INDEX_PARALLELISM |
| hoodie.bloom.index.prune.by.ranges | true | Only applies if index type is BLOOM. When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp. If the record key is completely random, it is better to turn this off, since range pruning will only add extra overhead to the index lookup.Config Param: BLOOM_INDEX_PRUNE_BY_RANGES |
| hoodie.bloom.index.update.partition.path | true | Only applies if index type is GLOBAL_BLOOM. When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partitionConfig Param: BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE |
| hoodie.bloom.index.use.caching | true | Only applies if index type is BLOOM.When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitionsConfig Param: BLOOM_INDEX_USE_CACHING |
| hoodie.bloom.index.use.metadata | false | Only applies if index type is BLOOM.When true, the index lookup uses bloom filters and column stats from metadata table when available to speed up the process.Config Param: BLOOM_INDEX_USE_METADATASince Version: 0.11.0 |
| hoodie.bloom.index.use.treebased.filter | true | Only applies if index type is BLOOM. When true, interval tree based file pruning optimization is enabled. This mode speeds-up file-pruning based on key ranges when compared with the brute-force modeConfig Param: BLOOM_INDEX_TREE_BASED_FILTER |
| hoodie.bucket.index.merge.threshold | 0.2 | Control if buckets should be merged when using consistent hashing bucket indexSpecifically, if a file slice size is smaller than hoodie.xxxx.max.file.size * threshold, then it will be consideredas a merge candidate.Config Param: BUCKET_MERGE_THRESHOLDSince Version: 0.13.0 |
| hoodie.bucket.index.num.buckets | 256 | Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, and each partition is divided to N buckets.Config Param: BUCKET_INDEX_NUM_BUCKETS |
| hoodie.bucket.index.partition.rule.type | regex | Rule parser for expressions when using partition level bucket index, default regex.Config Param: BUCKET_INDEX_PARTITION_RULE_TYPE |
| hoodie.bucket.index.split.threshold | 2.0 | Control if the bucket should be split when using consistent hashing bucket index.Specifically, if a file slice size reaches hoodie.xxxx.max.file.size * threshold, then split will be carried out.Config Param: BUCKET_SPLIT_THRESHOLDSince Version: 0.13.0 |
| hoodie.global.index.reconcile.parallelism | 60 | Only applies if index type is GLOBAL_BLOOM or GLOBAL_SIMPLE. This controls the parallelism for deduplication during indexing where more than 1 record could be tagged due to partition update.Config Param: GLOBAL_INDEX_RECONCILE_PARALLELISM |
| hoodie.global.simple.index.parallelism | 0 | Only applies if index type is GLOBAL_SIMPLE. This limits the parallelism of fetching records from the base files of all table partitions. The index picks the configured parallelism if the number of base files is larger than this configured value; otherwise, the number of base files is used as the parallelism. If the indexing stage is slow due to the limited parallelism, you can increase this to tune the performance.Config Param: GLOBAL_SIMPLE_INDEX_PARALLELISM |
| hoodie.index.bucket.engine | SIMPLE | org.apache.hudi.index.HoodieIndex$BucketIndexEngineType: Determines the type of bucketing or hashing to use when hoodie.index.type is set to BUCKET. SIMPLE(default): Uses a fixed number of buckets for file groups which cannot shrink or expand. This works for both COW and MOR tables. CONSISTENT_HASHING: Supports dynamic number of buckets with bucket resizing to properly size each bucket. This solves potential data skew problem where one bucket can be significantly larger than others in SIMPLE engine type. This only works with MOR tables.Config Param: BUCKET_INDEX_ENGINE_TYPESince Version: 0.11.0 |
| hoodie.index.class | Full path of user-defined index class and must be a subclass of HoodieIndex class. It will take precedence over the hoodie.index.type configuration if specifiedConfig Param: INDEX_CLASS_NAME | |
| hoodie.record.index.input.storage.level | MEMORY_AND_DISK_SER | Only applies when #recordIndexUseCaching is set. Determine what level of persistence is used to cache input RDDs. Refer to org.apache.spark.storage.StorageLevel for different valuesConfig Param: RECORD_INDEX_INPUT_STORAGE_LEVEL_VALUESince Version: 0.14.0 |
| hoodie.record.index.update.partition.path | false | Similar to Key: 'hoodie.bloom.index.update.partition.path' , default: true , isAdvanced: true , description: Only applies if index type is GLOBAL_BLOOM. When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition since version: version is not defined deprecated after: version is not defined, but for record index.Config Param: RECORD_INDEX_UPDATE_PARTITION_PATH_ENABLESince Version: 0.14.0 |
| hoodie.record.index.use.caching | true | Only applies if index type is RECORD_INDEX.When true, the input RDD will be cached to speed up index lookup by reducing IO for computing parallelism or affected partitionsConfig Param: RECORD_INDEX_USE_CACHINGSince Version: 0.14.0 |
| hoodie.simple.index.input.storage.level | MEMORY_AND_DISK_SER | Only applies when #simpleIndexUseCaching is set. Determine what level of persistence is used to cache input RDDs. Refer to org.apache.spark.storage.StorageLevel for different valuesConfig Param: SIMPLE_INDEX_INPUT_STORAGE_LEVEL_VALUE |
| hoodie.simple.index.parallelism | 0 | Only applies if index type is SIMPLE. This limits the parallelism of fetching records from the base files of affected partitions. By default, this is auto computed based on input workload characteristics. If the parallelism is explicitly configured by the user, the user-configured value is used in defining the actual parallelism. If the indexing stage is slow due to the limited parallelism, you can increase this to tune the performance.Config Param: SIMPLE_INDEX_PARALLELISM |
| hoodie.simple.index.update.partition.path | true | Similar to Key: 'hoodie.bloom.index.update.partition.path' , default: true , isAdvanced: true , description: Only applies if index type is GLOBAL_BLOOM. When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition since version: version is not defined deprecated after: version is not defined, but for simple index.Config Param: SIMPLE_INDEX_UPDATE_PARTITION_PATH_ENABLE |
| hoodie.simple.index.use.caching | true | Only applies if index type is SIMPLE. When true, the incoming writes will cached to speed up index lookup by reducing IO for computing parallelism or affected partitionsConfig Param: SIMPLE_INDEX_USE_CACHING |
Reader Configs
Please fill in the description for Config Group Name: Reader Configs
Reader Configs
Configurations that control file group reading.
| Config Name | Default | Description |
|---|---|---|
| hoodie.compaction.lazy.block.read | true | When merging the delta log files, this config helps to choose whether the log blocks should be read lazily or not. Choose true to use lazy block reading (low memory usage, but incurs seeks to each block header) or false for immediate block read (higher memory usage)Config Param: COMPACTION_LAZY_BLOCK_READ_ENABLE |
| hoodie.compaction.reverse.log.read | false | HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the reader reads the logfile in reverse direction, from pos=file_length to pos=0Config Param: COMPACTION_REVERSE_LOG_READ_ENABLE |
| hoodie.datasource.merge.type | payload_combine | For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records without merging; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records (combines the two records with same key for base and log file records), then read the left log file recordsConfig Param: MERGE_TYPE |
| hoodie.file.group.reader.enabled | true | Use engine agnostic file group reader if enabledConfig Param: FILE_GROUP_READER_ENABLEDSince Version: 1.0.0 |
| hoodie.hfile.block.cache.enabled | true | Enable HFile block-level caching for metadata files. This caches frequently accessed HFile blocks in memory to reduce I/O operations during metadata queries. Improves performance for workloads with repeated metadata access patterns.Config Param: HFILE_BLOCK_CACHE_ENABLEDSince Version: 1.1.0 |
| hoodie.hfile.block.cache.size | 100 | Maximum number of HFile blocks to cache in memory per metadata file reader. Higher values improve cache hit rates but consume more memory. Only effective when hfile.block.cache.enabled is true.Config Param: HFILE_BLOCK_CACHE_SIZESince Version: 1.1.0 |
| hoodie.hfile.block.cache.ttl.minutes | 60 | Time-to-live (TTL) in minutes for cached HFile blocks. Blocks are evicted from the cache after this duration to prevent memory leaks. Only effective when hfile.block.cache.enabled is true.Config Param: HFILE_BLOCK_CACHE_TTL_MINUTESSince Version: 1.1.0 |
| hoodie.merge.use.record.positions | true | Whether to use positions in the block header for data blocks containing updates and delete blocks for merging.Config Param: MERGE_USE_RECORD_POSITIONSSince Version: 1.0.0 |
| hoodie.read.blob.inline.mode | DESCRIPTOR | How Hudi interprets INLINE BLOB values on read. DESCRIPTOR (default) returns an OUT_OF_LINE-shaped reference pointing at the backing Lance file with the INLINE payload's position and size, so callers can skip the byte content read. CONTENT returns the raw inline bytes directly in the data field on every read.Config Param: BLOB_INLINE_READ_MODESince Version: 1.2.0 |
Metastore and Catalog Sync Configs
Configurations used by the Hudi to sync metadata to external metastores and catalogs.
Common Metadata Sync Configs
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.meta.sync.enable | false | Enable Syncing the Hudi Table with an external meta store or data catalog.Config Param: META_SYNC_ENABLED |
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.base_file_format | PARQUET | Base file format for the sync.Config Param: META_SYNC_BASE_FILE_FORMAT |
| hoodie.datasource.hive_sync.database | default | The name of the destination database that we should sync the hudi table to.Config Param: META_SYNC_DATABASE_NAME |
| hoodie.datasource.hive_sync.partition_extractor_class | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default is inferred based on partition configuration.Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS |
| hoodie.datasource.hive_sync.partition_fields | Field in the table to use for determining hive partition columns.Config Param: META_SYNC_PARTITION_FIELDS | |
| hoodie.datasource.hive_sync.table | unknown | The name of the destination table that we should sync the hudi table to.Config Param: META_SYNC_TABLE_NAME |
| hoodie.datasource.meta.sync.base.path | Base path of the hoodie table to syncConfig Param: META_SYNC_BASE_PATH | |
| hoodie.datasource.meta_sync.condition.sync | false | If true, only sync on conditions like schema change or partition change.Config Param: META_SYNC_CONDITIONAL_SYNC |
| hoodie.meta.sync.decode_partition | false | If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.Config Param: META_SYNC_DECODE_PARTITION |
| hoodie.meta.sync.incremental | true | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.Config Param: META_SYNC_INCREMENTALSince Version: 0.14.0 |
| hoodie.meta.sync.metadata_file_listing | true | Enable the internal metadata table for file listing for syncing with metastoresConfig Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA |
| hoodie.meta.sync.no_partition_metadata | false | If true, the partition metadata will not be synced to the metastore. This is useful when the partition metadata is large, and the partition info can be obtained from Hudi's internal metadata table. Note, Key: 'hoodie.metadata.enable' , default: true , isAdvanced: false , description: Enable the internal metadata table which serves table metadata like level file listings since version: 0.7.0 deprecated after: version is not defined must be set to true.Config Param: META_SYNC_NO_PARTITION_METADATASince Version: 1.0.0 |
| hoodie.meta.sync.sync_snapshot_with_table_name | true | sync meta info to origin table if enableConfig Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAMESince Version: 0.14.0 |
| hoodie.meta.sync.touch.partitions.enabled | false | If true, TOUCH partition events will be emitted during meta sync. TOUCH events indicate partitions that exist in both storage and metastore, no schema or location change, but the partition has received data.Config Param: META_SYNC_TOUCH_PARTITIONS_ENABLEDSince Version: 1.2.0 |
| hoodie.meta_sync.spark.version | The spark version used when syncing with a metastore.Config Param: META_SYNC_SPARK_VERSION |
Glue catalog sync based client Configurations
Configs that control Glue catalog sync based client.
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.meta.sync.glue.partition_index_fields | Specify the partitions fields to index on aws glue. Separate the fields by semicolon. By default, when the feature is enabled, all the partition will be indexed. You can create up to three indexes, separate them by comma. Eg: col1;col2;col3,col2,col3Config Param: META_SYNC_PARTITION_INDEX_FIELDSSince Version: 0.15.0 | |
| hoodie.datasource.meta.sync.glue.partition_index_fields.enable | false | Enable aws glue partition index feature, to speedup partition based query patternConfig Param: META_SYNC_PARTITION_INDEX_FIELDS_ENABLESince Version: 0.15.0 |
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.meta.sync.glue.catalogId | (N/A) | The catalogId needs to be populated for syncing hoodie tables in a different AWS accountConfig Param: GLUE_CATALOG_IDSince Version: 1.1.0 |
| hoodie.datasource.meta.sync.glue.resource_tags | (N/A) | Tags to be applied to AWS Glue databases and tables during sync. Format: key1:value1,key2:value2Config Param: GLUE_SYNC_RESOURCE_TAGSSince Version: 1.1.0 |
| hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism | 1 | Parallelism for listing all partitions(first time sync). Should be in interval [1, 10].Config Param: ALL_PARTITIONS_READ_PARALLELISMSince Version: 0.15.0 |
| hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism | 1 | Parallelism for listing changed partitions(second and subsequent syncs).Config Param: CHANGED_PARTITIONS_READ_PARALLELISMSince Version: 0.15.0 |
| hoodie.datasource.meta.sync.glue.database_name | The name of the destination database that we should sync the hudi table to.Config Param: GLUE_SYNC_DATABASE_NAMESince Version: 1.1.0 | |
| hoodie.datasource.meta.sync.glue.metadata_file_listing | false | Makes athena use the metadata table to list partitions and files. Currently it won't benefit from other features such stats indexesConfig Param: GLUE_METADATA_FILE_LISTINGSince Version: 0.14.0 |
| hoodie.datasource.meta.sync.glue.partition_change_parallelism | 1 | Parallelism for change operations - such as create/update/delete.Config Param: PARTITION_CHANGE_PARALLELISMSince Version: 0.15.0 |
| hoodie.datasource.meta.sync.glue.recreate_table_on_error | false | Glue sync may fail if the Glue table exists with partitions differing from the Hoodie table or if schema evolution is not supported by Glue.Enabling this configuration will drop and create the table to match the Hoodie configConfig Param: RECREATE_GLUE_TABLE_ON_ERRORSince Version: 0.14.0 |
| hoodie.datasource.meta.sync.glue.skip_table_archive | true | Glue catalog sync based client will skip archiving the table version if this config is set to trueConfig Param: GLUE_SKIP_TABLE_ARCHIVESince Version: 0.14.0 |
| hoodie.datasource.meta.sync.glue.table_name | The name of the destination table that we should sync the hudi table to.Config Param: GLUE_SYNC_TABLE_NAMESince Version: 1.1.0 |
BigQuery Sync Configs
Configurations used by the Hudi to sync metadata to Google BigQuery.
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.meta.sync.enable | false | Enable Syncing the Hudi Table with an external meta store or data catalog.Config Param: META_SYNC_ENABLED |
| Config Name | Default | Description |
|---|---|---|
| hoodie.gcp.bigquery.sync.big_lake_connection_id | (N/A) | The Big Lake connection ID to useConfig Param: BIGQUERY_SYNC_BIG_LAKE_CONNECTION_IDSince Version: 0.14.1 |
| hoodie.gcp.bigquery.sync.billing.project.id | (N/A) | Name of the billing project id in BigQuery. By default it uses the configuration from hoodie.gcp.bigquery.sync.project_id if this configuration is not set. This can only be used with manifest file based approachConfig Param: BIGQUERY_SYNC_BILLING_PROJECT_IDSince Version: 1.0.0 |
| hoodie.gcp.bigquery.sync.dataset_location | (N/A) | Location of the target dataset in BigQueryConfig Param: BIGQUERY_SYNC_DATASET_LOCATION |
| hoodie.gcp.bigquery.sync.project_id | (N/A) | Name of the target project in BigQueryConfig Param: BIGQUERY_SYNC_PROJECT_ID |
| hoodie.gcp.bigquery.sync.source_uri | (N/A) | Name of the source uri gcs path of the tableConfig Param: BIGQUERY_SYNC_SOURCE_URI |
| hoodie.gcp.bigquery.sync.source_uri_prefix | (N/A) | Name of the source uri gcs path prefix of the tableConfig Param: BIGQUERY_SYNC_SOURCE_URI_PREFIX |
| hoodie.datasource.hive_sync.base_file_format | PARQUET | Base file format for the sync.Config Param: META_SYNC_BASE_FILE_FORMAT |
| hoodie.datasource.hive_sync.database | default | The name of the destination database that we should sync the hudi table to.Config Param: META_SYNC_DATABASE_NAME |
| hoodie.datasource.hive_sync.partition_extractor_class | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default is inferred based on partition configuration.Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS |
| hoodie.datasource.hive_sync.partition_fields | Field in the table to use for determining hive partition columns.Config Param: META_SYNC_PARTITION_FIELDS | |
| hoodie.datasource.hive_sync.table | unknown | The name of the destination table that we should sync the hudi table to.Config Param: META_SYNC_TABLE_NAME |
| hoodie.datasource.meta.sync.base.path | Base path of the hoodie table to syncConfig Param: META_SYNC_BASE_PATH | |
| hoodie.datasource.meta_sync.condition.sync | false | If true, only sync on conditions like schema change or partition change.Config Param: META_SYNC_CONDITIONAL_SYNC |
| hoodie.gcp.bigquery.sync.dataset_name | Name of the target dataset in BigQueryConfig Param: BIGQUERY_SYNC_DATASET_NAME | |
| hoodie.gcp.bigquery.sync.partition_fields | Comma-delimited partition fields. Default to non-partitioned.Config Param: BIGQUERY_SYNC_PARTITION_FIELDS | |
| hoodie.gcp.bigquery.sync.require_partition_filter | false | If true, configure table to require a partition filter to be specified when querying the tableConfig Param: BIGQUERY_SYNC_REQUIRE_PARTITION_FILTERSince Version: 0.14.1 |
| hoodie.gcp.bigquery.sync.table_name | Name of the target table in BigQueryConfig Param: BIGQUERY_SYNC_TABLE_NAME | |
| hoodie.gcp.bigquery.sync.use_bq_manifest_file | false | If true, generate a manifest file with data file absolute paths and use BigQuery manifest file support to directly create one external table over the Hudi table. If false (default), generate a manifest file with data file names and create two external tables and one view in BigQuery. Query the view for the same results as querying the Hudi tableConfig Param: BIGQUERY_SYNC_USE_BQ_MANIFEST_FILESince Version: 0.14.0 |
| hoodie.gcp.bigquery.sync.use_file_listing_from_metadata | true | Fetch file listing from Hudi's metadataConfig Param: BIGQUERY_SYNC_USE_FILE_LISTING_FROM_METADATA |
| hoodie.meta.sync.decode_partition | false | If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.Config Param: META_SYNC_DECODE_PARTITION |
| hoodie.meta.sync.incremental | true | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.Config Param: META_SYNC_INCREMENTALSince Version: 0.14.0 |
| hoodie.meta.sync.metadata_file_listing | true | Enable the internal metadata table for file listing for syncing with metastoresConfig Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA |
| hoodie.meta.sync.no_partition_metadata | false | If true, the partition metadata will not be synced to the metastore. This is useful when the partition metadata is large, and the partition info can be obtained from Hudi's internal metadata table. Note, Key: 'hoodie.metadata.enable' , default: true , isAdvanced: false , description: Enable the internal metadata table which serves table metadata like level file listings since version: 0.7.0 deprecated after: version is not defined must be set to true.Config Param: META_SYNC_NO_PARTITION_METADATASince Version: 1.0.0 |
| hoodie.meta.sync.sync_snapshot_with_table_name | true | sync meta info to origin table if enableConfig Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAMESince Version: 0.14.0 |
| hoodie.meta.sync.touch.partitions.enabled | false | If true, TOUCH partition events will be emitted during meta sync. TOUCH events indicate partitions that exist in both storage and metastore, no schema or location change, but the partition has received data.Config Param: META_SYNC_TOUCH_PARTITIONS_ENABLEDSince Version: 1.2.0 |
| hoodie.meta_sync.spark.version | The spark version used when syncing with a metastore.Config Param: META_SYNC_SPARK_VERSION |
Hive Sync Configs
Configurations used by the Hudi to sync metadata to Hive Metastore.
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.mode | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.Config Param: HIVE_SYNC_MODE |
| hoodie.datasource.hive_sync.enable | false | When set to true, register/sync the table to Apache Hive metastore.Config Param: HIVE_SYNC_ENABLED |
| hoodie.datasource.hive_sync.jdbcurl | jdbc:hive2://localhost:10000 | Hive metastore urlConfig Param: HIVE_URL |
| hoodie.datasource.hive_sync.metastore.uris | thrift://localhost:9083 | Hive metastore urlConfig Param: METASTORE_URIS |
| hoodie.datasource.meta.sync.enable | false | Enable Syncing the Hudi Table with an external meta store or data catalog.Config Param: META_SYNC_ENABLED |
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.serde_properties | (N/A) | Serde properties to hive table.Config Param: HIVE_TABLE_SERDE_PROPERTIES |
| hoodie.datasource.hive_sync.table_properties | (N/A) | Additional properties to store with table.Config Param: HIVE_TABLE_PROPERTIES |
| hoodie.datasource.hive_sync.auto_create_database | true | Auto create hive database if does not existsConfig Param: HIVE_AUTO_CREATE_DATABASE |
| hoodie.datasource.hive_sync.base_file_format | PARQUET | Base file format for the sync.Config Param: META_SYNC_BASE_FILE_FORMAT |
| hoodie.datasource.hive_sync.batch_num | 1000 | The number of partitions one batch when synchronous partitions to hive.Config Param: HIVE_BATCH_SYNC_PARTITION_NUM |
| hoodie.datasource.hive_sync.bucket_sync | false | Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'Config Param: HIVE_SYNC_BUCKET_SYNC |
| hoodie.datasource.hive_sync.bucket_sync_spec | The hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'Config Param: HIVE_SYNC_BUCKET_SYNC_SPEC | |
| hoodie.datasource.hive_sync.create_managed_table | false | Whether to sync the table as managed table.Config Param: HIVE_CREATE_MANAGED_TABLE |
| hoodie.datasource.hive_sync.database | default | The name of the destination database that we should sync the hudi table to.Config Param: META_SYNC_DATABASE_NAME |
| hoodie.datasource.hive_sync.filter_pushdown_enabled | false | Whether to enable push down partitions by filterConfig Param: HIVE_SYNC_FILTER_PUSHDOWN_ENABLED |
| hoodie.datasource.hive_sync.filter_pushdown_max_size | 1000 | Max size limit to push down partition filters, if the estimate push down filters exceed this size, will directly try to fetch all partitions between the min/max.In case of glue metastore, this value should be reduced because it has a filter length limit.Config Param: HIVE_SYNC_FILTER_PUSHDOWN_MAX_SIZE |
| hoodie.datasource.hive_sync.ignore_exceptions | false | Ignore exceptions when syncing with Hive.Config Param: HIVE_IGNORE_EXCEPTIONS |
| hoodie.datasource.hive_sync.omit_metadata_fields | false | Whether to omit the hoodie metadata fields in the target table.Config Param: HIVE_SYNC_OMIT_METADATA_FIELDSSince Version: 0.13.0 |
| hoodie.datasource.hive_sync.partition_extractor_class | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default is inferred based on partition configuration.Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS |
| hoodie.datasource.hive_sync.partition_fields | Field in the table to use for determining hive partition columns.Config Param: META_SYNC_PARTITION_FIELDS | |
| hoodie.datasource.hive_sync.password | hive | hive password to useConfig Param: HIVE_PASS |
| hoodie.datasource.hive_sync.recreate_table_on_error | false | Hive sync may fail if the Hive table exists with partitions differing from the Hoodie table or if schema evolution if not supported by Hive.Enabling this configuration will drop and create the table to match the Hoodie configConfig Param: RECREATE_HIVE_TABLE_ON_ERRORSince Version: 0.14.0 |
| hoodie.datasource.hive_sync.schema_string_length_thresh | 4000 | Config Param: HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD |
| hoodie.datasource.hive_sync.skip_ro_suffix | false | Skip the _ro suffix for Read optimized table, when registeringConfig Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE |
| hoodie.datasource.hive_sync.support_timestamp | false | ‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUEConfig Param: HIVE_SUPPORT_TIMESTAMP_TYPE |
| hoodie.datasource.hive_sync.sync_as_datasource | true | Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE |
| hoodie.datasource.hive_sync.sync_comment | false | Whether to sync the table column comments while syncing the table.Config Param: HIVE_SYNC_COMMENT |
| hoodie.datasource.hive_sync.table | unknown | The name of the destination table that we should sync the hudi table to.Config Param: META_SYNC_TABLE_NAME |
| hoodie.datasource.hive_sync.table.strategy | ALL | Hive table synchronization strategy. Available option: RO, RT, ALL.Config Param: HIVE_SYNC_TABLE_STRATEGYSince Version: 0.13.0 |
| hoodie.datasource.hive_sync.use_jdbc | true | Use JDBC when hive synchronization is enabledConfig Param: HIVE_USE_JDBC |
| hoodie.datasource.hive_sync.use_spark_catalog | false | Use Spark catalog backed IMetaStoreClient implementation for hive sync.Config Param: HIVE_SYNC_USE_SPARK_CATALOG |
| hoodie.datasource.hive_sync.username | hive | hive user name to useConfig Param: HIVE_USER |
| hoodie.datasource.meta.sync.base.path | Base path of the hoodie table to syncConfig Param: META_SYNC_BASE_PATH | |
| hoodie.datasource.meta_sync.condition.sync | false | If true, only sync on conditions like schema change or partition change.Config Param: META_SYNC_CONDITIONAL_SYNC |
| hoodie.meta.sync.decode_partition | false | If true, meta sync will url-decode the partition path, as it is deemed as url-encoded. Default to false.Config Param: META_SYNC_DECODE_PARTITION |
| hoodie.meta.sync.incremental | true | Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.Config Param: META_SYNC_INCREMENTALSince Version: 0.14.0 |
| hoodie.meta.sync.metadata_file_listing | true | Enable the internal metadata table for file listing for syncing with metastoresConfig Param: META_SYNC_USE_FILE_LISTING_FROM_METADATA |
| hoodie.meta.sync.no_partition_metadata | false | If true, the partition metadata will not be synced to the metastore. This is useful when the partition metadata is large, and the partition info can be obtained from Hudi's internal metadata table. Note, Key: 'hoodie.metadata.enable' , default: true , isAdvanced: false , description: Enable the internal metadata table which serves table metadata like level file listings since version: 0.7.0 deprecated after: version is not defined must be set to true.Config Param: META_SYNC_NO_PARTITION_METADATASince Version: 1.0.0 |
| hoodie.meta.sync.sync_snapshot_with_table_name | true | sync meta info to origin table if enableConfig Param: META_SYNC_SNAPSHOT_WITH_TABLE_NAMESince Version: 0.14.0 |
| hoodie.meta.sync.touch.partitions.enabled | false | If true, TOUCH partition events will be emitted during meta sync. TOUCH events indicate partitions that exist in both storage and metastore, no schema or location change, but the partition has received data.Config Param: META_SYNC_TOUCH_PARTITIONS_ENABLEDSince Version: 1.2.0 |
| hoodie.meta_sync.spark.version | The spark version used when syncing with a metastore.Config Param: META_SYNC_SPARK_VERSION |
Global Hive Sync Configs
Global replication configurations used by the Hudi to sync metadata to Hive Metastore.
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.mode | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.Config Param: HIVE_SYNC_MODE |
| hoodie.datasource.hive_sync.enable | false | When set to true, register/sync the table to Apache Hive metastore.Config Param: HIVE_SYNC_ENABLED |
| hoodie.datasource.hive_sync.jdbcurl | jdbc:hive2://localhost:10000 | Hive metastore urlConfig Param: HIVE_URL |
| hoodie.datasource.hive_sync.metastore.uris | thrift://localhost:9083 | Hive metastore urlConfig Param: METASTORE_URIS |
| hoodie.datasource.meta.sync.enable | false | Enable Syncing the Hudi Table with an external meta store or data catalog.Config Param: META_SYNC_ENABLED |
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.serde_properties | (N/A) | Serde properties to hive table.Config Param: HIVE_TABLE_SERDE_PROPERTIES |
| hoodie.datasource.hive_sync.table_properties | (N/A) | Additional properties to store with table.Config Param: HIVE_TABLE_PROPERTIES |
| hoodie.meta_sync.global.replicate.timestamp | (N/A) | Config Param: META_SYNC_GLOBAL_REPLICATE_TIMESTAMP |
| hoodie.datasource.hive_sync.auto_create_database | true | Auto create hive database if does not existsConfig Param: HIVE_AUTO_CREATE_DATABASE |
| hoodie.datasource.hive_sync.base_file_format | PARQUET | Base file format for the sync.Config Param: META_SYNC_BASE_FILE_FORMAT |
| hoodie.datasource.hive_sync.batch_num | 1000 | The number of partitions one batch when synchronous partitions to hive.Config Param: HIVE_BATCH_SYNC_PARTITION_NUM |
| hoodie.datasource.hive_sync.bucket_sync | false | Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'Config Param: HIVE_SYNC_BUCKET_SYNC |
| hoodie.datasource.hive_sync.bucket_sync_spec | The hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'Config Param: HIVE_SYNC_BUCKET_SYNC_SPEC | |
| hoodie.datasource.hive_sync.create_managed_table | false | Whether to sync the table as managed table.Config Param: HIVE_CREATE_MANAGED_TABLE |
| hoodie.datasource.hive_sync.database | default | The name of the destination database that we should sync the hudi table to.Config Param: META_SYNC_DATABASE_NAME |
| hoodie.datasource.hive_sync.filter_pushdown_enabled | false | Whether to enable push down partitions by filterConfig Param: HIVE_SYNC_FILTER_PUSHDOWN_ENABLED |
| hoodie.datasource.hive_sync.filter_pushdown_max_size | 1000 | Max size limit to push down partition filters, if the estimate push down filters exceed this size, will directly try to fetch all partitions between the min/max.In case of glue metastore, this value should be reduced because it has a filter length limit.Config Param: HIVE_SYNC_FILTER_PUSHDOWN_MAX_SIZE |
| hoodie.datasource.hive_sync.ignore_exceptions | false | Ignore exceptions when syncing with Hive.Config Param: HIVE_IGNORE_EXCEPTIONS |
| hoodie.datasource.hive_sync.omit_metadata_fields | false | Whether to omit the hoodie metadata fields in the target table.Config Param: HIVE_SYNC_OMIT_METADATA_FIELDSSince Version: 0.13.0 |
| hoodie.datasource.hive_sync.partition_extractor_class | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values, default is inferred based on partition configuration.Config Param: META_SYNC_PARTITION_EXTRACTOR_CLASS |
| hoodie.datasource.hive_sync.partition_fields | Field in the table to use for determining hive partition columns.Config Param: META_SYNC_PARTITION_FIELDS | |
| hoodie.datasource.hive_sync.password | hive | hive password to useConfig Param: HIVE_PASS |
| hoodie.datasource.hive_sync.recreate_table_on_error | false | Hive sync may fail if the Hive table exists with partitions differing from the Hoodie table or if schema evolution if not supported by Hive.Enabling this configuration will drop and create the table to match the Hoodie configConfig Param: RECREATE_HIVE_TABLE_ON_ERRORSince Version: 0.14.0 |
| hoodie.datasource.hive_sync.schema_string_length_thresh | 4000 | Config Param: HIVE_SYNC_SCHEMA_STRING_LENGTH_THRESHOLD |
| hoodie.datasource.hive_sync.skip_ro_suffix | false | Skip the _ro suffix for Read optimized table, when registeringConfig Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE |
| hoodie.datasource.hive_sync.support_timestamp | false |