Version: 1.0.2

Basic Configurations

This page covers the basic configurations you may use to write/read Hudi tables. This page only features a subset of the most frequently used configurations. For a full list of all configs, please visit the All Configurations page.

Hudi Table Config: Basic Hudi Table configuration parameters.
Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
Metastore and Catalog Sync Configs: Configurations used by the Hudi to sync metadata to external metastores and catalogs.
Metrics Configs: These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.
Kafka Connect Configs: These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables
Hudi Streamer Configs: These set of configs are used for Hudi Streamer utility which provides the way to ingest from different sources such as DFS or Kafka.

note

In the tables below (N/A) means there is no default value set

Hudi Table Config

Basic Hudi Table configuration parameters.

Hudi Table Basic Configs

Configurations of the Hudi Table like type of ingestion, storage formats, hive table name etc. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and never changes during the lifetime of a hoodie table.

Config Name	Default	Description
hoodie.bootstrap.base.path	(N/A)	Base path of the dataset that needs to be bootstrapped as a Hudi table `Config Param: BOOTSTRAP_BASE_PATH`
hoodie.compaction.payload.class	(N/A)	Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file. `Config Param: PAYLOAD_CLASS_NAME`
hoodie.database.name	(N/A)	Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database `Config Param: DATABASE_NAME`
hoodie.record.merge.mode	(N/A)	org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user. `Config Param: RECORD_MERGE_MODE` `Since Version: 1.0.0`
hoodie.record.merge.strategy.id	(N/A)	Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id `Config Param: RECORD_MERGE_STRATEGY_ID` `Since Version: 0.13.0`
hoodie.table.checksum	(N/A)	Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config. `Config Param: TABLE_CHECKSUM` `Since Version: 0.11.0`
hoodie.table.create.schema	(N/A)	Schema used when creating the table `Config Param: CREATE_SCHEMA`
hoodie.table.index.defs.path	(N/A)	Relative path to table base path where the index definitions are stored `Config Param: RELATIVE_INDEX_DEFINITION_PATH` `Since Version: 1.0.0`
hoodie.table.keygenerator.class	(N/A)	Key Generator class property for the hoodie table `Config Param: KEY_GENERATOR_CLASS_NAME`
hoodie.table.keygenerator.type	(N/A)	Key Generator type to determine key generator class `Config Param: KEY_GENERATOR_TYPE` `Since Version: 1.0.0`
hoodie.table.metadata.partitions	(N/A)	Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers `Config Param: TABLE_METADATA_PARTITIONS` `Since Version: 0.11.0`
hoodie.table.metadata.partitions.inflight	(N/A)	Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers. `Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT` `Since Version: 0.11.0`
hoodie.table.name	(N/A)	Table name that will be used for registering with Hive. Needs to be same across runs. `Config Param: NAME`
hoodie.table.partition.fields	(N/A)	Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators `Config Param: PARTITION_FIELDS`
hoodie.table.precombine.field	(N/A)	Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked. `Config Param: PRECOMBINE_FIELD`
hoodie.table.recordkey.fields	(N/A)	Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey. `Config Param: RECORDKEY_FIELDS`
hoodie.table.secondary.indexes.metadata	(N/A)	The metadata of secondary indexes `Config Param: SECONDARY_INDEXES_METADATA` `Since Version: 0.13.0`
hoodie.timeline.layout.version	(N/A)	Version of timeline used, by the table. `Config Param: TIMELINE_LAYOUT_VERSION`
hoodie.archivelog.folder	archived	path under the meta folder, to store archived timeline instants at. `Config Param: ARCHIVELOG_FOLDER`
hoodie.bootstrap.index.class	org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex	Implementation to use, for mapping base files to bootstrap base file, that contain actual data. `Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
hoodie.bootstrap.index.enable	true	Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true. `Config Param: BOOTSTRAP_INDEX_ENABLE`
hoodie.bootstrap.index.type	HFILE	Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data. `Config Param: BOOTSTRAP_INDEX_TYPE` `Since Version: 1.0.0`
hoodie.datasource.write.hive_style_partitioning	false	Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values) `Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
hoodie.partition.metafile.use.base.format	false	If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files. `Config Param: PARTITION_METAFILE_USE_BASE_FORMAT`
hoodie.populate.meta.fields	true	When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing `Config Param: POPULATE_META_FIELDS`
hoodie.table.base.file.format	PARQUET	Base file format to store all the base file data. `Config Param: BASE_FILE_FORMAT`
hoodie.table.cdc.enabled	false	When enable, persist the change data if necessary, and can be queried as a CDC query mode. `Config Param: CDC_ENABLED` `Since Version: 0.13.0`
hoodie.table.cdc.supplemental.logging.mode	DATA_BEFORE_AFTER	org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs. `Config Param: CDC_SUPPLEMENTAL_LOGGING_MODE` `Since Version: 0.13.0`
hoodie.table.initial.version	EIGHT	Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup. `Config Param: INITIAL_VERSION` `Since Version: 1.0.0`
hoodie.table.log.file.format	HOODIE_LOG	Log format used for the delta logs. `Config Param: LOG_FILE_FORMAT`
hoodie.table.multiple.base.file.formats.enable	false	When set to true, the table can support reading and writing multiple base file formats. `Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE` `Since Version: 1.0.0`
hoodie.table.timeline.timezone	LOCAL	User can set hoodie commit timeline timezone, such as utc, local and so on. local is default `Config Param: TIMELINE_TIMEZONE`
hoodie.table.type	COPY_ON_WRITE	The table type for the underlying data. `Config Param: TYPE`
hoodie.table.version	EIGHT	Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes. `Config Param: VERSION`
hoodie.timeline.history.path	history	path under the meta folder, to store timeline history at. `Config Param: TIMELINE_HISTORY_PATH`
hoodie.timeline.path	timeline	path under the meta folder, to store timeline instants at. `Config Param: TIMELINE_PATH`

Config Name	Default	Description
hoodie.datasource.read.begin.instanttime	(N/A)	Required when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to start incrementally pulling data from. The completion time here need not necessarily correspond to an instant on the timeline. New data written with completion_time >= START_COMMIT are fetched out. For e.g: ‘20170901080000’ will get all new data written on or after Sep 1, 2017 08:00AM. `Config Param: START_COMMIT`
hoodie.datasource.read.end.instanttime	(N/A)	Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified. `Config Param: END_COMMIT`
hoodie.datasource.read.incr.table.version	(N/A)	The table version assumed for incremental read `Config Param: INCREMENTAL_READ_TABLE_VERSION`
hoodie.datasource.read.streaming.table.version	(N/A)	The table version assumed for streaming read `Config Param: STREAMING_READ_TABLE_VERSION`
hoodie.datasource.write.precombine.field	(N/A)	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: READ_PRE_COMBINE_FIELD`
hoodie.datasource.query.type	snapshot	Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) `Config Param: QUERY_TYPE`

Config Name	Default	Description
hoodie.datasource.hive_sync.mode	(N/A)	Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. `Config Param: HIVE_SYNC_MODE`
hoodie.datasource.write.partitionpath.field	(N/A)	Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString() `Config Param: PARTITIONPATH_FIELD`
hoodie.datasource.write.precombine.field	(N/A)	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: PRECOMBINE_FIELD`
hoodie.datasource.write.recordkey.field	(N/A)	Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` `Config Param: RECORDKEY_FIELD`
hoodie.datasource.write.secondarykey.column	(N/A)	Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` `Config Param: SECONDARYKEY_COLUMN_NAME`
hoodie.write.record.merge.mode	(N/A)	org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user. `Config Param: RECORD_MERGE_MODE` `Since Version: 1.0.0`
hoodie.clustering.async.enabled	false	Enable running of clustering service, asynchronously as inserts happen on the table. `Config Param: ASYNC_CLUSTERING_ENABLE` `Since Version: 0.7.0`
hoodie.clustering.inline	false	Turn on inline clustering - clustering will be run after each write operation is complete `Config Param: INLINE_CLUSTERING_ENABLE` `Since Version: 0.7.0`
hoodie.datasource.hive_sync.enable	false	When set to true, register/sync the table to Apache Hive metastore. `Config Param: HIVE_SYNC_ENABLED`
hoodie.datasource.hive_sync.jdbcurl	jdbc:hive2://localhost:10000	Hive metastore url `Config Param: HIVE_URL`
hoodie.datasource.hive_sync.metastore.uris	thrift://localhost:9083	Hive metastore url `Config Param: METASTORE_URIS`
hoodie.datasource.meta.sync.enable	false	Enable Syncing the Hudi Table with an external meta store or data catalog. `Config Param: META_SYNC_ENABLED`
hoodie.datasource.write.hive_style_partitioning	false	Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values) `Config Param: HIVE_STYLE_PARTITIONING`
hoodie.datasource.write.operation	upsert	Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it. `Config Param: OPERATION`
hoodie.datasource.write.table.type	COPY_ON_WRITE	The table type for the underlying data, for this write. This can’t change between writes. `Config Param: TABLE_TYPE`

Config Name	Default	Description
hoodie.database.name	(N/A)	Database name to register to Hive metastore `Config Param: DATABASE_NAME`
hoodie.table.name	(N/A)	Table name to register to Hive metastore `Config Param: TABLE_NAME`
path	(N/A)	Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfully `Config Param: PATH`
read.commits.limit	(N/A)	The maximum number of commits allowed to read in each instant check, if it is streaming read, the avg read instants number per-second would be 'read.commits.limit'/'read.streaming.check-interval', by default no limit `Config Param: READ_COMMITS_LIMIT`
read.end-commit	(N/A)	End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss' `Config Param: READ_END_COMMIT`
read.start-commit	(N/A)	Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read `Config Param: READ_START_COMMIT`
archive.max_commits	50	Max number of commits to keep before archiving older commits into a sequential log, default 50 `Config Param: ARCHIVE_MAX_COMMITS`
archive.min_commits	40	Min number of commits to keep before archiving older commits into a sequential log, default 40 `Config Param: ARCHIVE_MIN_COMMITS`
cdc.enabled	false	When enable, persist the change data if necessary, and can be queried as a CDC query mode `Config Param: CDC_ENABLED`
cdc.supplemental.logging.mode	DATA_BEFORE_AFTER	Setting 'op_key_only' persists the 'op' and the record key only, setting 'data_before' persists the additional 'before' image, and setting 'data_before_after' persists the additional 'before' and 'after' images. `Config Param: SUPPLEMENTAL_LOGGING_MODE`
changelog.enabled	false	Whether to keep all the intermediate changes, we try to keep all the changes of a record when enabled: 1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every changes of a record. The semantics is best effort because the compaction job would finally merge all changes of a record into one. default false to have UPSERT semantics `Config Param: CHANGELOG_ENABLED`
clean.async.enabled	true	Whether to cleanup the old commits immediately on new commits, enabled by default `Config Param: CLEAN_ASYNC_ENABLED`
clean.retain_commits	30	Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table, default 30 `Config Param: CLEAN_RETAIN_COMMITS`
clustering.async.enabled	false	Async Clustering, default false `Config Param: CLUSTERING_ASYNC_ENABLED`
clustering.plan.strategy.small.file.limit	600	Files smaller than the size specified here are candidates for clustering, default 600 MB `Config Param: CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT`
clustering.plan.strategy.target.file.max.bytes	1073741824	Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups, default 1 GB `Config Param: CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES`
compaction.async.enabled	true	Async Compaction, enabled by default for MOR `Config Param: COMPACTION_ASYNC_ENABLED`
compaction.delta_commits	5	Max delta commits needed to trigger compaction, default 5 commits `Config Param: COMPACTION_DELTA_COMMITS`
hive_sync.enabled	false	Asynchronously sync Hive meta to HMS, default false `Config Param: HIVE_SYNC_ENABLED`
hive_sync.jdbc_url	jdbc:hive2://localhost:10000	Jdbc URL for hive sync, default 'jdbc:hive2://localhost:10000' `Config Param: HIVE_SYNC_JDBC_URL`
hive_sync.metastore.uris		Metastore uris for hive sync, default '' `Config Param: HIVE_SYNC_METASTORE_URIS`
hive_sync.mode	HMS	Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'hms' `Config Param: HIVE_SYNC_MODE`
hoodie.datasource.query.type	snapshot	Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row & columnar data); 2) incremental mode (new data since an instantTime); 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshot `Config Param: QUERY_TYPE`
hoodie.datasource.write.hive_style_partitioning	false	Whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values) `Config Param: HIVE_STYLE_PARTITIONING`
hoodie.datasource.write.partitionpath.field		Partition path field. Value to be used at the `partitionPath` component of `HoodieKey`. Actual value obtained by invoking .toString(), default '' `Config Param: PARTITION_PATH_FIELD`
hoodie.datasource.write.recordkey.field	uuid	Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` `Config Param: RECORD_KEY_FIELD`
index.type	FLINK_STATE	Index type of Flink write job, default is using state backed index. `Config Param: INDEX_TYPE`
lookup.join.cache.ttl	PT1H	The cache TTL (e.g. 10min) for the build table in lookup join. `Config Param: LOOKUP_JOIN_CACHE_TTL`
metadata.compaction.delta_commits	10	Max delta commits for metadata table to trigger compaction, default 10 `Config Param: METADATA_COMPACTION_DELTA_COMMITS`
metadata.enabled	true	Enable the internal metadata table which serves table metadata like level file listings, default enabled `Config Param: METADATA_ENABLED`
precombine.field	ts	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: PRECOMBINE_FIELD`
read.streaming.enabled	false	Whether to read as streaming source, default false `Config Param: READ_AS_STREAMING`
read.streaming.skip_insertoverwrite	false	Whether to skip insert overwrite instants to avoid reading base files of insert overwrite operations for streaming read. In streaming scenarios, insert overwrite is usually used to repair data, here you can control the visibility of downstream streaming read. `Config Param: READ_STREAMING_SKIP_INSERT_OVERWRITE`
table.type	COPY_ON_WRITE	Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ `Config Param: TABLE_TYPE`
write.operation	upsert	The write operation, that this write should do `Config Param: OPERATION`
write.parquet.max.file.size	120	Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. `Config Param: WRITE_PARQUET_MAX_FILE_SIZE`

Config Name	Default	Description
hoodie.index.name	(N/A)	Name of the expression index. This is also used for the partition name in the metadata table. `Config Param: SECONDARY_INDEX_NAME` `Since Version: 1.0.0`
hoodie.index.name	(N/A)	Name of the expression index. This is also used for the partition name in the metadata table. `Config Param: EXPRESSION_INDEX_NAME` `Since Version: 1.0.0`
hoodie.metadata.index.drop	(N/A)	Drop the specified index. The value should be the name of the index to delete. You can check index names using `SHOW INDEXES` command. The index name either starts with or matches exactly can be one of the following: files, column_stats, bloom_filters, record_index, expr_index_, secondary_index_, partition_stats, files `Config Param: DROP_METADATA_INDEX` `Since Version: 1.0.1`
hoodie.expression.index.type	COLUMN_STATS	Type of the expression index. Default is `column_stats` if there are no functions and expressions in the command. Valid options could be BITMAP, COLUMN_STATS, LUCENE, etc. If index_type is not provided, and there are functions or expressions in the command then a expression index using column stats will be created. `Config Param: EXPRESSION_INDEX_TYPE` `Since Version: 1.0.0`
hoodie.metadata.enable	true	Enable the internal metadata table which serves table metadata like level file listings `Config Param: ENABLE` `Since Version: 0.7.0`
hoodie.metadata.index.bloom.filter.enable	false	Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups. `Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTER` `Since Version: 0.11.0`
hoodie.metadata.index.column.stats.enable	false	Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups. `Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS` `Since Version: 0.11.0`
hoodie.metadata.index.expression.enable	false	Enable expression index within the metadata table. When this configuration property is enabled (`true`), the Hudi writer automatically keeps all expression indexes consistent with the data table. When disabled (`false`), all expression indexes are deleted. Note that individual expression index can only be created through a `CREATE INDEX` and deleted through a `DROP INDEX` statement in Spark SQL. `Config Param: EXPRESSION_INDEX_ENABLE_PROP` `Since Version: 1.0.0`
hoodie.metadata.index.partition.stats.enable	false	Enable aggregating stats for each column at the storage partition level. Enabling this can improve query performance by leveraging partition and column stats for (partition) filtering. Important: The default value for this configuration is dynamically set based on the effective value of hoodie.metadata.index.column.stats.enable. If column stats index is enabled (default for Spark engine), partition stats indexing will also be enabled by default. Conversely, if column stats indexing is disabled (default for Flink and Java engines), partition stats indexing will also be disabled by default. `Config Param: ENABLE_METADATA_INDEX_PARTITION_STATS` `Since Version: 1.0.0`
hoodie.metadata.index.secondary.enable	true	Enable secondary index within the metadata table. When this configuration property is enabled (`true`), the Hudi writer automatically keeps all secondary indexes consistent with the data table. When disabled (`false`), all secondary indexes are deleted. Note that individual secondary index can only be created through a `CREATE INDEX` and deleted through a `DROP INDEX` statement in Spark SQL. `Config Param: SECONDARY_INDEX_ENABLE_PROP` `Since Version: 1.0.0`

Config Name	Default	Description
hoodie.parquet.compression.codec	gzip	Compression Codec for parquet files `Config Param: PARQUET_COMPRESSION_CODEC_NAME`
hoodie.parquet.max.file.size	125829120	Target size in bytes for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. `Config Param: PARQUET_MAX_FILE_SIZE`

Config Name	Default	Description
hoodie.keep.max.commits	30	Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. `Config Param: MAX_COMMITS_TO_KEEP`
hoodie.keep.min.commits	20	Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. `Config Param: MIN_COMMITS_TO_KEEP`

Config Name	Default	Description
hoodie.clean.async.enabled	false	Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance. `Config Param: ASYNC_CLEAN`
hoodie.clean.commits.retained	10	When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. `Config Param: CLEANER_COMMITS_RETAINED`

Config Name	Default	Description
hoodie.compact.inline	false	When set to true, compaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path. `Config Param: INLINE_COMPACT`
hoodie.compact.inline.max.delta.commits	5	Number of delta commits after the last compaction, before scheduling of a new compaction is attempted. This config takes effect only for the compaction triggering strategy based on the number of commits, i.e., NUM_COMMITS, NUM_COMMITS_AFTER_LAST_REQUEST, NUM_AND_TIME, and NUM_OR_TIME. `Config Param: INLINE_COMPACT_NUM_DELTA_COMMITS`

Config Name	Default	Description
hoodie.errortable.base.path	(N/A)	Base path for error table under which all error records would be stored. `Config Param: ERROR_TABLE_BASE_PATH`
hoodie.errortable.target.table.name	(N/A)	Table name to be used for the error table `Config Param: ERROR_TARGET_TABLE`
hoodie.errortable.write.class	(N/A)	Class which handles the error table writes. This config is used to configure a custom implementation for Error Table Writer. Specify the full class name of the custom error table writer as a value for this config `Config Param: ERROR_TABLE_WRITE_CLASS`
hoodie.errortable.enable	false	Config to enable error table. If the config is enabled, all the records with processing error in DeltaStreamer are transferred to error table. `Config Param: ERROR_TABLE_ENABLED`
hoodie.errortable.insert.shuffle.parallelism	200	Config to set insert shuffle parallelism. The config is similar to hoodie.insert.shuffle.parallelism config but applies to the error table. `Config Param: ERROR_TABLE_INSERT_PARALLELISM_VALUE`
hoodie.errortable.source.rdd.persist	false	Enabling this config, persists the sourceRDD to disk which helps in faster processing of data table + error table write DAG `Config Param: ERROR_TABLE_PERSIST_SOURCE_RDD`
hoodie.errortable.upsert.shuffle.parallelism	200	Config to set upsert shuffle parallelism. The config is similar to hoodie.upsert.shuffle.parallelism config but applies to the error table. `Config Param: ERROR_TABLE_UPSERT_PARALLELISM_VALUE`
hoodie.errortable.validate.recordcreation.enable	true	Records that fail to be created due to keygeneration failure or other issues will be sent to the Error Table `Config Param: ERROR_ENABLE_VALIDATE_RECORD_CREATION` `Since Version: 0.15.0`
hoodie.errortable.validate.targetschema.enable	false	Records with schema mismatch with Target Schema are sent to Error Table. `Config Param: ERROR_ENABLE_VALIDATE_TARGET_SCHEMA`
hoodie.errortable.write.failure.strategy	ROLLBACK_COMMIT	The config specifies the failure strategy if error table write fails. Use one of - [ROLLBACK_COMMIT (Rollback the corresponding base table write commit for which the error events were triggered) , LOG_ERROR (Error is logged but the base table write succeeds) ] `Config Param: ERROR_TABLE_WRITE_FAILURE_STRATEGY`
hoodie.errortable.write.union.enable	false	Enable error table union with data table when writing for improved commit performance. By default it is disabled meaning data table and error table writes are sequential `Config Param: ENABLE_ERROR_TABLE_WRITE_UNIFICATION`

Config Name	Default	Description
hoodie.base.path	(N/A)	Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory. `Config Param: BASE_PATH`
hoodie.datasource.write.precombine.field	(N/A)	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: PRECOMBINE_FIELD_NAME`
hoodie.table.name	(N/A)	Table name that will be used for registering with metastores like HMS. Needs to be same across runs. `Config Param: TBL_NAME`
hoodie.write.record.merge.mode	(N/A)	org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user. `Config Param: RECORD_MERGE_MODE` `Since Version: 1.0.0`
hoodie.fail.job.on.duplicate.data.file.detection	false	If config is enabled, entire job is failed on invalid file detection `Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION`
hoodie.instant_state.timeline_server_based.enabled	false	If enabled, writers get instant state from timeline server rather than requesting DFS directly `Config Param: INSTANT_STATE_TIMELINE_SERVER_BASED` `Since Version: 1.0.0`
hoodie.instant_state.timeline_server_based.force_refresh.request.number	100	Number of requests to trigger instant state cache refreshing `Config Param: INSTANT_STATE_TIMELINE_SERVER_BASED_FORCE_REFRESH_REQUEST_NUMBER` `Since Version: 1.0.0`
hoodie.write.auto.upgrade	true	If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower. `Config Param: AUTO_UPGRADE_VERSION` `Since Version: 1.0.0`
hoodie.write.concurrency.mode	SINGLE_WRITER	org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor. `Config Param: WRITE_CONCURRENCY_MODE`
hoodie.write.table.version	8	The table version this writer is storing the table in. This should match the current table version. `Config Param: WRITE_TABLE_VERSION` `Since Version: 1.0.0`

Config Name	Default	Description
hoodie.expression.index.function	(N/A)	Function to be used for building the expression index. `Config Param: INDEX_FUNCTION` `Since Version: 1.0.0`
hoodie.index.name	(N/A)	Name of the expression index. This is also used for the partition name in the metadata table. `Config Param: INDEX_NAME` `Since Version: 1.0.0`
hoodie.table.checksum	(N/A)	Index definition checksum is used to guard against partial writes in HDFS. It is added as the last entry in index.properties and then used to validate while reading table config. `Config Param: INDEX_DEFINITION_CHECKSUM` `Since Version: 1.0.0`
hoodie.expression.index.type	COLUMN_STATS	Type of the expression index. Default is `column_stats` if there are no functions and expressions in the command. Valid options could be BITMAP, COLUMN_STATS, LUCENE, etc. If index_type is not provided, and there are functions or expressions in the command then a expression index using column stats will be created. `Config Param: INDEX_TYPE` `Since Version: 1.0.0`

Config Name	Default	Description
hoodie.index.type	(N/A)	org.apache.hudi.index.HoodieIndex$IndexType: Determines how input records are indexed, i.e., looked up based on the key for the location in the existing table. Default is SIMPLE on Spark engine, and INMEMORY on Flink and Java engines. HBASE: uses an external managed Apache HBase table to store record key to location mapping. HBase index is a global index, enforcing key uniqueness across all partitions in the table. INMEMORY: Uses in-memory hashmap in Spark and Java engine and Flink in-memory state in Flink for indexing. BLOOM: Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Key uniqueness is enforced inside partitions. GLOBAL_BLOOM: Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Key uniqueness is enforced across all partitions in the table. SIMPLE: Performs a lean join of the incoming update/delete records against keys extracted from the table on storage.Key uniqueness is enforced inside partitions. GLOBAL_SIMPLE: Performs a lean join of the incoming update/delete records against keys extracted from the table on storage.Key uniqueness is enforced across all partitions in the table. BUCKET: locates the file group containing the record fast by using bucket hashing, particularly beneficial in large scale. Use `hoodie.index.bucket.engine` to choose bucket engine type, i.e., how buckets are generated. FLINK_STATE: Internal Config for indexing based on Flink state. RECORD_INDEX: Index which saves the record key to location mappings in the HUDI Metadata Table. Record index is a global index, enforcing key uniqueness across all partitions in the table. Supports sharding to achieve very high scale. `Config Param: INDEX_TYPE`
hoodie.bucket.index.query.pruning	true	Control if table with bucket index use bucket query or not `Config Param: BUCKET_QUERY_INDEX`

Config Name	Default	Description
hoodie.datasource.meta.sync.glue.partition_index_fields		Specify the partitions fields to index on aws glue. Separate the fields by semicolon. By default, when the feature is enabled, all the partition will be indexed. You can create up to three indexes, separate them by comma. Eg: col1;col2;col3,col2,col3 `Config Param: META_SYNC_PARTITION_INDEX_FIELDS` `Since Version: 0.15.0`
hoodie.datasource.meta.sync.glue.partition_index_fields.enable	false	Enable aws glue partition index feature, to speedup partition based query pattern `Config Param: META_SYNC_PARTITION_INDEX_FIELDS_ENABLE` `Since Version: 0.15.0`

Config Name	Default	Description
hoodie.metrics.on	false	Turn on/off metrics reporting. off by default. `Config Param: TURN_METRICS_ON` `Since Version: 0.5.0`
hoodie.metrics.reporter.type	GRAPHITE	Type of metrics reporter. `Config Param: METRICS_REPORTER_TYPE_VALUE` `Since Version: 0.5.0`
hoodie.metricscompaction.log.blocks.on	false	Turn on/off metrics reporting for log blocks with compaction commit. off by default. `Config Param: TURN_METRICS_COMPACTION_LOG_BLOCKS_ON` `Since Version: 0.14.0`

Config Name	Default	Description
hoodie.metrics.m3.env	production	M3 tag to label the environment (defaults to 'production'), applied to all metrics. `Config Param: M3_ENV` `Since Version: 0.15.0`
hoodie.metrics.m3.host	localhost	M3 host to connect to. `Config Param: M3_SERVER_HOST_NAME` `Since Version: 0.15.0`
hoodie.metrics.m3.port	9052	M3 port to connect to. `Config Param: M3_SERVER_PORT_NUM` `Since Version: 0.15.0`
hoodie.metrics.m3.service	hoodie	M3 tag to label the service name (defaults to 'hoodie'), applied to all metrics. `Config Param: M3_SERVICE` `Since Version: 0.15.0`
hoodie.metrics.m3.tags		Optional M3 tags applied to all metrics. `Config Param: M3_TAGS` `Since Version: 0.15.0`

Config Name	Default	Description
hoodie.streamer.transformer.sql	(N/A)	SQL Query to be executed during write `Config Param: TRANSFORMER_SQL`
hoodie.streamer.transformer.sql.file	(N/A)	File with a SQL script to be executed during write `Config Param: TRANSFORMER_SQL_FILE`

Config Name	Default	Description
hoodie.streamer.source.kafka.topic	(N/A)	Kafka topic name. `Config Param: KAFKA_TOPIC_NAME`
hoodie.streamer.source.kafka.proto.value.deserializer.class	org.apache.kafka.common.serialization.ByteArrayDeserializer	Kafka Proto Payload Deserializer Class `Config Param: KAFKA_PROTO_VALUE_DESERIALIZER_CLASS` `Since Version: 0.15.0`

Config Name	Default	Description
hoodie.streamer.source.pulsar.topic	(N/A)	Name of the target Pulsar topic to source data from `Config Param: PULSAR_SOURCE_TOPIC_NAME`
hoodie.streamer.source.pulsar.endpoint.admin.url	http://localhost:8080	URL of the target Pulsar endpoint (of the form 'pulsar://host:port' `Config Param: PULSAR_SOURCE_ADMIN_ENDPOINT_URL`
hoodie.streamer.source.pulsar.endpoint.service.url	pulsar://localhost:6650	URL of the target Pulsar endpoint (of the form 'pulsar://host:port' `Config Param: PULSAR_SOURCE_SERVICE_ENDPOINT_URL`

Config Name	Default	Description
hoodie.streamer.schemaprovider.registry.targetUrl	(N/A)	The schema of the target you are writing to e.g. https://foo:bar@schemaregistry.org `Config Param: TARGET_SCHEMA_REGISTRY_URL`
hoodie.streamer.schemaprovider.registry.url	(N/A)	The schema of the source you are reading from e.g. https://foo:bar@schemaregistry.org `Config Param: SRC_SCHEMA_REGISTRY_URL`

Hudi Table Config​

Hudi Table Basic Configs​

Spark Datasource Configs​

Read Options​

Write Options​

Flink Sql Configs​

Flink Options​

Write Client Configs​

Common Configurations​

Metadata Configs​

Storage Configs​

Archival Configs​

Bootstrap Configs​

Clean Configs​

Clustering Configs​

Compaction Configs​

Error table Configs​

Write Configurations​

Lock Configs​

Common Lock Configurations​

Key Generator Configs​

Key Generator Options​

Index Configs​

Common Index Configs​

Common Index Configs​

Metastore and Catalog Sync Configs​

Common Metadata Sync Configs​

Glue catalog sync based client Configurations​

BigQuery Sync Configs​

Hive Sync Configs​

Global Hive Sync Configs​

DataHub Sync Configs​

Metrics Configs​

Metrics Configurations​

Metrics Configurations for M3​

Kafka Connect Configs​

Kafka Sink Connect Configurations​

Hudi Streamer Configs​

Hudi Streamer Configs​

Hudi Streamer SQL Transformer Configs​

Hudi Streamer Source Configs​

DFS Path Selector Configs​

Hudi Incremental Source Configs​

Kafka Source Configs​

Pulsar Source Configs​

S3 Source Configs​

File-based SQL Source Configs​

SQL Source Configs​

Hudi Streamer Schema Provider Configs​

Hudi Streamer Schema Provider Configs​

File-based Schema Provider Configs​

Hudi Table Config

Hudi Table Basic Configs

Spark Datasource Configs

Read Options

Write Options

Flink Sql Configs

Flink Options

Write Client Configs

Common Configurations

Metadata Configs

Storage Configs

Archival Configs

Bootstrap Configs

Clean Configs

Clustering Configs

Compaction Configs

Error table Configs

Write Configurations

Lock Configs

Common Lock Configurations

Key Generator Configs

Key Generator Options

Index Configs

Common Index Configs

Common Index Configs

Metastore and Catalog Sync Configs

Common Metadata Sync Configs

Glue catalog sync based client Configurations

BigQuery Sync Configs

Hive Sync Configs

Global Hive Sync Configs

DataHub Sync Configs

Metrics Configs

Metrics Configurations

Metrics Configurations for M3

Kafka Connect Configs

Kafka Sink Connect Configurations

Hudi Streamer Configs

Hudi Streamer Configs

Hudi Streamer SQL Transformer Configs

Hudi Streamer Source Configs

DFS Path Selector Configs

Hudi Incremental Source Configs

Kafka Source Configs

Pulsar Source Configs

S3 Source Configs

File-based SQL Source Configs

SQL Source Configs

Hudi Streamer Schema Provider Configs

Hudi Streamer Schema Provider Configs

File-based Schema Provider Configs

Config Name	Default	Description
hoodie.streamer.schemaprovider.source.schema.file	(N/A)	The schema of the source you are reading from `Config Param: SOURCE_SCHEMA_FILE`
hoodie.streamer.schemaprovider.target.schema.file	(N/A)	The schema of the target you are writing to `Config Param: TARGET_SCHEMA_FILE`