Version: 1.0.2

All Configurations

This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels.

Hudi Table Config: Basic Hudi Table configuration parameters.
Environment Config: Hudi supports passing configurations via a configuration file hudi-defaults.conf in which each line consists of a key and a value separated by whitespace or = sign. For example:

hoodie.datasource.hive_sync.mode               jdbc
hoodie.datasource.hive_sync.jdbcurl            jdbc:hive2://localhost:10000
hoodie.datasource.hive_sync.support_timestamp  false

It helps to have a central configuration file for your common cross job configurations/tunings, so all the jobs on your cluster can utilize it. It also works with Spark SQL DML/DDL, and helps avoid having to pass configs inside the SQL statements.

Hudi always loads the configuration file under default directory file:/etc/hudi/conf, if exists, to set the default configs. Besides, you can specify another configuration directory location by setting the HUDI_CONF_DIR environment variable. The configs stored in HUDI_CONF_DIR/hudi-defaults.conf are loaded, overriding any configs already set by the config file in the default directory.

Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
Reader Configs: Please fill in the description for Config Group Name: Reader Configs
Metastore and Catalog Sync Configs: Configurations used by the Hudi to sync metadata to external metastores and catalogs.
Metrics Configs: These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.
Record Payload Config: This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.
Kafka Connect Configs: These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables
Amazon Web Services Configs: Configurations specific to Amazon Web Services.
Hudi Streamer Configs: These set of configs are used for Hudi Streamer utility which provides the way to ingest from different sources such as DFS or Kafka.

note

In the tables below (N/A) means there is no default value set

Externalized Config File

Instead of directly passing configuration settings to every Hudi job, you can also centrally set them in a configuration file hudi-defaults.conf. By default, Hudi would load the configuration file under /etc/hudi/conf directory. You can specify a different configuration directory location by setting the HUDI_CONF_DIR environment variable. This can be useful for uniformly enforcing repeated configs (like Hive sync or write/index tuning), across your entire data lake.

Hudi Table Config

Basic Hudi Table configuration parameters.

Hudi Table Basic Configs

Configurations of the Hudi Table like type of ingestion, storage formats, hive table name etc. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and never changes during the lifetime of a hoodie table.

Basic Configs

Config Name	Default	Description
hoodie.bootstrap.base.path	(N/A)	Base path of the dataset that needs to be bootstrapped as a Hudi table `Config Param: BOOTSTRAP_BASE_PATH`
hoodie.compaction.payload.class	(N/A)	Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file. `Config Param: PAYLOAD_CLASS_NAME`
hoodie.database.name	(N/A)	Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database `Config Param: DATABASE_NAME`
hoodie.record.merge.mode	(N/A)	org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user. `Config Param: RECORD_MERGE_MODE` `Since Version: 1.0.0`
hoodie.record.merge.strategy.id	(N/A)	Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id `Config Param: RECORD_MERGE_STRATEGY_ID` `Since Version: 0.13.0`
hoodie.table.checksum	(N/A)	Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config. `Config Param: TABLE_CHECKSUM` `Since Version: 0.11.0`
hoodie.table.create.schema	(N/A)	Schema used when creating the table `Config Param: CREATE_SCHEMA`
hoodie.table.index.defs.path	(N/A)	Relative path to table base path where the index definitions are stored `Config Param: RELATIVE_INDEX_DEFINITION_PATH` `Since Version: 1.0.0`
hoodie.table.keygenerator.class	(N/A)	Key Generator class property for the hoodie table `Config Param: KEY_GENERATOR_CLASS_NAME`
hoodie.table.keygenerator.type	(N/A)	Key Generator type to determine key generator class `Config Param: KEY_GENERATOR_TYPE` `Since Version: 1.0.0`
hoodie.table.metadata.partitions	(N/A)	Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers `Config Param: TABLE_METADATA_PARTITIONS` `Since Version: 0.11.0`
hoodie.table.metadata.partitions.inflight	(N/A)	Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers. `Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT` `Since Version: 0.11.0`
hoodie.table.name	(N/A)	Table name that will be used for registering with Hive. Needs to be same across runs. `Config Param: NAME`
hoodie.table.partition.fields	(N/A)	Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators `Config Param: PARTITION_FIELDS`
hoodie.table.precombine.field	(N/A)	Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked. `Config Param: PRECOMBINE_FIELD`
hoodie.table.recordkey.fields	(N/A)	Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey. `Config Param: RECORDKEY_FIELDS`
hoodie.table.secondary.indexes.metadata	(N/A)	The metadata of secondary indexes `Config Param: SECONDARY_INDEXES_METADATA` `Since Version: 0.13.0`
hoodie.timeline.layout.version	(N/A)	Version of timeline used, by the table. `Config Param: TIMELINE_LAYOUT_VERSION`
hoodie.archivelog.folder	archived	path under the meta folder, to store archived timeline instants at. `Config Param: ARCHIVELOG_FOLDER`
hoodie.bootstrap.index.class	org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex	Implementation to use, for mapping base files to bootstrap base file, that contain actual data. `Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
hoodie.bootstrap.index.enable	true	Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true. `Config Param: BOOTSTRAP_INDEX_ENABLE`
hoodie.bootstrap.index.type	HFILE	Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data. `Config Param: BOOTSTRAP_INDEX_TYPE` `Since Version: 1.0.0`
hoodie.datasource.write.hive_style_partitioning	false	Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values) `Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
hoodie.partition.metafile.use.base.format	false	If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files. `Config Param: PARTITION_METAFILE_USE_BASE_FORMAT`
hoodie.populate.meta.fields	true	When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing `Config Param: POPULATE_META_FIELDS`
hoodie.table.base.file.format	PARQUET	Base file format to store all the base file data. `Config Param: BASE_FILE_FORMAT`
hoodie.table.cdc.enabled	false	When enable, persist the change data if necessary, and can be queried as a CDC query mode. `Config Param: CDC_ENABLED` `Since Version: 0.13.0`
hoodie.table.cdc.supplemental.logging.mode	DATA_BEFORE_AFTER	org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs. `Config Param: CDC_SUPPLEMENTAL_LOGGING_MODE` `Since Version: 0.13.0`
hoodie.table.initial.version	EIGHT	Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup. `Config Param: INITIAL_VERSION` `Since Version: 1.0.0`
hoodie.table.log.file.format	HOODIE_LOG	Log format used for the delta logs. `Config Param: LOG_FILE_FORMAT`
hoodie.table.multiple.base.file.formats.enable	false	When set to true, the table can support reading and writing multiple base file formats. `Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE` `Since Version: 1.0.0`
hoodie.table.timeline.timezone	LOCAL	User can set hoodie commit timeline timezone, such as utc, local and so on. local is default `Config Param: TIMELINE_TIMEZONE`
hoodie.table.type	COPY_ON_WRITE	The table type for the underlying data. `Config Param: TYPE`
hoodie.table.version	EIGHT	Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes. `Config Param: VERSION`
hoodie.timeline.history.path	history	path under the meta folder, to store timeline history at. `Config Param: TIMELINE_HISTORY_PATH`
hoodie.timeline.path	timeline	path under the meta folder, to store timeline instants at. `Config Param: TIMELINE_PATH`

Config Name	Default	Description
hoodie.datasource.write.drop.partition.columns	false	When set to true, will not write the partition columns into hudi. By default, false. `Config Param: DROP_PARTITION_COLUMNS`
hoodie.datasource.write.partitionpath.urlencode	false	Should we url encode the partition path value, before creating the folder structure. `Config Param: URL_ENCODE_PARTITIONING`

Config Name	Default	Description
hoodie.datasource.read.begin.instanttime	(N/A)	Required when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to start incrementally pulling data from. The completion time here need not necessarily correspond to an instant on the timeline. New data written with completion_time >= START_COMMIT are fetched out. For e.g: ‘20170901080000’ will get all new data written on or after Sep 1, 2017 08:00AM. `Config Param: START_COMMIT`
hoodie.datasource.read.end.instanttime	(N/A)	Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified. `Config Param: END_COMMIT`
hoodie.datasource.read.incr.table.version	(N/A)	The table version assumed for incremental read `Config Param: INCREMENTAL_READ_TABLE_VERSION`
hoodie.datasource.read.streaming.table.version	(N/A)	The table version assumed for streaming read `Config Param: STREAMING_READ_TABLE_VERSION`
hoodie.datasource.write.precombine.field	(N/A)	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: READ_PRE_COMBINE_FIELD`
hoodie.datasource.query.type	snapshot	Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files) `Config Param: QUERY_TYPE`

Config Name	Default	Description
as.of.instant	(N/A)	The query instant for time travel. Without specified this option, we query the latest snapshot. `Config Param: TIME_TRAVEL_AS_OF_INSTANT`
hoodie.datasource.read.paths	(N/A)	Comma separated list of file paths to read within a Hudi table. `Config Param: READ_PATHS`
hoodie.datasource.merge.type	payload_combine	For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records without merging; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records (combines the two records with same key for base and log file records), then read the left log file records `Config Param: REALTIME_MERGE`
hoodie.datasource.query.incremental.format	latest_state	This config is used alone with the 'incremental' query type.When set to 'latest_state', it returns the latest records' values.When set to 'cdc', it returns the cdc data. `Config Param: INCREMENTAL_FORMAT` `Since Version: 0.13.0`
hoodie.datasource.read.create.filesystem.relation	false	When this is set, the relation created by DefaultSource is for a view representing the result set of the table valued function hudi_filesystem_view(...) `Config Param: CREATE_FILESYSTEM_RELATION` `Since Version: 1.0.0`
hoodie.datasource.read.extract.partition.values.from.path	false	When set to true, values for partition columns (partition values) will be extracted from physical partition path (default Spark behavior). When set to false partition values will be read from the data file (in Hudi partition columns are persisted by default). This config is a fallback allowing to preserve existing behavior, and should not be used otherwise. `Config Param: EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH` `Since Version: 0.11.0`
hoodie.datasource.read.file.index.listing.mode	lazy	Overrides Hudi's file-index implementation's file listing mode: when set to 'eager', file-index will list all partition paths and corresponding file slices w/in them eagerly, during initialization, prior to partition-pruning kicking in, meaning that all partitions will be listed including ones that might be subsequently pruned out; when set to 'lazy', partitions and file-slices w/in them will be listed lazily (ie when they actually accessed, instead of when file-index is initialized) allowing partition pruning to occur before that, only listing partitions that has already been pruned. Please note that, this config is provided purely to allow to fallback to behavior existing prior to 0.13.0 release, and will be deprecated soon after. `Config Param: FILE_INDEX_LISTING_MODE_OVERRIDE` `Since Version: 0.13.0`
hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled	true	Controls whether partition-path prefix analysis is enabled w/in the file-index, allowing to avoid necessity to recursively list deep folder structures of partitioned tables w/ multiple partition columns, by carefully analyzing provided partition-column predicates and deducing corresponding partition-path prefix from them (if possible). `Config Param: FILE_INDEX_LISTING_PARTITION_PATH_PREFIX_ANALYSIS_ENABLED` `Since Version: 0.13.0`
hoodie.datasource.read.incr.fallback.fulltablescan.enable	true	When doing an incremental query whether we should fall back to full table scans if file does not exist. `Config Param: INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN`
hoodie.datasource.read.incr.filters		For use-cases like DeltaStreamer which reads from Hoodie Incremental table and applies opaque map functions, filters appearing late in the sequence of transformations cannot be automatically pushed down. This option allows setting filters directly on Hoodie Source. `Config Param: PUSH_DOWN_INCR_FILTERS`
hoodie.datasource.read.incr.path.glob		For the use-cases like users only want to incremental pull from certain partitions instead of the full table. This option allows using glob pattern to directly filter on path. `Config Param: INCR_PATH_GLOB`
hoodie.datasource.read.incr.skip_cluster	false	Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read to improve read performance. `Config Param: INCREMENTAL_READ_SKIP_CLUSTER`
hoodie.datasource.read.incr.skip_compact	false	Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance. `Config Param: INCREMENTAL_READ_SKIP_COMPACT`
hoodie.datasource.read.schema.use.end.instanttime	false	Uses end instant schema when incrementally fetched data to. Default: users latest instant schema. `Config Param: INCREMENTAL_READ_SCHEMA_USE_END_INSTANTTIME`
hoodie.datasource.read.table.valued.function.filesystem.relation.subpath		A regex under the table's base path to get file system view information `Config Param: FILESYSTEM_RELATION_ARG_SUBPATH` `Since Version: 1.0.0`
hoodie.datasource.read.table.valued.function.timeline.relation	false	When this is set, the relation created by DefaultSource is for a view representing the result set of the table valued function hudi_query_timeline(...) `Config Param: CREATE_TIMELINE_RELATION` `Since Version: 1.0.0`
hoodie.datasource.read.table.valued.function.timeline.relation.archived	false	When this is set, the result set of the table valued function hudi_query_timeline(...) will include archived timeline `Config Param: TIMELINE_RELATION_ARG_ARCHIVED_TIMELINE` `Since Version: 1.0.0`
hoodie.datasource.streaming.startOffset	earliest	Start offset to pull data from hoodie streaming source. allow earliest, latest, and specified start instant time `Config Param: START_OFFSET` `Since Version: 0.13.0`
hoodie.enable.data.skipping	true	Enables data-skipping allowing queries to leverage indexes to reduce the search space by skipping over files `Config Param: ENABLE_DATA_SKIPPING` `Since Version: 0.10.0`
hoodie.file.index.enable	true	Enables use of the spark file index implementation for Hudi, that speeds up listing of large tables. `Config Param: ENABLE_HOODIE_FILE_INDEX`
hoodie.read.timeline.holes.resolution.policy	FAIL	When doing incremental queries, there could be hollow commits (requested or inflight commits that are not the latest) that are produced by concurrent writers and could lead to potential data loss. This config allows users to have different ways of handling this situation. The valid values are [FAIL, BLOCK, USE_TRANSITION_TIME]: Use `FAIL` to throw an exception when hollow commit is detected. This is helpful when hollow commits are not expected. Use `BLOCK` to block processing commits from going beyond the hollow ones. This fits the case where waiting for hollow commits to finish is acceptable. Use `USE_TRANSITION_TIME` (experimental) to query commits in range by state transition time (completion time), instead of commit time (start time). Using this mode will result in `begin.instanttime` and `end.instanttime` using `stateTransitionTime` instead of the instant's commit time. `Config Param: INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT` `Since Version: 0.14.0`
hoodie.schema.on.read.enable	false	Enables support for Schema Evolution feature `Config Param: SCHEMA_EVOLUTION_ENABLED`

Config Name	Default	Description
hoodie.datasource.hive_sync.mode	(N/A)	Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql. `Config Param: HIVE_SYNC_MODE`
hoodie.datasource.write.partitionpath.field	(N/A)	Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString() `Config Param: PARTITIONPATH_FIELD`
hoodie.datasource.write.precombine.field	(N/A)	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: PRECOMBINE_FIELD`
hoodie.datasource.write.recordkey.field	(N/A)	Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` `Config Param: RECORDKEY_FIELD`
hoodie.datasource.write.secondarykey.column	(N/A)	Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` `Config Param: SECONDARYKEY_COLUMN_NAME`
hoodie.write.record.merge.mode	(N/A)	org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user. `Config Param: RECORD_MERGE_MODE` `Since Version: 1.0.0`
hoodie.clustering.async.enabled	false	Enable running of clustering service, asynchronously as inserts happen on the table. `Config Param: ASYNC_CLUSTERING_ENABLE` `Since Version: 0.7.0`
hoodie.clustering.inline	false	Turn on inline clustering - clustering will be run after each write operation is complete `Config Param: INLINE_CLUSTERING_ENABLE` `Since Version: 0.7.0`
hoodie.datasource.hive_sync.enable	false	When set to true, register/sync the table to Apache Hive metastore. `Config Param: HIVE_SYNC_ENABLED`
hoodie.datasource.hive_sync.jdbcurl	jdbc:hive2://localhost:10000	Hive metastore url `Config Param: HIVE_URL`
hoodie.datasource.hive_sync.metastore.uris	thrift://localhost:9083	Hive metastore url `Config Param: METASTORE_URIS`
hoodie.datasource.meta.sync.enable	false	Enable Syncing the Hudi Table with an external meta store or data catalog. `Config Param: META_SYNC_ENABLED`
hoodie.datasource.write.hive_style_partitioning	false	Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values) `Config Param: HIVE_STYLE_PARTITIONING`
hoodie.datasource.write.operation	upsert	Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it. `Config Param: OPERATION`
hoodie.datasource.write.table.type	COPY_ON_WRITE	The table type for the underlying data, for this write. This can’t change between writes. `Config Param: TABLE_TYPE`

Config Name	Default	Description
hoodie.datasource.hive_sync.serde_properties	(N/A)	Serde properties to hive table. `Config Param: HIVE_TABLE_SERDE_PROPERTIES`
hoodie.datasource.hive_sync.table_properties	(N/A)	Additional properties to store with table. `Config Param: HIVE_TABLE_PROPERTIES`
hoodie.datasource.overwrite.mode	(N/A)	Controls whether overwrite use dynamic or static mode, if not configured, respect spark.sql.sources.partitionOverwriteMode `Config Param: OVERWRITE_MODE` `Since Version: 0.14.0`
hoodie.datasource.write.partitions.to.delete	(N/A)	Comma separated list of partitions to delete. Allows use of wildcard * `Config Param: PARTITIONS_TO_DELETE`
hoodie.datasource.write.payload.class	(N/A)	Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective `Config Param: PAYLOAD_CLASS_NAME`
hoodie.datasource.write.table.name	(N/A)	Table name for the datasource write. Also used to register the table into meta stores. `Config Param: TABLE_NAME`
hoodie.write.record.merge.custom.implementation.classes	(N/A)	List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These record merge impls will filter by hoodie.write.record.merge.strategy.idHudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc) `Config Param: RECORD_MERGE_IMPL_CLASSES` `Since Version: 0.13.0`
hoodie.write.record.merge.strategy.id	(N/A)	ID of record merge strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merge strategy id `Config Param: RECORD_MERGE_STRATEGY_ID` `Since Version: 0.13.0`
hoodie.datasource.compaction.async.enable	true	Controls whether async compaction should be turned on for MOR table writing. `Config Param: ASYNC_COMPACT_ENABLE`
hoodie.datasource.hive_sync.auto_create_database	true	Auto create hive database if does not exists `Config Param: HIVE_AUTO_CREATE_DATABASE`
hoodie.datasource.hive_sync.base_file_format	PARQUET	Base file format for the sync. `Config Param: HIVE_BASE_FILE_FORMAT`
hoodie.datasource.hive_sync.batch_num	1000	The number of partitions one batch when synchronous partitions to hive. `Config Param: HIVE_BATCH_SYNC_PARTITION_NUM`
hoodie.datasource.hive_sync.bucket_sync	false	Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS' `Config Param: HIVE_SYNC_BUCKET_SYNC`
hoodie.datasource.hive_sync.create_managed_table	false	Whether to sync the table as managed table. `Config Param: HIVE_CREATE_MANAGED_TABLE`
hoodie.datasource.hive_sync.database	default	The name of the destination database that we should sync the hudi table to. `Config Param: HIVE_DATABASE`
hoodie.datasource.hive_sync.ignore_exceptions	false	Ignore exceptions when syncing with Hive. `Config Param: HIVE_IGNORE_EXCEPTIONS`
hoodie.datasource.hive_sync.partition_extractor_class	org.apache.hudi.hive.MultiPartKeysValueExtractor	Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'. `Config Param: HIVE_PARTITION_EXTRACTOR_CLASS`
hoodie.datasource.hive_sync.partition_fields		Field in the table to use for determining hive partition columns. `Config Param: HIVE_PARTITION_FIELDS`
hoodie.datasource.hive_sync.password	hive	hive password to use `Config Param: HIVE_PASS`
hoodie.datasource.hive_sync.skip_ro_suffix	false	Skip the _ro suffix for Read optimized table, when registering `Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE`
hoodie.datasource.hive_sync.support_timestamp	false	‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUE `Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE`
hoodie.datasource.hive_sync.sync_as_datasource	true	`Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE`
hoodie.datasource.hive_sync.sync_comment	false	Whether to sync the table column comments while syncing the table. `Config Param: HIVE_SYNC_COMMENT`
hoodie.datasource.hive_sync.table	unknown	The name of the destination table that we should sync the hudi table to. `Config Param: HIVE_TABLE`
hoodie.datasource.hive_sync.use_jdbc	true	Use JDBC when hive synchronization is enabled `Config Param: HIVE_USE_JDBC`
hoodie.datasource.hive_sync.use_pre_apache_input_format	false	Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format `Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT`
hoodie.datasource.hive_sync.username	hive	hive user name to use `Config Param: HIVE_USER`
hoodie.datasource.insert.dup.policy	none	Note This is only applicable to Spark SQL writing.<br />When operation type is set to "insert", users can optionally enforce a dedup policy. This policy will be employed when records being ingested already exists in storage. Default policy is none and no action will be taken. Another option is to choose "drop", on which matching records from incoming will be dropped and the rest will be ingested. Third option is "fail" which will fail the write operation when same records are re-ingested. In other words, a given record as deduced by the key generation policy can be ingested only once to the target table of interest. `Config Param: INSERT_DUP_POLICY` `Since Version: 0.14.0`
hoodie.datasource.meta_sync.condition.sync	false	If true, only sync on conditions like schema change or partition change. `Config Param: HIVE_CONDITIONAL_SYNC`
hoodie.datasource.write.commitmeta.key.prefix	_	Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timeline `Config Param: COMMIT_METADATA_KEYPREFIX`
hoodie.datasource.write.drop.partition.columns	false	When set to true, will not write the partition columns into hudi. By default, false. `Config Param: DROP_PARTITION_COLUMNS`
hoodie.datasource.write.insert.drop.duplicates	false	If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. <br /> Note Just for Insert operation in Spark SQL writing since 0.14.0, users can switch to the config `hoodie.datasource.insert.dup.policy` instead for a simplified duplicate handling experience. The new config will be incorporated into all other writing flows and this config will be fully deprecated in future releases. `Config Param: INSERT_DROP_DUPS`
hoodie.datasource.write.keygenerator.class	org.apache.hudi.keygen.SimpleKeyGenerator	Key generator class, that implements `org.apache.hudi.keygen.KeyGenerator` `Config Param: KEYGENERATOR_CLASS_NAME`
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled	false	When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value `2016-12-29 09:54:00` will be written as timestamp `2016-12-29 09:54:00.0` in row-writer path, while it will be written as long value `1483023240000000` in non row-writer path. If enabled, then the timestamp value will be written in both the cases. `Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED` `Since Version: 0.10.1`
hoodie.datasource.write.partitionpath.urlencode	false	Should we url encode the partition path value, before creating the folder structure. `Config Param: URL_ENCODE_PARTITIONING`
hoodie.datasource.write.reconcile.schema	false	This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table's schema (after txn) is either kept the same or extended, meaning that we'll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table's schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch) `Config Param: RECONCILE_SCHEMA`
hoodie.datasource.write.row.writer.enable	true	When set to true, will perform write operations directly using the spark native `Row` representation, avoiding any additional conversion costs. `Config Param: ENABLE_ROW_WRITER`
hoodie.datasource.write.streaming.checkpoint.identifier	default_single_writer	A stream identifier used for HUDI to fetch the right checkpoint(`batch id` to be more specific) corresponding this writer. Please note that keep the identifier an unique value for different writer if under multi-writer scenario. If the value is not set, will only keep the checkpoint info in the memory. This could introduce the potential issue that the job is restart(`batch id` is lost) while spark checkpoint write fails, causing spark will retry and rewrite the data. `Config Param: STREAMING_CHECKPOINT_IDENTIFIER` `Since Version: 0.13.0`
hoodie.datasource.write.streaming.disable.compaction	false	By default for MOR table, async compaction is enabled with spark streaming sink. By setting this config to true, we can disable it and the expectation is that, users will schedule and execute compaction in a different process/job altogether. Some users may wish to run it separately to manage resources across table services and regular ingestion pipeline and so this could be preferred on such cases. `Config Param: STREAMING_DISABLE_COMPACTION` `Since Version: 0.14.0`
hoodie.datasource.write.streaming.ignore.failed.batch	false	Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch. Turning this on, could hide the write status errors while the spark checkpoint moves ahead.So, would recommend users to use this with caution. `Config Param: STREAMING_IGNORE_FAILED_BATCH`
hoodie.datasource.write.streaming.retry.count	3	Config to indicate how many times streaming job should retry for a failed micro batch. `Config Param: STREAMING_RETRY_CNT`
hoodie.datasource.write.streaming.retry.interval.ms	2000	Config to indicate how long (by millisecond) before a retry should issued for failed microbatch `Config Param: STREAMING_RETRY_INTERVAL_MS`
hoodie.meta.sync.client.tool.class	org.apache.hudi.hive.HiveSyncTool	Sync tool class name used to sync to metastore. Defaults to Hive. `Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAME`
hoodie.spark.sql.insert.into.operation	insert	Sql write operation to use with INSERT_INTO spark sql command. This comes with 3 possible values, bulk_insert, insert and upsert. bulk_insert is generally meant for initial loads and is known to be performant compared to insert. But bulk_insert may not do small file management. If you prefer hudi to automatically manage small files, then you can go with "insert". There is no precombine (if there are duplicates within the same batch being ingested, same dups will be ingested) with bulk_insert and insert and there is no index look up as well. If you may use INSERT_INTO for mutable dataset, then you may have to set this config value to "upsert". With upsert, you will get both precombine and updates to existing records on storage is also honored. If not, you may see duplicates. `Config Param: SPARK_SQL_INSERT_INTO_OPERATION` `Since Version: 0.14.0`
hoodie.spark.sql.merge.into.partial.updates	true	Whether to write partial updates to the data blocks containing updates in MOR tables with Spark SQL MERGE INTO statement. The data blocks containing partial updates have a schema with a subset of fields compared to the full schema of the table. `Config Param: ENABLE_MERGE_INTO_PARTIAL_UPDATES` `Since Version: 1.0.0`
hoodie.spark.sql.optimized.writes.enable	true	Controls whether spark sql prepped update and delete are enabled. `Config Param: SPARK_SQL_OPTIMIZED_WRITES` `Since Version: 0.14.0`
hoodie.sql.bulk.insert.enable	false	When set to true, the sql insert statement will use bulk insert. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation instead. `Config Param: SQL_ENABLE_BULK_INSERT`
hoodie.sql.insert.mode	upsert	Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record.For strict mode, insert statement will keep the primary key uniqueness constraint which do not allow duplicate record.While for non-strict mode, hudi just do the insert operation for the pk-table. This config is deprecated as of 0.14.0. Please use hoodie.spark.sql.insert.into.operation and hoodie.datasource.insert.dup.policy as you see fit. `Config Param: SQL_INSERT_MODE`
hoodie.streamer.source.kafka.value.deserializer.class	io.confluent.kafka.serializers.KafkaAvroDeserializer	This class is used by kafka client to deserialize the records `Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS` `Since Version: 0.9.0`
hoodie.write.set.null.for.missing.columns	false	When a nullable column is missing from incoming batch during a write operation, the write operation will fail schema compatibility check. Set this option to true will make the missing column be filled with null values to successfully complete the write operation. `Config Param: SET_NULL_FOR_MISSING_COLUMNS` `Since Version: 0.14.1`

Config Name	Default	Description
hoodie.precommit.validators		Comma separated list of class names that can be invoked to validate commit `Config Param: VALIDATOR_CLASS_NAMES`
hoodie.precommit.validators.equality.sql.queries		Spark SQL queries to run on table before committing new data to validate state before and after commit. Multiple queries separated by ';' delimiter are supported. Example: "select count(*) from <TABLE_NAME> Note <TABLE_NAME> is replaced by table state before and after commit. `Config Param: EQUALITY_SQL_QUERIES`
hoodie.precommit.validators.inequality.sql.queries		Spark SQL queries to run on table before committing new data to validate state before and after commit.Multiple queries separated by ';' delimiter are supported.Example query: 'select count(*) from <TABLE_NAME> where col=null'Note <TABLE_NAME> variable is expected to be present in query. `Config Param: INEQUALITY_SQL_QUERIES`
hoodie.precommit.validators.single.value.sql.queries		Spark SQL queries to run on table before committing new data to validate state after commit.Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: 'query1#result1:query2#result2'Note <TABLE_NAME> variable is expected to be present in query. `Config Param: SINGLE_VALUE_SQL_QUERIES`

Config Name	Default	Description
hoodie.database.name	(N/A)	Database name to register to Hive metastore `Config Param: DATABASE_NAME`
hoodie.table.name	(N/A)	Table name to register to Hive metastore `Config Param: TABLE_NAME`
path	(N/A)	Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfully `Config Param: PATH`
read.commits.limit	(N/A)	The maximum number of commits allowed to read in each instant check, if it is streaming read, the avg read instants number per-second would be 'read.commits.limit'/'read.streaming.check-interval', by default no limit `Config Param: READ_COMMITS_LIMIT`
read.end-commit	(N/A)	End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss' `Config Param: READ_END_COMMIT`
read.start-commit	(N/A)	Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read `Config Param: READ_START_COMMIT`
archive.max_commits	50	Max number of commits to keep before archiving older commits into a sequential log, default 50 `Config Param: ARCHIVE_MAX_COMMITS`
archive.min_commits	40	Min number of commits to keep before archiving older commits into a sequential log, default 40 `Config Param: ARCHIVE_MIN_COMMITS`
cdc.enabled	false	When enable, persist the change data if necessary, and can be queried as a CDC query mode `Config Param: CDC_ENABLED`
cdc.supplemental.logging.mode	DATA_BEFORE_AFTER	Setting 'op_key_only' persists the 'op' and the record key only, setting 'data_before' persists the additional 'before' image, and setting 'data_before_after' persists the additional 'before' and 'after' images. `Config Param: SUPPLEMENTAL_LOGGING_MODE`
changelog.enabled	false	Whether to keep all the intermediate changes, we try to keep all the changes of a record when enabled: 1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every changes of a record. The semantics is best effort because the compaction job would finally merge all changes of a record into one. default false to have UPSERT semantics `Config Param: CHANGELOG_ENABLED`
clean.async.enabled	true	Whether to cleanup the old commits immediately on new commits, enabled by default `Config Param: CLEAN_ASYNC_ENABLED`
clean.retain_commits	30	Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table, default 30 `Config Param: CLEAN_RETAIN_COMMITS`
clustering.async.enabled	false	Async Clustering, default false `Config Param: CLUSTERING_ASYNC_ENABLED`
clustering.plan.strategy.small.file.limit	600	Files smaller than the size specified here are candidates for clustering, default 600 MB `Config Param: CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT`
clustering.plan.strategy.target.file.max.bytes	1073741824	Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups, default 1 GB `Config Param: CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES`
compaction.async.enabled	true	Async Compaction, enabled by default for MOR `Config Param: COMPACTION_ASYNC_ENABLED`
compaction.delta_commits	5	Max delta commits needed to trigger compaction, default 5 commits `Config Param: COMPACTION_DELTA_COMMITS`
hive_sync.enabled	false	Asynchronously sync Hive meta to HMS, default false `Config Param: HIVE_SYNC_ENABLED`
hive_sync.jdbc_url	jdbc:hive2://localhost:10000	Jdbc URL for hive sync, default 'jdbc:hive2://localhost:10000' `Config Param: HIVE_SYNC_JDBC_URL`
hive_sync.metastore.uris		Metastore uris for hive sync, default '' `Config Param: HIVE_SYNC_METASTORE_URIS`
hive_sync.mode	HMS	Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'hms' `Config Param: HIVE_SYNC_MODE`
hoodie.datasource.query.type	snapshot	Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row & columnar data); 2) incremental mode (new data since an instantTime); 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshot `Config Param: QUERY_TYPE`
hoodie.datasource.write.hive_style_partitioning	false	Whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values) `Config Param: HIVE_STYLE_PARTITIONING`
hoodie.datasource.write.partitionpath.field		Partition path field. Value to be used at the `partitionPath` component of `HoodieKey`. Actual value obtained by invoking .toString(), default '' `Config Param: PARTITION_PATH_FIELD`
hoodie.datasource.write.recordkey.field	uuid	Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` `Config Param: RECORD_KEY_FIELD`
index.type	FLINK_STATE	Index type of Flink write job, default is using state backed index. `Config Param: INDEX_TYPE`
lookup.join.cache.ttl	PT1H	The cache TTL (e.g. 10min) for the build table in lookup join. `Config Param: LOOKUP_JOIN_CACHE_TTL`
metadata.compaction.delta_commits	10	Max delta commits for metadata table to trigger compaction, default 10 `Config Param: METADATA_COMPACTION_DELTA_COMMITS`
metadata.enabled	true	Enable the internal metadata table which serves table metadata like level file listings, default enabled `Config Param: METADATA_ENABLED`
precombine.field	ts	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: PRECOMBINE_FIELD`
read.streaming.enabled	false	Whether to read as streaming source, default false `Config Param: READ_AS_STREAMING`
read.streaming.skip_insertoverwrite	false	Whether to skip insert overwrite instants to avoid reading base files of insert overwrite operations for streaming read. In streaming scenarios, insert overwrite is usually used to repair data, here you can control the visibility of downstream streaming read. `Config Param: READ_STREAMING_SKIP_INSERT_OVERWRITE`
table.type	COPY_ON_WRITE	Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ `Config Param: TABLE_TYPE`
write.operation	upsert	The write operation, that this write should do `Config Param: OPERATION`
write.parquet.max.file.size	120	Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. `Config Param: WRITE_PARQUET_MAX_FILE_SIZE`

Config Name	Default	Description
clustering.tasks	(N/A)	Parallelism of tasks that do actual clustering, default same as the write task parallelism `Config Param: CLUSTERING_TASKS`
compaction.tasks	(N/A)	Parallelism of tasks that do actual compaction, default same as the write task parallelism `Config Param: COMPACTION_TASKS`
hive_sync.conf.dir	(N/A)	The hive configuration directory, where the hive-site.xml lies in, the file should be put on the client machine `Config Param: HIVE_SYNC_CONF_DIR`
hive_sync.serde_properties	(N/A)	Serde properties to hive table, the data format is k1=v1 k2=v2 `Config Param: HIVE_SYNC_TABLE_SERDE_PROPERTIES`
hive_sync.table_properties	(N/A)	Additional properties to store with table, the data format is k1=v1 k2=v2 `Config Param: HIVE_SYNC_TABLE_PROPERTIES`
hoodie.datasource.write.keygenerator.class	(N/A)	Key generator class, that implements will extract the key out of incoming record `Config Param: KEYGEN_CLASS_NAME`
read.tasks	(N/A)	Parallelism of tasks that do actual read, default is the parallelism of the execution environment `Config Param: READ_TASKS`
source.avro-schema	(N/A)	Source avro schema string, the parsed schema is used for deserialization `Config Param: SOURCE_AVRO_SCHEMA`
source.avro-schema.path	(N/A)	Source avro schema file path, the parsed schema is used for deserialization `Config Param: SOURCE_AVRO_SCHEMA_PATH`
write.bucket_assign.tasks	(N/A)	Parallelism of tasks that do bucket assign, default same as the write task parallelism `Config Param: BUCKET_ASSIGN_TASKS`
write.index_bootstrap.tasks	(N/A)	Parallelism of tasks that do index bootstrap, default same as the write task parallelism `Config Param: INDEX_BOOTSTRAP_TASKS`
write.partition.format	(N/A)	Partition path format, only valid when 'write.datetime.partitioning' is true, default is: 1) 'yyyyMMddHH' for timestamp(3) WITHOUT TIME ZONE, LONG, FLOAT, DOUBLE, DECIMAL; 2) 'yyyyMMdd' for DATE and INT. `Config Param: PARTITION_FORMAT`
write.tasks	(N/A)	Parallelism of tasks that do actual write, default is the parallelism of the execution environment `Config Param: WRITE_TASKS`
clean.policy	KEEP_LATEST_COMMITS	Clean policy to manage the Hudi table. Available option: KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS.Default is KEEP_LATEST_COMMITS. `Config Param: CLEAN_POLICY`
clean.retain_file_versions	5	Number of file versions to retain. default 5 `Config Param: CLEAN_RETAIN_FILE_VERSIONS`
clean.retain_hours	24	Number of hours for which commits need to be retained. This config provides a more flexible option ascompared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. `Config Param: CLEAN_RETAIN_HOURS`
clustering.delta_commits	4	Max delta commits needed to trigger clustering, default 4 commits `Config Param: CLUSTERING_DELTA_COMMITS`
clustering.plan.partition.filter.mode	NONE	Partition filter mode used in the creation of clustering plan. Available values are - NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate.RECENT_DAYS: keep a continuous range of partitions, worked together with configs 'clustering.plan.strategy.daybased.lookback.partitions' and 'clustering.plan.strategy.daybased.skipfromlatest.partitions.SELECTED_PARTITIONS: keep partitions that are in the specified range ['clustering.plan.strategy.cluster.begin.partition', 'clustering.plan.strategy.cluster.end.partition'].DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour. `Config Param: CLUSTERING_PLAN_PARTITION_FILTER_MODE_NAME`
clustering.plan.strategy.class	org.apache.hudi.client.clustering.plan.strategy.FlinkSizeBasedClusteringPlanStrategy	Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the last N (determined by clustering.plan.strategy.daybased.lookback.partitions) day based partitions picks the small file slices within those partitions. `Config Param: CLUSTERING_PLAN_STRATEGY_CLASS`
clustering.plan.strategy.cluster.begin.partition		Begin partition used to filter partition (inclusive) `Config Param: CLUSTERING_PLAN_STRATEGY_CLUSTER_BEGIN_PARTITION`
clustering.plan.strategy.cluster.end.partition		End partition used to filter partition (inclusive) `Config Param: CLUSTERING_PLAN_STRATEGY_CLUSTER_END_PARTITION`
clustering.plan.strategy.daybased.lookback.partitions	2	Number of partitions to list to create ClusteringPlan, default is 2 `Config Param: CLUSTERING_TARGET_PARTITIONS`
clustering.plan.strategy.daybased.skipfromlatest.partitions	0	Number of partitions to skip from latest when choosing partitions to create ClusteringPlan `Config Param: CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST`
clustering.plan.strategy.max.num.groups	30	Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism, default is 30 `Config Param: CLUSTERING_MAX_NUM_GROUPS`
clustering.plan.strategy.partition.regex.pattern		Filter clustering partitions that matched regex pattern `Config Param: CLUSTERING_PLAN_STRATEGY_PARTITION_REGEX_PATTERN`
clustering.plan.strategy.partition.selected		Partitions to run clustering `Config Param: CLUSTERING_PLAN_STRATEGY_PARTITION_SELECTED`
clustering.plan.strategy.sort.columns		Columns to sort the data by when clustering `Config Param: CLUSTERING_SORT_COLUMNS`
clustering.schedule.enabled	false	Schedule the cluster plan, default false `Config Param: CLUSTERING_SCHEDULE_ENABLED`
compaction.delta_seconds	3600	Max delta seconds time needed to trigger compaction, default 1 hour `Config Param: COMPACTION_DELTA_SECONDS`
compaction.max_memory	100	Max memory in MB for compaction spillable map, default 100MB `Config Param: COMPACTION_MAX_MEMORY`
compaction.schedule.enabled	true	Schedule the compaction plan, enabled by default for MOR `Config Param: COMPACTION_SCHEDULE_ENABLED`
compaction.target_io	512000	Target IO in MB for per compaction (both read and write), default 500 GB `Config Param: COMPACTION_TARGET_IO`
compaction.timeout.seconds	1200	Max timeout time in seconds for online compaction to rollback, default 20 minutes `Config Param: COMPACTION_TIMEOUT_SECONDS`
compaction.trigger.strategy	num_commits	Strategy to trigger compaction, options are 'num_commits': trigger compaction when there are at least N delta commits after last completed compaction; 'num_commits_after_last_request': trigger compaction when there are at least N delta commits after last completed/requested compaction; 'time_elapsed': trigger compaction when time elapsed > N seconds since last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 'num_commits' `Config Param: COMPACTION_TRIGGER_STRATEGY`
hive_sync.assume_date_partitioning	false	Assume partitioning is yyyy/mm/dd, default false `Config Param: HIVE_SYNC_ASSUME_DATE_PARTITION`
hive_sync.auto_create_db	true	Auto create hive database if it does not exists, default true `Config Param: HIVE_SYNC_AUTO_CREATE_DB`
hive_sync.db	default	Database name for hive sync, default 'default' `Config Param: HIVE_SYNC_DB`
hive_sync.file_format	PARQUET	File format for hive sync, default 'PARQUET' `Config Param: HIVE_SYNC_FILE_FORMAT`
hive_sync.ignore_exceptions	false	Ignore exceptions during hive synchronization, default false `Config Param: HIVE_SYNC_IGNORE_EXCEPTIONS`
hive_sync.partition_extractor_class	org.apache.hudi.hive.MultiPartKeysValueExtractor	Tool to extract the partition value from HDFS path, default 'MultiPartKeysValueExtractor' `Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME`
hive_sync.partition_fields		Partition fields for hive sync, default '' `Config Param: HIVE_SYNC_PARTITION_FIELDS`
hive_sync.password	hive	Password for hive sync, default 'hive' `Config Param: HIVE_SYNC_PASSWORD`
hive_sync.skip_ro_suffix	false	Skip the _ro suffix for Read optimized table when registering, default false `Config Param: HIVE_SYNC_SKIP_RO_SUFFIX`
hive_sync.support_timestamp	true	INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for backward compatibility. `Config Param: HIVE_SYNC_SUPPORT_TIMESTAMP`
hive_sync.table	unknown	Table name for hive sync, default 'unknown' `Config Param: HIVE_SYNC_TABLE`
hive_sync.table.strategy	ALL	Hive table synchronization strategy. Available option: RO, RT, ALL. `Config Param: HIVE_SYNC_TABLE_STRATEGY`
hive_sync.use_jdbc	true	Use JDBC when hive synchronization is enabled, default true `Config Param: HIVE_SYNC_USE_JDBC`
hive_sync.username	hive	Username for hive sync, default 'hive' `Config Param: HIVE_SYNC_USERNAME`
hoodie.bucket.index.hash.field		Index key field. Value to be used as hashing to find the bucket ID. Should be a subset of or equal to the recordKey fields. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` `Config Param: INDEX_KEY_FIELD`
hoodie.bucket.index.num.buckets	4	Hudi bucket number per partition. Only affected if using Hudi bucket index. `Config Param: BUCKET_INDEX_NUM_BUCKETS`
hoodie.datasource.merge.type	payload_combine	For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records without merging; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records (combines the two records with same key for base and log file records), then read the left log file records `Config Param: MERGE_TYPE`
hoodie.datasource.write.keygenerator.type	SIMPLE	Key generator type, that implements will extract the key out of incoming record. Note This is being actively worked on. Please use `hoodie.datasource.write.keygenerator.class` instead. `Config Param: KEYGEN_TYPE`
hoodie.datasource.write.partitionpath.urlencode	false	Whether to encode the partition path url, default false `Config Param: URL_ENCODE_PARTITIONING`
hoodie.index.bucket.engine	SIMPLE	Type of bucket index engine. Available options: [SIMPLE
hoodie.write.table.version	8	Table version produced by this writer. `Config Param: WRITE_TABLE_VERSION`
index.bootstrap.enabled	false	Whether to bootstrap the index state from existing hoodie table, default false `Config Param: INDEX_BOOTSTRAP_ENABLED`
index.global.enabled	true	Whether to update index for the old partition path if same key record with different partition path came in, default true `Config Param: INDEX_GLOBAL_ENABLED`
index.partition.regex	.*	Whether to load partitions in state if partition path matching， default `*` `Config Param: INDEX_PARTITION_REGEX`
index.state.ttl	0.0	Index state ttl in days, default stores the index permanently `Config Param: INDEX_STATE_TTL`
partition.default_name	HIVE_DEFAULT_PARTITION	The default partition name in case the dynamic partition column value is null/empty string `Config Param: PARTITION_DEFAULT_NAME`
payload.class	org.apache.hudi.common.model.EventTimeAvroPayload	Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for the option in-effective `Config Param: PAYLOAD_CLASS_NAME`
read.cdc.from.changelog	true	Whether to consume the delta changes only from the cdc changelog files. When CDC is enabled, i). for COW table, the changelog is generated on each file update; ii). for MOR table, the changelog is generated on compaction. By default, always read from the changelog file, once it is disabled, the reader would infer the changes based on the file slice dependencies. `Config Param: READ_CDC_FROM_CHANGELOG`
read.data.skipping.enabled	false	Enables data-skipping allowing queries to leverage indexes to reduce the search space by skipping over files `Config Param: READ_DATA_SKIPPING_ENABLED`
read.streaming.check-interval	60	Check interval for streaming read of SECOND, default 1 minute `Config Param: READ_STREAMING_CHECK_INTERVAL`
read.streaming.skip_clustering	true	Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read to improve read performance. `Config Param: READ_STREAMING_SKIP_CLUSTERING`
read.streaming.skip_compaction	true	Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance. This option can be used to avoid reading duplicates when changelog mode is enabled, it is a solution to keep data integrity `Config Param: READ_STREAMING_SKIP_COMPACT`
read.utc-timezone	true	Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone, by default true `Config Param: READ_UTC_TIMEZONE`
record.merger.impls	org.apache.hudi.common.model.HoodieAvroRecordMerger	List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by record.merger.strategy. Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc) `Config Param: RECORD_MERGER_IMPLS`
record.merger.strategy	eeb8d96f-b1e4-49fd-bbf8-28ac514178e5	Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in record.merger.impls which has the same merger strategy id `Config Param: RECORD_MERGER_STRATEGY_ID`
write.batch.size	256.0	Batch buffer size in MB to flush data into the underneath filesystem, default 256MB `Config Param: WRITE_BATCH_SIZE`
write.bulk_insert.shuffle_input	true	Whether to shuffle the inputs by specific fields for bulk insert tasks, default true `Config Param: WRITE_BULK_INSERT_SHUFFLE_INPUT`
write.bulk_insert.sort_input	true	Whether to sort the inputs by specific fields for bulk insert tasks, default true `Config Param: WRITE_BULK_INSERT_SORT_INPUT`
write.bulk_insert.sort_input.by_record_key	false	Whether to sort the inputs by record keys for bulk insert tasks, default false `Config Param: WRITE_BULK_INSERT_SORT_INPUT_BY_RECORD_KEY`
write.client.id		Unique identifier used to distinguish different writer pipelines for concurrent mode `Config Param: WRITE_CLIENT_ID`
write.commit.ack.timeout	-1	Timeout limit for a writer task after it finishes a checkpoint and waits for the instant commit success, only for internal use `Config Param: WRITE_COMMIT_ACK_TIMEOUT`
write.ignore.failed	false	Flag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default false. Turning this on, could hide the write status errors while the flink checkpoint moves ahead. So, would recommend users to use this with caution. `Config Param: IGNORE_FAILED`
write.insert.cluster	false	Whether to merge small files for insert mode, if true, the write throughput will decrease because the read/write of existing small file, only valid for COW table, default false `Config Param: INSERT_CLUSTER`
write.log.max.size	1024	Maximum size allowed in MB for a log file before it is rolled over to the next version, default 1GB `Config Param: WRITE_LOG_MAX_SIZE`
write.log_block.size	128	Max log block size in MB for log file, default 128MB `Config Param: WRITE_LOG_BLOCK_SIZE`
write.merge.max_memory	100	Max memory in MB for merge, default 100MB `Config Param: WRITE_MERGE_MAX_MEMORY`
write.parquet.block.size	120	Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group. `Config Param: WRITE_PARQUET_BLOCK_SIZE`
write.parquet.page.size	1	Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately. `Config Param: WRITE_PARQUET_PAGE_SIZE`
write.partition.overwrite.mode	STATIC	When INSERT OVERWRITE a partitioned data source table, we currently support 2 modes: static and dynamic. Static mode deletes all the partitions that match the partition specification(e.g. PARTITION(a=1,b)) in the INSERT statement, before overwriting. Dynamic mode doesn't delete partitions ahead, and only overwrite those partitions that have data written into it at runtime. By default we use static mode to keep the same behavior of previous version. `Config Param: WRITE_PARTITION_OVERWRITE_MODE`
write.precombine	false	Flag to indicate whether to drop duplicates before insert/upsert. By default these cases will accept duplicates, to gain extra performance: 1) insert operation; 2) upsert for MOR table, the MOR table deduplicate on reading `Config Param: PRE_COMBINE`
write.rate.limit	0	Write record rate limit per second to prevent traffic jitter and improve stability, default 0 (no limit) `Config Param: WRITE_RATE_LIMIT`
write.retry.interval.ms	2000	Flag to indicate how long (by millisecond) before a retry should issued for failed checkpoint batch. By default 2000 and it will be doubled by every retry `Config Param: RETRY_INTERVAL_MS`
write.retry.times	3	Flag to indicate how many times streaming job should retry for a failed checkpoint batch. By default 3 `Config Param: RETRY_TIMES`
write.sort.memory	128	Sort memory in MB, default 128MB `Config Param: WRITE_SORT_MEMORY`
write.task.max.size	1024.0	Maximum memory in MB for a write task, when the threshold hits, it flushes the max size data bucket to avoid OOM, default 1GB `Config Param: WRITE_TASK_MAX_SIZE`
write.utc-timezone	true	Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Default value is utc timezone for forward compatibility. `Config Param: WRITE_UTC_TIMEZONE`

Config Name	Default	Description
hoodie.memory.compaction.max.size	(N/A)	Maximum amount of memory used in bytes for compaction operations in bytes , before spilling to local storage. `Config Param: MAX_MEMORY_FOR_COMPACTION`
hoodie.memory.spillable.map.path	(N/A)	Default file path for spillable map `Config Param: SPILLABLE_MAP_BASE_PATH`
hoodie.memory.compaction.fraction	0.6	HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map `Config Param: MAX_MEMORY_FRACTION_FOR_COMPACTION`
hoodie.memory.dfs.buffer.max.size	16777216	Property to control the max memory in bytes for dfs input stream buffer size `Config Param: MAX_DFS_STREAM_BUFFER_SIZE`
hoodie.memory.merge.fraction	0.6	This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge `Config Param: MAX_MEMORY_FRACTION_FOR_MERGE`
hoodie.memory.merge.max.size	1073741824	Maximum amount of memory used in bytes for merge operations, before spilling to local storage. `Config Param: MAX_MEMORY_FOR_MERGE`
hoodie.memory.writestatus.failure.fraction	0.1	Property to control how what fraction of the failed record, exceptions we report back to driver. Default is 10%. If set to 100%, with lot of failures, this can cause memory pressure, cause OOMs and mask actual data errors. `Config Param: WRITESTATUS_FAILURE_FRACTION`

Config Name	Default	Description
hoodie.index.name	(N/A)	Name of the expression index. This is also used for the partition name in the metadata table. `Config Param: SECONDARY_INDEX_NAME` `Since Version: 1.0.0`
hoodie.index.name	(N/A)	Name of the expression index. This is also used for the partition name in the metadata table. `Config Param: EXPRESSION_INDEX_NAME` `Since Version: 1.0.0`
hoodie.metadata.index.drop	(N/A)	Drop the specified index. The value should be the name of the index to delete. You can check index names using `SHOW INDEXES` command. The index name either starts with or matches exactly can be one of the following: files, column_stats, bloom_filters, record_index, expr_index_, secondary_index_, partition_stats, files `Config Param: DROP_METADATA_INDEX` `Since Version: 1.0.1`
hoodie.expression.index.type	COLUMN_STATS	Type of the expression index. Default is `column_stats` if there are no functions and expressions in the command. Valid options could be BITMAP, COLUMN_STATS, LUCENE, etc. If index_type is not provided, and there are functions or expressions in the command then a expression index using column stats will be created. `Config Param: EXPRESSION_INDEX_TYPE` `Since Version: 1.0.0`
hoodie.metadata.enable	true	Enable the internal metadata table which serves table metadata like level file listings `Config Param: ENABLE` `Since Version: 0.7.0`
hoodie.metadata.index.bloom.filter.enable	false	Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups. `Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTER` `Since Version: 0.11.0`
hoodie.metadata.index.column.stats.enable	false	Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups. `Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS` `Since Version: 0.11.0`
hoodie.metadata.index.expression.enable	false	Enable expression index within the metadata table. When this configuration property is enabled (`true`), the Hudi writer automatically keeps all expression indexes consistent with the data table. When disabled (`false`), all expression indexes are deleted. Note that individual expression index can only be created through a `CREATE INDEX` and deleted through a `DROP INDEX` statement in Spark SQL. `Config Param: EXPRESSION_INDEX_ENABLE_PROP` `Since Version: 1.0.0`
hoodie.metadata.index.partition.stats.enable	false	Enable aggregating stats for each column at the storage partition level. Enabling this can improve query performance by leveraging partition and column stats for (partition) filtering. Important: The default value for this configuration is dynamically set based on the effective value of hoodie.metadata.index.column.stats.enable. If column stats index is enabled (default for Spark engine), partition stats indexing will also be enabled by default. Conversely, if column stats indexing is disabled (default for Flink and Java engines), partition stats indexing will also be disabled by default. `Config Param: ENABLE_METADATA_INDEX_PARTITION_STATS` `Since Version: 1.0.0`
hoodie.metadata.index.secondary.enable	true	Enable secondary index within the metadata table. When this configuration property is enabled (`true`), the Hudi writer automatically keeps all secondary indexes consistent with the data table. When disabled (`false`), all secondary indexes are deleted. Note that individual secondary index can only be created through a `CREATE INDEX` and deleted through a `DROP INDEX` statement in Spark SQL. `Config Param: SECONDARY_INDEX_ENABLE_PROP` `Since Version: 1.0.0`

Config Name	Default	Description
hoodie.write.commit.callback.http.custom.headers	(N/A)	Http callback custom headers. Format: HeaderName1:HeaderValue1;HeaderName2:HeaderValue2 `Config Param: CALLBACK_HTTP_CUSTOM_HEADERS` `Since Version: 0.15.0`
hoodie.write.commit.callback.http.url	(N/A)	Callback host to be sent along with callback messages `Config Param: CALLBACK_HTTP_URL` `Since Version: 0.6.0`
hoodie.write.commit.callback.class	org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback	Full path of callback class and must be a subclass of HoodieWriteCommitCallback class, org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback by default `Config Param: CALLBACK_CLASS_NAME` `Since Version: 0.6.0`
hoodie.write.commit.callback.http.api.key	hudi_write_commit_http_callback	Http callback API key. hudi_write_commit_http_callback by default `Config Param: CALLBACK_HTTP_API_KEY_VALUE` `Since Version: 0.6.0`
hoodie.write.commit.callback.http.timeout.seconds	30	Callback timeout in seconds. `Config Param: CALLBACK_HTTP_TIMEOUT_IN_SECONDS` `Since Version: 0.6.0`
hoodie.write.commit.callback.on	false	Turn commit callback on/off. off by default. `Config Param: TURN_CALLBACK_ON` `Since Version: 0.6.0`

Config Name	Default	Description
hoodie.metadata.index.bloom.filter.column.list	(N/A)	Comma-separated list of columns for which bloom filter index will be built. If not set, only record key will be indexed. `Config Param: BLOOM_FILTER_INDEX_FOR_COLUMNS` `Since Version: 0.11.0`
hoodie.metadata.index.column.stats.column.list	(N/A)	Comma-separated list of columns for which column stats index will be built. If not set, all columns will be indexed `Config Param: COLUMN_STATS_INDEX_FOR_COLUMNS` `Since Version: 0.11.0`
hoodie.metadata.index.column.stats.processing.mode.override	(N/A)	By default Column Stats Index is automatically determining whether it should be read and processed either'in-memory' (w/in executing process) or using Spark (on a cluster), based on some factors like the size of the Index and how many columns are read. This config allows to override this behavior. `Config Param: COLUMN_STATS_INDEX_PROCESSING_MODE_OVERRIDE` `Since Version: 0.12.0`
hoodie.metadata.index.expression.column	(N/A)	Column for which expression index will be built. `Config Param: EXPRESSION_INDEX_COLUMN` `Since Version: 1.0.1`
hoodie.metadata.index.expression.options	(N/A)	Options for the expression index, e.g. "expr='from_unixtime', format='yyyy-MM-dd'" `Config Param: EXPRESSION_INDEX_OPTIONS` `Since Version: 1.0.1`
hoodie.metadata.index.secondary.column	(N/A)	Column for which secondary index will be built. `Config Param: SECONDARY_INDEX_COLUMN` `Since Version: 1.0.1`
_hoodie.metadata.ignore.spurious.deletes	true	There are cases when extra files are requested to be deleted from metadata table which are never added before. This config determines how to handle such spurious deletes `Config Param: IGNORE_SPURIOUS_DELETES` `Since Version: 0.10.0`
hoodie.file.listing.parallelism	200	Parallelism to use, when listing the table on lake storage. `Config Param: FILE_LISTING_PARALLELISM_VALUE` `Since Version: 0.7.0`
hoodie.metadata.auto.initialize	true	Initializes the metadata table by reading from the file system when the table is first created. Enabled by default. Warning: This should only be disabled when manually constructing the metadata table outside of typical Hudi writer flows. `Config Param: AUTO_INITIALIZE` `Since Version: 0.14.0`
hoodie.metadata.compact.max.delta.commits	10	Controls how often the metadata table is compacted. `Config Param: COMPACT_NUM_DELTA_COMMITS` `Since Version: 0.7.0`
hoodie.metadata.dir.filter.regex		Directories matching this regex, will be filtered out when initializing metadata table from lake storage for the first time. `Config Param: DIR_FILTER_REGEX` `Since Version: 0.7.0`
hoodie.metadata.index.async	false	Enable asynchronous indexing of metadata table. `Config Param: ASYNC_INDEX_ENABLE` `Since Version: 0.11.0`
hoodie.metadata.index.bloom.filter.file.group.count	4	Metadata bloom filter index partition file group count. This controls the size of the base and log files and read parallelism in the bloom filter index partition. The recommendation is to size the file group count such that the base files are under 1GB. `Config Param: METADATA_INDEX_BLOOM_FILTER_FILE_GROUP_COUNT` `Since Version: 0.11.0`
hoodie.metadata.index.bloom.filter.parallelism	200	Parallelism to use for generating bloom filter index in metadata table. `Config Param: BLOOM_FILTER_INDEX_PARALLELISM` `Since Version: 0.11.0`
hoodie.metadata.index.check.timeout.seconds	900	After the async indexer has finished indexing upto the base instant, it will ensure that all inflight writers reliably write index updates as well. If this timeout expires, then the indexer will abort itself safely. `Config Param: METADATA_INDEX_CHECK_TIMEOUT_SECONDS` `Since Version: 0.11.0`
hoodie.metadata.index.column.stats.file.group.count	2	Metadata column stats partition file group count. This controls the size of the base and log files and read parallelism in the column stats index partition. The recommendation is to size the file group count such that the base files are under 1GB. `Config Param: METADATA_INDEX_COLUMN_STATS_FILE_GROUP_COUNT` `Since Version: 0.11.0`
hoodie.metadata.index.column.stats.inMemory.projection.threshold	100000	When reading Column Stats Index, if the size of the expected resulting projection is below the in-memory threshold (counted by the # of rows), it will be attempted to be loaded "in-memory" (ie not using the execution engine like Spark, Flink, etc). If the value is above the threshold execution engine will be used to compose the projection. `Config Param: COLUMN_STATS_INDEX_IN_MEMORY_PROJECTION_THRESHOLD` `Since Version: 0.12.0`
hoodie.metadata.index.column.stats.max.columns.to.index	32	Maximum number of columns to generate column stats for. If the config `hoodie.metadata.index.column.stats.column.list` is set, this config will be ignored. If the config `hoodie.metadata.index.column.stats.column.list` is not set, the column stats of the first `n` columns (`n` defined by this config) in the table schema are generated. `Config Param: COLUMN_STATS_INDEX_MAX_COLUMNS` `Since Version: 1.0.0`
hoodie.metadata.index.column.stats.parallelism	200	Parallelism to use, when generating column stats index. `Config Param: COLUMN_STATS_INDEX_PARALLELISM` `Since Version: 0.11.0`
hoodie.metadata.index.expression.file.group.count	2	Metadata expression index partition file group count. `Config Param: EXPRESSION_INDEX_FILE_GROUP_COUNT` `Since Version: 1.0.0`
hoodie.metadata.index.expression.parallelism	200	Parallelism to use, when generating expression index. `Config Param: EXPRESSION_INDEX_PARALLELISM` `Since Version: 1.0.0`
hoodie.metadata.index.partition.stats.file.group.count	1	Metadata partition stats file group count. This controls the size of the base and log files and read parallelism in the partition stats index. `Config Param: METADATA_INDEX_PARTITION_STATS_FILE_GROUP_COUNT` `Since Version: 1.0.0`
hoodie.metadata.index.partition.stats.parallelism	200	Parallelism to use, when generating partition stats index. `Config Param: PARTITION_STATS_INDEX_PARALLELISM` `Since Version: 1.0.0`
hoodie.metadata.index.secondary.parallelism	200	Parallelism to use, when generating secondary index. `Config Param: SECONDARY_INDEX_PARALLELISM` `Since Version: 1.0.0`
hoodie.metadata.log.compaction.blocks.threshold	5	Controls the criteria to log compacted files groups in metadata table. `Config Param: LOG_COMPACT_BLOCKS_THRESHOLD` `Since Version: 0.14.0`
hoodie.metadata.log.compaction.enable	false	This configs enables logcompaction for the metadata table. `Config Param: ENABLE_LOG_COMPACTION_ON_METADATA_TABLE` `Since Version: 0.14.0`
hoodie.metadata.max.deltacommits.when_pending	1000	When there is a pending instant in data table, this config limits the allowed number of deltacommits in metadata table to prevent the metadata table's timeline from growing unboundedly as compaction won't be triggered due to the pending data table instant. `Config Param: METADATA_MAX_NUM_DELTACOMMITS_WHEN_PENDING` `Since Version: 0.14.0`
hoodie.metadata.max.init.parallelism	100000	Maximum parallelism to use when initializing Record Index. `Config Param: RECORD_INDEX_MAX_PARALLELISM` `Since Version: 0.14.0`
hoodie.metadata.max.logfile.size	2147483648	Maximum size in bytes of a single log file. Larger log files can contain larger log blocks thereby reducing the number of blocks to search for keys `Config Param: MAX_LOG_FILE_SIZE_BYTES_PROP` `Since Version: 0.14.0`
hoodie.metadata.max.reader.buffer.size	10485760	Max memory to use for the reader buffer while merging log blocks `Config Param: MAX_READER_BUFFER_SIZE_PROP` `Since Version: 0.14.0`
hoodie.metadata.max.reader.memory	1073741824	Max memory to use for the reader to read from metadata `Config Param: MAX_READER_MEMORY_PROP` `Since Version: 0.14.0`
hoodie.metadata.metrics.enable	false	Enable publishing of metrics around metadata table. `Config Param: METRICS_ENABLE` `Since Version: 0.7.0`
hoodie.metadata.optimized.log.blocks.scan.enable	false	Optimized log blocks scanner that addresses all the multi-writer use-cases while appending to log files. It also differentiates original blocks written by ingestion writers and compacted blocks written by log compaction. `Config Param: ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN` `Since Version: 0.13.0`
hoodie.metadata.record.index.enable	false	Create the HUDI Record Index within the Metadata Table `Config Param: RECORD_INDEX_ENABLE_PROP` `Since Version: 0.14.0`
hoodie.metadata.record.index.growth.factor	2.0	The current number of records are multiplied by this number when estimating the number of file groups to create automatically. This helps account for growth in the number of records in the dataset. `Config Param: RECORD_INDEX_GROWTH_FACTOR_PROP` `Since Version: 0.14.0`
hoodie.metadata.record.index.max.filegroup.count	10000	Maximum number of file groups to use for Record Index. `Config Param: RECORD_INDEX_MAX_FILE_GROUP_COUNT_PROP` `Since Version: 0.14.0`
hoodie.metadata.record.index.max.filegroup.size	1073741824	Maximum size in bytes of a single file group. Large file group takes longer to compact. `Config Param: RECORD_INDEX_MAX_FILE_GROUP_SIZE_BYTES_PROP` `Since Version: 0.14.0`
hoodie.metadata.record.index.min.filegroup.count	10	Minimum number of file groups to use for Record Index. `Config Param: RECORD_INDEX_MIN_FILE_GROUP_COUNT_PROP` `Since Version: 0.14.0`
hoodie.metadata.spillable.map.path		Path on local storage to use, when keys read from metadata are held in a spillable map. `Config Param: SPILLABLE_MAP_DIR_PROP` `Since Version: 0.14.0`

Config Name	Default	Description
hoodie.parquet.compression.codec	gzip	Compression Codec for parquet files `Config Param: PARQUET_COMPRESSION_CODEC_NAME`
hoodie.parquet.max.file.size	125829120	Target size in bytes for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance. `Config Param: PARQUET_MAX_FILE_SIZE`

Config Name	Default	Description
hoodie.logfile.data.block.format	(N/A)	Format of the data block within delta logs. Following formats are currently supported "avro", "hfile", "parquet" `Config Param: LOGFILE_DATA_BLOCK_FORMAT`
hoodie.parquet.writelegacyformat.enabled	(N/A)	Sets spark.sql.parquet.writeLegacyFormat. If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Parquet's fixed-length byte array format which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. `Config Param: PARQUET_WRITE_LEGACY_FORMAT_ENABLED`
hoodie.avro.write.support.class	org.apache.hudi.avro.HoodieAvroWriteSupport	Provided write support class should extend HoodieAvroWriteSupport class and it is loaded at runtime. This is only required when trying to override the existing write context. `Config Param: HOODIE_AVRO_WRITE_SUPPORT_CLASS` `Since Version: 0.14.0`
hoodie.bloom.index.filter.dynamic.max.entries	100000	The threshold for the maximum number of keys to record in a dynamic Bloom filter row. Only applies if filter type is BloomFilterTypeCode.DYNAMIC_V0. `Config Param: BLOOM_FILTER_DYNAMIC_MAX_ENTRIES`
hoodie.bloom.index.filter.type	DYNAMIC_V0	org.apache.hudi.common.bloom.BloomFilterTypeCode: Filter type used by Bloom filter. SIMPLE: Bloom filter that is based on the configured size. DYNAMIC_V0(default): Bloom filter that is auto sized based on number of keys. `Config Param: BLOOM_FILTER_TYPE`
hoodie.hfile.block.size	1048576	Lower values increase the size in bytes of metadata tracked within HFile, but can offer potentially faster lookup times. `Config Param: HFILE_BLOCK_SIZE`
hoodie.hfile.compression.algorithm	GZ	Compression codec to use for hfile base files. `Config Param: HFILE_COMPRESSION_ALGORITHM_NAME`
hoodie.hfile.max.file.size	125829120	Target file size in bytes for HFile base files. `Config Param: HFILE_MAX_FILE_SIZE`
hoodie.index.bloom.fpp	0.000000001	Only applies if index type is BLOOM. Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives. If the number of entries added to bloom filter exceeds the configured value (hoodie.index.bloom.num_entries), then this fpp may not be honored. `Config Param: BLOOM_FILTER_FPP_VALUE`
hoodie.index.bloom.num_entries	60000	Only applies if index type is BLOOM. This is the number of entries to be stored in the bloom filter. The rationale for the default: Assume the maxParquetFileSize is 128MB and averageRecordSize is 1kb and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. Warning: Setting this very low, will generate a lot of false positives and index lookup will have to scan a lot more files than it has to and setting this to a very high number will increase the size every base file linearly (roughly 4KB for every 50000 entries). This config is also used with DYNAMIC bloom filter which determines the initial size for the bloom. `Config Param: BLOOM_FILTER_NUM_ENTRIES_VALUE`
hoodie.io.factory.class	org.apache.hudi.io.hadoop.HoodieHadoopIOFactory	The fully-qualified class name of the factory class to return readers and writers of files used by Hudi. The provided class should implement `org.apache.hudi.io.storage.HoodieIOFactory`. `Config Param: HOODIE_IO_FACTORY_CLASS` `Since Version: 0.15.0`
hoodie.logfile.data.block.max.size	268435456	LogFile Data block max size in bytes. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory. `Config Param: LOGFILE_DATA_BLOCK_MAX_SIZE`
hoodie.logfile.max.size	1073741824	LogFile max size in bytes. This is the maximum size allowed for a log file before it is rolled over to the next version. `Config Param: LOGFILE_MAX_SIZE`
hoodie.logfile.to.parquet.compression.ratio	0.35	Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file. `Config Param: LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION`
hoodie.orc.block.size	125829120	ORC block size, recommended to be aligned with the target file size. `Config Param: ORC_BLOCK_SIZE`
hoodie.orc.compression.codec	ZLIB	Compression codec to use for ORC base files. `Config Param: ORC_COMPRESSION_CODEC_NAME`
hoodie.orc.max.file.size	125829120	Target file size in bytes for ORC base files. `Config Param: ORC_FILE_MAX_SIZE`
hoodie.orc.stripe.size	67108864	Size of the memory buffer in bytes for writing `Config Param: ORC_STRIPE_SIZE`
hoodie.parquet.block.size	125829120	Parquet RowGroup size in bytes. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group. `Config Param: PARQUET_BLOCK_SIZE`
hoodie.parquet.bloom.filter.enabled	true	Control whether to write bloom filter or not. Default true. We can set to false in non bloom index cases for CPU resource saving. `Config Param: PARQUET_WITH_BLOOM_FILTER_ENABLED` `Since Version: 0.15.0`
hoodie.parquet.compression.ratio	0.1	Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files `Config Param: PARQUET_COMPRESSION_RATIO_FRACTION`
hoodie.parquet.dictionary.enabled	true	Whether to use dictionary encoding `Config Param: PARQUET_DICTIONARY_ENABLED`
hoodie.parquet.field_id.write.enabled	true	Would only be effective with Spark 3.3+. Sets spark.sql.parquet.fieldId.write.enabled. If enabled, Spark will write out parquet native field ids that are stored inside StructField's metadata as parquet.field.id to parquet files. `Config Param: PARQUET_FIELD_ID_WRITE_ENABLED` `Since Version: 0.12.0`
hoodie.parquet.outputtimestamptype	TIMESTAMP_MICROS	Sets spark.sql.parquet.outputTimestampType. Parquet timestamp type to use when Spark writes data to Parquet files. `Config Param: PARQUET_OUTPUT_TIMESTAMP_TYPE`
hoodie.parquet.page.size	1048576	Parquet page size in bytes. Page is the unit of read within a parquet file. Within a block, pages are compressed separately. `Config Param: PARQUET_PAGE_SIZE`
hoodie.parquet.spark.row.write.support.class	org.apache.hudi.io.storage.row.HoodieRowParquetWriteSupport	Provided write support class should extend HoodieRowParquetWriteSupport class and it is loaded at runtime. This is only required when trying to override the existing write context when `hoodie.datasource.write.row.writer.enable=true`. `Config Param: HOODIE_PARQUET_SPARK_ROW_WRITE_SUPPORT_CLASS` `Since Version: 0.15.0`
hoodie.storage.class	org.apache.hudi.storage.hadoop.HoodieHadoopStorage	The fully-qualified class name of the `HoodieStorage` implementation class to instantiate. The provided class should implement `org.apache.hudi.storage.HoodieStorage` `Config Param: HOODIE_STORAGE_CLASS` `Since Version: 0.15.0`

Config Name	Default	Description
_hoodie.optimistic.consistency.guard.enable	false	Enable consistency guard, which optimistically assumes consistency is achieved after a certain time period. `Config Param: OPTIMISTIC_CONSISTENCY_GUARD_ENABLE` `Since Version: 0.6.0`
hoodie.consistency.check.enabled	false	Enabled to handle S3 eventual consistency issue. This property is no longer required since S3 is now strongly consistent. Will be removed in the future releases. `Config Param: ENABLE` `Since Version: 0.5.0` `Deprecated since: 0.7.0`
hoodie.consistency.check.initial_interval_ms	400	Amount of time (in ms) to wait, before checking for consistency after an operation on storage. `Config Param: INITIAL_CHECK_INTERVAL_MS` `Since Version: 0.5.0` `Deprecated since: 0.7.0`
hoodie.consistency.check.max_checks	6	Maximum number of consistency checks to perform, with exponential backoff. `Config Param: MAX_CHECKS` `Since Version: 0.5.0` `Deprecated since: 0.7.0`
hoodie.consistency.check.max_interval_ms	20000	Maximum amount of time (in ms), to wait for consistency checking. `Config Param: MAX_CHECK_INTERVAL_MS` `Since Version: 0.5.0` `Deprecated since: 0.7.0`
hoodie.optimistic.consistency.guard.sleep_time_ms	500	Amount of time (in ms), to wait after which we assume storage is consistent. `Config Param: OPTIMISTIC_CONSISTENCY_GUARD_SLEEP_TIME_MS` `Since Version: 0.6.0`

Config Name	Default	Description
hoodie.filesystem.operation.retry.enable	false	Enabled to handle list/get/delete etc file system performance issue. `Config Param: FILESYSTEM_RETRY_ENABLE` `Since Version: 0.11.0`
hoodie.filesystem.operation.retry.exceptions		The class name of the Exception that needs to be retried, separated by commas. Default is empty which means retry all the IOException and RuntimeException from FileSystem `Config Param: RETRY_EXCEPTIONS` `Since Version: 0.11.0`
hoodie.filesystem.operation.retry.initial_interval_ms	100	Amount of time (in ms) to wait, before retry to do operations on storage. `Config Param: INITIAL_RETRY_INTERVAL_MS` `Since Version: 0.11.0`
hoodie.filesystem.operation.retry.max_interval_ms	2000	Maximum amount of time (in ms), to wait for next retry. `Config Param: MAX_RETRY_INTERVAL_MS` `Since Version: 0.11.0`
hoodie.filesystem.operation.retry.max_numbers	4	Maximum number of retry actions to perform, with exponential backoff. `Config Param: MAX_RETRY_NUMBERS` `Since Version: 0.11.0`

Config Name	Default	Description
hoodie.filesystem.remote.backup.view.enable	true	Config to control whether backup needs to be configured if clients were not able to reach timeline service. `Config Param: REMOTE_BACKUP_VIEW_ENABLE`
hoodie.filesystem.view.incr.timeline.sync.enable	false	Controls whether or not, the file system view is incrementally updated as new actions are performed on the timeline. `Config Param: INCREMENTAL_TIMELINE_SYNC_ENABLE`
hoodie.filesystem.view.remote.host	localhost	We expect this to be rarely hand configured. `Config Param: REMOTE_HOST_NAME`
hoodie.filesystem.view.remote.port	26754	Port to serve file system view queries, when remote. We expect this to be rarely hand configured. `Config Param: REMOTE_PORT_NUM`
hoodie.filesystem.view.remote.retry.enable	false	Whether to enable API request retry for remote file system view. `Config Param: REMOTE_RETRY_ENABLE` `Since Version: 0.12.1`
hoodie.filesystem.view.remote.retry.exceptions		The class name of the Exception that needs to be retried, separated by commas. Default is empty which means retry all the IOException and RuntimeException from Remote Request. `Config Param: RETRY_EXCEPTIONS` `Since Version: 0.12.1`
hoodie.filesystem.view.remote.retry.initial_interval_ms	100	Amount of time (in ms) to wait, before retry to do operations on storage. `Config Param: REMOTE_INITIAL_RETRY_INTERVAL_MS` `Since Version: 0.12.1`
hoodie.filesystem.view.remote.retry.max_interval_ms	2000	Maximum amount of time (in ms), to wait for next retry. `Config Param: REMOTE_MAX_RETRY_INTERVAL_MS` `Since Version: 0.12.1`
hoodie.filesystem.view.remote.retry.max_numbers	3	Maximum number of retry for API requests against a remote file system view. e.g timeline server. `Config Param: REMOTE_MAX_RETRY_NUMBERS` `Since Version: 0.12.1`
hoodie.filesystem.view.remote.timeout.secs	300	Timeout in seconds, to wait for API requests against a remote file system view. e.g timeline server. `Config Param: REMOTE_TIMEOUT_SECS`
hoodie.filesystem.view.rocksdb.base.path	/tmp/hoodie_timeline_rocksdb	Path on local storage to use, when storing file system view in embedded kv store/rocksdb. `Config Param: ROCKSDB_BASE_PATH`
hoodie.filesystem.view.secondary.type	MEMORY	Specifies the secondary form of storage for file system view, if the primary (e.g timeline server) is unavailable. `Config Param: SECONDARY_VIEW_TYPE`
hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction	0.05	Fraction of the file system view memory, to be used for holding mapping to bootstrap base files. `Config Param: BOOTSTRAP_BASE_FILE_MEM_FRACTION`
hoodie.filesystem.view.spillable.clustering.mem.fraction	0.02	Fraction of the file system view memory, to be used for holding clustering related metadata. `Config Param: SPILLABLE_CLUSTERING_MEM_FRACTION`
hoodie.filesystem.view.spillable.compaction.mem.fraction	0.1	Fraction of the file system view memory, to be used for holding compaction related metadata. `Config Param: SPILLABLE_COMPACTION_MEM_FRACTION`
hoodie.filesystem.view.spillable.dir	/tmp/	Path on local storage to use, when file system view is held in a spillable map. `Config Param: SPILLABLE_DIR`
hoodie.filesystem.view.spillable.log.compaction.mem.fraction	0.02	Fraction of the file system view memory, to be used for holding log compaction related metadata. `Config Param: SPILLABLE_LOG_COMPACTION_MEM_FRACTION` `Since Version: 0.13.0`
hoodie.filesystem.view.spillable.mem	104857600	Amount of memory to be used in bytes for holding file system view, before spilling to disk. `Config Param: SPILLABLE_MEMORY`
hoodie.filesystem.view.spillable.replaced.mem.fraction	0.05	Fraction of the file system view memory, to be used for holding replace commit related metadata. `Config Param: SPILLABLE_REPLACED_MEM_FRACTION`
hoodie.filesystem.view.type	MEMORY	File system view provides APIs for viewing the files on the underlying lake storage, as file groups and file slices. This config controls how such a view is held. Options include MEMORY,SPILLABLE_DISK,EMBEDDED_KV_STORE,REMOTE_ONLY,REMOTE_FIRST which provide different trade offs for memory usage and API request performance. `Config Param: VIEW_TYPE`

Config Name	Default	Description
hoodie.keep.max.commits	30	Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline. `Config Param: MAX_COMMITS_TO_KEEP`
hoodie.keep.min.commits	20	Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline. `Config Param: MIN_COMMITS_TO_KEEP`

Config Name	Default	Description
hoodie.archive.async	false	Only applies when hoodie.archive.automatic is turned on. When turned on runs archiver async with writing, which can speed up overall write performance. `Config Param: ASYNC_ARCHIVE` `Since Version: 0.11.0`
hoodie.archive.automatic	true	When enabled, the archival table service is invoked immediately after each commit, to archive commits if we cross a maximum value of commits. It's recommended to enable this, to ensure number of active commits is bounded. `Config Param: AUTO_ARCHIVE`
hoodie.archive.beyond.savepoint	false	If enabled, archival will proceed beyond savepoint, skipping savepoint commits. If disabled, archival will stop at the earliest savepoint commit. `Config Param: ARCHIVE_BEYOND_SAVEPOINT` `Since Version: 0.12.0`
hoodie.archive.delete.parallelism	100	When performing archival operation, Hudi needs to delete the files of the archived instants in the active timeline in .hoodie folder. The file deletion also happens after merging small archived files into larger ones if enabled. This config limits the Spark parallelism for deleting files in both cases, i.e., parallelism of deleting files does not go above the configured value and the parallelism is the number of files to delete if smaller than the configured value. If you see that the file deletion in archival operation is slow because of the limited parallelism, you can increase this to tune the performance. `Config Param: DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE`
hoodie.commits.archival.batch	10	Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size. `Config Param: COMMITS_ARCHIVAL_BATCH_SIZE`
hoodie.timeline.compaction.batch.size	10	The number of small files to compact at once. `Config Param: TIMELINE_COMPACTION_BATCH_SIZE`

Config Name	Default	Description
hoodie.bootstrap.data.queries.only	false	Improves query performance, but queries cannot use hudi metadata fields `Config Param: DATA_QUERIES_ONLY` `Since Version: 0.14.0`
hoodie.bootstrap.full.input.provider	org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider	Class to use for reading the bootstrap dataset partitions/files, for Bootstrap mode FULL_RECORD `Config Param: FULL_BOOTSTRAP_INPUT_PROVIDER_CLASS_NAME` `Since Version: 0.6.0`
hoodie.bootstrap.index.class	org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex	Implementation to use, for mapping a skeleton base file to a bootstrap base file. `Config Param: INDEX_CLASS_NAME` `Since Version: 0.6.0`
hoodie.bootstrap.mode.selector	org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector	Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped `Config Param: MODE_SELECTOR_CLASS_NAME` `Since Version: 0.6.0`
hoodie.bootstrap.mode.selector.regex	.*	Matches each bootstrap dataset partition against this regex and applies the mode below to it. `Config Param: PARTITION_SELECTOR_REGEX_PATTERN` `Since Version: 0.6.0`
hoodie.bootstrap.mode.selector.regex.mode	METADATA_ONLY	org.apache.hudi.client.bootstrap.BootstrapMode: Bootstrap mode for importing an existing table into Hudi FULL_RECORD: In this mode, the full record data is copied into hudi and metadata columns are added. A full record bootstrap is functionally equivalent to a bulk-insert. After a full record bootstrap, Hudi will function properly even if the original table is modified or deleted. METADATA_ONLY(default): In this mode, the full record data is not copied into Hudi therefore it avoids full cost of rewriting the dataset. Instead, 'skeleton' files containing just the corresponding metadata columns are added to the Hudi table. Hudi relies on the data in the original table and will face data-loss or corruption if files in the original table location are deleted or modified. `Config Param: PARTITION_SELECTOR_REGEX_MODE` `Since Version: 0.6.0`
hoodie.bootstrap.parallelism	1500	For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (`hoodie.bulkinsert.shuffle.parallelism`), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this. `Config Param: PARALLELISM_VALUE` `Since Version: 0.6.0`
hoodie.bootstrap.partitionpath.translator.class	org.apache.hudi.client.bootstrap.translator.IdentityBootstrapPartitionPathTranslator	Translates the partition paths from the bootstrapped data into how is laid out as a Hudi table. `Config Param: PARTITION_PATH_TRANSLATOR_CLASS_NAME` `Since Version: 0.6.0`

Config Name	Default	Description
hoodie.clean.async.enabled	false	Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance. `Config Param: ASYNC_CLEAN`
hoodie.clean.commits.retained	10	When KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries. `Config Param: CLEANER_COMMITS_RETAINED`

Externalized Config File​

Hudi Table Config​

Hudi Table Basic Configs​

Spark Datasource Configs​

Read Options​

Write Options​

PreCommit Validator Configurations​

Flink Sql Configs​

Flink Options​

Write Client Configs​

Common Configurations​

Memory Configurations​

Metadata Configs​

Metaserver Configs​

Storage Configs​

Consistency Guard Configurations​

FileSystem Guard Configurations​

File System View Storage Configurations​

Archival Configs​

Bootstrap Configs​

Clean Configs​

Clustering Configs​

Compaction Configs​

Error table Configs​

Layout Configs​

TTL management Configs​

Write Configurations​

Commit Callback Configs​

Write commit callback configs​

Write commit Kafka callback configs​

Write commit pulsar callback configs​

Lock Configs​

Common Lock Configurations​

DynamoDB based Locks Configurations​

Key Generator Configs​

Key Generator Options​

Timestamp-based key generator configs​

Index Configs​

Common Index Configs​

Common Index Configs​

HBase Index Configs​

Reader Configs​

Reader Configs​

Metastore and Catalog Sync Configs​

Common Metadata Sync Configs​

Glue catalog sync based client Configurations​

BigQuery Sync Configs​

Hive Sync Configs​

Global Hive Sync Configs​

DataHub Sync Configs​

Metrics Configs​

Metrics Configurations for Amazon CloudWatch​

Metrics Configurations​

Metrics Configurations for Datadog reporter​

Metrics Configurations for Graphite​

Metrics Configurations for Jmx​

Metrics Configurations for M3​

Metrics Configurations for Prometheus​

Record Payload Config​

Payload Configurations​

Kafka Connect Configs​

Kafka Sink Connect Configurations​

Amazon Web Services Configs​

Amazon Web Services Configs​

Hudi Streamer Configs​

Hudi Streamer Configs​

Hudi Streamer SQL Transformer Configs​

Hudi Streamer Source Configs​

Cloud Source Configs​

DFS Path Selector Configs​

Date Partition Path Selector Configs​

GCS Events Source Configs​

Hive Incremental Pulling Source Configs​

Hudi Incremental Source Configs​

JDBC Source Configs​

Json Kafka Post Processor Configs​

Kafka Source Configs​

Parquet DFS Source Configs​

Pulsar Source Configs​

S3 Event-based Hudi Incremental Source Configs​

Externalized Config File

Hudi Table Config

Hudi Table Basic Configs

Spark Datasource Configs

Read Options

Write Options

PreCommit Validator Configurations

Flink Sql Configs

Flink Options

Write Client Configs

Common Configurations

Memory Configurations

Metadata Configs

Metaserver Configs

Storage Configs

Consistency Guard Configurations

FileSystem Guard Configurations

File System View Storage Configurations

Archival Configs

Bootstrap Configs

Clean Configs

Clustering Configs

Compaction Configs

Error table Configs

Layout Configs

TTL management Configs

Write Configurations

Commit Callback Configs

Write commit callback configs

Write commit Kafka callback configs

Write commit pulsar callback configs

Lock Configs

Common Lock Configurations

DynamoDB based Locks Configurations

Key Generator Configs

Key Generator Options

Timestamp-based key generator configs

Index Configs

Common Index Configs

Common Index Configs

HBase Index Configs

Reader Configs

Reader Configs

Metastore and Catalog Sync Configs

Common Metadata Sync Configs

Glue catalog sync based client Configurations

BigQuery Sync Configs

Hive Sync Configs

Global Hive Sync Configs

DataHub Sync Configs

Metrics Configs

Metrics Configurations for Amazon CloudWatch

Metrics Configurations

Metrics Configurations for Datadog reporter

Metrics Configurations for Graphite

Metrics Configurations for Jmx

Metrics Configurations for M3

Metrics Configurations for Prometheus

Record Payload Config

Payload Configurations

Kafka Connect Configs

Kafka Sink Connect Configurations

Amazon Web Services Configs

Amazon Web Services Configs

Hudi Streamer Configs

Hudi Streamer Configs

Hudi Streamer SQL Transformer Configs

Hudi Streamer Source Configs

Cloud Source Configs

DFS Path Selector Configs

Date Partition Path Selector Configs

GCS Events Source Configs

Hive Incremental Pulling Source Configs

Hudi Incremental Source Configs

JDBC Source Configs

Json Kafka Post Processor Configs

Kafka Source Configs

Parquet DFS Source Configs

Pulsar Source Configs

S3 Event-based Hudi Incremental Source Configs

Config Name	Default	Description
hoodie.clean.automatic	true	When enabled, the cleaner table service is invoked immediately after each commit, to delete older file slices. It's recommended to enable this, to ensure metadata and data storage growth is bounded. `Config Param: AUTO_CLEAN`
hoodie.clean.delete.bootstrap.base.file	false	When set to true, cleaner also deletes the bootstrap base file when it's skeleton base file is cleaned. Turn this to true, if you want to ensure the bootstrap dataset storage is reclaimed over time, as the table receives updates/deletes. Another reason to turn this on, would be to ensure data residing in bootstrap base files are also physically deleted, to comply with data privacy enforcement processes. `Config Param: CLEANER_BOOTSTRAP_BASE_FILE_ENABLE`
hoodie.clean.failed.writes.policy	EAGER	org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy: Policy that controls how to clean up failed writes. Hudi will delete any files written by failed writes to re-claim space. EAGER(default): Clean failed writes inline after every write operation. LAZY: Clean failed writes lazily after heartbeat timeout when the cleaning service runs. This policy is required when multi-writers are enabled. NEVER: Never clean failed writes. `Config Param: FAILED_WRITES_CLEANER_POLICY`
hoodie.clean.fileversions.retained	3	When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning. `Config Param: CLEANER_FILE_VERSIONS_RETAINED`
hoodie.clean.hours.retained	24	When KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned. `Config Param: CLEANER_HOURS_RETAINED`
hoodie.clean.incremental.enabled	true	When enabled, the plans for each cleaner service run is computed incrementally off the events in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full table for each planning (even with a metadata table). `Config Param: CLEANER_INCREMENTAL_MODE_ENABLE`
hoodie.clean.multiple.enabled	false	Allows scheduling/executing multiple cleans by enabling this config. If users prefer to strictly ensure clean requests should be mutually exclusive, .i.e. a 2nd clean will not be scheduled if another clean is not yet completed to avoid repeat cleaning of same files, they might want to disable this config. `Config Param: ALLOW_MULTIPLE_CLEANS` `Since Version: 0.11.0` `Deprecated since: 0.15.0`
hoodie.clean.parallelism	200	This config controls the behavior of both the cleaning plan and cleaning execution. Deriving the cleaning plan is parallelized at the table partition level, i.e., each table partition is processed by one Spark task to figure out the files to clean. The cleaner picks the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. The clean execution, i.e., the file deletion, is parallelized at file level, which is the unit of Spark task distribution. Similarly, the actual parallelism cannot exceed the configured value if the number of files is larger. If cleaning plan or execution is slow due to limited parallelism, you can increase this to tune the performance.. `Config Param: CLEANER_PARALLELISM_VALUE`
hoodie.clean.policy	KEEP_LATEST_COMMITS	org.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.clean.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.clean.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.clean.hours.retained" is explicitly set only. `Config Param: CLEANER_POLICY`
hoodie.clean.trigger.max.commits	1	Number of commits after the last clean operation, before scheduling of a new clean is attempted. `Config Param: CLEAN_MAX_COMMITS`
hoodie.clean.trigger.strategy	NUM_COMMITS	org.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by `hoodie.clean.trigger.max.commits`. `Config Param: CLEAN_TRIGGER_STRATEGY`

Config Name	Default	Description
hoodie.clustering.plan.strategy.cluster.begin.partition	(N/A)	Begin partition used to filter partition (inclusive), only effective when the filter mode 'hoodie.clustering.plan.partition.filter.mode' is SELECTED_PARTITIONS `Config Param: PARTITION_FILTER_BEGIN_PARTITION` `Since Version: 0.11.0`
hoodie.clustering.plan.strategy.cluster.end.partition	(N/A)	End partition used to filter partition (inclusive), only effective when the filter mode 'hoodie.clustering.plan.partition.filter.mode' is SELECTED_PARTITIONS `Config Param: PARTITION_FILTER_END_PARTITION` `Since Version: 0.11.0`
hoodie.clustering.plan.strategy.partition.regex.pattern	(N/A)	Filter clustering partitions that matched regex pattern `Config Param: PARTITION_REGEX_PATTERN` `Since Version: 0.11.0`
hoodie.clustering.plan.strategy.partition.selected	(N/A)	Partitions to run clustering `Config Param: PARTITION_SELECTED` `Since Version: 0.11.0`
hoodie.clustering.plan.strategy.sort.columns	(N/A)	Columns to sort the data by when clustering `Config Param: PLAN_STRATEGY_SORT_COLUMNS` `Since Version: 0.7.0`
hoodie.clustering.async.max.commits	4	Config to control frequency of async clustering `Config Param: ASYNC_CLUSTERING_MAX_COMMITS` `Since Version: 0.9.0`
hoodie.clustering.execution.strategy.class	org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy	Config to provide a strategy class (subclass of RunClusteringStrategy) to define how the clustering plan is executed. By default, we sort the file groups in th plan by the specified columns, while meeting the configured target file sizes. `Config Param: EXECUTION_STRATEGY_CLASS_NAME` `Since Version: 0.7.0`
hoodie.clustering.group.read.parallelism	20	Maximum number of parallelism when Spark read records from clustering group. `Config Param: CLUSTERING_GROUP_READ_PARALLELISM` `Since Version: 1.0.0`
hoodie.clustering.inline.max.commits	4	Config to control frequency of clustering planning `Config Param: INLINE_CLUSTERING_MAX_COMMITS` `Since Version: 0.7.0`
hoodie.clustering.max.parallelism	15	Maximum number of parallelism jobs submitted in clustering operation. If the resource is sufficient(Like Spark engine has enough idle executors), increasing this value will let the clustering job run faster, while it will give additional pressure to the execution engines to manage more concurrent running jobs. `Config Param: CLUSTERING_MAX_PARALLELISM` `Since Version: 0.14.0`
hoodie.clustering.plan.partition.filter.mode	NONE	org.apache.hudi.table.action.cluster.ClusteringPlanPartitionFilterMode: Partition filter mode used in the creation of clustering plan. NONE(default): Do not filter partitions. The clustering plan will include all partitions that have clustering candidates. RECENT_DAYS: This filter assumes that your data is partitioned by date. The clustering plan will only include partitions from K days ago to N days ago, where K >= N. K is determined by `hoodie.clustering.plan.strategy.daybased.lookback.partitions` and N is determined by `hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions`. SELECTED_PARTITIONS: The clustering plan will include only partition paths with names that sort within the inclusive range [`hoodie.clustering.plan.strategy.cluster.begin.partition`, `hoodie.clustering.plan.strategy.cluster.end.partition`]. DAY_ROLLING: To determine the partitions in the clustering plan, the eligible partitions will be sorted in ascending order. Each partition will have an index i in that list. The clustering plan will only contain partitions such that i mod 24 = H, where H is the current hour of the day (from 0 to 23). `Config Param: PLAN_PARTITION_FILTER_MODE_NAME` `Since Version: 0.11.0`
hoodie.clustering.plan.strategy.class	org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy	Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the clustering small file size limit (determined by hoodie.clustering.plan.strategy.small.file.limit) to pick the small file slices within partitions for clustering. `Config Param: PLAN_STRATEGY_CLASS_NAME` `Since Version: 0.7.0`
hoodie.clustering.plan.strategy.daybased.lookback.partitions	2	Number of partitions to list to create ClusteringPlan `Config Param: DAYBASED_LOOKBACK_PARTITIONS` `Since Version: 0.7.0`
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions	0	Number of partitions to skip from latest when choosing partitions to create ClusteringPlan `Config Param: PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST` `Since Version: 0.9.0`
hoodie.clustering.plan.strategy.max.bytes.per.group	2147483648	Each clustering operation can create multiple output file groups. Total amount of data processed by clustering operation is defined by below two properties (CLUSTERING_MAX_BYTES_PER_GROUP * CLUSTERING_MAX_NUM_GROUPS). Max amount of data to be included in one group `Config Param: PLAN_STRATEGY_MAX_BYTES_PER_OUTPUT_FILEGROUP` `Since Version: 0.7.0`
hoodie.clustering.plan.strategy.max.num.groups	30	Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism `Config Param: PLAN_STRATEGY_MAX_GROUPS` `Since Version: 0.7.0`
hoodie.clustering.plan.strategy.single.group.clustering.enabled	true	Whether to generate clustering plan when there is only one file group involved, by default true `Config Param: PLAN_STRATEGY_SINGLE_GROUP_CLUSTERING_ENABLED` `Since Version: 0.14.0`
hoodie.clustering.rollback.pending.replacecommit.on.conflict	false	If updates are allowed to file groups pending clustering, then set this config to rollback failed or pending clustering instants. Pending clustering will be rolled back ONLY IF there is conflict between incoming upsert and filegroup to be clustered. Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in rare scenarios, for example, when the clustering completes after instants are fetched but before rollback completed. `Config Param: ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT` `Since Version: 0.10.0`
hoodie.clustering.schedule.inline	false	When set to true, clustering service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async clustering(execution) for the one scheduled by this writer. Users can choose to set both `hoodie.clustering.inline` and `hoodie.clustering.schedule.inline` to false and have both scheduling and execution triggered by any async process, on which case `hoodie.clustering.async.enabled` is expected to be set to true. But if `hoodie.clustering.inline` is set to false, and `hoodie.clustering.schedule.inline` is set to true, regular writers will schedule clustering inline, but users are expected to trigger async job for execution. If `hoodie.clustering.inline` is set to true, regular writers will do both scheduling and execution inline for clustering `Config Param: SCHEDULE_INLINE_CLUSTERING`
hoodie.clustering.updates.strategy	org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy	Determines how to handle updates, deletes to file groups that are under clustering. Default strategy just rejects the update `Config Param: UPDATES_STRATEGY` `Since Version: 0.7.0`
hoodie.layout.optimize.build.curve.sample.size	200000	Determines target sample size used by the Boundary-based Interleaved Index method of building space-filling curve. Larger sample size entails better layout optimization outcomes, at the expense of higher memory footprint. `Config Param: LAYOUT_OPTIMIZE_BUILD_CURVE_SAMPLE_SIZE` `Since Version: 0.10.0`
hoodie.layout.optimize.curve.build.method	DIRECT	org.apache.hudi.config.HoodieClusteringConfig$SpatialCurveCompositionStrategyType: This configuration only has effect if `hoodie.layout.optimize.strategy` is set to either "z-order" or "hilbert" (i.e. leveraging space-filling curves). This configuration controls the type of a strategy to use for building the space-filling curves, tackling specifically how the Strings are ordered based on the curve. Since we truncate the String to 8 bytes for ordering, there are two issues: (1) it can lead to poor aggregation effect, (2) the truncation of String longer than 8 bytes loses the precision, if the Strings are different but the 8-byte prefix is the same. The boundary-based interleaved index method ("SAMPLE") has better generalization, solving the two problems above, but is slower than direct method ("DIRECT"). User should benchmark the write and query performance before tweaking this in production, if this is actually a problem. Please refer to RFC-28 for more details. DIRECT(default): This strategy builds the spatial curve in full, filling in all of the individual points corresponding to each individual record, which requires less compute. SAMPLE: This strategy leverages boundary-base interleaved index method (described in more details in Amazon DynamoDB blog https://aws.amazon.com/cn/blogs/database/tag/z-order/) and produces a better layout compared to DIRECT strategy. It requires more compute and is slower. `Config Param: LAYOUT_OPTIMIZE_SPATIAL_CURVE_BUILD_METHOD` `Since Version: 0.10.0`
hoodie.layout.optimize.data.skipping.enable	true	Enable data skipping by collecting statistics once layout optimization is complete. `Config Param: LAYOUT_OPTIMIZE_DATA_SKIPPING_ENABLE` `Since Version: 0.10.0` `Deprecated since: 0.11.0`
hoodie.layout.optimize.enable	false	This setting has no effect. Please refer to clustering configuration, as well as LAYOUT_OPTIMIZE_STRATEGY config to enable advanced record layout optimization strategies `Config Param: LAYOUT_OPTIMIZE_ENABLE` `Since Version: 0.10.0` `Deprecated since: 0.11.0`
hoodie.layout.optimize.strategy	LINEAR	org.apache.hudi.config.HoodieClusteringConfig$LayoutOptimizationStrategy: Determines ordering strategy for records layout optimization. LINEAR(default): Orders records lexicographically ZORDER: Orders records along Z-order spatial-curve. HILBERT: Orders records along Hilbert's spatial-curve. `Config Param: LAYOUT_OPTIMIZE_STRATEGY` `Since Version: 0.10.0`

Config Name	Default	Description
hoodie.compact.inline	false	When set to true, compaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path. `Config Param: INLINE_COMPACT`
hoodie.compact.inline.max.delta.commits	5	Number of delta commits after the last compaction, before scheduling of a new compaction is attempted. This config takes effect only for the compaction triggering strategy based on the number of commits, i.e., NUM_COMMITS, NUM_COMMITS_AFTER_LAST_REQUEST, NUM_AND_TIME, and NUM_OR_TIME. `Config Param: INLINE_COMPACT_NUM_DELTA_COMMITS`

Config Name	Default	Description
hoodie.compaction.partition.path.regex	(N/A)	Used to specify the partition path regex for compaction. Only partitions that match the regex will be compacted. Only be used when configure PartitionRegexBasedCompactionStrategy. `Config Param: COMPACTION_SPECIFY_PARTITION_PATH_REGEX`
hoodie.compact.inline.max.delta.seconds	3600	Number of elapsed seconds after the last compaction, before scheduling a new one. This config takes effect only for the compaction triggering strategy based on the elapsed time, i.e., TIME_ELAPSED, NUM_AND_TIME, and NUM_OR_TIME. `Config Param: INLINE_COMPACT_TIME_DELTA_SECONDS`
hoodie.compact.inline.trigger.strategy	NUM_COMMITS	org.apache.hudi.table.action.compact.CompactionTriggerStrategy: Controls when compaction is scheduled. NUM_COMMITS(default): triggers compaction when there are at least N delta commits after last completed compaction. NUM_COMMITS_AFTER_LAST_REQUEST: triggers compaction when there are at least N delta commits after last completed or requested compaction. TIME_ELAPSED: triggers compaction after N seconds since last compaction. NUM_AND_TIME: triggers compaction when both there are at least N delta commits and N seconds elapsed (both must be satisfied) after last completed compaction. NUM_OR_TIME: triggers compaction when both there are at least N delta commits or N seconds elapsed (either condition is satisfied) after last completed compaction. `Config Param: INLINE_COMPACT_TRIGGER_STRATEGY`
hoodie.compact.schedule.inline	false	When set to true, compaction service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async compaction(execution) for the one scheduled by this writer. Users can choose to set both `hoodie.compact.inline` and `hoodie.compact.schedule.inline` to false and have both scheduling and execution triggered by any async process. But if `hoodie.compact.inline` is set to false, and `hoodie.compact.schedule.inline` is set to true, regular writers will schedule compaction inline, but users are expected to trigger async job for execution. If `hoodie.compact.inline` is set to true, regular writers will do both scheduling and execution inline for compaction `Config Param: SCHEDULE_INLINE_COMPACT`
hoodie.compaction.daybased.target.partitions	10	Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run. `Config Param: TARGET_PARTITIONS_PER_DAYBASED_COMPACTION`
hoodie.compaction.logfile.num.threshold	0	Only if the log file num is greater than the threshold, the file group will be compacted. `Config Param: COMPACTION_LOG_FILE_NUM_THRESHOLD` `Since Version: 0.13.0`
hoodie.compaction.logfile.size.threshold	0	Only if the log file size is greater than the threshold in bytes, the file group will be compacted. `Config Param: COMPACTION_LOG_FILE_SIZE_THRESHOLD`
hoodie.compaction.plan.generator	org.apache.hudi.table.action.compact.plan.generators.HoodieCompactionPlanGenerator	Compaction plan generator for data files. Override with a custom plan generator if there's a need to use extraMetadata in the compaction plan for optimizations, ignore otherwise `Config Param: COMPACTION_PLAN_GENERATOR`
hoodie.compaction.strategy	org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy	Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data. The strategy can be composed with multiple strategies by concatenating the class names with ','. `Config Param: COMPACTION_STRATEGY`
hoodie.compaction.target.io	512000	Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode. `Config Param: TARGET_IO_PER_COMPACTION_IN_MB`
hoodie.copyonwrite.insert.auto.split	true	Config to control whether we control insert split sizes automatically based on average record sizes. It's recommended to keep this turned on, since hand tuning is otherwise extremely cumbersome. `Config Param: COPY_ON_WRITE_AUTO_SPLIT_INSERTS`
hoodie.copyonwrite.insert.split.size	500000	Number of inserts assigned for each partition/bucket for writing. We based the default on writing out 100MB files, with at least 1kb records (100K records per file), and over provision to 500K. As long as auto-tuning of splits is turned on, this only affects the first write, where there is no history to learn record sizes from. `Config Param: COPY_ON_WRITE_INSERT_SPLIT_SIZE`
hoodie.copyonwrite.record.size.estimate	1024	The average record size. If not explicitly specified, hudi will compute the record size estimate compute dynamically based on commit metadata. This is critical in computing the insert parallelism and bin-packing inserts into small files. `Config Param: COPY_ON_WRITE_RECORD_SIZE_ESTIMATE`
hoodie.log.compaction.blocks.threshold	5	Log compaction can be scheduled if the no. of log blocks crosses this threshold value. This is effective only when log compaction is enabled via hoodie.log.compaction.inline `Config Param: LOG_COMPACTION_BLOCKS_THRESHOLD` `Since Version: 0.13.0`
hoodie.log.compaction.enable	false	By enabling log compaction through this config, log compaction will also get enabled for the metadata table. `Config Param: ENABLE_LOG_COMPACTION` `Since Version: 0.14.0`
hoodie.log.compaction.inline	false	When set to true, logcompaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path. `Config Param: INLINE_LOG_COMPACT` `Since Version: 0.13.0`
hoodie.parquet.small.file.limit	104857600	During upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a `small file`. By default, treat any file <= 100MB as a small file. Also note that if this set <= 0, will not try to get small files and directly write new files `Config Param: PARQUET_SMALL_FILE_LIMIT`
hoodie.record.size.estimation.threshold	1.0	We use the previous commits' metadata to calculate the estimated record size and use it to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold) `Config Param: RECORD_SIZE_ESTIMATION_THRESHOLD`

Config Name	Default	Description
hoodie.errortable.base.path	(N/A)	Base path for error table under which all error records would be stored. `Config Param: ERROR_TABLE_BASE_PATH`
hoodie.errortable.target.table.name	(N/A)	Table name to be used for the error table `Config Param: ERROR_TARGET_TABLE`
hoodie.errortable.write.class	(N/A)	Class which handles the error table writes. This config is used to configure a custom implementation for Error Table Writer. Specify the full class name of the custom error table writer as a value for this config `Config Param: ERROR_TABLE_WRITE_CLASS`
hoodie.errortable.enable	false	Config to enable error table. If the config is enabled, all the records with processing error in DeltaStreamer are transferred to error table. `Config Param: ERROR_TABLE_ENABLED`
hoodie.errortable.insert.shuffle.parallelism	200	Config to set insert shuffle parallelism. The config is similar to hoodie.insert.shuffle.parallelism config but applies to the error table. `Config Param: ERROR_TABLE_INSERT_PARALLELISM_VALUE`
hoodie.errortable.source.rdd.persist	false	Enabling this config, persists the sourceRDD to disk which helps in faster processing of data table + error table write DAG `Config Param: ERROR_TABLE_PERSIST_SOURCE_RDD`
hoodie.errortable.upsert.shuffle.parallelism	200	Config to set upsert shuffle parallelism. The config is similar to hoodie.upsert.shuffle.parallelism config but applies to the error table. `Config Param: ERROR_TABLE_UPSERT_PARALLELISM_VALUE`
hoodie.errortable.validate.recordcreation.enable	true	Records that fail to be created due to keygeneration failure or other issues will be sent to the Error Table `Config Param: ERROR_ENABLE_VALIDATE_RECORD_CREATION` `Since Version: 0.15.0`
hoodie.errortable.validate.targetschema.enable	false	Records with schema mismatch with Target Schema are sent to Error Table. `Config Param: ERROR_ENABLE_VALIDATE_TARGET_SCHEMA`
hoodie.errortable.write.failure.strategy	ROLLBACK_COMMIT	The config specifies the failure strategy if error table write fails. Use one of - [ROLLBACK_COMMIT (Rollback the corresponding base table write commit for which the error events were triggered) , LOG_ERROR (Error is logged but the base table write succeeds) ] `Config Param: ERROR_TABLE_WRITE_FAILURE_STRATEGY`
hoodie.errortable.write.union.enable	false	Enable error table union with data table when writing for improved commit performance. By default it is disabled meaning data table and error table writes are sequential `Config Param: ENABLE_ERROR_TABLE_WRITE_UNIFICATION`

Config Name	Default	Description
hoodie.storage.layout.partitioner.class	(N/A)	Partitioner class, it is used to distribute data in a specific way. `Config Param: LAYOUT_PARTITIONER_CLASS_NAME`
hoodie.storage.layout.type	DEFAULT	org.apache.hudi.table.storage.HoodieStorageLayout$LayoutType: Determines how the files are organized within a table. DEFAULT(default): Each file group contains records of a certain set of keys, without particular grouping criteria. BUCKET: Each file group contains records of a set of keys which map to a certain range of hash values, so that using the hash function can easily identify the file group a record belongs to, based on the record key. `Config Param: LAYOUT_TYPE`

Config Name	Default	Description
hoodie.partition.ttl.strategy.class	(N/A)	Config to provide a strategy class (subclass of PartitionTTLStrategy) to get the expired partitions `Config Param: PARTITION_TTL_STRATEGY_CLASS_NAME` `Since Version: 1.0.0`
hoodie.partition.ttl.strategy.partition.selected	(N/A)	Partitions to manage ttl `Config Param: PARTITION_SELECTED` `Since Version: 1.0.0`
hoodie.partition.ttl.inline	false	When enabled, the partition ttl management service is invoked immediately after each commit, to delete exipired partitions `Config Param: INLINE_PARTITION_TTL` `Since Version: 1.0.0`
hoodie.partition.ttl.management.strategy.type	KEEP_BY_TIME	Partition ttl management strategy type to determine the strategy class `Config Param: PARTITION_TTL_STRATEGY_TYPE` `Since Version: 1.0.0`
hoodie.partition.ttl.strategy.days.retain	-1	Partition ttl management KEEP_BY_TIME strategy days retain `Config Param: DAYS_RETAIN` `Since Version: 1.0.0`
hoodie.partition.ttl.strategy.max.delete.partitions	1000	max partitions to delete in partition ttl management `Config Param: MAX_PARTITION_TO_DELETE` `Since Version: 1.0.0`

Config Name	Default	Description
hoodie.base.path	(N/A)	Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory. `Config Param: BASE_PATH`
hoodie.datasource.write.precombine.field	(N/A)	Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..) `Config Param: PRECOMBINE_FIELD_NAME`
hoodie.table.name	(N/A)	Table name that will be used for registering with metastores like HMS. Needs to be same across runs. `Config Param: TBL_NAME`
hoodie.write.record.merge.mode	(N/A)	org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user. `Config Param: RECORD_MERGE_MODE` `Since Version: 1.0.0`
hoodie.fail.job.on.duplicate.data.file.detection	false	If config is enabled, entire job is failed on invalid file detection `Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION`
hoodie.instant_state.timeline_server_based.enabled	false	If enabled, writers get instant state from timeline server rather than requesting DFS directly `Config Param: INSTANT_STATE_TIMELINE_SERVER_BASED` `Since Version: 1.0.0`
hoodie.instant_state.timeline_server_based.force_refresh.request.number	100	Number of requests to trigger instant state cache refreshing `Config Param: INSTANT_STATE_TIMELINE_SERVER_BASED_FORCE_REFRESH_REQUEST_NUMBER` `Since Version: 1.0.0`
hoodie.write.auto.upgrade	true	If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower. `Config Param: AUTO_UPGRADE_VERSION` `Since Version: 1.0.0`
hoodie.write.concurrency.mode	SINGLE_WRITER	org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor. `Config Param: WRITE_CONCURRENCY_MODE`
hoodie.write.table.version	8	The table version this writer is storing the table in. This should match the current table version. `Config Param: WRITE_TABLE_VERSION` `Since Version: 1.0.0`

Config Name	Default	Description
hoodie.write.commit.callback.kafka.bootstrap.servers	(N/A)	Bootstrap servers of kafka cluster, to be used for publishing commit metadata. `Config Param: BOOTSTRAP_SERVERS` `Since Version: 0.7.0`
hoodie.write.commit.callback.kafka.partition	(N/A)	It may be desirable to serialize all changes into a single Kafka partition for providing strict ordering. By default, Kafka messages are keyed by table name, which guarantees ordering at the table level, but not globally (or when new partitions are added) `Config Param: PARTITION` `Since Version: 0.7.0`
hoodie.write.commit.callback.kafka.topic	(N/A)	Kafka topic name to publish timeline activity into. `Config Param: TOPIC` `Since Version: 0.7.0`
hoodie.write.commit.callback.kafka.acks	all	kafka acks level, all by default to ensure strong durability. `Config Param: ACKS` `Since Version: 0.7.0`
hoodie.write.commit.callback.kafka.retries	3	Times to retry the produce. 3 by default `Config Param: RETRIES` `Since Version: 0.7.0`

Config Name	Default	Description
hoodie.write.commit.callback.pulsar.broker.service.url	(N/A)	Server's url of pulsar cluster, to be used for publishing commit metadata. `Config Param: BROKER_SERVICE_URL` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.topic	(N/A)	pulsar topic name to publish timeline activity into. `Config Param: TOPIC` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.connection-timeout	10s	Duration of waiting for a connection to a broker to be established. `Config Param: CONNECTION_TIMEOUT` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.keepalive-interval	30s	Duration of keeping alive interval for each client broker connection. `Config Param: KEEPALIVE_INTERVAL` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.operation-timeout	30s	Duration of waiting for completing an operation. `Config Param: OPERATION_TIMEOUT` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.producer.block-if-queue-full	true	When the queue is full, the method is blocked instead of an exception is thrown. `Config Param: PRODUCER_BLOCK_QUEUE_FULL` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.producer.pending-queue-size	1000	The maximum size of a queue holding pending messages. `Config Param: PRODUCER_PENDING_QUEUE_SIZE` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.producer.pending-total-size	50000	The maximum number of pending messages across partitions. `Config Param: PRODUCER_PENDING_SIZE` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.producer.route-mode	RoundRobinPartition	Message routing logic for producers on partitioned topics. `Config Param: PRODUCER_ROUTE_MODE` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.producer.send-timeout	30s	The timeout in each sending to pulsar. `Config Param: PRODUCER_SEND_TIMEOUT` `Since Version: 0.11.0`
hoodie.write.commit.callback.pulsar.request-timeout	60s	Duration of waiting for completing a request. `Config Param: REQUEST_TIMEOUT` `Since Version: 0.11.0`