All Configurations
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels.
- Environment Config: Hudi supports passing configurations via a configuration file
hudi-defaults.conf
in which each line consists of a key and a value separated by whitespace or = sign. For example:
hoodie.datasource.hive_sync.mode jdbc
hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000
hoodie.datasource.hive_sync.support_timestamp false
It helps to have a central configuration file for your common cross job configurations/tunings, so all the jobs on your cluster can utilize it. It also works with Spark SQL DML/DDL, and helps avoid having to pass configs inside the SQL statements.
By default, Hudi would load the configuration file under /etc/hudi/conf
directory. You can specify a different configuration directory location by setting the HUDI_CONF_DIR
environment variable.
- Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
- Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
- Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
- Metastore and Catalog Sync Configs: Configurations used by the Hudi to sync metadata to external metastores and catalogs.
- Metrics Configs: These set of configs are used to enable monitoring and reporting of keyHudi stats and metrics.
- Record Payload Config: This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.
- Kafka Connect Configs: These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables
- Amazon Web Services Configs: Configurations specific to Amazon Web Services.
Externalized Config File
Instead of directly passing configuration settings to every Hudi job, you can also centrally set them in a configuration
file hudi-defaults.conf
. By default, Hudi would load the configuration file under /etc/hudi/conf
directory. You can
specify a different configuration directory location by setting the HUDI_CONF_DIR
environment variable. This can be
useful for uniformly enforcing repeated configs (like Hive sync or write/index tuning), across your entire data lake.
Environment Config
Hudi supports passing configurations via a configuration file hudi-defaults.conf
in which each line consists of a key and a value separated by whitespace or = sign. For example:
hoodie.datasource.hive_sync.mode jdbc
hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://localhost:10000
hoodie.datasource.hive_sync.support_timestamp false
It helps to have a central configuration file for your common cross job configurations/tunings, so all the jobs on your cluster can utilize it. It also works with Spark SQL DML/DDL, and helps avoid having to pass configs inside the SQL statements.
By default, Hudi would load the configuration file under /etc/hudi/conf
directory. You can specify a different configuration directory location by setting the HUDI_CONF_DIR
environment variable.
Spark Datasource Configs
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
Read Options
Options useful for reading tables via read.format.option(...)
Config Class
: org.apache.hudi.DataSourceOptions.scala
as.of.instant
The query instant for time travel. Without specified this option, we query the latest snapshot.
Default Value: N/A
(Required)
Config Param: TIME_TRAVEL_AS_OF_INSTANT
hoodie.datasource.read.begin.instanttime
Instant time to start incrementally pulling data from. The instanttime here need not necessarily correspond to an instant on the timeline. New data written with an instant_time > BEGIN_INSTANTTIME are fetched out. For e.g: ‘20170901080000’ will get all new data written after Sep 1, 2017 08:00AM.
Default Value: N/A
(Required)
Config Param: BEGIN_INSTANTTIME
hoodie.datasource.read.end.instanttime
Instant time to limit incrementally fetched data to. New data written with an instant_time <= END_INSTANTTIME are fetched out.
Default Value: N/A
(Required)
Config Param: END_INSTANTTIME
hoodie.datasource.read.paths
Comma separated list of file paths to read within a Hudi table.
Default Value: N/A
(Required)
Config Param: READ_PATHS
hoodie.datasource.merge.type
For Snapshot query on merge on read table, control whether we invoke the record payload implementation to merge (payload_combine) or skip merging altogetherskip_merge
Default Value: payload_combine (Optional)
Config Param: REALTIME_MERGE
hoodie.datasource.query.incremental.format
This config is used alone with the 'incremental' query type.When set to 'latest_state', it returns the latest records' values.When set to 'cdc', it returns the cdc data.
Default Value: latest_state (Optional)
Config Param: INCREMENTAL_FORMAT
Since Version: 0.13.0
hoodie.datasource.query.type
Whether data needs to be read, in incremental mode (new data since an instantTime) (or) Read Optimized mode (obtain latest view, based on base files) (or) Snapshot mode (obtain latest view, by merging base and (if any) log files)
Default Value: snapshot (Optional)
Config Param: QUERY_TYPE
hoodie.datasource.read.extract.partition.values.from.path
When set to true, values for partition columns (partition values) will be extracted from physical partition path (default Spark behavior). When set to false partition values will be read from the data file (in Hudi partition columns are persisted by default). This config is a fallback allowing to preserve existing behavior, and should not be used otherwise.
Default Value: false (Optional)
Config Param: EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH
Since Version: 0.11.0
hoodie.datasource.read.file.index.listing.mode
Overrides Hudi's file-index implementation's file listing mode: when set to 'eager', file-index will list all partition paths and corresponding file slices w/in them eagerly, during initialization, prior to partition-pruning kicking in, meaning that all partitions will be listed including ones that might be subsequently pruned out; when set to 'lazy', partitions and file-slices w/in them will be listed lazily (ie when they actually accessed, instead of when file-index is initialized) allowing partition pruning to occur before that, only listing partitions that has already been pruned. Please note that, this config is provided purely to allow to fallback to behavior existing prior to 0.13.0 release, and will be deprecated soon after.
Default Value: lazy (Optional)
Config Param: FILE_INDEX_LISTING_MODE_OVERRIDE
Since Version: 0.13.0
hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled
Controls whether partition-path prefix analysis is enabled w/in the file-index, allowing to avoid necessity to recursively list deep folder structures of partitioned tables w/ multiple partition columns, by carefully analyzing provided partition-column predicates and deducing corresponding partition-path prefix from them (if possible).
Default Value: true (Optional)
Config Param: FILE_INDEX_LISTING_PARTITION_PATH_PREFIX_ANALYSIS_ENABLED
Since Version: 0.13.0
hoodie.datasource.read.incr.fallback.fulltablescan.enable
When doing an incremental query whether we should fall back to full table scans if file does not exist.
Default Value: false (Optional)
Config Param: INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES
hoodie.datasource.read.incr.filters
For use-cases like DeltaStreamer which reads from Hoodie Incremental table and applies opaque map functions, filters appearing late in the sequence of transformations cannot be automatically pushed down. This option allows setting filters directly on Hoodie Source.
Default Value: (Optional)
Config Param: PUSH_DOWN_INCR_FILTERS
hoodie.datasource.read.incr.path.glob
For the use-cases like users only want to incremental pull from certain partitions instead of the full table. This option allows using glob pattern to directly filter on path.
Default Value: (Optional)
Config Param: INCR_PATH_GLOB
hoodie.datasource.read.schema.use.end.instanttime
Uses end instant schema when incrementally fetched data to. Default: users latest instant schema.
Default Value: false (Optional)
Config Param: INCREMENTAL_READ_SCHEMA_USE_END_INSTANTTIME
hoodie.datasource.streaming.startOffset
Start offset to pull data from hoodie streaming source. allow earliest, latest, and specified start instant time
Default Value: earliest (Optional)
Config Param: START_OFFSET
Since Version: 0.13.0
hoodie.datasource.write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: READ_PRE_COMBINE_FIELD
hoodie.enable.data.skipping
Enables data-skipping allowing queries to leverage indexes to reduce the search space by skipping over files
Default Value: false (Optional)
Config Param: ENABLE_DATA_SKIPPING
Since Version: 0.10.0
hoodie.file.index.enable
Enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.
Default Value: true (Optional)
Config Param: ENABLE_HOODIE_FILE_INDEX
Deprecated Version: 0.11.0
hoodie.schema.on.read.enable
Enables support for Schema Evolution feature
Default Value: false (Optional)
Config Param: SCHEMA_EVOLUTION_ENABLED
Write Options
You can pass down any of the WriteClient level configs directly using options()
or option(k,v)
methods.
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Options useful for writing tables via write.format.option(...)
Config Class
: org.apache.hudi.DataSourceOptions.scala
hoodie.datasource.hive_sync.mode
Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
Default Value: N/A
(Required)
Config Param: HIVE_SYNC_MODE
hoodie.datasource.hive_sync.serde_properties
Serde properties to hive table.
Default Value: N/A
(Required)
Config Param: HIVE_TABLE_SERDE_PROPERTIES
hoodie.datasource.hive_sync.table_properties
Additional properties to store with table.
Default Value: N/A
(Required)
Config Param: HIVE_TABLE_PROPERTIES
hoodie.datasource.write.partitionpath.field
Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
Default Value: N/A
(Required)
Config Param: PARTITIONPATH_FIELD
hoodie.datasource.write.partitions.to.delete
Comma separated list of partitions to delete. Allows use of wildcard *
Default Value: N/A
(Required)
Config Param: PARTITIONS_TO_DELETE
hoodie.datasource.write.recordkey.field
Record key field. Value to be used as the
recordKey
component ofHoodieKey
. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg:a.b.c
Default Value: N/A
(Required)
Config Param: RECORDKEY_FIELD
hoodie.datasource.write.table.name
Table name for the datasource write. Also used to register the table into meta stores.
Default Value: N/A
(Required)
Config Param: TABLE_NAME
hoodie.clustering.async.enabled
Enable running of clustering service, asynchronously as inserts happen on the table.
Default Value: false (Optional)
Config Param: ASYNC_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.clustering.inline
Turn on inline clustering - clustering will be run after each write operation is complete
Default Value: false (Optional)
Config Param: INLINE_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.datasource.compaction.async.enable
Controls whether async compaction should be turned on for MOR table writing.
Default Value: true (Optional)
Config Param: ASYNC_COMPACT_ENABLE
hoodie.datasource.hive_sync.assume_date_partitioning
Assume partitioning is yyyy/MM/dd
Default Value: false (Optional)
Config Param: HIVE_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.auto_create_database
Auto create hive database if does not exists
Default Value: true (Optional)
Config Param: HIVE_AUTO_CREATE_DATABASE
hoodie.datasource.hive_sync.base_file_format
Base file format for the sync.
Default Value: PARQUET (Optional)
Config Param: HIVE_BASE_FILE_FORMAT
hoodie.datasource.hive_sync.batch_num
The number of partitions one batch when synchronous partitions to hive.
Default Value: 1000 (Optional)
Config Param: HIVE_BATCH_SYNC_PARTITION_NUM
hoodie.datasource.hive_sync.bucket_sync
Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'
Default Value: false (Optional)
Config Param: HIVE_SYNC_BUCKET_SYNC
hoodie.datasource.hive_sync.create_managed_table
Whether to sync the table as managed table.
Default Value: false (Optional)
Config Param: HIVE_CREATE_MANAGED_TABLE
hoodie.datasource.hive_sync.database
The name of the destination database that we should sync the hudi table to.
Default Value: default (Optional)
Config Param: HIVE_DATABASE
hoodie.datasource.hive_sync.enable
When set to true, register/sync the table to Apache Hive metastore.
Default Value: false (Optional)
Config Param: HIVE_SYNC_ENABLED
hoodie.datasource.hive_sync.ignore_exceptions
Ignore exceptions when syncing with Hive.
Default Value: false (Optional)
Config Param: HIVE_IGNORE_EXCEPTIONS
hoodie.datasource.hive_sync.jdbcurl
Hive metastore url
Default Value: jdbc:hive2://localhost:10000 (Optional)
Config Param: HIVE_URL
hoodie.datasource.hive_sync.metastore.uris
Hive metastore url
Default Value: thrift://localhost:9083 (Optional)
Config Param: METASTORE_URIS
hoodie.datasource.hive_sync.partition_extractor_class
Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'.
Default Value: org.apache.hudi.hive.MultiPartKeysValueExtractor (Optional)
Config Param: HIVE_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.partition_fields
Field in the table to use for determining hive partition columns.
Default Value: (Optional)
Config Param: HIVE_PARTITION_FIELDS
hoodie.datasource.hive_sync.password
hive password to use
Default Value: hive (Optional)
Config Param: HIVE_PASS
hoodie.datasource.hive_sync.skip_ro_suffix
Skip the _ro suffix for Read optimized table, when registering
Default Value: false (Optional)
Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE
hoodie.datasource.hive_sync.support_timestamp
‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility.
Default Value: false (Optional)
Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE
hoodie.datasource.hive_sync.sync_as_datasource
Default Value: true (Optional)
Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE
hoodie.datasource.hive_sync.sync_comment
Whether to sync the table column comments while syncing the table.
Default Value: false (Optional)
Config Param: HIVE_SYNC_COMMENT
hoodie.datasource.hive_sync.table
The name of the destination table that we should sync the hudi table to.
Default Value: unknown (Optional)
Config Param: HIVE_TABLE
hoodie.datasource.hive_sync.use_jdbc
Use JDBC when hive synchronization is enabled
Default Value: true (Optional)
Config Param: HIVE_USE_JDBC
Deprecated Version: 0.9.0
hoodie.datasource.hive_sync.use_pre_apache_input_format
Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format
Default Value: false (Optional)
Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT
hoodie.datasource.hive_sync.username
hive user name to use
Default Value: hive (Optional)
Config Param: HIVE_USER
hoodie.datasource.meta.sync.enable
Enable Syncing the Hudi Table with an external meta store or data catalog.
Default Value: false (Optional)
Config Param: META_SYNC_ENABLED
hoodie.datasource.meta_sync.condition.sync
If true, only sync on conditions like schema change or partition change.
Default Value: false (Optional)
Config Param: HIVE_CONDITIONAL_SYNC
hoodie.datasource.write.commitmeta.key.prefix
Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timeline
Default Value: _ (Optional)
Config Param: COMMIT_METADATA_KEYPREFIX
hoodie.datasource.write.drop.partition.columns
When set to true, will not write the partition columns into hudi. By default, false.
Default Value: false (Optional)
Config Param: DROP_PARTITION_COLUMNS
hoodie.datasource.write.hive_style_partitioning
Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING
hoodie.datasource.write.insert.drop.duplicates
If set to true, filters out all duplicate records from incoming dataframe, during insert operations.
Default Value: false (Optional)
Config Param: INSERT_DROP_DUPS
hoodie.datasource.write.keygenerator.class
Key generator class, that implements
org.apache.hudi.keygen.KeyGenerator
Default Value: org.apache.hudi.keygen.SimpleKeyGenerator (Optional)
Config Param: KEYGENERATOR_CLASS_NAME
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled
When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value
2016-12-29 09:54:00
will be written as timestamp2016-12-29 09:54:00.0
in row-writer path, while it will be written as long value1483023240000000
in non row-writer path. If enabled, then the timestamp value will be written in both the cases.
Default Value: false (Optional)
Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED
hoodie.datasource.write.operation
Whether to do upsert, insert or bulkinsert for the write operation. Use bulkinsert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
Default Value: upsert (Optional)
Config Param: OPERATION
hoodie.datasource.write.partitionpath.urlencode
Should we url encode the partition path value, before creating the folder structure.
Default Value: false (Optional)
Config Param: URL_ENCODE_PARTITIONING
hoodie.datasource.write.payload.class
Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective
Default Value: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)
Config Param: PAYLOAD_CLASS_NAME
hoodie.datasource.write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: PRECOMBINE_FIELD
hoodie.datasource.write.reconcile.schema
This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table's schema (after txn) is either kept the same or extended, meaning that we'll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table's schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch)
Default Value: false (Optional)
Config Param: RECONCILE_SCHEMA
hoodie.datasource.write.record.merger.impls
List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)
Default Value: org.apache.hudi.common.model.HoodieAvroRecordMerger (Optional)
Config Param: RECORD_MERGER_IMPLS
Since Version: 0.13.0
hoodie.datasource.write.record.merger.strategy
Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id
Default Value: eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 (Optional)
Config Param: RECORD_MERGER_STRATEGY
Since Version: 0.13.0
hoodie.datasource.write.row.writer.enable
When set to true, will perform write operations directly using the spark native
Row
representation, avoiding any additional conversion costs.
Default Value: true (Optional)
Config Param: ENABLE_ROW_WRITER
hoodie.datasource.write.streaming.checkpoint.identifier
A stream identifier used for HUDI to fetch the right checkpoint(
batch id
to be more specific) corresponding this writer. Please note that keep the identifier an unique value for different writer if under multi-writer scenario. If the value is not set, will only keep the checkpoint info in the memory. This could introduce the potential issue that the job is restart(batch id
is lost) while spark checkpoint write fails, causing spark will retry and rewrite the data.
Default Value: default_single_writer (Optional)
Config Param: STREAMING_CHECKPOINT_IDENTIFIER
Since Version: 0.13.0
hoodie.datasource.write.streaming.ignore.failed.batch
Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch. Turning this on, could hide the write status errors while the spark checkpoint moves ahead.So, would recommend users to use this with caution.
Default Value: false (Optional)
Config Param: STREAMING_IGNORE_FAILED_BATCH
hoodie.datasource.write.streaming.retry.count
Config to indicate how many times streaming job should retry for a failed micro batch.
Default Value: 3 (Optional)
Config Param: STREAMING_RETRY_CNT
hoodie.datasource.write.streaming.retry.interval.ms
Config to indicate how long (by millisecond) before a retry should issued for failed microbatch
Default Value: 2000 (Optional)
Config Param: STREAMING_RETRY_INTERVAL_MS
hoodie.datasource.write.table.type
The table type for the underlying data, for this write. This can’t change between writes.
Default Value: COPY_ON_WRITE (Optional)
Config Param: TABLE_TYPE
hoodie.deltastreamer.source.kafka.value.deserializer.class
This class is used by kafka client to deserialize the records
Default Value: io.confluent.kafka.serializers.KafkaAvroDeserializer (Optional)
Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS
Since Version: 0.9.0
hoodie.meta.sync.client.tool.class
Sync tool class name used to sync to metastore. Defaults to Hive.
Default Value: org.apache.hudi.hive.HiveSyncTool (Optional)
Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAME
hoodie.sql.bulk.insert.enable
When set to true, the sql insert statement will use bulk insert.
Default Value: false (Optional)
Config Param: SQL_ENABLE_BULK_INSERT
hoodie.sql.insert.mode
Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record.For strict mode, insert statement will keep the primary key uniqueness constraint which do not allow duplicate record.While for non-strict mode, hudi just do the insert operation for the pk-table.
Default Value: upsert (Optional)
Config Param: SQL_INSERT_MODE
PreCommit Validator Configurations
The following set of configurations help validate new data before commits.
Config Class
: org.apache.hudi.config.HoodiePreCommitValidatorConfig
hoodie.precommit.validators
Comma separated list of class names that can be invoked to validate commit
Default Value: (Optional)
Config Param: VALIDATOR_CLASS_NAMES
hoodie.precommit.validators.equality.sql.queries
Spark SQL queries to run on table before committing new data to validate state before and after commit. Multiple queries separated by ';' delimiter are supported. Example: "select count(*) from <TABLE_NAME> Note <TABLE_NAME> is replaced by table state before and after commit.
Default Value: (Optional)
Config Param: EQUALITY_SQL_QUERIES
hoodie.precommit.validators.inequality.sql.queries
Spark SQL queries to run on table before committing new data to validate state before and after commit.Multiple queries separated by ';' delimiter are supported.Example query: 'select count(*) from <TABLE_NAME> where col=null'Note <TABLE_NAME> variable is expected to be present in query.
Default Value: (Optional)
Config Param: INEQUALITY_SQL_QUERIES
hoodie.precommit.validators.single.value.sql.queries
Spark SQL queries to run on table before committing new data to validate state after commit.Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: 'query1#result1:query2#result2'Note <TABLE_NAME> variable is expected to be present in query.
Default Value: (Optional)
Config Param: SINGLE_VALUE_SQL_QUERIES
Flink Sql Configs
These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
Flink Options
Flink jobs using the SQL can be configured through the options in WITH clause. The actual datasource level configs are listed below.
Config Class
: org.apache.hudi.configuration.FlinkOptions
clustering.tasks
Parallelism of tasks that do actual clustering, default same as the write task parallelism
Default Value: N/A
(Required)
Config Param: CLUSTERING_TASKS
compaction.tasks
Parallelism of tasks that do actual compaction, default same as the write task parallelism
Default Value: N/A
(Required)
Config Param: COMPACTION_TASKS
hive_sync.conf.dir
The hive configuration directory, where the hive-site.xml lies in, the file should be put on the client machine
Default Value: N/A
(Required)
Config Param: HIVE_SYNC_CONF_DIR
hive_sync.serde_properties
Serde properties to hive table, the data format is k1=v1 k2=v2
Default Value: N/A
(Required)
Config Param: HIVE_SYNC_TABLE_SERDE_PROPERTIES
hive_sync.table_properties
Additional properties to store with table, the data format is k1=v1 k2=v2
Default Value: N/A
(Required)
Config Param: HIVE_SYNC_TABLE_PROPERTIES
hoodie.database.name
Database name to register to Hive metastore
Default Value: N/A
(Required)
Config Param: DATABASE_NAME
hoodie.datasource.write.keygenerator.class
Key generator class, that implements will extract the key out of incoming record
Default Value: N/A
(Required)
Config Param: KEYGEN_CLASS_NAME
hoodie.table.name
Table name to register to Hive metastore
Default Value: N/A
(Required)
Config Param: TABLE_NAME
path
Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfully
Default Value: N/A
(Required)
Config Param: PATH
read.end-commit
End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss'
Default Value: N/A
(Required)
Config Param: READ_END_COMMIT
read.start-commit
Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read
Default Value: N/A
(Required)
Config Param: READ_START_COMMIT
read.tasks
Parallelism of tasks that do actual read, default is the parallelism of the execution environment
Default Value: N/A
(Required)
Config Param: READ_TASKS
source.avro-schema
Source avro schema string, the parsed schema is used for deserialization
Default Value: N/A
(Required)
Config Param: SOURCE_AVRO_SCHEMA
source.avro-schema.path
Source avro schema file path, the parsed schema is used for deserialization
Default Value: N/A
(Required)
Config Param: SOURCE_AVRO_SCHEMA_PATH
write.bucket_assign.tasks
Parallelism of tasks that do bucket assign, default same as the write task parallelism
Default Value: N/A
(Required)
Config Param: BUCKET_ASSIGN_TASKS
write.index_bootstrap.tasks
Parallelism of tasks that do index bootstrap, default same as the write task parallelism
Default Value: N/A
(Required)
Config Param: INDEX_BOOTSTRAP_TASKS
write.partition.format
Partition path format, only valid when 'write.datetime.partitioning' is true, default is:
- 'yyyyMMddHH' for timestamp(3) WITHOUT TIME ZONE, LONG, FLOAT, DOUBLE, DECIMAL;
- 'yyyyMMdd' for DATE and INT.
Default Value: N/A
(Required)
Config Param: PARTITION_FORMAT
write.tasks
Parallelism of tasks that do actual write, default is the parallelism of the execution environment
Default Value: N/A
(Required)
Config Param: WRITE_TASKS
archive.max_commits
Max number of commits to keep before archiving older commits into a sequential log, default 50
Default Value: 50 (Optional)
Config Param: ARCHIVE_MAX_COMMITS
archive.min_commits
Min number of commits to keep before archiving older commits into a sequential log, default 40
Default Value: 40 (Optional)
Config Param: ARCHIVE_MIN_COMMITS
cdc.enabled
When enable, persist the change data if necessary, and can be queried as a CDC query mode
Default Value: false (Optional)
Config Param: CDC_ENABLED
cdc.supplemental.logging.mode
Setting 'op_key_only' persists the 'op' and the record key only, setting 'data_before' persists the additional 'before' image, and setting 'data_before_after' persists the additional 'before' and 'after' images.
Default Value: data_before_after (Optional)
Config Param: SUPPLEMENTAL_LOGGING_MODE
changelog.enabled
Whether to keep all the intermediate changes, we try to keep all the changes of a record when enabled: 1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every changes of a record. The semantics is best effort because the compaction job would finally merge all changes of a record into one. default false to have UPSERT semantics
Default Value: false (Optional)
Config Param: CHANGELOG_ENABLED
clean.async.enabled
Whether to cleanup the old commits immediately on new commits, enabled by default
Default Value: true (Optional)
Config Param: CLEAN_ASYNC_ENABLED
clean.policy
Clean policy to manage the Hudi table. Available option: KEEP_LATEST_COMMITS, KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_BY_HOURS.Default is KEEP_LATEST_COMMITS.
Default Value: KEEP_LATEST_COMMITS (Optional)
Config Param: CLEAN_POLICY
clean.retain_commits
Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table, default 30
Default Value: 30 (Optional)
Config Param: CLEAN_RETAIN_COMMITS
clean.retain_file_versions
Number of file versions to retain. default 5
Default Value: 5 (Optional)
Config Param: CLEAN_RETAIN_FILE_VERSIONS
clean.retain_hours
Number of hours for which commits need to be retained. This config provides a more flexible option ascompared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
Default Value: 24 (Optional)
Config Param: CLEAN_RETAIN_HOURS
clustering.async.enabled
Async Clustering, default false
Default Value: false (Optional)
Config Param: CLUSTERING_ASYNC_ENABLED
clustering.delta_commits
Max delta commits needed to trigger clustering, default 4 commits
Default Value: 4 (Optional)
Config Param: CLUSTERING_DELTA_COMMITS
clustering.plan.partition.filter.mode
Partition filter mode used in the creation of clustering plan. Available values are - NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate.RECENT_DAYS: keep a continuous range of partitions, worked together with configs 'hoodie.clustering.plan.strategy.daybased.lookback.partitions' and 'hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions.SELECTED_PARTITIONS: keep partitions that are in the specified range ['hoodie.clustering.plan.strategy.cluster.begin.partition', 'hoodie.clustering.plan.strategy.cluster.end.partition'].DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.
Default Value: NONE (Optional)
Config Param: CLUSTERING_PLAN_PARTITION_FILTER_MODE_NAME
clustering.plan.strategy.class
Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the last N (determined by clustering.plan.strategy.daybased.lookback.partitions) day based partitions picks the small file slices within those partitions.
Default Value: org.apache.hudi.client.clustering.plan.strategy.FlinkSizeBasedClusteringPlanStrategy (Optional)
Config Param: CLUSTERING_PLAN_STRATEGY_CLASS
clustering.plan.strategy.daybased.lookback.partitions
Number of partitions to list to create ClusteringPlan, default is 2
Default Value: 2 (Optional)
Config Param: CLUSTERING_TARGET_PARTITIONS
clustering.plan.strategy.daybased.skipfromlatest.partitions
Number of partitions to skip from latest when choosing partitions to create ClusteringPlan
Default Value: 0 (Optional)
Config Param: CLUSTERING_PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST
clustering.plan.strategy.max.num.groups
Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism, default is 30
Default Value: 30 (Optional)
Config Param: CLUSTERING_MAX_NUM_GROUPS
clustering.plan.strategy.small.file.limit
Files smaller than the size specified here are candidates for clustering, default 600 MB
Default Value: 600 (Optional)
Config Param: CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT
clustering.plan.strategy.sort.columns
Columns to sort the data by when clustering
Default Value: (Optional)
Config Param: CLUSTERING_SORT_COLUMNS
clustering.plan.strategy.target.file.max.bytes
Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups, default 1 GB
Default Value: 1073741824 (Optional)
Config Param: CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES
clustering.schedule.enabled
Schedule the cluster plan, default false
Default Value: false (Optional)
Config Param: CLUSTERING_SCHEDULE_ENABLED
compaction.async.enabled
Async Compaction, enabled by default for MOR
Default Value: true (Optional)
Config Param: COMPACTION_ASYNC_ENABLED
compaction.delta_commits
Max delta commits needed to trigger compaction, default 5 commits
Default Value: 5 (Optional)
Config Param: COMPACTION_DELTA_COMMITS
compaction.delta_seconds
Max delta seconds time needed to trigger compaction, default 1 hour
Default Value: 3600 (Optional)
Config Param: COMPACTION_DELTA_SECONDS
compaction.max_memory
Max memory in MB for compaction spillable map, default 100MB
Default Value: 100 (Optional)
Config Param: COMPACTION_MAX_MEMORY
compaction.schedule.enabled
Schedule the compaction plan, enabled by default for MOR
Default Value: true (Optional)
Config Param: COMPACTION_SCHEDULE_ENABLED
compaction.target_io
Target IO in MB for per compaction (both read and write), default 500 GB
Default Value: 512000 (Optional)
Config Param: COMPACTION_TARGET_IO
compaction.timeout.seconds
Max timeout time in seconds for online compaction to rollback, default 20 minutes
Default Value: 1200 (Optional)
Config Param: COMPACTION_TIMEOUT_SECONDS
compaction.trigger.strategy
Strategy to trigger compaction, options are 'num_commits': trigger compaction when reach N delta commits; 'time_elapsed': trigger compaction when time elapsed > N seconds since last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 'num_commits'
Default Value: num_commits (Optional)
Config Param: COMPACTION_TRIGGER_STRATEGY
hive_sync.assume_date_partitioning
Assume partitioning is yyyy/mm/dd, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_ASSUME_DATE_PARTITION
hive_sync.auto_create_db
Auto create hive database if it does not exists, default true
Default Value: true (Optional)
Config Param: HIVE_SYNC_AUTO_CREATE_DB
hive_sync.db
Database name for hive sync, default 'default'
Default Value: default (Optional)
Config Param: HIVE_SYNC_DB
hive_sync.enabled
Asynchronously sync Hive meta to HMS, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_ENABLED
hive_sync.file_format
File format for hive sync, default 'PARQUET'
Default Value: PARQUET (Optional)
Config Param: HIVE_SYNC_FILE_FORMAT
hive_sync.ignore_exceptions
Ignore exceptions during hive synchronization, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_IGNORE_EXCEPTIONS
hive_sync.jdbc_url
Jdbc URL for hive sync, default 'jdbc:hive2://localhost:10000'
Default Value: jdbc:hive2://localhost:10000 (Optional)
Config Param: HIVE_SYNC_JDBC_URL
hive_sync.metastore.uris
Metastore uris for hive sync, default ''
Default Value: (Optional)
Config Param: HIVE_SYNC_METASTORE_URIS
hive_sync.mode
Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'hms'
Default Value: HMS (Optional)
Config Param: HIVE_SYNC_MODE
hive_sync.partition_extractor_class
Tool to extract the partition value from HDFS path, default 'MultiPartKeysValueExtractor'
Default Value: org.apache.hudi.hive.MultiPartKeysValueExtractor (Optional)
Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME
hive_sync.partition_fields
Partition fields for hive sync, default ''
Default Value: (Optional)
Config Param: HIVE_SYNC_PARTITION_FIELDS
hive_sync.password
Password for hive sync, default 'hive'
Default Value: hive (Optional)
Config Param: HIVE_SYNC_PASSWORD
hive_sync.skip_ro_suffix
Skip the _ro suffix for Read optimized table when registering, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_SKIP_RO_SUFFIX
hive_sync.support_timestamp
INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for backward compatibility.
Default Value: true (Optional)
Config Param: HIVE_SYNC_SUPPORT_TIMESTAMP
hive_sync.table
Table name for hive sync, default 'unknown'
Default Value: unknown (Optional)
Config Param: HIVE_SYNC_TABLE
hive_sync.table.strategy
Hive table synchronization strategy. Available option: RO, RT, ALL.
Default Value: ALL (Optional)
Config Param: HIVE_SYNC_TABLE_STRATEGY
hive_sync.use_jdbc
Use JDBC when hive synchronization is enabled, default true
Default Value: true (Optional)
Config Param: HIVE_SYNC_USE_JDBC
hive_sync.username
Username for hive sync, default 'hive'
Default Value: hive (Optional)
Config Param: HIVE_SYNC_USERNAME
hoodie.bucket.index.hash.field
Index key field. Value to be used as hashing to find the bucket ID. Should be a subset of or equal to the recordKey fields. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg:
a.b.c
Default Value: (Optional)
Config Param: INDEX_KEY_FIELD
hoodie.bucket.index.num.buckets
Hudi bucket number per partition. Only affected if using Hudi bucket index.
Default Value: 4 (Optional)
Config Param: BUCKET_INDEX_NUM_BUCKETS
hoodie.datasource.merge.type
For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records(combines the two records with same key for base and log file records), then read the left log file records
Default Value: payload_combine (Optional)
Config Param: MERGE_TYPE
hoodie.datasource.query.type
Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row & columnar data); 2) incremental mode (new data since an instantTime); 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshot
Default Value: snapshot (Optional)
Config Param: QUERY_TYPE
hoodie.datasource.write.hive_style_partitioning
Whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING
hoodie.datasource.write.keygenerator.type
Key generator type, that implements will extract the key out of incoming record. Note This is being actively worked on. Please use
hoodie.datasource.write.keygenerator.class
instead.
Default Value: SIMPLE (Optional)
Config Param: KEYGEN_TYPE
hoodie.datasource.write.partitionpath.field
Partition path field. Value to be used at the
partitionPath
component ofHoodieKey
. Actual value obtained by invoking .toString(), default ''
Default Value: (Optional)
Config Param: PARTITION_PATH_FIELD
hoodie.datasource.write.partitionpath.urlencode
Whether to encode the partition path url, default false
Default Value: false (Optional)
Config Param: URL_ENCODE_PARTITIONING
hoodie.datasource.write.recordkey.field
Record key field. Value to be used as the
recordKey
component ofHoodieKey
. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg:a.b.c
Default Value: uuid (Optional)
Config Param: RECORD_KEY_FIELD
index.bootstrap.enabled
Whether to bootstrap the index state from existing hoodie table, default false
Default Value: false (Optional)
Config Param: INDEX_BOOTSTRAP_ENABLED
index.global.enabled
Whether to update index for the old partition path if same key record with different partition path came in, default true
Default Value: true (Optional)
Config Param: INDEX_GLOBAL_ENABLED
index.partition.regex
Whether to load partitions in state if partition path matching, default
*
Default Value: .* (Optional)
Config Param: INDEX_PARTITION_REGEX
index.state.ttl
Index state ttl in days, default stores the index permanently
Default Value: 0.0 (Optional)
Config Param: INDEX_STATE_TTL
index.type
Index type of Flink write job, default is using state backed index.
Default Value: FLINK_STATE (Optional)
Config Param: INDEX_TYPE
metadata.compaction.delta_commits
Max delta commits for metadata table to trigger compaction, default 10
Default Value: 10 (Optional)
Config Param: METADATA_COMPACTION_DELTA_COMMITS
metadata.enabled
Enable the internal metadata table which serves table metadata like level file listings, default disabled
Default Value: false (Optional)
Config Param: METADATA_ENABLED
partition.default_name
The default partition name in case the dynamic partition column value is null/empty string
Default Value: __HIVE_DEFAULT_PARTITION__ (Optional)
Config Param: PARTITION_DEFAULT_NAME
payload.class
Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for the option in-effective
Default Value: org.apache.hudi.common.model.EventTimeAvroPayload (Optional)
Config Param: PAYLOAD_CLASS_NAME
precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: PRECOMBINE_FIELD
read.data.skipping.enabled
Enables data-skipping allowing queries to leverage indexes to reduce the search space byskipping over files
Default Value: false (Optional)
Config Param: READ_DATA_SKIPPING_ENABLED
read.streaming.check-interval
Check interval for streaming read of SECOND, default 1 minute
Default Value: 60 (Optional)
Config Param: READ_STREAMING_CHECK_INTERVAL
read.streaming.enabled
Whether to read as streaming source, default false
Default Value: false (Optional)
Config Param: READ_AS_STREAMING
read.streaming.skip_clustering
Whether to skip clustering instants for streaming read, to avoid reading duplicates
Default Value: false (Optional)
Config Param: READ_STREAMING_SKIP_CLUSTERING
read.streaming.skip_compaction
Whether to skip compaction instants for streaming read, there are two cases that this option can be used to avoid reading duplicates:
- you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
- changelog mode is enabled, this option is a solution to keep data integrity
Default Value: false (Optional)
Config Param: READ_STREAMING_SKIP_COMPACT
read.utc-timezone
Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone, by default true
Default Value: true (Optional)
Config Param: UTC_TIMEZONE
record.merger.impls
List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by record.merger.strategy. Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)
Default Value: org.apache.hudi.common.model.HoodieAvroRecordMerger (Optional)
Config Param: RECORD_MERGER_IMPLS
record.merger.strategy
Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in record.merger.impls which has the same merger strategy id
Default Value: eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 (Optional)
Config Param: RECORD_MERGER_STRATEGY
table.type
Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ
Default Value: COPY_ON_WRITE (Optional)
Config Param: TABLE_TYPE
write.batch.size
Batch buffer size in MB to flush data into the underneath filesystem, default 256MB
Default Value: 256.0 (Optional)
Config Param: WRITE_BATCH_SIZE
write.bulk_insert.shuffle_input
Whether to shuffle the inputs by specific fields for bulk insert tasks, default true
Default Value: true (Optional)
Config Param: WRITE_BULK_INSERT_SHUFFLE_INPUT
write.bulk_insert.sort_input
Whether to sort the inputs by specific fields for bulk insert tasks, default true
Default Value: true (Optional)
Config Param: WRITE_BULK_INSERT_SORT_INPUT
write.commit.ack.timeout
Timeout limit for a writer task after it finishes a checkpoint and waits for the instant commit success, only for internal use
Default Value: -1 (Optional)
Config Param: WRITE_COMMIT_ACK_TIMEOUT
write.ignore.failed
Flag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default false. Turning this on, could hide the write status errors while the spark checkpoint moves ahead. So, would recommend users to use this with caution.
Default Value: false (Optional)
Config Param: IGNORE_FAILED
write.insert.cluster
Whether to merge small files for insert mode, if true, the write throughput will decrease because the read/write of existing small file, only valid for COW table, default false
Default Value: false (Optional)
Config Param: INSERT_CLUSTER
write.log.max.size
Maximum size allowed in MB for a log file before it is rolled over to the next version, default 1GB
Default Value: 1024 (Optional)
Config Param: WRITE_LOG_MAX_SIZE
write.log_block.size
Max log block size in MB for log file, default 128MB
Default Value: 128 (Optional)
Config Param: WRITE_LOG_BLOCK_SIZE
write.merge.max_memory
Max memory in MB for merge, default 100MB
Default Value: 100 (Optional)
Config Param: WRITE_MERGE_MAX_MEMORY
write.operation
The write operation, that this write should do
Default Value: upsert (Optional)
Config Param: OPERATION
write.parquet.block.size
Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.
Default Value: 120 (Optional)
Config Param: WRITE_PARQUET_BLOCK_SIZE
write.parquet.max.file.size
Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Default Value: 120 (Optional)
Config Param: WRITE_PARQUET_MAX_FILE_SIZE
write.parquet.page.size
Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.
Default Value: 1 (Optional)
Config Param: WRITE_PARQUET_PAGE_SIZE
write.precombine
Flag to indicate whether to drop duplicates before insert/upsert. By default these cases will accept duplicates, to gain extra performance:
- insert operation;
- upsert for MOR table, the MOR table deduplicate on reading
Default Value: false (Optional)
Config Param: PRE_COMBINE
write.rate.limit
Write record rate limit per second to prevent traffic jitter and improve stability, default 0 (no limit)
Default Value: 0 (Optional)
Config Param: WRITE_RATE_LIMIT
write.retry.interval.ms
Flag to indicate how long (by millisecond) before a retry should issued for failed checkpoint batch. By default 2000 and it will be doubled by every retry
Default Value: 2000 (Optional)
Config Param: RETRY_INTERVAL_MS
write.retry.times
Flag to indicate how many times streaming job should retry for a failed checkpoint batch. By default 3
Default Value: 3 (Optional)
Config Param: RETRY_TIMES
write.sort.memory
Sort memory in MB, default 128MB
Default Value: 128 (Optional)
Config Param: WRITE_SORT_MEMORY
write.task.max.size
Maximum memory in MB for a write task, when the threshold hits, it flushes the max size data bucket to avoid OOM, default 1GB
Default Value: 1024.0 (Optional)
Config Param: WRITE_TASK_MAX_SIZE
Write Client Configs
Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
Layout Configs
Configurations that control storage layout and data distribution, which defines how the files are organized within a table.
Config Class
: org.apache.hudi.config.HoodieLayoutConfig
hoodie.storage.layout.partitioner.class
Partitioner class, it is used to distribute data in a specific way.
Default Value: N/A
(Required)
Config Param: LAYOUT_PARTITIONER_CLASS_NAME
hoodie.storage.layout.type
Type of storage layout. Possible options are [DEFAULT | BUCKET]
Default Value: DEFAULT (Optional)
Config Param: LAYOUT_TYPE
Clean Configs
Cleaning (reclamation of older/unused file groups/slices).
Config Class
: org.apache.hudi.config.HoodieCleanConfig
hoodie.clean.allow.multiple
Allows scheduling/executing multiple cleans by enabling this config. If users prefer to strictly ensure clean requests should be mutually exclusive, .i.e. a 2nd clean will not be scheduled if another clean is not yet completed to avoid repeat cleaning of same files, they might want to disable this config.
Default Value: true (Optional)
Config Param: ALLOW_MULTIPLE_CLEANS
Since Version: 0.11.0
hoodie.clean.async
Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance.
Default Value: false (Optional)
Config Param: ASYNC_CLEAN
hoodie.clean.automatic
When enabled, the cleaner table service is invoked immediately after each commit, to delete older file slices. It's recommended to enable this, to ensure metadata and data storage growth is bounded.
Default Value: true (Optional)
Config Param: AUTO_CLEAN
hoodie.clean.max.commits
Number of commits after the last clean operation, before scheduling of a new clean is attempted.
Default Value: 1 (Optional)
Config Param: CLEAN_MAX_COMMITS
hoodie.clean.trigger.strategy
Controls how cleaning is scheduled. Valid options: NUM_COMMITS
Default Value: NUM_COMMITS (Optional)
Config Param: CLEAN_TRIGGER_STRATEGY
hoodie.cleaner.commits.retained
Number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries.
Default Value: 10 (Optional)
Config Param: CLEANER_COMMITS_RETAINED
hoodie.cleaner.delete.bootstrap.base.file
When set to true, cleaner also deletes the bootstrap base file when it's skeleton base file is cleaned. Turn this to true, if you want to ensure the bootstrap dataset storage is reclaimed over time, as the table receives updates/deletes. Another reason to turn this on, would be to ensure data residing in bootstrap base files are also physically deleted, to comply with data privacy enforcement processes.
Default Value: false (Optional)
Config Param: CLEANER_BOOTSTRAP_BASE_FILE_ENABLE
hoodie.cleaner.fileversions.retained
When KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning.
Default Value: 3 (Optional)
Config Param: CLEANER_FILE_VERSIONS_RETAINED
hoodie.cleaner.hours.retained
Number of hours for which commits need to be retained. This config provides a more flexible option ascompared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
Default Value: 24 (Optional)
Config Param: CLEANER_HOURS_RETAINED
hoodie.cleaner.incremental.mode
When enabled, the plans for each cleaner service run is computed incrementally off the events in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full table for each planning (even with a metadata table).
Default Value: true (Optional)
Config Param: CLEANER_INCREMENTAL_MODE_ENABLE
hoodie.cleaner.parallelism
Parallelism for the cleaning operation. Increase this if cleaning becomes slow.
Default Value: 200 (Optional)
Config Param: CLEANER_PARALLELISM_VALUE
hoodie.cleaner.policy
Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. By default, cleaner spares the file slices written by the last N commits, determined by hoodie.cleaner.commits.retained Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time
Default Value: KEEP_LATEST_COMMITS (Optional)
Config Param: CLEANER_POLICY
hoodie.cleaner.policy.failed.writes
Cleaning policy for failed writes to be used. Hudi will delete any files written by failed writes to re-claim space. Choose to perform this rollback of failed writes eagerly before every writer starts (only supported for single writer) or lazily by the cleaner (required for multi-writers)
Default Value: EAGER (Optional)
Config Param: FAILED_WRITES_CLEANER_POLICY
Memory Configurations
Controls memory usage for compaction and merges, performed internally by Hudi.
Config Class
: org.apache.hudi.config.HoodieMemoryConfig
hoodie.memory.compaction.max.size
Maximum amount of memory used in bytes for compaction operations in bytes , before spilling to local storage.
Default Value: N/A
(Required)
Config Param: MAX_MEMORY_FOR_COMPACTION
hoodie.memory.compaction.fraction
HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map
Default Value: 0.6 (Optional)
Config Param: MAX_MEMORY_FRACTION_FOR_COMPACTION
hoodie.memory.dfs.buffer.max.size
Property to control the max memory in bytes for dfs input stream buffer size
Default Value: 16777216 (Optional)
Config Param: MAX_DFS_STREAM_BUFFER_SIZE
hoodie.memory.merge.fraction
This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge
Default Value: 0.6 (Optional)
Config Param: MAX_MEMORY_FRACTION_FOR_MERGE
hoodie.memory.merge.max.size
Maximum amount of memory used in bytes for merge operations, before spilling to local storage.
Default Value: 1073741824 (Optional)
Config Param: MAX_MEMORY_FOR_MERGE
hoodie.memory.spillable.map.path
Default file path for spillable map
Default Value: /tmp/ (Optional)
Config Param: SPILLABLE_MAP_BASE_PATH
hoodie.memory.writestatus.failure.fraction
Property to control how what fraction of the failed record, exceptions we report back to driver. Default is 10%. If set to 100%, with lot of failures, this can cause memory pressure, cause OOMs and mask actual data errors.
Default Value: 0.1 (Optional)
Config Param: WRITESTATUS_FAILURE_FRACTION
Archival Configs
Configurations that control archival.
Config Class
: org.apache.hudi.config.HoodieArchivalConfig
hoodie.archive.async
Only applies when hoodie.archive.automatic is turned on. When turned on runs archiver async with writing, which can speed up overall write performance.
Default Value: false (Optional)
Config Param: ASYNC_ARCHIVE
Since Version: 0.11.0
hoodie.archive.automatic
When enabled, the archival table service is invoked immediately after each commit, to archive commits if we cross a maximum value of commits. It's recommended to enable this, to ensure number of active commits is bounded.
Default Value: true (Optional)
Config Param: AUTO_ARCHIVE
hoodie.archive.beyond.savepoint
If enabled, archival will proceed beyond savepoint, skipping savepoint commits. If disabled, archival will stop at the earliest savepoint commit.
Default Value: false (Optional)
Config Param: ARCHIVE_BEYOND_SAVEPOINT
Since Version: 0.12.0
hoodie.archive.delete.parallelism
Parallelism for deleting archived hoodie commits.
Default Value: 100 (Optional)
Config Param: DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE
hoodie.archive.merge.enable
When enable, hoodie will auto merge several small archive files into larger one. It's useful when storage scheme doesn't support append operation.
Default Value: false (Optional)
Config Param: ARCHIVE_MERGE_ENABLE
hoodie.archive.merge.files.batch.size
The number of small archive files to be merged at once.
Default Value: 10 (Optional)
Config Param: ARCHIVE_MERGE_FILES_BATCH_SIZE
hoodie.archive.merge.small.file.limit.bytes
This config sets the archive file size limit below which an archive file becomes a candidate to be selected as such a small file.
Default Value: 20971520 (Optional)
Config Param: ARCHIVE_MERGE_SMALL_FILE_LIMIT_BYTES
hoodie.commits.archival.batch
Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size.
Default Value: 10 (Optional)
Config Param: COMMITS_ARCHIVAL_BATCH_SIZE
hoodie.keep.max.commits
Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline.
Default Value: 30 (Optional)
Config Param: MAX_COMMITS_TO_KEEP
hoodie.keep.min.commits
Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline.
Default Value: 20 (Optional)
Config Param: MIN_COMMITS_TO_KEEP
Metadata Configs
Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries.
Config Class
: org.apache.hudi.common.config.HoodieMetadataConfig
hoodie.metadata.index.bloom.filter.column.list
Comma-separated list of columns for which bloom filter index will be built. If not set, only record key will be indexed.
Default Value: N/A
(Required)
Config Param: BLOOM_FILTER_INDEX_FOR_COLUMNS
Since Version: 0.11.0
hoodie.metadata.index.column.stats.column.list
Comma-separated list of columns for which column stats index will be built. If not set, all columns will be indexed
Default Value: N/A
(Required)
Config Param: COLUMN_STATS_INDEX_FOR_COLUMNS
Since Version: 0.11.0
hoodie.metadata.index.column.stats.processing.mode.override
By default Column Stats Index is automatically determining whether it should be read and processed either'in-memory' (w/in executing process) or using Spark (on a cluster), based on some factors like the size of the Index and how many columns are read. This config allows to override this behavior.
Default Value: N/A
(Required)
Config Param: COLUMN_STATS_INDEX_PROCESSING_MODE_OVERRIDE
Since Version: 0.12.0
_hoodie.metadata.ignore.spurious.deletes
There are cases when extra files are requested to be deleted from metadata table which are never added before. This config determines how to handle such spurious deletes
Default Value: true (Optional)
Config Param: IGNORE_SPURIOUS_DELETES
Since Version: 0.10.0
hoodie.assume.date.partitioning
Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions < 0.3.1. Will be removed eventually
Default Value: false (Optional)
Config Param: ASSUME_DATE_PARTITIONING
Since Version: 0.3.0
hoodie.file.listing.parallelism
Parallelism to use, when listing the table on lake storage.
Default Value: 200 (Optional)
Config Param: FILE_LISTING_PARALLELISM_VALUE
Since Version: 0.7.0
hoodie.metadata.clean.async
Enable asynchronous cleaning for metadata table. This is an internal config and setting this will not overwrite the value actually used.
Default Value: false (Optional)
Config Param: ASYNC_CLEAN_ENABLE
Since Version: 0.7.0
hoodie.metadata.cleaner.commits.retained
Number of commits to retain, without cleaning, on metadata table. This is an internal config and setting this will not overwrite the actual value used.
Default Value: 3 (Optional)
Config Param: CLEANER_COMMITS_RETAINED
Since Version: 0.7.0
hoodie.metadata.compact.max.delta.commits
Controls how often the metadata table is compacted.
Default Value: 10 (Optional)
Config Param: COMPACT_NUM_DELTA_COMMITS
Since Version: 0.7.0
hoodie.metadata.dir.filter.regex
Directories matching this regex, will be filtered out when initializing metadata table from lake storage for the first time.
Default Value: (Optional)
Config Param: DIR_FILTER_REGEX
Since Version: 0.7.0
hoodie.metadata.enable
Enable the internal metadata table which serves table metadata like level file listings
Default Value: true (Optional)
Config Param: ENABLE
Since Version: 0.7.0
hoodie.metadata.enable.full.scan.log.files
Enable full scanning of log files while reading log records. If disabled, Hudi does look up of only interested entries. This is an internal config and setting this will not overwrite the actual value used.
Default Value: true (Optional)
Config Param: ENABLE_FULL_SCAN_LOG_FILES
Since Version: 0.10.0
hoodie.metadata.index.async
Enable asynchronous indexing of metadata table.
Default Value: false (Optional)
Config Param: ASYNC_INDEX_ENABLE
Since Version: 0.11.0
hoodie.metadata.index.bloom.filter.enable
Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups.
Default Value: false (Optional)
Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTER
Since Version: 0.11.0
hoodie.metadata.index.bloom.filter.file.group.count
Metadata bloom filter index partition file group count. This controls the size of the base and log files and read parallelism in the bloom filter index partition. The recommendation is to size the file group count such that the base files are under 1GB.
Default Value: 4 (Optional)
Config Param: METADATA_INDEX_BLOOM_FILTER_FILE_GROUP_COUNT
Since Version: 0.11.0
hoodie.metadata.index.bloom.filter.parallelism
Parallelism to use for generating bloom filter index in metadata table.
Default Value: 200 (Optional)
Config Param: BLOOM_FILTER_INDEX_PARALLELISM
Since Version: 0.11.0
hoodie.metadata.index.check.timeout.seconds
After the async indexer has finished indexing upto the base instant, it will ensure that all inflight writers reliably write index updates as well. If this timeout expires, then the indexer will abort itself safely.
Default Value: 900 (Optional)
Config Param: METADATA_INDEX_CHECK_TIMEOUT_SECONDS
Since Version: 0.11.0
hoodie.metadata.index.column.stats.enable
Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups.
Default Value: false (Optional)
Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS
Since Version: 0.11.0
hoodie.metadata.index.column.stats.file.group.count
Metadata column stats partition file group count. This controls the size of the base and log files and read parallelism in the column stats index partition. The recommendation is to size the file group count such that the base files are under 1GB.
Default Value: 2 (Optional)
Config Param: METADATA_INDEX_COLUMN_STATS_FILE_GROUP_COUNT
Since Version: 0.11.0
hoodie.metadata.index.column.stats.inMemory.projection.threshold
When reading Column Stats Index, if the size of the expected resulting projection is below the in-memory threshold (counted by the # of rows), it will be attempted to be loaded "in-memory" (ie not using the execution engine like Spark, Flink, etc). If the value is above the threshold execution engine will be used to compose the projection.
Default Value: 100000 (Optional)
Config Param: COLUMN_STATS_INDEX_IN_MEMORY_PROJECTION_THRESHOLD
Since Version: 0.12.0
hoodie.metadata.index.column.stats.parallelism
Parallelism to use, when generating column stats index.
Default Value: 10 (Optional)
Config Param: COLUMN_STATS_INDEX_PARALLELISM
Since Version: 0.11.0
hoodie.metadata.insert.parallelism
Parallelism to use when inserting to the metadata table
Default Value: 1 (Optional)
Config Param: INSERT_PARALLELISM_VALUE
Since Version: 0.7.0
hoodie.metadata.keep.max.commits
Similar to hoodie.metadata.keep.min.commits, this config controls the maximum number of instants to retain in the active timeline.
Default Value: 30 (Optional)
Config Param: MAX_COMMITS_TO_KEEP
Since Version: 0.7.0
hoodie.metadata.keep.min.commits
Archiving service moves older entries from metadata table’s timeline into an archived log after each write, to keep the overhead constant, even as the metadata table size grows. This config controls the minimum number of instants to retain in the active timeline.
Default Value: 20 (Optional)
Config Param: MIN_COMMITS_TO_KEEP
Since Version: 0.7.0
hoodie.metadata.metrics.enable
Enable publishing of metrics around metadata table.
Default Value: false (Optional)
Config Param: METRICS_ENABLE
Since Version: 0.7.0
hoodie.metadata.optimized.log.blocks.scan.enable
Optimized log blocks scanner that addresses all the multiwriter use-cases while appending to log files. It also differentiates original blocks written by ingestion writers and compacted blocks written by log compaction.
Default Value: false (Optional)
Config Param: ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN
Since Version: 0.13.0
hoodie.metadata.populate.meta.fields
When enabled, populates all meta fields. When disabled, no meta fields are populated. This is an internal config and setting this will not overwrite the actual value used.
Default Value: false (Optional)
Config Param: POPULATE_META_FIELDS
Since Version: 0.10.0
Consistency Guard Configurations
The consistency guard related config options, to help talk to eventually consistent object storage.(Tip: S3 is NOT eventually consistent anymore!)
Config Class
: org.apache.hudi.common.fs.ConsistencyGuardConfig
_hoodie.optimistic.consistency.guard.enable
Enable consistency guard, which optimistically assumes consistency is achieved after a certain time period.
Default Value: false (Optional)
Config Param: OPTIMISTIC_CONSISTENCY_GUARD_ENABLE
Since Version: 0.6.0
hoodie.consistency.check.enabled
Enabled to handle S3 eventual consistency issue. This property is no longer required since S3 is now strongly consistent. Will be removed in the future releases.
Default Value: false (Optional)
Config Param: ENABLE
Since Version: 0.5.0
Deprecated Version: 0.7.0
hoodie.consistency.check.initial_interval_ms
Amount of time (in ms) to wait, before checking for consistency after an operation on storage.
Default Value: 400 (Optional)
Config Param: INITIAL_CHECK_INTERVAL_MS
Since Version: 0.5.0
Deprecated Version: 0.7.0
hoodie.consistency.check.max_checks
Maximum number of consistency checks to perform, with exponential backoff.
Default Value: 6 (Optional)
Config Param: MAX_CHECKS
Since Version: 0.5.0
Deprecated Version: 0.7.0
hoodie.consistency.check.max_interval_ms
Maximum amount of time (in ms), to wait for consistency checking.
Default Value: 20000 (Optional)
Config Param: MAX_CHECK_INTERVAL_MS
Since Version: 0.5.0
Deprecated Version: 0.7.0
hoodie.optimistic.consistency.guard.sleep_time_ms
Amount of time (in ms), to wait after which we assume storage is consistent.
Default Value: 500 (Optional)
Config Param: OPTIMISTIC_CONSISTENCY_GUARD_SLEEP_TIME_MS
Since Version: 0.6.0
FileSystem Guard Configurations
The filesystem retry related config options, to help deal with runtime exception like list/get/put/delete performance issues.
Config Class
: org.apache.hudi.common.fs.FileSystemRetryConfig
hoodie.filesystem.operation.retry.enable
Enabled to handle list/get/delete etc file system performance issue.
Default Value: false (Optional)
Config Param: FILESYSTEM_RETRY_ENABLE
Since Version: 0.11.0
hoodie.filesystem.operation.retry.exceptions
The class name of the Exception that needs to be re-tryed, separated by commas. Default is empty which means retry all the IOException and RuntimeException from FileSystem
Default Value: (Optional)
Config Param: RETRY_EXCEPTIONS
Since Version: 0.11.0
hoodie.filesystem.operation.retry.initial_interval_ms
Amount of time (in ms) to wait, before retry to do operations on storage.
Default Value: 100 (Optional)
Config Param: INITIAL_RETRY_INTERVAL_MS
Since Version: 0.11.0
hoodie.filesystem.operation.retry.max_interval_ms
Maximum amount of time (in ms), to wait for next retry.
Default Value: 2000 (Optional)
Config Param: MAX_RETRY_INTERVAL_MS
Since Version: 0.11.0
hoodie.filesystem.operation.retry.max_numbers
Maximum number of retry actions to perform, with exponential backoff.
Default Value: 4 (Optional)
Config Param: MAX_RETRY_NUMBERS
Since Version: 0.11.0
Write Configurations
Configurations that control write behavior on Hudi tables. These can be directly passed down from even higher level frameworks (e.g Spark datasources, Flink sink) and utilities (e.g DeltaStreamer).
Config Class
: org.apache.hudi.config.HoodieWriteConfig
hoodie.avro.schema
Schema string representing the current write schema of the table. Hudi passes this to implementations of HoodieRecordPayload to convert incoming records to avro. This is also used as the write schema evolving records during an update.
Default Value: N/A
(Required)
Config Param: AVRO_SCHEMA_STRING
hoodie.base.path
Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
Default Value: N/A
(Required)
Config Param: BASE_PATH
hoodie.bulkinsert.user.defined.partitioner.class
If specified, this class will be used to re-partition records before they are bulk inserted. This can be used to sort, pack, cluster data optimally for common query patterns. For now we support a build-in user defined bulkinsert partitioner org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner which can does sorting based on specified column values set by hoodie.bulkinsert.user.defined.partitioner.sort.columns
Default Value: N/A
(Required)
Config Param: BULKINSERT_USER_DEFINED_PARTITIONER_CLASS_NAME
hoodie.bulkinsert.user.defined.partitioner.sort.columns
Columns to sort the data by when use org.apache.hudi.execution.bulkinsert.RDDCustomColumnsSortPartitioner as user defined partitioner during bulk_insert. For example 'column1,column2'
Default Value: N/A
(Required)
Config Param: BULKINSERT_USER_DEFINED_PARTITIONER_SORT_COLUMNS
hoodie.datasource.write.keygenerator.class
Key generator class, that implements
org.apache.hudi.keygen.KeyGenerator
extract a key out of incoming records.
Default Value: N/A
(Required)
Config Param: KEYGENERATOR_CLASS_NAME
hoodie.internal.schema
Schema string representing the latest schema of the table. Hudi passes this to implementations of evolution of schema
Default Value: N/A
(Required)
Config Param: INTERNAL_SCHEMA_STRING
hoodie.table.name
Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
Default Value: N/A
(Required)
Config Param: TBL_NAME
hoodie.write.schema
Config allowing to override writer's schema. This might be necessary in cases when writer's schema derived from the incoming dataset might actually be different from the schema we actually want to use when writing. This, for ex, could be the case for'partial-update' use-cases (like
MERGE INTO
Spark SQL statement for ex) where only a projection of the incoming dataset might be used to update the records in the existing table, prompting us to override the writer's schema
Default Value: N/A
(Required)
Config Param: WRITE_SCHEMA_OVERRIDE
_.hoodie.allow.multi.write.on.same.instant
Default Value: false (Optional)
Config Param: ALLOW_MULTI_WRITE_ON_SAME_INSTANT_ENABLE
hoodie.allow.empty.commit
Whether to allow generation of empty commits, even if no data was written in the commit. It's useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting data
Default Value: true (Optional)
Config Param: ALLOW_EMPTY_COMMIT
hoodie.allow.operation.metadata.field
Whether to include '_hoodie_operation' in the metadata fields. Once enabled, all the changes of a record are persisted to the delta log directly without merge
Default Value: false (Optional)
Config Param: ALLOW_OPERATION_METADATA_FIELD
Since Version: 0.9.0
hoodie.auto.adjust.lock.configs
Auto adjust lock configurations when metadata table is enabled and for async table services.
Default Value: false (Optional)
Config Param: AUTO_ADJUST_LOCK_CONFIGS
Since Version: 0.11.0
hoodie.auto.commit
Controls whether a write operation should auto commit. This can be turned off to perform inspection of the uncommitted write before deciding to commit.
Default Value: true (Optional)
Config Param: AUTO_COMMIT_ENABLE
hoodie.avro.schema.external.transformation
When enabled, records in older schema are rewritten into newer schema during upsert,delete and background compaction,clustering operations.
Default Value: false (Optional)
Config Param: AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE
hoodie.avro.schema.validate
Validate the schema used for the write against the latest schema, for backwards compatibility.
Default Value: false (Optional)
Config Param: AVRO_SCHEMA_VALIDATE_ENABLE
hoodie.bulkinsert.shuffle.parallelism
For large initial imports using bulk_insert operation, controls the parallelism to use for sort modes or custom partitioning donebefore writing records to the table.
Default Value: 0 (Optional)
Config Param: BULKINSERT_PARALLELISM_VALUE
hoodie.bulkinsert.sort.mode
Sorting modes to use for sorting records for bulk insert. This is use when user hoodie.bulkinsert.user.defined.partitioner.classis not configured. Available values are - GLOBAL_SORT: this ensures best file sizes, with lowest memory overhead at cost of sorting. PARTITION_SORT: Strikes a balance by only sorting within a partition, still keeping the memory overhead of writing lowest and best effort file sizing. PARTITION_PATH_REPARTITION: this ensures that the data for a single physical partition in the table is written by the same Spark executor, best for input data evenly distributed across different partition paths. This can cause imbalance among Spark executors if the input data is skewed, i.e., most records are intended for a handful of partition paths among all. PARTITION_PATH_REPARTITION_AND_SORT: this ensures that the data for a single physical partition in the table is written by the same Spark executor, best for input data evenly distributed across different partition paths. Compared to PARTITION_PATH_REPARTITION, this sort mode does an additional step of sorting the records based on the partition path within a single Spark partition, given that data for multiple physical partitions can be sent to the same Spark partition and executor. This can cause imbalance among Spark executors if the input data is skewed, i.e., most records are intended for a handful of partition paths among all. NONE: No sorting. Fastest and matches
spark.write.parquet()
in terms of number of files, overheads
Default Value: NONE (Optional)
Config Param: BULK_INSERT_SORT_MODE
hoodie.client.heartbeat.interval_in_ms
Writers perform heartbeats to indicate liveness. Controls how often (in ms), such heartbeats are registered to lake storage.
Default Value: 60000 (Optional)
Config Param: CLIENT_HEARTBEAT_INTERVAL_IN_MS
hoodie.client.heartbeat.tolerable.misses
Number of heartbeat misses, before a writer is deemed not alive and all pending writes are aborted.
Default Value: 2 (Optional)
Config Param: CLIENT_HEARTBEAT_NUM_TOLERABLE_MISSES
hoodie.combine.before.delete
During delete operations, controls whether we should combine deletes (and potentially also upserts) before writing to storage.
Default Value: true (Optional)
Config Param: COMBINE_BEFORE_DELETE
hoodie.combine.before.insert
When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage.
Default Value: false (Optional)
Config Param: COMBINE_BEFORE_INSERT
hoodie.combine.before.upsert
When upserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage. This should be turned off only if you are absolutely certain that there are no duplicates incoming, otherwise it can lead to duplicate keys and violate the uniqueness guarantees.
Default Value: true (Optional)
Config Param: COMBINE_BEFORE_UPSERT
hoodie.consistency.check.initial_interval_ms
Initial time between successive attempts to ensure written data's metadata is consistent on storage. Grows with exponential backoff after the initial value.
Default Value: 2000 (Optional)
Config Param: INITIAL_CONSISTENCY_CHECK_INTERVAL_MS
hoodie.consistency.check.max_checks
Maximum number of checks, for consistency of written data.
Default Value: 7 (Optional)
Config Param: MAX_CONSISTENCY_CHECKS
hoodie.consistency.check.max_interval_ms
Max time to wait between successive attempts at performing consistency checks
Default Value: 300000 (Optional)
Config Param: MAX_CONSISTENCY_CHECK_INTERVAL_MS
hoodie.datasource.write.keygenerator.type
Easily configure one the built-in key generators, instead of specifying the key generator class.Currently supports SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE. Note This is being actively worked on. Please use
hoodie.datasource.write.keygenerator.class
instead.
Default Value: SIMPLE (Optional)
Config Param: KEYGENERATOR_TYPE
hoodie.datasource.write.payload.class
Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective
Default Value: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)
Config Param: WRITE_PAYLOAD_CLASS_NAME
hoodie.datasource.write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: PRECOMBINE_FIELD_NAME
hoodie.datasource.write.record.merger.impls
List of HoodieMerger implementations constituting Hudi's merging strategy -- based on the engine used. These merger impls will filter by hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)
Default Value: org.apache.hudi.common.model.HoodieAvroRecordMerger (Optional)
Config Param: RECORD_MERGER_IMPLS
Since Version: 0.13.0
hoodie.datasource.write.record.merger.strategy
Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.datasource.write.record.merger.impls which has the same merger strategy id
Default Value: eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 (Optional)
Config Param: RECORD_MERGER_STRATEGY
Since Version: 0.13.0
hoodie.datasource.write.schema.allow.auto.evolution.column.drop
Controls whether table's schema is allowed to automatically evolve when incoming batch's schema can have any of the columns dropped. By default, Hudi will not allow this kind of (auto) schema evolution. Set this config to true to allow table's schema to be updated automatically when columns are dropped from the new incoming batch.
Default Value: false (Optional)
Config Param: SCHEMA_ALLOW_AUTO_EVOLUTION_COLUMN_DROP
Since Version: 0.13.0
hoodie.delete.shuffle.parallelism
Parallelism used for “delete” operation. Delete operations also performs shuffles, similar to upsert operation.
Default Value: 0 (Optional)
Config Param: DELETE_PARALLELISM_VALUE
hoodie.embed.timeline.server
When true, spins up an instance of the timeline server (meta server that serves cached file listings, statistics),running on each writer's driver process, accepting requests during the write from executors.
Default Value: true (Optional)
Config Param: EMBEDDED_TIMELINE_SERVER_ENABLE
hoodie.embed.timeline.server.async
Controls whether or not, the requests to the timeline server are processed in asynchronous fashion, potentially improving throughput.
Default Value: false (Optional)
Config Param: EMBEDDED_TIMELINE_SERVER_USE_ASYNC_ENABLE
hoodie.embed.timeline.server.gzip
Controls whether gzip compression is used, for large responses from the timeline server, to improve latency.
Default Value: true (Optional)
Config Param: EMBEDDED_TIMELINE_SERVER_COMPRESS_ENABLE
hoodie.embed.timeline.server.port
Port at which the timeline server listens for requests. When running embedded in each writer, it picks a free port and communicates to all the executors. This should rarely be changed.
Default Value: 0 (Optional)
Config Param: EMBEDDED_TIMELINE_SERVER_PORT_NUM
hoodie.embed.timeline.server.reuse.enabled
Controls whether the timeline server instance should be cached and reused across the JVM (across task lifecycles)to avoid startup costs. This should rarely be changed.
Default Value: false (Optional)
Config Param: EMBEDDED_TIMELINE_SERVER_REUSE_ENABLED
hoodie.embed.timeline.server.threads
Number of threads to serve requests in the timeline server. By default, auto configured based on the number of underlying cores.
Default Value: -1 (Optional)
Config Param: EMBEDDED_TIMELINE_NUM_SERVER_THREADS
hoodie.fail.on.timeline.archiving
Timeline archiving removes older instants from the timeline, after each write operation, to minimize metadata overhead. Controls whether or not, the write should be failed as well, if such archiving fails.
Default Value: true (Optional)
Config Param: FAIL_ON_TIMELINE_ARCHIVING_ENABLE
hoodie.fail.writes.on.inline.table.service.exception
Table services such as compaction and clustering can fail and prevent syncing to the metaclient. Set this to true to fail writes when table services fail
Default Value: true (Optional)
Config Param: FAIL_ON_INLINE_TABLE_SERVICE_EXCEPTION
Since Version: 0.13.0
hoodie.fileid.prefix.provider.class
File Id Prefix provider class, that implements
org.apache.hudi.fileid.FileIdPrefixProvider
Default Value: org.apache.hudi.table.RandomFileIdPrefixProvider (Optional)
Config Param: FILEID_PREFIX_PROVIDER_CLASS
Since Version: 0.10.0
hoodie.finalize.write.parallelism
Parallelism for the write finalization internal operation, which involves removing any partially written files from lake storage, before committing the write. Reduce this value, if the high number of tasks incur delays for smaller tables or low latency writes.
Default Value: 200 (Optional)
Config Param: FINALIZE_WRITE_PARALLELISM_VALUE
hoodie.insert.shuffle.parallelism
Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout.
Default Value: 0 (Optional)
Config Param: INSERT_PARALLELISM_VALUE
hoodie.markers.delete.parallelism
Determines the parallelism for deleting marker files, which are used to track all files (valid or invalid/partial) written during a write operation. Increase this value if delays are observed, with large batch writes.
Default Value: 100 (Optional)
Config Param: MARKERS_DELETE_PARALLELISM_VALUE
hoodie.markers.timeline_server_based.batch.interval_ms
The batch interval in milliseconds for marker creation batch processing
Default Value: 50 (Optional)
Config Param: MARKERS_TIMELINE_SERVER_BASED_BATCH_INTERVAL_MS
Since Version: 0.9.0
hoodie.markers.timeline_server_based.batch.num_threads
Number of threads to use for batch processing marker creation requests at the timeline server
Default Value: 20 (Optional)
Config Param: MARKERS_TIMELINE_SERVER_BASED_BATCH_NUM_THREADS
Since Version: 0.9.0
hoodie.merge.allow.duplicate.on.inserts
When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing). This is only relevant for insert operation, since upsert, delete operations will ensure unique key constraints are maintained.
Default Value: false (Optional)
Config Param: MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE
hoodie.merge.data.validation.enabled
When enabled, data validation checks are performed during merges to ensure expected number of records after merge operation.
Default Value: false (Optional)
Config Param: MERGE_DATA_VALIDATION_CHECK_ENABLE
hoodie.merge.small.file.group.candidates.limit
Limits number of file groups, whose base file satisfies small-file limit, to consider for appending records during upsert operation. Only applicable to MOR tables
Default Value: 1 (Optional)
Config Param: MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT
hoodie.release.resource.on.completion.enable
Control to enable release all persist rdds when the spark job finish.
Default Value: true (Optional)
Config Param: RELEASE_RESOURCE_ENABLE
Since Version: 0.11.0
hoodie.rollback.parallelism
Parallelism for rollback of commits. Rollbacks perform delete of files or logging delete blocks to file groups on storage in parallel.
Default Value: 100 (Optional)
Config Param: ROLLBACK_PARALLELISM_VALUE
hoodie.rollback.using.markers
Enables a more efficient mechanism for rollbacks based on the marker files generated during the writes. Turned on by default.
Default Value: true (Optional)
Config Param: ROLLBACK_USING_MARKERS_ENABLE
hoodie.schema.cache.enable
cache query internalSchemas in driver/executor side
Default Value: false (Optional)
Config Param: ENABLE_INTERNAL_SCHEMA_CACHE
hoodie.skip.default.partition.validation
When table is upgraded from pre 0.12 to 0.12, we check for "default" partition and fail if found one. Users are expected to rewrite the data in those partitions. Enabling this config will bypass this validation
Default Value: false (Optional)
Config Param: SKIP_DEFAULT_PARTITION_VALIDATION
Since Version: 0.12.0
hoodie.table.base.file.format
Base file format to store all the base file data.
Default Value: PARQUET (Optional)
Config Param: BASE_FILE_FORMAT
hoodie.table.services.enabled
Master control to disable all table services including archive, clean, compact, cluster, etc.
Default Value: true (Optional)
Config Param: TABLE_SERVICES_ENABLED
Since Version: 0.11.0
hoodie.timeline.layout.version
Controls the layout of the timeline. Version 0 relied on renames, Version 1 (default) models the timeline as an immutable log relying only on atomic writes for object storage.
Default Value: 1 (Optional)
Config Param: TIMELINE_LAYOUT_VERSION_NUM
Since Version: 0.5.1
hoodie.upsert.shuffle.parallelism
Parallelism to use for upsert operation on the table. Upserts can shuffle data to perform index lookups, file sizing, bin packing records optimallyinto file groups.
Default Value: 0 (Optional)
Config Param: UPSERT_PARALLELISM_VALUE
hoodie.write.buffer.limit.bytes
Size of in-memory buffer used for parallelizing network reads and lake storage writes.
Default Value: 4194304 (Optional)
Config Param: WRITE_BUFFER_LIMIT_BYTES_VALUE
hoodie.write.concurrency.async.conflict.detector.initial_delay_ms
Used for timeline-server-based markers with
AsyncTimelineServerBasedDetectionStrategy
. The time in milliseconds to delay the first execution of async marker-based conflict detection.
Default Value: 0 (Optional)
Config Param: ASYNC_CONFLICT_DETECTOR_INITIAL_DELAY_MS
Since Version: 0.13.0
hoodie.write.concurrency.async.conflict.detector.period_ms
Used for timeline-server-based markers with
AsyncTimelineServerBasedDetectionStrategy
. The period in milliseconds between successive executions of async marker-based conflict detection.
Default Value: 30000 (Optional)
Config Param: ASYNC_CONFLICT_DETECTOR_PERIOD_MS
Since Version: 0.13.0
hoodie.write.concurrency.early.conflict.check.commit.conflict
Whether to enable commit conflict checking or not during early conflict detection.
Default Value: false (Optional)
Config Param: EARLY_CONFLICT_DETECTION_CHECK_COMMIT_CONFLICT
Since Version: 0.13.0
hoodie.write.concurrency.early.conflict.detection.enable
Whether to enable early conflict detection based on markers. It eagerly detects writing conflict before create markers and fails fast if a conflict is detected, to release cluster compute resources as soon as possible.
Default Value: false (Optional)
Config Param: EARLY_CONFLICT_DETECTION_ENABLE
Since Version: 0.13.0
hoodie.write.concurrency.early.conflict.detection.strategy
The class name of the early conflict detection strategy to use. This should be a subclass of
org.apache.hudi.common.conflict.detection.EarlyConflictDetectionStrategy
.
Default Value: (Optional)
Config Param: EARLY_CONFLICT_DETECTION_STRATEGY_CLASS_NAME
Since Version: 0.13.0
hoodie.write.concurrency.mode
Enable different concurrency modes. Options are SINGLE_WRITER: Only one active writer to the table. Maximizes throughputOPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table and exactly one of them succeed if a conflict (writes affect the same file group) is detected.
Default Value: SINGLE_WRITER (Optional)
Config Param: WRITE_CONCURRENCY_MODE
hoodie.write.executor.disruptor.buffer.limit.bytes
The size of the Disruptor Executor ring buffer, must be power of 2
Default Value: 1024 (Optional)
Config Param: WRITE_EXECUTOR_DISRUPTOR_BUFFER_LIMIT_BYTES
Since Version: 0.13.0
hoodie.write.executor.disruptor.wait.strategy
Strategy employed for making Disruptor Executor wait on a cursor. Other options are SLEEPING_WAIT, it attempts to be conservative with CPU usage by using a simple busy wait loopYIELDING_WAIT, it is designed for cases where there is the option to burn CPU cycles with the goal of improving latencyBUSY_SPIN_WAIT, it can be used in low-latency systems, but puts the highest constraints on the deployment environment
Default Value: BLOCKING_WAIT (Optional)
Config Param: WRITE_EXECUTOR_DISRUPTOR_WAIT_STRATEGY
Since Version: 0.13.0
hoodie.write.executor.type
Set executor which orchestrates concurrent producers and consumers communicating through a message queue.BOUNDED_IN_MEMORY(default): Use LinkedBlockingQueue as a bounded in-memory queue, this queue will use extra lock to balance producers and consumerDISRUPTOR: Use disruptor which a lock free message queue as inner message, this queue may gain better writing performance if lock was the bottleneck. SIMPLE: Executor with no inner message queue and no inner lock. Consuming and writing records from iterator directly. Compared with BIM and DISRUPTOR, this queue has no need for additional memory and cpu resources due to lock or multithreading, but also lost some benefits such as speed limit. Although DISRUPTOR_EXECUTOR and SIMPLE are still in experimental.
Default Value: SIMPLE (Optional)
Config Param: WRITE_EXECUTOR_TYPE
Since Version: 0.13.0
hoodie.write.markers.type
Marker type to use. Two modes are supported: - DIRECT: individual marker file corresponding to each data file is directly created by the writer. - TIMELINE_SERVER_BASED: marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. If HDFS is used or timeline server is disabled, DIRECT markers are used as fallback even if this is configure. For Spark structured streaming, this configuration does not take effect, i.e., DIRECT markers are always used for Spark structured streaming.
Default Value: TIMELINE_SERVER_BASED (Optional)
Config Param: MARKERS_TYPE
Since Version: 0.9.0
hoodie.write.status.storage.level
Write status objects hold metadata about a write (stats, errors), that is not yet committed to storage. This controls the how that information is cached for inspection by clients. We rarely expect this to be changed.
Default Value: MEMORY_AND_DISK_SER (Optional)
Config Param: WRITE_STATUS_STORAGE_LEVEL_VALUE
hoodie.writestatus.class
Subclass of org.apache.hudi.client.WriteStatus to be used to collect information about a write. Can be overridden to collection additional metrics/statistics about the data if needed.
Default Value: org.apache.hudi.client.WriteStatus (Optional)
Config Param: WRITE_STATUS_CLASS_NAME
Metastore Configs
Configurations used by the Hudi Metastore.
Config Class
: org.apache.hudi.common.config.HoodieMetaserverConfig
hoodie.database.name
Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
Default Value: N/A
(Required)
Config Param: DATABASE_NAME
Since Version: 0.13.0
hoodie.table.name
Table name that will be used for registering with Hive. Needs to be same across runs.
Default Value: N/A
(Required)
Config Param: TABLE_NAME
Since Version: 0.13.0
hoodie.metaserver.connect.retries
Number of retries while opening a connection to metastore
Default Value: 3 (Optional)
Config Param: METASERVER_CONNECTION_RETRIES
Since Version: 0.13.0
hoodie.metaserver.connect.retry.delay
Number of seconds for the client to wait between consecutive connection attempts
Default Value: 1 (Optional)
Config Param: METASERVER_CONNECTION_RETRY_DELAY
Since Version: 0.13.0
hoodie.metaserver.enabled
Enable Hudi metaserver for storing Hudi tables' metadata.
Default Value: false (Optional)
Config Param: METASERVER_ENABLE
Since Version: 0.13.0
hoodie.metaserver.uris
Metastore server uris
Default Value: thrift://localhost:9090 (Optional)
Config Param: METASERVER_URLS
Since Version: 0.13.0
Key Generator Options
Hudi maintains keys (record key + partition path) for uniquely identifying a particular record. This config allows developers to setup the Key generator class that will extract these out of incoming records.
Config Class
: org.apache.hudi.keygen.constant.KeyGeneratorOptions
hoodie.datasource.write.partitionpath.field
Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
Default Value: N/A
(Required)
Config Param: PARTITIONPATH_FIELD_NAME
hoodie.datasource.write.recordkey.field
Record key field. Value to be used as the
recordKey
component ofHoodieKey
. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg:a.b.c
Default Value: N/A
(Required)
Config Param: RECORDKEY_FIELD_NAME
hoodie.datasource.write.hive_style_partitioning
Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING_ENABLE
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled
When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value
2016-12-29 09:54:00
will be written as timestamp2016-12-29 09:54:00.0
in row-writer path, while it will be written as long value1483023240000000
in non row-writer path. If enabled, then the timestamp value will be written in both the cases.
Default Value: false (Optional)
Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED
hoodie.datasource.write.partitionpath.urlencode
Should we url encode the partition path value, before creating the folder structure.
Default Value: false (Optional)
Config Param: URL_ENCODE_PARTITIONING
Storage Configs
Configurations that control aspects around writing, sizing, reading base and log files.
Config Class
: org.apache.hudi.common.config.HoodieStorageConfig
hoodie.logfile.data.block.format
Format of the data block within delta logs. Following formats are currently supported "avro", "hfile", "parquet"
Default Value: N/A
(Required)
Config Param: LOGFILE_DATA_BLOCK_FORMAT
hoodie.hfile.block.size
Lower values increase the size in bytes of metadata tracked within HFile, but can offer potentially faster lookup times.
Default Value: 1048576 (Optional)
Config Param: HFILE_BLOCK_SIZE
hoodie.hfile.compression.algorithm
Compression codec to use for hfile base files.
Default Value: GZ (Optional)
Config Param: HFILE_COMPRESSION_ALGORITHM_NAME
hoodie.hfile.max.file.size
Target file size in bytes for HFile base files.
Default Value: 125829120 (Optional)
Config Param: HFILE_MAX_FILE_SIZE
hoodie.logfile.data.block.max.size
LogFile Data block max size in bytes. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory.
Default Value: 268435456 (Optional)
Config Param: LOGFILE_DATA_BLOCK_MAX_SIZE
hoodie.logfile.max.size
LogFile max size in bytes. This is the maximum size allowed for a log file before it is rolled over to the next version.
Default Value: 1073741824 (Optional)
Config Param: LOGFILE_MAX_SIZE
hoodie.logfile.to.parquet.compression.ratio
Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file.
Default Value: 0.35 (Optional)
Config Param: LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION
hoodie.orc.block.size
ORC block size, recommended to be aligned with the target file size.
Default Value: 125829120 (Optional)
Config Param: ORC_BLOCK_SIZE
hoodie.orc.compression.codec
Compression codec to use for ORC base files.
Default Value: ZLIB (Optional)
Config Param: ORC_COMPRESSION_CODEC_NAME
hoodie.orc.max.file.size
Target file size in bytes for ORC base files.
Default Value: 125829120 (Optional)
Config Param: ORC_FILE_MAX_SIZE
hoodie.orc.stripe.size
Size of the memory buffer in bytes for writing
Default Value: 67108864 (Optional)
Config Param: ORC_STRIPE_SIZE
hoodie.parquet.block.size
Parquet RowGroup size in bytes. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.
Default Value: 125829120 (Optional)
Config Param: PARQUET_BLOCK_SIZE
hoodie.parquet.compression.codec
Compression Codec for parquet files
Default Value: gzip (Optional)
Config Param: PARQUET_COMPRESSION_CODEC_NAME
hoodie.parquet.compression.ratio
Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files
Default Value: 0.1 (Optional)
Config Param: PARQUET_COMPRESSION_RATIO_FRACTION
hoodie.parquet.dictionary.enabled
Whether to use dictionary encoding
Default Value: true (Optional)
Config Param: PARQUET_DICTIONARY_ENABLED
hoodie.parquet.field_id.write.enabled
Would only be effective with Spark 3.3+. Sets spark.sql.parquet.fieldId.write.enabled. If enabled, Spark will write out parquet native field ids that are stored inside StructField's metadata as parquet.field.id to parquet files.
Default Value: true (Optional)
Config Param: PARQUET_FIELD_ID_WRITE_ENABLED
Since Version: 0.12.0
hoodie.parquet.max.file.size
Target size in bytes for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Default Value: 125829120 (Optional)
Config Param: PARQUET_MAX_FILE_SIZE
hoodie.parquet.outputtimestamptype
Sets spark.sql.parquet.outputTimestampType. Parquet timestamp type to use when Spark writes data to Parquet files.
Default Value: TIMESTAMP_MICROS (Optional)
Config Param: PARQUET_OUTPUT_TIMESTAMP_TYPE
hoodie.parquet.page.size
Parquet page size in bytes. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.
Default Value: 1048576 (Optional)
Config Param: PARQUET_PAGE_SIZE
hoodie.parquet.writelegacyformat.enabled
Sets spark.sql.parquet.writeLegacyFormat. If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Parquet's fixed-length byte array format which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format.
Default Value: false (Optional)
Config Param: PARQUET_WRITE_LEGACY_FORMAT_ENABLED
Compaction Configs
Configurations that control compaction (merging of log files onto a new base files).
Config Class
: org.apache.hudi.config.HoodieCompactionConfig
hoodie.compact.inline
When set to true, compaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path.
Default Value: false (Optional)
Config Param: INLINE_COMPACT
hoodie.compact.inline.max.delta.commits
Number of delta commits after the last compaction, before scheduling of a new compaction is attempted. This config takes effect only for the compaction triggering strategy based on the number of commits, i.e., NUM_COMMITS, NUM_COMMITS_AFTER_LAST_REQUEST, NUM_AND_TIME, and NUM_OR_TIME.
Default Value: 5 (Optional)
Config Param: INLINE_COMPACT_NUM_DELTA_COMMITS
hoodie.compact.inline.max.delta.seconds
Number of elapsed seconds after the last compaction, before scheduling a new one. This config takes effect only for the compaction triggering strategy based on the elapsed time, i.e., TIME_ELAPSED, NUM_AND_TIME, and NUM_OR_TIME.
Default Value: 3600 (Optional)
Config Param: INLINE_COMPACT_TIME_DELTA_SECONDS
hoodie.compact.inline.trigger.strategy
Controls how compaction scheduling is triggered, by time or num delta commits or combination of both. Valid options: NUM_COMMITS,NUM_COMMITS_AFTER_LAST_REQUEST,TIME_ELAPSED,NUM_AND_TIME,NUM_OR_TIME
Default Value: NUM_COMMITS (Optional)
Config Param: INLINE_COMPACT_TRIGGER_STRATEGY
hoodie.compact.schedule.inline
When set to true, compaction service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async compaction(execution) for the one scheduled by this writer. Users can choose to set both
hoodie.compact.inline
andhoodie.compact.schedule.inline
to false and have both scheduling and execution triggered by any async process. But ifhoodie.compact.inline
is set to false, andhoodie.compact.schedule.inline
is set to true, regular writers will schedule compaction inline, but users are expected to trigger async job for execution. Ifhoodie.compact.inline
is set to true, regular writers will do both scheduling and execution inline for compaction
Default Value: false (Optional)
Config Param: SCHEDULE_INLINE_COMPACT
hoodie.compaction.daybased.target.partitions
Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.
Default Value: 10 (Optional)
Config Param: TARGET_PARTITIONS_PER_DAYBASED_COMPACTION
hoodie.compaction.lazy.block.read
When merging the delta log files, this config helps to choose whether the log blocks should be read lazily or not. Choose true to use lazy block reading (low memory usage, but incurs seeks to each block header) or false for immediate block read (higher memory usage)
Default Value: true (Optional)
Config Param: COMPACTION_LAZY_BLOCK_READ_ENABLE
hoodie.compaction.logfile.num.threshold
Only if the log file num is greater than the threshold, the file group will be compacted.
Default Value: 0 (Optional)
Config Param: COMPACTION_LOG_FILE_NUM_THRESHOLD
Since Version: 0.13.0
hoodie.compaction.logfile.size.threshold
Only if the log file size is greater than the threshold in bytes, the file group will be compacted.
Default Value: 0 (Optional)
Config Param: COMPACTION_LOG_FILE_SIZE_THRESHOLD
hoodie.compaction.preserve.commit.metadata
When rewriting data, preserves existing hoodie_commit_time
Default Value: true (Optional)
Config Param: PRESERVE_COMMIT_METADATA
Since Version: 0.11.0
hoodie.compaction.reverse.log.read
HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the reader reads the logfile in reverse direction, from pos=file_length to pos=0
Default Value: false (Optional)
Config Param: COMPACTION_REVERSE_LOG_READ_ENABLE
hoodie.compaction.strategy
Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default. Hudi picks the log file with most accumulated unmerged data
Default Value: org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy (Optional)
Config Param: COMPACTION_STRATEGY
hoodie.compaction.target.io
Amount of MBs to spend during compaction run for the LogFileSizeBasedCompactionStrategy. This value helps bound ingestion latency while compaction is run inline mode.
Default Value: 512000 (Optional)
Config Param: TARGET_IO_PER_COMPACTION_IN_MB
hoodie.copyonwrite.insert.auto.split
Config to control whether we control insert split sizes automatically based on average record sizes. It's recommended to keep this turned on, since hand tuning is otherwise extremely cumbersome.
Default Value: true (Optional)
Config Param: COPY_ON_WRITE_AUTO_SPLIT_INSERTS
hoodie.copyonwrite.insert.split.size
Number of inserts assigned for each partition/bucket for writing. We based the default on writing out 100MB files, with at least 1kb records (100K records per file), and over provision to 500K. As long as auto-tuning of splits is turned on, this only affects the first write, where there is no history to learn record sizes from.
Default Value: 500000 (Optional)
Config Param: COPY_ON_WRITE_INSERT_SPLIT_SIZE
hoodie.copyonwrite.record.size.estimate
The average record size. If not explicitly specified, hudi will compute the record size estimate compute dynamically based on commit metadata. This is critical in computing the insert parallelism and bin-packing inserts into small files.
Default Value: 1024 (Optional)
Config Param: COPY_ON_WRITE_RECORD_SIZE_ESTIMATE
hoodie.log.compaction.blocks.threshold
Log compaction can be scheduled if the no. of log blocks crosses this threshold value. This is effective only when log compaction is enabled via hoodie.log.compaction.inline
Default Value: 5 (Optional)
Config Param: LOG_COMPACTION_BLOCKS_THRESHOLD
Since Version: 0.13.0
hoodie.log.compaction.inline
When set to true, logcompaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path.
Default Value: false (Optional)
Config Param: INLINE_LOG_COMPACT
Since Version: 0.13.0
hoodie.optimized.log.blocks.scan.enable
New optimized scan for log blocks that handles all multi-writer use-cases while appending to log files. It also differentiates original blocks written by ingestion writers and compacted blocks written log compaction.
Default Value: false (Optional)
Config Param: ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN
Since Version: 0.13.0
hoodie.parquet.small.file.limit
During upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a
small file
. By default, treat any file <= 100MB as a small file. Also note that if this set <= 0, will not try to get small files and directly write new files
Default Value: 104857600 (Optional)
Config Param: PARQUET_SMALL_FILE_LIMIT
hoodie.record.size.estimation.threshold
We use the previous commits' metadata to calculate the estimated record size and use it to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)
Default Value: 1.0 (Optional)
Config Param: RECORD_SIZE_ESTIMATION_THRESHOLD
File System View Storage Configurations
Configurations that control how file metadata is stored by Hudi, for transaction processing and queries.
Config Class
: org.apache.hudi.common.table.view.FileSystemViewStorageConfig
hoodie.filesystem.remote.backup.view.enable
Config to control whether backup needs to be configured if clients were not able to reach timeline service.
Default Value: true (Optional)
Config Param: REMOTE_BACKUP_VIEW_ENABLE
hoodie.filesystem.view.incr.timeline.sync.enable
Controls whether or not, the file system view is incrementally updated as new actions are performed on the timeline.
Default Value: false (Optional)
Config Param: INCREMENTAL_TIMELINE_SYNC_ENABLE
hoodie.filesystem.view.remote.host
We expect this to be rarely hand configured.
Default Value: localhost (Optional)
Config Param: REMOTE_HOST_NAME
hoodie.filesystem.view.remote.port
Port to serve file system view queries, when remote. We expect this to be rarely hand configured.
Default Value: 26754 (Optional)
Config Param: REMOTE_PORT_NUM
hoodie.filesystem.view.remote.retry.enable
Whether to enable API request retry for remote file system view.
Default Value: false (Optional)
Config Param: REMOTE_RETRY_ENABLE
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.exceptions
The class name of the Exception that needs to be re-tryed, separated by commas. Default is empty which means retry all the IOException and RuntimeException from Remote Request.
Default Value: (Optional)
Config Param: RETRY_EXCEPTIONS
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.initial_interval_ms
Amount of time (in ms) to wait, before retry to do operations on storage.
Default Value: 100 (Optional)
Config Param: REMOTE_INITIAL_RETRY_INTERVAL_MS
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.max_interval_ms
Maximum amount of time (in ms), to wait for next retry.
Default Value: 2000 (Optional)
Config Param: REMOTE_MAX_RETRY_INTERVAL_MS
Since Version: 0.12.1
hoodie.filesystem.view.remote.retry.max_numbers
Maximum number of retry for API requests against a remote file system view. e.g timeline server.
Default Value: 3 (Optional)
Config Param: REMOTE_MAX_RETRY_NUMBERS
Since Version: 0.12.1
hoodie.filesystem.view.remote.timeout.secs
Timeout in seconds, to wait for API requests against a remote file system view. e.g timeline server.
Default Value: 300 (Optional)
Config Param: REMOTE_TIMEOUT_SECS
hoodie.filesystem.view.rocksdb.base.path
Path on local storage to use, when storing file system view in embedded kv store/rocksdb.
Default Value: /tmp/hoodie_timeline_rocksdb (Optional)
Config Param: ROCKSDB_BASE_PATH
hoodie.filesystem.view.secondary.type
Specifies the secondary form of storage for file system view, if the primary (e.g timeline server) is unavailable.
Default Value: MEMORY (Optional)
Config Param: SECONDARY_VIEW_TYPE
hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction
Fraction of the file system view memory, to be used for holding mapping to bootstrap base files.
Default Value: 0.05 (Optional)
Config Param: BOOTSTRAP_BASE_FILE_MEM_FRACTION
hoodie.filesystem.view.spillable.clustering.mem.fraction
Fraction of the file system view memory, to be used for holding clustering related metadata.
Default Value: 0.01 (Optional)
Config Param: SPILLABLE_CLUSTERING_MEM_FRACTION
hoodie.filesystem.view.spillable.compaction.mem.fraction
Fraction of the file system view memory, to be used for holding compaction related metadata.
Default Value: 0.8 (Optional)
Config Param: SPILLABLE_COMPACTION_MEM_FRACTION
hoodie.filesystem.view.spillable.dir
Path on local storage to use, when file system view is held in a spillable map.
Default Value: /tmp/ (Optional)
Config Param: SPILLABLE_DIR
hoodie.filesystem.view.spillable.log.compaction.mem.fraction
Fraction of the file system view memory, to be used for holding log compaction related metadata.
Default Value: 0.8 (Optional)
Config Param: SPILLABLE_LOG_COMPACTION_MEM_FRACTION
Since Version: 0.13.0
hoodie.filesystem.view.spillable.mem
Amount of memory to be used in bytes for holding file system view, before spilling to disk.
Default Value: 104857600 (Optional)
Config Param: SPILLABLE_MEMORY
hoodie.filesystem.view.spillable.replaced.mem.fraction
Fraction of the file system view memory, to be used for holding replace commit related metadata.
Default Value: 0.01 (Optional)
Config Param: SPILLABLE_REPLACED_MEM_FRACTION
hoodie.filesystem.view.type
File system view provides APIs for viewing the files on the underlying lake storage, as file groups and file slices. This config controls how such a view is held. Options include MEMORY,SPILLABLE_DISK,EMBEDDED_KV_STORE,REMOTE_ONLY,REMOTE_FIRST which provide different trade offs for memory usage and API request performance.
Default Value: MEMORY (Optional)
Config Param: VIEW_TYPE
Clustering Configs
Configurations that control the clustering table service in hudi, which optimizes the storage layout for better query performance by sorting and sizing data files.
Config Class
: org.apache.hudi.config.HoodieClusteringConfig
hoodie.clustering.plan.strategy.cluster.begin.partition
Begin partition used to filter partition (inclusive), only effective when the filter mode 'hoodie.clustering.plan.partition.filter.mode' is SELECTED_PARTITIONS
Default Value: N/A
(Required)
Config Param: PARTITION_FILTER_BEGIN_PARTITION
Since Version: 0.11.0
hoodie.clustering.plan.strategy.cluster.end.partition
End partition used to filter partition (inclusive), only effective when the filter mode 'hoodie.clustering.plan.partition.filter.mode' is SELECTED_PARTITIONS
Default Value: N/A
(Required)
Config Param: PARTITION_FILTER_END_PARTITION
Since Version: 0.11.0
hoodie.clustering.plan.strategy.partition.regex.pattern
Filter clustering partitions that matched regex pattern
Default Value: N/A
(Required)
Config Param: PARTITION_REGEX_PATTERN
Since Version: 0.11.0
hoodie.clustering.plan.strategy.partition.selected
Partitions to run clustering
Default Value: N/A
(Required)
Config Param: PARTITION_SELECTED
Since Version: 0.11.0
hoodie.clustering.plan.strategy.sort.columns
Columns to sort the data by when clustering
Default Value: N/A
(Required)
Config Param: PLAN_STRATEGY_SORT_COLUMNS
Since Version: 0.7.0
hoodie.clustering.async.enabled
Enable running of clustering service, asynchronously as inserts happen on the table.
Default Value: false (Optional)
Config Param: ASYNC_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.clustering.async.max.commits
Config to control frequency of async clustering
Default Value: 4 (Optional)
Config Param: ASYNC_CLUSTERING_MAX_COMMITS
Since Version: 0.9.0
hoodie.clustering.execution.strategy.class
Config to provide a strategy class (subclass of RunClusteringStrategy) to define how the clustering plan is executed. By default, we sort the file groups in th plan by the specified columns, while meeting the configured target file sizes.
Default Value: org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy (Optional)
Config Param: EXECUTION_STRATEGY_CLASS_NAME
Since Version: 0.7.0
hoodie.clustering.inline
Turn on inline clustering - clustering will be run after each write operation is complete
Default Value: false (Optional)
Config Param: INLINE_CLUSTERING
Since Version: 0.7.0
hoodie.clustering.inline.max.commits
Config to control frequency of clustering planning
Default Value: 4 (Optional)
Config Param: INLINE_CLUSTERING_MAX_COMMITS
Since Version: 0.7.0
hoodie.clustering.plan.partition.filter.mode
Partition filter mode used in the creation of clustering plan. Available values are - NONE: do not filter table partition and thus the clustering plan will include all partitions that have clustering candidate.RECENT_DAYS: keep a continuous range of partitions, worked together with configs 'hoodie.clustering.plan.strategy.daybased.lookback.partitions' and 'hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions.SELECTED_PARTITIONS: keep partitions that are in the specified range ['hoodie.clustering.plan.strategy.cluster.begin.partition', 'hoodie.clustering.plan.strategy.cluster.end.partition'].DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.
Default Value: NONE (Optional)
Config Param: PLAN_PARTITION_FILTER_MODE_NAME
Since Version: 0.11.0
hoodie.clustering.plan.strategy.class
Config to provide a strategy class (subclass of ClusteringPlanStrategy) to create clustering plan i.e select what file groups are being clustered. Default strategy, looks at the clustering small file size limit (determined by hoodie.clustering.plan.strategy.small.file.limit) to pick the small file slices within partitions for clustering.
Default Value: org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy (Optional)
Config Param: PLAN_STRATEGY_CLASS_NAME
Since Version: 0.7.0
hoodie.clustering.plan.strategy.daybased.lookback.partitions
Number of partitions to list to create ClusteringPlan
Default Value: 2 (Optional)
Config Param: DAYBASED_LOOKBACK_PARTITIONS
Since Version: 0.7.0
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
Number of partitions to skip from latest when choosing partitions to create ClusteringPlan
Default Value: 0 (Optional)
Config Param: PLAN_STRATEGY_SKIP_PARTITIONS_FROM_LATEST
Since Version: 0.9.0
hoodie.clustering.plan.strategy.max.bytes.per.group
Each clustering operation can create multiple output file groups. Total amount of data processed by clustering operation is defined by below two properties (CLUSTERING_MAX_BYTES_PER_GROUP * CLUSTERING_MAX_NUM_GROUPS). Max amount of data to be included in one group
Default Value: 2147483648 (Optional)
Config Param: PLAN_STRATEGY_MAX_BYTES_PER_OUTPUT_FILEGROUP
Since Version: 0.7.0
hoodie.clustering.plan.strategy.max.num.groups
Maximum number of groups to create as part of ClusteringPlan. Increasing groups will increase parallelism
Default Value: 30 (Optional)
Config Param: PLAN_STRATEGY_MAX_GROUPS
Since Version: 0.7.0
hoodie.clustering.plan.strategy.small.file.limit
Files smaller than the size in bytes specified here are candidates for clustering
Default Value: 314572800 (Optional)
Config Param: PLAN_STRATEGY_SMALL_FILE_LIMIT
Since Version: 0.7.0
hoodie.clustering.plan.strategy.target.file.max.bytes
Each group can produce 'N' (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups
Default Value: 1073741824 (Optional)
Config Param: PLAN_STRATEGY_TARGET_FILE_MAX_BYTES
Since Version: 0.7.0
hoodie.clustering.preserve.commit.metadata
When rewriting data, preserves existing hoodie_commit_time
Default Value: true (Optional)
Config Param: PRESERVE_COMMIT_METADATA
Since Version: 0.9.0
hoodie.clustering.rollback.pending.replacecommit.on.conflict
If updates are allowed to file groups pending clustering, then set this config to rollback failed or pending clustering instants. Pending clustering will be rolled back ONLY IF there is conflict between incoming upsert and filegroup to be clustered. Please exercise caution while setting this config, especially when clustering is done very frequently. This could lead to race condition in rare scenarios, for example, when the clustering completes after instants are fetched but before rollback completed.
Default Value: false (Optional)
Config Param: ROLLBACK_PENDING_CLUSTERING_ON_CONFLICT
Since Version: 0.10.0
hoodie.clustering.schedule.inline
When set to true, clustering service will be attempted for inline scheduling after each write. Users have to ensure they have a separate job to run async clustering(execution) for the one scheduled by this writer. Users can choose to set both
hoodie.clustering.inline
andhoodie.clustering.schedule.inline
to false and have both scheduling and execution triggered by any async process, on which casehoodie.clustering.async.enabled
is expected to be set to true. But ifhoodie.clustering.inline
is set to false, andhoodie.clustering.schedule.inline
is set to true, regular writers will schedule clustering inline, but users are expected to trigger async job for execution. Ifhoodie.clustering.inline
is set to true, regular writers will do both scheduling and execution inline for clustering
Default Value: false (Optional)
Config Param: SCHEDULE_INLINE_CLUSTERING
hoodie.clustering.updates.strategy
Determines how to handle updates, deletes to file groups that are under clustering. Default strategy just rejects the update
Default Value: org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy (Optional)
Config Param: UPDATES_STRATEGY
Since Version: 0.7.0
hoodie.layout.optimize.build.curve.sample.size
Determines target sample size used by the Boundary-based Interleaved Index method of building space-filling curve. Larger sample size entails better layout optimization outcomes, at the expense of higher memory footprint.
Default Value: 200000 (Optional)
Config Param: LAYOUT_OPTIMIZE_BUILD_CURVE_SAMPLE_SIZE
Since Version: 0.10.0
hoodie.layout.optimize.curve.build.method
Controls how data is sampled to build the space-filling curves. Two methods: "direct", "sample". The direct method is faster than the sampling, however sample method would produce a better data layout.
Default Value: direct (Optional)
Config Param: LAYOUT_OPTIMIZE_SPATIAL_CURVE_BUILD_METHOD
Since Version: 0.10.0
hoodie.layout.optimize.data.skipping.enable
Enable data skipping by collecting statistics once layout optimization is complete.
Default Value: true (Optional)
Config Param: LAYOUT_OPTIMIZE_DATA_SKIPPING_ENABLE
Since Version: 0.10.0
Deprecated Version: 0.11.0
hoodie.layout.optimize.enable
This setting has no effect. Please refer to clustering configuration, as well as LAYOUT_OPTIMIZE_STRATEGY config to enable advanced record layout optimization strategies
Default Value: false (Optional)
Config Param: LAYOUT_OPTIMIZE_ENABLE
Since Version: 0.10.0
Deprecated Version: 0.11.0
hoodie.layout.optimize.strategy
Determines ordering strategy used in records layout optimization. Currently supported strategies are "linear", "z-order" and "hilbert" values are supported.
Default Value: linear (Optional)
Config Param: LAYOUT_OPTIMIZE_STRATEGY
Since Version: 0.10.0
Common Configurations
The following set of configurations are common across Hudi.
Config Class
: org.apache.hudi.common.config.HoodieCommonConfig
as.of.instant
The query instant for time travel. Without specified this option, we query the latest snapshot.
Default Value: N/A
(Required)
Config Param: TIMESTAMP_AS_OF
hoodie.common.diskmap.compression.enabled
Turn on compression for BITCASK disk map used by the External Spillable Map
Default Value: true (Optional)
Config Param: DISK_MAP_BITCASK_COMPRESSION_ENABLED
hoodie.common.spillable.diskmap.type
When handling input data that cannot be held in memory, to merge with a file on storage, a spillable diskmap is employed. By default, we use a persistent hashmap based loosely on bitcask, that offers O(1) inserts, lookups. Change this to
ROCKS_DB
to prefer using rocksDB, for handling the spill.
Default Value: BITCASK (Optional)
Config Param: SPILLABLE_DISK_MAP_TYPE
hoodie.datasource.write.reconcile.schema
This config controls how writer's schema will be selected based on the incoming batch's schema as well as existing table's one. When schema reconciliation is DISABLED, incoming batch's schema will be picked as a writer-schema (therefore updating table's schema). When schema reconciliation is ENABLED, writer-schema will be picked such that table's schema (after txn) is either kept the same or extended, meaning that we'll always prefer the schema that either adds new columns or stays the same. This enables us, to always extend the table's schema during evolution and never lose the data (when, for ex, existing column is being dropped in a new batch)
Default Value: false (Optional)
Config Param: RECONCILE_SCHEMA
hoodie.schema.on.read.enable
Enables support for Schema Evolution feature
Default Value: false (Optional)
Config Param: SCHEMA_EVOLUTION_ENABLE
Bootstrap Configs
Configurations that control how you want to bootstrap your existing tables for the first time into hudi. The bootstrap operation can flexibly avoid copying data over before you can use Hudi and support running the existing writers and new hudi writers in parallel, to validate the migration.
Config Class
: org.apache.hudi.config.HoodieBootstrapConfig
hoodie.bootstrap.base.path
Base path of the dataset that needs to be bootstrapped as a Hudi table
Default Value: N/A
(Required)
Config Param: BASE_PATH
Since Version: 0.6.0
hoodie.bootstrap.keygen.class
Key generator implementation to be used for generating keys from the bootstrapped dataset
Default Value: N/A
(Required)
Config Param: KEYGEN_CLASS_NAME
Since Version: 0.6.0
hoodie.bootstrap.full.input.provider
Class to use for reading the bootstrap dataset partitions/files, for Bootstrap mode FULL_RECORD
Default Value: org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider (Optional)
Config Param: FULL_BOOTSTRAP_INPUT_PROVIDER_CLASS_NAME
Since Version: 0.6.0
hoodie.bootstrap.index.class
Implementation to use, for mapping a skeleton base file to a boostrap base file.
Default Value: org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex (Optional)
Config Param: INDEX_CLASS_NAME
Since Version: 0.6.0
hoodie.bootstrap.keygen.type
Type of build-in key generator, currently support SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE
Default Value: SIMPLE (Optional)
Config Param: KEYGEN_TYPE
Since Version: 0.9.0
hoodie.bootstrap.mode.selector
Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped
Default Value: org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector (Optional)
Config Param: MODE_SELECTOR_CLASS_NAME
Since Version: 0.6.0
hoodie.bootstrap.mode.selector.regex
Matches each bootstrap dataset partition against this regex and applies the mode below to it.
Default Value: .* (Optional)
Config Param: PARTITION_SELECTOR_REGEX_PATTERN
Since Version: 0.6.0
hoodie.bootstrap.mode.selector.regex.mode
Bootstrap mode to apply for partition paths, that match regex above. METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset. FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
Default Value: METADATA_ONLY (Optional)
Config Param: PARTITION_SELECTOR_REGEX_MODE
Since Version: 0.6.0
hoodie.bootstrap.parallelism
Parallelism value to be used to bootstrap data into hudi
Default Value: 1500 (Optional)
Config Param: PARALLELISM_VALUE
Since Version: 0.6.0
hoodie.bootstrap.partitionpath.translator.class
Translates the partition paths from the bootstrapped data into how is laid out as a Hudi table.
Default Value: org.apache.hudi.client.bootstrap.translator.IdentityBootstrapPartitionPathTranslator (Optional)
Config Param: PARTITION_PATH_TRANSLATOR_CLASS_NAME
Since Version: 0.6.0
Commit Callback Configs
Configurations controling callback behavior into HTTP endpoints, to push notifications on commits on hudi tables.
Write commit callback configs
Config Class
: org.apache.hudi.config.HoodieWriteCommitCallbackConfig
hoodie.write.commit.callback.http.url
Callback host to be sent along with callback messages
Default Value: N/A
(Required)
Config Param: CALLBACK_HTTP_URL
Since Version: 0.6.0
hoodie.write.commit.callback.class
Full path of callback class and must be a subclass of HoodieWriteCommitCallback class, org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback by default
Default Value: org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback (Optional)
Config Param: CALLBACK_CLASS_NAME
Since Version: 0.6.0
hoodie.write.commit.callback.http.api.key
Http callback API key. hudi_write_commit_http_callback by default
Default Value: hudi_write_commit_http_callback (Optional)
Config Param: CALLBACK_HTTP_API_KEY_VALUE
Since Version: 0.6.0
hoodie.write.commit.callback.http.timeout.seconds
Callback timeout in seconds. 3 by default
Default Value: 3 (Optional)
Config Param: CALLBACK_HTTP_TIMEOUT_IN_SECONDS
Since Version: 0.6.0
hoodie.write.commit.callback.on
Turn commit callback on/off. off by default.
Default Value: false (Optional)
Config Param: TURN_CALLBACK_ON
Since Version: 0.6.0
Write commit pulsar callback configs
Controls notifications sent to pulsar, on events happening to a hudi table.
Config Class
: org.apache.hudi.utilities.callback.pulsar.HoodieWriteCommitPulsarCallbackConfig
hoodie.write.commit.callback.pulsar.broker.service.url
Server's url of pulsar cluster, to be used for publishing commit metadata.
Default Value: N/A
(Required)
Config Param: BROKER_SERVICE_URL
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.topic
pulsar topic name to publish timeline activity into.
Default Value: N/A
(Required)
Config Param: TOPIC
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.connection-timeout
Duration of waiting for a connection to a broker to be established.
Default Value: 10s (Optional)
Config Param: CONNECTION_TIMEOUT
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.keepalive-interval
Duration of keeping alive interval for each client broker connection.
Default Value: 30s (Optional)
Config Param: KEEPALIVE_INTERVAL
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.operation-timeout
Duration of waiting for completing an operation.
Default Value: 30s (Optional)
Config Param: OPERATION_TIMEOUT
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.block-if-queue-full
When the queue is full, the method is blocked instead of an exception is thrown.
Default Value: true (Optional)
Config Param: PRODUCER_BLOCK_QUEUE_FULL
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.pending-queue-size
The maximum size of a queue holding pending messages.
Default Value: 1000 (Optional)
Config Param: PRODUCER_PENDING_QUEUE_SIZE
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.pending-total-size
The maximum number of pending messages across partitions.
Default Value: 50000 (Optional)
Config Param: PRODUCER_PENDING_SIZE
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.route-mode
Message routing logic for producers on partitioned topics.
Default Value: RoundRobinPartition (Optional)
Config Param: PRODUCER_ROUTE_MODE
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.producer.send-timeout
The timeout in each sending to pulsar.
Default Value: 30s (Optional)
Config Param: PRODUCER_SEND_TIMEOUT
Since Version: 0.11.0
hoodie.write.commit.callback.pulsar.request-timeout
Duration of waiting for completing a request.
Default Value: 60s (Optional)
Config Param: REQUEST_TIMEOUT
Since Version: 0.11.0
Write commit Kafka callback configs
Controls notifications sent to Kafka, on events happening to a hudi table.
Config Class
: org.apache.hudi.utilities.callback.kafka.HoodieWriteCommitKafkaCallbackConfig
hoodie.write.commit.callback.kafka.bootstrap.servers
Bootstrap servers of kafka cluster, to be used for publishing commit metadata.
Default Value: N/A
(Required)
Config Param: BOOTSTRAP_SERVERS
Since Version: 0.7.0
hoodie.write.commit.callback.kafka.partition