Skip to main content
Version: 0.11.1

Basic Configurations

This page covers the basic configurations you may use to write/read Hudi tables. This page only features a subset of the most frequently used configurations. For a full list of all configs, please visit the All Configurations page.

  • Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
  • Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
  • Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
  • Metrics Configs: These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.
  • Record Payload Config: This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.

Spark Datasource Configs

These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.

Read Options

Options useful for reading tables via read.format.option(...)

Config Class: org.apache.hudi.DataSourceOptions.scala

hoodie.datasource.query.type

Whether data needs to be read, in incremental mode (new data since an instantTime) (or) Read Optimized mode (obtain latest view, based on base files) (or) Snapshot mode (obtain latest view, by merging base and (if any) log files)
Default Value: snapshot (Optional)
Config Param: QUERY_TYPE


Write Options

You can pass down any of the WriteClient level configs directly using options() or option(k,v) methods.

inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);

Options useful for writing tables via write.format.option(...)

Config Class: org.apache.hudi.DataSourceOptions.scala

hoodie.datasource.write.operation

Whether to do upsert, insert or bulkinsert for the write operation. Use bulkinsert to load new data into a table, and there after use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
Default Value: upsert (Optional)
Config Param: OPERATION


hoodie.datasource.write.table.type

The table type for the underlying data, for this write. This can’t change between writes.
Default Value: COPY_ON_WRITE (Optional)
Config Param: TABLE_TYPE


hoodie.datasource.write.table.name

Table name for the datasource write. Also used to register the table into meta stores.
Default Value: N/A (Required)
Config Param: TABLE_NAME


hoodie.datasource.write.recordkey.field

Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c
Default Value: uuid (Optional)
Config Param: RECORDKEY_FIELD


hoodie.datasource.write.partitionpath.field

Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value ontained by invoking .toString()
Default Value: N/A (Required)
Config Param: PARTITIONPATH_FIELD


hoodie.datasource.write.keygenerator.class

Key generator class, that implements org.apache.hudi.keygen.KeyGenerator
Default Value: org.apache.hudi.keygen.SimpleKeyGenerator (Optional)
Config Param: KEYGENERATOR_CLASS_NAME


hoodie.datasource.write.precombine.field

Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: PRECOMBINE_FIELD


hoodie.datasource.write.payload.class

Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective
Default Value: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)
Config Param: PAYLOAD_CLASS_NAME


hoodie.datasource.write.partitionpath.urlencode

Should we url encode the partition path value, before creating the folder structure.
Default Value: false (Optional)
Config Param: URL_ENCODE_PARTITIONING


hoodie.datasource.hive_sync.enable

When set to true, register/sync the table to Apache Hive metastore
Default Value: false (Optional)
Config Param: HIVE_SYNC_ENABLED


hoodie.datasource.hive_sync.mode

Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
Default Value: N/A (Required)
Config Param: HIVE_SYNC_MODE


hoodie.datasource.write.hive_style_partitioning

Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING


hoodie.datasource.hive_sync.partition_fields

Field in the table to use for determining hive partition columns.
Default Value: (Optional)
Config Param: HIVE_PARTITION_FIELDS


hoodie.datasource.hive_sync.partition_extractor_class

Class which implements PartitionValueExtractor to extract the partition values, default 'SlashEncodedDayPartitionValueExtractor'.
Default Value: org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)
Config Param: HIVE_PARTITION_EXTRACTOR_CLASS


These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.

path

Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfully
Default Value: N/A (Required)
Config Param: PATH


hoodie.table.name

Table name to register to Hive metastore
Default Value: N/A (Required)
Config Param: TABLE_NAME


table.type

Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ
Default Value: COPY_ON_WRITE (Optional)
Config Param: TABLE_TYPE


write.operation

The write operation, that this write should do
Default Value: upsert (Optional)
Config Param: OPERATION


write.tasks

Parallelism of tasks that do actual write, default is 4
Default Value: 4 (Optional)
Config Param: WRITE_TASKS


write.bucket_assign.tasks

Parallelism of tasks that do bucket assign, default is the parallelism of the execution environment
Default Value: N/A (Required)
Config Param: BUCKET_ASSIGN_TASKS


write.precombine

Flag to indicate whether to drop duplicates before insert/upsert. By default these cases will accept duplicates, to gain extra performance:

  1. insert operation;
  2. upsert for MOR table, the MOR table deduplicate on reading

Default Value: false (Optional)
Config Param: PRE_COMBINE


read.tasks

Parallelism of tasks that do actual read, default is 4
Default Value: 4 (Optional)
Config Param: READ_TASKS


read.start-commit

Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read
Default Value: N/A (Required)
Config Param: READ_START_COMMIT


read.streaming.enabled

Whether to read as streaming source, default false
Default Value: false (Optional)
Config Param: READ_AS_STREAMING


compaction.tasks

Parallelism of tasks that do actual compaction, default is 4
Default Value: 4 (Optional)
Config Param: COMPACTION_TASKS


hoodie.datasource.write.hive_style_partitioning

Whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING


hive_sync.enable

Asynchronously sync Hive meta to HMS, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_ENABLED


hive_sync.mode

Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'jdbc'
Default Value: jdbc (Optional)
Config Param: HIVE_SYNC_MODE


hive_sync.table

Table name for hive sync, default 'unknown'
Default Value: unknown (Optional)
Config Param: HIVE_SYNC_TABLE


hive_sync.db

Database name for hive sync, default 'default'
Default Value: default (Optional)
Config Param: HIVE_SYNC_DB


hive_sync.partition_extractor_class

Tool to extract the partition value from HDFS path, default 'SlashEncodedDayPartitionValueExtractor'
Default Value: org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)
Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME


hive_sync.metastore.uris

Metastore uris for hive sync, default ''
Default Value: (Optional)
Config Param: HIVE_SYNC_METASTORE_URIS


Write Client Configs

Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.

Storage Configs

Configurations that control aspects around writing, sizing, reading base and log files.

Config Class: org.apache.hudi.config.HoodieStorageConfig

write.parquet.block.size

Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.
Default Value: 120 (Optional)
Config Param: WRITE_PARQUET_BLOCK_SIZE


write.parquet.max.file.size

Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Default Value: 120 (Optional)
Config Param: WRITE_PARQUET_MAX_FILE_SIZE


Metadata Configs

Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries.

Config Class: org.apache.hudi.common.config.HoodieMetadataConfig

hoodie.metadata.enable

Enable the internal metadata table which serves table metadata like level file listings
Default Value: true (Optional)
Config Param: ENABLE
Since Version: 0.7.0


Write Configurations

Configurations that control write behavior on Hudi tables. These can be directly passed down from even higher level frameworks (e.g Spark datasources, Flink sink) and utilities (e.g DeltaStreamer).

Config Class: org.apache.hudi.config.HoodieWriteConfig

hoodie.combine.before.upsert

When upserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage. This should be turned off only if you are absolutely certain that there are no duplicates incoming, otherwise it can lead to duplicate keys and violate the uniqueness guarantees.
Default Value: true (Optional)
Config Param: COMBINE_BEFORE_UPSERT


hoodie.write.markers.type

Marker type to use. Two modes are supported: - DIRECT: individual marker file corresponding to each data file is directly created by the writer. - TIMELINE_SERVER_BASED: marker operations are all handled at the timeline service which serves as a proxy. New marker entries are batch processed and stored in a limited number of underlying files for efficiency. If HDFS is used or timeline server is disabled, DIRECT markers are used as fallback even if this is configured. For Spark structured streaming, this configuration does not take effect, i.e., DIRECT markers are always used for Spark structured streaming.
Default Value: TIMELINE_SERVER_BASED (Optional)
Config Param: MARKERS_TYPE
Since Version: 0.9.0


hoodie.insert.shuffle.parallelism

Parallelism for inserting records into the table. Inserts can shuffle data before writing to tune file sizes and optimize the storage layout.
Default Value: 200 (Optional)
Config Param: INSERT_PARALLELISM_VALUE


hoodie.rollback.parallelism

Parallelism for rollback of commits. Rollbacks perform delete of files or logging delete blocks to file groups on storage in parallel.
Default Value: 100 (Optional)
Config Param: ROLLBACK_PARALLELISM_VALUE


hoodie.combine.before.delete

During delete operations, controls whether we should combine deletes (and potentially also upserts) before writing to storage.
Default Value: true (Optional)
Config Param: COMBINE_BEFORE_DELETE


hoodie.combine.before.insert

When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage. When set to true the precombine field value is used to reduce all records that share the same key.
Default Value: false (Optional)
Config Param: COMBINE_BEFORE_INSERT


hoodie.bulkinsert.shuffle.parallelism

For large initial imports using bulk_insert operation, controls the parallelism to use for sort modes or custom partitioning done before writing records to the table.
Default Value: 200 (Optional)
Config Param: BULKINSERT_PARALLELISM_VALUE


hoodie.delete.shuffle.parallelism

Parallelism used for “delete” operation. Delete operations also perform shuffles, similar to upsert operation.
Default Value: 200 (Optional)
Config Param: DELETE_PARALLELISM_VALUE


hoodie.bulkinsert.sort.mode

Sorting modes to use for sorting records for bulk insert. This is used when user hoodie.bulkinsert.user.defined.partitioner.class is not configured. Available values are - GLOBAL_SORT: this ensures best file sizes, with lowest memory overhead at cost of sorting. PARTITION_SORT: Strikes a balance by only sorting within a partition, still keeping the memory overhead of writing lowest and best effort file sizing. NONE: No sorting. Fastest and matches spark.write.parquet() in terms of number of files, overheads
Default Value: GLOBAL_SORT (Optional)
Config Param: BULK_INSERT_SORT_MODE


hoodie.embed.timeline.server

When true, spins up an instance of the timeline server (meta server that serves cached file listings, statistics),running on each writer's driver process, accepting requests during the write from executors.
Default Value: true (Optional)
Config Param: EMBEDDED_TIMELINE_SERVER_ENABLE


hoodie.upsert.shuffle.parallelism

Parallelism to use for upsert operation on the table. Upserts can shuffle data to perform index lookups, file sizing, bin packing records optimally into file groups.
Default Value: 200 (Optional)
Config Param: UPSERT_PARALLELISM_VALUE


hoodie.rollback.using.markers

Enables a more efficient mechanism for rollbacks based on the marker files generated during the writes. Turned on by default.
Default Value: true (Optional)
Config Param: ROLLBACK_USING_MARKERS_ENABLE


hoodie.finalize.write.parallelism

Parallelism for the write finalization internal operation, which involves removing any partially written files from lake storage, before committing the write. Reduce this value, if the high number of tasks incur delays for smaller tables or low latency writes.
Default Value: 200 (Optional)
Config Param: FINALIZE_WRITE_PARALLELISM_VALUE


Compaction Configs

Configurations that control compaction (merging of log files onto new base files) as well as cleaning (reclamation of older/unused file groups/slices).

Config Class: org.apache.hudi.config.HoodieCompactionConfig

hoodie.cleaner.policy

Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. By default, cleaner spares the file slices written by the last N commits, determined by hoodie.cleaner.commits.retained Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time
Default Value: KEEP_LATEST_COMMITS (Optional)
Config Param: CLEANER_POLICY


hoodie.copyonwrite.record.size.estimate

The average record size. If not explicitly specified, hudi will compute the record size estimate dynamically based on commit metadata. This is critical in computing the insert parallelism and bin-packing inserts into small files.
Default Value: 1024 (Optional)
Config Param: COPY_ON_WRITE_RECORD_SIZE_ESTIMATE


hoodie.compact.inline.max.delta.seconds

Number of elapsed seconds after the last compaction, before scheduling a new one.
Default Value: 3600 (Optional)
Config Param: INLINE_COMPACT_TIME_DELTA_SECONDS


hoodie.cleaner.commits.retained

Number of commit to retain when cleaner is triggered with KEEP_LATEST_COMMITS cleaning policy. Make sure to configure this property properly so that the longest running query is able to succeed. This also directly translates into how much data retention the table supports for incremental queries. Default Value: 10 (Optional)
Config Param: CLEANER_COMMITS_RETAINED


hoodie.clean.async

Only applies when hoodie.clean.automatic is turned on. When turned on runs cleaner async with writing, which can speed up overall write performance.
Default Value: false (Optional)
Config Param: ASYNC_CLEAN


hoodie.clean.automatic

When enabled, the cleaner table service is invoked immediately after each commit, to delete older file slices. It's recommended to enable this, to ensure metadata and data storage growth is bounded.
Default Value: true (Optional)
Config Param: AUTO_CLEAN


hoodie.commits.archival.batch

Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size.
Default Value: 10 (Optional)
Config Param: COMMITS_ARCHIVAL_BATCH_SIZE


hoodie.compact.inline

When set to true, compaction service is triggered after each write. While being simpler operationally, this adds extra latency on the write path.
Default Value: false (Optional)
Config Param: INLINE_COMPACT


hoodie.parquet.small.file.limit

During upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a small file. By default, treat any file <= 100MB as a small file.
Default Value: 104857600 (Optional)
Config Param: PARQUET_SMALL_FILE_LIMIT


hoodie.compaction.strategy

Compaction strategy decides which file groups are picked up for compaction during each compaction run. By default, Hudi picks the log file with most accumulated unmerged data
Default Value: org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy (Optional)
Config Param: COMPACTION_STRATEGY


hoodie.archive.automatic

When enabled, the archival table service is invoked immediately after each commit, to archive commits if we cross a maximum value of commits. It's recommended to enable this, to ensure number of active commits is bounded.
Default Value: true (Optional)
Config Param: AUTO_ARCHIVE


hoodie.copyonwrite.insert.auto.split

Config to control whether we control insert split sizes automatically based on average record sizes. It's recommended to keep this turned on, since hand tuning is otherwise extremely cumbersome.
Default Value: true (Optional)
Config Param: COPY_ON_WRITE_AUTO_SPLIT_INSERTS


hoodie.compact.inline.max.delta.commits

Number of delta commits after the last compaction, before scheduling of a new compaction is attempted. This is used when the compaction trigger strategy involves number of commits. For example NUM_COMMITS,NUM_AND_TIME,NUM_OR_TIME
Default Value: 5 (Optional)
Config Param: INLINE_COMPACT_NUM_DELTA_COMMITS


hoodie.keep.min.commits

Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline.
Default Value: 20 (Optional)
Config Param: MIN_COMMITS_TO_KEEP


hoodie.cleaner.parallelism

Parallelism for the cleaning operation. Increase this if cleaning becomes slow.
Default Value: 200 (Optional)
Config Param: CLEANER_PARALLELISM_VALUE


hoodie.record.size.estimation.threshold

We use the previous commits' metadata to calculate the estimated record size and use it to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)
Default Value: 1.0 (Optional)
Config Param: RECORD_SIZE_ESTIMATION_THRESHOLD


hoodie.compact.inline.trigger.strategy

Controls how compaction scheduling is triggered, by time or num delta commits or combination of both. Valid options: NUM_COMMITS,TIME_ELAPSED,NUM_AND_TIME,NUM_OR_TIME
Default Value: NUM_COMMITS (Optional)
Config Param: INLINE_COMPACT_TRIGGER_STRATEGY


hoodie.keep.max.commits

Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows.This config controls the maximum number of instants to retain in the active timeline.
Default Value: 30 (Optional)
Config Param: MAX_COMMITS_TO_KEEP


hoodie.copyonwrite.insert.split.size

Number of inserts assigned for each partition/bucket for writing. We based the default on writing out 100MB files, with at least 1kb records (100K records per file), and over provision to 500K. As long as auto-tuning of splits is turned on, this only affects the first write, where there is no history to learn record sizes from.
Default Value: 500000 (Optional)
Config Param: COPY_ON_WRITE_INSERT_SPLIT_SIZE


File System View Storage Configurations

Configurations that control how file metadata is stored by Hudi, for transaction processing and queries.

Config Class: org.apache.hudi.common.table.view.FileSystemViewStorageConfig

hoodie.filesystem.view.type

File system view provides APIs for viewing the files on the underlying lake storage, as file groups and file slices. This config controls how such a view is held. Options include MEMORY,SPILLABLE_DISK,EMBEDDED_KV_STORE,REMOTE_ONLY,REMOTE_FIRST which provide different trade offs for memory usage and API request performance.
Default Value: MEMORY (Optional)
Config Param: VIEW_TYPE


hoodie.filesystem.view.secondary.type

Specifies the secondary form of storage for file system view, if the primary (e.g timeline server) is unavailable.
Default Value: MEMORY (Optional)
Config Param: SECONDARY_VIEW_TYPE


Index Configs

Configurations that control indexing behavior, which tags incoming records as either inserts or updates to older records.

Config Class: org.apache.hudi.config.HoodieIndexConfig

hoodie.index.type

Type of index to use. Default is Bloom filter. Possible options are [BLOOM | GLOBAL_BLOOM |SIMPLE | GLOBAL_SIMPLE | INMEMORY | HBASE | BUCKET]. Bloom filters removes the dependency on a external system and is stored in the footer of the Parquet Data Files
Default Value: N/A (Required)
Config Param: INDEX_TYPE


hoodie.index.bloom.fpp

Only applies if index type is BLOOM. Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives. If the number of entries added to bloom filter exceeds the configured value (hoodie.index.bloom.num_entries), then this fpp may not be honored.
Default Value: 0.000000001 (Optional)
Config Param: BLOOM_FILTER_FPP_VALUE


hoodie.index.bloom.num_entries

Only applies if index type is BLOOM. This is the number of entries to be stored in the bloom filter. The rationale for the default: Assume the maxParquetFileSize is 128MB and averageRecordSize is 1kb and hence we approx a total of 130K records in a file. The default (60000) is roughly half of this approximation. Warning: Setting this very low, will generate a lot of false positives and index lookup will have to scan a lot more files than it has to and setting this to a very high number will increase the size every base file linearly (roughly 4KB for every 50000 entries). This config is also used with DYNAMIC bloom filter which determines the initial size for the bloom.
Default Value: 60000 (Optional)
Config Param: BLOOM_FILTER_NUM_ENTRIES_VALUE


hoodie.bloom.index.update.partition.path

Only applies if index type is GLOBAL_BLOOM. When set to true, an update including the partition path of a record that already exists will result in inserting the incoming record into the new partition and deleting the original record in the old partition. When set to false, the original record will only be updated in the old partition
Default Value: true (Optional)
Config Param: BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE


hoodie.bloom.index.use.caching

Only applies if index type is BLOOM. When true, the input RDD will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions
Default Value: true (Optional)
Config Param: BLOOM_INDEX_USE_CACHING


hoodie.bloom.index.parallelism

Only applies if index type is BLOOM. This is the amount of parallelism for index lookup, which involves a shuffle. By default, this is auto computed based on input workload characteristics.
Default Value: 0 (Optional)
Config Param: BLOOM_INDEX_PARALLELISM


hoodie.bloom.index.prune.by.ranges

Only applies if index type is BLOOM. When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp. If the record key is completely random, it is better to turn this off, since range pruning will only add extra overhead to the index lookup.
Default Value: true (Optional)
Config Param: BLOOM_INDEX_PRUNE_BY_RANGES


hoodie.bloom.index.filter.type

Filter type used. Default is BloomFilterTypeCode.DYNAMIC_V0. Available values are [BloomFilterTypeCode.SIMPLE , BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves based on number of keys.
Default Value: DYNAMIC_V0 (Optional)
Config Param: BLOOM_FILTER_TYPE


hoodie.simple.index.parallelism

Only applies if index type is SIMPLE. This is the amount of parallelism for index lookup, which involves a Spark Shuffle
Default Value: 50 (Optional)
Config Param: SIMPLE_INDEX_PARALLELISM


hoodie.simple.index.use.caching

Only applies if index type is SIMPLE. When true, the incoming writes will cached to speed up index lookup by reducing IO for computing parallelism or affected partitions
Default Value: true (Optional)
Config Param: SIMPLE_INDEX_USE_CACHING


hoodie.global.simple.index.parallelism

Only applies if index type is GLOBAL_SIMPLE. This is the amount of parallelism for index lookup, which involves a Spark Shuffle
Default Value: 100 (Optional)
Config Param: GLOBAL_SIMPLE_INDEX_PARALLELISM


hoodie.simple.index.update.partition.path

Similar to Key: 'hoodie.bloom.index.update.partition.path' , default: true but for simple index. Since version: 0.6.0
Default Value: true (Optional)
Config Param: SIMPLE_INDEX_UPDATE_PARTITION_PATH_ENABLE


Common Configurations

The following set of configurations are common across Hudi.

Config Class: org.apache.hudi.common.config.HoodieCommonConfig

hoodie.common.spillable.diskmap.type

When handling input data that cannot be held in memory, to merge with a file on storage, a spillable diskmap is employed. By default, we use a persistent hashmap based loosely on bitcask, that offers O(1) inserts, lookups. Change this to ROCKS_DB to prefer using rocksDB, for handling the spill.
Default Value: BITCASK (Optional)
Config Param: SPILLABLE_DISK_MAP_TYPE


Metrics Configs

These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.

Metrics Configurations for Datadog reporter

Enables reporting on Hudi metrics using the Datadog reporter type. Hudi publishes metrics on every commit, clean, rollback etc.

Config Class: org.apache.hudi.config.metrics.HoodieMetricsDatadogConfig

hoodie.metrics.on

Turn on/off metrics reporting. off by default.
Default Value: false (Optional)
Config Param: TURN_METRICS_ON
Since Version: 0.5.0


hoodie.metrics.reporter.type

Type of metrics reporter.
Default Value: GRAPHITE (Optional)
Config Param: METRICS_REPORTER_TYPE_VALUE
Since Version: 0.5.0


hoodie.metrics.reporter.class


Default Value: (Optional)
Config Param: METRICS_REPORTER_CLASS_NAME
Since Version: 0.6.0


Record Payload Config

This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.

Payload Configurations

Payload related configs, that can be leveraged to control merges based on specific business fields in the data.

Config Class: org.apache.hudi.config.HoodiePayloadConfig

hoodie.payload.event.time.field

Table column/field name to derive timestamp associated with the records. This can be useful for e.g, determining the freshness of the table.
Default Value: ts (Optional)
Config Param: EVENT_TIME_FIELD


hoodie.payload.ordering.field

Table column/field name to order records that have the same key, before merging and writing to storage.
Default Value: ts (Optional)
Config Param: ORDERING_FIELD