Skip to main content
Version: Next

SQL Procedures

Stored procedures are available when use Hudi SparkSQL extensions in all spark's version.

Usage

CALL supports passing arguments by name (recommended) or by position. Mixing position and named arguments is also supported.

Named arguments

All procedure arguments are named. When passing arguments by name, arguments can be in any order and any optional argument can be omitted.

CALL system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1, ... arg_name_n => arg_n);

Positional arguments

When passing arguments by position, the arguments may be omitted if they are optional.

CALL system.procedure_name(arg_1, arg_2, ... arg_n);

note: The system here has no practical meaning, the complete procedure name is system.procedure_name.

help

Show parameters and outputTypes of a procedure.

Input

Parameter NameTypeRequiredDefault ValueDescription
cmdStringNNonename of a procedure

Output

Output NameType
resultString

Example

call help(cmd => 'show_commits');
result
parameters:
param type_name default_value required
table string None true
limit integer 10 false
outputType:
name type_name nullable metadata
commit_time string true {}
action string true {}
total_bytes_written long true {}
total_files_added long true {}
total_files_updated long true {}
total_partitions_written long true {}
total_records_written long true {}
total_update_records_written long true {}
total_errors long true {}

Commit management

show_commits

Show commits' info.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned

Output

Output NameType
commit_timeString
total_bytes_writtenLong
total_files_addedLong
total_files_updatedLong
total_partitions_writtenLong
total_records_writtenLong
total_update_records_writtenLong
total_errorsLong

Example

call show_commits(table => 'test_hudi_table', limit => 10);
commit_timetotal_bytes_writtentotal_files_addedtotal_files_updatedtotal_partitions_writtentotal_records_writtentotal_update_records_writtentotal_errors
20220216171049652432653011000
20220216171027021435346101100
20220216171019361435349101100

show_commits_metadata

Show commits' metadata.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned

Output

Output NameType
commit_timeString
actionString
partitionString
file_idString
previous_commitString
num_writesLong
num_insertsLong
num_deletesLong
num_update_writesString
total_errorsLong
total_log_blocksLong
total_corrupt_logblocksLong
total_rollback_blocksLong
total_log_recordsLong
total_updated_records_compactedLong
total_bytes_writtenLong

Example

call show_commits_metadata(table => 'test_hudi_table');
commit_timeactionpartitionfile_idprevious_commitnum_writesnum_insertsnum_deletesnum_update_writestotal_errorstotal_log_blockstotal_corrupt_logblockstotal_rollback_blockstotal_log_recordstotal_updated_records_compactedtotal_bytes_written
20220109225319449commitdt=2021-05-03d0073a12-085d-4f49-83e9-402947e7e90a-0null1100000000435349
20220109225311742commitdt=2021-05-02b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0202201092148305921100000000435340
20220109225301429commitdt=2021-05-010d7298b3-6b55-4cff-8d7d-b0772358b78a-0202201092148305921100000000435340
20220109214830592commitdt=2021-05-010d7298b3-6b55-4cff-8d7d-b0772358b78a-0202201091916310150010000000432653
20220109214830592commitdt=2021-05-02b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0202201091916481810010000000432653
20220109191648181commitdt=2021-05-02b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0null1100000000435341
20220109191631015commitdt=2021-05-010d7298b3-6b55-4cff-8d7d-b0772358b78a-0null1100000000435341

show_commit_extra_metadata

Show commits' extra metadata.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned
instant_timeStringNNoneInstant time
metadata_keyStringNNoneKey of metadata

Output

Output NameType
instant_timeString
actionString
metadata_keyString
metadata_valueString

Example

call show_commit_extra_metadata(table => 'test_hudi_table');
instant_timeactionmetadata_keymetadata_value
20230206174349556deltacommitschema{"type":"record","name":"hudi_mor_tbl","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"int"},{"name":"ts","type":"long"}]}
20230206174349556deltacommitlatest_schema{"max_column_id":8,"version_id":20230206174349556,"type":"record","fields":[{"id":0,"name":"_hoodie_commit_time","optional":true,"type":"string","doc":""},{"id":1,"name":"_hoodie_commit_seqno","optional":true,"type":"string","doc":""},{"id":2,"name":"_hoodie_record_key","optional":true,"type":"string","doc":""},{"id":3,"name":"_hoodie_partition_path","optional":true,"type":"string","doc":""},{"id":4,"name":"_hoodie_file_name","optional":true,"type":"string","doc":""},{"id":5,"name":"id","optional":false,"type":"int"},{"id":8,"name":"ts","optional":false,"type":"long"}]}

show_archived_commits

Show archived commits.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned
start_tsStringN""Start time for commits, default: now - 10 days
end_tsStringN""End time for commits, default: now - 1 day

Output

Output NameType
commit_timeString
total_bytes_writtenLong
total_files_addedLong
total_files_updatedLong
total_partitions_writtenLong
total_records_writtenLong
total_update_records_writtenLong
total_errorsLong

Example

call show_archived_commits(table => 'test_hudi_table');
commit_timetotal_bytes_writtentotal_files_addedtotal_files_updatedtotal_partitions_writtentotal_records_writtentotal_update_records_writtentotal_errors
20220216171049652432653011000
20220216171027021435346101100
20220216171019361435349101100

show_archived_commits_metadata

Show archived commits' metadata.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned
start_tsStringN""Start time for commits, default: now - 10 days
end_tsStringN""End time for commits, default: now - 1 day

Output

Output NameType
commit_timeString
actionString
partitionString
file_idString
previous_commitString
num_writesLong
num_insertsLong
num_deletesLong
num_update_writesString
total_errorsLong
total_log_blocksLong
total_corrupt_logblocksLong
total_rollback_blocksLong
total_log_recordsLong
total_updated_records_compactedLong
total_bytes_writtenLong

Example

call show_archived_commits_metadata(table => 'test_hudi_table');
commit_timeactionpartitionfile_idprevious_commitnum_writesnum_insertsnum_deletesnum_update_writestotal_errorstotal_log_blockstotal_corrupt_logblockstotal_rollback_blockstotal_log_recordstotal_updated_records_compactedtotal_bytes_written
20220109225319449commitdt=2021-05-03d0073a12-085d-4f49-83e9-402947e7e90a-0null1100000000435349
20220109225311742commitdt=2021-05-02b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0202201092148305921100000000435340
20220109225301429commitdt=2021-05-010d7298b3-6b55-4cff-8d7d-b0772358b78a-0202201092148305921100000000435340
20220109214830592commitdt=2021-05-010d7298b3-6b55-4cff-8d7d-b0772358b78a-0202201091916310150010000000432653
20220109214830592commitdt=2021-05-02b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0202201091916481810010000000432653
20220109191648181commitdt=2021-05-02b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0null1100000000435341
20220109191631015commitdt=2021-05-010d7298b3-6b55-4cff-8d7d-b0772358b78a-0null1100000000435341
call show_archived_commits(table => 'test_hudi_table');
commit_timetotal_bytes_writtentotal_files_addedtotal_files_updatedtotal_partitions_writtentotal_records_writtentotal_update_records_writtentotal_errors
20220216171049652432653011000
20220216171027021435346101100
20220216171019361435349101100

show_commit_files

Show files of a commit.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned
instant_timeStringYNoneInstant time

Output

Output NameType
actionString
partition_pathString
file_idString
previous_commitString
total_records_updatedLong
total_records_writtenLong
total_bytes_writtenLong
total_errorsLong
file_sizeLong

Example

call show_commit_files(table => 'test_hudi_table', instant_time => '20230206174349556');
actionpartition_pathfile_idprevious_committotal_records_updatedtotal_records_writtentotal_bytes_writtentotal_errorsfile_size
deltacommitdt=2021-05-037fb52523-c7f6-41aa-84a6-629041477aeb-0null014347680434768

show_commit_partitions

Show partitions of a commit.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned
instant_timeStringYNoneInstant time

Output

Output NameType
actionString
partition_pathString
total_files_addedLong
total_files_updatedLong
total_records_insertedLong
total_records_updatedLong
total_bytes_writtenLong
total_errorsLong

Example

call show_commit_partitions(table => 'test_hudi_table', instant_time => '20230206174349556');
actionpartition_pathtotal_files_addedtotal_files_updatedtotal_records_insertedtotal_records_updatedtotal_bytes_writtentotal_errors
deltacommitdt=2021-05-037fb52523-c7f6-41aa-84a6-629041477aeb-00143476800

show_commit_write_stats

Show write statistics of a commit.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned
instant_timeStringYNoneInstant time

Output

Output NameType
actionString
total_bytes_writtenLong
total_records_writtenLong
avg_record_sizeLong

Example

call show_commit_write_stats(table => 'test_hudi_table', instant_time => '20230206174349556');
actiontotal_bytes_writtentotal_records_writtenavg_record_size
deltacommit4347681434768

show_rollbacks

Show rollback commits.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned

Output

Output NameType
instantString
rollback_instantString
total_files_deletedInt
time_taken_in_millisLong
total_partitionsInt

Example

call show_rollbacks(table => 'test_hudi_table');
instantrollback_instanttotal_files_deletedtime_taken_in_millistime_taken_in_millis
deltacommit43476814347682

show_rollback_detail

Show details of a rollback commit.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned
instant_timeStringYNoneInstant time

Output

Output NameType
instantString
rollback_instantString
partitionString
deleted_fileString
succeededInt

Example

call show_rollback_detail(table => 'test_hudi_table', instant_time => '20230206174349556');
instantrollback_instantpartitiondeleted_filesucceeded
deltacommit43476814347682

commits_compare

Compare commit with another path.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
pathStringYNonePath of table

Output

Output NameType
compare_detailString

Example

call commits_compare(table => 'test_hudi_table', path => 'hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table');
compare_detail
Source test_hudi_table is ahead by 0 commits. Commits to catch up - []

archive_commits

archive commits.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringNNoneHudi table name
pathStringNNonePath of table
min_commitsIntN20Similar to hoodie.keep.max.commits, but controls the minimum number of instants to retain in the active timeline.
max_commitsIntN30Archiving service moves older entries from timeline into an archived log after each write, to keep the metadata overhead constant, even as the table size grows. This config controls the maximum number of instants to retain in the active timeline.
retain_commitsIntN10Archiving of instants is batched in best-effort manner, to pack more instants into a single archive log. This config controls such archival batch size.
enable_metadataBooleanNfalseEnable the internal metadata table

Output

Output NameType
resultInt

Example

call archive_commits(table => 'test_hudi_table');
result
0

export_instants

extract instants to local folder.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
local_folderStringYNoneLocal folder
limitIntN-1Number of instants to be exported
actionsStringNclean,commit,deltacommit,rollback,savepoint,restoreCommit action
descBooleanNfalseDescending order

Output

Output NameType
export_detailString

Example

call export_instants(table => 'test_hudi_table', local_folder => '/tmp/folder');
export_detail
Exported 6 Instants to /tmp/folder

rollback_to_instant

Rollback a table to the commit that was current at some time.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name

Output

Output NameType
rollback_resultBoolean

Example

Roll back test_hudi_table to one instant

call rollback_to_instant(table => 'test_hudi_table', instant_time => '20220109225319449');
rollback_result
true

create_savepoint

Create a savepoint to hudi's table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
commit_timeStringYNoneCommit time
userStringN""User name
commentsStringN""Comments

Output

Output NameType
create_savepoint_resultBoolean

Example

call create_savepoint(table => 'test_hudi_table', commit_time => '20220109225319449');
create_savepoint_result
true

show_savepoints

Show savepoints.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name

Output

Output NameType
savepoint_timeString

Example

call show_savepoints(table => 'test_hudi_table');
savepoint_time
20220109225319449
20220109225311742
20220109225301429

delete_savepoint

Delete a savepoint to hudi's table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
instant_timeStringYNoneInstant time

Output

Output NameType
delete_savepoint_resultBoolean

Example

Delete a savepoint to test_hudi_table

call delete_savepoint(table => 'test_hudi_table', instant_time => '20220109225319449');
delete_savepoint_result
true

rollback_to_savepoint

Rollback a table to the commit that was current at some time.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
instant_timeStringYNoneInstant time

Output

Output NameType
rollback_savepoint_resultBoolean

Example

Rollback test_hudi_table to one savepoint

call rollback_to_savepoint(table => 'test_hudi_table', instant_time => '20220109225319449');
rollback_savepoint_result
true

copy_to_temp_view

copy table to a temporary view.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
query_typeStringN"snapshot"Whether data needs to be read, in incremental mode (new data since an instantTime) (or) read_optimized mode (obtain latest view, based on base files) (or) snapshot mode (obtain latest view, by merging base and (if any) log files)
view_nameStringYNoneName of view
begin_instance_timeStringN""Begin instance time
end_instance_timeStringN""End instance time
as_of_instantStringN""As of instant time
replaceBooleanNfalseReplace an existed view
globalBooleanNfalseGlobal view

Output

Output NameType
statusBoolean

Example

call copy_to_temp_view(table => 'test_hudi_table', view_name => 'copy_view_test_hudi_table');
status
0

copy_to_table

copy table to a new table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
query_typeStringN"snapshot"Whether data needs to be read, in incremental mode (new data since an instantTime) (or) read_optimized mode (obtain latest view, based on base files) (or) snapshot mode (obtain latest view, by merging base and (if any) log files)
new_tableStringYNoneName of new table
begin_instance_timeStringN""Begin instance time
end_instance_timeStringN""End instance time
as_of_instantStringN""As of instant time
save_modeStringN"overwrite"Save mode
columnsStringN""Columns of source table which should copy to new table

Output

Output NameType
statusBoolean

Example

call copy_to_table(table => 'test_hudi_table', new_table => 'copy_table_test_hudi_table');
status
0

Metadata Table management

create_metadata_table

Create metadata table of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name

Output

Output NameType
resultString

Example

call create_metadata_table(table => 'test_hudi_table');
result
Created Metadata Table in hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table/.hoodie/metadata (duration=2.777secs)

init_metadata_table

Init metadata table of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
read_onlyBooleanNfalseRead only

Output

Output NameType
resultString

Example

call init_metadata_table(table => 'test_hudi_table');
result
Initialized Metadata Table in hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table/.hoodie/metadata (duration=0.023sec)

delete_metadata_table

Delete metadata table of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name

Output

Output NameType
resultString

Example

call delete_metadata_table(table => 'test_hudi_table');
result
Removed Metadata Table from hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table/.hoodie/metadata

show_metadata_table_partitions

Show partition of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name

Output

Output NameType
partitionString

Example

call show_metadata_table_partitions(table => 'test_hudi_table');
partition
dt=2021-05-01
dt=2021-05-02
dt=2021-05-03

show_metadata_table_files

Show files of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
partitionStringN""Partition name
limitIntN100Limit number

Output

Output NameType
file_pathString

Example

Show files of a hudi table under one partition.

call show_metadata_table_files(table => 'test_hudi_table', partition => 'dt=20230220');
file_path
.d3cdf6ff-250a-4cee-9af4-ab179fdb9bfb-0_20230220190948086.log.1_0-111-123
d3cdf6ff-250a-4cee-9af4-ab179fdb9bfb-0_0-78-81_20230220190948086.parquet

show_metadata_table_stats

Show metadata table stats of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name

Output

Output NameType
stat_keyString
stat_valueString

Example

call show_metadata_table_stats(table => 'test_hudi_table');
stat_keystat_value
dt=2021-05-03.totalBaseFileSizeInBytes23142

validate_metadata_table_files

Validate metadata table files of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
verboseBooleanNFalseIf verbose print all files

Output

Output NameType
partitionString
file_nameString
is_present_in_fsBoolean
is_present_in_metadataBoolean
fs_sizeLong
metadata_sizeLong

Example

call validate_metadata_table_files(table => 'test_hudi_table');
partitionfile_nameis_present_in_fsis_present_in_metadatafs_sizemetadata_size
dt=2021-05-03ad1e5a3f-532f-4a13-9f60-223676798bf3-0_0-4-4_00000000000002.parquettruetrue4352343523

Table information

show_table_properties

Show hudi properties of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
pathStringNNonePath of table
limitIntN10Max number of records to be returned

Output

Output NameType
keyString
valueString

Example

call show_table_properties(table => 'test_hudi_table', limit => 10);
keyvalue
hoodie.table.precombine.fieldts
hoodie.table.partition.fieldsdt

show_fs_path_detail

Show detail of a path.

Input

Parameter NameTypeRequiredDefault ValueDescription
pathStringYNoneHudi table name
is_subBooleanNfalseWhether to list files
sortBooleanNtrueSorted by storage_size
limitIntN100Limit number

Output

Output NameType
path_numLong
file_numLong
storage_sizeLong
storage_size(unit)String
storage_pathString
space_consumedLong
quotaLong
space_quotaLong

Example

call show_fs_path_detail(path => 'hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table');
path_numfile_numstorage_sizestorage_size(unit)storage_pathspace_consumedquotaspace_quota
225820656121.97MBhdfs://ns1/hive/warehouse/hudi.db/test_hudi_table-16196836-1

stats_file_sizes

Show file sizes of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
partition_pathStringN""Partition path
limitIntN10Max number of records to be returned

Output

Output NameType
commit_timeString
minLong
10thDouble
50thDouble
avgDouble
95thDouble
maxLong
num_filesInt
std_devDouble

Example

call stats_file_sizes(table => 'test_hudi_table');
commit_timemin10th50thavg95thmaxnum_filesstd_dev
20230205134149455435000435000.0435000.0435000.0435000.043500010.0

stats_wa

Show write stats and amplification of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
limitIntN10Max number of records to be returned

Output

Output NameType
commit_timeString
total_upsertedLong
total_writtenLong
write_amplification_factorString

Example

call stats_wa(table => 'test_hudi_table');
commit_timetotal_upsertedtotal_writtenwrite_amplification_factor
Total000

show_logfile_records

Show records in logfile of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
log_file_path_patternStringY10Pattern of logfile
mergeBooleanNfalseMerge results
limitIntN10Max number of records to be returned

Output

Output NameType
recordsString

Example

call show_logfile_records(table => 'test_hudi_table', log_file_path_pattern => 'hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table/*.log*');
records
{"_hoodie_commit_time": "20230205133427059", "_hoodie_commit_seqno": "20230205133427059_0_10", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "3438e233-7b50-4eff-adbb-70b1cd76f518-0", "id": 1, "name": "a1", "price": 40.0, "ts": 1111}

show_logfile_metadata

Show metadatas in logfile of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
log_file_path_patternStringY10Pattern of logfile
mergeBooleanNfalseMerge results
limitIntN10Max number of records to be returned

Output

Output NameType
instant_timeString
record_countInt
block_typeString
header_metadataString
footer_metadataString

Example

call show_logfile_metadata(table => 'hudi_mor_tbl', log_file_path_pattern => 'hdfs://ns1/hive/warehouse/hudi.db/hudi_mor_tbl/*.log*');
instant_timerecord_countblock_typeheader_metadatafooter_metadata
202302051334270591AVRO_DATA_BLOCK{"INSTANT_TIME":"20230205133427059","SCHEMA":"{\"type\":\"record\",\"name\":\"hudi_mor_tbl_record\",\"namespace\":\"hoodie.hudi_mor_tbl\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_commit_seqno\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_record_key\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_partition_path\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"_hoodie_file_name\",\"type\":[\"null\",\"string\"],\"doc\":\"\",\"default\":null},{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"price\",\"type\":\"double\"},{\"name\":\"ts\",\"type\":\"long\"}]}"}{}

show_invalid_parquet

Show invalid parquet files of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
PathStringYNoneHudi table name
limitIntN100Limit number

Output

Output NameType
PathString

Example

call show_invalid_parquet(path => 'hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table');
Path
hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table/7fb52523-c7f6-41aa-84a6-629041477aeb-0_0-92-99_20230205133532199.parquet

show_fsview_all

Show file system views of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
max_instantStringN""Max instant time
include_maxBooleanNfalseInclude max instant
include_in_flightBooleanNfalseInclude in flight
exclude_compactionBooleanNfalseExclude compaction
limitIntN10Max number of records to be returned
path_regexStringN"*\/*\/*"Pattern of path

Output

Output NameType
partitionString
file_idString
base_instantString
data_fileString
data_file_sizeLong
num_delta_filesLong
total_delta_file_sizeLong
delta_filesString

Example

call show_fsview_all(table => 'test_hudi_table');
partitionfile_idbase_instantdata_filedata_file_sizenum_delta_filestotal_delta_file_sizedelta_files
dt=2021-05-03d0073a12-085d-4f49-83e9-402947e7e90a-0202201092253194497fb52523-c7f6-41aa-84a6-629041477aeb-0_0-92-99_20220109225319449.parquet53194491213193.7fb52523-c7f6-41aa-84a6-629041477aeb-0_20230205133217210.log.1_0-60-63

show_fsview_latest

Show latest file system view of a table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
max_instantStringN""Max instant time
include_maxBooleanNfalseInclude max instant
include_in_flightBooleanNfalseInclude in flight
exclude_compactionBooleanNfalseExclude compaction
partition_pathStringYNonePartition path
mergeBooleanNfalseMerge results

Output

Output NameType
partitionString
file_idString
base_instantString
data_fileString
data_file_sizeLong
num_delta_filesLong
total_delta_file_sizeLong
delta_size_compaction_scheduledLong
delta_size_compaction_unscheduledLong
delta_to_base_radio_compaction_scheduledDouble
delta_to_base_radio_compaction_unscheduledDouble
delta_files_compaction_scheduledString
delta_files_compaction_unscheduledString

Example

call show_fsview_latest(table => 'test_hudi_table', partition => 'dt=2021-05-03');
partitionfile_idbase_instantdata_filedata_file_sizenum_delta_filestotal_delta_file_sizedelta_files
dt=2021-05-03d0073a12-085d-4f49-83e9-402947e7e90a-0202201092253194497fb52523-c7f6-41aa-84a6-629041477aeb-0_0-92-99_20220109225319449.parquet53194491213193.7fb52523-c7f6-41aa-84a6-629041477aeb-0_20230205133217210.log.1_0-60-63

Table services

run_clustering

Trigger clustering on a hoodie table. By using partition predicates, clustering table can be run with specified partitions, and you can also specify the order columns to sort data.

note

Newly clustering instant will be generated every call, or some pending clustering instants are executed. When calling this procedure, one of parameters table and path must be specified at least. If both parameters are given, table will take effect.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringNNoneName of table to be clustered
pathStringNNonePath of table to be clustered
predicateStringNNonePredicate to filter partition
orderStringNNoneOrder column split by ,
show_involved_partitionBooleanNfalseShow involved partition in the output
opStringNNoneOperation type, EXECUTE or SCHEDULE
order_strategyStringNNoneRecords layout optimization, linear/z-order/hilbert
optionsStringNNoneCustomize hudi configs in the format "key1=value1,key2=value2`
instantsStringNNoneSpecified instants by ,
selected_partitionsStringNNonePartitions to run clustering by ,
limitIntNNoneMax number of plans to be executed

Output

The output as follows:

Parameter NameTypeRequiredDefault ValueDescription
timestampStringNNoneInstant name
input_group_sizeIntNNoneThe input group sizes for each plan
stateStringNNoneThe instant final state
involved_partitionsStringN*Show involved partitions, default is *

Example

Clustering test_hudi_table with table name

call run_clustering(table => 'test_hudi_table');

Clustering test_hudi_table with table path

call run_clustering(path => '/tmp/hoodie/test_hudi_table');

Clustering test_hudi_table with table name, predicate and order column

call run_clustering(table => 'test_hudi_table', predicate => 'ts <= 20220408L', order => 'ts');

Clustering test_hudi_table with table name, show_involved_partition

call run_clustering(table => 'test_hudi_table', show_involved_partition => true);

Clustering test_hudi_table with table name, op

call run_clustering(table => 'test_hudi_table', op => 'schedule');

Clustering test_hudi_table with table name, order_strategy

call run_clustering(table => 'test_hudi_table', order_strategy => 'z-order');

Clustering test_hudi_table with table name, op, options

call run_clustering(table => 'test_hudi_table', op => 'schedule', options => '
hoodie.clustering.plan.strategy.target.file.max.bytes=1024*1024*1024,
hoodie.clustering.plan.strategy.max.bytes.per.group=2*1024*1024*1024');

Clustering test_hudi_table with table name, op, instants

call run_clustering(table => 'test_hudi_table', op => 'execute', instants => 'ts1,ts2');

Clustering test_hudi_table with table name, op, selected_partitions

call run_clustering(table => 'test_hudi_table', op => 'execute', selected_partitions => 'par1,par2');

Clustering test_hudi_table with table name, op, limit

call run_clustering(table => 'test_hudi_table', op => 'execute', limit => 10);
note

Limit parameter is valid only when op is execute.

show_clustering

Show pending clusterings on a hoodie table.

note

When calling this procedure, one of parameters table and path must be specified at least. If both parameters are given, table will take effect.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringNNoneName of table to be clustered
pathStringNNonePath of table to be clustered
limitIntNNoneMax number of records to be returned

Output

Parameter NameTypeRequiredDefault ValueDescription
timestampStringNNoneInstant time
groupsIntNNoneNumber of file groups to be processed

Example

Show pending clusterings with table name

call show_clustering(table => 'test_hudi_table');
timestampgroups
202204081537079282
202204081536369633

Show pending clusterings with table path

call show_clustering(path => '/tmp/hoodie/test_hudi_table');
timestampgroups
202204081537079282
202204081536369633

Show pending clusterings with table name and limit

call show_clustering(table => 'test_hudi_table', limit => 1);
timestampgroups
202204081537079282

run_compaction

Schedule or run compaction on a hoodie table.

note

For scheduling compaction, if timestamp is specified, new scheduled compaction will use given timestamp as instant time. Otherwise, compaction will be scheduled by using current system time.

For running compaction, given timestamp must be a pending compaction instant time that already exists, if it's not, exception will be thrown. Meanwhile, if timestampis specified and there are pending compactions, all pending compactions will be executed without new compaction instant generated.

When calling this procedure, one of parameters table and pathmust be specified at least. If both parameters are given, table will take effect.

Input

Parameter NameTypeRequiredDefault ValueDescription
opStringNNoneOperation type, RUN or SCHEDULE
tableStringNNoneName of table to be compacted
pathStringNNonePath of table to be compacted
timestampStringNNoneInstant time
optionsStringNNonecomma separated list of Hudi configs for compaction in the format "config1=value1,config2=value2"

Output

The output of RUN operation is EMPTY, the output of SCHEDULE as follow:

Parameter NameTypeRequiredDefault ValueDescription
instantStringNNoneInstant name

Example

Run compaction with table name

call run_compaction(op => 'run', table => 'test_hudi_table');

Run compaction with table path

call run_compaction(op => 'run', path => '/tmp/hoodie/test_hudi_table');

Run compaction with table path and timestamp

call run_compaction(op => 'run', path => '/tmp/hoodie/test_hudi_table', timestamp => '20220408153658568');

Run compaction with options

call run_compaction(op => 'run', table => 'test_hudi_table', options => hoodie.compaction.strategy=org.apache.hudi.table.action.compact.strategy.LogFileNumBasedCompactionStrategy,hoodie.compaction.logfile.num.threshold=3);

Schedule compaction with table name

call run_compaction(op => 'schedule', table => 'test_hudi_table');
instant
20220408153650834

Schedule compaction with table path

call run_compaction(op => 'schedule', path => '/tmp/hoodie/test_hudi_table');
instant
20220408153650834

Schedule compaction with table path and timestamp

call run_compaction(op => 'schedule', path => '/tmp/hoodie/test_hudi_table', timestamp => '20220408153658568');
instant
20220408153658568

show_compaction

Show all compactions on a hoodie table, in-flight or completed compactions are included, and result will be in reverse order according to trigger time.

note

When calling this procedure, one of parameters tableand path must be specified at least. If both parameters are given, table will take effect.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringNNoneName of table to show compaction
pathStringNNonePath of table to show compaction
limitIntNNoneMax number of records to be returned

Output

Parameter NameTypeRequiredDefault ValueDescription
timestampStringNNoneInstant time
actionStringNNoneAction name of compaction
sizeIntNNoneNumber of file slices to be compacted

Example

Show compactions with table name

call show_compaction(table => 'test_hudi_table');
timestampactionsize
20220408153707928compaction10
20220408153636963compaction10

Show compactions with table path

call show_compaction(path => '/tmp/hoodie/test_hudi_table');
timestampactionsize
20220408153707928compaction10
20220408153636963compaction10

Show compactions with table name and limit

call show_compaction(table => 'test_hudi_table', limit => 1);
timestampactionsize
20220408153707928compaction10

run_clean

Run cleaner on a hoodie table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneName of table to be cleaned
schedule_in_lineBooleanNtrueSet "true" if you want to schedule and run a clean. Set false if you have already scheduled a clean and want to run that.
clean_policyStringNNoneorg.apache.hudi.common.model.HoodieCleaningPolicy: Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time. By default, the cleaning policy is determined based on one of the following configs explicitly set by the user (at most one of them can be set; otherwise, KEEP_LATEST_COMMITS cleaning policy is used). KEEP_LATEST_FILE_VERSIONS: keeps the last N versions of the file slices written; used when "hoodie.cleaner.fileversions.retained" is explicitly set only. KEEP_LATEST_COMMITS(default): keeps the file slices written by the last N commits; used when "hoodie.cleaner.commits.retained" is explicitly set only. KEEP_LATEST_BY_HOURS: keeps the file slices written in the last N hours based on the commit time; used when "hoodie.cleaner.hours.retained" is explicitly set only.
retain_commitsIntNNoneWhen KEEP_LATEST_COMMITS cleaning policy is used, the number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries.
hours_retainedIntNNoneWhen KEEP_LATEST_BY_HOURS cleaning policy is used, the number of hours for which commits need to be retained. This config provides a more flexible option as compared to number of commits retained for cleaning service. Setting this property ensures all the files, but the latest in a file group, corresponding to commits with commit times older than the configured number of hours to be retained are cleaned.
file_versions_retainedIntNNoneWhen KEEP_LATEST_FILE_VERSIONS cleaning policy is used, the minimum number of file slices to retain in each file group, during cleaning.
trigger_strategyStringNNoneorg.apache.hudi.table.action.clean.CleaningTriggerStrategy: Controls when cleaning is scheduled. NUM_COMMITS(default): Trigger the cleaning service every N commits, determined by hoodie.clean.max.commits
trigger_max_commitsIntNNoneNumber of commits after the last clean operation, before scheduling of a new clean is attempted.
optionsStringNNonecomma separated list of Hudi configs for cleaning in the format "config1=value1,config2=value2"

Output

Parameter NameType
start_clean_timeString
time_taken_in_millisLong
total_files_deletedInt
earliest_commit_to_retainString
bootstrap_part_metadataString
versionInt

Example

Run clean with table name

call run_clean(table => 'test_hudi_table');

Run clean with keep latest file versions policy

call run_clean(table => 'test_hudi_table', trigger_max_commits => 2, clean_policy => 'KEEP_LATEST_FILE_VERSIONS', file_versions_retained => 1)

delete_marker

Delete marker files of a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
instant_timeStringYNoneInstant name

Output

Output NameType
delete_marker_resultBoolean

Example

call delete_marker(table => 'test_hudi_table', instant_time => '20230206174349556');
delete_marker_result
true

sync_validate

Validate sync procedure.

Input

Parameter NameTypeRequiredDefault ValueDescription
src_tableStringYNoneSource table name
dst_tableStringYNoneTarget table name
modeStringY"complete"Mode
hive_server_urlStringYNoneHive server url
hive_passStringYNoneHive password
src_dbStringN"rawdata"Source database
target_dbStringNdwh_hoodie"Target database
partition_cntIntN5Partition count
hive_userStringN""Hive user name

Output

Output NameType
resultString

Example

 call sync_validate(hive_server_url=>'jdbc:hive2://localhost:10000/default', src_table => 'test_hudi_table_src', dst_table=> 'test_hudi_table_dst', mode=>'complete', hive_pass=>'', src_db=> 'default', target_db=>'default');

hive_sync

Sync the table's latest schema to Hive metastore.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
metastore_uriStringN""Metastore_uri
usernameStringN""User name
passwordStringN""Password
use_jdbcStringN""Use JDBC when hive synchronization is enabled
modeStringN""Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
partition_fieldsStringN""Field in the table to use for determining hive partition columns.
partition_extractor_classStringN""Class which implements PartitionValueExtractor to extract the partition values, default 'org.apache.hudi.hive.MultiPartKeysValueExtractor'.
strategyStringN""Hive table synchronization strategy. Available option: RO, RT, ALL.
sync_incrementalStringN""Whether to incrementally sync the partitions to the metastore, i.e., only added, changed, and deleted partitions based on the commit metadata. If set to false, the meta sync executes a full partition sync operation when partitions are lost.

Output

Output NameType
resultString

Example

call hive_sync(table => 'test_hudi_table');
result
true

hdfs_parquet_import

add parquet files to a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
table_typeStringY""Table type, MERGE_ON_READ or COPY_ON_WRITE
src_pathStringY""Source path
target_pathStringY""target path
row_keyStringY""Primary key
partition_keyStringY""Partition key
schema_file_pathStringY""Path of Schema file
formatStringN"parquet"File format
commandStringN"insert"Import command
retryIntN0Retry times
parallelismIntNNoneParallelism
props_file_pathStringN""Path of properties file

Output

Output NameType
import_resultInt

Example

call hdfs_parquet_import(table => 'test_hudi_table', table_type => 'COPY_ON_WRITE', src_path => '', target_path => '', row_key => 'id', partition_key => 'dt', schema_file_path => '');
import_result
0

repair_add_partition_meta

Repair add partition for a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
dry_runBooleanNtrueDry run

Output

Output NameType
partition_pathString
metadata_is_presentString
actionString

Example

call repair_add_partition_meta(table => 'test_hudi_table');
partition_pathmetadata_is_presentaction
dt=2021-05-03YesNone

repair_corrupted_clean_files

Repair corrupted clean files for a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name

Output

Output NameType
resultBoolean

Example

call repair_corrupted_clean_files(table => 'test_hudi_table');
result
true

repair_deduplicate

Repair deduplicate records for a hudi table. The job dedupliates the data in the duplicated_partition_path and writes it into repaired_output_path. In the end of the job, the data in repaired_output_path is copied into the original path (duplicated_partition_path).

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
duplicated_partition_pathStringYNoneDuplicated partition path
repaired_output_pathStringYNoneRepaired output path
dry_runBooleanNtrueDry run
dedupe_typeStringN"insert_type"Dedupe type

Output

Output NameType
resultString

Example

call repair_deduplicate(table => 'test_hudi_table', duplicated_partition_path => 'dt=2021-05-03', repaired_output_path => '/tmp/repair_path/');
result
Reduplicated files placed in: /tmp/repair_path/.

repair_migrate_partition_meta

downgrade a hudi table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
dry_runBooleanNtrueDry run

Output

Output NameType
partition_pathString
text_metafile_presentString
base_metafile_presentString
actionString

Example

call repair_migrate_partition_meta(table => 'test_hudi_table');

repair_overwrite_hoodie_props

overwrite a hudi table properties.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
new_props_file_pathStringYNonePath of new properties

Output

Output NameType
propertyString
old_valueString
new_valueString

Example

call repair_overwrite_hoodie_props(table => 'test_hudi_table', new_props_file_path = > '/tmp/props');
propertyold_valuenew_value
hoodie.file.index.enabletruefalse

Bootstrap

run_bootstrap

Convert an existing table to Hudi.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneName of table to be clustered
table_typeStringYNoneTable type, MERGE_ON_READ or COPY_ON_WRITE
bootstrap_pathStringYNoneBase path of the dataset that needs to be bootstrapped as a Hudi table
base_pathStringYNoneBase path
rowKey_fieldStringYNonePrimary key field
base_file_formatStringN"PARQUET"Format of base file
partition_path_fieldStringN""Partitioned column field
bootstrap_index_classStringN"org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex"Implementation to use, for mapping a skeleton base file to a bootstrap base file.
selector_classStringN"org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector"Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped
key_generator_classStringN"org.apache.hudi.keygen.SimpleKeyGenerator"Class of key generator
full_bootstrap_input_providerStringN"org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider"Class of full bootstrap input provider
schema_provider_classStringN""Class of schema provider
payload_classStringN"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload"Class of payload
parallelismIntN1500For metadata-only bootstrap, Hudi parallelizes the operation so that each table partition is handled by one Spark task. This config limits the number of parallelism. We pick the configured parallelism if the number of table partitions is larger than this configured value. The parallelism is assigned to the number of table partitions if it is smaller than the configured value. For full-record bootstrap, i.e., BULK_INSERT operation of the records, this configured value is passed as the BULK_INSERT shuffle parallelism (hoodie.bulkinsert.shuffle.parallelism), determining the BULK_INSERT write behavior. If you see that the bootstrap is slow due to the limited parallelism, you can increase this.
enable_hive_syncBooleanNfalseWhether to enable hive sync
props_file_pathStringN""Path of properties file
bootstrap_overwriteBooleanNfalseOverwrite bootstrap path

Output

Output NameType
statusBoolean

Example

call run_bootstrap(table => 'test_hudi_table', table_type => 'COPY_ON_WRITE', bootstrap_path => 'hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table', base_path => 'hdfs://ns1//tmp/hoodie/test_hudi_table', rowKey_field => 'id', partition_path_field => 'dt',bootstrap_overwrite => true);
status
0

show_bootstrap_mapping

Show mapping files of a bootstrap table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneName of table to be clustered
partition_pathStringN""Partition path
file_idsStringN""File ids
limitIntN10Max number of records to be returned
sort_byStringN"partition"Sort by columns
descBooleanNfalseDescending order

Output

Parameter NameType
partitionString
file_idInt
source_base_pathString
source_partitionInt
source_fileString

Example

call show_bootstrap_mapping(table => 'test_hudi_table');
partitionfile_idsource_base_pathsource_partitionsource_file
dt=2021-05-03d0073a12-085d-4f49-83e9-402947e7e90a-0hdfs://ns1/hive/warehouse/hudi.db/test_hudi_table/dt=2021-05-03/d0073a12-085d-4f49-83e9-402947e7e90a-0_0-2-2_00000000000002.parquetdt=2021-05-03hdfs://ns1/tmp/dt=2021-05-03/00001.parquet

show_bootstrap_partitions

Show partitions of a bootstrap table.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneName of table to be clustered

Output

Parameter NameType
indexed_partitionsString

Example

call show_bootstrap_partitions(table => 'test_hudi_table');
indexed_partitions
dt=2021-05-03

Version management

upgrade_table

upgrade a hudi table to a specific version.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
to_versionStringYNoneVersion of hoodie table

Output

Output NameType
resultBoolean

Example

call upgrade_table(table => 'test_hudi_table', to_version => 'FIVE');
result
true

downgrade_table

downgrade a hudi table to a specific version.

Input

Parameter NameTypeRequiredDefault ValueDescription
tableStringYNoneHudi table name
to_versionStringYNoneVersion of hoodie table

Output

Output NameType
resultBoolean

Example

call downgrade_table(table => 'test_hudi_table', to_version => 'FOUR');
result
true