Skip to main content
Version: 1.0.0

SQL DML

Spark SQL

SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data from your Hudi tables. Let's explore them one by one.

Please refer to SQL DDL for creating Hudi tables using SQL.

Insert Into

You can use the INSERT INTO statement to add data to a Hudi table using Spark SQL. Here are some examples:

INSERT INTO <table> 
SELECT <columns> FROM <source>;
Deprecations

From 0.14.0, hoodie.sql.bulk.insert.enable and hoodie.sql.insert.mode are deprecated. Users are expected to use hoodie.spark.sql.insert.into.operation instead. To manage duplicates with INSERT INTO, please check out insert dup policy config.

Examples:

-- Insert into a copy-on-write (COW) Hudi table
INSERT INTO hudi_cow_nonpcf_tbl SELECT 1, 'a1', 20;

-- Insert into a merge-on-read (MOR) Hudi table
INSERT INTO hudi_mor_tbl SELECT 1, 'a1', 20, 1000;

-- Insert into a COW Hudi table with static partition
INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', hh='11') SELECT 2, 'a2', 1000;

-- Insert into a COW Hudi table with dynamic partition
INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
Mapping to write operations

Hudi offers flexibility in choosing the underlying write operation of a INSERT INTO statement using the hoodie.spark.sql.insert.into.operation configuration. Possible options include "bulk_insert" (large inserts), "insert" (with small file management), and "upsert" (with deduplication/merging). If a precombine field is not set, "insert" is chosen as the default. For a table with preCombine field set, "upsert" is chosen as the default operation.

Insert Overwrite

The INSERT OVERWRITE statement is used to replace existing data in a Hudi table.

INSERT OVERWRITE <table> 
SELECT <columns> FROM <source>;

All existing partitions that are affected by the INSERT OVERWRITE statement will replaced with the source data. Here are some examples:

-- Overwrite non-partitioned table
INSERT OVERWRITE hudi_mor_tbl SELECT 99, 'a99', 20.0, 900;
INSERT OVERWRITE hudi_cow_nonpcf_tbl SELECT 99, 'a99', 20.0;

-- Overwrite partitioned table with dynamic partition
INSERT OVERWRITE TABLE hudi_cow_pt_tbl SELECT 10, 'a10', 1100, '2021-12-09', '10';

-- Overwrite partitioned table with static partition
INSERT OVERWRITE hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', hh='12') SELECT 13, 'a13', 1100;

Update

You can use the UPDATE statement to modify existing data in a Hudi table directly.

UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression]

Here's an example:

-- Update data in a Hudi table
UPDATE hudi_mor_tbl SET price = price * 2, ts = 1111 WHERE id = 1;

-- Update data in a partitioned Hudi table
UPDATE hudi_cow_pt_tbl SET name = 'a1_1', ts = 1001 WHERE id = 1;

-- update using non-PK field
update hudi_cow_pt_tbl set ts = 1001 where name = 'a1';
info

The UPDATE operation requires the specification of a preCombineField.

Merge Into

The MERGE INTO statement allows you to perform more complex updates and merges against source data. The MERGE INTO statement is similar to the UPDATE statement, but it allows you to specify different actions for matched and unmatched records.

MERGE INTO tableIdentifier AS target_alias
USING (sub_query | tableIdentifier) AS source_alias
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]

<merge_condition> =A equal bool condition
<matched_action> =
DELETE |
UPDATE SET * |
UPDATE SET column1 = expression1 [, column2 = expression2 ...]
<not_matched_action> =
INSERT * |
INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
info

WHEN NOT MATCHED clauses specify the action to perform if the values do not match. There are two kinds of INSERT clauses:

  1. INSERT * clauses require that the source table has the same columns as those in the target table.
  2. INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...]) clauses do not require to specify all the columns of the target table. For unspecified target columns, insert the NULL value.

Examples below

-- source table using hudi for testing merging into non-partitioned table
create table merge_source (id int, name string, price double, ts bigint) using hudi
tblproperties (primaryKey = 'id', preCombineField = 'ts');
insert into merge_source values (1, "old_a1", 22.22, 900), (2, "new_a2", 33.33, 2000), (3, "new_a3", 44.44, 2000);

merge into hudi_mor_tbl as target
using merge_source as source
on target.id = source.id
when matched then update set *
when not matched then insert *
;

-- source table using parquet for testing merging into partitioned table
create table merge_source2 (id int, name string, flag string, dt string, hh string) using parquet;
insert into merge_source2 values (1, "new_a1", 'update', '2021-12-09', '10'), (2, "new_a2", 'delete', '2021-12-09', '11'), (3, "new_a3", 'insert', '2021-12-09', '12');

MERGE into hudi_cow_pt_tbl as target
using (
select id, name, '1000' as ts, flag, dt, hh from merge_source2
) source
on target.id = source.id
when matched and flag != 'delete' then
update set id = source.id, name = source.name, ts = source.ts, dt = source.dt, hh = source.hh
when matched and flag = 'delete' then delete
when not matched then
insert (id, name, ts, dt, hh) values(source.id, source.name, source.ts, source.dt, source.hh)
;

Key requirements

For a Hudi table with user configured primary keys, the join condition in Merge Into is expected to contain the primary keys of the table. For a Table where Hudi auto generates primary keys, the join condition in MIT can be on any arbitrary data columns.

Merge Into with Partial Updates

Partial updates only write updated columns instead of full change records. This is useful when you have wide tables (typical for ML feature stores) with hundreds of columns and only a few columns are updated. It reduces the write amplification as well as helps in lowering the query latency. MERGE INTO statement above can be modified to use partial updates as shown below.

-- Create a Merge-on-Read table
CREATE TABLE tableName (
id INT,
name STRING,
price DOUBLE,
_ts LONG,
description STRING
) USING hudi
TBLPROPERTIES (
type = 'mor',
primaryKey = 'id',
preCombineField = '_ts'
)
LOCATION '/location/to/basePath';

-- Insert values into the table
INSERT INTO tableName VALUES
(1, 'a1', 10, 1000, 'a1: desc1'),
(2, 'a2', 20, 1200, 'a2: desc2'),
(3, 'a3', 30, 1250, 'a3: desc3');

-- Perform partial updates using a MERGE INTO statement
MERGE INTO tableName t0
USING (
SELECT 1 AS id, 'a1' AS name, 12 AS price, 1001 AS ts
UNION ALL
SELECT 3 AS id, 'a3' AS name, 25 AS price, 1260 AS ts
) s0
ON t0.id = s0.id
WHEN MATCHED THEN UPDATE SET
price = s0.price,
_ts = s0.ts;

SELECT id, name, price, _ts, description FROM tableName;

Notice, instead of UPDATE SET *, we are updating only the price and _ts columns.

note

Partial update is not yet supported in the following cases:

  1. When the target table is a bootstrapped table.
  2. When virtual keys is enabled.
  3. When schema on read is enabled.
  4. When there is an enum field in the source data.

Delete From

You can remove data from a Hudi table using the DELETE FROM statement.

DELETE FROM tableIdentifier [ WHERE boolExpression ]

Examples below

-- Delete data from a Hudi table
DELETE FROM hudi_cow_nonpcf_tbl WHERE uuid = 1;

-- Delete data from a MOR Hudi table based on a condition
DELETE FROM hudi_mor_tbl WHERE id % 2 = 0;

-- Delete data using a non-primary key field
DELETE FROM hudi_cow_pt_tbl WHERE name = 'a1';

Data Skipping and Indexing

DML operations can be sped up using column statistics for data skipping and using indexes to reduce the amount of data scanned. For e.g. the following helps speed up the DELETE operation on a Hudi table, by using the record level index.

SET hoodie.metadata.record.index.enable=true;

DELETE from hudi_table where uuid = 'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa';

These DML operations give you powerful tools for managing your tables using Spark SQL. You can control the behavior of these operations using various configuration options, as explained in the documentation.

Flink SQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update and delete data from your Hudi tables. Let's explore them one by one.

Insert Into

You can utilize the INSERT INTO statement to incorporate data into a Hudi table using Flink SQL. Here are a few illustrative examples:

INSERT INTO <table> 
SELECT <columns> FROM <source>;

Examples:

-- Insert into a Hudi table
INSERT INTO hudi_table SELECT 1, 'a1', 20;

If the write.operation is 'upsert,' the INSERT INTO statement will not only insert new records but also update existing rows with the same record key.

-- Insert into a Hudi table in upsert mode
INSERT INTO hudi_table/*+ OPTIONS('write.operation'='upsert')*/ SELECT 1, 'a1', 20;

Update

With Flink SQL, you can use update command to update the hudi table. Here are a few illustrative examples:

UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression]
UPDATE hudi_table SET price = price * 2, ts = 1111 WHERE id = 1;
Key requirements

Update query only work with batch excution mode.

Delete From

With Flink SQL, you can use delete command to delete the rows from hudi table. Here are a few illustrative examples:

DELETE FROM tableIdentifier [ WHERE boolExpression ]
DELETE FROM hudi_table WHERE price < 100;
DELETE FROM hudi_table WHERE price < 100;
Key requirements

Delete query only work with batch excution mode.

Lookup Joins

A lookup join is typically used to enrich a table with data that is queried from an external system. The join requires one table to have a processing time attribute and the other table to be backed by a lookup source connector.

CREATE TABLE datagen_source(
id int,
name STRING,
proctime as PROCTIME()
) WITH (
'connector' = 'datagen',
'rows-per-second'='1',
'number-of-rows' = '2',
'fields.id.kind'='sequence',
'fields.id.start'='1',
'fields.id.end'='2'
);

SELECT o.id,o.name,b.id as id2
FROM datagen_source AS o
JOIN hudi_table/*+ OPTIONS('lookup.join.cache.ttl'= '2 day') */ FOR SYSTEM_TIME AS OF o.proctime AS b on o.id = b.id;

Setting Writer/Reader Configs

With Flink SQL, you can additionally set the writer/reader writer configs along with the query.

INSERT INTO hudi_table/*+ OPTIONS('${hoodie.config.key1}'='${hoodie.config.value1}')*/
INSERT INTO hudi_table/*+ OPTIONS('hoodie.keep.max.commits'='true')*/

The hudi-flink module defines the Flink SQL connector for both hudi source and sink. There are a number of options available for the sink table:

Option NameRequiredDefaultRemarks
pathYN/ABase path for the target hoodie table. The path would be created if it does not exist, otherwise a hudi table expects to be initialized successfully
table.typeNCOPY_ON_WRITEType of table to write. COPY_ON_WRITE (or) MERGE_ON_READ
write.operationNupsertThe write operation, that this write should do (insert or upsert is supported)
write.precombine.fieldNtsField used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
write.payload.classNOverwriteWithLatestAvroPayload.classPayload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for the option in-effective
write.insert.drop.duplicatesNfalseFlag to indicate whether to drop duplicates upon insert. By default insert will accept duplicates, to gain extra performance
write.ignore.failedNtrueFlag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default true (in favor of streaming progressing over data integrity)
hoodie.datasource.write.recordkey.fieldNuuidRecord key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.c
hoodie.datasource.write.keygenerator.classNSimpleAvroKeyGenerator.classKey generator class, that implements will extract the key out of incoming record
write.tasksN4Parallelism of tasks that do actual write, default is 4
write.batch.size.MBN128Batch buffer size in MB to flush data into the underneath filesystem

If the table type is MERGE_ON_READ, you can also specify the asynchronous compaction strategy through options:

Option NameRequiredDefaultRemarks
compaction.async.enabledNtrueAsync Compaction, enabled by default for MOR
compaction.trigger.strategyNnum_commitsStrategy to trigger compaction, options are 'num_commits': trigger compaction when reach N delta commits; 'time_elapsed': trigger compaction when time elapsed > N seconds since last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 'num_commits'
compaction.delta_commitsN5Max delta commits needed to trigger compaction, default 5 commits
compaction.delta_secondsN3600Max delta seconds time needed to trigger compaction, default 1 hour

You can write the data using the SQL INSERT INTO statements:

INSERT INTO hudi_table select ... from ...; 

Note: INSERT OVERWRITE is not supported yet but already on the roadmap.

Non-Blocking Concurrency Control (Experimental)

Hudi Flink supports a new non-blocking concurrency control mode, where multiple writer tasks can be executed concurrently without blocking each other. One can read more about this mode in the concurrency control docs. Let us see it in action here.

In the below example, we have two streaming ingestion pipelines that concurrently update the same table. One of the pipeline is responsible for the compaction and cleaning table services, while the other pipeline is just for data ingestion.

In order to commit the dataset, the checkpoint needs to be enabled, here is an example configuration for a flink-conf.yaml:

-- set the interval as 30 seconds
execution.checkpointing.interval: 30000
state.backend: rocksdb
-- This is a datagen source that can generate records continuously
CREATE TABLE sourceT (
uuid varchar(20),
name varchar(10),
age int,
ts timestamp(3),
`partition` as 'par1'
) WITH (
'connector' = 'datagen',
'rows-per-second' = '200'
);

-- pipeline1: by default enable the compaction and cleaning services
CREATE TABLE t1(
uuid varchar(20),
name varchar(10),
age int,
ts timestamp(3),
`partition` varchar(20)
) WITH (
'connector' = 'hudi',
'path' = '${work_path}/hudi-demo/t1',
'table.type' = 'MERGE_ON_READ',
'index.type' = 'BUCKET',
'hoodie.write.concurrency.mode' = 'NON_BLOCKING_CONCURRENCY_CONTROL',
'write.tasks' = '2'
);

-- pipeline2: disable the compaction and cleaning services manually
CREATE TABLE t1_2(
uuid varchar(20),
name varchar(10),
age int,
ts timestamp(3),
`partition` varchar(20)
) WITH (
'connector' = 'hudi',
'path' = '${work_path}/hudi-demo/t1',
'table.type' = 'MERGE_ON_READ',
'index.type' = 'BUCKET',
'hoodie.write.concurrency.mode' = 'NON_BLOCKING_CONCURRENCY_CONTROL',
'write.tasks' = '2',
'compaction.schedule.enabled' = 'false',
'compaction.async.enabled' = 'false',
'clean.async.enabled' = 'false'
);

-- submit the pipelines
insert into t1 select * from sourceT;
insert into t1_2 select * from sourceT;

select * from t1 limit 20;

As you can see from the above example, we have two pipelines with multiple tasks that concurrently write to the same table. To use the new concurrency mode, all you need to do is set the hoodie.write.concurrency.mode to NON_BLOCKING_CONCURRENCY_CONTROL. The write.tasks option is used to specify the number of write tasks that will be used for writing to the table. The compaction.schedule.enabled, compaction.async.enabled and clean.async.enabled options are used to disable the compaction and cleaning services for the second pipeline. This is done to ensure that the compaction and cleaning services are not executed twice for the same table.

Consistent hashing index (Experimental)

We have introduced the Consistent Hashing Index since 0.13.0 release. In comparison to the static hashing index (Bucket Index), the consistent hashing index offers dynamic scalability of data buckets for the writer. You can find the RFC for the design of this feature. In the 0.13.X release, the Consistent Hashing Index is supported only for Spark engine. And since release 0.14.0, the index is supported for Flink engine.

To utilize this feature, configure the option index.type as BUCKET and set hoodie.index.bucket.engine to CONSISTENT_HASHING. When enabling the consistent hashing index, it's important to enable clustering scheduling within the writer. During this process, the writer will perform dual writes for both the old and new data buckets while the clustering is pending. Although the dual write does not impact correctness, it is strongly recommended to execute clustering as quickly as possible.

In the below example, we will create a datagen source and do streaming ingestion into Hudi table with consistent bucket index. In order to commit the dataset, the checkpoint needs to be enabled, here is an example configuration for a flink-conf.yaml:

-- set the interval as 30 seconds
execution.checkpointing.interval: 30000
state.backend: rocksdb
-- This is a datagen source that can generate records continuously
CREATE TABLE sourceT (
uuid varchar(20),
name varchar(10),
age int,
ts timestamp(3),
`partition` as 'par1'
) WITH (
'connector' = 'datagen',
'rows-per-second' = '200'
);

-- Create the hudi table with consistent bucket index
CREATE TABLE t1(
uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector'='hudi',
'path' = '${work_path}/hudi-demo/hudiT',
'table.type' = 'MERGE_ON_READ',
'index.type' = 'BUCKET',
'clustering.schedule.enabled'='true',
'hoodie.index.bucket.engine'='CONSISTENT_HASHING',
'hoodie.clustering.plan.strategy.class'='org.apache.hudi.client.clustering.plan.strategy.FlinkConsistentBucketClusteringPlanStrategy',
'hoodie.clustering.execution.strategy.class'='org.apache.hudi.client.clustering.run.strategy.SparkConsistentBucketClusteringExecutionStrategy',
'hoodie.bucket.index.num.buckets'='8',
'hoodie.bucket.index.max.num.buckets'='128',
'hoodie.bucket.index.min.num.buckets'='8',
'hoodie.bucket.index.split.threshold'='1.5',
'write.tasks'='2'
);

-- submit the pipelines
insert into t1 select * from sourceT;

select * from t1 limit 20;
caution

Consistent Hashing Index is supported for Flink engine since release 0.14.0 and currently there are some limitations to use it as of 0.14.0:

  • This index is supported only for MOR table. This limitation also exists even if using Spark engine.
  • It does not work with metadata table enabled. This limitation also exists even if using Spark engine.
  • Consistent hashing index does not work with bulk-insert using Flink engine yet, please use simple bucket index or Spark engine for bulk-insert pipelines.
  • The resize plan which generated by Flink engine only supports merging small file groups, the file splitting is not supported yet.
  • The resize plan should be executed through an offline Spark job. Flink engine does not support execute resize plan yet.