SQL DML
Spark SQL
SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data from your Hudi tables. Let's explore them one by one.
Please refer to SQL DDL for creating Hudi tables using SQL.
Insert Into
You can use the INSERT INTO
statement to add data to a Hudi table using Spark SQL. Here are some examples:
INSERT INTO <table>
SELECT <columns> FROM <source>;
From 0.14.0, hoodie.sql.bulk.insert.enable
and hoodie.sql.insert.mode
are deprecated. Users are expected to use hoodie.spark.sql.insert.into.operation
instead.
To manage duplicates with INSERT INTO
, please check out insert dup policy config.
Examples:
-- Insert into a copy-on-write (COW) Hudi table
INSERT INTO hudi_cow_nonpcf_tbl SELECT 1, 'a1', 20;
-- Insert into a merge-on-read (MOR) Hudi table
INSERT INTO hudi_mor_tbl SELECT 1, 'a1', 20, 1000;
-- Insert into a COW Hudi table with static partition
INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', hh='11') SELECT 2, 'a2', 1000;
-- Insert into a COW Hudi table with dynamic partition
INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
Hudi offers flexibility in choosing the underlying write operation of a INSERT INTO
statement using
the hoodie.spark.sql.insert.into.operation
configuration. Possible options include "bulk_insert" (large inserts), "insert" (with small file management),
and "upsert" (with deduplication/merging). If a precombine field is not set, "insert" is chosen as the default. For a table with preCombine field set,
"upsert" is chosen as the default operation.
Insert Overwrite
The INSERT OVERWRITE
statement is used to replace existing data in a Hudi table.
INSERT OVERWRITE <table>
SELECT <columns> FROM <source>;
All existing partitions that are affected by the INSERT OVERWRITE
statement will replaced with the source data.
Here are some examples:
-- Overwrite non-partitioned table
INSERT OVERWRITE hudi_mor_tbl SELECT 99, 'a99', 20.0, 900;
INSERT OVERWRITE hudi_cow_nonpcf_tbl SELECT 99, 'a99', 20.0;
-- Overwrite partitioned table with dynamic partition
INSERT OVERWRITE TABLE hudi_cow_pt_tbl SELECT 10, 'a10', 1100, '2021-12-09', '10';
-- Overwrite partitioned table with static partition
INSERT OVERWRITE hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', hh='12') SELECT 13, 'a13', 1100;