Query Engine Setup
Spark
The Spark Datasource API is a popular way of authoring Spark ETL pipelines. Hudi tables can be queried via the Spark datasource with a simple spark.read.parquet
.
See the Spark Quick Start for more examples of Spark datasource reading queries.
If your Spark environment does not have the Hudi jars installed, add --jars <path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar
to the classpath of drivers
and executors. Alternatively, hudi-spark-bundle can also fetched via the --packages
options (e.g: --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3
).
PrestoDB
PrestoDB is a popular query engine, providing interactive query performance. PrestoDB currently supports snapshot querying on COPY_ON_WRITE tables. Both snapshot and read optimized queries are supported on MERGE_ON_READ Hudi tables. Since PrestoDB-Hudi integration has evolved over time, the installation instructions for PrestoDB would vary based on versions. Please check the below table for query types supported and installation instructions for different versions of PrestoDB.
PrestoDB Version | Installation description | Query types supported |
---|---|---|
< 0.233 | Requires the hudi-presto-bundle jar to be placed into <presto_install>/plugin/hive-hadoop2/ , across the installation. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
>= 0.233 | No action needed. Hudi (0.5.1-incubating) is a compile time dependency. | Snapshot querying on COW tables. Read optimized querying on MOR tables. |
>= 0.240 | No action needed. Hudi 0.5.3 version is a compile time dependency. | Snapshot querying on both COW and MOR tables |
We upgraded Hudi version from 0.5.3 to 0.9.0 in Presto 0.265 but that introduced a breaking dependency change in
another presto module. See this issue for more details. Since then,
we have fixed the hudi-presto-bundle in version 0.10.1. Now, we need to
upgrade Hudi in Presto again. This is being tracked by HUDI-3010.
Our suggestion is to avoid upgrading Presto until the issue is fixed. However, if this is not an option, then the
workaround is to download the hudi-presto-bundle jar from our maven repo
and place it in <presto_install>/plugin/hive-hadoop2/
.
Presto Environment
- Configure Presto according to the Presto configuration document.
- Configure hive catalog in
/presto-server-0.2xxx/etc/catalog/hive.properties
as follows:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://xxx.xxx.xxx.xxx:9083
hive.config.resources=.../hadoop-2.x/etc/hadoop/core-site.xml,.../hadoop-2.x/etc/hadoop/hdfs-site.xml
Query
Beginning query by connecting hive metastore with presto client. The presto client connection command is as follows:
# The presto client connection command
./presto --server xxx.xxx.xxx.xxx:9999 --catalog hive --schema default
Trino
Trino (formerly PrestoSQL) was forked off of PrestoDB a few years ago. Hudi supports 'Snapshot' queries for Copy-On-Write tables and 'Read Optimized' queries for Merge-On-Read tables. This is through the initial input format based integration in PrestoDB (pre forking). This approach has known performance limitations with very large tables, which has been since fixed on PrestoDB. We are working on replicating the same fixes on Trino as well.
To query Hudi tables on Trino, please place the hudi-trino-bundle
jar into the Hive connector installation <trino_install>/plugin/hive-hadoop2
.
Hive
In order for Hive to recognize Hudi tables and query correctly,
- the HiveServer2 needs to be provided with the
hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar
in its aux jars path. This will ensure the input format classes with its dependencies are available for query planning & execution. - For MERGE_ON_READ tables, additionally the bundle needs to be put on the hadoop/hive installation across the cluster, so that queries can pick up the custom RecordReader as well.
In addition to setup above, for beeline cli access, the hive.input.format
variable needs to be set to the fully qualified path name of the
inputformat org.apache.hudi.hadoop.HoodieParquetInputFormat
. For Tez, additionally the hive.tez.input.format
needs to be set
to org.apache.hadoop.hive.ql.io.HiveInputFormat
. Then proceed to query the table like any other Hive table.
Redshift Spectrum
Copy on Write Tables in Apache Hudi versions 0.5.2, 0.6.0, 0.7.0, 0.8.0, 0.9.0, 0.10.x, 0.11.x and 0.12.0 can be queried via Amazon Redshift Spectrum external tables. To be able to query Hudi versions 0.10.0 and above please try latest versions of Redshift.
Hudi tables are supported only when AWS Glue Data Catalog is used. It's not supported when you use an Apache Hive metastore as the external catalog.
Please refer to Redshift Spectrum Integration with Apache Hudi for more details.
Doris
Copy on Write Tables in Hudi version 0.10.0 can be queried via Doris external tables starting from Doris version 1.1. Please refer to Doris Hudi external table for more details on the setup.
The current default supported version of Hudi is 0.10.0 and has not been tested in other versions. More versions will be supported in the future.
StarRocks
Copy on Write tables in Apache Hudi 0.10.0 and above can be queried via StarRocks external tables from StarRocks version 2.2.0. Only snapshot queries are supported currently. In future releases Merge on Read tables will also be supported. Please refer to StarRocks Hudi external table for more details on the setup.