When building a change data capture pipeline for already existing or newly created relational databases, one of the most common problems that one faces is simplifying the onboarding process for multiple tables. Ingesting multiple tables to Hudi dataset at a single go is now possible using HoodieMultiTableDeltaStreamer class which is a wrapper on top of the more popular HoodieDeltaStreamer class. Currently HoodieMultiTableDeltaStreamer supports COPY_ON_WRITE storage type only and the ingestion is done in a sequential way.
This blog will guide you through configuring and running HoodieMultiTableDeltaStreamer.
HoodieMultiTableDeltaStreamer expects users to maintain table wise overridden properties in separate files in a dedicated config folder. Common properties can be configured via common properties file also.
By default, hudi datasets are created under the path <base-path-prefix>/<database_name>/<name_of_table_to_be_ingested>. You need to provide the names of tables to be ingested via the property hoodie.deltastreamer.ingestion.tablesToBeIngested in the format <database>.<table>, for example
If you do not provide database name, then it is assumed the table belongs to default database and the hudi dataset for the concerned table is created under the path <base-path-prefix>/default/<name_of_table_to_be_ingested>. Also there is a provision to override the default path for hudi datasets. You can create hudi dataset for a particular table by setting the property hoodie.deltastreamer.ingestion.targetBasePath in table level config file
There are a lot of properties that one might like to override per table, for example
Properties like above need to be set for every table to be ingested. As already suggested at the beginning, users are expected to maintain separate config files for every table by setting the below property
If you do not want to set the above property for every table, you can simply create config files for every table to be ingested under the config folder with the name - <database>_<table>_config.properties. For example if you want to ingest table1 and table2 from dummy database, where config folder is set to s3:///tmp/config, then you need to create 2 config files on the given paths - s3:///tmp/config/dummy_table1_config.properties and s3:///tmp/config/dummy_table2_config.properties.
Finally you can specify all the common properties in a common properties file. Common properties file does not necessarily have to lie under config folder but it is advised to keep it along with other config files. This file will contain the below properties