Export Hudi datasets as a copy or as different formats

March 22, 2020

rxu

3 min read

Copy to Hudi dataset

Similar to the existing HoodieSnapshotCopier, the Exporter scans the source dataset and then makes a copy of it to the target output path.

spark-submit \
  --jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
  --deploy-mode "client" \
  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
      packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \
  --source-base-path "/tmp/" \
  --target-output-path "/tmp/exported/hudi/" \
  --output-format "hudi"

Export to json or parquet dataset

The Exporter can also convert the source dataset into other formats. Currently only "json" and "parquet" are supported.

spark-submit \
  --jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
  --deploy-mode "client" \
  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
      packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \
  --source-base-path "/tmp/" \
  --target-output-path "/tmp/exported/json/" \
  --output-format "json"  # or "parquet"

Re-partitioning

When export to a different format, the Exporter takes parameters to do some custom re-partitioning. By default, if neither of the 2 parameters below is given, the output dataset will have no partition.

`--output-partition-field`

This parameter uses an existing non-metadata field as the output partitions. All _hoodie_* metadata field will be stripped during export.

spark-submit \
  --jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
  --deploy-mode "client" \
  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
      packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \  
  --source-base-path "/tmp/" \
  --target-output-path "/tmp/exported/json/" \
  --output-format "json" \
  --output-partition-field "symbol"  # assume the source dataset contains a field `symbol`

The output directory will look like this

`_SUCCESS symbol=AMRS symbol=AYX symbol=CDMO symbol=CRC symbol=DRNA ...`

`--output-partitioner`

This parameter takes in a fully-qualified name of a class that implements HoodieSnapshotExporter.Partitioner. This parameter takes higher precedence than --output-partition-field, which will be ignored if this is provided.

An example implementation is shown below:

MyPartitioner.java

package com.foo.bar;
public class MyPartitioner implements HoodieSnapshotExporter.Partitioner {

  private static final String PARTITION_NAME = "date";
 
  @Override
  public DataFrameWriter<Row> partition(Dataset<Row> source) {
    // use the current hoodie partition path as the output partition
    return source
        .withColumnRenamed(HoodieRecord.PARTITION_PATH_METADATA_FIELD, PARTITION_NAME)
        .repartition(new Column(PARTITION_NAME))
        .write()
        .partitionBy(PARTITION_NAME);
  }
}

After putting this class in my-custom.jar, which is then placed on the job classpath, the submit command will look like this:

spark-submit \
  --jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar,my-custom.jar" \
  --deploy-mode "client" \
  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
      packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \
  --source-base-path "/tmp/" \
  --target-output-path "/tmp/exported/json/" \
  --output-format "json" \
  --output-partitioner "com.foo.bar.MyPartitioner"

Copy to Hudi dataset​

Export to json or parquet dataset​

Re-partitioning​

--output-partition-field​

--output-partitioner​

Copy to Hudi dataset

Export to json or parquet dataset

Re-partitioning

`--output-partition-field`

`--output-partitioner`