Export Hudi datasets as a copy or as different formats
Copy to Hudi dataset
Similar to the existing HoodieSnapshotCopier
, the Exporter scans the source dataset and then makes a copy of it to the target output path.
spark-submit \
--jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/hudi/" \
--output-format "hudi"
Export to json or parquet dataset
The Exporter can also convert the source dataset into other formats. Currently only "json" and "parquet" are supported.
spark-submit \
--jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--output-format "json" # or "parquet"
Re-partitioning
When export to a different format, the Exporter takes parameters to do some custom re-partitioning. By default, if neither of the 2 parameters below is given, the output dataset will have no partition.
--output-partition-field
This parameter uses an existing non-metadata field as the output partitions. All _hoodie_*
metadata field will be stripped during export.
spark-submit \
--jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--output-format "json" \
--output-partition-field "symbol" # assume the source dataset contains a field `symbol`
The output directory will look like this
`_SUCCESS symbol=AMRS symbol=AYX symbol=CDMO symbol=CRC symbol=DRNA ...`
--output-partitioner
This parameter takes in a fully-qualified name of a class that implements HoodieSnapshotExporter.Partitioner
. This parameter takes higher precedence than --output-partition-field
, which will be ignored if this is provided.
An example implementation is shown below:
MyPartitioner.java
package com.foo.bar;
public class MyPartitioner implements HoodieSnapshotExporter.Partitioner {
private static final String PARTITION_NAME = "date";
@Override
public DataFrameWriter<Row> partition(Dataset<Row> source) {
// use the current hoodie partition path as the output partition
return source
.withColumnRenamed(HoodieRecord.PARTITION_PATH_METADATA_FIELD, PARTITION_NAME)
.repartition(new Column(PARTITION_NAME))
.write()
.partitionBy(PARTITION_NAME);
}
}
After putting this class in my-custom.jar
, which is then placed on the job classpath, the submit command will look like this:
spark-submit \
--jars "packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar,my-custom.jar" \
--deploy-mode "client" \
--class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar \
--source-base-path "/tmp/" \
--target-output-path "/tmp/exported/json/" \
--output-format "json" \
--output-partitioner "com.foo.bar.MyPartitioner"