Data Quality
Apache Hudi has what are called Pre-Commit Validators that allow you to validate that your data meets certain data quality expectations as you are writing with DeltaStreamer or Spark Datasource writers.
To configure pre-commit validators, use this setting hoodie.precommit.validators=<comma separated list of validator class names>
.
Example:
spark.write.format("hudi")
.option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator")
Today you can use any of these validators and even have the flexibility to extend your own:
SQL Query Single Result
Can be used to validate that a query on the table results in a specific value.
Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: query1#result1;query2#result2
Example, "expect exactly 0 null rows":
import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
option("hoodie.precommit.validators.single.value.sql.queries", "select count(*) from <TABLE_NAME> where col=null#0").
save(basePath)
SQL Query Equality
Can be used to validate for equality of rows before and after the commit.
Example, "expect no change of null rows with this commit":
import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator").
option("hoodie.precommit.validators.equality.sql.queries", "select count(*) from <TABLE_NAME> where col=null").
save(basePath)
SQL Query Inequality
Can be used to validate for inequality of rows before and after the commit.
Example, "expect there must be a change of null rows with this commit":
import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator").
option("hoodie.precommit.validators.inequality.sql.queries", "select count(*) from <TABLE_NAME> where col=null").
save(basePath)
Extend Custom Validator
Users can also provide their own implementations by extending the abstract class SparkPreCommitValidator and overriding this method
void validateRecordsBeforeAndAfter(Dataset<Row> before,
Dataset<Row> after,
Set<String> partitionsAffected)