Data Quality
Data quality refers to the overall accuracy, completeness, consistency, and validity of data. Ensuring data quality is vital for accurate analysis and reporting, as well as for compliance with regulations and maintaining trust in your organization's data infrastructure.
Hudi offers Pre-Commit Validators that allow you to ensure that your data meets certain data quality expectations as you are writing with Hudi Streamer or Spark Datasource writers.
Pre-commit validators are skipped when using the BULK_INSERT write operation type.
Multiple class names can be separated by , delimiter.
Syntax: hoodie.precommit.validators=class_name1,class_name2
Example:
spark.write.format("hudi")
.option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator")
Today you can use any of these validators and even have the flexibility to extend your own:
SQL Query Single Result
org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator
The SQL Query Single Result validator can be used to validate that a query on the table results in a specific value. This validator allows you to run a SQL query and abort the commit if it does not match the expected output.
Multiple queries can be separated by ; delimiter. Include the expected result as part of the query separated by #.
Syntax: query1#result1;query2#result2
Example:
// In this example, we set up a validator that expects there is no row with `col` column as `null`
import org.apache.hudi.config.HoodiePreCommitValidatorConfig._
df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
option("hoodie.precommit.validators.single.value.sql.queries", "select count(*) from <TABLE_NAME> where col is null#0").
save(basePath)