Apache Hudi was originally developed at Uber, to achieve low latency database ingestion, with high efficiency. It has been in production since Aug 2016, powering the massive 100PB data lake, including highly business critical tables like core trips,riders,partners. It also powers several incremental Hive ETL pipelines and being currently integrated into Uber’s data dispersal system.
Amazon Web Services
Amazon Web Services is the World’s leading cloud services provider. Apache Hudi is pre-installed with the AWS Elastic Map Reduce offering, providing means for AWS users to perform record-level updates/deletes and manage storage efficiently.
EMIS Health is the largest provider of Primary Care IT software in the UK with datasets including more than 500Bn healthcare records. HUDI is used to manage their analytics dataset in production and keeping them up-to-date with their upstream source. Presto is being used to query the data written in HUDI format.
Yields.io is the first FinTech platform that uses AI for automated model validation and real-time monitoring on an enterprise-wide scale. Their data lake is managed by Hudi. They are also actively building their infrastructure for incremental, cross language/platform machine learning using Hudi.
Using Hudi at Yotpo for several usages. Firstly, integrated Hudi as a writer in their open source ETL framework https://github.com/YotpoLtd/metorikku and using as an output writer for a CDC pipeline, with events that are being generated from a database binlog streams to Kafka and then are written to S3.
Tathastu.ai offers the largest AI/ML playground of consumer data for data scientists, AI experts and technologists to build upon. They have built a CDC pipeline using Apache Hudi and Debezium. Data from Hudi datasets is being queried using Hive, Presto and Spark.
Talks & Presentations
“Hoodie: Incremental processing on Hadoop at Uber” - By Vinoth Chandar & Prasanna Rajaperumal Mar 2017, Strata + Hadoop World, San Jose, CA
“Hoodie: An Open Source Incremental Processing Framework From Uber” - By Vinoth Chandar. Apr 2017, DataEngConf, San Francisco, CA Slides Video
“Incremental Processing on Large Analytical Datasets” - By Prasanna Rajaperumal June 2017, Spark Summit 2017, San Francisco, CA. Slides Video
“Hudi: Unifying storage and serving for batch and near-real-time analytics” - By Nishith Agarwal & Balaji Vardarajan September 2018, Strata Data Conference, New York, NY
“Hudi: Large-Scale, Near Real-Time Pipelines at Uber” - By Vinoth Chandar & Nishith Agarwal October 2018, Spark+AI Summit Europe, London, UK
“Powering Uber’s global network analytics pipelines in real-time with Apache Hudi” - By Ethan Guo & Nishith Agarwal, April 2019, Data Council SF19, San Francisco, CA.
“Building highly efficient data lakes using Apache Hudi (Incubating)” - By Vinoth Chandar June 2019, SF Big Analytics Meetup, San Mateo, CA
“Apache Hudi (Incubating) - The Past, Present and Future Of Efficient Data Lake Architectures” - By Vinoth Chandar & Balaji Varadarajan September 2019, ApacheCon NA 19, Las Vegas, NV, USA
“Insert, upsert, and delete data in Amazon S3 using Amazon EMR” - By Paul Codding & Vinoth Chandar December 2019, AWS re:Invent 2019, Las Vegas, NV, USA
“Building Robust CDC Pipeline With Apache Hudi And Debezium” - By Pratyaksh, Purushotham, Syed and Shaik December 2019, Hadoop Summit Bangalore, India
- “The Case for incremental processing on Hadoop” - O’reilly Ideas article by Vinoth Chandar
- “Hoodie: Uber Engineering’s Incremental Processing Framework on Hadoop” - Engineering Blog By Prasanna Rajaperumal