Alibaba Cloud provides cloud computing services to online businesses and Alibaba’s own e-commerce ecosystem, Apache Hudi is integrated into Alibaba Cloud Data Lake Analytics offering real-time analysis on hudi dataset.
Amazon Web Services
Amazon Web Services is the World’s leading cloud services provider. Apache Hudi is pre-installed with the AWS Elastic Map Reduce offering, providing means for AWS users to perform record-level updates/deletes and manage storage efficiently.
EMIS Health is the largest provider of Primary Care IT software in the UK with datasets including more than 500Bn healthcare records. HUDI is used to manage their analytics dataset in production and keeping them up-to-date with their upstream source. Presto is being used to query the data written in HUDI format.
Kyligence is the leading Big Data analytics platform company. We’ve built end to end solutions for various Global Fortune 500 companies in US and China. We adopted Apache Hudi in our Cloud solution on AWS in 2019. With the help of Hudi, we are able to process upserts and deletes easily and we use incremental views to build efficient data pipelines in AWS. The Hudi datasets can also be integrated to Kyligence Cloud directly for high concurrent OLAP access.
Lingyue-digital Corporation belongs to BMW Group. Apache Hudi is used to perform ingest MySQL and PostgreSQL change data capture. We build up upsert scenarios on Hadoop and spark.
Hopsworks 1.x series supports Apache Hudi feature groups, to enable upserts and time travel.
SF-Express is the leading logistics service provider in China. HUDI is used to build a real-time data warehouse, providing real-time computing solutions with higher efficiency and lower cost for our business.
Tathastu.ai offers the largest AI/ML playground of consumer data for data scientists, AI experts and technologists to build upon. They have built a CDC pipeline using Apache Hudi and Debezium. Data from Hudi datasets is being queried using Hive, Presto and Spark.
EMR from Tencent Cloud has integrated Hudi as one of its BigData components since V2.2.0. Using Hudi, the end-users can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on HDFS/COS/CHDFS using Apache Parquet and Apache Avro.
Apache Hudi was originally developed at Uber, to achieve low latency database ingestion, with high efficiency. It has been in production since Aug 2016, powering the massive 100PB data lake, including highly business critical tables like core trips,riders,partners. It also powers several incremental Hive ETL pipelines and being currently integrated into Uber’s data dispersal system.
At Udemy, Apache Hudi on AWS EMR is used to perform ingest MySQL change data capture.
Yields.io is the first FinTech platform that uses AI for automated model validation and real-time monitoring on an enterprise-wide scale. Their data lake is managed by Hudi. They are also actively building their infrastructure for incremental, cross language/platform machine learning using Hudi.
Using Hudi at Yotpo for several usages. Firstly, integrated Hudi as a writer in their open source ETL framework, Metorikku and using as an output writer for a CDC pipeline, with events that are being generated from a database binlog streams to Kafka and then are written to S3.
Talks & Presentations
“Hoodie: Incremental processing on Hadoop at Uber” - By Vinoth Chandar & Prasanna Rajaperumal Mar 2017, Strata + Hadoop World, San Jose, CA
“Hoodie: An Open Source Incremental Processing Framework From Uber” - By Vinoth Chandar. Apr 2017, DataEngConf, San Francisco, CA Slides Video
“Incremental Processing on Large Analytical Datasets” - By Prasanna Rajaperumal June 2017, Spark Summit 2017, San Francisco, CA. Slides Video
“Hudi: Unifying storage and serving for batch and near-real-time analytics” - By Nishith Agarwal & Balaji Vardarajan September 2018, Strata Data Conference, New York, NY
“Hudi: Large-Scale, Near Real-Time Pipelines at Uber” - By Vinoth Chandar & Nishith Agarwal October 2018, Spark+AI Summit Europe, London, UK
“Powering Uber’s global network analytics pipelines in real-time with Apache Hudi” - By Ethan Guo & Nishith Agarwal, April 2019, Data Council SF19, San Francisco, CA.
“Building highly efficient data lakes using Apache Hudi (Incubating)” - By Vinoth Chandar June 2019, SF Big Analytics Meetup, San Mateo, CA
“Apache Hudi (Incubating) - The Past, Present and Future Of Efficient Data Lake Architectures” - By Vinoth Chandar & Balaji Varadarajan September 2019, ApacheCon NA 19, Las Vegas, NV, USA
“Insert, upsert, and delete data in Amazon S3 using Amazon EMR” - By Paul Codding & Vinoth Chandar December 2019, AWS re:Invent 2019, Las Vegas, NV, USA
“Building Robust CDC Pipeline With Apache Hudi And Debezium” - By Pratyaksh, Purushotham, Syed and Shaik December 2019, Hadoop Summit Bangalore, India
“Using Apache Hudi to build the next-generation data lake and its application in medical big data” - By JingHuang & Leesf March 2020, Apache Hudi & Apache Kylin Online Meetup, China
“Building a near real-time, high-performance data warehouse based on Apache Hudi and Apache Kylin” - By ShaoFeng Shi March 2020, Apache Hudi & Apache Kylin Online Meetup, China
- “The Case for incremental processing on Hadoop” - O’reilly Ideas article by Vinoth Chandar
- “Hoodie: Uber Engineering’s Incremental Processing Framework on Hadoop” - Engineering Blog By Prasanna Rajaperumal