Apache Hudi - 2021 a Year in Review

January 6, 20225 min read

As the year came to end, I took some time to reflect on where we are and what we accomplished in 2021. I am humbled by how strong our community is and how regardless of it being another tough pandemic year, that people from around the globe leaned in together and made this the best year yet for Apache Hudi. In this blog I want to recap some of the 2021 highlights.

Community

I want to call out how amazing it is to see such a diverse group of people step up and contribute to this project. There were over 30,000 interactions with the project on github, up 2x from last year. Over the last year 300 people have contributed to the project, with over 3,000 PRs over 5 releases. We moved Apache Hudi from release 0.5.X all the way to our feature packed 0.10.0 release. Come and join us on our active slack channel! Over 850 community members engaged on our slack, up about 100% from the year before. I want to add a special shout out to our top slack participants who have helped answer so many questions and drive rich discussions on our channel. Sivabalan Narayanan, Nishith Agarwal, Bhavani Sudha Saktheeswaran, Vinay Patil, Rubens Soto, Dave Hagman, Raghav Tandon, Sagar Sumit, Joyan Sil, Jake D, Felix Jose, Nick Vintila, KimL, Andrew Sukhan, Danny Chan, Biswajit Mohapatra, and Pratyaksh Sharma! I know I am missing plenty of other important callouts, every PR that landed this year has helped shape Hudi into what it is today. Thank you!

Impact

In 2021, I personally developed a deeper gratitude and understanding of the magnitude of the impact we are making in the industry. Throughout the year I met more and more people that told me about how Hudi transformed their business and I was impressed by the large variety of use cases and applications that Hudi was able to serve. Some from the community who publicly shared their story include: Amazon, GE, Robinhood, ByteDance, Halodoc, Baixin Bank, BiliBili, and so many more that haven’t even shared yet. One particular highlight from 2021 was attending AWS Re:Invent and meeting an overwhelmingly large number of users who expressed joy with using Apache Hudi. This raises my sense of responsibility even more to be aware of just how many people depend on Apache Hudi.

New Features

Apache Hudi has come a long way in 2021 from v0.5.X to 0.10.0. Throughout this year we have developed innovative and leading edge features that make it easier and easier to build streaming data lakes. Some of these features include Spark SQL DML Support, Clustering, Z-Order/Hilbert curves, Metadata Table file listing elimination, Timeline Server Markers, Precommit Validators, Flink MOR write/read, Parallel Write support with OCC, Clustering, Incremental Queries for MOR, Kafka Connect Sink, Delta Streamer sources for S3 and Debezium, DBT Support all of which are were added in 2021. To top it all, we put together a manifesto to realize our vision for streaming data lakes.

The Road Ahead

2021 may have been our best year so far, but it still feels like we are just getting started when we look at our new year's resolutions for 2022. In the year ahead we have bold plans to realize the first cut of our entire vision and take Hudi 1.0, that includes full-featured multi-modal indexing for faster writes/queries, pathbreaking lock free concurrency, new server components for caching/metadata and finally Flink based incremental materialized views! You can find our detailed roadmap here.

I look forward to continued collaboration with the growing Hudi community! Come join our community events and discussions in our slack channel! Happy new year 2022!_