Version: Current

AI-Native Lakehouse

Modern AI workloads demand more from the data lakehouse than traditional analytics ever did. Teams building retrieval-augmented generation (RAG) pipelines, recommendation systems, content moderation, and multimodal search need to store and query embeddings, images, audio, video, and documents alongside structured data — all with the transactional guarantees, incremental processing, and table services that Hudi is known for.

Apache Hudi's AI-native capabilities bring this vision to life with four foundational building blocks:

Core Capabilities

VECTOR Type and Similarity Search

Store high-dimensional embedding vectors as first-class column types and run approximate nearest neighbor (ANN) search directly in Spark SQL.

-- Declare an embedding column
CREATE TABLE products (
    id        STRING,
    name      STRING,
    embedding VECTOR(768)
) USING hudi;

-- Find the 10 most similar products
SELECT id, name, _hudi_distance
FROM hudi_vector_search('products', 'embedding', ARRAY(...), 10, 'cosine');

No external vector database required. Your embeddings live in the same table as your structured data, governed by the same transactions, schemas, and access controls.

Learn more about Vector Search →

BLOB Type for Unstructured Data

Store raw binary data (images, PDFs, audio clips, model weights) directly in Hudi tables using the BLOB type. Choose between two storage modes:

Mode	How it works	Best for
Inline	Bytes stored directly in the table row	Small objects (thumbnails, short audio clips)
Out-of-line	Table stores a pointer; `read_blob()` resolves it on demand	Large objects (high-res images, video, model checkpoints)

CREATE TABLE documents (
    doc_id   STRING,
    content  BLOB,
    summary  STRING
) USING hudi;

-- Read the raw bytes on demand
SELECT doc_id, read_blob(content) AS raw_bytes FROM documents WHERE doc_id = 'doc_001';

Out-of-line BLOBs keep your table footprint small (often less than 1% of total data size) while maintaining full queryability.

Learn more about Unstructured Data →

VARIANT Type for Semi-Structured Data

AI pipelines often deal with data whose shape is not known in advance: LLM outputs, model metadata, feature dictionaries, API responses. The VARIANT type stores semi-structured, JSON-like data with full transactional support — no rigid schema required.

CREATE TABLE llm_outputs (
    request_id  STRING,
    response    VARIANT,
    ts          BIGINT
) USING hudi;

-- Store any JSON structure
INSERT INTO llm_outputs VALUES (
    'req_001',
    parse_json('{"text": "...", "stop_reason": "end_turn", "tokens": 700}'),
    1000
);

-- Query back as JSON
SELECT request_id, cast(response as STRING) FROM llm_outputs;

VARIANT supports optional shredding to extract hot fields into typed columnar storage for better query performance, while keeping the flexibility for everything else.

Learn more about Semi-Structured Data →

Lance File Format

Hudi's pluggable file format architecture supports Lance, a modern columnar format purpose-built for AI/ML workloads. Lance provides:

Efficient vector indexing and ANN search
Fast random access for training data sampling
Optimized storage for high-dimensional arrays and nested structures

CREATE TABLE embeddings (...) USING hudi
TBLPROPERTIES (
    hoodie.datasource.write.base.file.format = 'lance'
);

Lance integrates seamlessly with Hudi's table services (compaction, clustering, cleaning) and works alongside existing Parquet and ORC tables.

Learn more about the Lance File Format →

Why Hudi for AI Workloads?

Unified Storage for Structured + Unstructured Data

Most AI pipelines today span multiple systems: a data warehouse for metadata, an object store for raw files, a vector database for embeddings, and custom glue code to keep them in sync. Hudi collapses this into a single table:

┌─────────────────────────────────────────────────┐
│                  Hudi Table                     │
│                                                 │
│  image_id │ breed │ embedding    │ image_bytes  │
│  (STRING) │(STRING)│(VECTOR(1024))│   (BLOB)    │
│───────────┼───────┼──────────────┼──────────────│
│  pet_001  │ Corgi │ [0.12, ...]  │ <137 KB PNG> │
│  pet_002  │ Tabby │ [-.03, ...]  │ <89 KB JPEG> │
└─────────────────────────────────────────────────┘

One table. One set of transactions. One schema. One set of access controls.

Incremental Processing for Embedding Pipelines

When new data arrives, you do not need to re-embed your entire corpus. Hudi's incremental query capabilities let you process only new or changed records:

-- Get only new images since the last embedding run
SELECT * FROM hudi_table_changes('product_images', 'latest_state', '20260101000000');

This can reduce embedding pipeline costs by 10-100x compared to full reprocessing.

Transactional Guarantees

Embedding updates, metadata changes, and raw data writes happen atomically. No more inconsistent states where your vector index points to deleted images or stale embeddings.

Table Services

Hudi's background table services work on AI tables just like any other:

Clustering — co-locate similar vectors for better search locality
Compaction — merge incremental updates efficiently
Cleaning — reclaim storage from old versions
Indexing — maintain metadata indexes for fast lookups

Open Ecosystem

Hudi tables are readable by Spark, Flink, Presto, Trino, and the native Python/Rust client (hudi-rs). Your AI tables are not locked into a single engine or vendor.

Use Cases

Use Case	Hudi Capabilities Used
Image/Video Search	VECTOR embeddings + BLOB storage + cosine similarity search
RAG (Retrieval-Augmented Generation)	VECTOR search to retrieve relevant document chunks for LLM context
LLM Output Management	VARIANT for flexible response storage, VECTOR for semantic indexing
Recommendation Systems	VECTOR similarity for collaborative filtering, incremental re-embedding
Content Moderation	BLOB for raw content + VECTOR for content embeddings + incremental processing
Multimodal Analytics	Structured metadata + VECTOR embeddings + BLOB raw data in one table
ML Feature Store	VECTOR for feature embeddings, VARIANT for sparse feature maps, time-travel for point-in-time retrieval
Experiment Tracking	VARIANT for heterogeneous model configs and metrics, incremental queries for latest runs
Data Labeling Pipelines	BLOB for raw data, incremental queries for unlabeled data, ACID for label updates

Getting Started

The fastest way to try these features is the AI Quick Start Guide, which walks you through an end-to-end image similarity search pipeline in under 30 minutes.

Core Capabilities​

VECTOR Type and Similarity Search​

BLOB Type for Unstructured Data​

VARIANT Type for Semi-Structured Data​

Lance File Format​

Why Hudi for AI Workloads?​

Unified Storage for Structured + Unstructured Data​

Incremental Processing for Embedding Pipelines​

Transactional Guarantees​

Table Services​

Open Ecosystem​

Use Cases​

Getting Started​