Skip to main content
Version: Current

AI Quick Start

This guide shows how Hudi 1.2.0 brings AI-native data types to the lakehouse. You will store image embeddings (VECTOR) and raw image bytes (BLOB) in a single Hudi table, then run a similarity search that finds the top-K nearest neighbors and materializes their images — in one SQL query.

tip

Want to try this locally? This guide is also available as an interactive Jupyter notebook. Download the notebook and run it end-to-end on your machine.

Vector search results — query image on the left, top-5 nearest neighbors on the right

Example output: a query image of a German Shorthaired pointer (left) and the five most similar images found by hudi_vector_search, with cosine similarity scores. Raw image bytes are materialized by read_blob() directly in the same query.

Prerequisites

RequirementVersion
Java11+
Python3.10 – 3.12
Apache Spark3.5.x
Hudi Spark bundle1.2.0+
pip install pyspark==3.5.* pyarrow>=14.0.0 \
torch>=2.3.0 torchvision>=0.18.0 timm>=1.0.9 \
scikit-learn>=1.4.2 numpy>=1.26.0 pillow>=10.3.0 matplotlib>=3.8.0

1. Start Spark with Hudi

import os
from pathlib import Path
from pyspark.sql import SparkSession

HUDI_JAR = os.getenv("HUDI_BUNDLE_JAR", "hudi-spark3.5-bundle_2.12-1.2.0.jar")

spark = (
SparkSession.builder
.appName("Hudi-AI-QuickStart")
.config("spark.jars", HUDI_JAR)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions",
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
.getOrCreate()
)

2. Load images and generate embeddings

Load the Oxford-IIIT Pet dataset (37 breeds) and generate 1024-dim embeddings with MobileNetV3.

import io, torch, timm, numpy as np
from sklearn.preprocessing import normalize
from PIL import Image
from torchvision.datasets import OxfordIIITPet

N_SAMPLES = 250

ds = OxfordIIITPet(root="~/.cache/torchvision", split="trainval", download=True)
indices = np.random.default_rng().choice(len(ds), size=N_SAMPLES, replace=False)

# Collect images as PNG bytes
data = []
for idx in indices:
img, label = ds[int(idx)]
buf = io.BytesIO(); img.convert("RGB").save(buf, format="PNG")
data.append({
"image_id": f"pets_{int(idx):06d}",
"category": ds.classes[label],
"label": int(label),
"image_bytes_raw": buf.getvalue(),
})

# Generate embeddings
model = timm.create_model("mobilenetv3_small_100", pretrained=True, num_classes=0)
model.eval()
cfg = timm.data.resolve_model_data_config(model)
transform = timm.data.create_transform(**cfg, is_training=False)

batch = torch.stack([
transform(Image.open(io.BytesIO(d["image_bytes_raw"])).convert("RGB"))
for d in data
])
with torch.no_grad():
feats = normalize(model(batch).numpy())

for i, d in enumerate(data):
d["embedding"] = feats[i].tolist()

DIM = feats.shape[1] # 1024

3. Create table and insert data

Declare VECTOR(1024) for embeddings and BLOB for raw image bytes — first-class Hudi column types.

from pyspark.sql import Row
from pyspark.sql.types import *

# Register as a Spark temp view
schema = StructType([
StructField("image_id", StringType(), False),
StructField("category", StringType(), False),
StructField("label", IntegerType(), False),
StructField("image_bytes_raw", BinaryType(), False),
StructField("embedding", ArrayType(FloatType(), containsNull=False), False),
])
rows = [Row(d["image_id"], d["category"], d["label"],
d["image_bytes_raw"], d["embedding"]) for d in data]
spark.createDataFrame(rows, schema).createOrReplaceTempView("staging")
CREATE TABLE pets (
image_id STRING,
category STRING,
label INT,
image_bytes BLOB,
embedding VECTOR(1024)
) USING hudi
LOCATION '/tmp/hudi_pets'
TBLPROPERTIES (
primaryKey = 'image_id',
preCombineField = 'image_id',
type = 'cow',
'hoodie.table.base.file.format' = 'parquet',
'hoodie.write.record.merge.custom.implementation.classes'
= 'org.apache.hudi.DefaultSparkRecordMerger'
);

INSERT INTO pets
SELECT image_id, category, label,
named_struct(
'type', 'INLINE',
'data', image_bytes_raw,
'reference', cast(null as struct<
external_path:string, offset:bigint,
length:bigint, managed:boolean>)
) AS image_bytes,
embedding
FROM staging;

Key points:

  • VECTOR(1024) stores fixed-dimension embeddings for similarity search.
  • BLOB stores raw image bytes inline. For large objects, use OUT_OF_LINE to store a pointer instead — read_blob() resolves both modes transparently.

4. Materialize a BLOB with read_blob()

read_blob() is Hudi's BLOB accessor. Pass it a BLOB column, get back raw binary. Works the same for inline bytes and out-of-line references.

SELECT image_id, category,
length(read_blob(image_bytes)) AS byte_count
FROM pets
LIMIT 5;
+-----------+--------------------+----------+
| image_id| category|byte_count|
+-----------+--------------------+----------+
|pets_002081| Beagle| 249983|
|pets_003404| Shiba Inu| 349745|
|pets_001939| American Bulldog| 267667|
|pets_002457|English Cocker Sp..| 364492|
|pets_003538|Staffordshire Bul..| 427728|
+-----------+--------------------+----------+

A sample image retrieved via read_blob — a sleeping Beagle

Image bytes retrieved by read_blob(), decoded back to a PNG — confirming lossless round-trip through the Hudi BLOB column.

5. Vector search + BLOB retrieval in one query

The headline: hudi_vector_search finds the top-K nearest neighbors by cosine similarity, and read_blob() materializes image bytes only for the matching rows.

SELECT image_id,
category,
read_blob(image_bytes) AS resolved_bytes,
_hudi_distance
FROM hudi_vector_search(
'/tmp/hudi_pets', -- table path
'embedding', -- VECTOR column
ARRAY(0.12, -0.03, ...), -- query embedding (1024 floats)
5, -- top-K
'cosine' -- distance metric
)
ORDER BY _hudi_distance;
+-----------+--------------------+-----------+
| image_id| category| distance |
+-----------+--------------------+-----------+
|pets_002575| German Shorthaired| 0.378 |
|pets_000703| German Shorthaired| 0.484 |
|pets_002562| German Shorthaired| 0.598 |
|pets_002556| German Shorthaired| 0.607 |
|pets_003538|Staffordshire Bul..| 0.641 |
+-----------+--------------------+-----------+

Vector search results panel

Query: German Shorthaired pointer (left). Top-5 results ranked by cosine similarity — the search correctly identifies same-breed images as the closest matches.

One SQL query. No external vector database. Embeddings, raw images, and metadata all live in one transactional Hudi table.

6. Visualize results

import matplotlib.pyplot as plt
from PIL import Image

fig, axes = plt.subplots(1, len(results) + 1, figsize=(3 * (len(results) + 1), 3.2))

# Query image
axes[0].imshow(Image.open(io.BytesIO(query_bytes)))
axes[0].set_title("QUERY", fontweight="bold"); axes[0].axis("off")

# Top-K matches
for i, row in enumerate(results):
img = Image.open(io.BytesIO(bytes(row["resolved_bytes"])))
sim = 1.0 - float(row["_hudi_distance"])
axes[i+1].imshow(img)
axes[i+1].set_title(f"{row['category']}\nSim: {sim:.3f}")
axes[i+1].axis("off")

plt.tight_layout()
plt.savefig("hudi_vector_search_results.png", dpi=150)

What's next

TopicLink
Full interactive notebook00_main_demo.ipynb
VECTOR type referenceVector Search — element types, batch TVF, distance metrics
BLOB type referenceUnstructured Data — inline vs. out-of-line, read_blob(), configs
VARIANT typeSemi-Structured Data — flexible JSON-like storage for LLM outputs, model metadata
Lance file formatLance File Format — vector-optimized storage
AI overviewAI-Native Lakehouse — architecture and use cases