Unstructured Data Quick Start
AI and machine learning pipelines (RAG, recommendation, multimodal search) need to store and query
embeddings, raw bytes, and structured metadata side by side. Hudi's VECTOR and BLOB column
types let you keep all of those in a single transactional table. This guide walks through them end
to end: you will store image embeddings (VECTOR) and raw image bytes (BLOB) in a single Hudi
table, then run a top-K similarity search and materialize the matching images in one SQL query.
Want to try this locally? This guide is also available as an interactive Jupyter notebook. Download the notebook and run it end-to-end on your machine.

Example output: a query image of a German Shorthaired pointer (left) and the five most similar images found by hudi_vector_search, with cosine similarity scores. Raw image bytes are materialized by read_blob() directly in the same query.
Prerequisites
| Requirement | Version |
|---|---|
| Java | 11+ |
| Python | 3.10 – 3.12 |
| Apache Spark | 3.5.x |
| Hudi Spark bundle | 1.2.0+ |
pip install pyspark==3.5.* pyarrow>=14.0.0 \
torch>=2.3.0 torchvision>=0.18.0 timm>=1.0.9 \
scikit-learn>=1.4.2 numpy>=1.26.0 pillow>=10.3.0 matplotlib>=3.8.0
1. Start Spark with Hudi
import os
from pathlib import Path
from pyspark.sql import SparkSession
HUDI_JAR = os.getenv("HUDI_BUNDLE_JAR", "hudi-spark3.5-bundle_2.12-1.2.0.jar")
spark = (
SparkSession.builder
.appName("Hudi-Unstructured-Data-QuickStart")
.config("spark.jars", HUDI_JAR)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions",
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
.getOrCreate()
)
2. Load images and generate embeddings
Load the Oxford-IIIT Pet dataset (37 breeds) and generate 1024-dim embeddings with MobileNetV3.
import io, torch, timm, numpy as np
from sklearn.preprocessing import normalize
from PIL import Image
from torchvision.datasets import OxfordIIITPet
N_SAMPLES = 250
ds = OxfordIIITPet(root="~/.cache/torchvision", split="trainval", download=True)
indices = np.random.default_rng().choice(len(ds), size=N_SAMPLES, replace=False)
# Collect images as PNG bytes
data = []
for idx in indices:
img, label = ds[int(idx)]
buf = io.BytesIO(); img.convert("RGB").save(buf, format="PNG")
data.append({
"image_id": f"pets_{int(idx):06d}",
"category": ds.classes[label],
"label": int(label),
"image_bytes_raw": buf.getvalue(),
})
# Generate embeddings
model = timm.create_model("mobilenetv3_small_100", pretrained=True, num_classes=0)
model.eval()
cfg = timm.data.resolve_model_data_config(model)
transform = timm.data.create_transform(**cfg, is_training=False)
batch = torch.stack([
transform(Image.open(io.BytesIO(d["image_bytes_raw"])).convert("RGB"))
for d in data
])
with torch.no_grad():
feats = normalize(model(batch).numpy())
for i, d in enumerate(data):
d["embedding"] = feats[i].tolist()
DIM = feats.shape[1] # 1024
3. Create table and insert data
Declare VECTOR(1024) for embeddings and BLOB for raw image bytes.
from pyspark.sql import Row
from pyspark.sql.types import *
# Register as a Spark temp view
schema = StructType([
StructField("image_id", StringType(), False),
StructField("category", StringType(), False),
StructField("label", IntegerType(), False),
StructField("image_bytes_raw", BinaryType(), False),
StructField("embedding", ArrayType(FloatType(), containsNull=False), False),
])
rows = [Row(d["image_id"], d["category"], d["label"],
d["image_bytes_raw"], d["embedding"]) for d in data]
spark.createDataFrame(rows, schema).createOrReplaceTempView("staging")
CREATE TABLE pets (
image_id STRING,
category STRING,
label INT,
image_bytes BLOB,
embedding VECTOR(1024)
) USING hudi
LOCATION '/tmp/hudi_pets'
TBLPROPERTIES (
primaryKey = 'image_id',
preCombineField = 'image_id',
type = 'cow',
'hoodie.table.base.file.format' = 'parquet',
'hoodie.write.record.merge.custom.implementation.classes'
= 'org.apache.hudi.DefaultSparkRecordMerger'
);
INSERT INTO pets
SELECT image_id, category, label,
named_struct(
'type', 'INLINE',
'data', image_bytes_raw,
'reference', cast(null as struct<
external_path:string, offset:bigint,
length:bigint, managed:boolean>)
) AS image_bytes,
embedding
FROM staging;
Notes:
VECTOR(1024)stores fixed-dimension embeddings for similarity search.BLOBstores raw image bytes inline. For large objects, useOUT_OF_LINEto store a pointer instead.read_blob()resolves both modes transparently.
4. Materialize a BLOB with read_blob()
read_blob() is Hudi's BLOB accessor. Pass it a BLOB column, get back raw binary. Works the same
for inline bytes and out-of-line references.
SELECT image_id, category,
length(read_blob(image_bytes)) AS byte_count
FROM pets
LIMIT 5;
+-----------+--------------------+----------+
| image_id| category|byte_count|
+-----------+--------------------+----------+
|pets_002081| Beagle| 249983|
|pets_003404| Shiba Inu| 349745|
|pets_001939| American Bulldog| 267667|
|pets_002457|English Cocker Sp..| 364492|
|pets_003538|Staffordshire Bul..| 427728|
+-----------+--------------------+----------+

Image bytes retrieved by read_blob(), decoded back to a PNG. The round-trip through the Hudi BLOB column is lossless.
5. Vector search + BLOB retrieval in one query
hudi_vector_search returns the top-K nearest neighbors by cosine similarity; read_blob()
materializes image bytes only for the matching rows.
SELECT image_id,
category,
read_blob(image_bytes) AS resolved_bytes,
_hudi_distance
FROM hudi_vector_search(
'/tmp/hudi_pets', -- table path
'embedding', -- VECTOR column
ARRAY(0.12, -0.03, ...), -- query embedding (1024 floats)
5, -- top-K
'cosine' -- distance metric
)
ORDER BY _hudi_distance;
+-----------+--------------------+-----------+
| image_id| category| distance |
+-----------+--------------------+-----------+
|pets_002575| German Shorthaired| 0.378 |
|pets_000703| German Shorthaired| 0.484 |
|pets_002562| German Shorthaired| 0.598 |
|pets_002556| German Shorthaired| 0.607 |
|pets_003538|Staffordshire Bul..| 0.641 |
+-----------+--------------------+-----------+

Query: German Shorthaired pointer (left). Top-5 results ranked by cosine similarity.
6. Visualize results
import matplotlib.pyplot as plt
from PIL import Image
fig, axes = plt.subplots(1, len(results) + 1, figsize=(3 * (len(results) + 1), 3.2))
# Query image
axes[0].imshow(Image.open(io.BytesIO(query_bytes)))
axes[0].set_title("QUERY", fontweight="bold"); axes[0].axis("off")
# Top-K matches
for i, row in enumerate(results):
img = Image.open(io.BytesIO(bytes(row["resolved_bytes"])))
sim = 1.0 - float(row["_hudi_distance"])
axes[i+1].imshow(img)
axes[i+1].set_title(f"{row['category']}\nSim: {sim:.3f}")
axes[i+1].axis("off")
plt.tight_layout()
plt.savefig("hudi_vector_search_results.png", dpi=150)
What's next
| Topic | Link |
|---|---|
| Full interactive notebook | 00_main_demo.ipynb |
| VECTOR type reference | VECTOR in SQL DDL · SQL DML · DataFrame writes · hudi_vector_search in SQL Queries |
| BLOB type reference | BLOB in SQL DDL · SQL DML · DataFrame writes · read_blob() in SQL Queries |
| VARIANT type | VARIANT in SQL DDL · SQL DML · DataFrame writes · Querying VARIANT |
| Lance file format | Storage Layouts → Lance · DataFrame writes |
| AI lakehouse use cases | Use Cases → AI Lakehouse |