Skip to main content
Version: 1.2.0

Unstructured Data Quick Start

AI and machine learning pipelines (RAG, recommendation, multimodal search) need to store and query embeddings, raw bytes, and structured metadata side by side. Hudi's VECTOR and BLOB column types let you keep all of those in a single transactional table. This guide walks through them end to end: you will store image embeddings (VECTOR) and raw image bytes (BLOB) in a single Hudi table, then run a top-K similarity search and materialize the matching images in one SQL query.

tip

Want to try this locally? This guide is also available as an interactive Jupyter notebook. Download the notebook and run it end-to-end on your machine.

Vector search results: query image on the left, top-5 nearest neighbors on the right

Example output: a query image of a German Shorthaired pointer (left) and the five most similar images found by hudi_vector_search, with cosine similarity scores. Raw image bytes are materialized by read_blob() directly in the same query.

Prerequisites

RequirementVersion
Java11+
Python3.10 – 3.12
Apache Spark3.5.x
Hudi Spark bundle1.2.0+
pip install pyspark==3.5.* pyarrow>=14.0.0 \
torch>=2.3.0 torchvision>=0.18.0 timm>=1.0.9 \
scikit-learn>=1.4.2 numpy>=1.26.0 pillow>=10.3.0 matplotlib>=3.8.0

1. Start Spark with Hudi

import os
from pathlib import Path
from pyspark.sql import SparkSession

HUDI_JAR = os.getenv("HUDI_BUNDLE_JAR", "hudi-spark3.5-bundle_2.12-1.2.0.jar")

spark = (
SparkSession.builder
.appName("Hudi-Unstructured-Data-QuickStart")
.config("spark.jars", HUDI_JAR)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.extensions",
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.hudi.catalog.HoodieCatalog")
.getOrCreate()
)

2. Load images and generate embeddings

Load the Oxford-IIIT Pet dataset (37 breeds) and generate 1024-dim embeddings with MobileNetV3.

import io, torch, timm, numpy as np
from sklearn.preprocessing import normalize
from PIL import Image
from torchvision.datasets import OxfordIIITPet

N_SAMPLES = 250

ds = OxfordIIITPet(root="~/.cache/torchvision", split="trainval", download=True)
indices = np.random.default_rng().choice(len(ds), size=N_SAMPLES, replace=False)

# Collect images as PNG bytes
data = []
for idx in indices:
img, label = ds[int(idx)]
buf = io.BytesIO(); img.convert("RGB").save(buf, format="PNG")
data.append({
"image_id": f"pets_{int(idx):06d}",
"category": ds.classes[label],
"label": int(label),
"image_bytes_raw": buf.getvalue(),
})

# Generate embeddings
model = timm.create_model("mobilenetv3_small_100", pretrained=True, num_classes=0)
model.eval()
cfg = timm.data.resolve_model_data_config(model)
transform = timm.data.create_transform(**cfg, is_training=False)

batch = torch.stack([
transform(Image.open(io.BytesIO(d["image_bytes_raw"])).convert("RGB"))
for d in data
])
with torch.no_grad():
feats = normalize(model(batch).numpy())

for i, d in enumerate(data):
d["embedding"] = feats[i].tolist()

DIM = feats.shape[1] # 1024

3. Create table and insert data

Declare VECTOR(1024) for embeddings and BLOB for raw image bytes.

from pyspark.sql import Row
from pyspark.sql.types import *

# Register as a Spark temp view
schema = StructType([
StructField("image_id", StringType(), False),
StructField("category", StringType(), False),
StructField("label", IntegerType(), False),
StructField("image_bytes_raw", BinaryType(), False),
StructField("embedding", ArrayType(FloatType(), containsNull=False), False),
])
rows = [Row(d["image_id"], d["category"], d["label"],
d["image_bytes_raw"], d["embedding"]) for d in data]
spark.createDataFrame(rows, schema).createOrReplaceTempView("staging")
CREATE TABLE pets (
image_id STRING,
category STRING,
label INT,
image_bytes BLOB,
embedding VECTOR(1024)
) USING hudi
LOCATION '/tmp/hudi_pets'
TBLPROPERTIES (
primaryKey = 'image_id',
preCombineField = 'image_id',
type = 'cow',
'hoodie.table.base.file.format' = 'parquet',
'hoodie.write.record.merge.custom.implementation.classes'
= 'org.apache.hudi.DefaultSparkRecordMerger'
);

INSERT INTO pets
SELECT image_id, category, label,
named_struct(
'type', 'INLINE',
'data', image_bytes_raw,
'reference', cast(null as struct<
external_path:string, offset:bigint,
length:bigint, managed:boolean>)
) AS image_bytes,
embedding
FROM staging;

Notes:

  • VECTOR(1024) stores fixed-dimension embeddings for similarity search.
  • BLOB stores raw image bytes inline. For large objects, use OUT_OF_LINE to store a pointer instead. read_blob() resolves both modes transparently.

4. Materialize a BLOB with read_blob()

read_blob() is Hudi's BLOB accessor. Pass it a BLOB column, get back raw binary. Works the same for inline bytes and out-of-line references.

SELECT image_id, category,
length(read_blob(image_bytes)) AS byte_count
FROM pets
LIMIT 5;
+-----------+--------------------+----------+
| image_id| category|byte_count|
+-----------+--------------------+----------+
|pets_002081| Beagle| 249983|
|pets_003404| Shiba Inu| 349745|
|pets_001939| American Bulldog| 267667|
|pets_002457|English Cocker Sp..| 364492|
|pets_003538|Staffordshire Bul..| 427728|
+-----------+--------------------+----------+

A sample image retrieved via read_blob, a sleeping Beagle

Image bytes retrieved by read_blob(), decoded back to a PNG. The round-trip through the Hudi BLOB column is lossless.

5. Vector search + BLOB retrieval in one query

hudi_vector_search returns the top-K nearest neighbors by cosine similarity; read_blob() materializes image bytes only for the matching rows.

SELECT image_id,
category,
read_blob(image_bytes) AS resolved_bytes,
_hudi_distance
FROM hudi_vector_search(
'/tmp/hudi_pets', -- table path
'embedding', -- VECTOR column
ARRAY(0.12, -0.03, ...), -- query embedding (1024 floats)
5, -- top-K
'cosine' -- distance metric
)
ORDER BY _hudi_distance;
+-----------+--------------------+-----------+
| image_id| category| distance |
+-----------+--------------------+-----------+
|pets_002575| German Shorthaired| 0.378 |
|pets_000703| German Shorthaired| 0.484 |
|pets_002562| German Shorthaired| 0.598 |
|pets_002556| German Shorthaired| 0.607 |
|pets_003538|Staffordshire Bul..| 0.641 |
+-----------+--------------------+-----------+

Vector search results panel

Query: German Shorthaired pointer (left). Top-5 results ranked by cosine similarity.

6. Visualize results

import matplotlib.pyplot as plt
from PIL import Image

fig, axes = plt.subplots(1, len(results) + 1, figsize=(3 * (len(results) + 1), 3.2))

# Query image
axes[0].imshow(Image.open(io.BytesIO(query_bytes)))
axes[0].set_title("QUERY", fontweight="bold"); axes[0].axis("off")

# Top-K matches
for i, row in enumerate(results):
img = Image.open(io.BytesIO(bytes(row["resolved_bytes"])))
sim = 1.0 - float(row["_hudi_distance"])
axes[i+1].imshow(img)
axes[i+1].set_title(f"{row['category']}\nSim: {sim:.3f}")
axes[i+1].axis("off")

plt.tight_layout()
plt.savefig("hudi_vector_search_results.png", dpi=150)

What's next

TopicLink
Full interactive notebook00_main_demo.ipynb
VECTOR type referenceVECTOR in SQL DDL · SQL DML · DataFrame writes · hudi_vector_search in SQL Queries
BLOB type referenceBLOB in SQL DDL · SQL DML · DataFrame writes · read_blob() in SQL Queries
VARIANT typeVARIANT in SQL DDL · SQL DML · DataFrame writes · Querying VARIANT
Lance file formatStorage Layouts → Lance · DataFrame writes
AI lakehouse use casesUse Cases → AI Lakehouse