mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing approaches to handling SVGs typically rasterize them, resulting in the loss of structured geometric and layout information and hindering the unification of text, images, and SVGs within a shared semantic space. This work proposes a training-free, instruction-guided multimodal embedding framework that leverages multimodal large language models (MLLMs) to uncover the geometric and relational structure of SVGs through modality-specific instructions and semantic rewriting. The framework introduces a multimodal Embedding-on-the-Last-token (mEOL) mechanism that compresses arbitrary multimodal inputs into single-token embeddings. It achieves, for the first time, structure-aware zero-shot multimodal retrieval, outperforming trainable encoder baselines on the first text-to-SVG retrieval benchmark and demonstrating the efficacy of prompt-level control for cross-modal alignment.

Technology Category

Application Category

📝 Abstract

Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a single token whose hidden state serves as a compact semantic embedding. (2) A semantic SVG rewriting module that assigns meaningful identifiers and simplifies nested SVG elements through visual reasoning over the rendered image, exposing geometric and relational cues hidden in raw code. Using a repurposed VGBench, we build the first text-to-SVG retrieval benchmark and show that our training-free embeddings outperform encoder-based and training-based multimodal baselines. These results highlight prompt-level control as an effective alternative to parameter-level training for structure-aware multimodal retrieval. Project page: https://scene-the-ella.github.io/meol/

Problem

Research questions and friction points this paper is trying to address.

Scalable Vector Graphics

multimodal embedding

structure-aware retrieval

instruction-guided

training-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

instruction-guided

multimodal embedding