CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the semantic ambiguity, stylistic inconsistency, and geometric incompatibility arising from text-only retrieval for avatar generation in the metaverse. To overcome these limitations, the authors propose a 3D concept skeleton–guided retrieval framework that constructs global spatial and stylistic contexts through prompt decomposition, visual evidence extraction, classification-based routing, and a hybrid retrieval mechanism. An agent-based vision-language model is further integrated to enable cross-category candidate filtering and iterative verification. Experimental results demonstrate that the proposed framework significantly enhances compositional robustness and topological consistency under diverse and ambiguous textual prompts, thereby validating the pivotal role of 3D concept skeletons in mitigating semantic ambiguity and enabling high-fidelity avatar assembly.

📝 Abstract

Metaverse platforms rely on creator-driven marketplaces where avatars are assembled from discrete, taxonomy-labeled 3D assets (e.g., tops, bottoms, shoes, accessories) under strict category and topology constraints. While users increasingly expect free-form text control, text-only retrieval is brittle: natural language is ambiguous with respect to platform taxonomies, metadata is often noisy or informal, and independently retrieved components can be stylistically inconsistent or geometrically incompatible. We propose \textbf{CMAG}, a concept-scaffolded retrieval and verified composition framework for marketplace avatar generation. Given a prompt, CMAG first synthesizes an intermediate 3D concept scaffold that disambiguates intent beyond text by providing global spatial and stylistic context. In parallel, a view-aware part discovery module extracts localized visual evidence via prompt decomposition and text-grounded segmentation. A prompt-conditioned taxonomy router enforces category coverage and resolves semantic-to-taxonomic mismatch, after which a hybrid category-wise retriever combines part-based fusion with a concept-residual fallback using feature suppression. Finally, an agentic vision--language model filters and re-ranks candidates across categories and drives an iterative verification loop to assemble prompt-faithful, topologically consistent avatars from catalog assets. We evaluate CMAG on diverse compositional prompts and demonstrate improved retrieval robustness and compositional correctness compared to strong baselines, highlighting the importance of 3D concept scaffolding under prompt ambiguity.

Problem

Research questions and friction points this paper is trying to address.

avatar generation

text-to-3D retrieval

concept disambiguation

compositional consistency

taxonomy alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

concept scaffolding

avatar generation

text-to-3D retrieval