🤖 AI Summary
This work addresses abstract visual compositional generation—where object identity is determined solely by the spatial configuration of a few geometric primitives (e.g., parts, symmetries, topology), invariant to texture and fine details. The task suffers from combinatorial explosion, severe data scarcity, and discrete feasibility constraints (e.g., non-overlap, orientation validity), yielding a sparse solution space that pixel-level generative models struggle to capture. To overcome this, we propose a geometry-semantic co-guided constrained generation framework that innovatively integrates AlphaGo-style search with fine-tuned vision-language models (VLMs). Specifically, Gumbel Monte-Carlo Tree Search (MCTS) serves as the policy engine, jointly optimizing constraint reasoning, neural semantic scoring, and adversarial reward learning in an end-to-end manner. Evaluated on Tangram Assembly, our method significantly outperforms diffusion and autoregressive baselines, achieving superior structural validity and semantic fidelity—especially under stringent geometric constraints.
📝 Abstract
We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.