Realistic Synthetic Household Data Generation at Scale

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic data methods struggle to model the bidirectional coupling between human behavior and domestic environments, hindering the development of embodied agents. This work proposes a loosely coupled generative framework that, for the first time, enables co-generative modeling of personas and home environments: persona traits drive environmental layout, while environmental semantics reciprocally shape interactive behaviors, with flexible customization via natural language prompts. The approach integrates multimodal embeddings, prompt engineering, and statistical validation—including interventional analysis and mutual information gain. Generated data achieves a cosine similarity of 0.60 with the real-world HOMER dataset; interventional experiments confirm that changes in persona attributes significantly influence both environment and behavior (p < 0.001, Cohen’s d = 0.51–1.12), demonstrating the method’s effectiveness and controllability.

Technology Category

Application Category

📝 Abstract
Advancements in foundation models have catalyzed research in Embodied AI to develop interactive agents capable of environmental reasoning and interaction. Developing such agents requires diverse, large-scale datasets. Prior frameworks generate synthetic data for long-term human-robot interactions but fail to model the bidirectional influence between human behavior and household environments. Our proposed generative framework creates household datasets at scale through loosely coupled generation of long-term human-robot interactions and environments. Human personas influence environment generation, while environment schematics and semantics shape human-robot interactions. The generated 3D data includes rich static context such as object and environment semantics, and temporal context capturing human and agent behaviors over extended periods. Our flexible tool allows users to define dataset characteristics via natural language prompts, enabling configuration of environment and human activity data through natural language specifications. The tool creates variations of user-defined configurations, enabling scalable data generation. We validate our framework through statistical evaluation using multi-modal embeddings and key metrics: cosine similarity, mutual information gain, intervention analysis, and iterative improvement validation. Statistical comparisons show good alignment with real-world datasets (HOMER) with cosine similarity (0.60), while synthetic datasets (Wang et al.) show moderate alignment (0.27). Intervention analysis across age, organization, and sleep pattern changes shows statistically significant effects (p<0.001) with large effect sizes (Cohen's d = 0.51-1.12), confirming bidirectional coupling translates persona traits into measurable environmental and behavioral differences. These contributions enable development and testing of household smart devices at scale.
Problem

Research questions and friction points this paper is trying to address.

synthetic data generation
human-environment interaction
embodied AI
bidirectional influence
household environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

bidirectional coupling
synthetic household data
foundation models
natural language prompting
embodied AI
🔎 Similar Papers
No similar papers found.