🤖 AI Summary
To address the challenge engineers face in comprehending complex third-party sequence diagrams (TDs), this paper proposes a vision-language question-answering (VQA) system specifically designed for TD understanding. Methodologically, we introduce a controllable TD synthesis pipeline to alleviate annotation scarcity and perform domain-adaptive fine-tuning of the lightweight multimodal large language model LLaVA for joint modeling of TD images and natural language queries. We further employ GPT-4o as a strong baseline for systematic evaluation. Our contributions are threefold: (1) the first dedicated VQA framework for TD understanding; (2) a synthetic-data-driven domain transfer approach enabling effective adaptation to the TD modality; and (3) state-of-the-art performance across multiple TD comprehension benchmarks—outperforming the zero-shot GPT-4o baseline by an average of 23.6%, thereby demonstrating both technical efficacy and engineering practicality.
📝 Abstract
We introduce TD-Interpreter, a specialized ML tool that assists engineers in understanding complex timing diagrams (TDs), originating from a third party, during their design and verification process. TD-Interpreter is a visual question-answer environment which allows engineers to input a set of TDs and ask design and verification queries regarding these TDs. We implemented TD-Interpreter with multimodal learning by fine-tuning LLaVA, a lightweight 7B Multimodal Large Language Model (MLLM). To address limited training data availability, we developed a synthetic data generation workflow that aligns visual information with its textual interpretation. Our experimental evaluation demonstrates the usefulness of TD-Interpreter which outperformed untuned GPT-4o by a large margin on the evaluated benchmarks.