🤖 AI Summary
Existing research struggles to effectively integrate multimodal data to uncover the public health risks posed by novel tobacco products such as flavored nicotine, primarily due to a lack of cross-modal factual grounding. To address this gap, this work introduces the NICO dataset, comprising over 200,000 image-text samples, and proposes the NICO-RAG framework. NICO-RAG pioneers the integration of a multimodal hypergraph structure into retrieval-augmented generation (RAG), organizing entities and relations extracted from both images and text into a hypergraph during the construction phase. This design enables efficient cross-modal retrieval and reasoning based on semantic and visual similarity—without requiring explicit processing of image tokens. Experimental results demonstrate that NICO-RAG achieves performance on par with state-of-the-art image-aware RAG methods across more than 100 questions while substantially reducing computational costs.
📝 Abstract
The nicotine addiction public health crisis continues to be pervasive. In this century alone, the tobacco industry has released and marketed new products in an aggressive effort to lure new and young customers for life. Such innovations and product development, namely flavored nicotine or tobacco such as nicotine pouches, have undone years of anti-tobacco campaign work. Past work is limited both in scope and in its ability to connect large-scale data points. Thus, we introduce the Nicotine Innovation Counter-Offensive (NICO) Dataset to provide public health researchers with over 200,000 multimodal samples, including images and text descriptions, on 55 tobacco and nicotine product brands. In addition, to provide public health researchers with factual connections across a large-scale dataset, we propose NICO-RAG, a retrieval-augmented generation (RAG) framework that can retrieve image features without incurring the high-cost of language models, as well as the added cost of processing image tokens with large-scale datasets such as NICO. At construction time, NICO-RAG organizes image- and text-extracted entities and relations into hypergraphs to produce as factual responses as possible. This joint multimodal knowledge representation enables NICO-RAG to retrieve images for query answering not only by visual similarity but also by the semantic similarity of image descriptions. Experimentals show that without needing to process additional tokens from images for over 100 questions, NICO-RAG performs comparably to the state-of-the-art RAG method adapted for images.