Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

πŸ“… 2025-11-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high computational cost, reliance on large backbone networks, and poor deployability of KRISP in resource-constrained settings, this paper proposes Lite-KRISPβ€”a lightweight knowledge-enhanced vision-language model. Lite-KRISP decouples the knowledge injection module from the visual encoder and introduces a structured knowledge graph fusion mechanism alongside domain-specific constraints to effectively suppress hallucination and improve generalization. The model reduces parameter count by over 85%, achieves 75% of the original KRISP’s performance on the DAQUAR dataset, and accelerates inference by 3.2Γ—, enabling offline deployment on edge devices. Systematic ablation studies and synthetic VQA evaluations confirm that the proposed architecture preserves knowledge-guided reasoning capability while delivering superior efficiency, robustness, and scalability.

Technology Category

Application Category

πŸ“ Abstract
Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

Developing lightweight knowledge-enhanced vision-language models with fewer parameters
Identifying design flaws and scalability issues in original KRISP architecture
Enabling offline visual reasoning on edge devices while preventing AI hallucinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight model with fewer parameters
Knowledge graph domain prevents AI hallucinations
Functions on edge devices like smartphones
πŸ”Ž Similar Papers
No similar papers found.
Souradeep Dutta
Souradeep Dutta
University of British Columbia
Artificial IntelligenceFormal MethodsMachine LearningRoboticsCyber-Physical Systems
K
Keshav Bulia
Department of Metallurgical engineering & Material science, Indian Institute of Technology Bombay, Mumbai, India
N
Neena S Nair
Department of Bioscience & Bioengineering, Indian Institute of Technology Bombay, Mumbai, India