Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study evaluates the capability of large language models (LLMs) to autonomously reverse-engineer the internal structure of black-box systems—specifically programs, formal languages, and mathematical equations—without prior structural knowledge. Method: We compare passive observation versus active intervention paradigms for data acquisition; propose a prompt-engineered, goal-directed querying strategy; design a cross-model intervention-data propagation mechanism; and benchmark performance against Bayesian optimal inference. Contribution/Results: LLMs under passive observation significantly underperform the Bayesian bound and suffer from overfitting or ignoring observations. Introducing an active “query–verify–revise” closed-loop improves reverse-engineering accuracy substantially and mitigates both failure modes. This work provides the first empirical evidence that goal-oriented active intervention is a critical pathway to enhancing LLMs’ scientific discovery capabilities in structural inference tasks.

Technology Category

Application Category

📝 Abstract

Using AI to create autonomous researchers has the potential to accelerate scientific discovery. A prerequisite for this vision is understanding how well an AI model can identify the underlying structure of a black-box system from its behavior. In this paper, we explore how well a large language model (LLM) learns to identify a black-box function from passively observed versus actively collected data. We investigate the reverse-engineering capabilities of LLMs across three distinct types of black-box systems, each chosen to represent different problem domains where future autonomous AI researchers may have considerable impact: Program, Formal Language, and Math Equation. Through extensive experiments, we show that LLMs fail to extract information from observations, reaching a performance plateau that falls short of the ideal of Bayesian inference. However, we demonstrate that prompting LLMs to not only observe but also intervene -- actively querying the black-box with specific inputs to observe the resulting output -- improves performance by allowing LLMs to test edge cases and refine their beliefs. By providing the intervention data from one LLM to another, we show that this improvement is partly a result of engaging in the process of generating effective interventions, paralleling results in the literature on human learning. Further analysis reveals that engaging in intervention can help LLMs escape from two common failure modes: overcomplication, where the LLM falsely assumes prior knowledge about the black-box, and overlooking, where the LLM fails to incorporate observations. These insights provide practical guidance for helping LLMs more effectively reverse-engineer black-box systems, supporting their use in making new discoveries.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to reverse-engineer black-box systems

Comparing passive observation vs active intervention in LLMs

Identifying common failure modes in LLM reverse-engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs reverse-engineer black-box systems via observation

Active intervention improves LLM performance significantly

Intervention data sharing enhances reverse-engineering accuracy

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation