🤖 AI Summary
Can large language models (LLMs) reliably adhere to organizational hierarchies and role-based access control (RBAC) constraints? This paper introduces OrgAccess, the first RBAC evaluation benchmark tailored for organization-level LLMs, comprising 40 permission categories and 70,000 multi-level, multi-conflict scenario instances. We systematically define and establish an organizational permission reasoning evaluation paradigm, incorporating fine-grained difficulty levels (easy/medium/hard) and combinatorial conflict testing. Leveraging synthetic data generation, hierarchical permission modeling, and F1-score–driven compliance assessment, we find that state-of-the-art LLMs achieve only 0.27 F1 on the hardest tasks—substantially below human performance. These results expose fundamental deficiencies in LLMs’ adherence to structured access policies, underscoring critical gaps in organizational trustworthiness. OrgAccess thus provides both a rigorous evaluation framework and empirical evidence essential for deploying trustworthy, enterprise-grade AI systems.
📝 Abstract
Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet under explored, challenge emerges: extit{can these models reliably understand and operate within the complex, often nuanced, constraints imposed by organizational hierarchies and associated permissions?} Evaluating this crucial capability is inherently difficult due to the proprietary and sensitive nature of real-world corporate data and access control policies. We introduce a synthetic yet representative extbf{OrgAccess} benchmark consisting of 40 distinct types of permissions commonly relevant across different organizational roles and levels. We further create three types of permissions: 40,000 easy (1 permission), 10,000 medium (3-permissions tuple), and 20,000 hard (5-permissions tuple) to test LLMs' ability to accurately assess these permissions and generate responses that strictly adhere to the specified hierarchical rules, particularly in scenarios involving users with overlapping or conflicting permissions. Our findings reveal that even state-of-the-art LLMs struggle significantly to maintain compliance with role-based structures, even with explicit instructions, with their performance degrades further when navigating interactions involving two or more conflicting permissions. Specifically, even extbf{GPT-4.1 only achieves an F1-Score of 0.27 on our hardest benchmark}. This demonstrates a critical limitation in LLMs' complex rule following and compositional reasoning capabilities beyond standard factual or STEM-based benchmarks, opening up a new paradigm for evaluating their fitness for practical, structured environments.