Falcon Perception

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unifying visual feature extraction and task modeling in traditional perception systems, which often rely on decoupled encoder-decoder architectures. The authors propose an end-to-end, early-fusion dense Transformer that jointly processes image patches and text tokens from the first layer within a shared parameter space. This architecture employs a hybrid attention mechanism—bidirectional for image tokens and causal for prediction tokens—to integrate global visual context with autoregressive instance generation. Eschewing modular design, the model uses only a lightweight output head for continuous dense prediction and incorporates parallel high-resolution mask decoding. It achieves a Macro-F₁ of 68.0 on SA-Co, surpassing SAM3 (62.3), and demonstrates superior performance on the new PBench benchmark. Its derivative, Falcon OCR (300M parameters), attains 80.3% accuracy on olmOCR and 88.64 on OmniDocBench.
📝 Abstract
Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F$_1$ compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.
Problem

Research questions and friction points this paper is trying to address.

perception-centric systems
early-fusion
modular encoder-decoder
dense prediction
unified architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

early-fusion
unified Transformer
hybrid attention
dense perception
scalable backbone
🔎 Similar Papers
No similar papers found.
A
Aviraj Bevli
Falcon Vision Team, TII
S
Sofian Chaybouti
Falcon Vision Team, TII
Yasser Dahou
Yasser Dahou
Dublin City University, Technology Innovation Institute
Deep learningVision Language ModelsVisual Attention modelling
Hakim Hacid
Hakim Hacid
Technology Innovation Institute (TII), UAE
Machine LearningLLMDatabasesInformation RetrievalEdge ML
N
Ngoc Dung Huynh
Falcon Vision Team, TII
P
Phuc H. Le Khac
Falcon Vision Team, TII
Sanath Narayan
Sanath Narayan
Technology Innovation Institute, Abu Dhabi
Computer VisionMachine Learning
W
Wamiq Reyaz Para
Falcon Vision Team, TII
A
Ankit Singh
Falcon Vision Team, TII