Falcon Perception

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of unifying visual feature extraction and task modeling in traditional perception systems, which often rely on decoupled encoder-decoder architectures. The authors propose an end-to-end, early-fusion dense Transformer that jointly processes image patches and text tokens from the first layer within a shared parameter space. This architecture employs a hybrid attention mechanism—bidirectional for image tokens and causal for prediction tokens—to integrate global visual context with autoregressive instance generation. Eschewing modular design, the model uses only a lightweight output head for continuous dense prediction and incorporates parallel high-resolution mask decoding. It achieves a Macro-F₁ of 68.0 on SA-Co, surpassing SAM3 (62.3), and demonstrates superior performance on the new PBench benchmark. Its derivative, Falcon OCR (300M parameters), attains 80.3% accuracy on olmOCR and 88.64 on OmniDocBench.

Technology Category

Application Category

📝 Abstract

Perception-centric systems are typically implemented with a modular encoder-decoder pipeline: a vision backbone for feature extraction and a separate decoder (or late-fusion module) for task prediction. This raises a central question: is this architectural separation essential or can a single early-fusion stack do both perception and task modeling at scale? We introduce Falcon Perception, a unified dense Transformer that processes image patches and text tokens in a shared parameter space from the first layer, using a hybrid attention pattern (bidirectional among image tokens, causal for prediction tokens) to combine global visual context with autoregressive, variable-length instance generation. To keep dense outputs practical, Falcon Perception retains a lightweight token interface and decodes continuous spatial outputs with specialized heads, enabling parallel high-resolution mask prediction. Our design promotes simplicity: we keep a single scalable backbone and shift complexity toward data and training signals, adding only small heads where outputs are continuous and dense. On SA-Co, Falcon Perception improves mask quality to 68.0 Macro-F$_1$ compared to 62.3 of SAM3. We also introduce PBench, a benchmark targeting compositional prompts (OCR, spatial constraints, relations) and dense long-context regimes, where the model shows better gains. Finally, we extend the same early-fusion recipe to Falcon OCR: a compact 300M-parameter model which attains 80.3% on olmOCR and 88.64 on OmniDocBench.

Problem

Research questions and friction points this paper is trying to address.

perception-centric systems

early-fusion

modular encoder-decoder

dense prediction

unified architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

early-fusion

unified Transformer

hybrid attention