🤖 AI Summary
Lightweight semantic segmentation faces the fundamental challenge of balancing representational capacity with computational efficiency, as existing approaches are constrained by rigid architectures and implicit modeling—often relying on computationally expensive vision transformers. To address this, we propose an explicit-implicit collaborative modeling paradigm: (1) explicit Cartesian-directional views to incorporate geometric priors for long-range contextual modeling, and (2) a nested attention mechanism enabling efficient multi-scale contextual dependency capture with minimal parameters. Integrated with lightweight feature interaction and multi-scale aggregation, our model achieves state-of-the-art accuracy on ADE20K and Cityscapes while reducing FLOPs by 38% and model parameters by 52% compared to prior arts. This yields an optimal trade-off between real-time inference capability and segmentation quality.
📝 Abstract
Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.