🤖 AI Summary
This work addresses the limitation of large language models (LLMs), which typically rely solely on their final-layer representations for prediction, thereby overlooking potentially superior task-relevant features embedded in intermediate layers. To overcome this, the authors propose Inter-Layer Structural Encoders (ILSE), a framework that structurally fuses representations from all intermediate layers to construct more powerful predictive representations. The key innovation lies in a geometric encoder based on Cayley graphs—termed the Cayley-Encoder—that leverages expander graph topology to enable efficient cross-layer information propagation and adaptively selects the optimal layer combination for each task. Evaluated across 13 classification and semantic similarity benchmarks, ILSE consistently outperforms baselines across nine pretrained LLMs spanning 14M to 8B parameters, achieving up to a 44% gain in accuracy and a 25% improvement in similarity metrics, with particularly strong performance in few-shot settings.
📝 Abstract
The standard practice in Large Language Models (LLMs) is to base predictions on the final-layer token representations. Recent studies, however, show that intermediate layers encode substantial information, which may contain more task-relevant features than the final-layer representations alone. Importantly, it was shown that for different tasks, different layers may be optimal. In this work we introduce Inter-Layer Structural Encoders (ILSE), a powerful structural approach to learn one effective representation from the LLM's internal layer representations all together. Central to ILSE is Cayley-Encoder, a mathematically grounded geometric encoder that leverages expander Cayley graphs for efficient inter-layer information propagation. We evaluate ILSE across 13 classification and semantic similarity tasks with 9 pre-trained LLMs ranging from 14 million to 8 billion parameters. ILSE consistently outperforms baselines and existing approaches, achieving up to 44% improvement in accuracy and 25% in similarity metrics. We further show that ILSE is data-efficient in few-shot regimes and can make small LLMs competitive with substantially larger models.