🤖 AI Summary
Existing LoRA-based methods are constrained to single-layer weight updates, limiting their ability to leverage intermediate representations from deep networks and thereby hindering the performance of parameter-efficient fine-tuning. This work proposes Echo-LoRA, which introduces a cross-layer representation injection mechanism: sample-specific echo representations are generated by aggregating hidden states at deep layer boundaries and then injected into shallow LoRA or DoRA modules via a lightweight projection and gating network. During training, answer masking, masked distillation, and stochastic routing are jointly employed to stabilize the auxiliary pathway and reduce the training–inference discrepancy; notably, the auxiliary structure can be entirely removed at inference time, incurring no additional overhead. Evaluated on eight commonsense reasoning benchmarks, Echo-LoRA achieves an average improvement of 5.7 percentage points over LoRA (3.0 points under unified reimplementation) and gains 2.7 points when combined with DoRA.
📝 Abstract
Parameter-efficient fine-tuning (PEFT) has become a practical route for adapting large language models to downstream tasks, with LoRA-style methods being particularly attractive because they are inexpensive to train and easy to deploy. Most LoRA variants, however, revise the update rule within the weight space of each layer and leave the intermediate representations formed by deeper layers largely unused. We propose Echo-LoRA, a cross-layer representation injection method for parameter-efficient fine-tuning. During training, Echo-LoRA collects boundary hidden states from deeper source layers, aggregates them into a sample-level echo representation, and uses lightweight projection and gating networks to inject the resulting signal into shallow LoRA or DoRA modules. Answer-only masking, masked distillation, and stochastic routing are used to keep this auxiliary path stable and to reduce the gap between training and inference. On eight commonsense reasoning benchmarks, Echo-LoRA exceeds the reported LoRA baselines by 5.7 percentage points on average across LLaMA-7B, LLaMA2-7B, and LLaMA3-8B. Under reproduced LoRA baselines in our unified implementation, the average gain is 3.0 points; when combined with DoRA, the gain is 2.7 points. The Echo path is discarded after training, so the deployed model keeps the original low-rank LoRA/DoRA form and adds neither inference-time parameters nor inference computation.