🤖 AI Summary
Existing text Mixup methods produce samples that are often unreadable, while large language model (LLM)-enhanced approaches, though readable, suffer from limited controllability and are prone to manifold intrusion. To address these issues, this work proposes inversedMixup, a novel framework that aligns the output embedding space of the task model with the input embedding space of an LLM through a three-stage training process. This approach enables, for the first time, a controllable inverse mapping from Mixup embeddings back to human-readable sentences. By jointly ensuring readability and controllability, inversedMixup effectively reveals and mitigates the manifold intrusion phenomenon inherent in textual Mixup. Experimental results demonstrate that inversedMixup significantly improves data augmentation performance under both few-shot and fully supervised settings, confirming its effectiveness and strong generalization capability.
📝 Abstract
Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.