🤖 AI Summary
This work addresses the vulnerability of current commercial AI text detectors, which exhibit weak detection performance on outputs from base large language models yet are overly sensitive to instruction-tuned models, often misclassifying the former as human-written. To overcome this limitation, the authors propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic framework that constructs an iterative paraphrasing pipeline using a lightly fine-tuned base model, guided by feedback from commercial detectors to evade detection while preserving semantic fidelity. Theoretical analysis reveals that existing detectors primarily respond to localized features introduced by instruction tuning rather than intrinsic generation patterns. Extensive experiments across Llama-3 and Qwen-3 model families (0.6B–70B parameters) demonstrate HIP’s superior effectiveness in enhancing human-likeness, maintaining semantic accuracy, and evading detection, establishing it as the first universal humanization method independent of specific detectors.
📝 Abstract
As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.