Total: 1
Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To preserve the ability to recall famous quotations, we additionally develop a variant of ParaPO that uses system prompts to control whether LMs should reduce regurgitation. On Llama3.1-8B, ParaPO consistently reduces regurgitation across all datasets we evaluated (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). On the instruction-tuned model Tulu3-8B, ParaPO with system prompts achieve a 27.5\% reduction in regurgitation (from 8.7 to 6.3) in creative writing, while preserving similar accuracy in requesting famous quotations. In contrast, the base Tulu model with inference-time system prompts achieves only a 3.5\% reduction (from 8.7 to 8.4).