Total: 1
This article examines LLMs’ ability to correctly label simple inferences with partisan conclusions. For this, we develop a dataset with both formal and material inferences, containing logically equivalent pairs of inferences with conclusions that favor either the political left or the political right. This allows us to focus on political bias as a source of decrease in performance. Our samples are synthetically generated and thus highly controlled, covering both English and German. We assess the performance of 16 configurations of both open and proprietary state-of-the-art LLMs on that dataset, finding generally unreliable performance as well as widespread political bias which, in the case of the English samples, persists throughout our experimental settings.