hartmann25@interspeech_2025@ISCA

Total: 1

#1 Reddit FlairShare: A Human-Annotated Dataset of Gender-Progressive Online Discourse [PDF] [Copy] [Kimi] [REL]

Author: Carlos Hartmann

This paper presents a large-scale dataset capturing Reddit comments with pronoun declarations in the respective user flairs, offering a new resource for studying linguistic identity, gender expression, and digital discourse. Totaling 72 million tokens, it contains all comments by pronoun-declaring users to present a broader view of their language use than previous corpora that selected isolated utterances. The dataset enables research across multiple domains, including (online) sociolinguistics, natural language processing (NLP), and other social sciences. It facilitates the study of pronoun-sharing behavior, the distribution and adoption of non-binary pronouns, and the use of mixed pronouns in online discourse. Future work can expand the dataset to capture more rare pronoun declarations; nevertheless, it provides a highly curated, valuable foundation for the study of online gender expression and discourse, innovative language, and identity performance in digital spaces.

Subject: INTERSPEECH.2025 - Modelling and Learning