Total: 1
This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness assessment, and topic-driven analysis. This research expands upon an earlier dataset directory created specifically for the analysis of MSR datasets by adding new annotations to the datasets, enriching the metadata categories, and offering more advanced filtering options. The metadata of the MSR papers presented from 2013 to 2024 has been gathered using the Semantic Scholar API. The analysis is based on Latent Dirichlet Allocation (LDA) topic modeling and statistical analysis. Dataset-level attributes were included into the expanded dataset directory, namely repository hosting site, format, accessibility, reusability, and dataset quality. The study reveals that the choice of repository hosting sites and data formats influences citation patterns and dataset usability. Furthermore, the enhanced annotation approach improves the analysis and discoverability of MSR datasets, supporting more effective reuse and evaluation of research artifacts.