EUSkm2sVJ6@OpenReview

Total: 1

#1 How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning [PDF6] [Copy] [Kimi14] [REL]

Authors: Yao Tong, Jiayuan Ye, Sajjad Zarifzadeh, Reza Shokri

How much of a given dataset was used to train a machine learning model? This is a critical question for data owners assessing the risk of unauthorized data usage and protecting their right (United States Code, 1976). However, previous work mistakenly treats this as a binary problem—inferring whether \textit{all or none} or \textit{any or none} of the data was used—which is fragile when faced with real, non-binary data usage risks. To address this, we propose a fine-grained analysis called Dataset Usage Cardinality Inference (\ourmethod{}), which estimates the exact proportion of data used. Our algorithm, leveraging debiased membership guesses, matches the performance of the optimal MLE approach (with a maximum error <0.1) but with significantly lower (e.g., 300× less) computational cost.

Subject: ICLR.2025 - Oral