Learning from Label Proportions

Learning from Label Proportions (LLP) is a weakly supervised problem: you only know the proportion of labels within each group of items (a bag), and you want to recover the labels of the individual items.

In our KDD 2023 paper, we show that the dependence structure between bags, items, and labels defines distinct LLP variants, and that accounting for it leads to better model selection across a wide range of datasets and algorithms (Franco et al., 2023). In a follow-up pre-print, we build on this to generate variant-specific datasets and propose guidelines for benchmarking LLP methods fairly, and use them to run an extensive benchmark of well-known algorithms (Franco et al., 2023).

Code and datasets are available at llp-variants-kdd and llp-variants-datasets-benchmarks.

References

2023

KDD
Dependence and Model Selection in LLP: The Problem of Variants

Gabriel Franco, Mark Crovella, and Giovanni Comarela

In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 2023

Abs DOI Bib HTML Code

The problem of Learning from Label Proportions (LLP) has received considerable research attention and has numerous practical applications. In LLP, a hypothesis assigning labels to items is learned using knowledge of only the proportion of labels found in predefined groups, called bags. While a number of algorithmic approaches to learning in this context have been proposed, very little work has addressed the model selection problem for LLP. We argue that a careful approach to model selection for LLP requires consideration of the dependence structure that exists between bags, items, and labels. In this paper we formalize this structure and show how it affects model selection. We show how this leads to improved methods of model selection that we demonstrate outperform the state of the art over a wide range of datasets and LLP algorithms.
@inproceedings{10.1145/3580305.3599307, author = {Franco, Gabriel and Crovella, Mark and Comarela, Giovanni}, title = {Dependence and Model Selection in LLP: The Problem of Variants}, year = {2023}, isbn = {9798400701030}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {470--481}, numpages = {12}, location = {Long Beach, CA, USA}, series = {KDD '23}, doi = {10.1145/3580305.3599307}, keywords = {weakly supervised learning; learning from label proportions; hyperparameter selection} }
arXiv
Evaluating LLP Methods: Challenges and Approaches

Gabriel Franco, Giovanni Comarela, and Mark Crovella

arXiv preprint arXiv:2310.19065, 2023

Abs arXiv Bib Code

Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges for benchmarking learning methods. To address these challenges, we develop methods capable of generating LLP datasets meeting the requirements of different variants, develop guidelines for benchmarking LLP algorithms, and illustrate the new methods and guidelines by performing an extensive benchmark of a set of well-known LLP algorithms.
@article{franco2023evaluating, title = {Evaluating LLP Methods: Challenges and Approaches}, author = {Franco, Gabriel and Comarela, Giovanni and Crovella, Mark}, journal = {arXiv preprint arXiv:2310.19065}, year = {2023}, }