Learning from Label Proportions

Model selection and benchmarking for a weakly supervised setting where only bag-level label proportions are known.

Learning from Label Proportions (LLP) is a weakly supervised problem: you only know the proportion of labels within each group of items (a bag), and you want to recover the labels of the individual items.

In our KDD 2023 paper, we show that the dependence structure between bags, items, and labels defines distinct LLP variants, and that accounting for it leads to better model selection across a wide range of datasets and algorithms (Franco et al., 2023). In a follow-up pre-print, we build on this to generate variant-specific datasets and propose guidelines for benchmarking LLP methods fairly, and use them to run an extensive benchmark of well-known algorithms (Franco et al., 2023).

Code and datasets are available at llp-variants-kdd and llp-variants-datasets-benchmarks.

References

2023

  1. KDD
    Dependence and Model Selection in LLP: The Problem of Variants
    Gabriel Franco, Mark Crovella, and Giovanni Comarela
    In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 2023
  2. arXiv
    Evaluating LLP Methods: Challenges and Approaches
    Gabriel Franco, Giovanni Comarela, and Mark Crovella
    arXiv preprint arXiv:2310.19065, 2023