Bartalis Dávid: Submodular selection for data summarization

Önálló projekt, szakmai gyakorlat III

2021/22 I. félév

Témavezetők:
Béres Ferenc (SZTAKI, Informatikai Kutatólaboratórium)
Bérczi-Kovács Erika Renáta (ELTE, Operációkutatási Tanszék)

Machine learning models, especially deep neural networks, perform much better if they are trained on large data sets. Unfortunately with millions of training examples the model training time also increases. Submodular selection is a technique that selects representative subsets from large data and offers theoretical guarantees on the quality of the acquired sample. Thus for small representative subsets it has the potential to enable a significantly faster learning process with comparable accuracy to the full data set training.

Submodularity captures the diminishing return property for set functions and has several applications in machine learning related tasks [2]. Schreiber et al. developed a submodular selection framework in Python that implements a facility location as well as a feature-based approach [1]. A possible future work could be the implementation of additional submodular selection methods in the apricot framework that scale well for large data sets. Furthermore, it would be also interesting to see how these methods perform for temporal data sets with concept drifts as the best representative subset may change over time.

Referenciák

  1. Schreiber et al. apricot: Submodular selection for data summarization in Python, 2019, https://github.com/jmschrei/apricot
  2. Krause, Golovin: Submodular Function Maximization, in Tractability, Cambridge University Press, pp 71-1042014

Dimenzió csökkentés mesterséges nyelvfeldolgozási feladatokban, teljesítmény, futásidőbeli összehasonlítás már implementált statisztikai, illetve lineáris algebrát használó módszerekkel. Illetve a használt szubmoduláris függvények paramétereinek optimalizálásával is szeretnénk foglalkozni.