Research Statement

2 minute read

Published:


Below is a statement by Predrag Radivojac regarding my contributions to our paper Fast Nonparametric Estimation of Class Proportions in the Positive-Unlabeled Classification Setting

Dan published a first-author paper at AAAI 2020 entitled “Fast nonparametric estimation of class proportions in the positive-unlabeled classification setting” where he developed and evaluated a new algorithm for estimating class priors in positive-unlabeled learning. Basically, this is a binary classification framework in which all labeled examples are positive and unlabeled data contain a mix of positive and negative examples. The goal is to estimate the fraction of positives in unlabeled data, assuming that labeled data is an i.i.d. sample from the positive class-conditional distribution. Class prior estimation is one of the keys to making progress in positive-unlabeled learning, especially if one is concerned with estimating the posterior distributions. Although Dan’s method is not the first in the field (this was in 2008), it is thus far not just the most accurate but also the simplest algorithm. It is also the fastest among those that have reasonably high accuracy. The main methodological idea is to pick a positive example from the set of positives, find and remove the closest unlabeled example from unlabeled data to the picked positive, and repeat this until the unlabeled set is empty. The recorded distances between the sampled positive and its closest unlabeled example then form a curve from which one can hope to estimate class priors. We had a rough idea about this algorithm before Dan joined the group, however, there were non-trivial challenges to be solved to actually develop a working algorithm. The biggest problems were that it was not clear how to treat high-dimensional data, how to do model selection, or how to accurately estimate class priors from the observed curve of recorded distances. We also weren’t quite sure how to do evaluation under the constraints that we needed to do model selection too. All these things had to be figured out if the algorithm was to be successful. Dan has successfully solved all those. The paper was a close collaboration between Dan, Shantanu Jain, and myself (all authors), so many steps were decided through discussion with some back and forths. Dan also wrote parts of the paper. Overall, this was a useful work and a challenging task. Dan demonstrated potential to be a successful researcher as we carried out this work.