Machine learning techniques paired with the availability of massive datasets
dramatically enhance our ability to explore the chemical compound space by
providing fast and accurate predictions of molecular properties. However,
learning on large datasets is strongly limited by the availability of
computational resources and can be infeasible in some scenarios. Moreover, the
instances in the datasets may not yet be labelled and generating the labels can
be costly, as in the case of quantum chemistry computations. Thus, there is a
need to select small training subsets from large pools of unlabelled data
points and to develop reliable ML methods that can effectively learn from small
training sets. This work focuses on predicting the molecules atomization energy
in the QM9 dataset. We investigate the advantages of employing domain
knowledge-based data sampling methods for an efficient training set selection
combined with informed ML techniques. In particular, we show how maximizing
molecular diversity in the training set selection process increases the
robustness of linear and nonlinear regression techniques such as kernel methods
and graph neural networks. We also check the reliability of the predictions
made by the graph neural network with a model-agnostic explainer based on the
rate distortion explanation framework.