🤖 AI Summary
This work addresses the limitation of continuous regression approaches in modeling integer-valued labels, which often disregard their inherent discrete nature. To overcome this, the authors propose directly modeling the discrete probability distribution of integer labels using differentiable parameterized integer distributions—specifically, bit-level Bernoulli representations and discrete Laplace-like distributions. These formulations preserve output discreteness while enabling end-to-end gradient-based training. The method integrates bit decomposition, neural network parameterization, and backpropagation, and is empirically validated across tabular data, sequence prediction, and image generation tasks. Among the considered distributions, the bit-level and discrete Laplace-like variants consistently achieve the best overall performance, effectively balancing representational capacity with differentiability.
📝 Abstract
We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.