Benchopt: Benchmarks for ML Optimizations
Posted on (Update: )
attempts for “reproducibility crisis”
- open datasets (OpenML)
- standardized code sharing
- benchmarks (MLPerf)
- the NeurIPS and ICLR reproducibility challenges
- new journals
optimization algorithms pervade almost every area of ML, from empirical risk minimization, variational inference to reinforcement learning, it is thus crucial to know which methods to use depending on the task and setting
- while some papers in optimization for ML provide extensive validations, many others fall short in this regard due to lack of time and resources, and in turn feature results that are hard to reproduce by other researchers
- in addition, both performance and hardware evolve over time, which eventually make static benchmarks obsolete
build a living, reproducible, and standardized state of the art that can serve as a foundation for future research
benchmarks are meant to evolve over time, so Benchopt
offers a modular structure through which a benchmark can be easily extened with new objective functions, datasets, and solvers by the addition of a single file of code
three canonical problems:
- $\ell_2$-regularized logistic regression
- the Lasso
- training of ResNet18 architecture for image classification
The Benchopt library
the considered problems are of the form
\[\theta^\star \in \argmin_{\theta\in\Theta} f(\theta; \cD, \Lambda)\]three types of object classes
- objective
- datasets
- solver
First example: $\ell_2$-regularized logistic regression
- $X\in \IR^{n\times p}$
- $y\in {-1, 1}^n$
$\ell_2$-regularized logistic regression provides a GLM indexed by $\theta^\star \in \IR^p$ to discriminate the classes by solving
\[\theta^\star =\argmin_{\theta\in \IR^p} \sum_{i=1}^n \log(1+\exp(-y_iX_i^\top\theta)) + \frac{\lambda}{2}\Vert\theta\Vert_2^2\]where $\lambda > 0$. It is a strongly convex with a Lipschitz gradient
- most classical method: Newton’s method
- quasi-Netwon: L-BFGS
- truncated Netwon
- methods based on stochastic estimates of the gradient: stochastic gradient descent
- core idea: use cheap and noisy estimates of the gradient
- SGD generally converges either slowly due to decreasing step sizes, or to a neighborhood of the solution for constant step sizes
- variance-reduced adaptations, such as SAG, SAGA, and SVRG
- methods based on coordinate descent
Second example: the Lasso
- iterative reweighted least-squares (IRLS)
- iterative soft thresholding algorithm (ISTA), the accelerated FISTA
Third example: how standard is a benchmark on ResNet18?
image classification has become a de facto standard to validate novel methods in the field.
Among the different network architectures, ResNets are extensively used in the community as they provide strong and versatile baselines