Enable JavaScript to see more content
Recently hyped ML content linked in one simple page
Sources: reddit/r/{MachineLearning,datasets}, arxivsanity, twitter, kaggle/kernels, hackernews, awesomedatasets, sota changes
Made by: Deep Phrase HK Limited
1


[1910.05104] Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning
We show both theoretically and in practice that such a smoothing leads to accelerated convergence rates, and may be used for settings where the sample size is limited, such as fewshot or adversarial learning
Abstract: We investigate the theoretical limits of pipeline parallel learning of deep
learning architectures, a distributed setup in which the computation is
distributed per layer instead of per example. For smooth convex and nonconvex
objective functions, we provide matching lower and upper complexity bounds and
show that a naive pipeline parallelization of Nesterov's accelerated gradient
descent is optimal. For nonsmooth convex functions, we provide a novel
algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a
$d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is
the underlying dimension. While the convergence rate still obeys a slow
$\varepsilon^{2}$ convergence rate, the depthdependent part is accelerated,
resulting in a nearlinear speedup and convergence time that only slightly
depends on the depth of the deep learning architecture. Finally, we perform an
empirical analysis of the nonsmooth nonconvex case and show that, for
difficult and highly nonsmooth problems, PPRS outperforms more traditional
optimization algorithms such as gradient descent and Nesterov's accelerated
gradient descent for problems where the sample size is limited, such as
fewshot or adversarial learning.
‹Figure 1: Example of a computation graph for f(x, ω) = ln(1 + ex/2 ) + x/2 − ω sin(x). (Pipeline parallel optimization setting)Figure 2: Forward pass. Figure 3: Backward pass. Figure 4: Bubbling scheme used at each iteration of PPRS in the case of a sequential neural network. Cell (i, k) indicates the computation of the forward pass (resp. backward pass) for ∇fi(θ + γXk). (Optimal algorithm)Figure 5: Comparison with GD and AGD. Increasing the number of samples increases the stability of PPRS and allows for faster convergence rates. Depth: (a) moderate, ∆ = 20. (b) high, ∆ = 200. (Experiments)›



Related: TFIDF
[1702.08704] Optimal algorithms for smooth and strongly convex distributed optimization in networks[1706.10207] Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning[1906.06821] A Survey of Optimization Methods from a Machine Learning Perspective[1909.03550] Lecture Notes: Optimization for Machine Learning[1706.04769] Stochastic Training of Neural Networks via Successive Convex Approximations[1505.04824] An Asynchronous MiniBatch Algorithm for Regularized Stochastic Optimization[1610.07448] A Framework for Parallel and Distributed Training of Neural Networks[1411.0972] Convex Optimization for Big Data[1805.05189] Randomized Smoothing SVRG for Largescale Nonsmooth Convex Optimization[1905.10018] MomentumBased Variance Reduction in NonConvex SGD

Related: TFIDF
[1702.08704] Optimal algorithms for smooth and strongly convex distributed optimization in networks[1706.10207] Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning[1906.06821] A Survey of Optimization Methods from a Machine Learning Perspective[1909.03550] Lecture Notes: Optimization for Machine Learning[1706.04769] Stochastic Training of Neural Networks via Successive Convex Approximations[1505.04824] An Asynchronous MiniBatch Algorithm for Regularized Stochastic Optimization[1610.07448] A Framework for Parallel and Distributed Training of Neural Networks[1411.0972] Convex Optimization for Big Data[1805.05189] Randomized Smoothing SVRG for Largescale Nonsmooth Convex Optimization[1905.10018] MomentumBased Variance Reduction in NonConvex SGD