Chapter 10 comparison
We consider DSGT(Pu and Nedić 2018), DSGD(Pu, Olshevsky, and Paschalidis 2019a), DGD, EXTRA(Shi et al. 2015), D-PSGD(Lian et al. 2017) and \(D^2\)(Tang et al. 2018) here. Generally, EXTRA, DSGT, and \(D^2\) achieve better convergence properties because of adding some correction terms. The following table lists the schemes of these algorithms.
Name | Scheme |
---|---|
c-SGD | \(x_{k+1}=x_k-\alpha_k\tilde g(k)\) |
DGD | \(X_{k+1}=WX_k-\alpha_k\nabla f(X_k)\) |
D-PSGD | \(X_{k+1}=WX_{k}-\alpha G_k\) |
EXTRA | \(X_{k+1}=WX_k-\alpha \nabla f(X_k)+\sum\limits_{t=0}^{k-1}(W-\tilde W)X_t\) |
DSGD | \(X_{k+1}=W(X_k-\alpha_{k}G_k)\) |
DSGT | \[X_{k+1}=W(X_k-\alpha_k Y_k)\\Y_{k+1}=WY_k+G_{k+1}-G_k\] |
\(D^2\) | \[X_1=W(X_0-\alpha G_0)\\X_{k+1}=W(X_k-\alpha G_k)+W(X_k-X_{k-1}-\alpha G_{k-1})\] |
Thus \(nR'(k+1)=E\left[\Vert X_{k+1}-\mathbf{1}x^*\Vert^2\right]\) is affected by \[ E\left[\Vert H-\alpha_k G_k\Vert^2\right] \] Similar for DSGD, we have \(nR'(k+1)\) is affected by \[ E\left[\Vert\alpha_k G_k\Vert^2 \right]=\alpha_k E\left[\Vert G_k\Vert^2 \right] \]
We can see that a sequence of dimishing \(\alpha_k\) can decrease \(E\left[\Vert G_k\Vert^2 \right]\) as \(k\) gets large, which explains why dimishing stepsizes can improve fixed stepsize scheme to an exact convergence. On the other hand, we can also design a distributed algorithm \(s.t.\) \[\begin{equation} E\left[\Vert H-\alpha_k G_k\Vert^2\right]\leq \left[\Vert\alpha_k G_k\Vert^2 \right] \tag{10.2} \end{equation}\]and conserve the consensus property.
10.1 Assumptions of different schemes
The following table summarizes the assumptions needed for each schemes. The column of \(\sigma^2\) and \(\zeta^2\) denote \[ E_{\xi }\left\|g_{i}(x ; \xi)-\nabla f_{i}(x)\right\|^{2} \leqslant \sigma^{2}, \quad \forall i, \forall x \] and \[ \frac{1}{n} \sum_{i=1}^{n}\left\|\nabla f_{i}(x)-\nabla f(x)\right\|^{2} \leqslant \zeta^{2}, \quad \forall i, \forall x \] respectively.
\(\zeta_0=\frac{1}{n} \sum\limits_{i=1}^{n}\left\|\nabla f_{i}(\mathbf{0})-\nabla f(\mathbf{0})\right\|^{2}\); \(\rho_w\) is the spectral norm of \(W-\frac{\mathbf{1}\mathbf{1}^T}{n}\)
Name | \(f_i(x)\) | \(\nabla f_i(x)\) | \(W\) | Spectral gap | \(\sigma^2\) | \(\zeta^2\) |
---|---|---|---|---|---|---|
c-SGD | Lipschitz continuous | \(\backslash\) | \(\backslash\) | \(\backslash\) | \(\backslash\) | |
DGD | Lipschitz continuous | Sym and \(W\mathbf{1}=\mathbf{1}\) | \(\rho_w<1\) | \(\backslash\) | \(\backslash\) | |
D-PSGD | Lipschitz continuous | Sym and doubly stochastic | \(\lambda_2^2<1\) | Y | Y | |
EXTRA | convex (\(f\) strongly convex leads to better rate) | Lipschitz continuous | Sym and \(W\mathbf{1}=\mathbf{1}\) | \(\lambda_\min(W)\geq0\) | \(\backslash\) | \(\backslash\) |
DSGD | strongly convex | Lipschitz continuous | Sym and doubly stochastic | \(\rho_w\) | Y | N |
DSGT | strongly convex | Lipschitz continuous | doubly stochastic | \(\rho_w\) | Y | N |
\(D^2\) | Lipschitz continuous | Sym and \(W\mathbf{1}=\mathbf{1}\) | \(\lambda_2<1,\lambda_\min\geq-\frac{1}{3}\) | Y | \(\zeta_0\) |
Remark. \
EXTRA assumes a less restrictive assumption on \(W\) which leads to \(W\mathbf{1}=\mathbf{1}\) and \(\lambda(W)\in(-1, 1]\);
In EXTRA, if \(\lambda_{\min}(W)<0\), we can replace \(W\) by \(\frac{I+W}{2}\);
In EXTRA, if we assume \(g(x)=\frac{1}{n}\sum\limits_{i=1}^n f_i(x) + \frac{1}{4\alpha}\Vert x\Vert_{\tilde{W}-W}\) is restricted strongly convex w.r.t. \(x^*\), we can derive a convergence rate \(\mathcal{O}(1+\delta)^{-k}\). This conditon does not require all \(f_i\) to be individually restricted strongly convex, which is less restrictive to that in DSGT and DSGD.
DSGD addtionally assume \(\sum\limits_{i=1}^{n}\left\|x_{i}(0)-x^{*}\right\|^{2}=\mathcal{O}(n)\) and \(\sum\limits_{i=1}^{n}\left\|\nabla f_{i}\left(x^{*}\right)\right\|^{2}=\mathcal{O}(n)\);
\(D^2\) is less senstive to the data variance across workers compared to D-PSGD
options:
What is the convergence rate when assuming \(f_i\) to be strongly convex in \(D^2\)?
Li, Shi, and Yan (2019) assume a sturcture of \(f_i(x)=s_i(x)+r_i(x)\) on the objective function, where \(s_i\) is differentiable and has a Lipschitz continuous gradient with parameter \(L\) and \(r_i\) is proximable. They improve The PG-EXTRA in the speed and the dependency of convergence over networks. If we assume the same structure on DSGT or DSGD, what is the convergence rate?
10.2 Convergence rate
The following table lists the convergence rate of the above algorithms.
Name | nonconvex | convex | strongly convex |
---|---|---|---|
c-SGD | \(\mathcal{O}(\frac{1}{\sqrt{nT}}+\frac{1}{T})\) | \(\mathcal{O}(\frac{1}{nk}+\frac{1}{k^2})\) | |
DGD | \(\mathcal{O}\left(\frac{\ln k}{\sqrt{k}}\right)\) | ||
D-PSGD | \(\mathcal{O}\left(\frac{\sigma}{\sqrt{n T}}+\frac{n^{\frac{1}{4}} \zeta^{\frac{2}{3}}}{T^{\frac{2}{3}}}+\frac{1}{T}\right)\) | ||
EXTRA | \(\mathcal{O}(\frac{1}{k})\) | \(\mathcal{O}((1+\delta)^{-k})\) | |
DSGD | \(\mathcal{O}\left(\frac{1}{nk}+\frac{1}{k^{1.5}}+\frac{1}{k^2}\right)\) | ||
DSGT | \(\mathcal{O}\left(\frac{1}{nk} +\frac{1}{k^{\theta\mu}}+\frac{1}{k^2} \right)\) | ||
\(D^2\) | \(\mathcal{O}\left(\frac{\sigma}{\sqrt{n T}}+\frac{1}{T}+\frac{\zeta_{0}^{2}}{T+\sigma^{2} T^{2}}+\frac{\sigma^2}{1+\sigma^2T}\right)\) |
- DSGT converges to the neighborhood of \(x^*\) at the linear rate of \(\rho(A)^k\) when choosing fixed \(\alpha\) and converges to \(x^*\) when choosing \(\alpha_k=\frac{\theta}{m+k}\). This implies us to choose stepsize adaptively.
References
Li, Zhi, Wei Shi, and Ming Yan. 2019. “A Decentralized Proximal-Gradient Method with Network Independent Step-Sizes and Separated Convergence Rates.” IEEE Transactions on Signal Processing 67 (17). IEEE: 4494–4506.
Lian, Xiangru, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent.” In Advances in Neural Information Processing Systems, 5330–40.
Pu, Shi, and Angelia Nedić. 2018. “Distributed Stochastic Gradient Tracking Methods.”
Pu, Shi, Alex Olshevsky, and Ioannis Ch. Paschalidis. 2019a. “A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent.”
Shi, Wei, Qing Ling, Gang Wu, and Wotao Yin. 2015. “Extra: An Exact First-Order Algorithm for Decentralized Consensus Optimization.” SIAM Journal on Optimization 25 (2). SIAM: 944–66.
Tang, Hanlin, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. 2018. “D \(\^{} 2\): Decentralized Training over Decentralized Data.” arXiv Preprint arXiv:1803.07068.