Chapter 10 comparison

We consider DSGT(Pu and Nedić 2018), DSGD(Pu, Olshevsky, and Paschalidis 2019 a), DGD, EXTRA(Shi et al. 2015), D-PSGD(Lian et al. 2017) and \(D^2\)(Tang et al. 2018) here. Generally, EXTRA, DSGT, and \(D^2\) achieve better convergence properties because of adding some correction terms. The following table lists the schemes of these algorithms.

Name	Scheme
c-SGD	\(x_{k+1}=x_k-\alpha_k\tilde g(k)\)
DGD	\(X_{k+1}=WX_k-\alpha_k\nabla f(X_k)\)
D-PSGD	\(X_{k+1}=WX_{k}-\alpha G_k\)
EXTRA	\(X_{k+1}=WX_k-\alpha \nabla f(X_k)+\sum\limits_{t=0}^{k-1}(W-\tilde W)X_t\)
DSGD	\(X_{k+1}=W(X_k-\alpha_{k}G_k)\)
DSGT	\[X_{k+1}=W(X_k-\alpha_k Y_k)\\Y_{k+1}=WY_k+G_{k+1}-G_k\]
\(D^2\)	\[X_1=W(X_0-\alpha G_0)\\X_{k+1}=W(X_k-\alpha G_k)+W(X_k-X_{k-1}-\alpha G_{k-1})\]

To see the influence of the correction term \(H\), suppose we have a scheme, \[\begin{equation} X_{k+1}=W(X_k-\alpha_k G_k+H) \tag{10.1} \end{equation}\] We assume \(X_k\) has achieved the optimum \(X_k=\mathbf{1}{x^*}\), then from \(W\mathbf{1}=\mathbf{1}\), we have \[\begin{align} X_{k+1}=\mathbf{1}x^*-\alpha_kWG_k+WH \end{align}\]

Thus \(nR'(k+1)=E\left[\Vert X_{k+1}-\mathbf{1}x^*\Vert^2\right]\) is affected by \[ E\left[\Vert H-\alpha_k G_k\Vert^2\right] \] Similar for DSGD, we have \(nR'(k+1)\) is affected by \[ E\left[\Vert\alpha_k G_k\Vert^2 \right]=\alpha_k E\left[\Vert G_k\Vert^2 \right] \]

We can see that a sequence of dimishing \(\alpha_k\) can decrease \(E\left[\Vert G_k\Vert^2 \right]\) as \(k\) gets large, which explains why dimishing stepsizes can improve fixed stepsize scheme to an exact convergence. On the other hand, we can also design a distributed algorithm \(s.t.\) \[\begin{equation} E\left[\Vert H-\alpha_k G_k\Vert^2\right]\leq \left[\Vert\alpha_k G_k\Vert^2 \right] \tag{10.2} \end{equation}\]

and conserve the consensus property.

10.1 Assumptions of different schemes

The following table summarizes the assumptions needed for each schemes. The column of \(\sigma^2\) and \(\zeta^2\) denote \[ E_{\xi }\left\|g_{i}(x ; \xi)-\nabla f_{i}(x)\right\|^{2} \leqslant \sigma^{2}, \quad \forall i, \forall x \] and \[ \frac{1}{n} \sum_{i=1}^{n}\left\|\nabla f_{i}(x)-\nabla f(x)\right\|^{2} \leqslant \zeta^{2}, \quad \forall i, \forall x \] respectively.

\(\zeta_0=\frac{1}{n} \sum\limits_{i=1}^{n}\left\|\nabla f_{i}(\mathbf{0})-\nabla f(\mathbf{0})\right\|^{2}\); \(\rho_w\) is the spectral norm of \(W-\frac{\mathbf{1}\mathbf{1}^T}{n}\)

Name	\(f_i(x)\)	\(\nabla f_i(x)\)	\(W\)	Spectral gap	\(\sigma^2\)	\(\zeta^2\)
c-SGD		Lipschitz continuous	\(\backslash\)	\(\backslash\)	\(\backslash\)	\(\backslash\)
DGD		Lipschitz continuous	Sym and \(W\mathbf{1}=\mathbf{1}\)	\(\rho_w<1\)	\(\backslash\)	\(\backslash\)
D-PSGD		Lipschitz continuous	Sym and doubly stochastic	\(\lambda_2^2<1\)	Y	Y
EXTRA	convex (\(f\) strongly convex leads to better rate)	Lipschitz continuous	Sym and \(W\mathbf{1}=\mathbf{1}\)	\(\lambda_\min(W)\geq0\)	\(\backslash\)	\(\backslash\)
DSGD	strongly convex	Lipschitz continuous	Sym and doubly stochastic	\(\rho_w\)	Y	N
DSGT	strongly convex	Lipschitz continuous	doubly stochastic	\(\rho_w\)	Y	N
\(D^2\)		Lipschitz continuous	Sym and \(W\mathbf{1}=\mathbf{1}\)	\(\lambda_2<1,\lambda_\min\geq-\frac{1}{3}\)	Y	\(\zeta_0\)

Remark. \

EXTRA assumes a less restrictive assumption on \(W\) which leads to \(W\mathbf{1}=\mathbf{1}\) and \(\lambda(W)\in(-1, 1]\);
In EXTRA, if \(\lambda_{\min}(W)<0\), we can replace \(W\) by \(\frac{I+W}{2}\);
In EXTRA, if we assume \(g(x)=\frac{1}{n}\sum\limits_{i=1}^n f_i(x) + \frac{1}{4\alpha}\Vert x\Vert_{\tilde{W}-W}\) is restricted strongly convex w.r.t. \(x^*\), we can derive a convergence rate \(\mathcal{O}(1+\delta)^{-k}\). This conditon does not require all \(f_i\) to be individually restricted strongly convex, which is less restrictive to that in DSGT and DSGD.
DSGD addtionally assume \(\sum\limits_{i=1}^{n}\left\|x_{i}(0)-x^{*}\right\|^{2}=\mathcal{O}(n)\) and \(\sum\limits_{i=1}^{n}\left\|\nabla f_{i}\left(x^{*}\right)\right\|^{2}=\mathcal{O}(n)\);
\(D^2\) is less senstive to the data variance across workers compared to D-PSGD

options:

What is the convergence rate when assuming \(f_i\) to be strongly convex in \(D^2\)?
Li, Shi, and Yan (2019) assume a sturcture of \(f_i(x)=s_i(x)+r_i(x)\) on the objective function, where \(s_i\) is differentiable and has a Lipschitz continuous gradient with parameter \(L\) and \(r_i\) is proximable. They improve The PG-EXTRA in the speed and the dependency of convergence over networks. If we assume the same structure on DSGT or DSGD, what is the convergence rate?

10.2 Convergence rate

The following table lists the convergence rate of the above algorithms.

Name	nonconvex	convex	strongly convex
c-SGD	\(\mathcal{O}(\frac{1}{\sqrt{nT}}+\frac{1}{T})\)		\(\mathcal{O}(\frac{1}{nk}+\frac{1}{k^2})\)
DGD		\(\mathcal{O}\left(\frac{\ln k}{\sqrt{k}}\right)\)
D-PSGD	\(\mathcal{O}\left(\frac{\sigma}{\sqrt{n T}}+\frac{n^{\frac{1}{4}} \zeta^{\frac{2}{3}}}{T^{\frac{2}{3}}}+\frac{1}{T}\right)\)
EXTRA		\(\mathcal{O}(\frac{1}{k})\)	\(\mathcal{O}((1+\delta)^{-k})\)
DSGD			\(\mathcal{O}\left(\frac{1}{nk}+\frac{1}{k^{1.5}}+\frac{1}{k^2}\right)\)
DSGT			\(\mathcal{O}\left(\frac{1}{nk} +\frac{1}{k^{\theta\mu}}+\frac{1}{k^2} \right)\)
\(D^2\)	\(\mathcal{O}\left(\frac{\sigma}{\sqrt{n T}}+\frac{1}{T}+\frac{\zeta_{0}^{2}}{T+\sigma^{2} T^{2}}+\frac{\sigma^2}{1+\sigma^2T}\right)\)

DSGT converges to the neighborhood of \(x^*\) at the linear rate of \(\rho(A)^k\) when choosing fixed \(\alpha\) and converges to \(x^*\) when choosing \(\alpha_k=\frac{\theta}{m+k}\). This implies us to choose stepsize adaptively.

References

Li, Zhi, Wei Shi, and Ming Yan. 2019. “A Decentralized Proximal-Gradient Method with Network Independent Step-Sizes and Separated Convergence Rates.” IEEE Transactions on Signal Processing 67 (17). IEEE: 4494–4506.

Lian, Xiangru, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent.” In Advances in Neural Information Processing Systems, 5330–40.

Pu, Shi, and Angelia Nedić. 2018. “Distributed Stochastic Gradient Tracking Methods.”

Pu, Shi, Alex Olshevsky, and Ioannis Ch. Paschalidis. 2019a. “A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent.”

Shi, Wei, Qing Ling, Gang Wu, and Wotao Yin. 2015. “Extra: An Exact First-Order Algorithm for Decentralized Consensus Optimization.” SIAM Journal on Optimization 25 (2). SIAM: 944–66.

Tang, Hanlin, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. 2018. “D \(\^{} 2\): Decentralized Training over Decentralized Data.” arXiv Preprint arXiv:1803.07068.