Chapter 8 Some results in asymptotic network independence

8.1 Compressed Communication

8.1.1 CHOCO-SGD

Koloskova, Stich, and Jaggi (2019) proposed a decentralized stochastic method Choco-SGD(see algorithm 8.1) that converges at rate \[\begin{equation} \mathcal{O}(\frac{1}{nk}+\frac{1}{(\delta^2w^k)^2}) \end{equation}\]

where \(\delta=1-|\lambda_2(W)|\) denotes the spectral gap of mixing matrix \(W\), \(w\leq1\) is the compression quality factor. This method conserve the asymptotic network independence and can be applied to a network where the nodes compress their model update with quality \(0<w\leq1\). Addtionnally, the noise introduced by compression operator vanishes as \(k\to\infty\).

Different from DSGD, Choco-SGD 8.1 use the averaged iterate \(x_{\mathrm{avg}}=\frac{1}{S}\sum\limits_{k=1}^{k_{\mathrm{stop}}}w_k\bar x(k)\), i.e., theorem 8.1,

Theorem 8.1 Let assumptions 1.1 and 8.2 hold, algorithm 8.1 with SGD stepsize \(\alpha_k=\frac{4}{\mu(a+k)}\), \(a\geq\max\{\frac{410}{\delta^2w},16\kappa\}\) for \(\kappa=\frac{L}{\mu}\), and consensus stepsize \(\gamma=\frac{\delta^{2} \omega}{16 \delta+\delta^{2}+4 \beta^{2}+2 \delta \beta^{2}-8 \delta \omega}\) converges with the rate \[ \mathbb{E} \left[f(x_{\mathrm{avg}})-f(x^*)\right]=\mathcal{O}\left(\frac{\bar{\sigma}^{2}}{\mu n T}\right)+\mathcal{O}\left(\frac{\kappa G^{2}}{\mu \omega^{2} \delta^{4} T^{2}}\right)+\mathcal{O}\left(\frac{G^{2}}{\mu \omega^{3} \delta^{6} T^{3}}\right) \] where \(x_{\mathrm{avg}}=\frac{1}{S}\sum\limits_{k=0}^{k_{\mathrm{stop}}}w_k\bar x(k)\) with weight \(w_k=(a+k)^2\) and \(S=\sum\limits_{k=0}^{k_{\mathrm{stop}}}w_k\)

In the proof of theorem 8.1, the difference is that the author Koloskova, Stich, and Jaggi (2019) deal the term \[ \bar x(k)-\alpha_k h(\bar x(k))-x^* \] differently. They estimate \(\Vert h(\bar x(k))\Vert^2\) and \(\langle\bar x(k)-x^*,h(\bar x(k))\rangle\)(from lemma 8.1, we have \(f(\bar x(k))-f(x^*)\)), while in DSGD, we use lemma 9.2. Then based on a lemma from Stich, Cordonnier, and Jaggi (2018), they finish the proof.

Lemma 8.1 If \(f\) has \(L-\)Lipschitz gradient with minimizer \(x^*,s.t.\nabla f(x^*)=\mathbf{0}\), then \[ \|\nabla f(x)\|^{2}=\left\|\nabla f(x)-\nabla f\left(x^{\star}\right)\right\|^{2} \leq 2 L\left(f(x)-f\left(x^{\star}\right)\right) \]

Let \(Q(\cdot):\mathbb{R}^{p}\to\mathbb{R}^p\) be a specific compression operator, which is known and satisfy assumption 8.1. \(f_i(x):=E_{\xi_i\sim\mathcal{D}_i} F_i(x,\xi_i)\) for a loss function \(F_i:\mathcal{R}^p\times \Omega_\boldsymbol\xi\), \(f_i\) satisfy assumption 1.1 and distribution \(\mathcal{D_1},...,\mathcal{D_n}\) can be different on every node.

Assumption8.1 The compression operator \(Q(\cdot):\mathbb{R}^{p}\to\mathbb{R}^p\) satisfies, \[ E_{Q}\|Q(x)-x\|^{2} \leq(1-\omega)\|x\|^{2}, x\in\mathbb{R}^p, \] for a parameter \(w>0\). \(E_Q\) denotes the expectation over the internal randomness of operator \(Q\)
The CHOCO-SGD method can be seen in algorithm 8.1,

Algorithm8.1 (CHOCO-SGD) Initialize \(x_i(0)\in \mathbb{R}^{p}, \hat x_i(0)=\mathbf{0}\in\mathbb{R}^{p}\) for \(i\in\mathcal{N}\), consensus stepsize \(\gamma\), SGD stepsize \(\alpha_k\geq0\),and mixing matrix \(W=(w_{ij})\in\mathbb{R}^{n\times n}\)

For \(k=0,1,...\) do in parallel for all agents \(i\in\mathcal{N}\),

  • Sample \(\xi_i(k)\), compute \(g_i(k)=\nabla F_i(x_i(k),\xi_i(k))\)

  • \(x_i(k+\frac{1}{2})=x_i(k)-\alpha_kg_i(k)\) (stochastic gradient)

  • \(q_i(k)=Q(x_i(k)-\hat x_i(k))\) (compression operator)

  • For all the neighbor \(j\) of agents \(i\), i.e., \((i,j)\in\mathcal{E}\),

    • Send \(q_i(k)\) and receive \(q_j(k)\)

    • \(\hat x_j(k+1)=\hat x_j(k)+q_j(k)\)

  • \(x_i(k+1)=x_i(k+\frac{1}{2})+\gamma \sum_{j:(i,j)\in\mathcal{E}} w_{ij}(\hat x_j(k+1)-\hat x_i(k+1))\)

When the compression quality \(w=1\),i.e. no communication compression, and consensus stepsize \(\gamma=1\), the updation in algorithm 8.1 becomes \[ x_i(k+1)=\sum\limits_{i=1}^nw_{ij}(x_i(k)-\alpha_kg_i(k)) \] which is the DSGD algorithm 7.2. From the setting \[f_i(x):=E_{\xi_i\sim\mathcal{D}_i} F_i(x,\xi_i)\] \(g_i(k)\) satisify the unbiased property in assumption 4.1 automatically when \(\nabla\) and expectation on \(F_i\) can be interchanged. Despite the second property in assumption 4.1, we need to assum the second moment of \(\nabla F_i(x,\xi_i)\) is finite,i.e., assumption 8.2. A little difference is that \(\sigma\) can be different for each agent.

Assumption8.2 \[\begin{align} E_{\xi_{i}}\left\|\nabla F_{i}\left(x, \xi_{i}\right)-\nabla f_{i}(x)\right\|^{2} &\leq \sigma_{i}^{2}, \quad \forall x \in \mathbb{R}^{d}, i \in\mathcal{N}\\ E_{\xi_{i}}\left\|\nabla F_{i}\left(x, \xi_{i}\right)\right\|^{2} &\leq G^{2},\quad\forall x \in \mathbb{R}^{d}, i \in\mathcal{N} \end{align}\]

8.1.2 Stochastic gradient push

The push sum gossip algorithm is robust to stragglers and communication delays. Assran et al. (2018) propose a variant of stochastic gradient push(SGP), called overlap SGP, converges to a stationary point of smooth, non-convex objectives at an \(\mathcal{O}(\frac{1}{\sqrt{nK}})\), and that all nodes achieve consensus. Additionally, we use a sequence of mixing matrice \(W_k\) here.

We uses a different convergence criteria(Lian et al. 2017), i.e., for the number of iterations \(K\), \[\begin{equation} \frac{1}{K} \sum_{k=1}^{K} \mathbb{E}\left\|\nabla f\left(\bar{{x}}(k)\right)\right\|^{2} \leq \varepsilon, \forall \varepsilon \tag{8.1} \end{equation}\]

8.2 \(D^2\)

Global scheme, \[ X_1=W(X_0-\alpha G_0)\\ X_{k+1}=W(X_k-\alpha G_k)+W(X_k-X_{k-1}-\alpha G_{k-1}) \] then we have, \[ \bar x_{k+1} = \bar x_k - \alpha G_k \] Thus if we assume each \(f_i\) is strongly convex, we can still use lemma 9.3. However, for the expected consensus error \(V(k)=E\left[\Vert X_{k}-\mathbf{1}\bar x_k\right]\), we need to use, \[\begin{align} X_{k+1}-\mathbf{1}\bar x_{k+1}&= W(2X_k-X_{k-1}-\alpha G_k+\alpha G_{k-1})-\mathbf{1}(\bar x_k-\alpha \bar g_k)\\ &=2(WX_{k}-\mathbf{1}\bar x_k)-\alpha (WG_k-\mathbf{1}\bar g_k)\\ &\quad -(WX_{k-1}-\mathbf{1}\bar x_{k-1}) +\alpha(WG_{k-1}-\bar g_{k-1}) \tag{8.2} \end{align}\]

then the following steps are similar to those in the proof of lemma 9.4. Then if we can uniformly bound \(R'(k)\), we can derive the convergence rate of \(D^2\).

References

Assran, Mahmoud, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. 2018. “Stochastic Gradient Push for Distributed Deep Learning.” arXiv Preprint arXiv:1811.10792.

Koloskova, Anastasia, Sebastian U Stich, and Martin Jaggi. 2019. “Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication.” arXiv Preprint arXiv:1902.00340.

Lian, Xiangru, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent.” In Advances in Neural Information Processing Systems, 5330–40.

Stich, Sebastian U, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. “Sparsified Sgd with Memory.” In Advances in Neural Information Processing Systems, 4447–58.