Chapter 8 Some results in asymptotic network independence
8.1 Compressed Communication
8.1.1 CHOCO-SGD
Koloskova, Stich, and Jaggi (2019) proposed a decentralized stochastic method Choco-SGD(see algorithm 8.1) that converges at rate \[\begin{equation} \mathcal{O}(\frac{1}{nk}+\frac{1}{(\delta^2w^k)^2}) \end{equation}\]where \(\delta=1-|\lambda_2(W)|\) denotes the spectral gap of mixing matrix \(W\), \(w\leq1\) is the compression quality factor. This method conserve the asymptotic network independence and can be applied to a network where the nodes compress their model update with quality \(0<w\leq1\). Addtionnally, the noise introduced by compression operator vanishes as \(k\to\infty\).
Different from DSGD, Choco-SGD 8.1 use the averaged iterate \(x_{\mathrm{avg}}=\frac{1}{S}\sum\limits_{k=1}^{k_{\mathrm{stop}}}w_k\bar x(k)\), i.e., theorem 8.1,
In the proof of theorem 8.1, the difference is that the author Koloskova, Stich, and Jaggi (2019) deal the term \[ \bar x(k)-\alpha_k h(\bar x(k))-x^* \] differently. They estimate \(\Vert h(\bar x(k))\Vert^2\) and \(\langle\bar x(k)-x^*,h(\bar x(k))\rangle\)(from lemma 8.1, we have \(f(\bar x(k))-f(x^*)\)), while in DSGD, we use lemma 9.2. Then based on a lemma from Stich, Cordonnier, and Jaggi (2018), they finish the proof.
Let \(Q(\cdot):\mathbb{R}^{p}\to\mathbb{R}^p\) be a specific compression operator, which is known and satisfy assumption 8.1. \(f_i(x):=E_{\xi_i\sim\mathcal{D}_i} F_i(x,\xi_i)\) for a loss function \(F_i:\mathcal{R}^p\times \Omega_\boldsymbol\xi\), \(f_i\) satisfy assumption 1.1 and distribution \(\mathcal{D_1},...,\mathcal{D_n}\) can be different on every node.
Algorithm8.1 (CHOCO-SGD) Initialize \(x_i(0)\in \mathbb{R}^{p}, \hat x_i(0)=\mathbf{0}\in\mathbb{R}^{p}\) for \(i\in\mathcal{N}\), consensus stepsize \(\gamma\), SGD stepsize \(\alpha_k\geq0\),and mixing matrix \(W=(w_{ij})\in\mathbb{R}^{n\times n}\)
For \(k=0,1,...\) do in parallel for all agents \(i\in\mathcal{N}\),
Sample \(\xi_i(k)\), compute \(g_i(k)=\nabla F_i(x_i(k),\xi_i(k))\)
\(x_i(k+\frac{1}{2})=x_i(k)-\alpha_kg_i(k)\) (stochastic gradient)
\(q_i(k)=Q(x_i(k)-\hat x_i(k))\) (compression operator)
For all the neighbor \(j\) of agents \(i\), i.e., \((i,j)\in\mathcal{E}\),
Send \(q_i(k)\) and receive \(q_j(k)\)
\(\hat x_j(k+1)=\hat x_j(k)+q_j(k)\)
- \(x_i(k+1)=x_i(k+\frac{1}{2})+\gamma \sum_{j:(i,j)\in\mathcal{E}} w_{ij}(\hat x_j(k+1)-\hat x_i(k+1))\)
When the compression quality \(w=1\),i.e. no communication compression, and consensus stepsize \(\gamma=1\), the updation in algorithm 8.1 becomes \[ x_i(k+1)=\sum\limits_{i=1}^nw_{ij}(x_i(k)-\alpha_kg_i(k)) \] which is the DSGD algorithm 7.2. From the setting \[f_i(x):=E_{\xi_i\sim\mathcal{D}_i} F_i(x,\xi_i)\] \(g_i(k)\) satisify the unbiased property in assumption 4.1 automatically when \(\nabla\) and expectation on \(F_i\) can be interchanged. Despite the second property in assumption 4.1, we need to assum the second moment of \(\nabla F_i(x,\xi_i)\) is finite,i.e., assumption 8.2. A little difference is that \(\sigma\) can be different for each agent.
8.1.2 Stochastic gradient push
The push sum gossip algorithm is robust to stragglers and communication delays. Assran et al. (2018) propose a variant of stochastic gradient push(SGP), called overlap SGP, converges to a stationary point of smooth, non-convex objectives at an \(\mathcal{O}(\frac{1}{\sqrt{nK}})\), and that all nodes achieve consensus. Additionally, we use a sequence of mixing matrice \(W_k\) here.
We uses a different convergence criteria(Lian et al. 2017), i.e., for the number of iterations \(K\), \[\begin{equation} \frac{1}{K} \sum_{k=1}^{K} \mathbb{E}\left\|\nabla f\left(\bar{{x}}(k)\right)\right\|^{2} \leq \varepsilon, \forall \varepsilon \tag{8.1} \end{equation}\]8.2 \(D^2\)
Global scheme, \[ X_1=W(X_0-\alpha G_0)\\ X_{k+1}=W(X_k-\alpha G_k)+W(X_k-X_{k-1}-\alpha G_{k-1}) \] then we have, \[ \bar x_{k+1} = \bar x_k - \alpha G_k \] Thus if we assume each \(f_i\) is strongly convex, we can still use lemma 9.3. However, for the expected consensus error \(V(k)=E\left[\Vert X_{k}-\mathbf{1}\bar x_k\right]\), we need to use, \[\begin{align} X_{k+1}-\mathbf{1}\bar x_{k+1}&= W(2X_k-X_{k-1}-\alpha G_k+\alpha G_{k-1})-\mathbf{1}(\bar x_k-\alpha \bar g_k)\\ &=2(WX_{k}-\mathbf{1}\bar x_k)-\alpha (WG_k-\mathbf{1}\bar g_k)\\ &\quad -(WX_{k-1}-\mathbf{1}\bar x_{k-1}) +\alpha(WG_{k-1}-\bar g_{k-1}) \tag{8.2} \end{align}\]then the following steps are similar to those in the proof of lemma 9.4. Then if we can uniformly bound \(R'(k)\), we can derive the convergence rate of \(D^2\).
References
Assran, Mahmoud, Nicolas Loizou, Nicolas Ballas, and Michael Rabbat. 2018. “Stochastic Gradient Push for Distributed Deep Learning.” arXiv Preprint arXiv:1811.10792.
Koloskova, Anastasia, Sebastian U Stich, and Martin Jaggi. 2019. “Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication.” arXiv Preprint arXiv:1902.00340.
Lian, Xiangru, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent.” In Advances in Neural Information Processing Systems, 5330–40.
Stich, Sebastian U, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. “Sparsified Sgd with Memory.” In Advances in Neural Information Processing Systems, 4447–58.