Chapter 7 Asymptotic network independence

An undesirable property of distributed optimization method is that the increasing number of nodes may result in a large increase in the time to reach the same \(\varepsilon\) accuracy(error \(<\varepsilon\)) under the centralized version. Pu, Olshevsky, and Paschalidis (2019 b) discuss this phenomenon under the following scenario

\[\begin{equation} \text { Time }_{n, \varepsilon} \text { (decentralized) } \leq p(\mathcal{G}) \text { Time }_{n, \varepsilon} \text { (centralized) } \tag{7.1} \end{equation}\]

where \(\text { Time }_{n, \varepsilon} \text { (decentralized) }\) denotes the time for the decentralized algorithm on n nodes to reach \(\varepsilon\) accuracy, and \(\text { Time }_{n, \varepsilon} \text { (centralized) }\) is the time for the centralized algorithm which can query \(n\) gradients per time step to reach \(\varepsilon\) accuracy.\(\mathcal{G}=(\mathcal{N},\mathcal{E})\) denotes the graph. Typically, \(p(\mathcal{G})\) is at least \(\mathcal{O}(n^2)\)³, which is inpractical to use. This is because, \(p(\mathcal{G})=\mathcal{O}(n^2)\) implies the distributed version would be \(n^2\) times slower than the centralized one with the same computational power.

\(p(\mathcal{G})=\mathcal{O}(1)\) is a desirable setting, which means a decentralized algorithm converge to the optimal at a comparable rate to a centralized algorithm with the same computational power⁴. Fortunately, it is possible for the iteration time \(k\) to be large enough for some distributed stochastic optimization, which is the asymptotic network independence property: it is as if the network is not even there. In chapter 9, we summarize some notes showing the asymptotic network independence property of algorithm 7.1 with \(\alpha_k=\frac{\theta}{\mu(k+K)}\)(Pu, Olshevsky, and Paschalidis 2019 a).

7.1 SGD and DSGD

We consider a distributed stochastic gradient descent(DSGD) method, see algorithm under assumptions 1.1, 4.1,4.2,and 4.3 plus symmetric⁵, the similar settings as we discuss in chapter 4. By assumption 1.1, there exist a unique solution \(x^*\in\mathbb{R}^p\) to the problem (1.1).

Algorithm7.1 (DSGD) Each agent \(i\) choose the same step size \(\alpha_k\) at the \(k\)th iteration and initilized with an arbitary \(x_i(0)\in\mathbb{R}^p\)

For k = 0, 1, …,

For \(i\in\mathcal{N}\),
- \(x_i(k+1) = \sum\limits_{j=1}^nw_{ij}(x_j(k+1)-\alpha_k g_j(x_j(k),\xi_j(k)))\)

\(\{\alpha_k\}\) are a sequence of nonnegative non-increasing stepsizes. In the long run, \(x_{i,k}=x_{j,k},\forall i,j\in\mathcal{N}\), i.e. DSGD belongs to the class of consensus-based distributed optimization methods, which can be achieved under assumptions 4.2 and 4.3 plus \(W\) is symmetric. Let \(X(k)=(x_1(k),...,x_n(k))^T\in\mathbb{R^{n\times p}}, G(k)=(g_1(x_1(k),\xi_1(k)),...,g_n(x_n(k),\xi_n(k)))^T\in\mathbb{R}^{n\times p}\), and \(W=(w_{ij})\in\mathbb{R}^{n\times n}\), then we can rewrite 7.1 as \[\begin{equation} X(k+1)=W(X(k)-\alpha_kG(k)) \tag{7.2} \end{equation}\]

Todo:\(x_{i,k}\stackrel{?}=x_{j,k},\forall i,j\in\mathcal{N}\)

We compare DSGD with centralized stochastic gradient descent(SGD) which can query \(n\) gradients at each iteration,

Algorithm7.2 (SGD) Initialize arbitrary \(x_{0}\in\mathbb{R}^{p}\) and choose stepsize \(\alpha_k\) for each step

For k=0,1,…,

\(x(k+1)=x(k)-\alpha_k\bar g(k)\)

where \(\bar g(k)=\frac{1}{n}\sum\limits_{i=1}^n g_i(x(k),\xi_i(k)),\alpha_k=\frac{1}{\mu k}\), we use \(\bar g(k)\) here to make the gradient comparable to that in DSGD, i.e., \(\sum\limits_{j=1}^n w_{ij} g_j(x_j(k),\xi_j(k))\).

Choose \(2-\)norm as the loss function, the distance between \(x(k)\) and \(x^*\) at the \(k\)th step is \[\begin{equation} R(k)=E \left[\Vert x(k)-x^*\Vert^2\right]=\frac{1}{n}\sum\limits_{i=1}^nE \left[\Vert x(k)-x^*\Vert^2\right] \tag{7.3} \end{equation}\] which hints us to evaluate the similar distance of DSGD by \[\begin{equation} R'(k)=\frac{1}{n}\sum\limits_{i=1}^n E\left[\Vert x_i(k)-x^*\Vert^2\right] \tag{7.4} \end{equation}\] Additonnally, (7.4) can be divided into two sources, one from the optimization error, and one from the consensus error, i.e., \[\begin{equation} R'(K)=\underbrace{E \left[\Vert \bar x(k)-x^*\Vert^2\right]}_{\text{expected optimization error}} + \underbrace{\frac{1}{n}\sum_{i=1}^nE \left[\Vert x_i(k)-\bar x(k)^*\Vert^2\right]}_{\text{expected consensus error}} \tag{7.5} \end{equation}\] This is because \[\begin{align} \frac{1}{n}\sum_{i=1}^n E\left[\langle x_i(k)-\bar x, \bar x - x^*\rangle\right]&=E\left[\langle \frac{1}{n}\sum_{i=1}^nx_i(k)-\bar x, \bar x - x^*\rangle\right]\\ &=0 \end{align}\]

7.2 Bounds

We next compare SGD and DSGD by analyzing their error bounds.

Lemma 7.1 Let assumptions 1.1, 4.1, 4.2,and 4.3 plus \(W\) is symmetric hold, for SGD 7.2, we have \[ R(k+1) \leq\left(1-\alpha_{k} \mu\right)^{2} R(k)+\frac{\alpha_{k}^{2} \sigma^{2}}{n} \]

Denote \(U(k)=E \left[\Vert \bar x(k)-x^*\Vert^2\right]\) and \(V(k)=\sum\limits_{i=1}^nE \left[\Vert x_i(k)-\bar x(k)\right]\), we have

Lemma 7.2 Let the same assumptions in 7.1 hold, \[ U(k+1) \leq\left(1-\frac{1}{k}\right)^{2} U(k)+\frac{2 L}{\sqrt{n} \mu} \frac{\sqrt{U(k) V(k)}}{k}+\frac{L^{2}}{n \mu^{2}} \frac{V(k)}{k^{2}}+\frac{\sigma^{2}}{n \mu^{2}} \frac{1}{k^{2}} \]

In chapter 9, lemma 9.7 shows that \(\exists K_0, s.t.\) when \(k>K_0, R'(k)\leq \frac{\hat W}{\tilde k}+\frac{\hat V}{\tilde k^2}\), where \(\tilde k\) is some shift of \(k\) with a choice of step size \(\alpha_k=\frac{\theta}{\mu(k+K)},K:=\left\lceil\frac{2 \theta L^{2}}{\mu^{2}}\right\rceil\).

Remark. \

In a view that \(R(k)\) and \(R'(k)\) are both risk functions, if \(V(k)\) decays fast enough compared to \(U(k)\), we then have \(R(k)\approx R'(k)\) for large \(k\).
the asymptotic network independence phenomenon: after a transient, DSGD performs comparably to a centralized stochastic gradient descent method with the same computational power.

7.3 Possible ways to achieve asymptotic network independece

Considering nonconvex objective functions(distributed training of deep neural networks);
Explore communication reduction techniques that do not sacrifice the asymptotic network independece property;
Redcing the transient time;

Additionnally, an unsolving question is can distributed methods compete with the centralized ones when the exact gradient is available?

We list some big pictures related to the above issues in chapter 8.

References

Pu, Shi, Alex Olshevsky, and Ioannis Ch. Paschalidis. 2019a. “A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent.”

Pu, Shi, Alex Olshevsky, and Ioannis Ch. 2019b. “Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning.”

smaller \(p(\mathcal{G})\) is possible for some special cases↩
Computing Power: Two processors have the same computing power if they can run the same programs (after translation into each processor’s machine language) and produce the same results↩
It seems \(W\) in DSGT is also symmetric?↩