Chapter 9 A sharp estimate of the transient time of DSGD
In this chapter, we derive an estimate of transient time \(K_T\) of DSGD and then show it is sharp(Pu, Olshevsky, and Paschalidis 2019a). During this, we will prove lemma 7.2 in chapter 7.
Still, we set \(f_i, i\in\mathcal{N}\) are \(\mu-\)strongly convex with \(L-\)Lipschitz continuous gradients, the graph \(\mathcal{G}=(\mathcal{N},\mathcal{E})\) is connected and undirected, and we are able to obtain “good” gradient estimates \(g_i(x_i(k),\xi_i(k))\), i.e. the assumptions 1.1, 4.1, 4.2,and 4.3 hold.
We first derive the recursion of \(R'(k),U(k),\) and \(V(k)\) defined in chapter 7.
9.1 \(U(k)\) and \(V(k)\)
To keep the consistency in notation, we still use \(h(X)=\frac{1}{n}\mathbf{1}^T\nabla F(X)\), which the authors denote as \(\bar\nabla F(X)\). Let \(\bar g(k)=\frac{1}{n}\mathbf{1}^TG(X(k),\boldsymbol\xi(k))\)6. The idea is the same as that in (4.3), we have \[\begin{align} \bar x(k+1)-x^*&=\bar x(k)-\alpha_k\left[\bar g(k)-\frac{1}{n}\mathbf{1}^T\nabla F(X(k))\right]-\\ &\alpha_k\left[\frac{1}{n}\mathbf{1}^T\nabla F(X(k))-\nabla f(\bar x(k))\right]-\alpha_k\nabla f(\bar x(k))-x^*\\ &:=(\bar x(k)-\alpha_k\nabla f(\bar x(k))-x^*)-\alpha_k(\bar g(k)-h(X(k)))-\\ &\alpha_k(h(X(k))-\nabla f(\bar x(k))) \tag{9.1} \end{align}\]Then we can use lemma 4.1 and assumption 4.1. Moreover, we use \(|\langle a,b\rangle|\leq \Vert a\Vert \cdot\Vert b\Vert\) to deal with \(<\bar x(k)-\alpha_k\nabla f(\bar x(k))-x^*,\nabla f(\bar x(k))-h(X(k))>\). Then take expectation conditioned on \(X(k)\), we have lemma 9.1.
Where We use Cauchy-Schwartz inequality in the second inequality.
We can also separate \(\frac{2 \alpha_{k} L}{\sqrt{n}}\left\|\bar{x}(k)-\alpha_{k} \nabla f(\bar{x}(k))-x^{*}\right\|\|X(k)-\mathbf{1} \bar{x}(k)\|\) by \[\begin{align} &\frac{2 \alpha_{k} L}{\sqrt{n}}\left\|\bar{x}(k)-\alpha_{k} \nabla f(\bar{x}(k))-x^{*}\right\|\|X(k)-\mathbf{1} \bar{x}(k)\|\\ &\leq \lambda^2c\Vert \bar x(k)-x^*\Vert + \frac{L^2}{cn}\Vert X(k)-\mathbf{1}\bar x(k)\Vert \tag{9.2} \end{align}\]for an arbitary \(c>0\)(in chapter 4, we let \(c=\mu\)). \(\lambda\) comes from lemma 9.2. Comparing lemma 9.2 with lemma 3.4, they only differ with the choice of \(\alpha\).
From lemma 9.1 and formula (9.2), take the full expectation on both sides, we have for \(\alpha_k\in(0,2/L)\),
\[\begin{align} U(k+1)\leq \lambda^2(1+c) U(k)+\frac{\alpha_k^2L^2}{cn}(1+\frac{1}{c})V(k)+\frac{\alpha_k\sigma^2}{n} \tag{9.3} \end{align}\]Take \(c=\frac{3}{8}\alpha_k\mu\) and let \(\alpha_k\leq\min\{\frac{1}{L},\frac{1}{3\mu}\}\) in (9.3), we derive lemma 9.3. \(\alpha_k\leq \frac{1}{L}\) is to make \(\lambda = 1-\alpha_k\mu\) in lemma 9.2, \(\alpha_k\leq \frac{1}{3\mu}\) is to make \((1+c)\lambda^2\leq 1-\frac{3}{2}\alpha_k\mu\). Thus we derive lemma 9.3.
Todo:why \(1-\frac{3}{2}\alpha_k\mu\)
Next, we bound \(\vert \nabla F(X(k))\Vert^2\). Recall \(\nabla f\) is \(L-\)Lipschitz continuous, and we need \(X(k)-\mathbf{1}\bar x(k)\) as well as \(\bar x(k)-x^*\), so we have,
\[\begin{align} &\Vert \nabla F(X(k))\Vert^2\\ &= \Vert \nabla F(X(k))-\nabla F(\mathbf{1}\bar x(k))+\nabla F(\mathbf{1}\bar x(k))-\nabla F(\mathbf{1}x^*)+\nabla F(\mathbf{1}x^*)\Vert^2\\ &\leq\left(L\|X(k)-\mathbf{1} \bar{x}(k)\|+\sqrt{n} L\left\|\bar{x}(k)-x^{*}\right\|+\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|\right)^{2}\\ &\stackrel{?}\leq 2 L^{2}\|X(k)-\mathbf{1} \bar{x}(k)\|^{2}+4 n L^{2}\left\|\bar{x}(k)-x^{*}\right\|^{2}+4\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|^{2} \tag{9.6} \end{align}\]Use the two inequalities (9.5) and (9.6) in (9.4), we have
\[\begin{align} &\frac{1}{\rho_w^2}E\left[\Vert X(k+1)-\mathbf{1}\bar x(k+1)\Vert^2|X(k)\right]-\alpha_k^2n\sigma^2\\ &\leq \Vert X(k)-\mathbf{1}\bar x(k)\Vert^2 + \alpha_k^2 \Vert \nabla F(X(k))\Vert^2 + 2\alpha_k\Vert X(k)-\mathbf{1}\bar x(k)\Vert\cdot \Vert \nabla F(X(k))\Vert\\ \end{align}\] For the last term, from (9.6), \[\begin{align} &2\alpha_k\Vert X(k)-\mathbf{1}\bar x(k)\Vert\cdot \Vert \nabla F(X(k))\Vert\\ &\leq2\alpha_k \Vert X(k)-\mathbf{1}\bar x(k)\Vert\left(L\|X(k)-\mathbf{1} \bar{x}(k)\|+\sqrt{n} L\left\|\bar{x}(k)-x^{*}\right\|+\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|\right)\\ &\leq 2\alpha_kL\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2 + \left[c\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2+\frac{\alpha_k^2}{c}(\sqrt{n} L\left\|\bar{x}(k)-x^{*}\right\|+\| \nabla F\left(\mathbf{1} x^{*}\right)^2)\right]\\ &\leq (2\alpha_kL+c)\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2 + \frac{\alpha_k}{c}\left(2 n L^{2}\left\|\bar{x}(k)-x^{*}\right\|^{2}+2\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|^{2}\right) \end{align}\]For \(\forall c>0\), the last inequality uses \((a+b)^2\leq 2(a^2+b^2)\). Let \(c=\frac{1-\rho_w^2}{2}\), the same as that in (4.6) in chapter 4, then we have lemma 9.4.
9.2 Asymptotic network independence of DSGD
In this section, we first show for algorithm 7.1, \(U(k)=\mathcal O(\frac{1}{k})\) and \(V(k)=\mathcal O(\frac{1}{k^2})\), i.e. algorithm 7.1 enjoys the sublinear convergence rate. More specifically, we show that \(\exists N, s.t. k>N,\) \[\begin{equation} U(k)\leq \frac{\hat W(1-\rho_w^2)}{\tilde k},\quad V(k)\leq \frac{\hat V((1-\rho_w^2,\hat W))}{\tilde k^2} \tag{9.7} \end{equation}\] where \(\tilde k\) is some shift of \(k\), \(\hat W(\cdot)\) and \(\hat V(\cdot)\) are functions. The goal is to show that asymptotically, \(\frac{1}{n}\sum\limits_{i=1}^nE\left[\Vert x_i(k)-x^*\Vert^2\right]=R'(k)=U(k)+\frac{1}{n}V(k)\) has the same convergence rate with \(R(k)\) in SGD. Notice \(V(k)\) can be shown to decay faster than \(U(k)\), so if we can bound \(\hat W(1-\rho_w^2)\) by another quantity \(C\) which does not depende on \(\rho_w^2\), i.e. does not depende on the network, then we can have \[\begin{equation} R'(k)\leq\frac{C}{\tilde k}+\frac{1}{n} \frac{V(1-\rho_w^2)}{\tilde k^2} \tag{9.8} \end{equation}\]which shows the asymptotic newwork independence property of DSGD. Then by (9.8), we can obtain the transient time for DSGD to reach the asymptotic convergence rate.
We first give a uniform bound for \[\begin{align} E\left[\Vert X(k)-\mathbf{1}x^*\Vert^2\right]&=E\left[\sum_{i=1}^n\Vert x_i^T(k)-x^*\Vert^2\right]\\ &=\sum_{i=1}^n E\left[\Vert x_i^T(k)-x^*\right]\\ &=nR'(k) \end{align}\]Remark. \
\(X(k)-\mathbf{1}x^*=W^k(X(0)-\mathbf{1}x^*)-\sum\limits_{i+j=k}W^i\alpha_jG(j)\leq W^k(X(0)-\mathbf{1}x^*)\)
- ?The author derive the second part in the \(\max\{\cdot\}\) by \(E\left[\|X(k)\|^{2}\right] \leq \max \left\{\|X(0)\|^{2}, \sum_{i=1}^{n} R_{i}\right\}\)
9.2.1 Sublinear rate
Recall \(R'(k)=U(k)+\frac{1}{n}V(k)\), hence we introduce \(W(k)\) as \[\begin{equation} W(k):=U(k)+\omega(k)V(k),\quad \forall k\geq0 \tag{9.9} \end{equation}\]where \(\omega(k)>0\) is to be determined later. (9.9) is called the Lyapunov function.
From lemma 9.5, we have a uniform bound for \(R'(k)\leq \frac{\hat X}{n}\), we also want such a property on \(W(k)\),i.e. \(W(k)\leq\frac{\hat X}{n}\). This is how we determine \(\omega(k)\). From lemma 9.3 and 9.4, we further establish a recursion of \(W(k)\). By induction, it will have a term of \(\prod\limits_{t=a}^{k-1}(1-\frac{\gamma}{t})\). Lemma 9.6 leads us to bound such a term. Then \(U(k)\) is bounded from \(U(k)\leq W(k)\). \(V(k)\) is bound from lemma 9.4 and the bound for \(U(k)\).
Lemma 9.7 shows the algorithm 7.1 with the stepsize policy being \[ \alpha_k=\frac{\theta}{\mu(k+K)},K:=\left\lceil\frac{2 \theta L^{2}}{\mu^{2}}\right\rceil \] enjoys sublinear rate.
where, \[ \hat{W}:=\frac{K_{1} \hat{X}}{n}+\frac{3}{(4 \theta-3)}\left(\frac{\sigma^{2} \theta^{2}}{n \mu^{2}}+\frac{\sigma^{2} \rho_{w}^{2} \theta^{2}}{2 \mu^{2}}\right)+\frac{12\left\|\nabla F\left(1 x^{*}\right)\right\|^{2} \rho_{w}^{2} \theta^{2}}{(4 \theta-3) n \mu^{2}\left(1-\rho_{w}^{2}\right)} \]
\[ \hat{V}:=\max \left\{K_{1}^{2} \hat{X}, \frac{8 \theta^{2} \rho_{w}^{2}}{\mu^{2}\left(1-\rho_{w}^{2}\right)}\left[\frac{4\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|^{2}}{\left(1-\rho_{w}^{2}\right)}+n \sigma^{2}+\frac{4 n L^{2} \hat{W}}{\left(1-\rho_{w}^{2}\right) K_{1}}\right]\right\} \]
\[ \tilde k=k+K \]
Remark.
\(\omega(k)=\frac{12 \alpha_{k} L^{2}}{n \mu\left(1-\rho_{w}^{2}\right)}=\frac{f(a_k)}{n(1-\rho_w^2)}\leq \frac{1}{n}\) so that \(W(k)\leq \frac{\hat X}{n}\). This requires \(f(a_k)=\frac{12\alpha_kL^2}{\mu}\leq 1-\rho_w^2\), which is satisfied for the choice of \(K_1=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil\geq \frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\), i.e.,
The uniform bound of \(R'(k)\) (lemma 9.5) gives bound for the above quantities. \(\tilde U(k)\leq \frac{\hat X}{n}, \tilde V(k)\leq \hat X\), and \(\tilde W(k)\leq\frac{\hat X}{n}\). This can also be seen from the definition in (9.10), which moves \(U(k),V(k)\) and \(W(k)\) horizontally.
9.2.2 Asymptotic network independence
In this section, we show that the asymptotic network independence of algorithm 7.1 with the stepsize policy being \(\alpha_k=\frac{\theta}{\mu(k+K)}\). That is, although \(R'(k)=U(k)+\frac{1}{n}V(k)\) depends on the network involving the term of \(1-\rho_w^2\), we show that part will decay faster. Then after some iterations (\(\exists K_0\), when \(k\geq K_0\)), \(R(k)\leq \frac{C}{\tilde k}, \exists C\).
From lemma 9.3, substitute \(\alpha_k=\frac{\theta}{\mu(k+K)}\) and let \(k'=k+K\), then \[\begin{equation} \tilde{U}(k'-K+1) \leq\left(1-\frac{3 \theta}{2 k'}\right) \tilde{U}(k'-K)+\frac{3 \theta L^{2}}{n \mu^{2}} \frac{\tilde{V}(k'-K)}{k'}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k'^{2}}, \quad \forall k' \geq K_{1} \end{equation}\] Then from (9.10), we simplify the above as \[\begin{equation} \tilde{U}(k+1) \leq\left(1-\frac{3 \theta}{2 k}\right) \tilde{U}(k)+\frac{3 \theta L^{2}}{n \mu^{2}} \frac{\tilde{V}(k)}{k}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k^{2}}, \quad \forall k \geq K_{1} \tag{9.11} \end{equation}\] By induction and lemma 9.6, we bound \(U(k)\) with the network dependent term decaying faster, i.e. theorem 9.1. Additionally, from lemma 9.7, \(\tilde V(k)=V(k-K)\leq \frac{\hat V}{k^2}\), (9.11) becomes \[\begin{align} \tilde U(k)\leq \frac{K_{1}^{1.5 \theta}}{k^{1.5 \theta}} \tilde{U}\left(K_{1}\right)+\sum_{t=K_{1}}^{k-1} \frac{(t+1)^{1.5 \theta}}{k^{1.5 \theta}}\left(\frac{\theta^{2} \sigma^{2}}{n \mu^{2} t^{2}}+\frac{3 \theta L^{2}}{n \mu^{2}} \frac{\hat V}{t^3}\right) \tag{9.12} \end{align}\]The inequality holds for \(1<\frac{3\theta}{2}\leq \frac{K_1}{2}\), where \(K_1:=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil\). To simplify (9.12), we claim that \[ \sum_{t=K_1}^{k-1}\frac{(t+1)^{1.5\theta}}{t^2}\leq \frac{b^{1.5 \theta-1}}{1.5 \theta-1}+\frac{3 b^{1.5 \theta-2}}{1.5 \theta-2}+3 b^{1.5 \theta-2} \] and \[ \sum_{a}^{b} \frac{(t+1)^{1.5 \theta}}{t^{3}} \leq \int_{a}^{b} t^{1.5 \theta-3} d t \leq \frac{2 b^{1.5 \theta-2}}{1.5 \theta-2} \]
Then, \[\begin{align} \tilde{U}(k) &\leq \frac{\theta^{2} \sigma^{2}}{(1.5 \theta-1) n \mu^{2} k}+\frac{3 \theta^{2}(1.5 \theta-1) \sigma^{2}}{(1.5 \theta-2) n \mu^{2}} \frac{1}{k^{2}}+\frac{K_{1}^{1.5 \theta}}{k^{1.5 \theta}} \tilde{U}\left(K_{1}\right)+\frac{6 \theta L^{2} \hat{V}}{(1.5 \theta-2) n \mu^{2}} \frac{1}{k^{2}}\\ &\leq \frac{\theta^{2} \sigma^{2}}{(1.5 \theta-1) n \mu^{2} \tilde{k}}+\left[\frac{3 \theta^{2}(1.5 \theta-1) \sigma^{2}}{(1.5 \theta-2) n \mu^{2}}+\frac{6 \theta L^{2} \hat{V}}{(1.5 \theta-2) n \mu^{2}}\right] \frac{1}{\tilde{k}^{2}} \\ &\quad +\frac{K_1^{1.5\theta}\hat X}{n}\frac{1}{\tilde k^3},\forall k\geq K_1-K \tag{9.13} \end{align}\] Remark.
Where \(\hat V=\mathcal{O}(\frac{n}{(1-\rho_w)^2})\) with some constraints on \(\Vert X(0)-\mathbf{1}x^*\Vert^2\) and \(\Vert \nabla F(\mathbf{1}x^*)\Vert^2\) according to lemma 9.8.
Lemma 9.8 Suppose \(\sum\limits_{i=1}^{n}\left\|x_{i}(0)-x^{*}\right\|^{2}=\mathcal{O}(n) \text { and } \sum\limits_{i=1}^{n}\left\|\nabla f_{i}\left(x^{*}\right)\right\|^{2}=\mathcal{O}(n)\), then \[ \hat V=\mathcal{O}(\frac{n}{(1-\rho_w^2)^2}) \]
This is because given such conditions on \(\Vert X(0)-\mathbf{1}x^*\Vert^2\) and \(\Vert \nabla F(\mathbf{1}x^*)\Vert^2\), \(\hat X=\mathcal{O}(n)\) and from lemma 9.7, \(\hat W=\mathcal{O}(\frac{1}{1-\rho_w^2})\).
For \(\hat V\), noticing \(K_1=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil\leq \frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}+1\), then \[\begin{align} \hat V&=\max \left\{K_{1}^{2} \hat{X}, \frac{8 \theta^{2} \rho_{w}^{2}}{\mu^{2}\left(1-\rho_{w}^{2}\right)}\left[\frac{4\left\|\nabla F\left(1 x^{*}\right)\right\|^{2}}{\left(1-\rho_{w}^{2}\right)}+n \sigma^{2}+\frac{4 n L^{2} \hat{W}}{\left(1-\rho_{w}^{2}\right) K_{1}}\right]\right\}\\ &=\max\left\{C_1\frac{n}{(1-\rho_w^2)^2}, C_2\frac{n\rho_w^2}{(1-\rho_w^2)^2}+C_2\frac{n}{1-\rho_w^2}+C_3\frac{n\rho_w^2}{(1-\rho_w^2)^2}\right\}\\ &=\mathcal{O}(\frac{n}{(1-\rho_w^2)^2}) \end{align}\] Recall \(R'(k)=U(k)+\frac{1}{n}V(k)\), from theorem 9.1, lemma 9.7, and lemma 9.8, we show \[\begin{equation} R'(k)\leq \frac{\theta^{2} \sigma^{2}}{(1.5 \theta-1) n \mu^{2}} \frac{1}{\tilde{k}}+\mathcal{O}\left(\frac{1}{\left(1-\rho_{w}^2\right)^{2}}\right) \frac{1}{\tilde{k}^{2}} \tag{9.14} \end{equation}\]Next we improve the bound in theorem 9.1
9.2.3 Improved Bound
In the derivation of theorem 9.1, we start from lemma 9.3, which is based on lemma 9.1. From lemma 9.1 to lemma 9.3, Cauchy-Schwartz inequality is used twice. Now we start directly from lemma 9.1 and do not introduce an arbitary \(c>0\)(see (9.2)). Moreover, we consider the situation where \(\alpha_k=\frac{\theta}{\mu(k+K)}\) and \(k\geq K_1-K\). Then we have \[\begin{equation} \tilde{U}(k+1) \leq\left(1-\frac{2 \theta}{k}\right) \tilde{U}(k)+\frac{\theta^{2} \tilde{U}(k)}{k^{2}}+\frac{2 \theta L}{\sqrt{n} \mu} \frac{\sqrt{\tilde{U}(k) \tilde{V}(k)}}{k}+\frac{\theta^{2} L^{2}}{n \mu^{2}} \frac{\tilde{V}(k)}{k^{2}}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k^{2}} \end{equation}\]We expand \((1-\frac{\theta}{k})^2\) to see the power of \(\frac{1}{k}\) clearly. Then by induction, lemma 9.6, lemma 9.7, and similar bounding for the sums \(\sum\limits_{t=K_1}^{k-1}\frac{(t+1)^b}{t^a}\), we have improved results for \(R'(k)=\frac{1}{n} \sum\limits_{i=1}^{n} E\left[\left\|x_{i}(k)-x^{*}\right\|^{2}\right]\).
9.3 Transient time
First we derive the convergence rate of centralized gradient descent 7.2. Similarly, we use lemma 9.2 and assumption 4.1, then \[\begin{align} &E\left[\Vert x(k+1)-x^*\Vert^2|x(k) \right]\\ &\quad=E\left[\Vert x(k)-\alpha_k\nabla f(x(k))-x^*-\alpha_k (\tilde g(k)-\alpha_k\nabla f(x(k)))\Vert^2-|x(k) \right]\\ &\quad\leq (1-\alpha_k\mu)^2\Vert x(k)-x^*\Vert^2+\frac{\alpha_k^2}{n^2}\sum_{i=1}^nE\left[\Vert \nabla f_i(x(k))-g_i(k)\Vert^2|X(k)\right]\\ &\quad\leq (1-\alpha_k\mu)^2\Vert x(k)-x^*\Vert^2+\frac{\alpha_k^2\sigma^2}{n} \tag{9.15} \end{align}\] Take the full expetation on both sides in (9.15), we prove lemma 7.1 in chapter 7. Substitute \(\alpha=\frac{1}{\mu k}\) in (9.15) and take full expetation on both sides, we have, \[\begin{equation} R(k+1) = \left(1-\frac{2 \theta}{k}\right)\left\|x(k)-x^{*}\right\|^{2}+\frac{\theta^{2}}{k^{2}}\left\|x(k)-x^{*}\right\|^{2}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k^{2}} \end{equation}\]First we show \(\exists c_3=\mathcal{O}(\frac{1}{n}),s.t. E\left[\Vert x(k)-x^*\Vert^2\right]\leq \frac{c_3}{k},\forall k\geq K_2:=\left\lceil\frac{\theta L}{\mu}\right\rceil\). Then by induction, lemma 9.6 and bound for the sums, we have the convergence rate for algorithm 7.2.
Compare theorem 9.3 and theorem 9.2(or formula (9.14)), we derive the transient time \(K_T=\mathcal{O}(\frac{n}{(1-\rho_w^2)^2})\)
Remark.
From formula (9.14), we have \[
R'(k)\leq \frac{\theta^{2} \sigma^{2}}{ (2\theta-1)n \mu^{2}\tilde k}\left[\frac{2\theta-1}{1.5\theta-1}+\mathcal{O}\left(\frac{n}{\left(1-\rho_{w}^2\right)^{2}}\right) \frac{1}{\tilde{k}}\right]
\] \(\theta>2\) is a constant, let \[
\mathcal{O}\left(\frac{n}{\left(1-\rho_{w}^2\right)^{2}}\right) \frac{1}{K_T}=\mathcal{O}(1)
\] we have \[
K_T=\mathcal{O}\left(\frac{n}{(1-\rho_w^2)^2}\right)
\]
\[\begin{align} \frac{1}{(1-\rho_w)^2} / \frac{1}{(1-\rho_w^2)^2}&=\left(\frac{1-\rho_w^2}{1-\rho_w}\right)^2\\ &=(1+\rho_w)^2 \end{align}\] where \(\rho_w<1\). Hence \(\mathcal{O}(\frac{1}{(1-\rho_w^2)^2})\) and \(\mathcal{O}(\frac{1}{(1-\rho_w)^2})\) have the same order.
9.5 Summary
Recall in chapter 7, we have (7.1), i.e. \[\begin{equation} \text { Time }_{n, \varepsilon} \text { (decentralized) } \leq p(\mathcal{G}) \text { Time }_{n, \varepsilon} \text { (centralized) } \end{equation}\]The corallary 9.1 indicates that for DSGD with \(\alpha_k=\frac{\theta}{\mu(k+K)},\theta>2\), when \(k\geq K_T\), \(p(\mathcal{G})=\mathcal{O}(1)\), this is what we call asymptotic network independence.
References
Pu, Shi, Alex Olshevsky, and Ioannis Ch. Paschalidis. 2019a. “A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent.”