Chapter 9 A sharp estimate of the transient time of DSGD

In this chapter, we derive an estimate of transient time \(K_T\) of DSGD and then show it is sharp(Pu, Olshevsky, and Paschalidis 2019a). During this, we will prove lemma 7.2 in chapter 7.

Still, we set \(f_i, i\in\mathcal{N}\) are \(\mu-\)strongly convex with \(L-\)Lipschitz continuous gradients, the graph \(\mathcal{G}=(\mathcal{N},\mathcal{E})\) is connected and undirected, and we are able to obtain “good” gradient estimates \(g_i(x_i(k),\xi_i(k))\), i.e. the assumptions 1.1, 4.1, 4.2,and 4.3 hold.

We first derive the recursion of \(R'(k),U(k),\) and \(V(k)\) defined in chapter 7.

9.1 \(U(k)\) and \(V(k)\)

To keep the consistency in notation, we still use \(h(X)=\frac{1}{n}\mathbf{1}^T\nabla F(X)\), which the authors denote as \(\bar\nabla F(X)\). Let \(\bar g(k)=\frac{1}{n}\mathbf{1}^TG(X(k),\boldsymbol\xi(k))\)6. The idea is the same as that in (4.3), we have \[\begin{align} \bar x(k+1)-x^*&=\bar x(k)-\alpha_k\left[\bar g(k)-\frac{1}{n}\mathbf{1}^T\nabla F(X(k))\right]-\\ &\alpha_k\left[\frac{1}{n}\mathbf{1}^T\nabla F(X(k))-\nabla f(\bar x(k))\right]-\alpha_k\nabla f(\bar x(k))-x^*\\ &:=(\bar x(k)-\alpha_k\nabla f(\bar x(k))-x^*)-\alpha_k(\bar g(k)-h(X(k)))-\\ &\alpha_k(h(X(k))-\nabla f(\bar x(k))) \tag{9.1} \end{align}\]

Then we can use lemma 4.1 and assumption 4.1. Moreover, we use \(|\langle a,b\rangle|\leq \Vert a\Vert \cdot\Vert b\Vert\) to deal with \(<\bar x(k)-\alpha_k\nabla f(\bar x(k))-x^*,\nabla f(\bar x(k))-h(X(k))>\). Then take expectation conditioned on \(X(k)\), we have lemma 9.1.

Lemma 9.1 For algorithm 7.1, \(\forall k\geq0\), we have, \[\begin{align} &E\left[\left\|\bar{x}(k+1)-x^{*}\right\|^{2} | X(k)\right] \leq\left\|\bar{x}(k)-\alpha_{k} \nabla f(\bar{x}(k))-x^{*}\right\|^{2}\\ &+\frac{2 \alpha_{k} L}{\sqrt{n}}\left\|\bar{x}(k)-\alpha_{k} \nabla f(\bar{x}(k))-x^{*}\right\|\|X(k)-\mathbf{1} \bar{x}(k)\|+\frac{\alpha_{k}^{2} L^{2}}{n}\|X(k)-\mathbf{1} \bar{x}(k)\|^{2}+\frac{\alpha_{k}^{2} \sigma^{2}}{n} \end{align}\]
From lemma 9.1, let \(\alpha_k=\frac{1}{\mu k}<\frac{2}{\mu+L}\), use lemma 3.4, and take full expectation on both sides, we derive lemma 7.2 in chapter 7. \[\begin{align} U(k+1)&\leq (1-\frac{1}{k})^2 U(k)+\frac{2L(1-\frac{1}{k})}{\mu k\sqrt{n}}E\left[\Vert \bar x(k)-x^*\Vert\cdot \Vert x(k)-\mathbf{1}\bar x(k)\Vert\right]+\\ &\frac{L^2}{\mu^2 k^2n}V(k)+\frac{\sigma^2}{\mu^2k^2n}\\ &\leq (1-\frac{1}{k})^2 U(k)+\frac{2L}{\mu\sqrt{n}}\frac{\sqrt{U(k)V(k)}}{k}+\frac{L^2}{\mu^2 k^2n}V(k)+\frac{\sigma^2}{\mu^2k^2n} \end{align}\]

Where We use Cauchy-Schwartz inequality in the second inequality.

We can also separate \(\frac{2 \alpha_{k} L}{\sqrt{n}}\left\|\bar{x}(k)-\alpha_{k} \nabla f(\bar{x}(k))-x^{*}\right\|\|X(k)-\mathbf{1} \bar{x}(k)\|\) by \[\begin{align} &\frac{2 \alpha_{k} L}{\sqrt{n}}\left\|\bar{x}(k)-\alpha_{k} \nabla f(\bar{x}(k))-x^{*}\right\|\|X(k)-\mathbf{1} \bar{x}(k)\|\\ &\leq \lambda^2c\Vert \bar x(k)-x^*\Vert + \frac{L^2}{cn}\Vert X(k)-\mathbf{1}\bar x(k)\Vert \tag{9.2} \end{align}\]

for an arbitary \(c>0\)(in chapter 4, we let \(c=\mu\)). \(\lambda\) comes from lemma 9.2. Comparing lemma 9.2 with lemma 3.4, they only differ with the choice of \(\alpha\).

Lemma 9.2 Let assumption 1.1 holds, \(\forall x\in\mathbb{R}^p\) and \(\alpha\in(0,2/L)\), we have, \[ \left\|x-\alpha \nabla f(x)-x^{*}\right\| \leq \lambda\left\|x-x^{*}\right\| \] where \(\lambda=\max (|1-\alpha \mu|,|1-\alpha L|)\)

From lemma 9.1 and formula (9.2), take the full expectation on both sides, we have for \(\alpha_k\in(0,2/L)\),

\[\begin{align} U(k+1)\leq \lambda^2(1+c) U(k)+\frac{\alpha_k^2L^2}{cn}(1+\frac{1}{c})V(k)+\frac{\alpha_k\sigma^2}{n} \tag{9.3} \end{align}\]

Take \(c=\frac{3}{8}\alpha_k\mu\) and let \(\alpha_k\leq\min\{\frac{1}{L},\frac{1}{3\mu}\}\) in (9.3), we derive lemma 9.3. \(\alpha_k\leq \frac{1}{L}\) is to make \(\lambda = 1-\alpha_k\mu\) in lemma 9.2, \(\alpha_k\leq \frac{1}{3\mu}\) is to make \((1+c)\lambda^2\leq 1-\frac{3}{2}\alpha_k\mu\). Thus we derive lemma 9.3.

Todo:why \(1-\frac{3}{2}\alpha_k\mu\)

Lemma 9.3 Under algorithm 7.1, let \(\alpha_k\leq\min\{\frac{1}{L},\frac{1}{3\mu}\}\), then \[ U(k+1) \leq\left(1-\frac{3}{2} \alpha_{k} \mu\right) U(k)+\frac{3 \alpha_{k} L^{2}}{n \mu} V(k)+\frac{\alpha_{k}^{2} \sigma^{2}}{n} \]
For \(V(k+1)\), similar to (4.6), we denote \(G(k):=G(X(k),\boldsymbol\xi(k))\) here, then \[\begin{align} &E\left[\Vert X(k+1)-\mathbf{1}\bar x(k+1)\Vert^2|X(k)\right]\\ &=E\left[\Vert WX(k)-\alpha_kWG(k)-\mathbf{1}(\bar x(k)-\alpha_k\bar g(k))\Vert^2|X(k)\right]\\ &\leq \rho_w^2\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2+\alpha_k^2\rho_w^2E\left[\Vert G(k)-\mathbf{1}\bar g(k)\Vert^2|X(k)\right] -\\ &\quad 2\alpha_k\rho_w^2E\left[\langle X(k)-\mathbf{1}\bar x(k),G(k)-\mathbf{1}\bar g(k)\rangle|X(k)\right]\\ &\leq \rho_w^2\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2+\alpha_k^2\rho_w^2E\left[\Vert G(k)-\mathbf{1}\bar g(k)\Vert^2|X(k)\right] -\\ &\quad 2\alpha_k\rho_w^2E\left[\langle X(k)-\mathbf{1}\bar x(k),\nabla F(X(k))-\mathbf{1}h(X(k))\rangle|X(k)\right]\\ \tag{9.4} \end{align}\] Next, we show \[\begin{align} E\left[\Vert G(k)-\mathbf{1}\bar g(k)\Vert^2|X(k)\right] &\leq\|\nabla F(X(k))-\mathbf{1} h(X(k))\|^{2}+n \sigma^{2}\\ \|\nabla F(X(k))-\mathbf{1} h(X(k))\|^{2}&\leq \Vert \nabla F(X(k))\Vert^2 \tag{9.5} \end{align}\] This is because, \[\begin{align} &E\left[\Vert G(k)-\mathbf{1}\bar g(k)\Vert^2|X(k)\right]\\ &=E\left[\Vert (G(k)-\nabla F(X(k)))-\mathbf{1}(\bar g(k)-h(X(k)))+(\nabla F(X(k))-\mathbf{1}h(X(k)))\Vert^2|X(k)\right]\\ &\leq E\left[\Vert G(k)-\nabla F(X(k)\Vert^2|X(k)\right]-nE\left[\Vert \bar g(k)-h(X(k)) \Vert^2|X(k)\right] + \\ &\quad\Vert \nabla F(X(k))-\mathbf{1}h(X(k))\Vert^2\\ &\leq n \sigma^{2}+ \|\nabla F(X(k))-\mathbf{1} h(X(k))\|^{2}\\ &\leq n\sigma^2 + \vert \nabla F(X(k))\Vert^2 +n\vert h(x)\Vert^2 - 2\langle\nabla F(X(k)),\mathbf{1}h(X(k))\rangle \\ &\leq n\sigma^2 + \Vert \nabla F(X(k))\Vert^2 \end{align}\] The last inequality is from \(\langle\mathbf{A}, \mathbf{B}\rangle:=\sum_{i=1}^{n}\left\langle A_{i}, B_{i}\right\rangle\),i.e. \[\begin{align} \langle\nabla F(X(k)),\mathbf{1}h(X(k))\rangle&=\sum_{i=1}^n\langle\nabla f_i(x_i(k)),\frac{1}{n}\sum_{j=1}^n\nabla f_j(x_j(k))\rangle\\ &=n\langle \frac{1}{n}\sum_{i=1}^n \nabla f_i(x_i(k)),\frac{1}{n}\sum_{j=1}^n\nabla f_j(x_j(k))\rangle\\ &=n\Vert h(X(k))\Vert^2 \end{align}\]

Next, we bound \(\vert \nabla F(X(k))\Vert^2\). Recall \(\nabla f\) is \(L-\)Lipschitz continuous, and we need \(X(k)-\mathbf{1}\bar x(k)\) as well as \(\bar x(k)-x^*\), so we have,

\[\begin{align} &\Vert \nabla F(X(k))\Vert^2\\ &= \Vert \nabla F(X(k))-\nabla F(\mathbf{1}\bar x(k))+\nabla F(\mathbf{1}\bar x(k))-\nabla F(\mathbf{1}x^*)+\nabla F(\mathbf{1}x^*)\Vert^2\\ &\leq\left(L\|X(k)-\mathbf{1} \bar{x}(k)\|+\sqrt{n} L\left\|\bar{x}(k)-x^{*}\right\|+\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|\right)^{2}\\ &\stackrel{?}\leq 2 L^{2}\|X(k)-\mathbf{1} \bar{x}(k)\|^{2}+4 n L^{2}\left\|\bar{x}(k)-x^{*}\right\|^{2}+4\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|^{2} \tag{9.6} \end{align}\]

Use the two inequalities (9.5) and (9.6) in (9.4), we have

\[\begin{align} &\frac{1}{\rho_w^2}E\left[\Vert X(k+1)-\mathbf{1}\bar x(k+1)\Vert^2|X(k)\right]-\alpha_k^2n\sigma^2\\ &\leq \Vert X(k)-\mathbf{1}\bar x(k)\Vert^2 + \alpha_k^2 \Vert \nabla F(X(k))\Vert^2 + 2\alpha_k\Vert X(k)-\mathbf{1}\bar x(k)\Vert\cdot \Vert \nabla F(X(k))\Vert\\ \end{align}\] For the last term, from (9.6), \[\begin{align} &2\alpha_k\Vert X(k)-\mathbf{1}\bar x(k)\Vert\cdot \Vert \nabla F(X(k))\Vert\\ &\leq2\alpha_k \Vert X(k)-\mathbf{1}\bar x(k)\Vert\left(L\|X(k)-\mathbf{1} \bar{x}(k)\|+\sqrt{n} L\left\|\bar{x}(k)-x^{*}\right\|+\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|\right)\\ &\leq 2\alpha_kL\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2 + \left[c\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2+\frac{\alpha_k^2}{c}(\sqrt{n} L\left\|\bar{x}(k)-x^{*}\right\|+\| \nabla F\left(\mathbf{1} x^{*}\right)^2)\right]\\ &\leq (2\alpha_kL+c)\Vert X(k)-\mathbf{1}\bar x(k)\Vert^2 + \frac{\alpha_k}{c}\left(2 n L^{2}\left\|\bar{x}(k)-x^{*}\right\|^{2}+2\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|^{2}\right) \end{align}\]

For \(\forall c>0\), the last inequality uses \((a+b)^2\leq 2(a^2+b^2)\). Let \(c=\frac{1-\rho_w^2}{2}\), the same as that in (4.6) in chapter 4, then we have lemma 9.4.

Lemma 9.4 Under algorithm 7.1, \(\forall k\geq0\), \[\begin{align} V(k+1)&\leq \rho_{w}^{2}\left(\frac{3-\rho_{w}^{2}}{2}+2 \alpha_{k} \rho_{w}^{2} L+2 \alpha_{k}^{2} \rho_{w}^{2} L^{2}\right) V(k) + \\ & \rho_{w}^{2} \alpha_{k}^{2}\left[\frac{8 n L^{2}}{\left(1-\rho_{w}^{2}\right)} U(k)+\frac{8\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|^{2}}{\left(1-\rho_{w}^{2}\right)}+n \sigma^{2}\right] \end{align}\]

9.2 Asymptotic network independence of DSGD

In this section, we first show for algorithm 7.1, \(U(k)=\mathcal O(\frac{1}{k})\) and \(V(k)=\mathcal O(\frac{1}{k^2})\), i.e. algorithm 7.1 enjoys the sublinear convergence rate. More specifically, we show that \(\exists N, s.t. k>N,\) \[\begin{equation} U(k)\leq \frac{\hat W(1-\rho_w^2)}{\tilde k},\quad V(k)\leq \frac{\hat V((1-\rho_w^2,\hat W))}{\tilde k^2} \tag{9.7} \end{equation}\] where \(\tilde k\) is some shift of \(k\), \(\hat W(\cdot)\) and \(\hat V(\cdot)\) are functions. The goal is to show that asymptotically, \(\frac{1}{n}\sum\limits_{i=1}^nE\left[\Vert x_i(k)-x^*\Vert^2\right]=R'(k)=U(k)+\frac{1}{n}V(k)\) has the same convergence rate with \(R(k)\) in SGD. Notice \(V(k)\) can be shown to decay faster than \(U(k)\), so if we can bound \(\hat W(1-\rho_w^2)\) by another quantity \(C\) which does not depende on \(\rho_w^2\), i.e. does not depende on the network, then we can have \[\begin{equation} R'(k)\leq\frac{C}{\tilde k}+\frac{1}{n} \frac{V(1-\rho_w^2)}{\tilde k^2} \tag{9.8} \end{equation}\]

which shows the asymptotic newwork independence property of DSGD. Then by (9.8), we can obtain the transient time for DSGD to reach the asymptotic convergence rate.

We first give a uniform bound for \[\begin{align} E\left[\Vert X(k)-\mathbf{1}x^*\Vert^2\right]&=E\left[\sum_{i=1}^n\Vert x_i^T(k)-x^*\Vert^2\right]\\ &=\sum_{i=1}^n E\left[\Vert x_i^T(k)-x^*\right]\\ &=nR'(k) \end{align}\]
Lemma 9.5 For algorithm 7.1, \(\forall k\geq0\), \[ E\left[\left\|X(k)-1 x^{*}\right\|^{2}\right] \leq \hat{X}:=\max \left\{\left\|X(0)-1 x^{*}\right\|^{2}, \frac{9 \sum_{i=1}^{n}\left\|\nabla f_{i}\left(x^{*}\right)\right\|^{2}}{\mu^{2}}+\frac{n \sigma^{2}}{L^{2}}\right\} \]

Remark. \

  • \(X(k)-\mathbf{1}x^*=W^k(X(0)-\mathbf{1}x^*)-\sum\limits_{i+j=k}W^i\alpha_jG(j)\leq W^k(X(0)-\mathbf{1}x^*)\)

  • ?The author derive the second part in the \(\max\{\cdot\}\) by \(E\left[\|X(k)\|^{2}\right] \leq \max \left\{\|X(0)\|^{2}, \sum_{i=1}^{n} R_{i}\right\}\)

9.2.1 Sublinear rate

Recall \(R'(k)=U(k)+\frac{1}{n}V(k)\), hence we introduce \(W(k)\) as \[\begin{equation} W(k):=U(k)+\omega(k)V(k),\quad \forall k\geq0 \tag{9.9} \end{equation}\]

where \(\omega(k)>0\) is to be determined later. (9.9) is called the Lyapunov function.

From lemma 9.5, we have a uniform bound for \(R'(k)\leq \frac{\hat X}{n}\), we also want such a property on \(W(k)\),i.e. \(W(k)\leq\frac{\hat X}{n}\). This is how we determine \(\omega(k)\). From lemma 9.3 and 9.4, we further establish a recursion of \(W(k)\). By induction, it will have a term of \(\prod\limits_{t=a}^{k-1}(1-\frac{\gamma}{t})\). Lemma 9.6 leads us to bound such a term. Then \(U(k)\) is bounded from \(U(k)\leq W(k)\). \(V(k)\) is bound from lemma 9.4 and the bound for \(U(k)\).

Lemma 9.6 \(\forall 1<a<k(a\in\mathbb{N})\) and \(1<\gamma<a/2\), \[ \frac{a^{2 \gamma}}{k^{2 \gamma}} \leq \prod_{t=a}^{k-1}\left(1-\frac{\gamma}{t}\right) \leq \frac{a^{\gamma}}{k^{\gamma}} \]

Lemma 9.7 shows the algorithm 7.1 with the stepsize policy being \[ \alpha_k=\frac{\theta}{\mu(k+K)},K:=\left\lceil\frac{2 \theta L^{2}}{\mu^{2}}\right\rceil \] enjoys sublinear rate.

Lemma 9.7 (Sublinear rate of DSGD) For algorithm 7.1,let \[\begin{equation*} K_{1}:=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil \end{equation*}\] \(\forall k\geq K_1-K\), we have \[\begin{equation*} U(k)\leq \frac{\hat W}{\tilde k},\quad V(k)\leq \frac{\hat V}{\tilde k^2} \end{equation*}\]

where, \[ \hat{W}:=\frac{K_{1} \hat{X}}{n}+\frac{3}{(4 \theta-3)}\left(\frac{\sigma^{2} \theta^{2}}{n \mu^{2}}+\frac{\sigma^{2} \rho_{w}^{2} \theta^{2}}{2 \mu^{2}}\right)+\frac{12\left\|\nabla F\left(1 x^{*}\right)\right\|^{2} \rho_{w}^{2} \theta^{2}}{(4 \theta-3) n \mu^{2}\left(1-\rho_{w}^{2}\right)} \]

\[ \hat{V}:=\max \left\{K_{1}^{2} \hat{X}, \frac{8 \theta^{2} \rho_{w}^{2}}{\mu^{2}\left(1-\rho_{w}^{2}\right)}\left[\frac{4\left\|\nabla F\left(\mathbf{1} x^{*}\right)\right\|^{2}}{\left(1-\rho_{w}^{2}\right)}+n \sigma^{2}+\frac{4 n L^{2} \hat{W}}{\left(1-\rho_{w}^{2}\right) K_{1}}\right]\right\} \]

\[ \tilde k=k+K \]

Remark.
\(\omega(k)=\frac{12 \alpha_{k} L^{2}}{n \mu\left(1-\rho_{w}^{2}\right)}=\frac{f(a_k)}{n(1-\rho_w^2)}\leq \frac{1}{n}\) so that \(W(k)\leq \frac{\hat X}{n}\). This requires \(f(a_k)=\frac{12\alpha_kL^2}{\mu}\leq 1-\rho_w^2\), which is satisfied for the choice of \(K_1=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil\geq \frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\), i.e.,

\[\begin{align} f(a_k)&=\frac{12\alpha_kL^2}{\mu}\stackrel{k\geq K_1-K}\leq \frac{12L^2}{\mu}\frac{\theta}{\mu K_1}\\ &\stackrel{K_1\geq \frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}}\leq \frac{1-\rho_w^2}{2} \end{align}\]
We introduce \(\tilde k\) is based on \(\alpha_k=\frac{c}{k+K}\). Moreover, we introduce the following shift of \(U(k),V(k)\), and \(W(k)\) for simplicity. \[\begin{equation} \tilde{U}(k):=U(k-K), \quad \tilde{V}(k):=V(k-K), \quad \tilde{W}(k):=W(k-K), \quad \forall k \geq K \tag{9.10} \end{equation}\]

The uniform bound of \(R'(k)\) (lemma 9.5) gives bound for the above quantities. \(\tilde U(k)\leq \frac{\hat X}{n}, \tilde V(k)\leq \hat X\), and \(\tilde W(k)\leq\frac{\hat X}{n}\). This can also be seen from the definition in (9.10), which moves \(U(k),V(k)\) and \(W(k)\) horizontally.

9.2.2 Asymptotic network independence

In this section, we show that the asymptotic network independence of algorithm 7.1 with the stepsize policy being \(\alpha_k=\frac{\theta}{\mu(k+K)}\). That is, although \(R'(k)=U(k)+\frac{1}{n}V(k)\) depends on the network involving the term of \(1-\rho_w^2\), we show that part will decay faster. Then after some iterations (\(\exists K_0\), when \(k\geq K_0\)), \(R(k)\leq \frac{C}{\tilde k}, \exists C\).

From lemma 9.3, substitute \(\alpha_k=\frac{\theta}{\mu(k+K)}\) and let \(k'=k+K\), then \[\begin{equation} \tilde{U}(k'-K+1) \leq\left(1-\frac{3 \theta}{2 k'}\right) \tilde{U}(k'-K)+\frac{3 \theta L^{2}}{n \mu^{2}} \frac{\tilde{V}(k'-K)}{k'}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k'^{2}}, \quad \forall k' \geq K_{1} \end{equation}\] Then from (9.10), we simplify the above as \[\begin{equation} \tilde{U}(k+1) \leq\left(1-\frac{3 \theta}{2 k}\right) \tilde{U}(k)+\frac{3 \theta L^{2}}{n \mu^{2}} \frac{\tilde{V}(k)}{k}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k^{2}}, \quad \forall k \geq K_{1} \tag{9.11} \end{equation}\] By induction and lemma 9.6, we bound \(U(k)\) with the network dependent term decaying faster, i.e. theorem 9.1. Additionally, from lemma 9.7, \(\tilde V(k)=V(k-K)\leq \frac{\hat V}{k^2}\), (9.11) becomes \[\begin{align} \tilde U(k)\leq \frac{K_{1}^{1.5 \theta}}{k^{1.5 \theta}} \tilde{U}\left(K_{1}\right)+\sum_{t=K_{1}}^{k-1} \frac{(t+1)^{1.5 \theta}}{k^{1.5 \theta}}\left(\frac{\theta^{2} \sigma^{2}}{n \mu^{2} t^{2}}+\frac{3 \theta L^{2}}{n \mu^{2}} \frac{\hat V}{t^3}\right) \tag{9.12} \end{align}\]

The inequality holds for \(1<\frac{3\theta}{2}\leq \frac{K_1}{2}\), where \(K_1:=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil\). To simplify (9.12), we claim that \[ \sum_{t=K_1}^{k-1}\frac{(t+1)^{1.5\theta}}{t^2}\leq \frac{b^{1.5 \theta-1}}{1.5 \theta-1}+\frac{3 b^{1.5 \theta-2}}{1.5 \theta-2}+3 b^{1.5 \theta-2} \] and \[ \sum_{a}^{b} \frac{(t+1)^{1.5 \theta}}{t^{3}} \leq \int_{a}^{b} t^{1.5 \theta-3} d t \leq \frac{2 b^{1.5 \theta-2}}{1.5 \theta-2} \]

Then, \[\begin{align} \tilde{U}(k) &\leq \frac{\theta^{2} \sigma^{2}}{(1.5 \theta-1) n \mu^{2} k}+\frac{3 \theta^{2}(1.5 \theta-1) \sigma^{2}}{(1.5 \theta-2) n \mu^{2}} \frac{1}{k^{2}}+\frac{K_{1}^{1.5 \theta}}{k^{1.5 \theta}} \tilde{U}\left(K_{1}\right)+\frac{6 \theta L^{2} \hat{V}}{(1.5 \theta-2) n \mu^{2}} \frac{1}{k^{2}}\\ &\leq \frac{\theta^{2} \sigma^{2}}{(1.5 \theta-1) n \mu^{2} \tilde{k}}+\left[\frac{3 \theta^{2}(1.5 \theta-1) \sigma^{2}}{(1.5 \theta-2) n \mu^{2}}+\frac{6 \theta L^{2} \hat{V}}{(1.5 \theta-2) n \mu^{2}}\right] \frac{1}{\tilde{k}^{2}} \\ &\quad +\frac{K_1^{1.5\theta}\hat X}{n}\frac{1}{\tilde k^3},\forall k\geq K_1-K \tag{9.13} \end{align}\]

Remark.

  • The last inequality in (9.13) holds for \(\theta >2\), we also use \(\tilde U(K_1)\leq \frac{\hat X}{n}\), i.e. lemma 9.7

  • How does it yield theorem 9.1?
Theorem 9.1 Under algorithm 7.1 with \(\alpha_k=\frac{\theta}{\mu(k+K)}\), let \(\theta>2, K:=\left\lceil\frac{2 \theta L^{2}}{\mu^{2}}\right\rceil\), and \(K_{1}:=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil\), we have \[ U(k) \leq \frac{\theta^{2} \sigma^{2}}{(1.5 \theta-1) n \mu^{2} \tilde{k}}+\left[\frac{3 \theta^{2}(1.5 \theta-1) \sigma^{2}}{(1.5 \theta-2) n \mu^{2}}+\frac{6 \theta L^{2} \hat{V}}{(1.5 \theta-2) n \mu^{2}}\right] \frac{1}{\tilde{k}^{2}}, \quad \forall k \geq K_{1}-K \]

Where \(\hat V=\mathcal{O}(\frac{n}{(1-\rho_w)^2})\) with some constraints on \(\Vert X(0)-\mathbf{1}x^*\Vert^2\) and \(\Vert \nabla F(\mathbf{1}x^*)\Vert^2\) according to lemma 9.8.

Lemma 9.8 Suppose \(\sum\limits_{i=1}^{n}\left\|x_{i}(0)-x^{*}\right\|^{2}=\mathcal{O}(n) \text { and } \sum\limits_{i=1}^{n}\left\|\nabla f_{i}\left(x^{*}\right)\right\|^{2}=\mathcal{O}(n)\), then \[ \hat V=\mathcal{O}(\frac{n}{(1-\rho_w^2)^2}) \]

This is because given such conditions on \(\Vert X(0)-\mathbf{1}x^*\Vert^2\) and \(\Vert \nabla F(\mathbf{1}x^*)\Vert^2\), \(\hat X=\mathcal{O}(n)\) and from lemma 9.7, \(\hat W=\mathcal{O}(\frac{1}{1-\rho_w^2})\).

For \(\hat V\), noticing \(K_1=\left\lceil\frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}\right\rceil\leq \frac{24 L^{2} \theta}{\left(1-\rho_{w}^{2}\right) \mu^{2}}+1\), then \[\begin{align} \hat V&=\max \left\{K_{1}^{2} \hat{X}, \frac{8 \theta^{2} \rho_{w}^{2}}{\mu^{2}\left(1-\rho_{w}^{2}\right)}\left[\frac{4\left\|\nabla F\left(1 x^{*}\right)\right\|^{2}}{\left(1-\rho_{w}^{2}\right)}+n \sigma^{2}+\frac{4 n L^{2} \hat{W}}{\left(1-\rho_{w}^{2}\right) K_{1}}\right]\right\}\\ &=\max\left\{C_1\frac{n}{(1-\rho_w^2)^2}, C_2\frac{n\rho_w^2}{(1-\rho_w^2)^2}+C_2\frac{n}{1-\rho_w^2}+C_3\frac{n\rho_w^2}{(1-\rho_w^2)^2}\right\}\\ &=\mathcal{O}(\frac{n}{(1-\rho_w^2)^2}) \end{align}\] Recall \(R'(k)=U(k)+\frac{1}{n}V(k)\), from theorem 9.1, lemma 9.7, and lemma 9.8, we show \[\begin{equation} R'(k)\leq \frac{\theta^{2} \sigma^{2}}{(1.5 \theta-1) n \mu^{2}} \frac{1}{\tilde{k}}+\mathcal{O}\left(\frac{1}{\left(1-\rho_{w}^2\right)^{2}}\right) \frac{1}{\tilde{k}^{2}} \tag{9.14} \end{equation}\]

Next we improve the bound in theorem 9.1

9.2.3 Improved Bound

In the derivation of theorem 9.1, we start from lemma 9.3, which is based on lemma 9.1. From lemma 9.1 to lemma 9.3, Cauchy-Schwartz inequality is used twice. Now we start directly from lemma 9.1 and do not introduce an arbitary \(c>0\)(see (9.2)). Moreover, we consider the situation where \(\alpha_k=\frac{\theta}{\mu(k+K)}\) and \(k\geq K_1-K\). Then we have \[\begin{equation} \tilde{U}(k+1) \leq\left(1-\frac{2 \theta}{k}\right) \tilde{U}(k)+\frac{\theta^{2} \tilde{U}(k)}{k^{2}}+\frac{2 \theta L}{\sqrt{n} \mu} \frac{\sqrt{\tilde{U}(k) \tilde{V}(k)}}{k}+\frac{\theta^{2} L^{2}}{n \mu^{2}} \frac{\tilde{V}(k)}{k^{2}}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k^{2}} \end{equation}\]

We expand \((1-\frac{\theta}{k})^2\) to see the power of \(\frac{1}{k}\) clearly. Then by induction, lemma 9.6, lemma 9.7, and similar bounding for the sums \(\sum\limits_{t=K_1}^{k-1}\frac{(t+1)^b}{t^a}\), we have improved results for \(R'(k)=\frac{1}{n} \sum\limits_{i=1}^{n} E\left[\left\|x_{i}(k)-x^{*}\right\|^{2}\right]\).

Theorem 9.2 For algorithm 7.1 with \(\alpha_k=\frac{\theta}{\mu(k+K)}\), suppose \(\theta>2\), \(\sum\limits_{i=1}^{n}\left\|x_{i}(0)-x^{*}\right\|^{2}=\mathcal{O}(n) \text { and } \sum\limits_{i=1}^{n}\left\|\nabla f_{i}\left(x^{*}\right)\right\|^{2}=\mathcal{O}(n)\), then for \(k\geq K_1-K\), \[\begin{align} R'(k)\leq \frac{\theta^{2} \sigma^{2}}{(2 \theta-1) n \mu^{2} \tilde{k}}+\mathcal{O}\left(\frac{1}{\sqrt{n}\left(1-\rho_{w}\right)}\right) \frac{1}{\tilde{k}^{1.5}}+\mathcal{O}\left(\frac{1}{\left(1-\rho_{w}\right)^{2}}\right) \frac{1}{\tilde{k}^{2}} \end{align}\]

9.3 Transient time

First we derive the convergence rate of centralized gradient descent 7.2. Similarly, we use lemma 9.2 and assumption 4.1, then \[\begin{align} &E\left[\Vert x(k+1)-x^*\Vert^2|x(k) \right]\\ &\quad=E\left[\Vert x(k)-\alpha_k\nabla f(x(k))-x^*-\alpha_k (\tilde g(k)-\alpha_k\nabla f(x(k)))\Vert^2-|x(k) \right]\\ &\quad\leq (1-\alpha_k\mu)^2\Vert x(k)-x^*\Vert^2+\frac{\alpha_k^2}{n^2}\sum_{i=1}^nE\left[\Vert \nabla f_i(x(k))-g_i(k)\Vert^2|X(k)\right]\\ &\quad\leq (1-\alpha_k\mu)^2\Vert x(k)-x^*\Vert^2+\frac{\alpha_k^2\sigma^2}{n} \tag{9.15} \end{align}\] Take the full expetation on both sides in (9.15), we prove lemma 7.1 in chapter 7. Substitute \(\alpha=\frac{1}{\mu k}\) in (9.15) and take full expetation on both sides, we have, \[\begin{equation} R(k+1) = \left(1-\frac{2 \theta}{k}\right)\left\|x(k)-x^{*}\right\|^{2}+\frac{\theta^{2}}{k^{2}}\left\|x(k)-x^{*}\right\|^{2}+\frac{\theta^{2} \sigma^{2}}{n \mu^{2}} \frac{1}{k^{2}} \end{equation}\]

First we show \(\exists c_3=\mathcal{O}(\frac{1}{n}),s.t. E\left[\Vert x(k)-x^*\Vert^2\right]\leq \frac{c_3}{k},\forall k\geq K_2:=\left\lceil\frac{\theta L}{\mu}\right\rceil\). Then by induction, lemma 9.6 and bound for the sums, we have the convergence rate for algorithm 7.2.

Theorem 9.3 Under algorithm 7.2, suppose \(k\geq K_2\), we have \[ E\left[\left\|x(k)-x^{*}\right\|^{2}\right] \leq \frac{\theta^{2} \sigma^{2}}{(2 \theta-1) n \mu^{2} k}+\mathcal{O}\left(\frac{1}{n}\right) \frac{1}{k^{2}} \]

Compare theorem 9.3 and theorem 9.2(or formula (9.14)), we derive the transient time \(K_T=\mathcal{O}(\frac{n}{(1-\rho_w^2)^2})\)

Corollary 9.1 (Transient Time) It takes \(K_T=\mathcal{O}(\frac{n}{(1-\rho_w^2)^2})\) for algorithm 7.1 with \(\alpha_k=\frac{\theta}{\mu(k+K)}\) to reach the aymptotic rate of convergence of algorithm 7.2,i.e. when \(k\geq K_T\), we have \(\frac{1}{n} \sum_{i=1}^{n} E\left[\left\|x_{i}(k)-x^{*}\right\|^{2}\right] \leq \frac{\theta^{2} \sigma^{2}}{(2 \theta-1) n \mu^{2 k}} \mathcal{O}(1)\).

Remark.
From formula (9.14), we have \[ R'(k)\leq \frac{\theta^{2} \sigma^{2}}{ (2\theta-1)n \mu^{2}\tilde k}\left[\frac{2\theta-1}{1.5\theta-1}+\mathcal{O}\left(\frac{n}{\left(1-\rho_{w}^2\right)^{2}}\right) \frac{1}{\tilde{k}}\right] \] \(\theta>2\) is a constant, let \[ \mathcal{O}\left(\frac{n}{\left(1-\rho_{w}^2\right)^{2}}\right) \frac{1}{K_T}=\mathcal{O}(1) \] we have \[ K_T=\mathcal{O}\left(\frac{n}{(1-\rho_w^2)^2}\right) \]

The above can also begin with theorem 9.2
Remark.
\[\begin{align} \frac{1}{(1-\rho_w)^2} / \frac{1}{(1-\rho_w^2)^2}&=\left(\frac{1-\rho_w^2}{1-\rho_w}\right)^2\\ &=(1+\rho_w)^2 \end{align}\] where \(\rho_w<1\). Hence \(\mathcal{O}(\frac{1}{(1-\rho_w^2)^2})\) and \(\mathcal{O}(\frac{1}{(1-\rho_w)^2})\) have the same order.

9.4 Sharpness

From wiki,the constraint is sharp (sometimes optimal) if it cannot be made more restrictive without failing in some cases. We show the transient time for DGS to reach the asymptotic convergence rate is lower bounded by \(K_T=\mathcal{O}\left(\frac{n}{(1-\rho_w^2)^2}\right)\).

9.5 Summary

Recall in chapter 7, we have (7.1), i.e. \[\begin{equation} \text { Time }_{n, \varepsilon} \text { (decentralized) } \leq p(\mathcal{G}) \text { Time }_{n, \varepsilon} \text { (centralized) } \end{equation}\]

The corallary 9.1 indicates that for DSGD with \(\alpha_k=\frac{\theta}{\mu(k+K)},\theta>2\), when \(k\geq K_T\), \(p(\mathcal{G})=\mathcal{O}(1)\), this is what we call asymptotic network independence.

References

Pu, Shi, Alex Olshevsky, and Ioannis Ch. Paschalidis. 2019a. “A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent.”


  1. In chapter 7, \(\bar g(k)=\frac{1}{n}\sum\limits_{i=1}^n g_i(x(k),\xi_i(k))\)