Chapter 11 Exact diffusion

Consider the following problem, \[\begin{equation} x^*=\underset{x\in\mathbb{R}^p}{\arg\min}\sum_{i=1}^nf_i(x)\left(=E_{\xi_i}Q(x,\xi_i)\right) \tag{11.1} \end{equation}\]

The noisy gradient \(\nabla Q(x_{i,k},\xi_{i,k})\) is more concrete than the \(g(x_{i,k},\xi_{i,k})\) we use in the previous chapters, like chapter 4. In the following, we use \(g(x_{i,k},\xi_{i,k})=\nabla Q(x_{i,k},\xi_{i,k})\).

We focus on the class of diffusion strategies here. The exact diffusion and the traditional diffusion strategy(take adapt-then-conbine formulartion of diffusion as an example) can be seen in the following table,

Name	Scheme
adapt-then-combine of diffusion	\[X_{k+\frac{1}{2}}=X_{k}-\alpha\nabla G(X_k,\boldsymbol\xi_k)(\text{adaptation})\\X_{k+1}=WX_{k+\frac{1}{2}}(\text{combination})\]
exact diffusion(stochastic version)	\[X_{k+\frac{1}{3}}=X_{k}-\alpha G(X_k,\boldsymbol\xi_k)(\text{adaptation})\\X_{k+\frac{2}{3}}=X_{k+\frac{1}{3}}+X_k-X_{k-1+\frac{1}{3}}(\text{correction})\\X_{k+1}=WX_{k+\frac{2}{3}}(\text{combination})\]

The performance of bias-correction methods under stochastic and adaptive settings remain unclear. Yuan et al. (2019) show that the correction-step in exact diffusion leads to better standy-state performance under stochastic scenarios. The exact diffusion assume each \(f_i\) to be differentiable and \(\mu-\)strongly convex. Under sufficiently small step sizes \(\alpha\), the exact diffusion converges exponentially at the rate \(1-\mathcal{O}(\alpha\mu)\) to a neiborhood of \(x^*\), which can be characterized as \[\begin{equation} \underset{k\to \infty}{\lim\sup}\frac{1}{n}E\left[\Vert X_k-\mathbf{1}x^*\Vert^2\right]_{ed} =\mathcal{O}\left(\frac{\alpha\bar\sigma^2}{n\mu}+\frac{\delta^2}{n\mu^2}\cdot\frac{\alpha^2\bar\sigma^2}{1-\lambda}\right) \tag{11.2} \end{equation}\] In comparison, the traditional diffusion strategy converges at the similar rate to the following neighborhood of \(x^*\), \[\begin{equation} \underset{k\to \infty}{\lim\sup}\frac{1}{n}E\left[\Vert X_k-\mathbf{1}x^*\Vert^2\right]_{d} =\mathcal{O}\left(\frac{\alpha\bar\sigma^2}{n\mu}+\frac{\delta^2}{n\mu^2}\cdot(\frac{\alpha^2\bar\sigma^2}{1-\lambda}+ \frac{\alpha^2b^2}{(1-\lambda)^2})\right) \tag{11.3} \end{equation}\]

where \(\bar\sigma^2\) measures the size of gradient noise,\(\delta\) bounds the hessian matrix of \(f_i\), \(b^2=\sum\limits_{i=1}^n\Vert \nabla f_i(x^*)\Vert^2\), and \(\lambda=\max\{|\lambda_2(W)|,|\lambda_n(W)|\}\in(0,1)\) stands for the spectral gap.

Assumption11.1 Each \(f_i\) is \(\mu-\)strongly convex and twice differentiable, and \[ \mu I_p\leq\nabla^2 f_i(x)\leq\delta I_p \]

We have a more smooth function \(f_i\) here compared to assuming \(\nabla f_i\) is \(L-\)Lipschitz continuous.

Assumption11.2 The network is undirected and stronly connected, and the mixing matrix \(W\) is symmetric and doubly-stochastic

For the noisy gradient, we assume differently compared to assumption 4.1, where awe assume the variance of the noisy gradient can be bound be a uniform \(\sigma^2\).

Assumption11.3 Each agent \(i\) and the iteration step \(k\), we have, \[\begin{align} E\left[\nabla g(x_{i,k},\xi_{i,k})|\mathcal{F}_{k-1}\right]&=\nabla f_i(x_{i,k})\\ E\left[\Vert \nabla g(x_{i,k},\xi_{i,k})-\nabla f_i(x_{i,k})\Vert^2|\mathcal{F}_{k-1}\right]&\leq \beta_i\Vert x_{i,k-1}-x^*\Vert^2 + \sigma^2_i \end{align}\] \(\bar\sigma^2=\frac{1}{n}\sum_{i=1}^n\sigma_i^2\)

Compare the two characterization (11.2) and (11.3), we see that the exact diffusion outperforms the traditional diffusion strategy in the following,

The exact diffusion removes the bias \(\frac{\alpha^2b^2}{(1-\lambda^2)}\)
When we can obtain the exact gradient(deterministic setting), i.e. \(\sigma_i^2=0,i=1,2,...,n\), the exact diffusion converge exactly to \(x^*\).
When the bias \(\frac{\alpha^2b^2}{(1-\lambda^2)}\) is significant, i.e. \(b^2\) is large or the network is badly-connected(i.e. \(\lambda\) is close to 1), and the step size is not sufficiently small(\(\alpha\leq d_3(1-\lambda)^{(2+y)},y>0\)), the exact diffusion outperforms the traditional one.

11.1 Decreasing stepsize

Here we can derive the information of Hessian matrices of \(f_i,i=1,2,...,n\), then let \(\tilde x_{i,k}=x^*-x_{i,k}\) we have, \[\begin{equation} \nabla f_i(x_{i,k})-\nabla f_i(x^*)=-H_{i,k}\tilde x_{i,k} \end{equation}\] where \[\begin{equation} H_{i,k}=\int_{0}^1\nabla^2 f_i(x^*-r\tilde x_{i,k})dr\in\mathbb{R}^{p\times p} \end{equation}\] then we can denote \[\begin{equation} \nabla F(X_k)-\nabla F(\mathbf{1}x^*):= -\left( \begin{array}{c} (H_{1,k}\tilde x_{1,k})^T\\ \vdots\\ (H_{n,k}\tilde x_{n,k})^T \end{array}\right) \in\mathbb{R}^{n,p} \end{equation}\] However, this notation cannot have the form of \(\mathcal{H}_k\tilde X_k\), which helps us to separate the Hessian matrices and the iterated values. To ensure such an advantage, we stack the local vairables of different agents as a long vector, i.e., \[\begin{equation} \mathcal{X}_k = \left( \begin{array}{c} x_{1,k}\\ x_{2,k}\\ \vdots\\ x_{n,k} \end{array}\right):=\mathrm{col}\{x_{1,k},...,x_{n,k}\}\in\mathbb{R}^{np} \end{equation}\] Similarly, we denote \(\mathcal{G}_k:=\mathrm{col}\{g_1(x_{1,k},\xi_{1,k}),...,g_n(x_{n,k},\xi_{n,k})\}\). Then the exact diffusion with decreasing stepsize \(\alpha_k\) can be rewritten globally as the following, \[\begin{align} \mathcal{X}_{k+1} &= \overline{\mathcal{W}} (2 \mathcal{X}_k-\mathcal{X}_{k-1}-\alpha_k\mathcal{G}_k+\alpha_{k-1}\mathcal{G}_{k-1})\\ \mathcal{X}_1&=\overline{\mathcal{W}} (\mathcal{X}_0-\alpha_0\mathcal{G}_0),\mathcal{X}_0=\mathbf{0} \tag{11.4} \end{align}\] where \(\overline{\mathcal{W}}=(\mathcal{W}+I_{np})/2\), \(\mathcal{W}=W\otimes I_{p}\in\mathbb{R}^{np\times np}\). Essentially, exact diffusion is a primal-dual method. Since \(W\) is symmetric and doubly-stochastic, then so does \(\overline W\), thus \(I_n-\overline W\) is positive semi-definite(p.s.d.) and symmetric, we have spectral decomposition \(I_n-\overline W = U\Sigma U^T:=V^2\). Let \(\mathcal{V}=V\otimes I_p\), then \(\mathcal{V}^2=I_{np}-\overline{\mathcal{W}}\), we have the primal-dual version of exact diffusion (11.5). \[\begin{equation} \begin{cases} \mathcal{X}_{k+1}=\overline{\mathcal{W}}\left(\mathcal{X}_k-\alpha_k\mathcal{G}_k\right)-\mathcal{V}\mathcal{Y}_k\\ \mathcal{Y}_{k+1}=\mathcal{Y}_k+\mathcal{V}\mathcal{X}_{k+1} \end{cases} \tag{11.5} \end{equation}\] This is because, \[\begin{align} \mathcal{X}_{k+1}-\mathcal{X}_{k}&=\overline{\mathcal{W}}\left(\mathcal{X}_{k}-\mathcal{X}_{k-1}-\alpha_k\mathcal{G}_k +\alpha_{k-1}\mathcal{G}_{k-1}\right)-\mathcal{V}\mathcal{Y}_k+\mathcal{V}\mathcal{Y}_{k-1}\\ &=\overline{\mathcal{W}}\left(\mathcal{X}_{k}-\mathcal{X}_{k-1}-\alpha_k\mathcal{G}_k +\alpha_{k-1}\mathcal{G}_{k-1}\right)-\mathcal{V}(\mathcal{Y}_{k-1}+\mathcal{V}\mathcal{X}_k)+\mathcal{V}\mathcal{Y}_{k-1}\\ &=\overline{\mathcal{W}} (2 \mathcal{X}_k-\mathcal{X}_{k-1}-\alpha_k\mathcal{G}_k+\alpha_{k-1}\mathcal{G}_{k-1})-\mathcal{X}_{k} \end{align}\]

We can also have \(\mathcal{Y}_k=\mathcal{Y}_0+\mathcal{V}\sum\limits_{i=1}^k \mathcal{X}_i\).

Let \(\nabla\mathcal{J}(\mathcal{X}_k):=\mathrm{col}\{\nabla f_1(x_{1,k}),...,\nabla f_n(x_{n,k})\}\), then lemma 11.1 states the optimality condition for problem (11.1).

Lemma 11.1 Under assumption 11.1, if some vectors \((\mathcal{X}^*,\mathcal{Y}^*)\) exist that satisfy: \[\begin{align} \alpha_k\overline{\mathcal{W}}\nabla J(\mathcal{X}^*)+\mathcal{V}\mathcal{Y}^*&=\mathbf{0}\\ \mathcal{V}\mathcal{X}^*=\mathbf{0} \end{align}\] where\(\mathcal{X}^*=\mathrm{\{x_1^*,...,x_n^*\}}\), then it holds that \[ x_1^*=x_2^*=\cdots=x_n^*=x^* \] where the \(x^*\) is the unique solution to problem (11.1).

We follow the proof in (Yuan et al. 2018), where the authors prove the situation with different stepsizes for each agents and the objective function is a weighted average of \(f_i\).

First, according to \(\mathrm{null}(\mathcal{V})=\mathrm{span}\{\mathbf{1}_n\otimes I_p\}\), \[\begin{align} 0=\mathcal{V}\mathcal{X}^*\Longleftrightarrow \mathcal{X}^*\in\mathrm{null}(\mathcal{V}) \end{align}\]

so, \(x_1^*=x_2^*=\cdots=x_n^*\).

Additionally, let \(\mathcal{I}=\mathbf{1}_n\otimes I_p\) and multiply \(\mathcal{I}^T\) on both sides of \(\alpha_k\overline{\mathcal{W}}\nabla J(\mathcal{X}^*)+\mathcal{V}\mathcal{Y}^*=\mathbf{0}\). Since \(\mathcal{V}\) is symmetric and \(\mathcal{V}\mathcal{I}=\mathbf{0}\), we have \[\begin{align} \alpha_k\mathcal{I}^T\overline{\mathcal{W}}\nabla J(\mathcal{X}^*)&=\alpha_k(\mathbf{1}_n\otimes I_p)^T(\overline{W}\otimes I_p)\nabla J(\mathcal{X}^*)\\ &=\alpha_k(\mathbf{1}^T\otimes I_p)(\overline{W}\otimes I_p)\nabla J(\mathcal{X}^*)\\ &=\alpha_k(\mathbf{1}_n\overline{W})\otimes (I_pI_p)\nabla J(\mathcal{X}^*)\\ &=\alpha_k(\mathbf{1}_n^T\otimes I_p)\nabla J(\mathcal{X}^*)\\ &=\alpha_k\sum_{i=1}^n\nabla f_i(x_i^*)\\ &=0 \end{align}\] where we use the column stochastic property of \(\overline W\). \(\sum\limits_{i=1}^n\nabla f_i(x_i^*)=0\) implies that \(x_1^*=\cdots=x_n^*\) must coincide with the minimizer \(x^*\) of problem (11.1). We can also see from the above proof that changing the stepsize from fixed \(\alpha\) to decreasing \(\alpha_k\) does not violate too much(will condtions in lemma 11.1 requires too much?). From lemma 11.1, we have the following error dynamics, \[\begin{equation} \begin{cases} \tilde{\mathcal{X}}_{k+1}=\overline{\mathcal{W}}\left[\tilde{\mathcal{X}_{k}}+\alpha_k(\nabla \mathcal{J}(\mathcal{X}_k)-\nabla \mathcal{J}(\mathcal{X}^*))\right]-\mathcal{V}\tilde{\mathcal{Y}_k}+\alpha_k\overline{\mathcal{W}}\mathcal{S}_k(\mathcal{X}_k)\\ \tilde{\mathcal{Y}}_{k+1}=\tilde{\mathcal{Y}_{k}}+\mathcal{V}\tilde{\mathcal{X}}_{k+1} \end{cases} \tag{11.6} \end{equation}\]

Where \(\mathcal{S}_k(\mathcal{X}_k)=\mathcal{G}_k-\nabla \mathcal{J}(\mathcal{X}_k)\), \(\tilde{\mathcal{X}}_k:=\mathcal{X}^*-\mathcal{X}_k\), and \(\tilde{\mathcal{Y}}_k:=\mathcal{Y}^*-\mathcal{Y}_k\).

Proof.
From \(\mathcal{V}\mathcal{X}^*=0\), we have \(0=\mathcal{V}^2\mathcal{X}^*=(I_{np}-\overline{\mathcal{W}})\mathcal{X}^*\), hence \(\mathcal{X}^*=\overline{\mathcal{W}}\mathcal{X}^*\), then \[\begin{align} \tilde{\mathcal{X}}_{k+1}&=\mathcal{X}^*-\mathcal{X}_{k+1}\\ &=\overline{\mathcal{W}}\mathcal{X}^*-\overline{\mathcal{W}}\left(\mathcal{X}_k-\alpha_k\mathcal{G}_k\right)+\mathcal{V}\mathcal{Y}_k\\ &=\overline{\mathcal{W}}\left[\tilde{\mathcal{X}}_{k}+\alpha_k(\nabla\mathcal{J}(\mathcal{X}_k)-\nabla \mathcal{J}(\mathcal{X}^*))\right]+\alpha_k\overline{\mathcal{W}}\mathcal{S}_k(\mathcal{X}_k)+\alpha_k\overline{\mathcal{W}}\nabla \mathcal{J}(\mathcal{X}^*)+\mathcal{V}\mathcal{Y}_k\\ &=\overline{\mathcal{W}}\left[\tilde{\mathcal{X}}_{k}+\alpha_k(\nabla\mathcal{J}(\mathcal{X}_k)-\nabla \mathcal{J}(\mathcal{X}^*))\right]-\mathcal{V}\tilde{\mathcal{Y}_k}+\alpha_k\overline{\mathcal{W}}\mathcal{S}_k(\mathcal{X}_k) \end{align}\]

Remark.
In the above proof, we use the condition \(\alpha_k\overline{\mathcal{W}}\nabla J(\mathcal{X}^*)+\mathcal{V}\mathcal{Y}^*=\mathbf{0}\) in lemma 11.1 so that we can have \(\alpha_k\overline{\mathcal{W}}\nabla \mathcal{J}(\mathcal{X}^*)=-mathcal{V}\mathcal{Y}^*\). On the other hand, if requiring the condtions in lemma 11.1 are satisfied for each \(k\in\mathbb{N}\) is too strong, we may instead require \(\overline{\mathcal{W}}\nabla J(\mathcal{X}^*)+\mathcal{V}\mathcal{Y}^*=\mathbf{0}\\ \mathcal{V}\mathcal{X}^*=\mathbf{0}\), which also leads to \(\sum\limits_{i=1}^n\nabla f_i(x_i^*)=0\). Then the error dynamics should be \[\begin{align} \tilde{\mathcal{X}}_{k+1}&=\overline{\mathcal{W}}\left[\tilde{\mathcal{X}}_{k}+\alpha_k(\nabla \mathcal{J}(\mathcal{X}_k)-\nabla \mathcal{J}(\mathcal{X}^*))\right]-\mathcal{V}\tilde{\mathcal{Y}_k}+\alpha_k\overline{\mathcal{W}}\mathcal{S}_k(\mathcal{X}_k)+(1-\alpha_k)\mathcal{V}\mathcal{Y}^*\\ &=\overline{\mathcal{W}}\left[\tilde{\mathcal{X}}_{k}+\alpha_k(\nabla \mathcal{J}(\mathcal{X}_k)-\nabla \mathcal{J}(\mathcal{X}^*))\right]-\mathcal{V}\tilde{\mathcal{Y}_k}+\alpha_k\overline{\mathcal{W}}\mathcal{S}_k(\mathcal{X}_k)-(1-\alpha_k)\overline{W}\nabla \mathcal{J}(\mathcal{X}^*) \end{align}\]

Then we can see a nice property of using the above notation as we have mentioned at the beginning. Let \[\begin{equation} \mathcal{H}_k=\mathrm{diag}\{H_{1,k},...,H_{n,k}\}\in\mathbb{R}^{np\times np} \end{equation}\] Then \(\nabla \mathcal{J}(\mathcal{X}_k)-\nabla \mathcal{J}(\mathcal{X}^*)=-\mathcal{H}_k\tilde{\mathcal{X}}_k\), the error dynamics (11.6) become \[\begin{equation} \begin{cases} \tilde{\mathcal{X}}_{k+1}=\overline{\mathcal{W}}\left(\tilde{\mathcal{X}_{k}}-\alpha_k\mathcal{H}_k\tilde{\mathcal{X}_k}\right)-\mathcal{V}\tilde{\mathcal{Y}_k}+\alpha_k\overline{\mathcal{W}}\mathcal{S}_k(\mathcal{X}_k)\\ \tilde{\mathcal{Y}}_{k+1}=\tilde{\mathcal{Y}_{k}}+\mathcal{V}\tilde{\mathcal{X}}_{k+1} \end{cases} \tag{11.7} \end{equation}\]

We can further write the above (11.7) as the matrix form,

\[\begin{align} \left( \begin{array}{c} \tilde{\mathcal{X}_{k+1}}\\ \tilde{\mathcal{Y}_{k+1}}\\ \end{array} \right) &= \left(\begin{array}{cc} \overline{\mathcal{W}}&-\mathcal{V}\\ \mathcal{V}\overline{\mathcal{W}}&\overline{\mathcal{W}} \end{array} \right) \left[ \left(\begin{array}{cc} I_{np}&0\\ 0&I_{np} \end{array} \right) - \alpha_k\left(\begin{array}{cc} \mathcal{H}_k&0\\ 0&0 \end{array} \right) \right] \left( \begin{array}{c} \tilde{\mathcal{X}_{k}}\\ \tilde{\mathcal{Y}_{k}}\\ \end{array} \right) +\alpha_k \left( \begin{array}{c} \overline{\mathcal{W}}\\ \mathcal{V}\overline{\mathcal{W}}\\ \end{array} \right) \mathcal{S}_k(\mathcal{X}_k) \\ &:=\mathcal{B}(I_{2np}-\alpha_k\mathcal{T}_k) \left( \begin{array}{c} \tilde{\mathcal{X}_{k}}\\ \tilde{\mathcal{Y}_{k}}\\ \end{array} \right) +\alpha_k \mathcal{B}_l \mathcal{S}_k(\mathcal{X}_k) \tag{11.8} \end{align}\] Note that \[\begin{equation}\mathcal{B}_l=\mathcal{B}\left( \begin{array}{c} I_{np}\\ 0 \end{array} \right) \end{equation}\]

and for \(\mathcal{B}\) we have the following decomposition(see (Yuan et al. 2018)). In general, the following decomposition is a eigendecomposition.

Lemma 11.2 Under assumptions 11.1 and 11.2, the matrix \(\mathcal{B}\) can be decomposed as \[\begin{align} \mathcal{B}&= \left(\mathcal{R}_1,\mathcal{R}_2,c\mathcal{K}_R\right)\mathrm{diag}(I_p,I_p,\mathcal{D}_1) \left( \begin{array}{c} \mathcal{L}_1^T\\ \mathcal{L}_2^T\\ \frac{1}{c}\mathcal{K}_L \end{array} \right)\\ &:=\mathcal{K}\mathcal{D}\mathcal{K}^{-1} \end{align}\] \(\forall c>0\), and \[\begin{align} &\mathcal{R}_{1}=\left(\begin{array}{l} \mathcal{I} \\ 0 \end{array}\right) \in \mathbb{R}^{2 n p \times p}, \quad \mathcal{R}_{2}=\left(\begin{array}{l} 0 \\ \mathcal{I} \end{array}\right) \in \mathbb{R}^{2 n p \times p}\\ &\mathcal{L}_{1}=\left(\begin{array}{c} \frac{1}{n} \mathcal{I} \\ 0 \end{array}\right) \in \mathbb{R}^{2 n p \times p},\quad \mathcal{L}_{2}=\left(\begin{array}{c} 0 \\ \frac{1}{n} \mathcal{I} \end{array}\right) \in \mathbb{R}^{2 n p\times p}\\ &\mathcal{K}_R= \left( \begin{array}{c} \mathcal{K}_{R,u}\\ \mathcal{K}_{R,l} \end{array} \right)\in\mathbb{R}^{2np\times 2(n-1)p},\\ &\mathcal{K}_L=(\mathcal{K}_{L,l},\mathcal{K}_{L,r})\in\mathbb{R}^{2(n-1)p\times 2np} \end{align}\] where \(\mathcal{I}=\mathbf{1}_n\otimes I_p\in\mathbb{R}^{np\times p}\)

From the error dynamics (11.8) and the decomposition of \(\mathcal{B}\), multiply \(\mathcal{K}^{-1}\) on the both sides of (11.8), we have, \[\begin{align} \mathcal{K}^{-1} \left( \begin{array}{c} \tilde{\mathcal{X}_{k+1}}\\ \tilde{\mathcal{Y}_{k+1}}\\ \end{array} \right)&:= \left( \begin{array}{c} \bar z_{k+1}\\ \hat z_{k+1}\\ \check z_{k+1} \end{array} \right)\\ &= \mathcal{D}(\mathcal{K}^{-1}I_{2np}\mathcal{K}-\alpha_k\mathcal{K}^{-1}\mathcal{T}_k\mathcal{K})\mathcal{K}^{-1} \left( \begin{array}{c} \tilde{\mathcal{X}_{k}}\\ \tilde{\mathcal{Y}_{k}}\\ \end{array} \right) + \alpha_k\mathcal{K}^{-1}\mathcal{B}_l\mathcal{S}_k(\mathcal{X}_k)\\ &= \mathcal{D}(I_{2np}-\alpha_k\mathcal{K}^{-1}\mathcal{T}_k\mathcal{K}) \left( \begin{array}{c} \bar z_{k}\\ \hat z_{k}\\ \check z_{k} \end{array} \right) + \alpha_k\mathcal{K}^{-1}\mathcal{B}_l\mathcal{S}_k(\mathcal{X}_k) \end{align}\] Moreover, \[\begin{align} \mathcal{K}^{-1}\mathcal{T}_k\mathcal{K}&= \left( \begin{array}{cc} \frac{1}{n}\mathcal{I}^T&0\\ 0&\frac{1}{n}\mathcal{I}^T\\ \frac{1}{c}\mathcal{K}_{L,l}&\frac{1}{c}\mathcal{K}_{L,r} \end{array} \right) \left( \begin{array}{cc} \mathcal{H}_k&0\\ 0&0 \end{array} \right) \left( \begin{array}{ccc} \mathcal{I}&0&c\mathcal{K}_{R,u}\\ 0&\mathcal{I}&c\mathcal{K}_{R,l} \end{array} \right)\\ &=\left(\begin{array}{ccc} \frac{1}{n} \mathcal{I}^T \mathcal{H}_{k}\mathcal{I} & 0 & \frac{c}{n} \mathcal{I}^T\mathcal{H}_k\mathcal{K}_{R,u} \\ 0 & 0 & 0 \\ \frac{1}{c} \mathcal{K}_{L, l} \mathcal{H}_{k}\mathcal{I} & 0 & \mathcal{K}_{L, l} \mathcal{H}_{k} \mathcal{K}_{R,u} \end{array}\right)\\ &=\left( \begin{array}{ccc} \frac{1}{n}\sum_{i=1}^n H_{i,k} & 0 & \frac{c}{n}\mathcal{I}^T\mathcal{H}_k\mathcal{K}_{R,u}\\ 0&0&0\\ \frac{1}{c}\mathcal{K}_L\mathcal{T}_k\mathcal{R}_1 & 0 & \mathcal{K}_L\mathcal{T}_k\mathcal{K}_R \end{array}\right) \end{align}\] This is because, \[\begin{align} \mathcal{T}_k=\left( \begin{array}{cc} \mathcal{H}_k & 0\\ 0 &0 \end{array} \right) \end{align}\]

We also have,

\[\begin{align} \mathcal{K}^{-1}\mathcal{B}_l\mathcal{S}_k(\mathcal{X}_k)&= \mathcal{K}^{-1}\mathcal{B}\left( \begin{array}{c} I_{np}\\ 0 \end{array} \right) \mathcal{S}_k(\mathcal{X}_k)\\ &=\mathcal{D}\mathcal{K}^{-1} \left( \begin{array}{c} I_{np}\\ 0 \end{array} \right)\mathcal{S}_k(\mathcal{X}_k)\\ &=\left( \begin{array}{c} \frac{1}{n}\mathcal{I}^T\\ 0\\ \frac{1}{c}D_1\mathcal{K}_{L,l} \end{array} \right)\mathcal{S}_k(\mathcal{X}_k)\\ &=\left(\begin{array}{c} \frac{1}{n}\mathcal{I}^T\\ 0\\ \frac{1}{c}\mathcal{D}_1\mathcal{K}_L\mathcal{B}_l \end{array} \right)\mathcal{S}_k(\mathcal{X}_k) \end{align}\] We can see that, for \(\hat z_k\), we have \[\begin{equation} \hat z_{k+1} = \hat z_k \end{equation}\]

Hence we only check \((\bar z_k,\check z_k)^T\), where \(\bar z_k\in\mathbb{R}^{p}\), and \(\check z_k\in\mathbb{R}^{2(n-1)p}\) in the following.

Lemma 11.3 (Transformed Error Dynamics) Under assumption 11.1 and 11.2, the transformed error dynamics for exact diffusion is, \[\begin{align} \left(\begin{array}{c} \bar z_{k+1}\\ \check z_{k+1} \end{array} \right) &=\left(\begin{array}{cc} I_p - \frac{\alpha_k}{n}\sum_{i=1}^n H_{i,k} & -\frac{\alpha_k c}{n}\mathcal{I}^T\mathcal{H}_k\mathcal{K}_{R,u}\\ -\frac{\alpha_k}{c}\mathcal{D}_1\mathcal{K}_L\mathcal{T}_k\mathcal{R}_1& \mathcal{D}_1-\alpha_k\mathcal{D}_1\mathcal{K}_L\mathcal{T}_k\mathcal{K}_R \end{array} \right) \left(\begin{array}{c} \bar z_{k}\\ \check z_{k} \end{array} \right)\\ &\quad + \alpha_k \left( \begin{array}{c} \frac{1}{n}\mathcal{I}^T\\ \frac{1}{c}D_1\mathcal{K}_{L}\mathcal{B}_l \end{array} \right)\mathcal{S}_k(\mathcal{X}_k)\\ &:=A_k \left(\begin{array}{c} \bar z_{k}\\ \check z_{k} \end{array} \right) +\alpha_k \left( \begin{array}{c} \frac{1}{n}\mathcal{I}^T\\ \frac{1}{c}D_1\mathcal{K}_{L}\mathcal{B}_l \end{array} \right)\mathcal{S}_k(\mathcal{X}_k) \end{align}\] and, \[\begin{align} \left( \begin{array}{c} \tilde{\mathcal{X}_{k}}\\ \tilde{\mathcal{Y}_{k}}\\ \end{array} \right) =(\mathcal{R}_1, c\mathcal{K}_R) \left(\begin{array}{c} \bar z_k\\ \check z_k \end{array} \right) \end{align}\]

Next we do a non-asymptotic analysis to check the convergence of \((\bar z_k,\check z_k)^T\) under a decreasing stepsize \(\alpha_k:=\frac{\theta}{k+m}\), for some \(\theta\).

11.1.1 Preliminary results

We first check a less restrictive situation where assumption 11.3 becomes 4.1. Then, \[\begin{align} E\left[\mathcal{S}_k(\mathcal{X}_k)|\mathcal{F}_k\right]&=0,\forall k\\ E\left[\Vert \mathcal{S}_{k}(\mathcal{X}_{k})\Vert^2|\mathcal{F}_k\right]\leq \frac{\sigma^2}{n} \tag{11.9} \end{align}\]

where \(\mathcal{F}_k\) is the \(\sigma\)-algebra generated by \(\{\xi_0,...,\xi_{k-1}\}\).

Then from lemma 11.3, \[\begin{align} E\left[\Vert \bar z_{k+1}\Vert^2|\mathcal{F}_k\right]&=\Vert (I_p-\frac{\alpha_k}{n}\sum_{i=1}^n H_{i,k})\bar z_k-\frac{\alpha_kc}{n}\mathcal{I}^T\mathcal{H}_k\mathcal{K}_{R,u}\check z_k\Vert^2\\ &\quad + \frac{\alpha_k^2}{n^2}\Vert \mathcal{I}^T\Vert^2 E\left[\Vert \mathcal{S}_k(\mathcal{X}_k)\Vert^2|\mathcal{F}_k\right]\\ \tag{11.10} \end{align}\] For the first term, according to \(2\langle a,b\rangle\leq \gamma\Vert a\Vert^2+\gamma^{-1}\Vert b\Vert^2,\forall \gamma>0\), \(\Vert \mathcal{I}\Vert^2=n\), \(\Vert \mathcal{H}_k\Vert^2\leq\delta^2\), and \(\Vert I_p-\frac{\alpha_k}{n}\sum\limits_{i=1}^n H_{i,k}\Vert^2\leq(1-\alpha_k\mu)^2\) when \(\alpha_k\leq\frac{1}{\delta}\), we have, \[\begin{align} &\quad\Vert (I_p-\frac{\alpha_k}{n}\sum_{i=1}^n H_{i,k})\bar z_k-\frac{\alpha_kc}{n}\mathcal{I}^T\mathcal{H}_k\mathcal{K}_{R,u}\check z_k\Vert^2\\ &\leq (1+\gamma)\Vert (I_p-\frac{\alpha_k}{n}\sum_{i=1}^n H_{i,k})\bar z_k\Vert^2 + (1+\frac{1}{\gamma})\Vert \frac{\alpha_kc}{n}\mathcal{I}^T\mathcal{H}_k\mathcal{K}_{R,u}\check z_k\Vert^2\\ &\leq (1+\gamma)(1-\alpha_k\mu)^2\Vert \bar z_k\Vert^2+(1+\frac{1}{\gamma})\frac{\alpha_k^2c^2\delta^2}{n}\Vert \mathcal{K}_{R,u}\Vert^2\Vert \check z_k\Vert^2 \tag{11.11} \end{align}\] We further let \(\alpha_k\leq \frac{1}{3\mu}\) and choose \(\gamma = \frac{3}{8}\alpha_k\mu\), (see the proof of lemma 2.6 in (Pu, Olshevsky, and Paschalidis 2019 a)), then \((1+\gamma)(1-\alpha_k\mu)^2\leq 1-\frac{3}{2}\alpha_k\mu\) and \((1+\frac{1}{\gamma})\alpha_k\leq\frac{3}{\mu}\), then when \(\alpha_k\leq\min\{\frac{1}{\delta},\frac{1}{3\mu}\}\) (11.11) becomes \[\begin{align} &\quad\Vert (I_p-\frac{\alpha_k}{n}\sum_{i=1}^n H_{i,k})\bar z_k-\frac{\alpha_kc}{n}\mathcal{I}^T\mathcal{H}_k\mathcal{K}_{R,u}\check z_k\Vert^2\\ &\leq(1-\frac{3}{2}\alpha_k\mu)\Vert \bar z_k\Vert^2 + \frac{3\alpha_k c^2\Vert \mathcal{K}_{R,u}\Vert^2\delta^2}{n\mu}\Vert \check z_k\Vert^2 \tag{11.12} \end{align}\] Substituting (11.12) and (11.9) into (11.10) and take full expectation on both sides, we have lemma 11.4. Notice that \[\begin{equation} \Vert \mathcal{K}_{R,u}\Vert^2= \Vert \left(I_{np},0\right)\mathcal{K}_{R}\Vert^2\leq \Vert \mathcal{K}_{R}\Vert^2 \end{equation}\]

Lemma 11.4 Under assumptions 11.1, 11.2, and 4.1, supposing \(\alpha_k\leq \min\{\frac{1}{\delta},\frac{1}{3\mu}\}\), then \[\begin{align} E \left[\Vert \bar z_{k+1}\Vert^2\right]\leq (1-\frac{3}{2}\alpha_k\mu)\Vert E\left[\bar z_k\Vert^2\right] + \frac{3\alpha_k c^2\Vert \mathcal{K}_{R}\Vert^2\delta^2}{n\mu}\Vert E\left[\check z_k\Vert^2\right] + \frac{\alpha_k^2\sigma^2}{n} \end{align}\]

Similarly, for \(E\left[\Vert \check z_{k+1}\Vert^2|\mathcal{F}_{k}\right]\), we have,

\[\begin{align} E\left[\Vert \check z_{k+1}\Vert^2|\mathcal{F}_{k}\right]&=\Vert \mathcal{D}_1\check z_k-\frac{\alpha_k}{c}\mathcal{D}_1\mathcal{K}_L\mathcal{T}_k(\mathcal{R}_1\bar z_k+c\mathcal{K}_R\check z_k)\Vert^2\\ &\quad + \frac{\alpha_k^2\Vert\mathcal{D}_1\mathcal{K}_L\mathcal{B}_l\Vert^2}{c^2}E\left[\Vert \mathcal{S}_k(\mathcal{X}_k)\Vert^2|\mathcal{F}_k\right] \tag{11.13} \end{align}\] Let \(\lambda :=\max \{|\lambda_2(W)|,|\lambda_n(W)|\}\), denote \(\lambda' = (1+\lambda_2(W))/2\in(0,1)\), then from lemma 4 in (Yuan et al. 2018), \(\Vert \mathcal{D}_1\Vert=\sqrt{\lambda'}\). Additionally, \[\begin{align} \Vert \mathcal{T}_k\Vert^2&= \Vert \left( \begin{array}{cc} \mathcal{H}_k&0\\ 0&9 \end{array} \right) \Vert^2\leq\delta^2\\ \Vert \mathcal{R}_1\Vert^2 &= \Vert\left( \begin{array}{c} \mathcal{I}\\ 0 \end{array} \right)\Vert^2=\Vert \mathcal{I}\Vert^2=n \end{align}\] Moreover, from the decomposition of \(\mathcal{B}\) and \(\mathcal{B}_l^T=\mathcal{B}^T(I_{np},0)\), we have, \[\begin{equation} \vert \mathcal{B}_l\Vert^2\leq \Vert \mathcal{B}\Vert^2\leq 1 \end{equation}\] Then similar as (11.11), we have \[\begin{align} &\quad \Vert \mathcal{D}_1\check z_k-\frac{\alpha_k}{c}\mathcal{D}_1\mathcal{K}_L\mathcal{T}_k(\mathcal{R}_1\bar z_k+c\mathcal{K}_R\check z_k)\Vert^2\\ &\leq (1+\gamma)\lambda' \Vert \check z_k\Vert^2 + (1+\frac{1}{\gamma})\frac{\alpha_k^2\lambda'\Vert \mathcal{K}_L\Vert^2\delta^2}{c}\Vert \mathcal{R}_1\bar z_k+c\mathcal{K}_R\check z_k\Vert^2\\ &\leq (1+\gamma)\lambda' \Vert \check z_k\Vert^2 + (1+\frac{1}{\gamma})\frac{\alpha_k^2\lambda'\Vert \mathcal{K}_L\Vert^2\delta^2}{c}\left(2n\Vert \bar z_k\Vert^2+2c^2\Vert \mathcal{K}_R\Vert^2\Vert\check z_k\Vert^2\right)\\ &\leq (1+\frac{1}{\gamma})\frac{2n\alpha_k^2\lambda'\Vert \mathcal{K}_L\Vert^2\delta^2}{c}\Vert \bar z_k\Vert^2 \\ &\quad + \left[(1+\gamma)\lambda'+(1+\frac{1}{\gamma})({2c\alpha_k^2\lambda'\Vert \mathcal{K}_L\Vert^2\Vert \mathcal{K}_R\Vert^2\delta^2})\right]\Vert \check z_k\Vert^2 \tag{11.14} \end{align}\] Then we have, \[\begin{align} &\quad E\left[\Vert \check z_{k+1}\Vert^2\right]\\ &\leq (1+\frac{1}{\gamma})\frac{2n\alpha_k^2\lambda'\Vert \mathcal{K}_L\Vert^2\delta^2}{c}E\left[\Vert \bar z_k\Vert^2\right] + \alpha_k^2\frac{\lambda'\Vert\mathcal{K}_L\Vert^2\sigma^2}{nc^2}\\ &\quad + \left[(1+\gamma)\lambda'+(1+\frac{1}{\gamma})({2c\alpha_k^2\lambda'\Vert \mathcal{K}_L\Vert^2\Vert \mathcal{K}_R\Vert^2\delta^2})\right]E\left[\Vert \check z_k\Vert^2\right] \tag{11.15} \end{align}\] Denote, \[\begin{equation} B(k)=E\left[\Vert \bar z_k\Vert^2\right], H(k) = E\left[\Vert\check z_k\Vert^2\right] \end{equation}\] We want to derive uniform bounds \(\hat B\) and \(\hat H\) for \(B(k)\) and \(H(k)\) respectively, such that \[\begin{equation} B(k)\leq \frac{\hat B}{(m+k)^a},\quad H(k)\leq \frac{\hat H}{(m+k)^b} \end{equation}\] where \(a,b\) are some positive integers. Noting that the relationship between \((\tilde{\mathcal{X}_k},\tilde{\mathcal{Y}_k})^T\) and \((\bar z_k,\check z_k)^T\) is \[\begin{align} \left(\begin{array}{c} \bar z_k\\ \check z_k \end{array} \right)&=\left(\begin{array}{c} \mathcal{L}_1^T\\ \frac{1}{c}\mathcal{K}_L\end{array}\right) \left(\begin{array}{c} \tilde{\mathcal{X}_k}\\ \tilde{\mathcal{Y}_k} \end{array} \right)\\ &=\left(\begin{array}{cc} \frac{1}{n}\mathcal{I}^T&0\\ \frac{1}{c}\mathcal{K}_{L,l}&\frac{1}{c}\mathcal{K}_{L,r} \end{array} \right) \left(\begin{array}{c} \tilde{\mathcal{X}_k}\\ \tilde{\mathcal{Y}_k} \end{array} \right) \end{align}\]

References

Pu, Shi, Alex Olshevsky, and Ioannis Ch. Paschalidis. 2019a. “A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent.”

Yuan, Kun, Sulaiman A Alghunaim, Bicheng Ying, and Ali H Sayed. 2019. “On the Performance of Exact Diffusion over Adaptive Networks.” In 2019 Ieee 58th Conference on Decision and Control (Cdc), 4898–4903. IEEE.

Yuan, Kun, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. 2018. “Exact Diffusion for Distributed Optimization and Learning—Part Ii: Convergence Analysis.” IEEE Transactions on Signal Processing 67 (3). IEEE: 724–39.