Chapter 12 Decentralized Proximal Gradient Algorithms with Linear Convergence Rates

Alghunaim et al. (2019) propose a general primal-dual algorithmic framework that deals with the problem \[\begin{equation} x^* = \underset{x\in\mathbb{R}^p}{\arg\min} \frac{1}{n}\sum_{i=1}^nf_i+R(x) \tag{12.1} \end{equation}\]

where \(R(x):\mathbb{R}^p\to\mathbb{R}\cup\{+\infty\}\) is a common convex nonsmooth function across all agents. Whether it can be relaxed to be \(R_i(x)\) is not explicitly answered. However, if we restrict it to be the scenario where each agent can compute one gradient and one proximal mapping per iteration for the smooth and non-smooth part, respectively and unlimited communication per iteration, then the linear convergence rate to the exact \(x^*\in\mathbb{R}^p\) cannot be achieved. \(f_i\), which is only known to agent \(i\), is \(\mu-\)strongly convex and has \(L-\)Lipschitz continuous gradient, and \(\mu\leq L\).

Instead of stacking the local variable \(x_i\in\mathbb{R}^p\) to become a maxtrix, we use vectorized notation here to best fit the setting in (Alghunaim et al. 2019). \[\begin{equation} \mathcal{X} := \left(\begin{array}{c} x_1\\ x_2\\ \vdots\\ x_n \end{array} \right) \end{equation}\] Then \(\mathcal{X}\in\mathbb{R}^{np}\), denote \(J(\mathcal{X}):=\sum\limits_{i=1}^nf_i(x_i)\). Let \(B,C\in\mathbb{R}^{np\times np}\) be two general symmetric matrices satisfying \[\begin{align} B\mathcal{X}=0&\Longleftrightarrow\quad x_1=\cdot=x_n\\ C=0\text{ or } C\mathcal{X}=0 &\Longleftrightarrow B\mathcal{X}=0 \tag{12.2} \end{align}\]

12.1 UDA

First we consider \(R(x)=0\), where problem (12.1) reduces to be problem (1.1). We have the following unified decentralized algorithm(UDA). The network is assumed to be connected and undirected, the mixing matrix \(A\) is symmetric, doubly-stochastic, and primitive,i.e. \(\exists j\in\mathbb{N}^+\quad s.t.\) all entries of \(A^j\) are positive.

Algorithm12.1 (UDA) * Initialize \(y_0 = 0\in\mathbb{R}^{np}\) and \(\mathcal{X}\) to be arbitary value, set step size \(\alpha\), \(\overline{\mathcal{A}}\)

  • For k = 1, 2, …, T

  • \(z_{k+1} = (I - C)\mathcal{X}_k - \alpha\nabla J(\mathcal{X}_k) - By_k\) (primal-descent)

  • \(y_{k+1} = y_k + Bz_{k+1}\) (dual-ascent)

  • \(\mathcal{x}_{k+1} = \overline{\mathcal{A}}z_{k+1}\) (combine)

  • Output \(\frac{1}{n}\sum\limits_{i=1}^nx_{i,T}\)

Remark. Algorithm 12.1 provides a general framework, \(\overline{\mathcal{A}}\) should be specified. It is corresponding to the mixing matrix \(A\)(network combination matrix).

From \(z_{k+1}-z_k\) and \(y_{k+1}-y_k=Bz_{k+1}\), we can eliminating \(y_k\) and have, \[\begin{equation} z_{k+1} = (I-B^2)z_k + (I-C)(x_k-x_{k-1}) - \alpha (\nabla J(\mathcal{X}_{k}) - \nabla J(\mathcal{X}_{k-1})) \tag{12.3} \end{equation}\]

Choosing different \(\{\overline{\mathcal{A}}, B, C\}\) leads to many state of art algorithms.

12.2 PUDA

When \(R(x)\not=0\), we can extend UDA 12.1 to a proximal unified decentralized algorithm(PUDA). Let \[\begin{equation} \mathcal{R}(\mathcal{X}):=\frac{1}{n}\sum_{i=1}^n R(x_i) \tag{12.4} \end{equation}\] and define the proximal operator \(\mathrm{prox_{\alpha f}(\cdot)}\) \[\begin{equation} \mathrm{prox}_{\alpha f}(x)=\underset{z}{\arg \min } f(z)+\frac{1}{2 \mu}\|z-x\|^{2} \end{equation}\]

then we have,

Algorithm12.2 (PUDA) * Initialize \(y_0 = 0\in\mathbb{R}^{np}\) and \(\mathcal{X}\) to be arbitary value, set step size \(\alpha\), \(\overline{\mathcal{A}}\)

  • For k = 1, 2, …, T

  • \(z_{k+1} = (I - C)\mathcal{X}_k - \alpha\nabla J(\mathcal{X}_k) - By_k\) (primal-descent)

  • \(y_{k+1} = y_k + Bz_{k+1}\) (dual-ascent)

  • \(\mathcal{x}_{k+1} = \mathrm{prox}_{\alpha \mathcal{R}}\overline{\mathcal{A}}z_{k+1}\) (combine)

  • Output \(\frac{1}{n}\sum\limits_{i=1}^nx_{i,T}\)

Remark. The network quantity \(\mathcal{R}(\mathcal{X})\) is not \(R(\bar x)\)

For PUDA 12.2, a unique fixed point \((\mathcal{X}^*, y^*, z^*)\) exists. Moreover, under some conditions, PUDA 12.2 converges at the linear rate to that point with a fixed step size \(\mathcal{O}(\frac{1}{L})\).

Assumption12.1 Assume matrices \(B,C\in\mathbb{R}^{np\times np}\) satisfy (12.2) and the following \[ \overline{\mathcal{A}}^{2} \leq I-B^{2} \text { and } 0 \leq C<2 I \]
Theorem 12.1 Let \(f_i\) be \(\mu-\)strongly convex and has \(L-\)Lipschitz continuous gradient and assumption 12.1 hold, if \(y_1=0\) and step size \(\alpha<\frac{2-\sigma_{\max}(C)}{L}\), then \[ \Vert \mathcal{X}_{k+1}-\mathcal{X}^*\Vert^2 + \Vert y_{k+1}-y_b^*\Vert^2 \leq \gamma (\Vert \mathcal{X}_{k}-\mathcal{X}^*\Vert^2 + \Vert y_{k}-y^*\Vert^2) \] where \(\gamma = \max\{1-\alpha\mu(2-\sigma_\max(C)-\alpha L), 1- \underline{\sigma}(B)\}\)

References

Alghunaim, Sulaiman A., Ernest K. Ryu, Kun Yuan, and Ali H. Sayed. 2019. “Decentralized Proximal Gradient Algorithms with Linear Convergence Rates.”