Chapter 1 Introduction

These are some notes about distributed optimization, including some algorithms, their analysis of convergence, and some understandings of my own. Although the authors of those literature already provide proofs, I complement some details and try to figure out why should we prove in such a way. Hence they could be more easy to understand, especially for myself.

We summarize some distributed gradient based algorithms solving the problem (1.1), namely Push-Pull Gradient Method¹ (Pu et al. 2018), and Distributed Stochastic Gradient Tracking Methods(DSGT) (Pu and Nedić 2018).

\[\begin{equation} \underset{x\in\mathbb{R}^p}{\min} f(x):=\frac{1}{n}\sum_{i=1}^n f_i(x) \tag{1.1} \end{equation}\]

Where \(f_i:\mathbb{R}^p\to \mathbb{R}\) is known by agent \(i\) only, and all the agents communicate and exchange information over a network.

In general, these two methods use a decision variable \(x\in \mathbb{R}^p\) and an auxiliary variable \(y\in\mathbb{R}^p\) and have a form of (1.2) for the \((k+1)\)th iteration.

\[\begin{align} X_{k+1} &= S_1(X_k-\boldsymbol\alpha Y_k)\\ Y_{k+1} &= S_2Y_k + T(X_{k+1}) - T(X_k) \tag{1.2} \end{align}\]

Where \(S_1\) and \(S_2\) are the matrices inducing the graphs, \(T(\cdot)\) is the estimate of gradient, and \(\boldsymbol\alpha = \text{diag}(\alpha_1,...,\alpha_n)\in\mathbb{R}^{n\times n}\), \(\alpha_i\) is the step size initialized for agent \(i\).

\[\begin{align*} X_k &= \left(x_{1,k},x_{2,k},...,x_{n,k}\right)^T\in\mathbb{R}^{n\times p}\\ Y_k &= \left(y_{1,k},y_{2,k},...,y_{n,k}\right)^T\in\mathbb{R}^{n\times p}\\ \end{align*}\]

\(x_{i,k}\in\mathbb{R}^p\) denotes the decision variable of agent \(i\) of the \(k\)th iteration, \(y_{i,k}\) represents the auxiliary variable of agent \(i\) of the \(k\)th iteration.

Under assumption 1.1 , there exists an unique solution to (1.1) \(x^*\in\mathbb{R}^{1\times p}\). To prove the convergence of those methods, the idea is to bound three quantities, namely \(\bar x_{k+1}-x^*\), \(X_{k+1}-\mathbf{1}\bar x_{k+1}\), and \(Y_{k+1}-\mathbf{1}\bar y_{k+1}\) by the linear combination of their previous values under corresponding measurements. This will introduce a matrix \(A\). In order to make it converge, we need to make \(\rho(A)<1\), i.e. the spectral radius of \(A\) to be less than \(1\)(similar idea with contraction mapping), which will derive constraints to the stepsize \(\alpha\). By doing so, the authors prove the convergence and derive their convergence rate.

Assumption1.1 Each \(f_i\) is \(\mu\)-strongly convex and its gradient is \(L-\)Lipschitz continuous, i.e. \(\forall x,x'\in\mathbb{R}^{p}\) \[ \left\langle\nabla f_{i}(x)-\nabla f_{i}\left(x^{\prime}\right), x-x^{\prime}\right\rangle \geq \mu\left\|x-x^{\prime}\right\|_{2}^{2} \]

\[ \left\|\nabla f_{i}(x)-\nabla f_{i}\left(x^{\prime}\right)\right\|_{2} \leq L\left\|x-x^{\prime}\right\|_{2} \]

We also use the following notation, \[\begin{equation*} F(X) = \sum_{i=1}^n f_i(x_i), X\in\mathbb{R}^{n\times p} \end{equation*}\] \[\begin{equation*} \nabla F(X) = \left(\nabla f_1(x_1), ..., \nabla f_n(x_n)\right)^T\in\mathbb{R}^{n\times p} \end{equation*}\]

If not stated otherwise, we use the capital letter to denote a matrix, and use column vectors. The appendix in (Koloskova, Stich, and Jaggi 2019) includes some inequalities we repeatedly use in the proof or deravation.

References

Koloskova, Anastasia, Sebastian U Stich, and Martin Jaggi. 2019. “Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication.” arXiv Preprint arXiv:1902.00340.

Pu, Shi, and Angelia Nedić. 2018. “Distributed Stochastic Gradient Tracking Methods.”

Pu, Shi, Wei Shi, Jinming Xu, and Angelia Nedić. 2018. “A Push-Pull Gradient Method for Distributed Optimization in Networks.” In 2018 Ieee Conference on Decision and Control (Cdc), 3385–90. IEEE.

In the original paper, the author considers minimizing the sum of \(f_i\), known by agent i only↩

Notes about Distributed Optimization

Notes about Distributed Optimization

Chapter 1 Introduction

References