Four lectures on probabilistic methods for data science

Methods of high-dimensional probability play a central role in applications for statistics, signal processing theoretical computer science and related fields. These lectures present a sample of particularly useful tools of high-dimensional probability, focusing on the classical and matrix Bernstein's inequality and the uniform matrix deviation inequality. We illustrate these tools with applications for dimension reduction, network analysis, covariance estimation, matrix completion and sparse signal recovery. The lectures are geared towards beginning graduate students who have taken a rigorous course in probability but may not have any experience in data science applications.


Lecture 1: Concentration of sums of independent random variables
These lectures present a sample of modern methods of high dimensional probability and illustrate these methods with applications in data science.This sample is not comprehensive by any means, but it could serve as a point of entry into a branch of modern probability that is motivated by a variety of data-related problems.
To get most out of these lectures, you should have taken a graduate course in probability, have a good command of linear algebra (including singular value decomposition) and be familiar with basic concepts in normed spaces (including L p spaces).
All of the material of these lectures is covered more systematically, at a slower pace, and with a wider range of applications, in my forthcoming textbook [53].You may also be interested in two similar tutorials: [51] is focused on random matrices, and a more advanced text [52] discusses high-dimensional inference problems.
It should be possible to use these lectures for a self-study or group study.You will find here many places where you are invited to do some work (marked in the text e.g. by "check this!"), and you are encouraged to do it to get a better grasp of the material.Each lecture ends with a section called "Notes" where you will find references of the results just discussed, as well as some improvements and extensions.
We are now ready to start.
Probabilistic reasoning has a major impact on modern data science.There are roughly two ways in which this happens.
• Radnomized algorithms, which perform some operations at random, have long been developed in computer science and remain very popular.Randomized algorithms are among the most effective methods -and sometimes the only known ones -for many data problems.• Random models of data form the usual premise of statistical analysis.Even when the data at hand is deterministic, it is often helpful to think of it as a random sample drawn from some unknown distribution ("population").
In this lectures, we will encounter both randomized algorithms and random models of data.

Sub-gaussian distributions
Before we start discussing probabilistic methods, we will introduce an important class of probability distributions that forms a natural "habitat" for random variables in many theoretical and applied problems.These are sub-gaussian distributions.As the name suggests, we will be looking at an extension of the most fundamental distribution in probability theory -the gaussian, or normal, distribution N(µ, σ).
All these properties tell the same story from four different perspectives.It is not very difficult to show (although we will not do it here) that for any random variable X, not necessarily Gaussian, these four properties are essentially equivalent.
Proposition 1.1.1(Sub-gaussian properties).For a random variable X, the following properties are equivalent.Tails: P |X| t 2 exp(−t 2 /K 2 1 ) for all t 0. Moments: X p K 2 √ p for all p 1. MGF of square: E exp(X 2 /K 2 3 ) 2. Moreover, if E X = 0 then these properties are also equivalent to the following one: MGF: E exp(λX) exp(λ 2 K 2 4 ) for all λ ∈ R.
Random variables that satisfy one of the first three properties (and thus all of them) are called sub-gaussian.The best K 3 is called the sub-gaussian norm of X, and is usually denoted X ψ 2 , that is One can check that • ψ 2 indeed defines a norm; it is an example of the general concept of the Orlicz norm.Proposition 1.1.1states that the numbers K i in all four properties are equivalent to X ψ 2 up to absolute constant factors.
Example 1.1.2.As we already noted, the standard normal random variable X ∼ N(0, 1) is sub-gaussian.Similarly, arbitrary normal random variables X ∼ N(µ, σ) are sub-gaussian.Another example is a Bernoulli random variable X that takes values 0 and 1 with probabilities 1/2 each.More generally, any bounded random variable X is sub-gaussian.On the contrary, Poisson, exponential, Pareto and Cauchy distributions are not sub-gaussian.(Verify all these claims; this is not difficult.)1.2.Hoeffding's inequality You may remember from a basic course in probability that the normal distribution N(µ, σ) has a remarkable property: the sum of independent normal random variables is also normal.Here is a version of this property for sub-gaussian distributions.Proposition 1.2.1 (Sums of sub-gaussians).Let X 1 , . . ., X N be independent, mean zero, sub-gaussian random variables.Then N i=1 X i is a sub-gaussian, and where C is an absolute constant. 3roof.Let us bound the moment generating function of the sum for any λ ∈ R: = exp(λ 2 K 2 ) where Using again the last property in Proposition 1.1.1,we conclude that the sum S = N i=1 X i is sub-gaussian, and S ψ 2 C 1 K where C 1 is an absolute constant.The proof is complete.
Let us rewrite Proposition 1.2.1 in a form that is often more useful in applications, namely as a concentration inequality.To do this, we simply use the first property in Proposition 1.1.1 for the sum N i=1 X i .We immediately get the following.
Theorem 1.2.2 (General Hoeffding's inequality).Let X 1 , . . ., X N be independent, mean zero, sub-gaussian random variables.Then, for every t 0 we have Hoeffding's inequality controls how far and with what probability can a sum of independent random variables deviate from its mean, which is zero.

Sub-exponential distributions
Sub-gaussian distributions form a sufficiently wide class of distributions.Many results in probability and data science are proved nowadays in the for sub-gaussian random variables.Still, as we noted, there are some natural random variables that are not sub-gaussian.For example, the square X 2 of a normal random variable X ∼ N(0, 1) is not sub-gaussian.(Check!)To cover examples like this, we will introduce a similar but weaker notion of sub-exponential distributions.
Proposition 1.3.1 (Sub-exponential properties).For a random variable X, the following properties are equivalent, in the same sense as in Proposition 1.1.1.
Moreover, if E X = 0 then these properties imply the following one: Just like we did for sub-gaussian distributions, we call the best K 3 the subexponential norm of X and denote it by X ψ 2 , that is All sub-exponential random variables are squares of sub-gaussian random variables.Indeed, inspecting the definitions you will quickly see that

Bernstein's inequality
A version of Hoeffding's inequality for sub-exponential random variables is called Bernstein's inequality.You may naturally expect to see a sub-exponential tail bound in this result.So it may come as a surprise that Bernstein's inequality actually has a mixture of two tails -sub-gaussian and subexponential.Let us state and prove the inequality first, and then we will comment on the mixture of the two tails.
Theorem 1.4.1 (Bernstein's inequality).Let X 1 , . . ., X N be independent, mean zero, sub-exponential random variables.Then, for every t 0 we have Proof.For simplicity, we will assume that K = 1 and only prove the one-sided bound (without absolute value); the general case is not much harder.Our approach will be based on bounding the moment generating function of the sum S := N i=1 X i .To see how MGF can be helpful here, choose λ 0 and use Markov's inequality to get (1.4.2) P S t = P exp(λS) exp(λt) e −λt E exp(λS).
Recall that S = N i=1 X i and use independence to express the right side of (1.4.2) as (Check!)It remains to bound the MGF of each term X i , and this is a much simpler task.If we choose λ small enough so that then we can use the last property in Proposition 1.3.1 to get The left side does not depend on λ while the right side does.So we can choose λ that minimizes the right side subject to the constraint (1.4.3).When this is done carefully, we obtain the tail bound stated in Bernstein's inequality.(Do this!) Now, why does Bernstein's inequality has a mixture of two tails?The subexponential tail should of course be there.Indeed, even if the entire sum consisted of a single term X i , the best bound we could hope for would be of the form exp(−ct/ X i ψ 1 ).The sub-gaussian term could be explained by the central limit theorem, which states that the sum should becomes approximately normal as the number of terms N increases to infinity.Remark 1.4.4 (Bernstein's inequality for bounded random variables).Suppose the random variables X i are uniformly bounded, which is a stronger assumption than being sub-gaussian.Then there is a useful version of Bernstein's inequality, which unlike Theorem 1.4.1 is sensitive to the variances of X i 's.It states that if K > 0 is such that |X i | K almost surely for all i, then, for every t 0, we have (1.4.5) Here i is the variance of the sum.This version of Bernstein's inequality can be proved in essentially the same way as Theorem 1.4.1.We will not do it here, but a stronger Theorem 2.2.1, which is valid for matrix-valued random variables X i , will be proved in Lecture 2.
To compare this with Theorem 1.4.1, note that σ 2 + CKt 2 max(σ 2 , CKt).So we can state this the probability bound (1.4.5) as Just like before, here we also have a mixture of two tails, sub-gaussian and sub-exponential.The sub-gaussian tail is a bit sharper than in Theorem 1.4.1, since it depends on the variances rather than sub-gaussian norms of X i .The subexponential tail, on the other hand, is weaker, since it depends on the sup-norms rather than the sub-exponential norms of X i .

Sub-gaussian random vectors
The concept of sub-gaussian distributions can be extended to higher dimensions.Consider a random vector X taking values in R n .We call X a sub-gaussian random vector if all one-dimensional marginals of X, i.e. the random variables X, x for x ∈ R n , are sub-gaussian.The sub-gaussian norm of X is defined as where S n−1 denotes the unit Euclidean sphere in R n .
Example 1.5.1.Examples of sub-gaussian random distributions in R n include the standard normal distribution N(0, I n ) (why?), the uniform distribution on the centered Euclidean sphere of radius √ n, the uniform distribution on the cube {−1, 1} n , and many others.The last example can be generalized: a random vector X = (X 1 , . . ., X n ) with independent and sub-gaussian coordinates is sub-gaussian, with X ψ 2 C max i X i ψ 2 .
1.6.Johnson-Lindenstrauss Lemma Concentration inequalities like Hoeffding's and Bernstein's are successfully used in the analysis of algorithms.Let us give one example for the problem of dimension reduction.Suppose we have some data that is represented as a set of N points in R n .(Think, for example, of n gene expressions of N patients.) We would like to compress the data by representing it in a lower dimensional space R m instead of R n with m n.By how much can we reduce the dimension without loosing the important features of the data?
The basic result in this direction is Johnson-Lindenstrauss Lemma.It states that a remarkably simple dimension reduction method works -a random linear map from R n to R m with m ∼ log N, see Figure 1.6.3.The logarithmic function grows very slowly, so we can usually reduce the dimension dramatically.What exactly is a random linear map?Several models are possible to use.Here we will model such a map using a Gaussian random matrix -an m × n matrix A with independent N(0, 1) entries.More generally, we can consider an m × n matrix A whose rows are independent, mean zero, isotropic 4 and subgaussian random vectors in R n .For example, the entries of A can be independent Rademacher entries -those taking values ±1 with equal probabilities.Theorem 1.6.1 (Johnson-Lindenstrauss Lemma).Let X be a set of N points in R n and ε ∈ (0, 1).Consider an m × n matrix A whose rows are independent, mean zero, isotropic and sub-gaussian random vectors in R n .Rescale A by defining the "Gaussian random projection" where C is an appropriately large constant that depends only on the sub-gaussian norms of the vectors X i .Then, with high probability (say, 0.99), the map P preserves the distances between all points in X with error ε, that is x − y 2 for all x, y ∈ X. Proof.Take a closer look at the desired conclusion (1.6.2).By linearity, Px − Py = P(x − y).So, dividing the inequality by x − y 2 , we can rewrite (1.6.2) in the following way: (1.6.4) It will be convenient to square the inequality (1.6.4).Using that 1 By construction, the coordinates of the vector Pz = 1 √ m Az are 1 √ m X i , z .Thus we can restate (1.6.5) as Results like (1.6.6) are often proved by combining concentration and a union bound.In order to use concentration, we first fix z ∈ T .By assumption, the random variables X i , z 2 − 1 are independent; they have zero mean (use isotropy to check this!), and they are sub-exponential (use (1.3.2) to check this).Then Bernstein's inequality (Theorem 1.4.1)gives (Check!) Finally, we can unfix z by taking a union bound over all possible z ∈ T : By definition of T , we have |T | N 2 .So, if we choose m Cε −2 log N with appropriately large constant C, we can make (1.6.7)bounded by 0.01.The proof is complete.

Notes
The material presented in Sections 1.1-1.5 is basic and can be found e.g. in [51] and [53] with all the proofs.Bernstein's and Hoeffding's inequalities that we covered here are two basic examples of concentration inequalities.There are many other useful concentration inequalities for sums of independent random variables (e.g.Chernoff's and Bennett's) and for more general objects.The textbook [53] is an elementary introduction into concentration; the books [10,33,34] offer more comprehensive and more advanced accounts of this area.
The original version of Johnson-Lindenstrauss Lemma was proved in [26].The version we gave here, Theorem 1.6.1, was stated with probability of success 0.99, but an inspection of the proof gives probability 1 − 2 exp(−cε 2 m) which is much better for large m.A great variety of ramifications and applications of Johnson-Lindenstrauss lemma are known, see e.g.[2,4,7,10,29,37].

Lecture 2: Concentration of sums of independent random matrices
In the previous lecture we proved Bernstein's inequality, which quantifies how a sum of independent random variables concentrates about its mean.We will now study an extension of Bernstein's inequality to higher dimensions, which holds for sums of independent random matrices.

Matrix calculus
The key idea of developing a matrix Bernstein's inequality will be to use matrix calculus, which allows us to operate with matrices as with scalars -adding and multiplying them of course, but also comparing matrices and applying functions to matrices.Let us explain this.
We can compare matrices to each other using the notion of being positive semidefinite.Let us focus here on n × n symmetric matrices.If A − B is a positive semidefinite matrix, which we denote A − B 0, then we say that A B (and, of course, B A).This defines a partial order on the set of n × n symmetric matrices.The ream "partial" indicates that, unlike the real numbers, there exist n × n symmetric matrices A and B that can not be compared.(Give an example where neither A B nor B A!) Next, let us guess how to measure the magnitude of a matrix A. The magnitude of a scalar a ∈ R is measured by the absolute value |a|; it is the smallest nonnegative number t such that −t a t.
Extending this reasoning to matrices, we can measure the magnitude of of an n × n symmetric matrix A by the smallest non-negative number t such that6 The smallest t is called the operator norm of A and is denoted A .Diagonalizing A, we can see that With a little more work (do it!),we can see that A is the norm of A acting as a linear operator on R n equipped with the Euclidean norm • 2 ; this is why A is called the operator norm.Thus A is the smallest non-negative number M such that Ax 2 M x 2 for all x ∈ R n .
Finally, we will need to be able to take functions of matrices.Let f : R → R be a function and X be an n × n symmetric matrix.We can define f(X) in two equivalent ways.The spectral theorem allows us to represent X as where λ i are the eigenvalues of X and u i are the corresponding eigenvectors.Then we can simply define Note that f(X) has the same eigenvectors as X, but the eigenvalues change under the action of f.An equivalent way to define f(X) is using power series.Suppose the function f has a convergent power series expansion about some point x ∈ R, i.e.
Then one can check that the following matrix series converges 7 and defines f(X): (Check!)

Matrix Bernstein's inequality
We are now ready to state and prove a remarkable generalization of Bernstein's inequality for random matrices.
Theorem 2.2.1 (Matrix Bernstein's inequality).Let X 1 , . . ., X N be independent, mean zero, n × n symmetric random matrices, such that X i K almost surely for all i.Then, for every t 0 we have Here i is the norm of the "matrix variance" of the sum.
The scalar case, where n = 1, is the classical Bernstein's inequality we stated in (1.4.5).A remarkable feature of matrix Bernstein's inequality, which makes it especially powerful, is that it does not require any independence of the entries (or the rows or columns) of X i ; all is needed is that the random matrices X i be independent from each other.
We will prove matrix Bernstein's inequality and give a few applications in this and next lecture.
Our proof will be based on bounding the moment generating function (MGF) E exp(λS) of the sum S = N i=1 X i .Note that to exponentiate the matrix λS in order to define the matrix MGF, we rely on matrix calculus that we introduced in Section 2.1.
If the terms X i were scalars, independence would yield the classical fact that MGF of a product is a product of MGF's, i.e. (2.2.2) But for matrices, this reasoning breaks down badly, for in general even for 2 × 2 symmetric matrices X and Y. (Give a counterexample!)Fortunately, there are some trace inequalities that can often serve as proxies for the missing inequality e X+Y = e X e Y .One of such proxies is Golden-Thompson inequality, which states that (2.2.3) tr(e X+Y ) tr(e X e Y ) for any n × n symmetric matrices X and Y. Another result, which we will actually use in the proof of matrix Bernstein's inequality, is Lieb's inequality.
Theorem 2.2.4 (Lieb's inequality).Let H be an n × n symmetric matrix.Then the function f(X) = tr exp(H + log X) is concave 8 on the space on n × n symmetric matrices.
Note that in the scalar case, where n = 1, the function f in Lieb's inequality is linear and the result is trivial.
To use Lieb's inequality in a probabilistic context, we will combine it with the classical Jensen's inequality.It states that for any concave function f and a random matrix X, one has 9   (2.2.5) Using this for the function f in Lieb's inequality, we get And changing variables to X = e Z , we get the following: Lemma 2.2.6 (Lieb's inequality for random matrices).Let H be a fixed n × n symmetric matrix and Z be an n × n symmetric random matrix.Then Lieb's inequality is a perfect tool for bounding the MGF of a sum of independent random variables S = N i=1 X i .To do this, let us condition on the random variables X 1 , . . ., X N−1 .Apply Lemma 2.2.6 for the fixed matrix H := N−1 i=1 λX i and the random matrix Z := λX i , and afterwards take expectation with respect to X 1 , . . ., X N−1 .By the law of total expectation, we get Next, apply Lemma 2.2.6 in a similar manner for H := N−2 i=1 λX i + log E e λX N and Z := λX N−1 , and so on.After N times, we obtain: Lemma 2.2.7 (MGF of a sum of independent random matrices).Let X 1 , . . ., X N be independent n × n symmetric random matrices.Then the sum S = N i=1 X i satisfies log E e λX i .
8 Formally, concavity of f means that f(λX for all symmetric matrices X and Y and all λ ∈ [0, 1]. 9 Jensen's inequality is usually stated for a convex function g and a scalar random variable X, and it reads g(E X) E g(X).From this, inequality (2.2.5) for concave functions and random matrices easily follows (Check!).
Think of this inequality is a matrix version of the scalar identity (2.2.2).The main difference is that it bounds the trace of the MGF10 rather the MGF itself.
You may recall from a course in probability theory that the quantity log E e λX i that appears in this bound is called the cumulant generating function of X i .Lemma 2.2.7 reduces the complexity of our task significantly, for it is much easier to bound the cumulant generating function of each single random variable X i than to say something about their sum.Here is a simple bound.Lemma 2.2.8 (Moment generating function).Let X be an n × n symmetric random matrix.Assume that E X = 0 and X K almost surely.Then, for all 0 < λ < 3/K we have Proof.First, check that the following scalar inequality holds for 0 < λ < 3/K and |x| K: Then extend it to matrices using matrix calculus: if 0 < λ < 3/K and X K then (Do these two steps carefully!)Finally, take expectation and recall E X = 0 to obtain E e λX I + g(λ) E X 2 exp g(λ) E X 2 .
In the last inequality, we use the matrix version of the scalar inequality 1 + z e z that holds for all z ∈ R. The lemma is proved.
Proof of Matrix Bernstein's inequality.We would like to bound the operator norm of the random matrix S = N i=1 X i , which, as we know from (2.1.1),is the largest eigenvalue of S by magnitude.For simplicity of exposition, let us drop the absolute value from (2.1.1)and just bound the maximal eigenvalue of S, which we denote λ max (S).(Once this is done, we can repeat the argument for −S to reinstate the absolute value.Do this!) So, we are to bound where It remains to optimize this bound in λ.The minimum is attained for λ = t/(σ 2 + Kt/3).(Check!) Substituting this value for λ, we conclude This completes the proof of Theorem 2.2.1.
Bernstein's inequality gives a powerful tail bound for N i=1 X i .This easily implies a useful bound on the expectation: Corollary 2.2.9 (Expected norm of sum of random matrices).Let X 1 , . . ., X N be independent, mean zero, n × n symmetric random matrices, such that X i K almost surely for all i.Then Proof.The link from tail bounds to expectation is provided by the basic identity (2.2.10) which is valid for any non-negative random variable Z. (Check it!)Integrating the tail bound given by matrix Bernstein's inequality, you will arrive at the expectation bound we claimed.(Check!) Notice in this corollary a mild, logarithmic, dependence on the ambient dimension n.As we will see shortly, this can be an important feature in some applications.

Community recovery in networks Matrix
Bernstein's inequality has many applications.The one we are going to discuss first is for the analysis of networks.A network can be mathematically represented by graph, a set of n vertices with edges connecting some of them.For simplicity, we will consider undirected graphs where the edges do not have arrows.Real world networks often tend to have clusters, or communities -subsets of vertices that are connected by unusually many edges.(Think, for example, about a friendship network where communities form around some common interests.)An important problem in data science is to recover communities from a given network.
We are going to explain one of the simplest methods for community recovery, which is called spectral clustering.But before we introduce it, we will first of all place a probabilistic model on the networks we consider.In other words, it will be convenient for us to view networks as random graphs whose edges are formed at random.Although not all real-world networks are truly random, this simplistic model can motivate us to develop algorithms that would empirically succeed also for real-world networks.
The basic probabilistic model of random graphs is the Erdös-Rényi model.Erdös-Rényi random model is very simple.But is not a good choice if we want to model a network with communities, for every pair of vertices has the same chance to be connected.So let us introduce a natural generalization of Erdös-Rényi random model that does allow for community structure: Definition 2.3.2 (Stochastic block model).Partition a set of n vertices into two subsets ("communities") with n/2 vertices each, and connect every pair vertices independently with probability p if they belong to the same community and q < p if not.The resulting random graph is said to follow the stochastic block model G(n, p, q).Suppose we are shown one instance of a random graph generated according to a stochastic block model G(n, p, q).How can we find which vertices belong to which community?
The spectral clustering algorithm we are going to explain will do precisely this.It will be based on the spectrum of the adjacency matrix A of the graph, which is the n × n symmetric matrix whose entries A ij equal 1 if the vertices i and j are connected by an edge, and 0 otherwise. 11he adjacency matrix A is a random matrix.Let us compute its expectation first.This is easy, since the entires of A are Bernoulli random variables.If i and j belong to the same community then E A ij = p and otherwise E A ij = q.Thus A has block structure: for example, if n = 4 then A looks like this: p p q q p p q q q q p p q q p p        (For illustration purposes, we grouped the vertices from each community together.In reality, we do not know in advance how to group them, but we do not need to.) You will easily check that A has rank 2, and the non-zero eigenvalues and the corresponding eigenvectors are (2.3.4) The eigenvalues and eigenvectors of E A tell us a lot about the community structure of the underlying graph.Indeed, the first (larger) eigenvalue, is the expected degree of any vertex of the graph. 12The second eigenvalue tells us whether there is any community structure at all (which happens when p = q and thus λ 2 (E A) = 0).The first eigenvector v 1 is not informative of the structure of the network at all.It is the second eigenvector v 2 that tells us exactly how to separate the vertices into the two communities: the signs of the coefficients of v 2 can be used for this purpose.Thus if we know E A, we can recover the community structure of the network from the signs of the second eigenvector.The problem is that we do not know E A. Instead, we know the adjacency matrix A. And if, by some chance, A is not far from E A, we may hope to use the A to approximately recover the community structure.So is it true that A ≈ E A? The answer is yes, and we can prove it using matrix Bernstein's inequality.

Theorem 2.3.5 (Concentration of the stochastic block model).
Let A be the adjacency matrix of a G(n, p, q) random graph.Then Here d = (p + q)n/2 is the expected degree.
Proof.Let us sketch the argument.To use matrix Bernstein's inequality, let us break A into a sum of independent random matrices where each matrix X ij contains a pair of symmetric entries of A, or one diagonal entry. 13Matrix Bernstein's inequality obviously applies for the sum Corollary 2.2.9 gives 14   (2.3.6) It is a good exercise to check that σ 2 d and K 2.
How useful is Theorem 2.3.5 for community recovery?Suppose that the network is not too sparse, namely d log n.
In other words, A nicely approximates E A: the relative error or approximation is small in the operator norm.At this point one can classical results from the perturbation theory for matrices, which state that since A and E A are close, their eigenvalues and eigenvectors must also be close.The relevant perturbation results are Weyl's inequality for eigenvalues and Davis-Kahan's inequality for eigenvectors, which we will not reproduce here.Heuristically, what they give us is . 13 Precisely, if i = j, then X ij has all zero entries except the (i, j) and (j, i) entries that equal 1.If i = j, the only non-zero entry of X ij is the (i, i). 14We will liberally use the notation to hide constant factors appearing in the inequalities.Thus, a b means that a Cb for some constant C.
Then we should expect that most of the coefficients of v 2 (A) be positive on one community and negative on the other.So we can use v 2 (A) to approximately recover the communities.This method is called spectral clustering: Spectral Clustering Algorithm.Compute v 2 (A), the eigenvector corresponding to the second largest eigenvalue of the adjacency matrix A of the network.Use the signs of the coefficients of v 2 (A) to predict the community membership of the vertices.
We saw that spectral clustering should perform well for stochastic block model G(n, p, q) if it is not too sparse, namely if the expected degrees satisfy d = (p + q)n/2 log n.A more careful analysis along these lines, which you should be able to do yourself with some work, leads to the following more rigorous result.
Theorem 2.3.7 (Guarantees of spectral clustering).Consider a random graph generated according to the stochastic block model G(n, p, q) with p > q, and set a = pn, b = qn.Suppose that Then, with high probability, the spectral clustering algorithm recovers the communities up to o(n) misclassified vertices.

Notes
The idea to extend concentration inequalities like Bernstein's to matrices goes back to R. Ahlswede and A. Winter [3].They used Golden-Thompson inequality (2.2.3) and proved a slightly weaker form of matrix Bernstein's inequality than we gave in Section 2.2.R. Oliveira [42,43] found a way to improve this argument and gave a result similar to Theorem 2.2.1.The version of matrix Bernstein's inequality we gave here (Theorem 2.2.1) and a proof based on Lieb's inequality is due to J. Tropp [45].
The survey [46] contains a comprehensive introduction of matrix calculus, a proof of Lieb's inequality (Theorem 2.2.4), a detailed proof of matrix Bernstein's inequality (Theorem 2.2.1) and a variety of applications.A proof of Golden-Thompson inequality (2.2.3) can be found in [8,Theorem 9.3.7].
In Section 2.3 we scratched the surface of an interdisciplinary area of network analysis.For a systematic introduction into networks, refer to the book [41].Stochastic block models (Definition 2.3.2) were introduced in [28].The community recovery problem in stochastic block models, sometimes also called community detection problem, has been in the spotlight in the last few years.A vast and still growing body of literature exists on algorithms and theoretical results for community recovery, see the book [41], the survey [21], papers such as [9,24,25,27,32,40,54] and the references therein.
A concentration result similar to Theorem 2.3.5 can be found in [42]; the argument there is also is based on matrix concentration.This theorem is not quite optimal.For dense networks, where with the expected degree d satisfies d log n, the concentration inequality in Theorem 2.3.5 can be improved to (2.4.1) This improved bound goes back to the original paper [20] which studies the simpler Erdös-Rényi model but the results extend to stochastic block models [16]; it can also be deduced from [6,27,32].

Lecture 3: Covariance estimation and matrix completion
In the last lecture, we proved matrix Bernstein's inequality and gave an application for network analysis.We will spend this lecture discussing a couple of other interesting applications of matrix Bernstein's inequality.In Section 3.1 we will work on covariance estimation, a basic problem in high-dimensional statistics.In Section 3.2, we will derive a useful bound on norms random matrices, which unlike Bernstein's inequality does not require any boundedness assumptions on the distribution.We will apply this bound in Section 3.3 for a problem of matrix completion, where we are shown a small sample of the entries of a matrix and asked to guess the missing entries.

Covariance estimation
Covariance estimation is a problem of fundamental importance in high-dimensional statistics.Suppose we have a sample of data points X 1 , . . ., X N in R n .It is often reasonable to assume that these points are independently sampled from the same probability distribution (or "population") which is unknown.We would like to learn something useful about this distribution.
Denote by X a random vector that has this (unknown) distribution.The most basic parameter of the distribution is the mean E X.One can estimate E X from the sample by computing the sample mean 1 N N i=1 X i .The law of large numbers guarantees that the estimate becomes tight as the sample size N grows to infinity, i.e. 1 The next most basic parameter of the distribution is the covariance matrix This is a higher-dimensional version of the usual notion of variance of a random variable Z, which is The eigenvectors of the covariance matrix of Σ are called the principal components.
Principal components that correspond to large eigenvalues of Σ are the directions in which the distribution of X is most extended, see Figure 3.1.1.These are often the most interesting directions in the data.Practitioners often visualize the highdimensional data by projecting it onto the span of a few (maybe two or three) of such principal components; the projection may reveal some hidden structure of the data.This method is called Principal Component Analysis (PCA).One can estimate the covariance matrix Σ from sample by computing the sample covariance Again, the law of large numbers guarantees that the estimate becomes tight as the sample size N grows to infinity, i.e.
But how large should the sample size N for covariance estimation?Generally, one can not have N < n for dimension reasons.(Why?)We are going to show that N ∼ n log n is enough.In other words, covariance estimation is possible with just logarithmic oversampling.
For simplicity, we shall state the covariance estimation bound for mean zero distributions.(If the mean is not zero, we can estimate it from the sample and subtract.)Theorem 3.1.2(Covariance estimation).Let X be a random vector in R n with covariance matrix Σ. Suppose that Then, for every N 1, we have Before we pass to the proof, let us note that Theorem 3.1.2yields the covariance estimation result we promised.Let ε ∈ (0, 1).If we take a sample of size then we are guaranteed covariance estimation with a good relative error: Proof.Apply matrix Bernstein's inequality (Corollary 2.2.9) for the sum of independent random matrices X i X T i − Σ and get where and K is chosen so that It remains to bound σ and K. Let us start with σ.We have where (Check this!) Therefore, covariance estimation is possible with Remark 3.1.6(The boundedness condition).It is a good exercise to check that if we remove the boundedness condition (3.1.3),a nontrivial covariance estimation is impossible in general.(Show this!)But how do we know whether the boundedness condition holds for data at hand?We may not, but we can enforce this condition by truncation.All we have to do is to discard 1% of data points with largest norms.(Check this accurately, assuming that such truncation does not change the covariance significantly.)

Norms of random matrices
We have worked a lot with the operator norm of matrices, denoted A .One may ask if is there a formula that expresses A in terms of the entires A ij .Unfortunately, there is no such formula.The operator norm is a more difficult quantity in this respect than the Frobenius norm, which as we know can be easily expressed in terms of entries: A F = ( i,j A 2 ij ) 1/2 .If we can not express of A in terms of the entires, can we at least get a good estimate?Let us consider n × n symmetric matrices for simplicity.In one direction, A is always bounded below by the largest Euclidean norm of the rows . 15 The Frobenius norm of an n × m matrix, sometimes also called Hilbert-Schmidt norm, is de- , where λ i (A) are the eigenvalues of A. Thus the stable rank of A can be expressed as r(A) (Check!) Unfortunately, this bound is sometimes very loose, and the best possible upper bound is (Show this bound, and give an example where it is sharp.)Fortunately, for random matrices with independent entries the bound (3.2.2) can be improved to the point where the upper and lower bounds almost match.

Theorem 3.2.3 (Norms of random matrices without boundedness assumptions).
Let A be an n × n symmetric random matrix whose entries on and above the diagonal are independent, mean zero random variables.Then where A i denote the rows of A.
In words, the operator norm of a random matrix is almost determined by the norm of the rows.
Our proof of this result will be based on matrix Bernstein's inequality -more precisely, Corollary 2.2.9.There is one surprising point.How can we use matrix Bernstein's inequality, which applies only for bounded distributions, to prove a result like Theorem 3.2.3 that does not have any boundedness assumptions?We will do this using a trick based on conditioning and symmetrization.Let us introduce this technique first.Lemma 3.2.4(Symmetrization).Let X 1 , . . ., X N be independent, mean zero random vectors in a normed space.Then Proof.To prove the upper bound, let (X i ) be an independent copy of the random vectors (X i ).Then The distribution of the random vectors Y i := X i − X i is symmetric, which means that the distributions of Y i and −Y i are the same.(Why?) Thus the distribution of the random vectors Y i and ε i Y i is also the same, for all we do is change the signs of these vectors at random and independently of the values of the vectors.Summarizing, we can replace X i − X i in the sum above with ε i (X i − X i ).Thus (the two sums have the same distribution).
This proves the upper bound in the symmetrization inequality.The lower bound can be proved by a similar argument.(Do this!) Proof.We already discussed the lower bound in Theorem 3.2.3.The proof of the upper bound will be based on matrix Bernstein's inequality.First, we decompose A in the same way as we did in the proof of Theorem 2.3.5.Thus we represent A as a sum of independent, mean zero, symmetric random matrices X ij each of which contains a pair of symmetric entries of A (or one diagonal entry): Apply the symmetrization inequality (Lemma 3.2.4) for the random matrices Z ij and get (3.2.5) where we set and ε ij are independent Rademacher random variables.Now we condition on A. The random variables Z ij become fixed values and all randomness remains in the Rademacher random variables ε ij .Note that X ij are (conditionally) bounded almost surely, and this is exactly what we have lacked to apply matrix Bernstein's inequality.Now we can do it.Corollary 2.2.9 gives 16(3.2.6) where (Do it!)Substituting into (3.2.6), we get Finally, we unfix A by taking expectation of both sides of this inequality with respect to A and using the law of total expectation.The proof is complete.
We stated Theorem 3.2.3 for symmetric matrices, but it is simple to extend it to general m × n random matrices A. The bound in this case becomes where A i and A j denote the rows and columns of A. To see this, apply Theorem 3.2.3 to the (m + n) × (m + n) symmetric random matrix 0 A A T 0 .

Matrix completion
Consider a fixed, unknown n × n matrix X. Suppose we are shown m randomly chosen entries of X. Can we guess all the missing entries?This important problem is called matrix completion.We will analyze it using the bounds on the norms on random matrices we just obtained.Obviously, there is no way to guess the missing entries unless we know something extra about the matrix X.So let us assume that X has low rank: The number of degrees of freedom of an n × n matrix with rank r is O(rn).
(Why?)So we may hope that (3.3.1)m ∼ rn observed entries of X will be enough to determine X completely.But how?Here we will analyze what is probably the simplest method for matrix completion.Take the matrix Y that consists of the observed entries of X while all unobserved entries are set to zero.Unlike X, the matrix Y may not have small rank.Compute the best rank r approximation 17 of Y.The result, as we will show, will be a good approximation to X.
But before we show this, let us define sampling of entries more rigorously.Assume each entry of X is shown or hidden independently of others with fixed probability p. Which entries are shown is decided by independent Bernoulli random variables δ ij ∼ Ber(p) with p := m n 2 which are often called selectors in this context.The value of p is chosen so that among n 2 entries of X, the expected number of selected (known) entries is m.Define the n × n matrix Y with entries 17 The best rank r approximation of an n × n matrix A is a matrix B that minimizes the operator norm A − B or, alternatively, the Frobenius norm A − B F (the minimizer turns out to be the same).One can compute B by truncating the singular value decomposition A = n i=1 s i u i v T i of A as follows: B = r i=1 s i u i v T i , where we assume that the singular values s i are arranged in the non-increasing order.
We can assume that we are shown Y, for it is a matrix that contains the observed entries of X while all unobserved entries are replaced with zeros.The following result shows how to estimate X based on Y. Theorem 3.3.2(Matrix completion).Let X be a best rank r approximation to p −1 Y. Then Here X ∞ = max i,j |X ij | denotes the maximum magnitude of the entries of X.
Before we prove this result, let us understand what this bound says about the quality of matrix completion.The recovery error is measured in the Frobenius norm, and the left side of (3.
Thus Theorem 3.3.2controls the average error per entry in the mean-squared sense.
To make the error small, let us assume that we have a sample of size m rn log 2 n, which is slightly larger than the ideal size we discussed in (3.3.1).This makes C log(n) rn/m = o(1) and forces the recovery error to be bounded by o(1) X ∞ .Summarizing, Theorem 3.3.2says that the expected average error per entry is much smaller than the maximal magnitude of the entry of X.This is true for a sample of almost optimal size m.The smaller the rank r of the matrix X, the fewer entires of X we need to see in order to do matrix completion.
Proof of Theorem 3.3.2.
Step 1: The error in the operator norm.Let us first bound the recovery error in the operator norm.Decompose the error into two parts using triangle inequality: Recall that X is a best approximation to p −1 Y. Then the first part of the error is smaller than the second part, i.e.X − p −1 Y p −1 Y − X , and we have The entries of the matrix Y − pX, are independent and mean zero random variables.Thus we can apply the bound (3.2.7) on the norms of random matrices and get All that remains is to bound the norms of the rows and columns of Y − pX.This is not difficult if we note that they can be expressed as sums of independent random variables: and similarly for columns.Taking expectation and noting that This is a good bound, but we need something stronger in (3.3.5).Since the maximum appears inside the expectation, we need a uniform bound, which will say that all rows are bounded simultaneously with high probability.Such uniform bounds are usually proved by applying concentration inequalities followed by a union bound.Bernstein's inequality (1.4.5) yields (Check!)This probability can be further bounded by n −ct using the assumption that m = pn 2 n log n.A union bound over n rows leads to Integrating this tail, we conclude using (2.2.10) that (Check!)And this yields the desired bound on the rows, which is an improvement of (3.3.6)we wanted.We can do similarly for the columns.Substituting into (3.3.5), this gives Then, by (3.3.4),we get Step 2: Passing to Frobenius norm.Now we will need to pass from the operator to Frobenius norm.This is where we will use for the first (and only) time the rank of X.We know that rank(X) r by assumption and rank( X) r by construction, so rank( X − X) 2r.There is a simple relationship between the Four lectures on probabilistic methods for data science operator and Frobenius norms: (Check it!)Take expectation of both sides and use (3.3.7);we get Dive both sides by n, we can rewrite this bound as But pn 2 = m by definition of the sampling probability p.This yields the desired bound (3.3.3).
3.4.Notes Theorem 3.1.2on covariance estimation is a version of [51, Corollary 5.52], see also [31].The logarithmic factor is in general necessary.This theorem is a general-purpose result.If one knows some additional structural information about the covariance matrix (such as sparsity), then fewer samples may be needed, see e.g.[11,15,35].
Although the logarithmic factor in Theorem 3.2.3 can not be completely removed in general, it can be improved.Our argument actually gives Using different methods, one can save an extra log n factor and show that (see [6]) and see [48].(The results in [6,48] are stated for Gaussian random matrices; the two bounds above can be deduced by using conditioning and symmetrization.)The surveys [6,51] and the textbook [53] present several other useful techniques to bound the operator norm of random matrices.The matrix completion problem, which we discussed in Section 3.3, has attracted a lot of recent attention.E. Candes and B. Recht [13] showed that one can often achieve exact matrix completion, thus computing the precise values of all missing values of a matrix, from m ∼ rn log 2 (n) randomly sampled entries.For exact matrix completion, one needs an extra incoherence assumption that is not present in Theorem 3.3.2.This assumption basically excludes matrices that are simultaneously sparse and low rank (such as a matrix whose all but one entries are zero -it would be extremely hard to complete it, since sampling will likely miss the non-zero entry).Many further results on exact matrix completion are known, e.g.[14,17,23,49].Theorem 3.3.2with a simple proof is borrowed from [44]; see also the tutorial [52].This result only guarantees approximate matrix completion, but it does not have any incoherence assumptions on the matrix.

Lecture 4: Matrix deviation inequality
In this last lecture we will study a new uniform deviation inequality for random matrices.This result will be a far reaching generalization of Johnson-Lindenstrauss Lemma we proved in Lecture 1.
Consider the same setup as in Theorem 1.6.1,where A is an m × n random matrix whose rows are independent, mean zero, isotropic and sub-gaussian random vectors in R n .(If you find it helpful to think in terms of concrete examples, let the entries of A be independent N(0, 1) random variables.)Like in Johnson-Lindenstrauss Lemma, we will be looking at A as a linear transformation from R n to R m , and we will be interested in what A does to points in some set in R n .This time, however, we will allow for infinite sets T ⊂ R n .
Let us start by analyzing what A does to a single fixed vector x ∈ R n .We have (where A T j denote the rows of A) (by linearity) = m x 2 2 (using isotropy of A j ).Further, if we believe that concentration about the mean holds here (and in fact, it does), we should expect that (4.0.1) with high probability.
Similarly to Johnson-Lindenstrauss Lemma, our next goal is to make (4.0.1) hold simultaneously over all vectors x in some fixed set T ⊂ R n .Precisely, we may ask -how large is the average uniform deviation: This quantity should clearly depend on some notion of the size of T : the larger T , the larger should the uniform deviation be.So, how can we quantify the size of T for this problem?In the next section we will do precisely this -introduce a convenient, geometric measure of the sizes of sets in R n , which is called Gaussian width.Gaussian width and Gaussian complexity are closely related.Indeed, 19 (4.1.2)

Gaussian width
(Check these identities!) Gaussian width has a natural geometric interpretation.Suppose g is a unit vector in R n .Then a moment's thought reveals that sup x,y∈T g, x − y is simply the width of T in the direction of g, i.e. the distance between the two hyperplanes with normal g that touch T on both sides as shown in This reasoning is valid except where we assumed that g is a unit vector.Instead, for g ∼ N(0, I n ) we have E g 2 2 = n and g 2 ≈ √ n with high probability.
(Check both these claims using Bernstein's inequality.)Thus, we need to scale by the factor √ n.Ultimately, the geometric interpretation of the Gaussian width becomes the following: w(T ) is approximately 2 √ n larger than the usual, geometric width of T averaged over all directions.
A good exercise is to compute the Gaussian width and complexity for some simple sets, such as the unit balls of the p norms in R n , which we denote B n p = {x ∈ R n : x p 1}.In particular, we have  The same holds for Gaussian width w(T ).(Check these facts!)A look a these examples reveals that the Gaussian width captures some nonobvious geometric qualities of sets.Of course, the fact that the Gaussian width of the unit Euclidean ball B n 2 is or order √ n is not surprising: the usual, geometric width in all directions is 2 and the Gaussian width is about √ n times that.But it may be surprising that the Gaussian width of the 1 ball B n 1 is much smaller, and so is the width of any finite set T (unless the set has exponentially large cardinality).As we will see later, Gaussian width nicely captures the geometric size of "the bulk" of a set.

Matrix deviation inequality
Now we are ready to answer the question we asked in the beginning of this lecture: what is the magnitude of the uniform deviation (4.0.2)?The answer is surprisingly simple: it is bounded by the Gaussian complexity of T .The proof is not too simple however, and we will skip it.

Theorem 4.2.1 (Matrix deviation inequality).
Let A be an m × n matrix whose rows A i are independent, isotropic and sub-gaussian random vectors in R n .Let T ⊂ R n be a fixed bounded set.Then where K = max i A i ψ 2 is the maximal sub-gaussian norm20 of the rows of A.
Remark 4.2.2 (Tail bound).It is often useful to have results that hold with high probability rather than in expectation.There exists a high-probability version of matrix deviation inequality, and it states the following.Let u 0. Then the event holds with probability at least 1 − 2 exp(−u 2 ).Here rad(T ) is the radius of T , defined as rad(T Since rad(T ) γ(T ) (check!) we can continue the bound (4.2.3) by for all u 1.This is a weaker but still a useful inequality.For example, we can use it to bound all higher moments of the deviation: (Do this calculation using (4.2.4) for p = 2.) We will use this bound in Section 4.4.
Matrix deviation inequality has many consequences.We will explore some of then now.

Deriving Johnson-Lindenstrauss Lemma
We started this lecture by promising a result that is more general than Johnson-Lindenstrauss Lemma.So let us show how to quickly derive Johnson-Lindenstrauss from the matrix deviation inequality.Theorem 1.6.1 from Theorem 4.2.1.
Assume we are in the situation of Johnson-Lindenstrauss Lemma (Theorem 1.6.1).Given a set X ⊂ R, consider the normalized difference set Then T is a finite subset of the unit sphere of R n , and thus (4.
This is exactly the consequence of Johnson-Lindenstrauss lemma.
The argument based on matrix deviation inequality, which we just gave, can be easily extended for infinite sets.It allows one to state a version of Johnson-Lindenstrauss lemma for general, possibly infinite, sets, which depends on the Gaussian complexity of T rather than cardinality.(Try to do this!)

Covariance estimation
In Section 3.1, we introduced the problem of covariance estimation, and we showed that N ∼ n log n samples are enough to estimate the covariance matrix of a general distribution in R n .We will now show how to do better if the distribution is sub-gaussian.
where K ⊂ R n is some known set in R n that describes anything that we know about x a-priori.(Admittedly, we are operating on a high level of generality here.If you need a concrete example, we will consider it in Section 4.6.) Summarizing, here is the problem we are trying to solve.Determine a solution x = x(A, y, K) to the undetermined linear equation y = Ax as accurately as possible, assuming that x ∈ K.
A variety of approaches to this and similar problems were proposed during the last decade.The one we will describe here is based on optimization.To do this, we will be convenient to convert the set K into a function on R n which is called Minkowski functional of K.This is basically a function whose level sets are multiples of K. To define it formally, assume that K is star-shaped, which means that together with any point x, the set K contains the entire interval that connects x with the origin; see Figure 4.5.1 for illustration.The Minkowski functional of K is defined as If the set K is convex and symmetric about the origin, x K is actually a norm on R n .(Check this!) 0 0 Note that this is a very natural program: it looks at all solutions to the equation y = Ax and tries to "shrink" the solution x toward K. (This is what minimization of Minkowski functional is about.)Also note that if K is convex, this is a convex optimization program, and thus can be solved effectively by one of the many available numeric algorithms.
The main question we should now be asking is -would the solution to this program approximate the original vector x?The following result bounds the approximation error for a probabilistic model of linear equations.Assume that A is a random matrix as in Theorem 4.2.1.Proof.Both the original vector x and the solution x are feasible vectors for the optimization program (4.5.2).Then x K x K (since x minimizes the Minkowski functional) 1 (since x ∈ K).
Thus both x, x ∈ K.
We also know that Ax = Ax = y, which yields where we used (4.1.2) in the last identity.Substitute u = x and v = x here.We may do this since, as we noted above, both these vectors belong to K.But then the term A(u − v) 2 will be equal zero by (4.5.4).It disappears from the bound, and we get E √ m x − x 2 w(K).
Dividing both sides by √ m we complete the proof.Suppose we know that the signal x is sparse, which means that only few coordinates of x are nonzero.As before, our task is to recover x from the random linear measurements given by the vector y = Ax, where A is an m × n random matrix.This is a basic example of sparse recovery problems, which are ubiquitous in various disciplines.
The number of nonzero coefficients of a vector x ∈ R n , or the sparsity of x, is often denoted x 0 .This is similar to the notation for the p norm x p = ( n i=1 |x i | p ) 1/p , and for a reason.You can quickly check that (4.6.1) x 0 = lim p→0 x p (Do this!) Keep in mind that neither x 0 nor x p for 0 < p < 1 are actually norms on R n , since they fail triangle inequality.(Give an example.)Let us go back to the sparse recovery problem.Our first attempt to recover x is to try the following optimization problem: (4.6.2) min x 0 subject to y = Ax .This is sensible because this program selects the sparsest feasible solution.One can show that the program performs well.But there is an implementation caveat: the function f(x) = x 0 is highly non-convex and even discontinuous.There is simply no known algorithm to solve the optimization problem (4.6.2) efficiently.
To overcome this difficulty, let us turn to the relation (4.6.1) for an inspiration.What if we replace x 0 in the optimization problem (4.6.2) by x p with p > 0? The smallest p for which f(x) = x p is a genuine norm (and thus a convex function on R n ) is p = 1.So let us try (4.6.3)min x 1 subject to y = Ax .This is a convexification of the non-convex program (4.6.2), and a variety of numeric convex optimization methods are available to solve it efficiently.
We will now show that 1 minimization works nicely for sparse recovery.As before, we assume that A is a random matrix as in Theorem 4.2.1.

Notes
For a more thorough introduction to Gaussian width and its role in high-dimensional estimation, refer to the tutorial [52] and the textbook [53]; see also [5].Matrix deviation inequality (Theorem 4.2.1) is borrowed from [36].In the special case where A is a Gaussian random matrix, this result follows from the work of G. Schechtman [50].In the general case of sub-gaussian distributions, earlier variants of Theorem 4.2.1 were proved by B. Klartag and S. Mendelson [30], S. Mendelson, A. Pajor and N. Tomczak-Jaegermann [39] and S. Dirksen [19].
Theorem 4.4.1 for covariance estimation can be proved alternatively using more elementary tools (Bernstein's inequality and ε-nets), see [51].However, no known elementary approach exists for the low-rank covariance estimation discussed in Remark 4.4.3.The bound (4.4.4) was proved by V. Koltchinskii and K. Lounici [31] by a different method.
In Section 4.5, we scratched the surface of a recently developed area of sparse signal recovery, which is also called compressed sensing.Our presentation there essentially follows the tutorial [52].Theorem 4.6.4can be improved: if we take m s log(n/s) measurements, then with high probability the optimization program (4.6.3)recovers the unknown signal x exactly, i.e.
First results of this kind were proved by J. Romberg, E. Candes and T. Tao [12] and a great number of further developments followed; refer e.g. to the book [22] and the chapter in [18] for an introduction into this reach area.

Figure 1 . 6 . 3 .
Figure 1.6.3.Johnson-Lindenstrauss Lemma states that a random projection of N data points from dimension n to dimension m ∼ log N preserves the geometry of the data.

Definition 2 . 3 . 1 (
Erdös-Rényi model).Consider a set of n vertices and connect every pair of vertices independently and with fixed probability p.The resulting random graph is said to follow the Erdös-Rényi model G(n, p).

Figure 2 .
Figure 2.3.3 illustrate a simulation of a stochastic block model.

Figure 3 . 1 . 1 .
Figure 3.1.1.Data points X 1 , . . ., X N sampled from a distribution in R n and the principal components of the covariance matrix.

2 F A 2 .
check by expanding the square) tr(Σ) • E XX T (drop Σ 2 and use (3.1.3))= tr(Σ) • Σ.Thus σ 2 N tr(Σ) Σ .Next, to bound K, we haveXX TSubstitute the bounds on σ and K into (3.1.4)and getE Σ N − Σ 1 N N tr(Σ) Σ log n + tr(Σ) log nTo complete the proof, use that tr Σ n Σ (check this!) and simplify the bound.Remark 3.1.5(Low-dimensional distributions).Much fewer samples are needed for covariance estimation for low-dimensional, or approximately low-dimensional, distributions.To measure approximate low-dimensionality we can use the notion of the stable rank of Σ 2 .The stable rank of a matrix A is defined as the square of the ratio of Frobenius to operator norms: 15 r(A) := A The stable rank is always bounded by the usual, linear algebraic rank, r(A) rank(A), and it can be much smaller.(Check both claims.)Our proof of Theorem 3.1.2actually gives

Definition 4 . 1 . 1 .
Let T ⊂ R n be a bounded set, and g be a standard normal random vector in R n , i.e. g ∼ N(0, I n ).Then the quantities w(T ) := E sup x∈T g, x and γ(T ) := E sup x∈T | g, x | are called the Gaussian width and of T and the Gaussian complexity of T , respectively.

Figure 4 . 1 . 3 .
Figure 4.1.3.The width of a set T in the direction of g.

1 )
∼ log n.19The set T − T is defined as {x − y : x, y ∈ T }.More generally, given two sets A and B in the same vector space, the Minkowski sum of A and B is defined as A + B = {a − b : a ∈ A, b ∈ B}.

CRemark 4 . 2 . 5 ( 4 m x 2 2 CK 4 γ(T ) 2 +
p K 2 γ(T ) where C p C √ p for p 1. (Check this using Proposition 1.1.1.)Deviation of squares).It is sometimes helpful to bound the deviation of the square Ax 2 2 rather than Ax 2 itself.We can easily deduce the deviation of squares by using the identity a 2 − b 2 = (a − b) 2 + 2b(a − b) for a = Ax 2 and b = √ m x 2 .Doing this, we conclude that (CK 2 √ m rad(T )γ(T ).

Figure 4 . 5 . 1 .
Figure 4.5.1.The set on the left (whose boundary is shown) is starshaped, the set on the right is not.Now we propose the following way to solve the recovery problem: solve the optimization program (4.5.2) min x K subject to y = Ax .

Theorem 4 . 5 . 3 (
Recovery by optimization).The solution x of the optimization program (4.5.2) satisfies 21E x − x 2 w(K) √ m ,where w(K) is the Gaussian width of K.

Theorem 4 . 5 . 2 random linear measurements. 4 . 6 .
3 says that a signal x ∈ K can be efficiently recovered from m ∼ w(K) Sparse recovery Let us illustrate Theorem 4.5.3 with an important specific example of the feasible set K.

Theorem 4 . 6 . 4 (s log n m x 2 . 1 √ s x 2 .( 1 s x 2 •
Sparse recovery by optimization).Assume that an unknown vector x ∈ R n has at most s non-zero coordinates, i.e. x 0 s.The solution x of the optimization program (4.6.3)satisfiesE x − x 2Proof.Since x 0 s, Cauchy-Schwarz inequality shows that (4.6.5)xCheck!) Denote the unit ball of the 1 norm in R n by B n 1 , i.e.B n 1 := {x ∈ R n : x 1 1}.Then we can rewrite (4.6.5) as the inclusion x ∈ √ s x 2 • B n 1 := K. Apply Theorem 4.5.3 for this set K. We noted the Gaussian width of B n log n.Substitute this is Theorem 4.5.3 and complete the proof.Theorem 4.6.4 says that an s-sparse signal x ∈ R n can be efficiently recovered from m ∼ s log n random linear measurements.
[9,16,24,25,32,40,54]vely dense, i.e. d log n, one can improve the guarantee (2.3.8) of spectral clustering in Theorem 2.3.7 to All one has to do is to use the improved concentration inequality (2.4.1) instead of Theorem 2.3.5.Furthermore, in this case there exist algorithms that can recover the communities exactly, i.e. without any misclassified vertices, and with high probability, see e.g.[1,16,27,38].For sparser networks, where d log n and possibly even d = O(1), relatively little algorithms had been known until recently, but now there exist many approaches that provably recover communities in sparse stochastic block models, see e.g.[9,16,24,25,32,40,54].