Network formation by reinforcement learning: the long and medium run

We investigate a simple stochastic model of social network formation by the process of reinforcement learning with discounting of the past. In the limit, for any value of the discounting parameter, small, stable cliques are formed. However, the time it takes to reach the limiting state in which cliques have formed is very sensitive to the discounting parameter. Depending on this value, the limiting result may or may not be a good predictor for realistic observation times.


Introduction
Each day, each member of a small group of individuals selects two others with whom to interact. The individuals are of various types, and their types determine the payoff to each from the interaction. That is to say that the interaction is modeled as a symmetric 3-person game. Probabilities of selecting individuals evolve by reinforcement learning, where the reinforcements are the payoff of the interaction. We consider two games. The first is a degenerate game, "Three's Company". Here there is only one type and everyone gets equal reinforcement for every interaction. The analysis of "Three's company" is then used in the analysis of a second game, a three-person Stag Hunt. Here there are two types, Stag Hunters and Hare Hunters. Hare Hunters always get a payoff of 3, while a Stag Hunter gets a payoff of 4 if he interacts with two other Stag Hunters, otherwise he gets nothing.
The point of this modeling exercise is threefold. First, there is a substantial literature, reviewed in the next section, that compares learning models to laboratory data in order to make inferences about underlying psychological mechanisms for learning and strategy formation. Our model explores subgroup formation in a way that is informed by previous research on plausible mechanisms and parameter values. Secondly, the notion of modeling the co-evolution of interaction networks and strategies was set out in [SP00]. This begs the formulation and analysis of basic stochastic network models. To this end, we provide an analysis of two such models. Our analysis is sufficiently robust so as to shed light on similar models, many of which have been studied in less depth. Finally, many of the models previously studied share a common mathematical description. We aim to provide a collection of rigorous results on this class of models, which will allow scientists quickly to understand the long-and medium-term behavior of these stochastic models.
Stochastic models of reinforcement learning were introduced by Estes [Est50] and by Bush and Mosteller [BM55]. On each of a series of trials the learning agent chooses between a finite set of alternative acts, and gets a payoff. The choice is assumed to be governed by a probability vector that evolves according to prescribed learning dynamics. In the learning dynamics investigated by Estes and by Bush and Mosteller, the new probability vector after a trial is a weighted average of the previous probability vector and the vector putting unit weight on the act just chosen. The weight on the unit vector is the product of the payoff and a learning rate parameter. This kind of learning satisfies what Bush and Mosteller call independence of path: the probaiblity vector at time n + 1 depends only on the probability vector at time n, the act chosen, and the magnitude of the payoff. As a consequence, if the payoffs to the acts are fixed, the learning process is a Markov chain with the probability vector as its state. Some Markov models for learning have been firmly established in the psychology literature, appearing in survey texts by the early 1970's [IT69,Nor72].
Mathematically, a reinforcement process as defined in the literature on reinforced random walks [CD97,Dav90,Dav99,Pem88,Pem92] need not have this Markov property. The current probability can depend on the history of choices and payoffs, via summary statistics or propensities associated to the possible actions. The probability vector is a function of all the propensities (though in general not a one-to-one function if the process is not Markovian). The possibility of using such processes to model reinforcement learning was introduced by Luce [Luc59].
Luce considered a range of models for the evolution of the propensities. The payoffs for an action taken might modify its propensity multiplicatively, additively, or in some combination. In Luce's gamma model, the new propensity, v ′ (i), for an action i, is the sum of γ i and the product of the old propensity and a factor, where both β i and γ i are functions of the payoff for action i. Luce investigated his models using the linear response rule, j v(j) which simply normalizes the propensities. However the separation of the questions of propensity evolution and response rule opens the possibility of other alternatives such as the logistic response rule: p(i) = exp(bv(i)) j exp(bv(j)) with b a learning parameter. This response rule is used in the learning models of Busemeyer and Townsend [BT93] and Camerer and Ho [CH99]. One advantage to the logistic response rule is that it allows deterrent reinforcement to be modeled, since probabilities are proportional to exponentials of prepoensities, thus remain positive when propensities are allowed to become negative.
The first expermental corroboration of these models, of which we are aware, was by Herrnstein [Her70]. Thorndike had previously proposed the "Law of Effect", which Herrnstein quantifies as the "Matching Law": the probability of choosing an action is proportional to the accumulated rewards. Let propensities evolve by adding payoffs, that is, β i = 1 and γ i is zero if the action i was not taken, and is otherwise equal to the payoff. If we follow the linear response rule, we obtain Herrnstein's matching law. Herrnstein reports data from laboratory experiments with humans as well as with animals, from which one may conclude the broad applicability of the model.
There is a special case whose limiting behavior is well known. If each action is equally reinforced, the process is mathematically equivalent to Pólya's urn process [EP23], with each action represented by a different color of ball initially in the urn. The process converges to a random limit, whose support is the whole probability simplex. In other words, any limiting state of propensities or probabilities is possible.
In 1960, Suppes and Atkinson [SA60] introduced interactive reinforcement to model learning behavior in games. A number of players choose between alternatives as before, but the payoffs to each player now depend on the acts chosen by all players. Players modify their choice probabilities by reinforcement learning dynamics of the Bush-Mosteller type. If joint actions fix the payoffs and we take the state of the system to be the vector, indexed by players, of vectors of choice probabilities, then the dynamics will be Markovian.
Insofar as multi-agent reinforcement learning has been studied, it has been largely in the framework of Suppes and Atkinson. Macy [Mac90,Mac91] applies multi-player reinforcement to study collective action problems from a bounded rationality viewpoint. Borgers and Sarin [BS97] draw a connection between multi-agent Bush-Mosteller dynamics and the replicator dynamics of evolutionary game theory [MS82], showing that the two coincide in a certain limit. Perhaps the greatest impulse to this direction of study was the widely cited 1995 paper of Roth and Erev [RE95]. They proposed a multi-agent reinforcement model based on Herrnstein's linear reinforement and response.
Here and in subsequent publications [ER98,BE98], they show a good fit with a wide range of empirical data. Limiting behavior in the basic model has recently been studied by Beggs [Beg02] and by Ianni [Ian02].
In [SP00], both basic and discounted versions of Roth-Erev learning are applied to social network formation. Individuals begin with prior propensities to interact with each other, and interactions are modeled as two-person games. Individuals have given strategies, and interactions between individuals evolve by reinforcement learning. The analysis begins with a series of results on "Making Friends", a network formation model in the special case where the game interaction is trivial. Nontrivial strategic interaction is then introduced, and it is shown that the co-evolution of network and strategy depends on relative rates of evolution as well as on other features of the model.
The present work is a natural outgrowth of the investigations begun in [SP00]. In the richer context of multi-agent interactions, more phenomena arise, namely clique formation and a metastable state of high network connectivity for an initial epoch whose length depends dramatically on the discounting parameter. In Section 5.3 we discuss the implications of these features for a wide class of models.

Mathematical background
Our ultimate goal is to understand qualitative phenomena such as clique formation, or tendency of the interaction frequencies toward some limiting values. The mathematical literature on reinforcement processes contains results in these directions. It will be instructive to review these, and to examine the mathematical classification of such processes, although we will need to go beyond this level of analysis to explain the behavior of network models such as Three's Company on timescales we can observe.
Reinforcement processes fall into two main types, trapping and non-trapping. A process is said to be trapping if there are proper subsets of actions for each player such that there is a positive probability that all players always play from this subset of actions. For example, if the repetition of any single vector (i) of actions (action i j for player j) is sufficiently self-reinforcing that it might cause action i to be perpetuated forever, then the process is trapping. The specific dynamics investigated by Bush and Mosteller in 1955 are trapping, as are most logistic response models. By contrast, models that give all times in the past an equal effect on the present, such as Herrnstein's dynamics and Roth-Erev dynamics, tend not to be trapping.
One of several modifications suggested by Roth and Erev to maximize agreement of their model with the data is to introduce a discounting parameter x ∈ (0, 1). The past is discounted via multiplication by a factor of (1 − x) at each step. Formally, this is a version of Luce's gamma model with β i = 1 − x for all i. It is known from the theory of urn processes that discounting may cause trapping. For example, it follows from a theorem of H. Rubin reported in [Dav90] that if Pólya's urn is altered by discounting the past, there will be a point in time beyond which only one color is ever chosen. This holds as well with Roth-Erev type models: the discounted Roth-Erev model is trapping, while the undiscounted model is not. In [SP00], discounted and nondiscounted versions of several games are studied, and equilibria examined for stability. Again, discounting causes trapping, and we investigate the robustness of the trapping when the discounting parameter becomes negligible. In a related paper, Bonacich and Liggett [BL02] investigate Bush-Mosteller dynamics in a two-person interaction representing gift giving. Their model has discounting, and they find a set of trapping states.
It is in general an outstanding problem in the theoretical study of reinforcement models to show that trapping must occur with probability 1 if it occurs with positive probability. This was only recently proved, for instance, for the reinforced random walk on a graph with three vertices, via a complicated argument [Lim01]. Much of the effort that has gone into the mathematical study of these models has been directed at these difficult limiting questions. In the non-trapping case, even though the choice of action does not fixate, the probaiblities for some of the actions may tend to zero. A series of papers in the 1990's by Benaim and others [BH95, Ben98, Ben99] establishes some basic tests for whether in undiscounted Roth-Erev type models, probabilties will tend toward determinstic vectors.
From the point of view of applications, in a situation where it can be proven or surmised that trapping occurs, we are mainly interested in characterizing the states in which we may become trapped and in determining how long will it be before the process becomes trapped. Recalling our initial discussion of modeling goals, we are particularly interested in results that are robust as parameters and modeling details vary, or when they are not robust, of understanding how these details of the model affect observed qualitative behavior.
4 Three's Company: a ternary interaction model

Specification of the model
The game "Three's Company" models collaboration of trios of agents from a fixed population. At each time step, each agent selects two others with whom to form a temporary collusion. An agent may be involved in multiple collusions during a single time step: one that she initiates, and zero or more initiated by another agent. Analogously to the basic game "Making Friends", introduced in [SP00], Three's Company has a constant reward structure: every collaboration results in an identical positive outcome, so every agent in every temporary collusion increases by an identical amount her propensity to choose each of the other two agents in the trio. The choice probabilities follow what could be called mulitlinear response. The probability of an agent choosing to form a trio with two other agents i and j is taken to be proportional to the product of her propensity for i with her propensity for j. In addition to providing a model for self-organization based on a simple matching law type of response mechanism, this model is meant to provide a basis for the analysis of games such as the three person stag hunting game discussed in the next section. We now give a more formal mathematical definition of Three's Company, taken from [PS03a].
Fix a positive integer N ≥ 4, representing the size of the population. For t ≥ 0 and 1 ≤ i, j ≤ N , define random variables W (i, j, t) and U (i, t) inductively on a common probability space (Ω, F , P) as follows. The W variables are positive numbers, and the U variables are subsets of the population of cardinality 3. One may think of the U variables as random triangles in the complete graph with a vertex representing each agent. The variable U (i, t) is equal to the trio formed by agent i at time t. The W variables represent propensities: W (i, j, t) will be the propensity for player i to choose player j on the time step t. The initialization is W (i, j, 0) = 1 for all i = j, while W (i, i, 0) = 0). We write W (e, t) for W (i, j, t) when e is the edge (unordered set) {i, j} (note that the evolution rules below imply that W (i, j, t) = W (j, i, t) for all i, j and t). The inductive step, for t ≥ 0, defines probabilities (formally, conditional probabilities given the past) for the variables U (i, t) in terms of the variables W (r, s, t), r, s ≤ N , and then defines W (i, j, t + 1) in terms of W (i, j, t) and the variables U (r, t), r ≤ N . The equations are: ; (4.1) Here (1 − x) is the factor per unit time by which the past is discounted, and the σ-field conditioned on is the process up to time t, The following alternative statement of the evolution equation (4.1) is useful for those familiar with the analytic machinery (c.f [Pem92]) that is typically used to reduce such a process to a stochastic approximation. Think of the normalized matrix as the state vector. This is then an asymptotically time-homogeneous Markov chain, with an evolution rule where g(t) = 1/x + O(1/t), the drift vector field µ maps the simplex of normalized matrices into its tangent space and may be explicitly computed, and ξ t are martingale increments of order 1. In the non-discounted case, g(t) = 1/t+O(1/t), and much information about the long term behavior of this process can be discovered by an analysis of the the flow dX/dt = µ(X) [Ben99]. In the discounted case, g(t) does not go to zero and an alternative analysis is required.

Analysis of the model
Equations (4.1) and (4.2) completely specify the model for the given parameters N and x. Simulations for a population of size 6 (N = 6) showed the following behavior. When x = .5 (a rather steep discount rate, though not unheard of in psychological laboratory experiments [BS02]), all 1,000 trials broke up into two cliques of size 3, with no interactions across clique boundaries. In larger populations, with the same discount rate, again decomposition into cliques occurs, this time of sizes 3, 4 and 5, whose members interact exclusively with other members of the same clique.
When N = 6 and x = .4 we found that 994 out of the 1,000 trials had decomposed into two cliques of three (we allowed the process to continue for 1,000,000 time steps). When x was decreased to .3, only 13 of the 1,000 trials showed decomposition into cliques, while in the remainder of the trials all six members of the population remained well connected through the 1,000,000 time steps. Finally, when x = .2, a reasonable discount rate for individuals though still steeper than in most economic models, none out of 1,000 trials had broken into cliques. All six members of the population remained well connected after 1,000,000 time steps.
To summarize the simulation data, high discount rates lead to trapping, with each agent restricting her choices to members of a clique of size 3 (or, in larger populations, size 4 or 5). Less steep discount rates lead to less trapping or no trapping at all. Interestingly, the simulation data is contradicted by the following theorem, proved in the appendix.
Theorem 4.1 In Three's Company, with any population size N ≥ 6 and any discount rate x ∈ (0, 1), with probability 1 the population may be partitioned into subsets of sizes 3, 4 and 5, such that each member of each subset chooses each other with positive limiting frequency, and chooses members outside the subset only finitely often. Every partition into sets of sizes 3, 4 and 5 has positive probability of occurring.
In other words, despite the simulation data, trapping always occurs. The set of traps is the set of all ways of decomposing into cliques of sizes 3, 4 and 5. The apparent contradiction between the simulation and the theorem is resolved by Theorem 4.2, whose proof is given in the companion paper [PS03a]. The theorem states that the time for the population to break into cliques increases exponentially in 1/x as the discount rate 1 − x increases to 1.
Theorem 4.2 For each N ≥ 6 there is a δ > 0 and numbers c N > 0 such that in Three's Company with N players and discount rate 1 − x, the probability is at least δ that each player will play with each other player beyond time exp(c N x −1 ).

The three player stag hunt 5.1 Specification of the model
We now replace the uniformly positive reward structure by a nontrivial game, which is a three player version of Rousseau's Stag Hunt. For the purposes of our model, agents are divided into two types, hare hunters and stag hunters. That is, we model strategic choice as unchanging, at least on the time scale where network evolution is taking place. No matter which other two agents a hare hunter goes hunting with, the hare hunter comes back with a hare (hares can be caught by individuals). A stag hunter, on the other hand, comes home empty-handed unless in a trio of three stag hunters, in which case each comes home with one third share of a stag. One third of a stag is better than a whole hare, but evidently riskier because it will not materialize if any member of the hunting party decides to play it safe and focus attention on bagging a hare. In the three player stag hunting game, as in Three's Company, at each time step each agent chooses two others with whom to form a collusion. The payoffs are as follows. Whenever a hare hunter is a member of a trio, his reward is 3. A stag hunter's reward is 4 if in a trio of three stag hunters and 0 otherwise. A formal model is as follows.
Let N = 2n be an even integer representing the size of the population and let x ∈ (0, 1) be the discount parameter. The variables {W (i, j, t), U (i, t) : 1 ≤ i, j ≤ N ; t ≥ 0} are defined again on (Ω, F , P) with the W variables taking positive values and representing propensities and the U variables taking values in the subsets of {1, . . . , N } of cardinality 3 and representing choices of trios. We initialize the W variables by W (i, j, 0) = 1 − δ ij , just as before, and we invoke a linear response mechanism (4.1) just as before. Now, instead of the trivial reward structure (4.2), the propensities evolve according to the hunting bounties 1 i∈U(q,t)={q,r,s} . (5.4) The factor in front of the last sum is 2 rather than 4 because the sum counts the trio {q, r, s}, chosen by agent q, exactly twice: as (q, r, s) and as (q, s, r).

Analysis of the model
The propensities for stag hunters to choose rabbit hunters remain at their initial values, whence stag hunters choose other stag hunters with limiting probability 1. The stag hunters are never affected by the rabbit hunters' choices, so the stag hunters mimic Three's Company among themselves precisely except for the times, numbering only O(log t) by time t, when they choose rabbit hunters. We know therefore, that eventually they fall into cliques of size 3, 4 and 5, but that this will take a long time if the discount parameter is small.
Rabbit hunters may form cliques of size 3, 4 and 5 as well, but because they are rewarded for choosing stag hunters, they may also attach to stag hunters. The chosen stag hunters have cliques of their own and ignore the rabbit hunters, except during the times that they are purposelessly called to hunt with them. These attachements can be one rabbit continually calling on a particular pair of stags or two rabbits continually calling on a single stag. In either case the one or two rabbits are isolated from all hunters other than their chosen stag hunters.
What matters here is not the details of the trapping state but the time scale on which the trap forms and the likelihood of a rabbit hunter ending up in a sub-optimal trap 4 . This likelihood decreases as the discount rate becomes small for the following reason. Rabbit hunters choosing to hunt with stag hunters are getting no reciprocal invitations, whereas any time they choose to hunt with other rabbit hunters, their mutual success creates a likelihood of future reciprocal invitations. These reciprocal invitations are then successful and increase the original hunter's propensity for choosing the other rabbit hunter. Thus, on average, propensity for a rabbit hunter to form a hunting party with other rabbit hunters will increase faster than propensity to call on stag hunters, and the relative weights will drift toward the rabbit-rabbit groupings. The smaller the discount parameter, x, the more chance this has to occur before a chance run of similar choices locks an agent into a particular clique.
Simulations show that stag hunters find each other rapidly. With 6 stag hunters and six rabbit hunters and a discount rate of .5, the probability that a stag hunter will visit a rabbit hunter usually drops below half a percent in 25 iterations. For 50 iterations of the process this always happened in 1000 trials, and this remains true for values of x between .5 and .1. For x=.01, 100 iterations of the process suffices for stag hunters to meet stag hunters at this level and for 200 iterations are enough when x=.001. Rabbit hunters find each other more slowly, except when they are frozen into interactions with stag hunters. When the past is heavily discounted the latter possibility a serious one. At x=.5, at least one rabbit hunter interacted with a stag hunter (after 10,000 iterations) in 384 of 1,000 trials. This number dropped to 217 for x=.4, 74 for x=.3, 6 for x=.2, and 0 for x=.1. Reliable clique formation among stag hunters is much slower in line with results of the last section, taking about 100,000 iterations for x=.5 and 1,000,000 iterations for x=.4.

Further discussion
The two models discussed in this paper are highly idealized. But from these, we can learn some general principles as to how to analyze a much wider class of models.
The first principle is that when x is near zero, the process should for a long time behave similarly to the non-discounted process (x = 0). Here, following [Ben99], one must find equlibria for the flow dX/dt = µ(X), and classify these as to stability. Unstable equilibria, in general, do not matter (though see [PS03b] for cases in which the effects of unstable equilibria may last quite a while). Stable equilibria may be possible trapping states, or may not be. The interesting case is when a stable equilibrium for the non-discounted process is not a possible trapping state for the discounted process. In this case, the process may get pseudo-trapped there, that is, may remain there for a very long time. Just how long will depend on the model, though Theorem 4.2 extends rather robustly to a broader class of linearly stable states (for the non-discounted process) that are non-trapping for the discounted process.
Another mathematical technique relevant to these analyses, which we have not yet tried to apply, is quasi-stationary analysis. Recall that equation (4.3) describes and asymptotically timehomogeneous Markov chain. If there is trapping, this chain is not ergodic. A chain that is not ergodic may be conditioned to stay in a set of transient states. The stationary measure of the conditioned chain is called a quasi-stationary measure for the original chain. The study of these was begun in the 1960's by Seneta and others (see, e.g., [DS65]) and there is now an extensive literature. In particular, it is sometimes possible to understand the time scale on which the process leaves the transient states.

Conclusion
Our analysis reinforces the emphasis of Suppes and Atkinson, and of Roth and Erev, on the medium run for empirical applications. Long run limiting behavior may simply never be seen. It is useful to quantify the time scale on which we can expect medium run behavior to persist, and Theorem 4.2 is meant to serve as a prototypical result in this direction. Indeed, Theorem 4.2 is proved via a stronger result [PS03a, Theorem 4.1], which applies to many trapping models as the discount rate becomes negligible. As to the nature of the medium run behavior, analyses tend to be model-dependent.
7 Appendix: proof of Theorem 4.1 Let G(t) be the graph whose edges are all e such that e ⊆ U (i, t) for some i, that is, the set of edges whose weights are increased from time t to time t + 1. The following two easy lemmas capture some helpful estimates.
Proof: The first part is a consequence of the equation for the evolution of the total weight: The second part follows from the first, and from the fact that when e ∈ G(t) then W (e, t + 1) ≥ 1 and hence W (e, t + k) ≥ (1 − x) k−1 .
Let G denote the transitive (irreflexive) closure of a graph G; thus G is the smallest disjoint union of complete graphs that contains G. There is a path from v to w of length at most N ; denote this path (v = v 1 , v 2 , . . . , v r = w). If r = 2 then the inequality (7.1) for e ∈ G(t) follows from Lemma 7.1. If r ≥ 3, we let E(H, v, w, 1) be the event that for every 2 ≤ j ≤ r − 1, the edge {v j−1 , v j+1 } is in G(t + 1). Since this event contains the intersection over r of the events that U (v j , t) = {v j , v j−1 , v j+1 }, since Lemma 7.1 bounds each of these probabilities from below, and since the events are conditionally independent given F t , we have a lower bound on the probability of E(H, v, w, 1). In general, for 1 ≤ k ≤ r − 2, let E(H, v, w, k) be the event that for every 2 ≤ j ≤ r − k, the edge {v j−1 , v j+k is in G(t + k). We claim that conditional on E(H, v, w, l) for all l < k, the conditional probability of E(H, v, w, k) given F t+k−1 can be bounded below: inductively, Lemma 7.1 bounds from below the product of W (v j , v j−1 , t)W (v j , v j+k , t), and hence the probability that U (v j , t) = {v j , v j−1 , v j+k }; these conditionally independent probabilities may then be multiplied to prove the claim, with the bound depending only on x and N .
¿From this argument, we see that the intersection E(H, v, w) := 1≤k≤r−2 E(H, v, w, k) has a probability which is bounded from below. Sequentially, we may choose a sequence of values for w running through all vertices of H at some distance r(w) − 1 ≥ 2 from v, measured in the metric on H. For each such w, we can bound from below the probability that in r − 2 more time steps the path from v to w will be transitively completed. We denote these events E ′ (H, v, w), the prime denoting the time shift to allow events analogous to E(H, v, w) to occur sequentially. Summing the time to run over all w ∈ H yields at most N 2 time steps. Let E(H, v) denote the intersection of all the events E ′ (H, v, w). Inductively, we see that the probability of E(H, v) is bounded from below by a positive number depending only on N and x.
Finally, we let (H, v) vary with H exhausting components of G(t) and v a choice function on the vertices of H. The events E(H, v) are all conditionally independent given F t , so the probability of their intersection, E, is bounded from below by a positive constant which we call c. By Lemma 7.1 once more, on E, we know that (7.1) is satisfied for each e ∈ G(t). denote the event that from time t onward, V is isolated from its complement. If V is the vertex set of a component of G(t), then the conditional probability given F t of the event E(V, t) may be bounded from below as follows. For any v ∈ V, w ∈ V c , and for any s ≥ t, if the edge e := {v, w} is not in G(r) for any t ≤ r < s, then by part 1 of Lemma 7.1, its weight W (e, s) is at most (1 − x) s−t 3N x −1 . Since z W (v, z, s) ≥ 2 for all v, z, s, it follows from the evolution equations that

It follows that
uniformly in N, x and t as s − t → ∞ (though the uniformity in N and x is not needed). By the conditional Borel-Cantelli Lemma, it follows that on the event that V is the vertex set of a component of G(t).
By the reverse direction of the Conditional Borel-Cantelli Lemma, the event E(V, t) occurs for some t with probability 1 on that event that V is a component of G(t) infinitely often. Let e = {v, w} be any edge. If e / ∈ G(t) infinitely often, then since there are only finitely many subsets of vertices, it follows that v ∈ V and w ∈ W for some disjoint V and W that are infinitely often components of G(t). This implies that e ∈ G(t) finitely often. We have shown that, almost surely, the edges come in two types: those in G(t) finitely often and those in G(t) all but finitely often. This further implies that G(t) is eventually constant. Denote this almost sure limit by G ∞ . It remains to characterize G ∞ .
It is evident that G ∞ contains no component of size less than three, since G(t) is the union of triangles U (i, t). Suppose that G(t) = H for some H of cardinality at least six. By Lemma 7.2, conditional on F t and G(t) = H, for every e ∈ H. Write H as the disjoint union of sets J and K, each of cardinality at least three. Then with probability at least (1 − x) N 2 3N + (1 − x) N 2 |J|+|K| U (i, t + N 2 ) ⊆ J for every i ∈ J and U (i, t + N 2 ) ⊆ K for every i ∈ K. In this case, G(t + N 2 ) has components that are proper subsets of H. By the martingale convergence theorem, converges with probability 1 to the indicator function of H being a component of G ∞ . From the above computation, it is not possible for P(H is a component of G ∞ | F t ) to converge to 1 when H has cardinality six or more. Therefore, every component of G ∞ has cardinality 3, 4 or 5.
The rest of the proof is easy. Let V 1 , . . . , V k be any partition of [N ] into sets of cardinalities 3, 4 and 5. The derivation of (7.2) shows that in other words, with positive probability G ∞ has k components which are precisely the complete graphs on V 1 , . . . , V k . It is elementary that a coupling may be produced between the Three's Company processes on populations of sizes N and K < N (with the same x value), so that if {W (i, j, t),Ũ (i, t)} are the weight and choice variables for the smaller population, thenŨ (i, t) = U (i, t) andW (i, j, t + 1) = W (i, j, t + 1) for all t < τ where τ is the first time, possibly infinite, at which U (i, t) contains an edge between [K] and {K + 1, . . . , N }. In general, coupling methods show that if P(G ∞ = G 0 | t) > 1 − ǫ then the conditional distribution of the Three's Company process from time t onward given F t and G ∞ = G 0 , shifted back t time units and restricted to a component H of G ∞ , is within ǫ in total variation of the distribution of the Three's Company process on H started with initial weights W ′ (i, j, 0) := W (i, j, t).
The Three's company process on a population of size 3, 4 or 5 with any discount rate 1 − x < 1 is ergodic: to see this just note that the Markov chain whose state space is the collection of W variables is Harris recurrent as a consequence of Lemma 7.2. The invariant measure gives positive weight to each edge, so each agent chooses each other with positive frequency, finishing the proof of Theorem 4.1.