Skip to main content
eScholarship
Open Access Publications from the University of California

A Bayesian Metric for Network Similarity

Abstract

Networks of every kind and in numerous fields areomnipresent in today’s society (e.g. brain networks, socialnetworks) and are the intense subject of research. It wouldbe of great utility to have a computationally efficient andgenerally applicable method for assessing similarity ofnetworks. The field (going back to the 1950s) has not comeup with such a method (albeit a few moves in this directionexist, such as Jaccard coefficients, QAP--quadraticassignment procedure, and more recently Menezes & Roth,2013, and Asta & Shalizi, 2014). I present a Bayesian-basedmetric for assessing similarity of two networks, possibly ofdifferent size, that include nodes and links between nodes. Iassume the nodes are labeled so that both the nodes andlinks between two nodes that are shared between the twonetworks can be identified.The method calculates similarity as (a monotonictransformation of) the odds that the two observed networks,termed V and W, were produced by random sampling froma single master network, termed G, as opposed to generationby two different but similar networks, termed Gv and Gw.The simplest form of the method ignores strengths thatcould be assigned to nodes and links, and considers onlynodes and links that are, or are not, shared by the networks.Suppose there are n V nodes and N V links only in V, n Wnodes and N W links only in W and n c nodes and N c linksshared between the networks. Thus the number of nodes inV is n c + n V and the number in W is n c + n W . The number ofunique nodes in both V and W is n c + n V + n W = n. Thenumber of links in V is N c + N V and the number in W is N c +N W . The number of unique links in both V and W is N c + N V+ N W = N.The single master network, G, is assumed to consist of theunion of the nodes and links in the two networks, and has nnodes and N links. The probability a given shared node willbe randomly and independently sampled twice is[(n V +n c )/n][(n W +n c )/n]. The probability a given shared linkwill be randomly and independently sampled twice is[(N V +N c )/N][(N W +N c )/N].If there are two generating networks I assume they eachhave n nodes and N links. I also assume they are similar, because we would not be comparing dissimilar networks.The degree of similarity is controlled by ‘tuning’parameters 1 : Gv and Gw are assumed to share αn nodes andβN links. The probability a given shared node will besampled twice is then α[(n V +n c )/n][(n W +n c )/n], and theprobability a given shared link will be sampled twice isβ[(N V +N c )/N][(N W +N c )/N]. The likelihood ratio λ js for G vs(GV, GW) as generator of a given shared node is then 1/αand the likelihood ratio π js of a given shared link is then 1/β.For a non-shared node, say in V, similar reasoning gives alikelihood ratio λ kV of[1-(n W +n c )/n)] /[1– α(n W +n c )/n]and for a non-shared link a likelihood ratio π kV of[1-(N W +N c )/n)] /[1– α(N W +N c )/N]For a non-shared node or link in W substitute a Wsubscript for the V subscript in these likelihood ratios.Computational efficiency is a necessity if the similaritymetric is to be applied to large networks. For this reason Ido not calculate the exact probabilities for the numbers ofshared and non-shared nodes and links that are observed(the combinatoric complexity of such calculations isenormous). Instead I make the simplifying assumption thateach node and link contribute the likelihood ratios givenabove and that the total odds is obtained by multiplying allthe likelihood ratios together. This simplification canperhaps be justified if similar distortion is produced by thissimplifying assumption for both the cases of G and (G V ,G W )as generators. Under this simplifying assumption the overallodds becomes:φ(1/2) = (λ js ) nc (λ kV ) nV (λ jW ) nW (π js ) Nc (π kV ) NV (π jW ) NWTaking the log of this product converts the calculation tosums and makes calculation highly efficient.This abstract is too short to permit giving the different andmore complex results that hold for the several cases whenthe nodes and/or links have associated strengths. I give asummary of some of the results here. The results for linksand nodes are similar so consider the results for nodes. Letthere be just one set of strength values, Si for the i-th node.Norm these to sum to 1.0. For either generation by G or(Gv,Gw) assume sampling is made without replacement andproportional to strength. Let Ziv and Ziw be theprobabilities that node i will be sampled by n v +n c samples,or n w +n c samples respectively. The Z’s would be difficult to obtain analytically but could be estimated by Monte Carlosampling. Consider two possibilities for the way that Gv andGw overlap. In Case A the probability a node will be sharedis simply α, independent of strength. In Case B, theprobability a node will be shared is an increasingfunction of strength, Y i .For Case A the likelihood ratio for a shared node i is:1/α. For a node k only in V the likelihood ratio is: λ kV =(1-­‐Zkw)/{1 – α (1-­‐Zkw)}. For a node only in W exchangethe subscripts v and w. Then we have for the odds due tonodes: φ D = (1/α) nc Π k (λ kV )Π j (λ jW ).For Case B the likelihood ratio for a shared node i is1/Y i . For a node k only in V the likelihood ratio is: λ kV =(1-­‐Zkw)/{1–Y k (1-­‐Zkw)}. Again switch v and w subscriptsfor a node only in W. Then we have for the odds due tonodes: φ D = Π i (1/Y i )Π k (λ kV )Π j (λ jW ).These expressions would have analogous forms forlinks, with different Ns, Z’s and Y’s, and the overall oddswould, as before, be a product of the odds for nodes andthe odds for links.The critical difference between Cases A and B is thedegree to which evidence based on an observed sharednode or link is strength dependent: For Case B thisevidence rises as strength decreases. This should raiseconcerns: However strengths are obtained there is likelyto be measurement noise that reduces the reliability oflow strength values. This might argue in favor of usingCase A, or if one preferred Case B to restrict the Yi valuesto lie above a lower bound. The idea would be to letevidence depend most on the nodes (or links with highstrength values.It should be observed that the existence of acomputationally efficient and generally applicable metricfor network similarity would allow alignment of non-labeled networks. One would search for the alignment ofnodes that would maximize the metric.I have many relevant publications demonstrating somedegree of expertise in Bayesian modeling (e.g.: Shiffrin &Chandramouli, in press; Shiffrin, Chandramouli, &Grünwald, 2015; Chandramouli & Shiffrin, 2015; Nelson &Shiffrin, 2013; Cox & Shiffrin, 2012; Shiffrin, Lee, Kim, &Wagenmakers, 2008; Cohen, Shiffrin, Gold, Ross, & Ross,2007; Denton & Shiffrin; Huber, Shiffrin, Lyle, & Ruys,2001; Shiffrin & Steyvers, 1997). I note that the presentresults are in a vague sense an extension of the metricproposed for matching memory probes to memory tracesthat are given in Cox and Shiffrin (2012) and in theappendix of Nelsonb and Shiffrin (2013).

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View