US20160350382A1 - Estimating influence using sketches - Google Patents

Estimating influence using sketches Download PDF

Info

Publication number
US20160350382A1
US20160350382A1 US15/236,986 US201615236986A US2016350382A1 US 20160350382 A1 US20160350382 A1 US 20160350382A1 US 201615236986 A US201615236986 A US 201615236986A US 2016350382 A1 US2016350382 A1 US 2016350382A1
Authority
US
United States
Prior art keywords
influence
nodes
graph
node
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/236,986
Inventor
Renato F. Werneck
Daniel Delling
Thomas Pajor
Edith Cohen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority to US15/236,986 priority Critical patent/US20160350382A1/en
Assigned to MICROSOFT CORPORATION reassignment MICROSOFT CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: DELLING, DANIEL, PAJOR, THOMAS, WERNECK, RENATO F., COHEN, EDITH
Assigned to MICROSOFT TECHNOLOGY LICENSING, LLC reassignment MICROSOFT TECHNOLOGY LICENSING, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MICROSOFT CORPORATION
Publication of US20160350382A1 publication Critical patent/US20160350382A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30519
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/24569Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/30958

Definitions

  • Propagation of contagion is a fundamental process in social, biological, and physical networks.
  • Graphs can be used to model a network, and propagation of contagion can be used to model the spread of information, influence, or a viral infection with respect to the nodes of the graph.
  • Diffusion patterns in the graph can be specified by a probabilistic model, such as independent cascade (IC), or captured by a set of representative traces.
  • IC independent cascade
  • influence queries include determining the influence of a specified seed set of nodes in a graph, and identifying the most influential seed set of a given size in the graph (i.e., influence maximization). Answering an influence query may involve edge traversals in hundreds of graph instances, and may not scale well for very large graphs. Influence maximization is hard even to approximate.
  • the standard is the greedy algorithm, which iteratively selects a node which maximizes a marginal gain in influence and adds it to the seed set.
  • the greedy algorithm does not scale well for graphs with more than a few million edges.
  • a graph that includes multiple nodes and edges is received. Multiple instances of the graph are generated by randomly instantiating the edges according to, for example, a binary independent cascade model or a randomized edge length independent cascade model. Where the binary independent cascade model is used, combined reachability sketches are generated for each node across all instances of the graph. Where the randomized edge length independent cascade model is used, combined all-distances sketches are generated for each node across all instances of the graph. Depending on which model is used, the combined reachability or all-distances sketches are used to estimate the influence of nodes in the graph or to estimate a subset of nodes from a graph of a specified size with a maximum influence using a greedy algorithm.
  • a graph is received by a computing device.
  • the graph includes nodes and edges.
  • a sketch is computed by the computing device.
  • the sketch may be either a reachability sketch or an all-distances sketch.
  • An influence query is received by the computing device.
  • the influence query may be a query for an estimate of the influence of a subset of nodes or for an estimate of a subset of nodes of a specified size with a maximum combined influence.
  • a result is determined in response to the influence query using one or more of the computed sketches by the computing device. The determined result is provided in response to the influence query by the computing device.
  • sketches are received by a computing device.
  • Each sketch is associated with a plurality of nodes of a graph.
  • Each sketch may be one or more of a reachability sketch or an all-distances sketch.
  • An influence query is received for a subset of the nodes of a specified size having a maximum influence by the computing device.
  • a first node of the plurality of nodes that when added to the subset of the nodes increases an influence of the subset of the nodes by the greatest amount is determined using the sketch associated with the first node and the sketches associated with the nodes in the subset of the nodes by the computing device.
  • the determined first node is added to the subset of the nodes by the computing device. That the subset of the nodes is of the specified size is determined by the computing device.
  • the subset of the nodes is provided by the computing device.
  • FIG. 1 shows an environment for answering influence queries
  • FIG. 2 is an illustration of an implementation of an influence engine
  • FIG. 3 is an operational flow of an implementation of a method for estimating the results of an influence query for a graph
  • FIG. 4 is an operational flow of an implementation of a method for determining a subset of nodes from a graph of a specified size that maximizes an influence of the nodes in the subset of nodes;
  • FIG. 5 shows an exemplary computing environment.
  • FIG. 1 shows an environment 100 for answering influence queries 145 on graphs.
  • the graphs may include a plurality of nodes and edges and may include both directed and undirected graphs.
  • the graphs may be weighted or unweighted.
  • a graph may represent a variety of entities and structures such a social network, the Internet, populations of humans or animals, and cities, for example.
  • An example of an influence query 145 includes a request to determine the influence of a subset of nodes S from a graph G.
  • the influence query 145 may include identifiers of the one or more nodes in the subset.
  • Another example of an influence query 145 may be to identify a subset of nodes of a particular size that includes the nodes from the graph G with the highest combined influence.
  • the influence query 145 may include an indicator of the desired size of the subset of nodes. This type of query 145 is known as influence maximization.
  • the influence of a node is a measure of how connected a particular node in the graph is to the other nodes of a graph. Identifying nodes with high influence can have many uses in a variety of fields. For example, for social networking, identifying users with high influence can be used for marketing purposes to determine which users to give a free product to in order to maximize the exposure of the product. As another example, for public health, the influence of users can be used to model how a disease may be spread, or to identify which users to target for vaccination.
  • IC independent cascade
  • an independent random variable is assigned to each edge (u, v) of a graph G to model the influence of the node u on the node v.
  • a single instance of the graph may be created by instantiating the random variables for each edge, and the influence of a particular node may be determined across many of these graph instances.
  • binary IC One version of the IC model is known as binary IC.
  • the random variable assigned to each edge is binary and may be one or zero.
  • the assigned variable represents whether or not the particular edge is live or null.
  • a live edge (u, v) means that once a node u is infected, so is the node v.
  • the influence of a particular node can be based on the number of nodes that are infected when the particular node is infected across all of the instances of the graph.
  • the variable assigned to the (directed) edge (u, v) and the edge (v, u) may be different.
  • REL randomized edge length
  • each edge may be randomly assigned any non-negative variable.
  • Each variable may represent a variety of values such as time.
  • the assigned variable to an edge (u, v) may represent how much time may elapse before the node v becomes infected after the node u has become infected.
  • the influence of a particular node may similarly be based on the how much the other nodes are infected when the particular node is infected across all of the instances of the graph but may change based on a current time value.
  • a graph G with nodes V and edges E may be used to generate a set [G i ⁇ of graph instances.
  • a particular instance G i (V, E i ,w i ) may be specified by an edge set E i with lengths w i (e) ⁇ 0.
  • the influence of S over all instances ⁇ G i ⁇ may be defined as the average of the single instance influences where l is the total number of instances using formula 2:
  • one method for solving the influence maximization problem for a seed subset of size S described above is by using a greedy algorithm.
  • the algorithm starts with the empty seed set S and determines the node from the graph with the greatest influence using the formula 1 and adds the node to S.
  • the greedy algorithm determines the node from the graph that when added to S results in the greatest increase in influence for the subset (i.e., the node with the highest marginal gain in influence). The algorithm is stopped when the seed set S has the desired size.
  • the environment 100 may include an influence engine 180 that estimates influence queries 145 using one or more sketches generated from a graph, rather than directly from the graph as described above.
  • the sketches may be computed in a preprocessing phase resulting in a reduction of processing resources.
  • the influence engine 180 may be in communication with a graph provider 160 and a client device 110 through a network 120 .
  • the client device 110 may include a desktop personal computer, workstation, laptop, personal digital assistant (PDA), smartphone, cell phone, or any WAP-enabled device or any other computing device capable of interfacing directly or indirectly with the network 120 .
  • the network 120 may be a variety of network types including the public switched telephone network (PSTN), a cellular telephone network, and a packet switched network (e.g., the Internet).
  • PSTN public switched telephone network
  • a cellular telephone network e.g., the Internet
  • packet switched network e.g., the Internet
  • the influence engine 180 may receive a graph 165 from the graph provider 160 through the network 120 , and may generate a sketch for each node of the graph 165 .
  • a sketch of a graph is like a summary of a graph and includes some number of nodes and edges from the graph selected according to a sampling function, potentially with some associated information.
  • the generated sketches may be stored by the influence engine 180 as the sketch data 187 .
  • each generated sketch for a node may be a reachability sketch and may indicate which nodes are reachable in the graph 165 from the node by following paths from the node in the graph 165 .
  • Each reachability sketch may be a combined reachability sketch in that it is based on all instances of the graph 165 . How each reachability sketch is generated is described further with respect to FIG. 2 .
  • the influence engine 180 may generate what is referred to herein as a combined all-distances sketch for each node.
  • An all-distances sketch for a node 17 includes a random sample of nodes from the graph, where the inclusion probability of a node u in the sample decreases with its distance from v.
  • the combined all-distances sketch for a node may be (conceptually) a combination of the all-distances sketches generated for the node across all instances of the graph 165 .
  • the particular methods used to generate the combined all-distances sketches and combined reachability sketches are described further with respect to FIG. 2 .
  • the influence engine 180 may receive an influence query 145 from a client device 110 and may generate an estimate in response to the query using the sketches stored in the sketch data 187 .
  • the estimate may be provided by the influence engine 180 to the client device 110 as the results 186 .
  • the influence engine 180 may estimate the influence using the reachability sketches associated with each of the nodes in the subset from the sketch data 187 .
  • the influence may be estimated by estimating (using the sketches of each node in the subset) the cardinality of the union of the reachability sets of all nodes in the subset. Other methods may be used.
  • the influence of the subset may be estimated using the combined all-distances sketches associated with the nodes in the subset and one or more estimators. Other methods may be used.
  • the influence engine 180 may determine the subset of nodes of a specified size using a form of the greedy algorithm described above while estimating the influence of the nodes using either the reachability sketches or the combined all-distances sketches associated with each node of the graph 165 .
  • the greedy algorithm and its application are described further below with respect to FIG. 2 .
  • FIG. 2 is an illustration of an implementation of an influence engine 180 .
  • the influence engine 180 may include several components such as an instance engine 210 , a sketch generator 220 , and an influence estimator 225 . More or fewer components may be supported by the influence engine 180 .
  • the instance engine 210 may receive a graph 165 and may generate one or more instances based on the graph 165 .
  • the graph 165 may be received from the graph provider 160 and may include a plurality of edges and nodes. Each edge may further have an associated weight.
  • the generated instances may be stored by the instance engine 210 as the graph instance data 215 .
  • the number of instances generated from a graph 165 may be set by a user or administrator, for example.
  • the instance engine 210 may generate an instance from the graph 165 by assigning either a one or a zero to each edge in the graph 165 .
  • the one or zero may be randomly assigned to each edge using a biased coin, for example. Other methods for randomly, or pseudo randomly, assigning values may be used. Where a zero is assigned to an edge in an instance of the graph 165 , the edge may be deemed to be dead or inactive in the instance. Conversely, where a one is assigned to an edge in an instance of the graph 165 , the edge may be deemed to be live or active in the instance.
  • the instance engine 210 may generate an instance from the graph 165 by assigning a positive value to each edge in the graph 165 .
  • the positive value assigned to an edge many be randomly selected from a distribution. Any method for randomly, or pseudo randomly, assigning values may be used.
  • the value assigned to an edge may represent a time that a first node associated with the edge may take to infect a second node associated with the edge, for example.
  • the sketch generator 220 may, for each node in the graph 165 , generate a sketch for the node.
  • the sketch generated for each node may be a combined sketch, and may be generated for the node based on all of the instances of the graph 165 generated by the instance engine 210 .
  • the generated sketches for each node may be stored by the sketch generator 220 as the sketch data 187 .
  • each of the generated sketches may be reachability sketches.
  • the reachability sketches may be bottom-k min hash sketches, where k is the size or number of samples in the sketch. Other types of reachability sketches may be used.
  • the combined reachability set R u ⁇ (v, i)
  • the sketch generator 220 may generate a reachability sketch by, for each node and instance pair (v, i) of a graph 165 , associating a random rank value with the pair.
  • the random rank value may be hash based.
  • the random rank value r u i ⁇ U[0, 1] may be selected from the uniform distribution of [0, 1].
  • the combined reachability sketch X u for the node u may then be generated from the set of the k smallest associated rank values amongst ⁇ r v i
  • the sketch generator 220 may generate the combined reachability sketch for each node by performing sequential pruned graph searches.
  • the sketch generator 220 may rank the node instance pairs based on the assigned random rank values (from lowest to highest). Pruned searches may then be performed using the ranked node instance pairs.
  • a search may be performed from u using the reversed edges of G i .
  • the value r u i may be added to X v . Otherwise the search may pruned at v.
  • x v may include the bottom-k combined reachability sketch of v as described above in the formula 3. Other methods for generating reachability sketches may be used.
  • each of the generated sketches may be all-distances sketches, and may be used to generate a combined all-distances sketch for each node across all instances of the graph 165 .
  • An all-distances sketch for a node v includes a random sample of nodes from the graph, where the inclusion probability of a node u in the sample decreases with its distance from v.
  • the combined all-distances sketch for a node is a combination of all of the all-distances sketches generated for the node for each instance of the graph 165 .
  • the sketch generator 220 may generate a combined all-distances sketch cADS(u) for a node u by, for each node and instance pair (u, i) for a graph 165 , associating a random rank value with the pair similarly as described above for the reachability sketches.
  • the sketch generator 220 may rank the node instance pairs based on the assigned random rank values. Pruned Dijkstra searches may be iteratively performed by the sketch generator 220 using the ranked node instance pairs by increasing rank r u i using the reversed edges of G i .
  • a determination is made as to whether there is an entry (x, y) ⁇ cADS(u) where y ⁇ d u i .
  • cADS(u) is the combined all-distances sketch of x. If so, the Dijkstra search is pruned at v. Otherwise, cADS(v) is updated to include(r u i , d vu i ).
  • Other methods for generating combined all-distances sketches may be used.
  • the influence estimator 225 may generate results 186 in response to a received influence query 145 . How the influence estimator 225 generates the results 186 may depend both on the type of influence query 145 (i.e., whether the query is for an estimation of the influence of a subset of the nodes of the graph 165 or to determine a subset of nodes of the graph 165 of a specified size that have a maximum influence) and whether or not the generated instances of the graph 165 are generated using binary IC or REL IC.
  • the influence estimator 225 may determine the influence of a subset of nodes S identified by the influence query 145 by estimating the cardinality of the union U u ⁇ S R u (u, i) of the combined reachability sketches X u for all nodes u in the subset S.
  • the influence estimator 225 may compute a threshold rank t u of each node u using formula 4 (where k th indicates the k-th smallest element of the set):
  • the influence estimator 225 may estimate the cardinality
  • the influence estimator 224 may further estimate the cardinality of the union U u ⁇ S R u using the bottom-k sketches of each set R u for u ⁇ S.
  • the cardinality of the union U u ⁇ S R u may be estimated by the influence estimator using formula 5:
  • ⁇ v ⁇ ⁇ z ⁇ U v ⁇ S ⁇ X v ⁇ ⁇ ⁇ ⁇ t v ⁇ ⁇ 1 max u ⁇ S ⁇ z ⁇ X u ⁇ ⁇ t u ⁇ ⁇ t u ( 5 )
  • the influence estimator 225 may determine the subset S using the greedy algorithm described above. However, rather than compute the actual influence using the formula 2, the influence estimator 225 may use the combined reachability sketches and may estimate the influence of the subset using the formula 5.
  • the influence estimator 225 may apply the greedy algorithm by first creating a priority queue for each node u in the graph 165 . Initially, the nodes in the graph may be ordered based on their estimated influence or
  • the node u with the highest priority may be added to the subset S by the influence estimator 225 .
  • the node u at the top of the priority queue is retrieved. If its freshness value indicates that it has not been evaluated for this iteration, the marginal gain of adding u to S is estimated using the formula 5 above. If the estimated gain is less than a previously estimated maximum gain, then u is added back into the priority queue with the updated maximum gain value and freshness value. If the estimated gain is greater than the previously estimated maximum gain, then u is added to S, and the algorithm is repeated until S is full.
  • the influence estimator 225 may determine the influence of a subset of nodes S identified by the influence query 145 using an estimator according to formula 6 where ⁇ is any non-increasing function:
  • a node For a subset of nodes S with only one node u, a node always influences itself, so the influence of that node is one.
  • a historic inverse probability estimator may be used to estimate the influence contribution of nodes that are a positive distance from u.
  • the influence estimator 225 may create what is referred to as a union all-distances sketch.
  • the influence estimator 225 may generate the union all-distances sketch from each of the combined all-distances sketches from the nodes in the subset of nodes S. For example, the influence estimator 225 may take the k smallest ranks from the combined all-distances sketches for each instance.
  • the influence estimator 225 may then estimate the influence of the nodes in the subset of nodes S by applying the estimator of formula 6 to entries in the union all-distances sketch.
  • the influence estimator 225 may determine the subset S using the greedy algorithm similarly as described above for the binary IC model. However, rather than estimate the influence of the subset at each iteration of the algorithm using the formula 5, the influence estimator may estimate the influence using the estimator of formula 6.
  • FIG. 3 is an operational flow of an implementation of a method 300 for estimating the results of an influence query for a graph.
  • the method 300 may be implemented by the influence engine 180 , for example.
  • a graph is received at 301 .
  • the graph 165 may be received by the influence engine 180 from the graph provider 160 .
  • the graph 165 may include a plurality of nodes and a plurality of edges.
  • the graph may be a weighted or unweighted, and may be directed or undirected.
  • a sketch is computed at 303 .
  • Each sketch may be computed by the sketch generator 220 .
  • each sketch may be a combined sketch and may be a combination of the sketches generated for the node across all instances of the received graph 165 .
  • the sketch generated for each node may be a combined reachability sketch.
  • the sketch generated for each node may be a combined all-distances sketch.
  • Other types of sketches may be used.
  • the computed sketches may be stored by the sketch generator 220 as the sketch data 187 .
  • the influence query 145 may be received by the influence engine 180 from the client device 110 .
  • the influence query 145 may be a request to estimate the influence of a subset of nodes of the graph 165 , or may be a request to estimate the subset of nodes of the graph 165 of a specified size with a maximum influence.
  • the influence query 145 may include identifiers of the one or more nodes in the subset.
  • the influence query is a request to estimate the subset of nodes of the graph 165 of a specified size with a maximum influence
  • the influence query 145 may include an indicator of the specified size.
  • One or more sketches are retrieved based on the influence query at 307 .
  • the one or more sketches may be retrieved from the sketch data 187 by the influence estimator 225 .
  • the influence estimator 225 may retrieve the sketches associated with the one or more nodes of the subset.
  • the influence estimator 225 may retrieve the sketches associated with every node of the graph 165 .
  • Results are determined in response to the influence query based on the retrieved one or more sketches at 309 .
  • the results 186 may be estimates and may be determined by the influence estimator 225 . How the results 186 are estimated may depend on the type of query 145 and the IC mode used to generate the graph instances.
  • the influence estimator 225 may estimate the influence of a subset of nodes by estimating the cardinality of union of the reachability sketches associated with each node in the subset.
  • the influence estimator 225 may estimate the influence of a subset of nodes by applying an estimator, such as a historic inverse probability estimator, to the sketches associated with the nodes in the subset of nodes.
  • an estimator such as a historic inverse probability estimator
  • the influence estimator 225 may determine the subset using a greedy algorithm.
  • the generated results are provided in response to the influence query at 311 .
  • the generated results 186 may be provided to the client device 110 that originated the influence query 145 by the influence engine 180 .
  • the influence engine 180 may return to 305 where a new influence query 145 may be received from a client device 110 .
  • FIG. 4 is an operational flow of an implementation of a method 400 for determining a subset of nodes from a graph of a specified size that maximizes an influence of the nodes in the subset of nodes.
  • the method 400 may be implemented by the influence engine 180 , for example.
  • a plurality of sketches is received at 401 .
  • the plurality of sketches may be received from the sketch data 187 by the influence estimator 225 .
  • Each sketch of the plurality of sketches may be associated with a node of a graph 165 .
  • the sketches may be combined sketches across all instances of the graph 165 .
  • the combined sketches may be combined reachability sketches or may be combined all-distances sketches.
  • the influence query 145 may be received by the influence engine 180 from a client device 110 .
  • the influence query 145 may specify a size and may be a query for a subset of nodes from the graph 165 of the specified size that maximizes the influence of the nodes in the subset.
  • a node of the plurality of nodes with a greatest estimated marginal gain in influence is determined at 405 .
  • the node with the greatest estimated marginal gain may be determined by the influence estimator 225 using the sketches associated with each node in the graph 165 .
  • the influence estimator 225 may determine the node with the greatest marginal gain by, for each node of the graph 165 that is not already in the subset, estimating the influence of the nodes already in the subset when the node being considered is included. The node that results in the largest increase in the estimated influence for the subset of nodes may then be determined to be the node with the greatest estimated marginal gain. Other methods may be used.
  • the determined node is added to the subset at 407 .
  • the determined node may be added to the subset of nodes by the influence estimator 225 .
  • a determination is made as to whether the subset of nodes is the specified size. If the subset of nodes is the specified size, then the method 400 may continue at 411 . Otherwise, the method 400 may return to 405 where another node is selected.
  • the influence estimator 225 may recompute the received sketches before or after adding the node to the subset of nodes. As nodes are added to the subset of nodes, the sketches become less accurate. Accordingly, the sketches may be recomputed to account for the nodes already added to the subset of nodes. The recomputation may be triggered by the calculated marginal gain falling below a threshold value, or after some number of nodes have been added to the subset, for example.
  • the determined subset of nodes is provided at 411 .
  • the determined subset of nodes may be provided by the influence engine in response to the influence query 145 as the results 186 .
  • FIG. 5 shows an exemplary computing environment in which example implementations and aspects may be implemented.
  • the computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality.
  • PCs personal computers
  • server computers handheld or laptop devices
  • multiprocessor systems microprocessor-based systems
  • network PCs minicomputers
  • mainframe computers mainframe computers
  • embedded systems distributed computing environments that include any of the above systems or devices, and the like.
  • Computer-executable instructions such as program modules, being executed by a computer may be used.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium.
  • program modules and other data may be located in both local and remote computer storage media including memory storage devices.
  • an exemplary system for implementing aspects described herein includes a computing device, such as computing device 500 .
  • computing device 500 typically includes at least one processing unit 502 and memory 504 .
  • memory 504 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two.
  • RAM random access memory
  • ROM read-only memory
  • flash memory etc.
  • This most basic configuration is illustrated in FIG. 5 by dashed line 506 .
  • Computing device 500 may have additional features/functionality.
  • computing device 500 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape.
  • additional storage is illustrated in FIG. 5 by removable storage 508 and non-removable storage 510 .
  • Computing device 500 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by device 700 and include both volatile and non-volatile media, and removable and non-removable media.
  • Computer storage media include volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Memory 504 , removable storage 508 , and non-removable storage 510 are all examples of computer storage media.
  • Computer storage media include, but are not limited to, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500 . Any such computer storage media may be part of computing device 500 .
  • Computing device 500 may contain communication connection(s) 512 that allow the device to communicate with other devices.
  • Computing device 500 may also have input device(s) 514 such as a keyboard, mouse, pen, voice input device, touch input device, etc.
  • Output device(s) 516 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
  • exemplary implementations may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network or distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be affected across a plurality of devices. Such devices might include PCs, network servers, and handheld devices, for example.

Abstract

A graph that includes multiple nodes and edges is received. Multiple instances of the graph are generated by randomly instantiating the edges according to either a binary independent cascade model or a randomized edge length independent cascade model. Where the binary independent cascade model is used, combined reachability sketches are generated for each node across all instances of the graph. Where the randomized edge length independent cascade model is used, combined all-distances sketches are generated for each node across all instances of the graph. Depending on which model is used, the combined reachability or all-distances sketches are used to estimate the influence of nodes in the graph or to estimate a subset of nodes from a graph of a specified size with a maximum influence using a greedy algorithm.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of U.S. application Ser. No. 14/290,209, filed on May 29, 2014, the disclosure of which is hereby incorporated in its entirety.
  • BACKGROUND
  • Propagation of contagion is a fundamental process in social, biological, and physical networks. Graphs can be used to model a network, and propagation of contagion can be used to model the spread of information, influence, or a viral infection with respect to the nodes of the graph. Diffusion patterns in the graph can be specified by a probabilistic model, such as independent cascade (IC), or captured by a set of representative traces.
  • Basic computational problems in the study of diffusion are influence queries. These queries include determining the influence of a specified seed set of nodes in a graph, and identifying the most influential seed set of a given size in the graph (i.e., influence maximization). Answering an influence query may involve edge traversals in hundreds of graph instances, and may not scale well for very large graphs. Influence maximization is hard even to approximate. Both in theory and practice, the standard is the greedy algorithm, which iteratively selects a node which maximizes a marginal gain in influence and adds it to the seed set. However, the greedy algorithm does not scale well for graphs with more than a few million edges.
  • SUMMARY
  • A graph that includes multiple nodes and edges is received. Multiple instances of the graph are generated by randomly instantiating the edges according to, for example, a binary independent cascade model or a randomized edge length independent cascade model. Where the binary independent cascade model is used, combined reachability sketches are generated for each node across all instances of the graph. Where the randomized edge length independent cascade model is used, combined all-distances sketches are generated for each node across all instances of the graph. Depending on which model is used, the combined reachability or all-distances sketches are used to estimate the influence of nodes in the graph or to estimate a subset of nodes from a graph of a specified size with a maximum influence using a greedy algorithm.
  • In an implementation, a graph is received by a computing device. The graph includes nodes and edges. For each node of the graph, a sketch is computed by the computing device. The sketch may be either a reachability sketch or an all-distances sketch. An influence query is received by the computing device. The influence query may be a query for an estimate of the influence of a subset of nodes or for an estimate of a subset of nodes of a specified size with a maximum combined influence. A result is determined in response to the influence query using one or more of the computed sketches by the computing device. The determined result is provided in response to the influence query by the computing device.
  • In an implementation, sketches are received by a computing device. Each sketch is associated with a plurality of nodes of a graph. Each sketch may be one or more of a reachability sketch or an all-distances sketch. An influence query is received for a subset of the nodes of a specified size having a maximum influence by the computing device. A first node of the plurality of nodes that when added to the subset of the nodes increases an influence of the subset of the nodes by the greatest amount is determined using the sketch associated with the first node and the sketches associated with the nodes in the subset of the nodes by the computing device. The determined first node is added to the subset of the nodes by the computing device. That the subset of the nodes is of the specified size is determined by the computing device. In response to the determination, the subset of the nodes is provided by the computing device.
  • This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The foregoing summary, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the embodiments, there are shown in the drawings example constructions of the embodiments; however, the embodiments are not limited to the specific methods and instrumentalities disclosed. In the drawings:
  • FIG. 1 shows an environment for answering influence queries;
  • FIG. 2 is an illustration of an implementation of an influence engine;
  • FIG. 3 is an operational flow of an implementation of a method for estimating the results of an influence query for a graph;
  • FIG. 4 is an operational flow of an implementation of a method for determining a subset of nodes from a graph of a specified size that maximizes an influence of the nodes in the subset of nodes; and
  • FIG. 5 shows an exemplary computing environment.
  • DETAILED DESCRIPTION
  • FIG. 1 shows an environment 100 for answering influence queries 145 on graphs. The graphs may include a plurality of nodes and edges and may include both directed and undirected graphs. In addition, the graphs may be weighted or unweighted. A graph may represent a variety of entities and structures such a social network, the Internet, populations of humans or animals, and cities, for example.
  • An example of an influence query 145 includes a request to determine the influence of a subset of nodes S from a graph G. The influence query 145 may include identifiers of the one or more nodes in the subset. Another example of an influence query 145 may be to identify a subset of nodes of a particular size that includes the nodes from the graph G with the highest combined influence. The influence query 145 may include an indicator of the desired size of the subset of nodes. This type of query 145 is known as influence maximization.
  • The influence of a node is a measure of how connected a particular node in the graph is to the other nodes of a graph. Identifying nodes with high influence can have many uses in a variety of fields. For example, for social networking, identifying users with high influence can be used for marketing purposes to determine which users to give a free product to in order to maximize the exposure of the product. As another example, for public health, the influence of users can be used to model how a disease may be spread, or to identify which users to target for vaccination.
  • One model for the diffusion (or contagion) of information in graphs based on influence is known as independent cascade (IC) in which an independent random variable is assigned to each edge (u, v) of a graph G to model the influence of the node u on the node v. A single instance of the graph may be created by instantiating the random variables for each edge, and the influence of a particular node may be determined across many of these graph instances.
  • One version of the IC model is known as binary IC. In binary IC, the random variable assigned to each edge is binary and may be one or zero. The assigned variable represents whether or not the particular edge is live or null. A live edge (u, v) means that once a node u is infected, so is the node v. In such a model, the influence of a particular node can be based on the number of nodes that are infected when the particular node is infected across all of the instances of the graph. As may be appreciated, where the graphs are undirected graphs, the variable assigned to the (directed) edge (u, v) and the edge (v, u) may be different.
  • Another version of the IC model is known as randomized edge length (REL) IC. Unlike binary IC where the assigned variables are limited to one or zero, in REL IC each edge may be randomly assigned any non-negative variable. Each variable may represent a variety of values such as time. For example, the assigned variable to an edge (u, v) may represent how much time may elapse before the node v becomes infected after the node u has become infected. In such a model, the influence of a particular node may similarly be based on the how much the other nodes are infected when the particular node is infected across all of the instances of the graph but may change based on a current time value.
  • In one version of the IC model, a graph G, with nodes V and edges E may be used to generate a set [Gi} of graph instances. A particular instance Gi=(V, Ei,wi) may be specified by an edge set Ei with lengths wi(e)≧0. The influence of a subset of nodes S of a particular instance Gi may be defined using formula 1, where dSu u=minv∈Sdvu i is the distance in instance i from S to u and α is a non-increasing function:

  • inf(G i , S)=Σu∈V α(d Su i)   (1)
  • The influence of S over all instances {Gi} may be defined as the average of the single instance influences where l is the total number of instances using formula 2:
  • inf ( { G i } , S ) = 1 l i l inf ( G i , S ) ( 2 )
  • Using the formulas 1 and 2, one method for solving the influence maximization problem for a seed subset of size S described above is by using a greedy algorithm. For a first iteration, the algorithm starts with the empty seed set S and determines the node from the graph with the greatest influence using the formula 1 and adds the node to S. At each subsequent iteration, the greedy algorithm determines the node from the graph that when added to S results in the greatest increase in influence for the subset (i.e., the node with the highest marginal gain in influence). The algorithm is stopped when the seed set S has the desired size.
  • For both binary IC and REL IC, determining the influence of a particular node or subset of nodes and solving the influence maximization problem using the greedy algorithm may be computationally expensive for very large graphs. Accordingly, the environment 100 may include an influence engine 180 that estimates influence queries 145 using one or more sketches generated from a graph, rather than directly from the graph as described above. The sketches may be computed in a preprocessing phase resulting in a reduction of processing resources.
  • The influence engine 180 may be in communication with a graph provider 160 and a client device 110 through a network 120. The client device 110 may include a desktop personal computer, workstation, laptop, personal digital assistant (PDA), smartphone, cell phone, or any WAP-enabled device or any other computing device capable of interfacing directly or indirectly with the network 120. The network 120 may be a variety of network types including the public switched telephone network (PSTN), a cellular telephone network, and a packet switched network (e.g., the Internet). The graph provider 160, the influence engine 180, and the client device 110 may be implemented together or separately using one or more computing devices such as the computing device 500 illustrated with respect to FIG. 5.
  • The influence engine 180 may receive a graph 165 from the graph provider 160 through the network 120, and may generate a sketch for each node of the graph 165. A sketch of a graph is like a summary of a graph and includes some number of nodes and edges from the graph selected according to a sampling function, potentially with some associated information. The generated sketches may be stored by the influence engine 180 as the sketch data 187.
  • Depending on the implementation, for binary IC, each generated sketch for a node may be a reachability sketch and may indicate which nodes are reachable in the graph 165 from the node by following paths from the node in the graph 165. Each reachability sketch may be a combined reachability sketch in that it is based on all instances of the graph 165. How each reachability sketch is generated is described further with respect to FIG. 2.
  • For REL IC, rather than generate reachability sketches, the influence engine 180 may generate what is referred to herein as a combined all-distances sketch for each node. An all-distances sketch for a node 17 includes a random sample of nodes from the graph, where the inclusion probability of a node u in the sample decreases with its distance from v. The combined all-distances sketch for a node may be (conceptually) a combination of the all-distances sketches generated for the node across all instances of the graph 165. The particular methods used to generate the combined all-distances sketches and combined reachability sketches are described further with respect to FIG. 2.
  • The influence engine 180 may receive an influence query 145 from a client device 110 and may generate an estimate in response to the query using the sketches stored in the sketch data 187. The estimate may be provided by the influence engine 180 to the client device 110 as the results 186.
  • How the influence engine 180 determines the estimate may depend on both the type of the influence query 145, as well as whether the influence is being determined based on a binary IC model or a REL IC model. Where the influence query 145 is a request to estimate the influence of a subset of nodes in the graph 165 and the instances of the graph 165 are based on the binary IC model, the influence engine 180 may estimate the influence using the reachability sketches associated with each of the nodes in the subset from the sketch data 187. The influence may be estimated by estimating (using the sketches of each node in the subset) the cardinality of the union of the reachability sets of all nodes in the subset. Other methods may be used.
  • Where the instances of the graph are based on the REL IC model, the influence of the subset may be estimated using the combined all-distances sketches associated with the nodes in the subset and one or more estimators. Other methods may be used.
  • Where the influence query 145 is an influence maximization request for either a graph 165 with instances generated using either the binary IC model or the REL IC model, the influence engine 180 may determine the subset of nodes of a specified size using a form of the greedy algorithm described above while estimating the influence of the nodes using either the reachability sketches or the combined all-distances sketches associated with each node of the graph 165. The greedy algorithm and its application are described further below with respect to FIG. 2.
  • FIG. 2 is an illustration of an implementation of an influence engine 180. As shown, the influence engine 180 may include several components such as an instance engine 210, a sketch generator 220, and an influence estimator 225. More or fewer components may be supported by the influence engine 180.
  • The instance engine 210 may receive a graph 165 and may generate one or more instances based on the graph 165. The graph 165 may be received from the graph provider 160 and may include a plurality of edges and nodes. Each edge may further have an associated weight. The generated instances may be stored by the instance engine 210 as the graph instance data 215. The number of instances generated from a graph 165 may be set by a user or administrator, for example.
  • Where the binary IC model is used, the instance engine 210 may generate an instance from the graph 165 by assigning either a one or a zero to each edge in the graph 165. The one or zero may be randomly assigned to each edge using a biased coin, for example. Other methods for randomly, or pseudo randomly, assigning values may be used. Where a zero is assigned to an edge in an instance of the graph 165, the edge may be deemed to be dead or inactive in the instance. Conversely, where a one is assigned to an edge in an instance of the graph 165, the edge may be deemed to be live or active in the instance.
  • Where the REL IC model is used, the instance engine 210 may generate an instance from the graph 165 by assigning a positive value to each edge in the graph 165. The positive value assigned to an edge many be randomly selected from a distribution. Any method for randomly, or pseudo randomly, assigning values may be used. The value assigned to an edge may represent a time that a first node associated with the edge may take to infect a second node associated with the edge, for example.
  • The sketch generator 220 may, for each node in the graph 165, generate a sketch for the node. The sketch generated for each node may be a combined sketch, and may be generated for the node based on all of the instances of the graph 165 generated by the instance engine 210. The generated sketches for each node may be stored by the sketch generator 220 as the sketch data 187.
  • Where the instances of the graph 165 are based on the binary IC model, each of the generated sketches may be reachability sketches. The reachability sketches may be bottom-k min hash sketches, where k is the size or number of samples in the sketch. Other types of reachability sketches may be used.
  • For a node u ∈ Gi, the reachability set Ru i (i.e., all nodes in the instance i that are reachable from the node u) is defined as Ru i={v|u
    Figure US20160350382A1-20161201-P00001
    v in Gi} where u
    Figure US20160350382A1-20161201-P00001
    v means that v is reachable from u. When combining reachability sets across all instances of the graph 165, the combined reachability set Ru={(v, i)|u
    Figure US20160350382A1-20161201-P00001
    v in Gi}.
  • The sketch generator 220 may generate a reachability sketch by, for each node and instance pair (v, i) of a graph 165, associating a random rank value with the pair. Depending on the implementation, the random rank value may be hash based. The random rank value ru i˜U[0, 1] may be selected from the uniform distribution of [0, 1].
  • The combined reachability sketch Xu for the node u may then be generated from the set of the k smallest associated rank values amongst {rv i|(v, i) ∈ Ru} by the sketch generator 220 according to formula 3 where Bottom-k of a set is the subset consisting of the k smallest associated rank values:

  • X u=Bottom-k {r v i |v ∈ R u i}  (3)
  • In some implementations, the sketch generator 220 may generate the combined reachability sketch for each node by performing sequential pruned graph searches. The sketch generator 220 may rank the node instance pairs based on the assigned random rank values (from lowest to highest). Pruned searches may then be performed using the ranked node instance pairs.
  • For a node instance pair (u, i), a search may be performed from u using the reversed edges of Gi. When a new node v is visited and its current sketch Xv is smaller than k, the value ru i may be added to Xv. Otherwise the search may pruned at v. Eventually, xv may include the bottom-k combined reachability sketch of v as described above in the formula 3. Other methods for generating reachability sketches may be used.
  • Where the graph 165 is based on the REL IC model, each of the generated sketches may be all-distances sketches, and may be used to generate a combined all-distances sketch for each node across all instances of the graph 165. An all-distances sketch for a node v includes a random sample of nodes from the graph, where the inclusion probability of a node u in the sample decreases with its distance from v. The combined all-distances sketch for a node is a combination of all of the all-distances sketches generated for the node for each instance of the graph 165.
  • The sketch generator 220 may generate a combined all-distances sketch cADS(u) for a node u by, for each node and instance pair (u, i) for a graph 165, associating a random rank value with the pair similarly as described above for the reachability sketches.
  • The sketch generator 220 may rank the node instance pairs based on the assigned random rank values. Pruned Dijkstra searches may be iteratively performed by the sketch generator 220 using the ranked node instance pairs by increasing rank ru i using the reversed edges of Gi. When a new node v is visited, a determination is made as to whether there is an entry (x, y) ∈ cADS(u) where y≦du i. Here cADS(u) is the combined all-distances sketch of x. If so, the Dijkstra search is pruned at v. Otherwise, cADS(v) is updated to include(ru i, dvu i). Other methods for generating combined all-distances sketches may be used.
  • The influence estimator 225 may generate results 186 in response to a received influence query 145. How the influence estimator 225 generates the results 186 may depend both on the type of influence query 145 (i.e., whether the query is for an estimation of the influence of a subset of the nodes of the graph 165 or to determine a subset of nodes of the graph 165 of a specified size that have a maximum influence) and whether or not the generated instances of the graph 165 are generated using binary IC or REL IC.
  • Where the instances of the graph 165 are binary IC, the influence estimator 225 may determine the influence of a subset of nodes S identified by the influence query 145 by estimating the cardinality of the union Uu∈S Ru (u, i) of the combined reachability sketches Xu for all nodes u in the subset S. In some implementations, when estimating the cardinality of the union of combined reachability sketches, the influence estimator 225 may compute a threshold rank tu of each node u using formula 4 (where kth indicates the k-th smallest element of the set):

  • t u =k th({r v i |v ∈ R u i}),   (4)
  • The influence estimator 225 may estimate the cardinality |Ru| as (k-1)/tu. The influence estimator 224 may further estimate the cardinality of the union Uu∈S Ru using the bottom-k sketches of each set Ru for u ∈ S. In some implementations, the influence estimator 225 may estimate the cardinality of the union by computing the bottom-k sketch of the union which has a threshold value t=kth{Uu∈S Xu} using the cardinality estimator (k-1)/t.
  • The cardinality of the union Uu∈S Ru may be estimated by the influence estimator using formula 5:
  • v = z U v S X v \ { t v } 1 max u S z X u { t u } t u ( 5 )
  • Where the influence query 145 is a request to determine a subset S of an indicated size having a maximum influence, the influence estimator 225 may determine the subset S using the greedy algorithm described above. However, rather than compute the actual influence using the formula 2, the influence estimator 225 may use the combined reachability sketches and may estimate the influence of the subset using the formula 5.
  • In some implementations, the influence estimator 225 may apply the greedy algorithm by first creating a priority queue for each node u in the graph 165. Initially, the nodes in the graph may be ordered based on their estimated influence or |Ru|. The nodes may also be associated with a freshness value that indicates the last time that the influence value associated with the node was updated.
  • For the first iteration, the node u with the highest priority may be added to the subset S by the influence estimator 225. For subsequent iterations, the node u at the top of the priority queue is retrieved. If its freshness value indicates that it has not been evaluated for this iteration, the marginal gain of adding u to S is estimated using the formula 5 above. If the estimated gain is less than a previously estimated maximum gain, then u is added back into the priority queue with the updated maximum gain value and freshness value. If the estimated gain is greater than the previously estimated maximum gain, then u is added to S, and the algorithm is repeated until S is full.
  • Where the instances of the graph 165 are generated using the REL IC model, the influence estimator 225 may determine the influence of a subset of nodes S identified by the influence query 145 using an estimator according to formula 6 where α is any non-increasing function:

  • inf({G i }, S)=Σ(v,i)u∈S max α(d uv i)=Σ(v,i)u∈S min α(d uv i)   (6)
  • For a subset of nodes S with only one node u, a node always influences itself, so the influence of that node is one. For the other nodes in the graph 165, a historic inverse probability estimator may be used to estimate the influence contribution of nodes that are a positive distance from u. For a subset of nodes S with more than one node u, the influence estimator 225 may create what is referred to as a union all-distances sketch. The influence estimator 225 may generate the union all-distances sketch from each of the combined all-distances sketches from the nodes in the subset of nodes S. For example, the influence estimator 225 may take the k smallest ranks from the combined all-distances sketches for each instance. The influence estimator 225 may then estimate the influence of the nodes in the subset of nodes S by applying the estimator of formula 6 to entries in the union all-distances sketch.
  • Where the influence query 145 is a request to determine a subset S of an indicated size having a maximum influence, the influence estimator 225 may determine the subset S using the greedy algorithm similarly as described above for the binary IC model. However, rather than estimate the influence of the subset at each iteration of the algorithm using the formula 5, the influence estimator may estimate the influence using the estimator of formula 6.
  • FIG. 3 is an operational flow of an implementation of a method 300 for estimating the results of an influence query for a graph. The method 300 may be implemented by the influence engine 180, for example.
  • A graph is received at 301. The graph 165 may be received by the influence engine 180 from the graph provider 160. The graph 165 may include a plurality of nodes and a plurality of edges. The graph may be a weighted or unweighted, and may be directed or undirected.
  • For each node in the graph, a sketch is computed at 303. Each sketch may be computed by the sketch generator 220. Depending on the implementation, each sketch may be a combined sketch and may be a combination of the sketches generated for the node across all instances of the received graph 165. Where an instance of a graph is generated using the binary IC model, the sketch generated for each node may be a combined reachability sketch. Where an instance of a graph is generated using the REL IC model, the sketch generated for each node may be a combined all-distances sketch. Other types of sketches may be used. The computed sketches may be stored by the sketch generator 220 as the sketch data 187.
  • An influence query is received at 305. The influence query 145 may be received by the influence engine 180 from the client device 110. Depending on the implementation, the influence query 145 may be a request to estimate the influence of a subset of nodes of the graph 165, or may be a request to estimate the subset of nodes of the graph 165 of a specified size with a maximum influence. Where the influence query 145 is a request to estimate the influence of a subset of nodes in the graph 165, the influence query 145 may include identifiers of the one or more nodes in the subset. Where the influence query is a request to estimate the subset of nodes of the graph 165 of a specified size with a maximum influence, the influence query 145 may include an indicator of the specified size.
  • One or more sketches are retrieved based on the influence query at 307. The one or more sketches may be retrieved from the sketch data 187 by the influence estimator 225. Where the influence query 145 is a request to estimate the influence of a subset of nodes of the graph 165, the influence estimator 225 may retrieve the sketches associated with the one or more nodes of the subset. Where the influence query 145 is a request to estimate the subset of nodes of the graph 165 of a specified size with a maximum influence, the influence estimator 225 may retrieve the sketches associated with every node of the graph 165.
  • Results are determined in response to the influence query based on the retrieved one or more sketches at 309. The results 186 may be estimates and may be determined by the influence estimator 225. How the results 186 are estimated may depend on the type of query 145 and the IC mode used to generate the graph instances.
  • For graph instances that are based on the binary IC model, the influence estimator 225 may estimate the influence of a subset of nodes by estimating the cardinality of union of the reachability sketches associated with each node in the subset. For graph instances that are based on the REL IC model, the influence estimator 225 may estimate the influence of a subset of nodes by applying an estimator, such as a historic inverse probability estimator, to the sketches associated with the nodes in the subset of nodes. Where the query 145 is a request to identify the subset of a specified size with the maximum influence, the influence estimator 225 may determine the subset using a greedy algorithm.
  • The generated results are provided in response to the influence query at 311. The generated results 186 may be provided to the client device 110 that originated the influence query 145 by the influence engine 180. After providing the generated results 186, the influence engine 180 may return to 305 where a new influence query 145 may be received from a client device 110.
  • FIG. 4 is an operational flow of an implementation of a method 400 for determining a subset of nodes from a graph of a specified size that maximizes an influence of the nodes in the subset of nodes. The method 400 may be implemented by the influence engine 180, for example.
  • A plurality of sketches is received at 401. The plurality of sketches may be received from the sketch data 187 by the influence estimator 225. Each sketch of the plurality of sketches may be associated with a node of a graph 165. The sketches may be combined sketches across all instances of the graph 165. Depending on the implementation, the combined sketches may be combined reachability sketches or may be combined all-distances sketches.
  • An influence query is received at 403. The influence query 145 may be received by the influence engine 180 from a client device 110. The influence query 145 may specify a size and may be a query for a subset of nodes from the graph 165 of the specified size that maximizes the influence of the nodes in the subset.
  • A node of the plurality of nodes with a greatest estimated marginal gain in influence is determined at 405. The node with the greatest estimated marginal gain may be determined by the influence estimator 225 using the sketches associated with each node in the graph 165.
  • The influence estimator 225 may determine the node with the greatest marginal gain by, for each node of the graph 165 that is not already in the subset, estimating the influence of the nodes already in the subset when the node being considered is included. The node that results in the largest increase in the estimated influence for the subset of nodes may then be determined to be the node with the greatest estimated marginal gain. Other methods may be used.
  • The determined node is added to the subset at 407. The determined node may be added to the subset of nodes by the influence estimator 225. A determination is made as to whether the subset of nodes is the specified size. If the subset of nodes is the specified size, then the method 400 may continue at 411. Otherwise, the method 400 may return to 405 where another node is selected.
  • Depending on the implementation, the influence estimator 225 may recompute the received sketches before or after adding the node to the subset of nodes. As nodes are added to the subset of nodes, the sketches become less accurate. Accordingly, the sketches may be recomputed to account for the nodes already added to the subset of nodes. The recomputation may be triggered by the calculated marginal gain falling below a threshold value, or after some number of nodes have been added to the subset, for example.
  • The determined subset of nodes is provided at 411. The determined subset of nodes may be provided by the influence engine in response to the influence query 145 as the results 186.
  • FIG. 5 shows an exemplary computing environment in which example implementations and aspects may be implemented. The computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality.
  • Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers (PCs), server computers, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
  • Computer-executable instructions, such as program modules, being executed by a computer may be used. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
  • With reference to FIG. 5, an exemplary system for implementing aspects described herein includes a computing device, such as computing device 500. In its most basic configuration, computing device 500 typically includes at least one processing unit 502 and memory 504. Depending on the exact configuration and type of computing device, memory 504 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated in FIG. 5 by dashed line 506.
  • Computing device 500 may have additional features/functionality. For example, computing device 500 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in FIG. 5 by removable storage 508 and non-removable storage 510.
  • Computing device 500 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by device 700 and include both volatile and non-volatile media, and removable and non-removable media.
  • Computer storage media include volatile and non-volatile, and removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Memory 504, removable storage 508, and non-removable storage 510 are all examples of computer storage media. Computer storage media include, but are not limited to, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computing device 500. Any such computer storage media may be part of computing device 500.
  • Computing device 500 may contain communication connection(s) 512 that allow the device to communicate with other devices. Computing device 500 may also have input device(s) 514 such as a keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 516 such as a display, speakers, printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
  • It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the processes and apparatus of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium where, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the presently disclosed subject matter.
  • Although exemplary implementations may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network or distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be affected across a plurality of devices. Such devices might include PCs, network servers, and handheld devices, for example.
  • Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (20)

What is claimed:
1. A method comprising:
receiving a graph comprising a plurality of nodes, by a computing device;
for each node of the graph, computing a sketch by the computing device;
receiving an influence query by the computing device, wherein the influence query is a query for one of (1) an estimate of an influence of a subset of nodes of the plurality of nodes or (2) an estimate of a subset of nodes of the plurality of nodes of a specified size with a maximum combined influence, wherein the influence of a node is a measure of how connected the node in the graph is to the other nodes of the graph;
determining a result in response to the influence query using one or more of the computed sketches by the computing device; and
providing the determined result in response to the influence query by the computing device.
2. The method of claim 1, wherein each sketch is one or more of an all-distances sketch or a reachability sketch.
3. The method of claim 1, further comprising generating a plurality of instances from the graph, and computing a sketch for a node comprises:
computing a sketch for the node for each of the generated instances; and
combining the computed sketches for each of the generated instances.
4. The method of claim 3, wherein the combined computed sketches are combined all-distances sketches or combined reachability sketches.
5. The method of claim 3, wherein the plurality of instances are randomly generated from the graph.
6. The method of claim 5, wherein the graph further comprises a plurality of edges, and wherein randomly generating an instance of the graph comprises randomly assigning a value to each edge of the graph.
7. The method of claim 6, wherein an assigned value is either a one or a zero.
8. The method of claim 6, wherein an assigned value is any non-zero value.
9. The method of claim 1, wherein the influence query identifies a subset of nodes from graph, and wherein determining the result in response to the influence query using one or more of the computed sketches comprises estimating an influence of the nodes of the subset of nodes using the sketches computed for each of the nodes in the subset of nodes.
10. The method of claim 9, wherein the influence is estimated based on a union of the sketches computed for each of the nodes in the subset of nodes.
11. The method of claim 1, wherein the influence query is a query for a subset of nodes from the graph of a specified size having a maximum influence, and determining the result in response to the influence query using one or more of the computed sketches comprises using a greedy algorithm to determine the subset of nodes.
12. The method of claim 11, wherein using the greedy algorithm to determine the subset of nodes comprises:
determining a node of the plurality of nodes that when added to the subset of nodes increases an influence of the subset of nodes by the greatest amount using the sketch computed for the determined node and the sketches computed for the nodes of the subset of nodes; and
adding the determined node to the subset of nodes.
13. A system comprising:
a computing device; and
an influence engine adapted to:
receive a graph comprising a plurality of nodes;
for each node of the graph, compute a sketch;
receive an influence query, wherein the influence query is a query for an estimate of an influence of a subset of nodes of the plurality of nodes, wherein the influence of a node is a measure of how connected the node in the graph is to the other nodes of the graph;
determine a result in response to the influence query using one or more of the computed sketches; and
provide the determined result in response to the influence query.
14. The system of claim 13, wherein each sketch is one or more of an all-distances sketch or a reachability sketch.
15. The system of claim 13, wherein the influence engine is further adapted to generate a plurality of instances from the graph, and the influence engine adapted to compute a sketch for a node comprises the influence engine adapted to:
compute a sketch for the node for each of the generated instances; and
combine the computed sketches for each of the generated instances.
16. The system of claim 15, wherein the plurality of instances are randomly generated from the graph.
17. The system of claim 16, wherein the graph further comprises a plurality of edges, and wherein randomly generating an instance of the graph comprises randomly assigning a value to each edge of the graph.
18. A method comprising:
generating a plurality of instances from a graph comprising a plurality of nodes, by a computing device;
for each node of the graph, computing a sketch by the computing device, wherein computing a sketch for a node comprises computing a sketch for the node for each of the generated instances and combining the computed sketches for each of the generated instances;
receiving an influence query by the computing device, wherein the influence query is a query for an estimate of an influence of a subset of nodes of the plurality of node, wherein the influence of a node is a measure of how connected the node in the graph is to the other nodes of the graph;
determining a result in response to the influence query using one or more of the computed sketches by the computing device; and
providing the determined result in response to the influence query by the computing device.
19. The method of claim 18, wherein the plurality of instances are randomly generated from the graph, wherein the graph further comprises a plurality of edges, and wherein randomly generating an instance of the graph comprises randomly assigning a value to each edge of the graph.
20. The method of claim 18, wherein the influence query identifies a subset of nodes from graph, and wherein determining the result in response to the influence query using one or more of the computed sketches comprises estimating an influence of the nodes of the subset of nodes using the sketches computed for each of the nodes in the subset of nodes.
US15/236,986 2014-05-29 2016-08-15 Estimating influence using sketches Abandoned US20160350382A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/236,986 US20160350382A1 (en) 2014-05-29 2016-08-15 Estimating influence using sketches

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/290,209 US9443034B2 (en) 2014-05-29 2014-05-29 Estimating influence using sketches
US15/236,986 US20160350382A1 (en) 2014-05-29 2016-08-15 Estimating influence using sketches

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/290,209 Continuation US9443034B2 (en) 2014-05-29 2014-05-29 Estimating influence using sketches

Publications (1)

Publication Number Publication Date
US20160350382A1 true US20160350382A1 (en) 2016-12-01

Family

ID=54702064

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/290,209 Active 2034-11-11 US9443034B2 (en) 2014-05-29 2014-05-29 Estimating influence using sketches
US15/236,986 Abandoned US20160350382A1 (en) 2014-05-29 2016-08-15 Estimating influence using sketches

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/290,209 Active 2034-11-11 US9443034B2 (en) 2014-05-29 2014-05-29 Estimating influence using sketches

Country Status (1)

Country Link
US (2) US9443034B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695043A (en) * 2020-06-16 2020-09-22 桂林电子科技大学 Social network blocking influence maximization method based on geographic area

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9798458B2 (en) 2013-10-02 2017-10-24 The Joan and Irwin Jacobs Technion-Cornell Innovation Institute Methods, systems, and apparatuses for accurate measurement and real-time feedback of solar ultraviolet exposure
US9880052B2 (en) 2013-10-02 2018-01-30 The Joan and Irwin Jacobs Technion-Cornell Innovation Institute Methods, systems, and apparatuses for accurate measurement and real-time feedback of solar ultraviolet exposure
US10120956B2 (en) * 2014-08-29 2018-11-06 GraphSQL, Inc. Methods and systems for distributed computation of graph data
US10527490B2 (en) 2015-08-25 2020-01-07 The Joan and Irwin Jacobs Technion-Cornell Innovation Institute Methods, systems, and apparatuses for accurate measurement and real-time feedback of solar ultraviolet exposure
US10739253B2 (en) 2016-06-07 2020-08-11 Youv Labs, Inc. Methods, systems, and devices for calibrating light sensing devices
US10386194B2 (en) 2016-06-10 2019-08-20 Apple Inc. Route-biased search
US10060753B2 (en) 2016-08-17 2018-08-28 Apple Inc. On-demand shortcut computation for routing
US10018476B2 (en) 2016-08-17 2018-07-10 Apple Inc. Live traffic routing
USD829112S1 (en) 2016-08-25 2018-09-25 The Joan and Irwin Jacobs Technion-Cornell Innovation Institute Sensing device
WO2018156572A1 (en) * 2017-02-21 2018-08-30 Virginia Commonwealth University Intellectual Property Foundation Importance sketching of influence dynamics in massive-scale networks
CN107135153A (en) * 2017-04-28 2017-09-05 常州工学院 The information source and influence power node positioning method inversely reviewed based on propagation path
US10942970B2 (en) * 2018-10-12 2021-03-09 Oracle International Corporation Reachability graph index for query processing
US10876886B2 (en) 2018-10-19 2020-12-29 Youv Labs, Inc. Methods, systems, and apparatuses for accurate measurement of health relevant UV exposure from sunlight

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962520B2 (en) * 2007-04-11 2011-06-14 Emc Corporation Cluster storage using delta compression
US8370313B2 (en) * 2009-06-10 2013-02-05 James Snow Scoring nodes in a directed graph with positive and negative links
US8619084B2 (en) * 2010-05-03 2013-12-31 International Business Machines Corporation Dynamic adaptive process discovery and compliance
US8631044B2 (en) * 2008-12-12 2014-01-14 The Trustees Of Columbia University In The City Of New York Machine optimization devices, methods, and systems
US8645412B2 (en) * 2011-10-21 2014-02-04 International Business Machines Corporation Computing correlated aggregates over a data stream
US8666920B2 (en) * 2010-02-15 2014-03-04 Microsoft Corporation Estimating shortest distances in graphs using sketches
US8688701B2 (en) * 2007-06-01 2014-04-01 Topsy Labs, Inc Ranking and selecting entities based on calculated reputation or influence scores
US8959525B2 (en) * 2009-10-28 2015-02-17 International Business Machines Corporation Systems and methods for affinity driven distributed scheduling of parallel computations
US9471691B1 (en) * 2012-12-07 2016-10-18 Google Inc. Systems, methods, and computer-readable media for providing search results having contacts from a user's social graph

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8359276B2 (en) 2006-09-20 2013-01-22 Microsoft Corporation Identifying influential persons in a social network
CN101859315A (en) 2010-04-30 2010-10-13 西北工业大学 Heuristic solving method for maximizing influence of social network
US20110295626A1 (en) 2010-05-28 2011-12-01 Microsoft Corporation Influence assessment in social networks
US8751618B2 (en) 2011-04-06 2014-06-10 Yahoo! Inc. Method and system for maximizing content spread in social network
CN102819664B (en) 2012-07-18 2015-02-18 中国人民解放军国防科学技术大学 Influence maximization parallel accelerating method based on graphic processing unit

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962520B2 (en) * 2007-04-11 2011-06-14 Emc Corporation Cluster storage using delta compression
US8688701B2 (en) * 2007-06-01 2014-04-01 Topsy Labs, Inc Ranking and selecting entities based on calculated reputation or influence scores
US8631044B2 (en) * 2008-12-12 2014-01-14 The Trustees Of Columbia University In The City Of New York Machine optimization devices, methods, and systems
US8370313B2 (en) * 2009-06-10 2013-02-05 James Snow Scoring nodes in a directed graph with positive and negative links
US8959525B2 (en) * 2009-10-28 2015-02-17 International Business Machines Corporation Systems and methods for affinity driven distributed scheduling of parallel computations
US8666920B2 (en) * 2010-02-15 2014-03-04 Microsoft Corporation Estimating shortest distances in graphs using sketches
US8619084B2 (en) * 2010-05-03 2013-12-31 International Business Machines Corporation Dynamic adaptive process discovery and compliance
US8645412B2 (en) * 2011-10-21 2014-02-04 International Business Machines Corporation Computing correlated aggregates over a data stream
US8868599B2 (en) * 2011-10-21 2014-10-21 International Business Machines Corporation Computing correlated aggregates over a data stream
US9471691B1 (en) * 2012-12-07 2016-10-18 Google Inc. Systems, methods, and computer-readable media for providing search results having contacts from a user's social graph

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695043A (en) * 2020-06-16 2020-09-22 桂林电子科技大学 Social network blocking influence maximization method based on geographic area

Also Published As

Publication number Publication date
US20150347625A1 (en) 2015-12-03
US9443034B2 (en) 2016-09-13

Similar Documents

Publication Publication Date Title
US9443034B2 (en) Estimating influence using sketches
Ohsaka et al. Dynamic influence analysis in evolving networks
Ohsaka et al. Fast and accurate influence maximization on large networks with pruned monte-carlo simulations
Cohen et al. Sketch-based influence maximization and computation: Scaling up with guarantees
US10115115B2 (en) Estimating similarity of nodes using all-distances sketches
US8392398B2 (en) Query optimization over graph data streams
US8719211B2 (en) Estimating relatedness in social network
US10936765B2 (en) Graph centrality calculation method and apparatus, and storage medium
US20210158211A1 (en) Linear time algorithms for privacy preserving convex optimization
US8666920B2 (en) Estimating shortest distances in graphs using sketches
US8438189B2 (en) Local computation of rank contributions
US8521724B2 (en) Processing search queries using a data structure
US20090306996A1 (en) Rating computation on social networks
US20080270549A1 (en) Extracting link spam using random walks and spam seeds
EP2601622A1 (en) Predicting a user behavior number of a word
WO2019019385A1 (en) Cross-platform data matching method and apparatus, computer device and storage medium
US20130103671A1 (en) Processing Search Queries In A Network Of Interconnected Nodes
WO2018149337A1 (en) Information distribution method, device, and server
US20150170030A1 (en) Determining geo-locations of users from user activities
US11514038B2 (en) Systems and methods for quantum global optimization
Avrachenkov et al. Inference in osns via lightweight partial crawls
Lee et al. Computing the stationary distribution locally
US20150169794A1 (en) Updating location relevant user behavior statistics from classification errors
US11468521B2 (en) Social media account filtering method and apparatus
US11341585B2 (en) Importance sketching of influence dynamics in massive-scale networks

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WERNECK, RENATO F.;DELLING, DANIEL;PAJOR, THOMAS;AND OTHERS;SIGNING DATES FROM 20140526 TO 20140527;REEL/FRAME:039435/0980

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:039436/0075

Effective date: 20141014

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION