https://www.threadingbuildingblocks.org/docs/help/index.htm

Parallelizing Data Flow and Dependency Graphs

In addition to loop parallelism, the Intel® Threading Building Blocks (Intel® TBB) library also supports graph parallelism. It's possible to create graphs that are highly scalable, but it is also possible to create graphs that are completely sequential.

除了循环并行化，tbb还支持图并行化。这使得创建高度扩展性的图有了可能，同时也能都创建完全顺序执行的图

Using graph parallelism, computations are represented by nodes and the communication channels between these computations are represented by edges. When a node in the graph receives a message, a task is spawned to execute its body object on the incoming message. Messages flow through the graph across the edges that connect the nodes. The following sections present two examples of applications that can be expressed as graphs. For more information on tasks, see the See Also section below.

图并行化中，计算被表示为节点，计算之间的通讯通道被表达为边。当一个节点收到消息，一个任务会被执行。消息通过连接节点的边来流过图。下面有两个例子

The following figure shows a streaming or data flow application where a sequence of values is processed as each value passes through the nodes in the graph. In this example, the sequence is created by a function F. For each value in the sequence, G squares the value and H cubes the value. J then takes each of the squared and cubed values and adds them to a global sum. After all values in the sequence are completely processed, sum is equal to the sum of the sequence of squares and cubes from 1 to 10. In a streaming or data flow graph, the values actually flow across the edges; the output of one node becomes the input of its successor(s).

下图是一个streaming or data flow 的应用

Simple Data Flow Graph

Intel® Threading Building Blocks (Intel® TBB) Developer Guide 中文 Parallelizing Data Flow and Dependence Graphs并行化data flow和依赖图

The following graphic shows a different form of graph application. In this example, a dependence graph is used to establish a partial ordering among the steps for making a peanut butter and jelly sandwich. In this partial ordering, you must first get the bread before spreading the peanut butter or jelly on the bread. You must spread on the peanut butter before you put away the peanut butter jar, and likewise spread on the jelly before you put away the jelly jar. And, you need to spread on both the peanut butter and jelly before putting the two slices of bread together. This is a partial ordering because, for example, it doesn't matter if you spread on the peanut butter first or the jelly first. It also doesn't matter if you finish making the sandwich before putting away the jars.

下图是另外一种图的应用，以dependence graph 的形式表达任务的步骤执行

Dependence Graph for Making a Sandwich

While it can be inferred that resources, such as the bread, or the jelly jar, are shared between ordered steps, it is not explicit in the graph. Instead, only the required ordering of steps is explicit in a dependence graph. For example, you must "Put jelly on 1 slice" before you "Put away jelly jar".

The flow graph interface in the Intel TBB library allows you to express data flow and dependence graphs such as these, as well as more complicated graphs that include cycles, conditionals, buffering and more. If you express your application using the flow graph interface, the runtime library spawns tasks to exploit the parallelism that is present in the graph. For example, in the first example above, perhaps two different values might be squared in parallel, or the same value might be squared and cubed in parallel. Likewise in the second example, the peanut butter might be spread on one slice of bread in parallel with the jelly being spread on the other slice. The interface expresses what is legal to execute in parallel, but allows the runtime library to choose at runtime what will be executed in parallel.

tbb允许你表达data flow and dependence graphs。以及更复杂的图，比如包含cycle，条件，缓冲。。

The support for graph parallelism is contained within the namespace tbb::flow and is defined in the flow_graph.h header file.

Parent topic: Parallelizing Data Flow and Dependence Graphs

Basic Flow Graph Concepts

基本的概念

Flow Graph Basics: Graph Object 图

Conceptually a flow graph is a collection of nodes and edges. Each node belongs to exactly one graph and edges are made only between nodes in the same graph. In the flow graph interface, a graph object represents this collection of nodes and edges, and is used for invoking whole graph operations such as waiting for all tasks related to the graph to complete, resetting the state of all nodes in the graph, and canceling the execution of all nodes in the graph.

The code below creates a graph object and then waits for all tasks spawned by the graph to complete. The call to wait_for_all in this example returns immediately since this is a trivial graph with no nodes or edges, and therefore no tasks are spawned.

graph g;

g.wait_for_all();

Flow Graph Basics: Nodes 节点

A node is a class that inherits from tbb::flow::graph_node and also typically inherits from tbb::flow::sender<T> , tbb::flow::receiver<T> or both. A node performs some operation, usually on an incoming message and may generate zero or more output messages. Some nodes require more than one input message or generate more than one output message.

节点用来做计算

While it is possible to define your own node types by inheriting from graph_node, sender and receiver, it is more typical that predefined node types are used to construct a graph. The list of predefined nodes is available from the See Also section below.

A function_node is a predefined type available in flow_graph.h and represents a simple function with one input and one output. The constructor for afunction_node takes three arguments:

template< typename Body> function_node(graph &g, size_t concurrency, Body body)

Parameter	Description
Body	Type of the body object.
g	The graph the node belongs to.
concurrency	The concurrency limit for the node. You can use the concurrency limit to control how many invocations of the node are allowed to proceed concurrently, from 1 (serial) to an unlimited number.
body	User defined function object, or lambda expression, that is applied to the incoming message to generate the outgoing message.

Below is code for creating a simple graph that contains a single function_node. In this example, a node n is constructed that belongs to graph g, and has a second argument of 1, which allows at most 1 invocation of the node to occur concurrently. The body is a lambda expression that prints each value v that it receives, spins for v seconds, prints the value again, and then returns v unmodified. The code for the function spin_for is not provided.

    graph g;

    function_node< int, int > n( g, 1, []( int v ) -> int {

        cout << v;

        spin_for( v );

        cout << v;

        return v;

    } );

After the node is constructed in the example above, you can pass messages to it, either by connecting it to other nodes using edges or by invoking its function try_put. Using edges is described in the next section.

    n.try_put( 1 );

    n.try_put( 2 );

    n.try_put( 3 );

You can then wait for the messages to be processed by calling wait_for_all on the graph object:

    g.wait_for_all();

In the above example code, the function_node n was created with a concurrency limit of 1. When it receives the message sequence 1, 2 and 3, the node n will spawn a task to apply the body to the first input, 1. When that task is complete, it will then spawn another task to apply the body to 2. And likewise, the node will wait for that task to complete before spawning a third task to apply the body to 3. The calls to try_put do not block until a task is spawned; if a node cannot immediately spawn a task to process the message, the message will be buffered in the node. When it is legal, based on concurrency limits, a task will be spawned to process the next buffered message.

In the above graph, each message is processed sequentially. If however, you construct the node with a different concurrency limit, parallelism can be achieved:

    function_node< int, int > n( g, tbb::flow::unlimited, []( int v ) -> int {

        cout << v;

        spin_for( v );

        cout << v;

        return v;

    } );

You can use unlimited as the concurrency limit to instruct the library to spawn a task as soon as a message arrives, regardless of how many other tasks have been spawned. You can also use any specific value, such as 4 or 8, to limit concurrency to at most 4 or 8, respectively. It is important to remember that spawning a task does not mean creating a thread. So while a graph may spawn many tasks, only the number of threads available in the library's thread pool will be used to execute these tasks.

Suppose you use unlimited in the function_node constructor instead and call try_put on the node:

    n.try_put( 1 );

    n.try_put( 2 );

    n.try_put( 3 );

    g.wait_for_all();

The library spawns three tasks, each one applying n's lambda expression to one of the messages. If you have a sufficient number of threads available on your system, then all three invocations of the body will occur in parallel. If however, you have only one thread in the system, they execute sequentially.

Parent topic: Basic Flow Graph Concepts

Intel® Threading Building Blocks (Intel® TBB) Developer Guide 中文 Parallelizing Data Flow and Dependence Graphs并行化data flow和依赖图