vpolin July 28, 2014 at 10:14

Computing Graphs, Speculative Locks and Arenas for Tasks in Intel® Threading Building Blocks (continued)

Original author: Vladimir Polin, Alexey Kukanov, Michael J. Voss

Transfer

This post is a continuation of the translation of the article “Flow Graphs, Speculative Locks, and Task Arenas in Intel Threading Building Blocks” from Parallel Universe Magazine, Issue 18, 2014. In this half of the article we will look at speculative locks that take advantage of Intel technology Transactional Synchronization Extensions and user-managed task arenas, which provide advanced control and management of the level of parallelism and isolation of tasks. If you are interested, welcome to cat.

Speculative Locks

Intel TBB 4.2 offers speculative locks: New synchronization classes that are based on Intel Transactional Synchronization Extensions (Intel TSX) technology.
Speculative locks can allow critical sections protected by this lock to run at the same time, assuming that access and data modification does not conflict with each other. In practice, if there is a conflict over the data, one or more speculative execution should be canceled without touching the protected data and, thus, without affecting other streams. Then, the threads involved in the conflict over the data repeat their critical sections and can take a real lock to protect the data (Intel TSX technology does not guarantee that speculative execution will succeed in the end).
In the implementation of speculative locks in the Intel TBB library [6, 7], all these steps occur unnoticed by the user, and the programmer can simply use the API of the special mutex. Moreover, on a processor that does not support Intel TSX technology, the implementation will immediately use an ordinary lock. Those. developers can write portable programs that can take advantage of transactional synchronization.
Intel TBB now provides 2 classes of mutexes that support Intel TSX technology: speculative_spin_mutex and speculative_spin_rw_mutex ; the latter was added as a “preview feature” in Intel TBB 4.2 Update 2.
The speculative_spin_mutex class is very similar to the spin_mutex class; and both are located in the same tbb / spin_mutex.h header file. The main difference between speculative_spin_mutex , of course, apart from Intel TSX support, is its size. In order to avoid sharing the cache with other data, which is highly likely to lead to conflict and loss of performance, an instance of speculative_spin_mutex takes up 2 cache lines.
An example of using a speculative lock:

#include 
#include 
tbb::speculative_spin_mutex tsx_mtx;
std::set g_Set;
void thread_safe_add_to_set( int value ) {
    tbb::speculative_spin_mutex::scoped_lock lock(tsx_mtx);
    g_Set.insert(value);
}

The speculative_spin_rw_mutex class , as you might guess from its name, implements speculative RW spinlock. Someone could notice that any speculative lock is not exclusive by definition, allowing not only to read at the same time, but also to write if there are no conflicts. That is, a “speculative RW-lock” may sound like a tautology. However, it must be recalled that the flow can capture the castle “for real.” In this case, speculative_spin_mutex should provide exclusive access to the stream, no matter what the stream does — reads or modifies data; therefore, it is not a true RW-lock. Speculative_spin_mutex , on the other handallows many data readers to run. Moreover, “real” and speculative readers can run simultaneously. But this involves additional overhead: the internal data fields of the class must be stored on different cache lines, in order to avoid simultaneous access to the cache line from several cores (false sharing). Because of this, each instance of the speculative_spin_rw_mutex class occupies three cache lines.
Although speculative_spin_rw_mutex is currently in the tbb / spin_rw_mutex.h header file, and even uses spin_rw_mutex as part of the implementation, the two classes are not fully compatible. The speculative_spin_rw_mutex class does not have lock () andunlock () . It forces us to use the scope template, i.e. it must be accessible through the class speculative_spin_rw_mutex :: scoped_lock . Since this is a preview feature class, the macro TBB_PREVIEW_SPECULATIVE_SPIN_RW_MUTEX must be set to a non-zero value before attaching this header file.
Unfortunately, the use and benefits of speculative locks are very dependent on the task. Do not think that these new classes are “improved locks”. Careful performance studies are needed in order to decide on a case-by-case basis that speculative locks are the right tool.

User-Managed Task Arenas

Another significant part of the new functionality recently added to the library is managed task arenas. In our terminology, an arena is a place for streams where they share tasks and pick up tasks to perform. Initially, the library supported one global arena for the application. Then we changed it, allowing us to support different arenas for each application thread, based on user feedback that work launched by different threads should be isolated from each other. Then we received requests so that concurrency control and job isolation were not related to application threads. To meet these needs, you have introduced user-managed arenas for tasks. At the moment this is still a preview feature class and to use it requires setting the macro TBB_PREVIEW_TASK_ARENA to a non-zero value. But we are working to make it fully supported later in the year (2014).
The API for user-managed arenas for tasks is implemented through the task_arena class . When defining the task_arena class , the user can specify the desired parallelism and how much of this parallelism should be reserved for application threads.

#define TBB_PREVIEW_TASK_ARENA 1
#include 
tbb::task_arena my_arena(4, 1);

In this example, an arena is created for 4 threads and one place is reserved for the application thread. This means that up to 3 workflows managed by the Intel TBB library can join this arena and work on tasks that are in this arena. There is no limit on how many application threads can add work to this arena, but general concurrency will be limited to 4; all "extra" threads will not be able to join the arena and perform tasks there.
In order to send work to the arena, you must call the execute () or enqueue () methods :

my_arena.enqueue( a_job_functor );
my_arena.execute( a_job_functor2 );

Work for any of these methods can be represented by lambda expressions from C ++ 11 or a functor. These two methods are distinguished by the method of sending work to the arena. task_arena :: enqueue () - this is an asynchronous call to send work like "fire-and-forget" (send-and-forget); the thread that called task_arena :: enqueue () does not join the arena and immediately returns from this call. task_arena :: execute () , in contrast, does not return until the submitted job is finished; if possible, the thread that called task_arena :: execute () joins the arena and performs its tasks. If not, then the thread is blocked for a while until the task is completed.
You can send many sequential tasks to task_arenabut that’s not what it is intended for. Usually a task is sent to task_arena that creates enough parallelism, for example, calling parallel_for :

my_arena.execute( [&]{
    tbb::parallel_for(0,N,iteration_functor());
});

or flow graph:

tbb::flow::graph g;
... // create the graph here
my_arena.enqueue( [&]{
    ... // start graph computations
});
... // do something else
my_arena.execute( [&]{
    g.wait_for_all();
}); // does not return until the flow graph finishes

More information on task arenas can be found in the Intel TBB Reference Guide.

Conclusion

The Intel Threading Building Blocks template library C ++ offers a rich set of components for using efficient high-level, task-based parallelism and portable application development that can use the full power of multi-core architectures in the future. The library allows application developers to focus on the parallelism of the algorithms in the application without having to focus on the low-level details of managing this parallelism. In addition to high-performance implementations of the most used high-level parallel algorithms and thread-safe containers, the library provides low-level building blocks such as a thread-safe, scalable memory manager, locks, and atomic operations.
Despite the fact that the Intel TBB library is already quite comprehensive and recognized by the community, we continue to improve its performance and expand its functionality. In Intel TBB 4.0, we released computational graph support in order to enable developers to more easily implement algorithms that are based on data or execution dependency graphs. In Intel TBB 4.2, we supported the benefits of Intel Transactional Synchronization Extensions with the new synchronization classes and responded to user requests for advanced control and management of concurrency and isolation of tasks by introducing user-managed arenas for tasks.
You can find the latest Intel TBB versions and further information on our sites:

Official site software.intel.com/en-us/intel-tbb
Open source version threadingbuildingblocks.org
Documentation software.intel.com/en-us/tbb_4.2_ug
Forum software.intel.com/en-us/forums/intel-threading-building-blocks

Bibliography

[1] Michael McCool, Arch Robison, James Reinders “Structured Parallel Programming” parallelbook.com
[2] Vladimir Polin, “Android * Tutorial: Writing a Multithreaded Application using Intel Threading Building Blocks”. software.intel.com/en-us/android/articles/android-tutorial-writing-a-multithreaded-application-using-intel-threading-building-blocks
[3] Vladimir Polin, “Windows * 8 Tutorial: Writing a Multithreaded Application for the Windows Store * using Intel Threading Building Blocks. " software.intel.com/en-us/blogs/2013/01/14/windows-8-tutorial-writing-a-multithreaded-application-for-the-windows-store-using
[4] Michael J. Voss, “ The Intel Threading Building Blocks Flow Graph ”,
Dr. Dobb's, October 2011,www.drdobbs.com/tools/the-intel-threading-building-blocks-flow/231900177 .
[5] Aparna Chandramowlishwaran, Kathleen Knobe, and Richard Vuduc, “Performance
Evaluation of Concurrent Collections on High-Performance Multicore Computing Systems”,
2010 Symposium on Parallel & Distributed Processing (IPDPS), April 2010.
[6] Christopher Huson, “Transactional memory support: the speculative_spin_mutex ”. software.intel.com/en-us/blogs/2013/10/07/transactional-memory-support-the-speculative-spin-mutex
[7] Christopher Huson, “Transactional Memory Support: the speculative_spin_rw_mutex”. software.intel.com/en-us/blogs/2014/03/07/transactional-memory-support-the-speculative-spin-rw-mutex-community-preview

Tags:

Computing Graphs, Speculative Locks and Arenas for Tasks in Intel® Threading Building Blocks (continued)

Speculative Locks

User-Managed Task Arenas

Conclusion

Also popular now: