[Translation] When to use parallel streams

    Authors: Doug Lea with Brian Goetz, Paul Sandoz, Aleksey Shipilev, Heinz Kabutz, Joe Bowbeer, ...

    The framework java.util.streamscontains data-driven operations on collections and other data sources. Most stream methods perform the same operation on each of the elements. Using the method of collections parallelStream(), in the presence of several cores, data-driven can be turned into a data-parallel . But when is it worth doing?

    Consider using S.parallelStream().operation(F)instead S.stream().operation(F), provided that the operations are independent of each other, and either costly in terms of computation, or applied to a large number of elements of an effectively splittable data structure, or both. More precisely:

    • F: the function for working with one element, usually lambda, is independent, i.e. performing an operation on any of the elements does not depend on and does not affect operations on other elements (for recommendations on the use of independent (non-interfering) stateless functions, see the documentation on the stream package ).
    • S: source collection is effectively split. In addition to collections, there are other streamlined data sources suitable for parallelization, for example, java.util.SplittableRandom(for which parallelization you can use the method stream.parallel()). But most of the sources with I / O at the base are designed mainly for sequential operation.
    • The total time in the sequential mode exceeds the minimum allowable limit. Today, for most platforms, the limit is roughly equal (within x10) to 100 microseconds. Accurate measurements, in this case, are not required. For practical purposes, it is sufficient to simply multiply N(the number of elements) by Q(the time it takes to work with one F), and Qyou can roughly estimate the number of operations or the number of lines of code. After it is necessary to verify that N * Qat least less 10000(if you are a coward, then add one or a couple of zeros). So, if F- this is a small function like x -> x + 1, then parallel execution will make sense when N >= 10000. Conversely, ifF- this is a weighty calculation, like finding the next best move in a game of chess, the value is Qso great that it Ncan be neglected, but as long as the collection is completely split.

    The stream processing framework will not (and cannot) insist on any of the above. If the calculations are dependent on each other, then their parallel execution does not make sense, or will be harmful at all and will lead to errors. Other criteria derived from the above engineering constraints (issues) and tradeoffs (tradeoffs) criteria include:

    • Start-up
      The appearance of additional cores in processors, in most cases, was accompanied by the addition of a power management mechanism, which can be the cause of a slower launch of cores, sometimes with additional overlays from the JVM, the operating system, and the hypervisor. In this case, the limit at which parallel mode makes sense roughly corresponds to the time required to start processing subtasks with a sufficient number of cores. After that, parallel computing may be more energy efficient than sequential (depending on the details of the processors and systems. For an example, see the article ).
    • Detailing (Granularity)
      Rarely it makes sense to split up and so small calculations. The framework usually divides the task in such a way that the individual parts can work on all available cores of the system. If, after the start, there is almost no work for each core, then (usually sequential) efforts to organize parallel computing will be wasted. Considering that in practice the number of nuclei ranges from 2 to 256 threshold, also, prevents the undesirable effect of excessive division of the task.
    • Divisibility (Splittability)
      The most efficiently split collections include ArrayListand {Concurrent}HashMapas well as regular arrays ( T[]which are divided into parts using static methods java.util.Arrays). The least effective are cleavable LinkedList, BlockingQueueand most sources with I / O in the base. The rest are somewhere in the middle (data structures that support random access and / or efficient search are usually efficiently split). If data fragmentation takes longer than processing, then efforts are vain. If it Qis large enough, then you can get a gain due to parallelization even forLinkedListbut this is quite a rare case. In addition, some sources cannot be split into a single element, and thus there may be a limit on the degree of decomposition of the problem.

    Obtaining the exact characteristics of these effects can be difficult (although, if you try, and doable using tools like JMH ). But the cumulative effect is fairly easy to notice. To experience it yourself - conduct an experiment. For example, on a 32-core test machine, when launching small functions, like max()or sum()aboveArrayListbreak-even point is approximately 10,000. For more elements, acceleration is noted up to 20 times. Working time for collections with less than 10,000 items is not much less than for 10,000, and therefore slower than sequential processing. The worst result occurs with less than 100 elements — in this case, the threads involved stop doing nothing useful, because calculations are completed before they start. On the other hand, when operations on items are time-consuming, in the case of using efficiently and fully split collections, such as ArrayList, the benefits are immediately visible.

    To paraphrase all of the above, use parallel()in the case of an unjustifiably small amount of computation can cost about 100microseconds, and use otherwise should save at least this time itself (or perhaps hours for very large tasks). The specific cost and benefits will vary over time and for different platforms, and, also, depending on the context. For example, the launch of small computations in parallel within a sequential cycle enhances the effect of rises and falls (performance microtests, in which this manifests itself, may not reflect the real situation).

    Questions and answers

    • Why does the JVM itself not know when to perform operations in parallel mode?

    She might try, but too often the decision would be wrong. The search for fully automatic multi-core concurrency has not led to a universal solution for the last thirty years, and therefore, the framework uses a more robust approach that requires the user to choose only yes or no. This choice is based on engineering problems that are constantly encountered in sequential programming, which are unlikely to disappear completely. For example, you may encounter a 100-fold slowdown when searching for the maximum value in a collection containing a single element by comparison using this value directly (without a collection). Sometimes the JVM can optimize such cases for you. But this rarely happens in sequential cases, and never in the case of parallel mode. On the other hand, it can be expected that, as it develops, the tools will help users make better decisions.

    • What if for taking a good decision I do not have enough knowledge about the parameters ( F, N, Q, S)?

    This, too, is similar to problems that often occur in sequential programming. For example, a S.contains(x)class method Collectionis usually executed quickly if Sit is HashSet, slowly if LinkedList, and moderately in other cases. Usually, for the author of a component using a collection, the best way out of this situation is to encapsulate it and publish only a specific operation on it. Then users will be isolated from the need to choose. The same applies to parallel operations. For example, a component with an internal price collection can define a method that checks its size to reach the limit, which will make sense until elementwise calculations become too expensive. Example:

    publiclonggetMaxPrice(){ return priceStream().max(); }
    private Stream priceStream(){
     return (prices.size() < MIN_PAR) ? 
       prices.stream() : prices.parallelStream();

    This idea can be extended to other considerations about when and how to use parallelism.

    • What if my function might do I / O or synchronized operations?

    At one extreme are functions that do not meet the independence criteria, including successive I / O operations in nature, access to blocking synchronized resources, and cases where an error in one parallel subtask that performs I / O affects others. Paralleling them doesn’t make much sense. On the other hand, there are calculations that occasionally perform I / O or rarely blocked synchronization (for example, most logging cases, and using such competitive collections asConcurrentHashMap). They are harmless. What is between them requires more research. If each subtask can block for a considerable time while waiting for I / O or access, CPU resources will be idle without being able to use them by the program or the JVM. From such a bad all. In these cases, parallel stream processing is not always the right choice. But there are good alternatives - for example, asynchronous I / O and approach CompletableFuture.

    • What if my source is based on I / O?

    At the moment, using StreamJDK I / O generators (for example, BufferedReader.lines()) are mainly adapted for sequential use, processing elements one by one as they arrive. Support for high-performance mass (bulk) processing of buffered I / O is possible, but, at the moment, this requires the development of special generators Streams, Spliterators and Collectors. Support for some common cases may be added in future JDK releases.

    • What if my program runs on a loaded computer and all the cores are busy?

    Machines usually have a fixed number of cores, and cannot magically create new ones when performing parallel operations. However, as long as the criteria for choosing a parallel mode are clearly spoken forThere is nothing to doubt. Your parallel tasks will compete for others with the CPU and you'll notice less acceleration. In most cases, it is still more effective than other alternatives. The underlying mechanism is designed so that if there are no available cores, you will notice only a slight slowdown in comparison with the sequential option, unless the system is so overloaded that it spends all its time switching the context instead of doing some real work, or configured with the expectation that all processing is performed sequentially. If you have such a system, then perhaps the administrator has already disabled the use of multi-threading / nuclear in the JVM settings. And if you are the system administrator yourself, then it makes sense to do it.

    • Do all operations parallelize when using parallel mode?

    Yes. At least to some extent. But it is necessary to take into account that the stream-framework takes into account the limitations of sources and methods when choosing how to do it. In general, the less constraints, the greater the potential for concurrency. On the other hand, there are no guarantees that the framework will identify and apply all available opportunities for concurrency. In some cases, if you have the time and expertise, your own solution can make better use of the possibilities of parallelism.

    • What acceleration do I get from concurrency?

    If you follow these tips, it is usually enough to make sense. Predictability is not the strength of modern hardware and systems, and therefore there is no universal answer. The cache locality, GC characteristics, JIT compilation, memory access conflicts, data location, OS dispatch policies, and the presence of the hypervisor are some of the factors that have significant influence. The performance of the sequential mode is also subject to their influence, which, when using parallelism, is often exacerbated: a problem causing a 10 percent difference in the case of sequential execution can lead to a 10-fold difference in parallel processing.

    Stream-framework includes some features that help increase the chances of acceleration. For example, using specialization for primitives, such as IntStream, usually has a greater effect for parallel mode than for serial mode. The reason is that in this case not only the consumption of resources (and memory) is reduced, but also the locality of the cache is improved. Using ConcurrentHashMapinstead HashMap, in the case of a parallel operation of the operation collect, reduces internal costs. New tips and recommendations will appear as you gain experience with the framework.

    • All this is too scary! Can we just come up with rules for using JVM properties to turn off concurrency?

    We don't want to tell you what to do. The appearance for programmers of new ways to do something wrong can be scary. Errors in code, architecture, and evaluations will of course occur. Decades ago, some people predicted that having concurrency at the application level would lead to more trouble. But it never came true.

    Also popular now: