Side Effects – Streams

Side Effects

Efficient execution of parallel streams that produces the desired results requires the stream operations (and their behavioral parameters) to avoid certain side effects.

  • Non-interfering behaviors

The behavioral parameters of stream operations should be non-interfering (p. 909)—both for sequential and parallel streams. Unless the stream data source is concurrent, the stream operations should not modify it during the execution of the stream. See building streams from collections (p. 897).

  • Stateless behaviors

The behavioral parameters of stream operations should be stateless (p. 909)— both for sequential and parallel streams. A behavioral parameter implemented as a lambda expression should not depend on any state that might change during the execution of the stream pipeline. The results from a stateful behavioral parameter can be nondeterministic or even incorrect. For a stateless behavioral parameter, the results are always the same.

Shared state that is accessed by the behavior parameters of stream operations in a pipeline is not a good idea. Executing the pipeline in parallel can lead to race conditions in accessing the global state, and using synchronization code to provide thread-safety may defeat the purpose of parallelization. Using the three-argument reduce() or collect() method can be a better solution to encapsulate shared state.

The intermediate operations distinct(), skip(), limit(), and sorted() are stateful (p. 915, p. 915, p. 917, p. 929). See also Table 16.3, p. 938. They can carry extra performance overhead when executed in a parallel stream, as such an operation can entail multiple passes over the data and may require significant data buffering.

Ordering

An ordered stream (p. 891) processed by operations that preserve the encounter order will produce the same results, regardless of whether it is executed sequentially or in parallel. However, repeated execution of an unordered stream— sequential or parallel—can produce different results.

Preserving the encounter order of elements in an ordered parallel stream can incur a performance penalty. The performance of an ordered parallel stream can be improved if the ordering constraint is removed by calling the unordered() intermediate operation on the stream (p. 932).

The three stateful intermediate operations distinct(), skip(), and limit() can improve performance in a parallel stream that is unordered, as compared to one that is ordered (p. 915, p. 915, p. 917). The distinct() operation need only buffer any occurrence of a duplicate value in the case of an unordered parallel stream, rather than the first occurrence. The skip() operation can skip any n elements in the case of an unordered parallel stream, not necessarily the first n elements. The limit() operation can truncate the stream after any n elements in the case of an unordered parallel stream, and not necessarily after the first n elements.

The terminal operation findAny() is intentionally nondeterministic, and can return any element in the stream (p. 952). It is specially suited for parallel streams.

The forEach() terminal operation ignores the encounter order, but the forEachOrdered() terminal operation preserves the order (p. 948). The sorted() stateful intermediate operation, on the other hand, enforces a specific encounter order, regardless of whether it executed in a parallel pipeline (p. 929).

Autoboxing and Unboxing of Numeric Values

As the Stream API allows both object and numeric streams, and provides support for conversion between them (p. 934), choosing a numeric stream when possible can offset the overhead of autoboxing and unboxing in object streams.

As we have seen, in order to take full advantage of parallel execution, composition of a stream pipeline must follow certain rules to facilitate parallelization. In summary, the benefits of using parallel streams are best achieved when:

  • The stream data source is of a sufficient size and the stream is easily splittable into substreams.
  • The stream operations have no adverse side effects and are computation-intensive enough to warrant parallelization.