UE4 – Improving speed with ParallelFor

Developing for the Unreal Engine, we have sometimes to make complex for loops with a lot of stuff in the body. Usually, it’s a performance breaker, especially when the number of iterations of the for-loop increases.

The purpose of this small article is to introduce a nice trick that can greatly improves the performance of these operations : it’s the ParallelFor. ParallelFor allows us to multi-thread any for-loop in a minute, thus dividing the execution time by splitting the work between several threads. Let’s see how we can replace our for-loops with the ParallelFor.

Replacing a classical for loop with ParallelFor.

Let’s say we have a TArray of int32 and we want to increment the value of each element. The traditional for-loop can be written the following way:

for(int32& Element : Array)
{
    ++Element;
}

Or this way (using an index):

for(int32 Idx = 0; Idx < Array.Num(); Idx++)
{
    ++Array[Idx];
}

The execution time will clearly depends on the number of elements, but it could be drastically reduced by running the operation in several threads, which can be done with ParallelFor.

First, to do this we have to include ParallelFor.

#include "Runtime/Core/Public/Async/ParallelFor.h"

ParallelFor is a function which usually takes 2 parameters: the number of elements, and the body of the for-loop as a lambda expression. Here the number of elements is Array.Num() and the body of the for-loop is ++Array[Idx], so the syntax of the ParallelFor simply is:

ParallelFor(Array.Num(), [&](int32 Idx) {
    ++Array[Idx];
});

We may note that we use & in the lambda expression, to capture the reference to the Array, and the lambda expression gives as parameter the index of the element to manage. Now, the increment of elements will be done in several threads, thus reducing the total amount of time required to process the whole collection. As we can see, it is pretty straightforward and simple to implement. However, there are some aspects we must be careful about.

Precautions.

We have to make some precautions when using ParallelFor. Indeed, unlike the classical for loop, we have no guarantee on the order of the execution.

Moreover, as ParallelFor is multi-threaded, the classical issues of synchronization may intervene: if the body of the ParallelFor is trying to update a variable outside of the loop, we have to imagine that several threads will be doing the same at the same time. For instance, if we want to compute the sum of values of a TArray, ParallelFor is not the way to go. Generally speaking, we try to reserve ParallelFor for unitary operations, i.e. operations without interactions.

However, if we still have to use ParallelFor, we may use FCriticalSection which will prevents the concurrent execution of a small amount of code. Let’s imagine we have an array, and we want to copy all elements which are multiples of 5 into a new array (I don’t know why someone would want to make this, but let’s just say).

The syntax using ParallelFor would be:

ParallelFor(Input.Num(), [&](int32 Idx)
{
    if(Input[Idx]% 5 == 0)
    {
        Output.Add(Input[Idx]);
    }
});

But we have to keep in mind that the code of the lambda expression may be called simultaneously by several threads. If two threads try to add an element to the Output array, it will result in an exception.

To prevent this we will have to declare a FCriticalSection, lock it before the critical operation and unlock it after the critical operation (i.e. the operation that can’t be performed simultaneously by different threads):

FCriticalSection Mutex;
ParallelFor(Input.Num(), [&](int32 Idx)
{
    if(Input[Idx] % 5 == 0)
    {
        Mutex.Lock();
        Output.Add(Input[Idx]);
        Mutex.Unlock();
    }
});

The usage of the critical section will prevent multiple threads from running Output.Add(Input[Idx]) at the same time: when the lock is acquired by a thread, the other threads have to wait for it to be unlocked. But, it also reduce the interest of the ParallelFor. If all operations are critical, it means that the body can be executed by only one thread at once, and therefore we will prefer to use a standard for-loop.

Performance evaluation.

When using FCriticalSection, the benefits of using ParallelFor are less evident, and we have to test the performance with and without the ParallelFor. To reduce the amount of code writing, we can use the 3rd parameter of the ParallelFor function. It’s simply a boolean value, set to false (which is the default value), it means that the body will be run by different threads, and set to true it means that it will be run by a single thread.

Comparing the performance with the last parameter set to false or to true will allow us to know if it is interesting to use a ParallelFor or not. If it is set that a single threaded loop is more interesting, we will revert to the standard writing as the last parameter of the ParallelFor is mainly designed for debugging purposes, and the use of ParallelFor for a singly threaded operation is not relevant.

Leave a Reply