Temporary allocations in C++
Heap allocations in C++ are notoriously expensive when they hit the generic allocator. Many high-performance software projects employ custom allocation strategies that fit a scenario and speed up allocations. A great overview of allocation strategies common in games is given by Stefan Reinalter in his Molecular musings blog.
In this post I’ll specifically talk about temporary allocations in C++ and how to optimize STL allocations. STL is difficult to use right as its types do many heap allocations and the allocator is template parameter. It makes types with different allocators incompatible. C++11 and C++14 however have added many new features that in my opinion return STL as a viable option even for performance-critical code.
Temporary objects usually survive only a small scope within the execution of the program, but this scope (function) might be called many times. If objects in it always allocate on the heap, we might suffer a substantial performance hit.
Imagine this function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
void OnTaskComplete() { std::vector tasksToSchedule; for(auto& task : m_WaitingTasks) { if (task->CanBeScheduled()) { tasksToSchedule.push_back(task); // Do some more stuff } } for (auto& task : tasksToSchedule) { // Schedule task } } |
The function in the example might get called after a task in job system completes. It checks if to schedule some new task. It might get called hundreds of times in a frame. The vanilla vector there might impact performance as it will do a heap allocation each time at least one task will have to be scheduled.
We know that the lifetime of the object is really narrow – it is temporary to the scope of the function. Let’s make it a lot faster.
We will employ a custom allocator for temporary objects that outperforms the standard heap allocations by 7x-10x.
A “Linear allocator” is one that allocates a predefined chunk of memory and each time we request from it – it gives back a chunk from it. We don’t manage anything except where we currently are in the buffer. We can however “reset” the allocator back, which effectively allows us to reuse the memory. It is common in games to use such an allocator for all allocations within the frame and reset it at the beginning of a new one. This allows for very fast O(1) allocations within the frame and an O(1) super-quick bulk “deallocation”. We might be in trouble only if we consume more memory within the frame than the allocator has to offer. Assertions are in order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
class LinearAllocator { public: LinearAllocator(size_t size) : m_Ptr{ static_cast(TEMP_MALLOC(size)) } , m_TotalSize{ size } , m_FreeSpace{ size } {} ~LinearAllocator() { TEMP_FREE(m_Ptr); } void* Allocate(size_t size, unsigned alignment /* power of 2 */) { assert((alignment & (alignment - 1)) == 0); auto currentPtr = static_cast(m_Ptr + (m_TotalSize - m_FreeSpace)); auto retPtr = std::align(alignment, size, currentPtr, m_FreeSpace); if (!retPtr) { assert(false && "Linear allocator full!"); // no space return nullptr; } m_FreeSpace -= size; return retPtr; } void Free() { // do nothing } void Reset(size_t freeSpace) { m_FreeSpace = freeSpace; } size_t CurrentFreeSpace() const { return m_FreeSpace; } private: char* m_Ptr; size_t m_TotalSize; size_t m_FreeSpace; }; |
In the systems I work with, we have multiple threads that work on “tasks”. This makes defining a frame somewhat difficult. In this sample we will employ a scope object that in its destructor resets the linear allocator.
Our allocator will also be thread-local, so that every thread has its own copy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
struct TempAllocatorScope { public: TempAllocatorScope() : m_Space(tlsLinearAllocator->CurrentFreeSpace()) {} ~TempAllocatorScope() { tlsLinearAllocator->Reset(m_Space); } private: size_t m_Space; }; |
Finally we will have an STD allocator that forwards all calls to our linear one and allows us to use it with any STL type.
The complete test example is available here.
The sample allows us to switch between the linear allocator and the default one to measure the gains. On my machine it was between 7x – 10x with the VS 2015 runtime.
Full code:
https://gist.github.com/stoyannk/8122bbd9871fa7b5fd10
If you are passionate about game technology and you would like to use your skills to tackle similar complex problems, join us. Check out our open positions here.
Follow Stoyan on Twitter: @stoyannk