Xavier Lamorlette

C++ Performance

Notes sur l'optimisation de performances en C++, issues d'une formation d'Hubert Matthews, et de mon expérience.

Sommaire :

Methodology

Performance optimisation methodology:

Tools to mesure performance:

Notes on GCC and CPU Costs

GCC recommended compilation flags:

A branch misprediction is as costly as a L2 cache read.

On modern CPUs, floating point operations are as fast as integer operations.

Beware virtual functions are usually not inlined.

Caches

Caches efficiency is not only linked to temporal locality, but also to spacial locality.

Read amplification: useless data is brought to cache because of bad data layout. This can also prevent vectorisation: use of CPU parallelised instructions if data is laid out linearly.

Containers

deque is a vector of vectors. No reallocation when the size increases as with vector (only reallocation of pointer blocks). A bit faster on write than vector without reservation of memory.

set (balanced binary tree) is much slower than unordered_set (hash table with singly-linked lists for collisions).

The find() of unordered_set is 0(1). But beware the cost of the hash function.

For unordered_set, possibility to pre-allocate memory with reserve (as with vector).

For set, possibility to give a hint on insertion.

Boot FlatSet and FlatMap are more cache friendly (denser) than the std equivalents.

Data Access

6 questions about data access:

When choosing the data structure, the memory density is important for caching reasons.

Strings

std::string_view (C++17): lightweight read-only view on a std::string. It can lead to dangling reference if the underlying std::string is destroyed.

boost::string_ref is a std::string_view equivalent with extra functionality.

For small strings, consider std::array<char, N>.

Multithreaded Programming

With GCC, STL can be compiled in parallel mode to execute algorithms on multiple CPU (-D_GLIBCXX_PARALLEL and -fopenmp).

In C++17, look for standard parallelisation and transform_reduce.

For efficient multithreading, align data with cache lines.

CPU Vectorisation

La vectorisation repose sur les extensions processeurs SIMD : Single Instruction Multiple Data.
SSE = Streaming SIMD Extensions.

La derniére mise à jour de cette page date de janvier 2019.

Le contenu de ce site est, en tant qu'œuvre originale de l'esprit, protégé par le droit d'auteur.
Pour tout commentaire, vous pouvez m'écrire à xavier.lamorlette@gmail.com.