Xavier Lamorlette - Performance

C++ Performance

Notes sur l'optimisation de performances en C++, issues d'une formation d'Hubert Matthews, et de mon expérience.

Sommaire :

Methodology
Notes on GCC and CPU costs
Caches
Containers
Data Access
Strings
Multithreaded programming
CPU vectorisation

Methodology

Performance optimisation methodology:

- start profiling / timing
- remove unnecessary calls and operations
- remove dynamic memory allocations
- try different algorithms affecting caching
- optimise data layout for cache-aware programming

Tools to mesure performance:

- without instrumentation:
- - perf
- - oprofile
- - top
- - iotop
- - vmstat
- - iostat
- - bash time
- - HPC Toolkit: uses sampling
- - Intel VTune

- with instrumentation:
- - Tracy
- - std::chrono::high_resolution_clock()
```
#include <chrono>

template <typename F>
void time(const std::string & label, F function) {
  using namespace std::chrono;
  auto start = high_resolution_clock::now();
  function();
  auto end = high_resolution_clock::now();
  auto ms = duration_cast<milliseconds>(end - start).count();
  std::cout << label << ": " << ms << " ms" << std::endl;
}

int main() {
  time("mycode", []{
    for (auto i = 0; i < 1000; i++) {
      f();
    }
  });
}
```
- - boost::time::cpu_timer

Notes on GCC and CPU costs

GCC recommended compilation flags:

-march=native
-Wsuggest-final-types and -Wsuggest-final-methods to help compiler devirtualise

A branch misprediction is as costly as a L2 cache read.

On modern CPUs, floating point operations are as fast as integer operations.

Beware virtual functions are usually not inlined.

Caches

Caches efficiency is not only linked to temporal locality, but also to spacial locality.

Read amplification: useless data is brought to cache because of bad data layout. This can also prevent vectorisation: use of CPU parallelised instructions if data is laid out linearly.

Containers

deque is a vector of vectors. No reallocation when the size increases as with vector (only reallocation of pointer blocks). A bit faster on write than vector without reservation of memory.

set (balanced binary tree) is much slower than unordered_set (hash table with singly-linked lists for collisions).

The find() of unordered_set is 0(1). But beware the cost of the hash function.

For unordered_set, possibility to pre-allocate memory with reserve (as with vector).

For set, possibility to give a hint on insertion.

Boot FlatSet and FlatMap are more cache friendly (denser) than the std equivalents.

Data Access

6 questions about data access:

- primary key access
- non-primary key access (find by value)
- range scans / sequential access
- read / write ratio
- working set size
- consistency

When choosing the data structure, the memory density is important for caching reasons.

Strings

std::string_view (C++17): lightweight read-only view on a std::string. It can lead to dangling reference if the underlying std::string is destroyed.

boost::string_ref is a std::string_view equivalent with extra functionality.

For small strings, consider std::array<char, N>.

Multithreaded programming

With GCC, STL can be compiled in parallel mode to execute algorithms on multiple CPU (-D_GLIBCXX_PARALLEL and -fopenmp).

In C++17, look for standard parallelisation and transform_reduce.

For efficient multithreading, align data with cache lines.

CPU vectorisation

La vectorisation repose sur les extensions processeurs SIMD : Single Instruction Multiple Data.
SSE = Streaming SIMD Extensions.

La derniére mise à jour de cette page date de mai 2024.

Le contenu de ce site est, en tant qu'œuvre originale de l'esprit, protégé par le droit d'auteur.
Pour tout commentaire, vous pouvez m'écrire à xavier.lamorlette@gmail.com.