Xavier Lamorlette
Notes sur l'optimisation de performances en C++, issues d'une formation d'Hubert Matthews, et de mon expérience.
Sommaire :
Performance optimisation methodology:
Tools to mesure performance:
std::chrono::high_resolution_clock()
#include <chrono>
template <typename F>
void time(const std::string & label, F function) {
using namespace std::chrono;
auto start = high_resolution_clock::now();
function();
auto end = high_resolution_clock::now();
auto ms = duration_cast<milliseconds>(end - start).count();
std::cout << label << ": " << ms << " ms" << std::endl;
}
int main() {
time("mycode", []{
for (auto i = 0; i < 1000; i++) {
f();
}
});
}
boost::time::cpu_timer
GCC recommended compilation flags:
-march=native
-Wsuggest-final-types
and -Wsuggest-final-methods
to help compiler devirtualiseA branch misprediction is as costly as a L2 cache read.
On modern CPUs, floating point operations are as fast as integer operations.
Beware virtual functions are usually not inlined.
Caches efficiency is not only linked to temporal locality, but also to spacial locality.
Read amplification: useless data is brought to cache because of bad data layout. This can also prevent vectorisation: use of CPU parallelised instructions if data is laid out linearly.
deque
is a vector of vectors.
No reallocation when the size increases as with vector
(only reallocation of pointer blocks).
A bit faster on write than vector
without reservation of memory.
set
(balanced binary tree) is much slower than unordered_set
(hash table with singly-linked lists for collisions).
The find()
of unordered_set
is 0(1). But beware the cost of the hash function.
For unordered_set
, possibility to pre-allocate memory with reserve
(as with vector
).
For set
, possibility to give a hint on insertion.
Boot FlatSet
and FlatMap
are more cache friendly (denser) than the std
equivalents.
6 questions about data access:
When choosing the data structure, the memory density is important for caching reasons.
std::string_view
(C++17): lightweight read-only view on a std::string
.
It can lead to dangling reference if the underlying std::string
is destroyed.
boost::string_ref
is a std::string_view
equivalent with extra functionality.
For small strings, consider std::array<char, N>
.
With GCC, STL can be compiled in parallel mode to execute algorithms on multiple CPU
(-D_GLIBCXX_PARALLEL
and -fopenmp
).
In C++17, look for standard parallelisation and transform_reduce
.
For efficient multithreading, align data with cache lines.
La vectorisation repose sur les extensions processeurs SIMD : Single Instruction Multiple Data.
SSE = Streaming SIMD Extensions.
La derniére mise à jour de cette page date de mai 2024.
Le contenu de ce site est, en tant qu'œuvre originale de l'esprit, protégé par le droit d'auteur.
Pour tout commentaire, vous pouvez m'écrire à xavier.lamorlette@gmail.com.