CBJ, what I do is processors down at the RTL level (Verilog) and also compiler optimizations, so low hanging fruit from my POV is possibly different from yours.
In processor land, random memory access (i.e. too much indirection) is the enemy of high performance computing. Period! (I/O is the other killer, but at least most people are aware of it)
We can try to compensate for that mainly by increasing cache sizes, since DRAM seek time is a major bottleneck that can make certain algorithms take the same time on a 166MHz Pentium2 as on a 4GHz Core i9.
Compiler optimizations help too, but in the end the main engineering idiom stands: you can't foolproof from talented fools (e.g. python
)
Optimizing the data structures is good, but optimizing the memory accesses is (to me) the real low hanging fruit, and is far too often overlooked.\
to illustrate the point, take this example code:
Code: Select all
#include <stdlib.h>
#include <chrono>
int* data = nullptr;
#define DATA_SZ 1024*1024*128
void initData();
int doWork(bool random);
int _rand();
int main()
{
using std::chrono::high_resolution_clock;
using std::chrono::duration;
duration<double, std::micro> t_delta;
int ret;
initData();
printf("data init\n");
auto t_start = high_resolution_clock::now();
for (int j = 0; j < 5; j++) {
t_start = high_resolution_clock::now();
ret = doWork(false); // use and print the return value to prevent optimizer truncation
t_delta = high_resolution_clock::now() - t_start;
printf("0x%8.8X seq run: %.2fus\n", ret, t_delta.count());
}
for (int j = 0; j < 5; j++) {
t_start = high_resolution_clock::now();
ret = doWork(true); // use and print the return value to prevent optimizer truncation
t_delta = high_resolution_clock::now() - t_start;
printf("0x%8.8X rnd run: %.2fus\n", ret, t_delta.count());
}
if (data != nullptr)
free(data);
}
int doWork(bool random) {
int ret = 0;
for (int i = 0; i < DATA_SZ; i++)
{
int rnd = _rand() & (DATA_SZ-1);
ret += rnd; // tally rand value to prevent optimizer truncation
int addr = random ? rnd : i;
ret += data[addr]; // force data access
}
return ret;
}
void initData() {
data = (int*)malloc(sizeof(int) * DATA_SZ);
if (data == nullptr)
exit(0);
for (int i = 0; i < DATA_SZ; i++)
{
data[i] = _rand();
}
}
//use precessing LFSRs since CRT rand() has nondeterministic performance
int rnd_seed1 = 0xDEADBEEF;
int rnd_seed2 = 0x1EE17889;
int _rand()
{
int feed = ((rnd_seed1 & (1 << 2)) >> 2) ^ ((rnd_seed1 & (1 << 30)) >> 30);
rnd_seed1 = (rnd_seed1 << 1) | feed;
feed = ((rnd_seed2 & (1 << 1)) >> 1) ^ ((rnd_seed2 & (1 << 28)) >> 28);
rnd_seed2 = (rnd_seed2 << 1) | feed;
return rnd_seed1 ^ rnd_seed2;
}
All that it does is it makes an array of random ints, and then does a whole bunch memory accesses. Some sequential, some random.
Running it gives me the following results for same memory page vs random accesses:
Code: Select all
data init
0x5B959E7D seq run: 272732.10us
0x88610FC5 seq run: 275153.10us
0xEBA323E0 seq run: 271510.10us
0xDA428E09 seq run: 283839.20us
0xECA3E5AC seq run: 273258.40us
0x320BE4DF rnd run: 3825904.80us
0xB52BC756 rnd run: 3831637.00us
0xAC8D19B7 rnd run: 4047264.80us
0xCDD836A0 rnd run: 3846358.00us
0xBF6C3A38 rnd run: 4055526.60us
The randomized access result is an example of what happens when you have a complex data structure with multiple levels of object instances and data residing all over the place.
A super complex data structure is one that looks like
Code: Select all
obj->blah->foo->bar->more->stuff->Process()
All those pointer dereferences or C++ sugar can compound when you're running this inside a loop and each "obj" has a unique "blah" "Foo" ...
i.e. an execution sequence becomes the following if all the memory objects have not been thoughtfully allocated (using braces as index in an memory pool)
Code: Select all
obj{0}->blah{300}->foo{1}->bar{523}->more{1}->stuff{87}->Process();
obj{1}->blah{0}->foo{634}->bar{3}->more{54}->stuff{1}->Process();
obj{2}->blah{109}->foo{32}->bar{276}->more{400}->stuff{543}->Process();
obj{3}->blah{239}->foo{298}->bar{102}->more{0}->stuff{200}->Process();
...
even though the root "obj" elements are sequential in memory and likely don't trigger cache miss between accesses, the "blah" and other child element do.
Resolutions to this are to structure the blah and others right after obj, which is effectively flattening obj.
Alternatively, a re-sorting and a reorganization of the child elements so that blah of obj{0} is right next to blah of obj{1}
(in processor land, memory page is same as cache page, not to be confused with virtual page which is an OS thing)