Most C programmers know that in a C struct, members have to be aligned in memory. Take the following struct as an example:
typedef struct { unsigned key; unsigned char val; } UnpackedStruct;
The two members of this struct take 5 bytes in total. However, because “val” has to be aligned with the longer “key”, “sizeof(UnpackedStruct)” returns 8. 3 bytes are wasted in this struct. Waste of memory is the key reason why my khash library uses two separate arrays to keep keys and values even though this leads to more cache misses.
Khash was initially written about 10 years ago when I was young and foolish. I later learned that with gcc/clang, it is possible to byte-pack the struct:
typedef struct { unsigned key; unsigned char val; } __attribute__ ((__packed__)) PackedStruct;
With this, “sizeof(PackedStruct)” returns 5. Then why gcc does not use this by default? Is it because unaligned memory hurt performance? Google search pointed me to this question on StackOverflow. There was a discussion, but no clear conclusions.
Hash table has become the bottleneck of my recent works, so I decided to revisit the question: does packed struct hurt performance on x86_64 CPUs? As usual, I did a very simple benchmark: with khash, I insert/delete 50 million (uint32_t,uint8_t) integer pairs stored in either packed or unpacked struct shown above and see if the performance is different. The following table shows the CPU time on my x86_64 laptop:
Key type | Value type | size per elem | CPU seconds |
---|---|---|---|
Unsigned | uint8_t | 5 | 10.249 |
UnpackedStruct | N/A | 8 | 9.429 |
PackedStruct | N/A | 5 | 9.287 |
The table says it all: on x64 CPUs, a packed struct array does not hurt performance in comparison to an unpacked struct array. With both gcc and clang, packed struct is consistently faster, perhaps because packed struct takes smaller space, which might help cache performance. The source code can be found here.
At last, it should be noted that x86 CPUs have been optimized for unaligned memory access. On other CPUs, the results may be very different. Perhaps that is why gcc does not pack struct by default.
Nice article. We have always used packed structures in our applications, mostly for convenience than for performance. But we never found any performance issues with packed structures. This article explains why! Thanks.
I made a compiler. I started with aligning. I tested without. No discernable difference, so I removed it, because I like small code.
When I make something with wood, I carefull measure. I met a mechanical engineer and was shock on a prototype when he stuck a switch in the center without measuring. If it looks right… it looks right!
I was taught to align both code and data. In a jump table, you would align all destinations. Not significant in practice. I didn’t measure.
I made major optimizations, but they had no affect because the CPU was already converting CISC ro RISC and scheduling. The right answer would be to undo the optimizations, but I have pride in my code.
I misspoke. I never had automatic data alignment, just code alignment. I always align my structures by hand, so it never occurred to me. You might test data alignment — it might be silly to align by hand, I’ll bet. I have dignity, though.
I misspoke. Auto alignment is really stupid. What I meant to say is any alignment might be pointless, but I have dignity.
I did not get into regsiter scheduling and trying to optimize for the CPU instruction pipes. I don’t care that much. It’s voodoo. I’ll laugh at people who hand code asm.
It’s not much trouble to align data structures by hand. It forces you to group better.
I saw something hilarious on http://www.osdev.org
Someone was telling that bit field orders could be changed. I find that hilarious.
Only reply to this one. I also hand-align struct as much as possible. However, in the example in my post, how could you align a 4-byte integer to a 1-byte integer without wasting space? Of course we can use a 5-byte uint_8[] array, but that is cumbersome.
On modern CPUs, the number of CPU clocks to bus ratio is huge. For every bus clock, you have ample CPU clocks to process that data. So your artificial benchmark is memory bound because it is trivial CPU processing, thus the only thing you were really testing was the bus speed. Unless you have some complex code/access patterns that the prefetcher can’t guess, you aren’t really testing what you think you’re testing. Given that CPU manufacturers recommend aligned data, and if you could make a memory bus/cache/prefetch predictor agnostic benchmark, you’d find aligned data faster. You have shown just how little it makes it modern CPUs, that is noteworthy. But be careful about drawing conclusions from it.
gcc cannot arbitrarily pack by default because that would break the platform ABI. You may ask why the platform ABI chose to align.
In this case with tiny data structure, packed version faster (5%) because of big data overhead (8 bytes vs 5). But when I use big structures in hash table – packed version is slower (15%) because in real tests I need to work with this data. In simple words, hash table may work faster but all other data processing will be slow.
However it’s always good to try both variants in each specific case.
On x86 there hasn’t been a penalty for unaligned data [on non-vector code] since the Pentuim MMX. From Pentium Pro forward the only case that is slower is when crossing a cache line, which was 32 bytes then and is 64 bytes now. I’m not 100% sure about the Athlon, but I would expect it was the same (and had 64 byte cache lines back then if anyone cares)
You’re wrong – pre-Haswell Xeon’s have a significant performance penalty when accessing unaligned data. You should also consider how compiler vector optimizations are affected.
I know this thread is quite a few years old now, but running khash_test on both an i7 gaming machine and a low power intel celeron laptop, the packed data is consistently ~11% faster on both machines. There’s a lot of misinformation floating around, but I think after these tests I’m pretty set that it’s worth testing for as an improvement. There are probably still cases where aligned accesses are faster, but I haven’t seen any so far.
(also someone else ran some similar tests: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/)