Most C programmers know that in a C struct, members have to be aligned in memory. Take the following struct as an example:
typedef struct {
unsigned key;
unsigned char val;
} UnpackedStruct;
The two members of this struct take 5 bytes in total. However, because “val” has to be aligned with the longer “key”, “sizeof(UnpackedStruct)” returns 8. 3 bytes are wasted in this struct. Waste of memory is the key reason why my khash library uses two separate arrays to keep keys and values even though this leads to more cache misses.
Khash was initially written about 10 years ago when I was young and foolish. I later learned that with gcc/clang, it is possible to byte-pack the struct:
typedef struct {
unsigned key;
unsigned char val;
} __attribute__ ((__packed__)) PackedStruct;
With this, “sizeof(PackedStruct)” returns 5. Then why gcc does not use this by default? Is it because unaligned memory hurt performance? Google search pointed me to this question on StackOverflow. There was a discussion, but no clear conclusions.
Hash table has become the bottleneck of my recent works, so I decided to revisit the question: does packed struct hurt performance on x86_64 CPUs? As usual, I did a very simple benchmark: with khash, I insert/delete 50 million (uint32_t,uint8_t) integer pairs stored in either packed or unpacked struct shown above and see if the performance is different. The following table shows the CPU time on my x86_64 laptop:
| Key type | Value type | size per elem | CPU seconds |
|---|---|---|---|
| Unsigned | uint8_t | 5 | 10.249 |
| UnpackedStruct | N/A | 8 | 9.429 |
| PackedStruct | N/A | 5 | 9.287 |
The table says it all: on x64 CPUs, a packed struct array does not hurt performance in comparison to an unpacked struct array. With both gcc and clang, packed struct is consistently faster, perhaps because packed struct takes smaller space, which might help cache performance. The source code can be found here.
At last, it should be noted that x86 CPUs have been optimized for unaligned memory access. On other CPUs, the results may be very different. Perhaps that is why gcc does not pack struct by default.