Does packed struct hurt performance on x86_64?

May 2, 2013 by attractivechaos

Most C programmers know that in a C struct, members have to be aligned in memory. Take the following struct as an example:

typedef struct {
  unsigned key;
  unsigned char val;
} UnpackedStruct;

The two members of this struct take 5 bytes in total. However, because “val” has to be aligned with the longer “key”, “sizeof(UnpackedStruct)” returns 8. 3 bytes are wasted in this struct. Waste of memory is the key reason why my khash library uses two separate arrays to keep keys and values even though this leads to more cache misses.

Khash was initially written about 10 years ago when I was young and foolish. I later learned that with gcc/clang, it is possible to byte-pack the struct:

typedef struct {
  unsigned key;
  unsigned char val;
}  __attribute__ ((__packed__)) PackedStruct;

With this, “sizeof(PackedStruct)” returns 5. Then why gcc does not use this by default? Is it because unaligned memory hurt performance? Google search pointed me to this question on StackOverflow. There was a discussion, but no clear conclusions.

Hash table has become the bottleneck of my recent works, so I decided to revisit the question: does packed struct hurt performance on x86_64 CPUs? As usual, I did a very simple benchmark: with khash, I insert/delete 50 million (uint32_t,uint8_t) integer pairs stored in either packed or unpacked struct shown above and see if the performance is different. The following table shows the CPU time on my x86_64 laptop:

Key type	Value type	size per elem	CPU seconds
Unsigned	uint8_t	5	10.249
UnpackedStruct	N/A	8	9.429
PackedStruct	N/A	5	9.287

The table says it all: on x64 CPUs, a packed struct array does not hurt performance in comparison to an unpacked struct array. With both gcc and clang, packed struct is consistently faster, perhaps because packed struct takes smaller space, which might help cache performance. The source code can be found here.

At last, it should be noted that x86 CPUs have been optimized for unaligned memory access. On other CPUs, the results may be very different. Perhaps that is why gcc does not pack struct by default.

Posted in Uncategorized | 13 Comments

13 Responses

on May 3, 2013 at 3:35 am | Reply Arun Chandrasekaran

Nice article. We have always used packed structures in our applications, mostly for convenience than for performance. But we never found any performance issues with packed structures. This article explains why! Thanks.
on May 3, 2013 at 5:30 am | Reply Terry A. Davis

I made a compiler. I started with aligning. I tested without. No discernable difference, so I removed it, because I like small code.

When I make something with wood, I carefull measure. I met a mechanical engineer and was shock on a prototype when he stuck a switch in the center without measuring. If it looks right… it looks right!
on May 3, 2013 at 5:40 am | Reply Terry A. Davis

I was taught to align both code and data. In a jump table, you would align all destinations. Not significant in practice. I didn’t measure.

I made major optimizations, but they had no affect because the CPU was already converting CISC ro RISC and scheduling. The right answer would be to undo the optimizations, but I have pride in my code.
on May 3, 2013 at 6:04 am | Reply Terry A. Davis

I misspoke. I never had automatic data alignment, just code alignment. I always align my structures by hand, so it never occurred to me. You might test data alignment — it might be silly to align by hand, I’ll bet. I have dignity, though.
on May 3, 2013 at 6:07 am | Reply Terry A. Davis

I misspoke. Auto alignment is really stupid. What I meant to say is any alignment might be pointless, but I have dignity.

I did not get into regsiter scheduling and trying to optimize for the CPU instruction pipes. I don’t care that much. It’s voodoo. I’ll laugh at people who hand code asm.
on May 3, 2013 at 6:35 am | Reply Terry A. Davis

It’s not much trouble to align data structures by hand. It forces you to group better.

I saw something hilarious on http://www.osdev.org
Someone was telling that bit field orders could be changed. I find that hilarious.
- on May 3, 2013 at 11:43 am | Reply attractivechaos
  
  Only reply to this one. I also hand-align struct as much as possible. However, in the example in my post, how could you align a 4-byte integer to a 1-byte integer without wasting space? Of course we can use a 5-byte uint_8[] array, but that is cumbersome.
on May 4, 2013 at 3:34 am | Reply Bob

On modern CPUs, the number of CPU clocks to bus ratio is huge. For every bus clock, you have ample CPU clocks to process that data. So your artificial benchmark is memory bound because it is trivial CPU processing, thus the only thing you were really testing was the bus speed. Unless you have some complex code/access patterns that the prefetcher can’t guess, you aren’t really testing what you think you’re testing. Given that CPU manufacturers recommend aligned data, and if you could make a memory bus/cache/prefetch predictor agnostic benchmark, you’d find aligned data faster. You have shown just how little it makes it modern CPUs, that is noteworthy. But be careful about drawing conclusions from it.
on June 24, 2013 at 5:09 am | Reply Jed Brown

gcc cannot arbitrarily pack by default because that would break the platform ABI. You may ask why the platform ABI chose to align.
on August 11, 2013 at 7:24 pm | Reply RPG

In this case with tiny data structure, packed version faster (5%) because of big data overhead (8 bytes vs 5). But when I use big structures in hash table – packed version is slower (15%) because in real tests I need to work with this data. In simple words, hash table may work faster but all other data processing will be slow.
However it’s always good to try both variants in each specific case.
on June 19, 2014 at 11:48 am | Reply Mex

On x86 there hasn’t been a penalty for unaligned data [on non-vector code] since the Pentuim MMX. From Pentium Pro forward the only case that is slower is when crossing a cache line, which was 32 bytes then and is 64 bytes now. I’m not 100% sure about the Athlon, but I would expect it was the same (and had 64 byte cache lines back then if anyone cares)
on December 30, 2014 at 8:07 am | Reply Matt

You’re wrong – pre-Haswell Xeon’s have a significant performance penalty when accessing unaligned data. You should also consider how compiler vector optimizations are affected.
on August 14, 2019 at 12:42 am | Reply Oscar

I know this thread is quite a few years old now, but running khash_test on both an i7 gaming machine and a low power intel celeron laptop, the packed data is consistently ~11% faster on both machines. There’s a lot of misinformation floating around, but I think after these tests I’m pretty set that it’s worth testing for as an improvement. There are probably still cases where aligned accesses are faster, but I haven’t seen any so far.
(also someone else ran some similar tests: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/)

Comments RSS

	humbleoptimizer on Using void* in Generic C Progr…
	Accessed on January… on My programming language benchm…
	Sumit Kumar on Implementing Generic Hash Libr…
	Implementing a HashM… on khash.h
	Implementing a HashM… on khash.h

Attractive Chaos

Just another WordPress.com weblog