One of the C/C++ single header libraries I maintain is hashmap.h. This library is a super light weight and easily integrated hashmap - I’m not focused on the performance necessarily here, it’s really there primarily as a nice-to-use hashmap for when you just want to throw something in there. But that doesn’t mean I want the performance to suck either. If I can make it faster without compromising the API, I’m totally gonna do that!

One thing I used previously was adding a special case when people were using the library with SSE4.2 enabled:

for (; i < len; i++) {
  crc32val = _mm_crc32_u8(crc32val, HASHMAP_CAST(unsigned char, s[i]));
}

This SSE4.2 path was 2.5x faster than the fallback (a lookup table).

I use a M1 Macbook Air, which as everyone knows uses the Arm architecture. There is no SSE4.2 on Arm, so can I make this faster on Arm too?

Well turns out I can! There is an extension for CRC32 on Arm too, that I can use.

for (; i < len; i++) {
  crc32val = __crc32b(crc32val, HASHMAP_CAST(unsigned char, s[i]));
}

And running this on my M1 Macbook Air this goes 2.5x too! I patted myself on the back and went to do a PR on GitHub. Then I realised - why am I only doing my CRC calculations on a byte-by-byte basis?

Turns out I just didn’t know how the CRC intrinsics actually worked. And there are CRC intrinsics that let us process multiple bytes at a time:

for (; (i + sizeof(uint64_t)) < len; i += sizeof(uint64_t)) {
  uint64_t next;
  memcpy(&next, &s[i], sizeof(next));
  crc32val = __crc32d(crc32val, next);
}

This will process eight bytes at a time instead of one, so how much faster does it go? 5.2x faster than the single-byte CRC intrinsic, and a whopping 13.3x faster than the original fallback code.

And the SSE4.2 code can process four bytes at a time with the intrinsics there too:

for (; (i + sizeof(unsigned)) < len; i += sizeof(unsigned)) {
  unsigned next;
  memcpy(&next, &s[i], sizeof(next));
  crc32val = _mm_crc32_u32(crc32val, next);
}

Adding this makes the X86 code go 3.1x faster than the single-byte CRC intrinsic, and a nice 7.7x than the original fallback code.

I also noticed that I only have power of 2 size for my hashmap, but I wasn’t relying on this when doing the hashmap calculations (I was doing an integer modulus, where I could just use an and-mask instead!). Doing this made the hashing calculations 1.01x faster, but every little helps.

All the code has now landed in the library and any users will be able to make use of it already!