Introducing YARI-V - an experiment on SPIR-V compression

SPIR-V is a simple binary intermediate language used for graphics shaders and compute kernels. Wearing my work hat (I work at Codeplay Software Ltd.) I have been contributing to the SPIR-V specification since 2014 as one of the authors. SPIR-V’s primary goals are (as according to me):

Have a regular binary structure.
Be easily extendable.
Be easy to validate for correctness.
Be easy to produce from compiler toolchains.
Be easy to consume in tools and drivers.

To this end, one of the things that SPIR-V has not prioritised is the size of the resultant binaries. The awesome @aras_p wrote a great summary of the problem (and his tool SMOL-V) on his blog - SPIR-V Compression. The SMOL-V tool is a single C++ header/single C++ source file.

I’m a big fan of single C header libraries, and was curious if I could write a similar tool to his own, written in C, but try to use my knowledge of SPIR-V to get me a better compression ratio. In my previous blog posts ‘spirv-stats - a tool to output statistics of your SPIR-V shader modules‘ and ‘spirv-stats update - exposing more information‘ I tried to get an in-depth look into what is taking up the most space in the SPIR-V shaders that @aras_p was using for testing.

Then, I begun writing my own tool for compressing SPIR-V shaders that I’m calling YARI-V (a yari is a type of Japanese spear, which seemed appropriate as a sister encoding to SPEAR-V).

In the remainder of this post I’ll walk you through the steps I took to compress the SPIR-V shaders that @aras_p was using for testing, and compare and contrast the result of my own library YARI-V against SMOL-V.

Test Set⌗

I didn’t have handy access to some real world shaders like @aras_p had for his SMOL-V tool - so I simply used the 341 shaders he uses to test SMOL-V against to test YARI-V against. The total size of the uncompressed shaders is 4868.47 kilobytes, and we’ll use a percentage on this size when evaluating the compression attempts that were made.

Varint Encoding⌗

The first thing I thought to do was use the varint encoding used in Google’s Protocol Buffers for everything. For the uninitiated, SPIR-V is word based - everything is held in 32 bit values. IDs are in the range [1..N), and ID’s should start at 1 and increment from there as more IDs are required. This means that for small shaders, most IDs in use are going to be small unsigned integer numbers. Let’s take a look at the OpDecorate instruction for an example:

<th style="text-align: center;" colspan="2">
  word count
</th>

<th style="text-align: center;" colspan="2">
  opcode
</th>

<th style="text-align: center;" colspan="4">
  <id> target
</th>

<th style="text-align: center;" colspan="4">
  decoration
</th>

<th style="text-align: center;" colspan="4">
  literal*
</th>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

<td style="text-align: center;">
  1
</td>

As we can see, the opcode is made up of:

two bytes for the word count
two bytes for the opcode
four bytes for the to target this decoration with
four bytes for the decoration itself (an enumeration of values in the range [0..N))
four bytes for each optional literal (some decorations can take other values)

What I did was take the word count and varint encode that (this value is normally very low for opcodes) - the only opcodes that could have a word count greater than 127 (the magic cutoff to fit within 1 byte using varint encoding) are the ones that take strings (like OpString, OpName, OpMemberName). This meant that word count was taking 1 byte instead of 2 in most cases.

Next, I varint encoded the opcode. Most of the opcodes we use are below the 127 cutoff for varint encoding, so we can encode this as 1 byte instead of 2 again. The worst case to fit within 2 bytes using varint encoding is 16383, and our maximum opcode at present is in the 4000 range, so we shouldn’t ever require more bytes than the original encoding by using varint.

Next, the . In small shaders it would be normal for all ’s to be less than 127, but even in large shaders most ’s are likely to be lower than the 2 byte boundary for varint - the value 16383. Given that took 4 bytes all the time previously, we are saving at least 2 bytes in nearly all the cases we care about.

The decoration used in OpDecorate currently has [0..44) possible values, so this will always fit inside 1 byte of our varint encoding.

And any literal used by the decoration we’ll just varint and hope that it’ll be worth it.

After doing this for all opcodes in the SPIR-V shaders, we reduced our SPIR-V shader size to:

2279.97 kilobytes (46.8%)

So a pretty healthy start for reducing the size of the binaries!

OpLabel <-> OpNop⌗

So after varint encoding, the next thing I looked at was the output from my spirv-stats tool. It showed that OpLabel was being used 9915 times in our shaders. OpLabel’s opcode value is 248 - which means it requires 2 bytes when varint encoded. So I decided to find another opcode whose value was less than 127 and could thus fit within a 1 byte varint value, that wasn’t being used within our shaders. OpNop has the value 0, is used no times within our shaders, so I decided to swap the values of these during encoding, and then swap them back during decoding.

The next thing I noticed was that OpLabel has a constant word count - the number of words it takes is the same for every use of the opcode. Given that this is constant, it means we can not encode the value, and simply infer the constant value of the word count during decoding.

2260.61 kilobytes 46.43%

This reduced the size by about 19 kilobytes.

Finding Moar Things to Swap⌗

After realising that swapping ’s with greater than 127 value with non-used or little-used smaller than 128 values worked, I decided to go through the next set of most used opcodes (as found from spirv-stats) and swap them, and where possible not encode the word count of the opcode if it was a constant.

I first swapped OpFMul <-> OpSourceContinued, and OpFadd with OpSource:

2246.63 kilobytes 46.15%

Then I swapped OpBranch and OpSourceExtension, not encoding the word count of OpBranch:

2236.58 kilobytes 45.94%

Then I swapped OpFSub and OpUndef, not encoding OpFSub’s word count and also not encoding OpFAdd and OpFMul’s word count too:

2217.26 kilobytes 45.54%

In total another 43 kilobytes shaved off the size!

Delta Encoding⌗

SPIR-V uses a compiler intermediate form known as Single-Static-Assignment (SSA for short) which means the results of opcodes are assigned to an once, and that is never reassigned to. This means that once we go over the 127 value boundary for an , we are going to require 2 bytes for every subsequent to be encoded.

For the most part, ’s will be linearly increasing the length of the program, EG. for the current opcode we can be quite confident that the previous opcode had the of our - 1. Given this fact, I delta encoded our to the previous known . There was a problem though - what if the previous used was actually bigger than our ? This would result in the subtraction creating a large unsigned integer number, which would take 5 bytes to encode! To get round this, I used a lovely little bit twiddling hack called zig-zag encoding (used in Google’s Protocol Buffers, but explained really well here). Zig-zag encoding allows for all integers in the range [-64..64) to be encoded using one byte when combined with our varint encoding, meaning that even if the previous was actually larger than our own, we would still hopefully be able to encode the delta from it to our own in 1 or 2 bytes (rather than a worst case of 5 bytes).

I also thought I’d try delta encoding our types separately. In general the ’s assigned to types are close to each other in the SPIR-V shaders, because types are all declared in one section at the beginning of the shaders. So I thought by delta encoding the types I’d also get a nice little compression.

2111.88 kilobytes 43.38%

So doing this shaved a lovely 106 kilobytes off of YARI-V encoded size.

Never Encode a Constant Word Count⌗

I’d already shown that not encoding the word count where possible would save us at least 1 byte per opcode we could do this for, so I did a pass over all the opcodes in SPIR-V to not encode the word count for all opcodes where the word count was a constant.

2081.48 kilobytes 42.75%

This shaved a further 30 kilobytes off our YARI-V encoded size.

Fake Opcodes⌗

I was a little disappointed that never encoding the constant word count only knocked 30 kilobytes off of our encoded size - then I realised, the most used opcodes in our SPIR-V shaders are all variable length (as per the specification). But are they really? I added some output to spirv-stats to show when OpLoad and OpStore had the optional addition Memory Access literal - and it turns out exactly **** of our OpLoad’s and OpStore’s used this! So for our purposes, OpLoad and OpStore had a constant word count, they just didn’t know it.

What I did was split OpLoad into two encoding, OpLoad, and a new fake opcode called OpLoadWithMemoryAccess. I set the value for OpLoadWithMemoryAccess to above 500 (the largest SPIR-V ID in use at present is in the low 300’s, so I hope this is safe enough for the time being), and then when encoding both our OpLoad and OpLoadWithMemoryAccess opcodes their word counts are constant (4 for OpLoad, 5 for OpLoadWithMemoryAccess). Doing this allowed me to save 1-2 bytes for each use of OpLoad (which accounts for 16% of the opcodes in our SPIR-V shaders!)

2032.74 kilobytes 41.75%

Next I did the same for OpStore, making a new fake opcode OpStoreWithMemoryAccess, and not encoding the now constant word count for OpStore and OpStoreWithMemoryAccess.

2004.19 41.17%

In total shaving 77 kilobytes off of our YARI-V encoded size.

More Decorations⌗

OpDecorate is the third most used opcode with 8.28% of the total opcodes. I added some information to spirv-stats to output how many of the decorations had no literals and how many had one literal (none of the opcodes available today have more than one). 71% of the opcodes have no literals, and 29% have one. So I decided to split OpDecorate into three encodings, one that contains a decoration that has no literals, one that contains a decoration that has exactly one literal, and one that has two or more literals (to future proof the encoder). This allowed me to make all of our uses of OpDecorate have a constant word count, meaning we do not need to encode it. I also swapped these new fake opcodes with OpLine and OpExtension so their ’s were less than 127.

1996.46 kilobytes 41.01%

Shaving 8 kilobytes off of the YARI-V encoding.

Moar Member Decorations⌗

Given the success of splitting OpDecorate, I decided to do the same with OpMemberDecorate, which is the sixth most used opcode in our SPIR-V shaders. 90% of the uses of OpMemberDecorate had 1 literal, so I decided I’d split it into three encodings (just like I did with OpDecorate), one that contains a decoration that has no literals, one that contains a decoration that has exactly one literal, and one that has two or more literals. I also swapping these new fake opcodes with OpExtImport and OpMemoryModel.

I also noticed that I wasn’t delta encoding the ’s for the new fake OpDecorate or OpMemberDecorate variants, so I did that too.

1967.65 kilobytes 40.42%

All of this resulted in shaving a further 29 kilobytes off of our YARI-V encoding.

(Non) Initialised Variables⌗

I added a check to spirv-stats to see if any of the OpVariables we were declaring had initialisers - **** of them did. So I added a new encoding for OpVariable that has an initializer, which meant I could skip encoding the word count.

1949.05 kilobytes 40.03%

I then applied the same logic to OpConstant - all of our constants were using one word for the actual constant (all of our constants were 32 bit integers and floats), so I could split out the encoding for OpConstant if it was encoding a 64 bit integer of double into a separate opcode, allowing me to not output the word count of our OpConstant’s.

1938.48 kilobytes 39.82%

Shaving 29 kilobytes off of our YARI-V encoding.

Access Chains⌗

To get a pointer into a composite (say an array or struct) we use OpAccessChain to work out what we want to load. I added some information to spirv-stats to output the number of indices being used with OpAccessChain. 78% were using one index (say indexing into an array), 19% were using two indices (used if you were indexing into an array of structs), and 2% were using three indices.

I decided to split OpAccessChain into four encodings, one that contains one index, one that contains two indices, one that contains three indices, and one for all other index combinations. I also swapped these new fake opcodes with OpExecutionMode, OpCapability and OpTypeVoid)

1919.76 kilobytes 39.43%

Shaving 19 kilobytes off of our YARI-V encoding.

Everyday I’m Shuffling⌗

OpVectorShuffle takes 3.4% of the opcodes in the SPIR-V shader module, but 6% of the size of the module (it’s a lot of bytes per opcode hit).

The first thing I noticed was that OpVectorShuffle was working on at most two vec4’s (the SPIR-V shaders I’m dealing with are used in Vulkan, where 4 element vectors are the maximum). So I decided to split OpVectorShuffle into four encodings; one that contains two components, one that contains three components, one that contains four components, and one for all other component combinations. I also swapped these with gaps in the SPIR-V opcode range at 8, 13 & 18 opcode values

1910.26 kilobytes 39.24%

Only 9 kilobytes shaved, which wasn’t so great. My next observation was that, when shuffling two vec4’s together, the maximum number of states each component literal could be in was 9, in the range [-1..8) - where -1 denotes that we’d want a undefined result in that component of the vector. I checked, and none of our encodings of OpVectorShuffle were using -1, so given that all of our literals are less than 8, we can use 3 bits maximum to encode each literal!. I extended the new OpVectorShuffle encodings I had previously made to encode the literals in at most 2 bytes (1 byte for the two literals case, 2 bytes for the three and four cases).

1892.43 kilobytes 38.87%

I next checked how many of our OpVectorShuffle’s were actually doing a swizzle - EG. they were taking the same vector for both vectors, and were only accessing values from the first vector. A whopping 82% of our OpVectorShuffle’s were doing exactly this, so I added some new fake opcodes for OpVectorSwizzle, using 2 bits to encode each literal (in a swizzle at most 4 elements of a vec4 were being shuffled around, which can be encoded in 2 bits).

1874.17 kilobytes 38.50%

Shaving a cool 36 kilobytes off of our YARI-V encoded size.

Swapshop⌗

I noticed that OpBranchConditional and OpSelectionMerge were being used enough that requiring 2 bytes to encode their opcodes was silly, so I swapped these with the CL-specific OpTypeEvent and OpTypeDeviceEvent for a further 8 kilobyte reduction in our YARI-V encoded size:

1868.01 kilobytes 38.37%

Composing⌗

OpCompositeExtract and OpCompositeConstruct take a decent amount of space in the SPIR-V binary with 7% of the bytes dedicated to them.

I first split OpCompositeExtract into two encodings; one that has exactly one literal, and one for all other cases:

1859.28 kilobytes 38.19%

Then I split split OpCompositeConstruct into four encodings; one that has one constituent, one that has two constituents, one that has three constituents, and one for all other encodings:

1855.46 kilobytes 38.11%

Next, I noticed that OpCompositeExtract was being used mostly to lift a scalar from a vector for some scalar calculation. So I detected when OpCompositeExtract was being used with literals in the range [0..4), and added four encodings of OpCompositeExtract; one that assumes the literal is zero, one that assumes it is one, one that assumes it is two, and one that assumes it is three:

1843.40 kilobytes 37.86%

Relaxing Precisely While Decorating⌗

The most used decoration of OpDecorate was RelaxedPrecision - with 66% of the 23770 uses of the opcode encoding that. So I added a new fake opcode for OpDecorateRelaxedPrecision, allowing me to not actually encode the decoration for RelaxedPrecision and skip the unnecessary byte.

1828.03 kilobytes 37.55%

I then used the same logic on OpMemberDecorate. The most used decoration with OpMemberDecorate was for Offset - accounting for 90% of the 14332 uses of the opcode. I added a new fake opcode for OpMemberDecorateOffset, to skip outputting the decoration in this most used case.

1816.17 kilobytes 37.30%

And with this I was really excited because I’d finally beaten @aras_p‘s SMOL-V (his was taking 1837.88 kilobytes for the shaders).

The Big Plot Twist⌗

One thing I hadn’t been keeping an eye on (showing my newbieness to all things compression) was the compression ratio when passing YARI-V into something like zstd. SMOL-V is primarily a data filtering algorithm - it runs on a SPIR-V shader to create SMOL-V such that the SMOL-V is much more easily compressible than the SPIR-V was. My mistake was I was thinking of YARI-V solely as a compression format, and not as a filtering algorithm.

When I tested running zstd at level 20 encoding on YARI-V versus SMOL-V, YARI-V was taking 440 kilobytes compressed to SMOL-V’s 348 kilobytes! Even though the encoding of YARI-V was smaller than SMOL-V’s, SMOL-V was clearly filtering the data such that it made the compressors life easier.

I had to now work out how to increase repetition in my YARI-V encoding to help out compressors.

Delta Encoding More Things⌗

I had previously only delta encoded the result of my opcodes - but ’s are used in the body of the opcodes to. I looked at our three most used opcodes and started there.

For OpLoad and OpStore, I delta encoded the that they were loading/storing from/to. This should result in a 1 byte encoding, as most OpLoad’s and OpStore’s are using the result from an OpAccessChain to work out where to load from, and the access chain is usually the instruction immediately before the load or store.

1810.70 kilobytes 37.19%

Unvarinting Things⌗

By using the varint encoding everywhere I was being a little over zealous with the use of varints. For example, if we were declaring an OpConstant that was a floating point value, as long as the floating point constant was not denormal it would always take 5 bytes to encode instead of the 4 bytes it would have taken if we hadn’t used varint encoding. So I added some logic to detect constants that had any bits set that would result in a 4 byte or larger varint encoding, and just memcpy’d these into the YARI-V encoding.

It turns out that 47% of our constants fitted this pattern, which saved one byte per constant encoded.

1805.71 kilobytes 37.09%

The Case of the Mistaken Delta Encoded Types⌗

I still wasn’t anywhere near where I needed to be when compressed, and I was having trouble working out why I was still so far away. So I did some analysis of number of bytes taken up by delta encoding both our ’s and types. It turned out that our types were being delta encoded to 1 or 2 bytes which seemed reasonable at first glance. But then looking more closely, I realised that, since types are declared at the beginning of the SPIR-V shader modules, they mostly had low ’s assigned to them. In _most _cases the ’s were less than 127 for our types. This meant that instead of delta encoding them which was giving us roughly 50/50 for 1/2 byte encodings, if I just used varint encoding without delta encoding too, I’m getting nearly 95% of types encoded in 1 byte.

1675.96 kilobytes 34.42%

Shaving a whopping 125 kilobytes off of our resultant YARI-V encoding!

I then noticed that SMOL-V had an option to strip the non-essential ’s from the SPIR-V shader modules. This involves removing debug instructions like OpName, OpMemberName, OpLine, etc. that aren’t required to be present for the SPIR-V to function correctly.

I added in my own option on encoding to handle stripping of the debug instructions, which resulted in:

1498.67 kilobytes 32.41%

Which is a further 176 kilobytes smaller than our non-stripped YARI-V encoding, and 130 kilobytes smaller than SMOL-V’s equivalent stripped encoding.

At this stage I thought I was golden, I’d cracked the puzzle and obviously my YARI-V encoding would be smaller than SMOL-V? Oh how I was wrong!

Results⌗

<th style="text-align: right;">
  Size (kilobytes)
</th>

<th style="text-align: right;">
  Compression (%)
</th>

<td style="text-align: right;">
  4868.468750
</td>

<td style="text-align: right;">
  100.000000%
</td>

<td style="text-align: right;">
  590.573242
</td>

<td style="text-align: right;">
  12.130575%
</td>

<td style="text-align: right;">
  1837.881836
</td>

<td style="text-align: right;">
  37.750717%
</td>

<td style="text-align: right;">
  386.879883
</td>

<td style="text-align: right;">
  7.946644%
</td>

<td style="text-align: right;">
  1675.956055
</td>

<td style="text-align: right;">
  34.424706%
</td>

<td style="text-align: right;">
  390.577148
</td>

<td style="text-align: right;">
  8.022587%
</td>

<td style="text-align: right;">
  1629.115234
</td>

<td style="text-align: right;">
  33.462580%
</td>

<td style="text-align: right;">
  348.073242
</td>

<td style="text-align: right;">
  7.149542%
</td>

<td style="text-align: right;">
  1498.666016
</td>

<td style="text-align: right;">
  30.783108%
</td>

<td style="text-align: right;">
  364.057617
</td>

<td style="text-align: right;">
  7.477867%
</td>

The brass tax is that SMOL-V, with stripping, and then fed through zstd at level 20, is 16 kilobytes smaller than the equivalent YARI-V, stripped and fed through zstd at level 20, even though YARI-V is 130 kilobytes smaller than SMOL-V when comparing the results of YARI-V against SMOL-V directly.

My main suspicion is that my approach of creating new fake opcodes, and thus allowing me to avoid outputting the word count, is probably wrong. Only 9.74% of my opcodes required a word count in the end - but across the 286932 opcodes used in the input SPIR-V shaders this means that 27956 opcodes were using the more expensive approach to encoding our word count.

My other main idea is that it would seem that SMOL-V is creating a binary stream that zstd can more easily work out how to compress - more sequences of bits must be the same in SMOL-V as compared to YARI-V. I think my approach of simply trying to compress the input SPIR-V into as concise a form as I could with YARI-V meant I lost a little of the big picture that this should have been more of a filtering step on the SPIR-V rather than a compression algorithm in its own right.

Future Work⌗

One thing I’d like to look at is if I could remap the ’s in the SPIR-V (guarded by an option) such that we could increase the delta encoding success rate. At present our delta encoded IDs take up:

<th style="text-align: right;">
  percentage of opcodes
</th>

<td style="text-align: right;">
  58.180900%
</td>

<td style="text-align: right;">
  27.674441%
</td>

<td style="text-align: right;">
  14.144659%
</td>

<td style="text-align: right;">
  0.000000%
</td>

<td style="text-align: right;">
  0.000000%
</td>

At present 58% of the times we delta encode we get a value that will fit within 1 byte of our zig-zagged varint encoding. 27% fits within 2 bytes, and then 14% in 3 bytes. I think the best place to start would be to try and decrease the number of times a 3 byte encoding was required, and try to map ’s for locality.

A great example of where this would be useful is with OpConstant’s. OpConstant’s are declared early in the SPIR-V shader module and are therefore generally given a low . But they tend to be used in the body of the functions, which occurs much later on. If an OpConstant was used by an OpFMul, it would be awesome if we could make the of the OpConstant close to the of OpFMul to increase our chances of a 1 byte delta encoding.

Getting YARI-V⌗

YARI-V is available on GitHub licensed under the unlicense. I hope the code is useful to someone, even though I fell short of my aim I very much enjoyed the journey of trying.