Hello Chris,
Your branchless UTF-8 decoder expects a buffer divisible by 4, so the API is not that good for the user. Could you not just have the user supply the size of the buffer, and then have a local buffer of size 4 into which you would move the bytes from the user-supplied buffer, or 0 otherwise? The move should be branchless, of course, you just have to bully the compiler into turning a ternary into a cmov. This way the API is nice and the code is still branchless.
Correction - you do not need to bully nearly as much as i thought, you can just make a zeroed buffer, and then memcpy the length of the string (or the buffer, whichever one is smallest) into it. memcpy can usually optimize into a single instruction, and the comparison is trivially turned into cmov on all compilers. What do you think?