There's no performance improvement for current 64-bit capable Intel processors with this version. Those were happy running native x86-64 code, achieving the same level of performance that they do with SSE2 code. There's, however, one advantage to using this version on Intel processors as well: the performance should be more stable, regardless of C compiler version and options.
Some technical detail for the curious:
Yes, the new 64-bit mode SSE2 code is also different. The DES S-box implementations are not derived from the 32-bit mode MMX/SSE2 ones, but rather they have been generated from Matthew Kwan's optimized S-box expressions anew, making use of the 16 XMM registers. A Perl script was written and used to compile the expressions into a virtual 3-operand architecture, convert to 2-operand and allocate virtual registers, allocate real registers and spill whatever didn't fit into registers to memory locations, and do some instruction scheduling based on published SSE2 instruction latencies for the Pentium 4 (which are larger than those for current AMD64 processors, so should be good for both camps). Then the script's output required some manual work to resolve the few occurrences of unsupported operand combinations. Finally, the individual S-boxes were subjected to "brute-force instruction scheduling", which tested thousands of different ways in which instructions could be re-ordered, measuring the actual execution times in clock cycles (yes, code was actually being run on the CPU multiple times for each of the thousands of code versions being tested).
Now, you would expect excellent performance from the above, wouldn't you? Well, it's not so great in practice. The 8-registers MMX code from which the non-64-bit-mode SSE2 code was derived is excellent, making it very hard to do things better - even with more registers. Yes, the availability of 16 registers helps save some moves and reduces the instruction count in S-boxes by 10%, but it turns out that accessing the added 8 XMM registers may be slower than accessing memory (actually, L1 cache) in some cases. I think that this has to do with limitations of the instruction decoder. For example, blindly converting the "old" 8-registers SSE2 code to use the extra 8 registers instead of temporaries in memory results in an 8% slowdown on an Athlon 64, but makes no difference on a Xeon - which is consistent with the "decoder theory". Brute-re-scheduling that code does get back some of the lost 8%, but it does not restore the original 8-registers code performance.
So for the time being, I am satisfied with the brand new 16-registers code being about as fast as the old 8-registers one is. In future versions, I might further improve the Perl script to do smarter register allocation, taking into account the fact that consecutive instructions which access registers XMM8-XMM15 have a higher cost on AMD64 processors. (In my testing, the slowdown is seen with 4+ consecutive instructions like that, but in real-world code there are more constraints.) I might also generate some hybrid native x86-64 + SSE2 code, which might make this issue irrelevant... or it might not. :-)