;//////////////////////////////////////////////////////////////////////////////// ;// THE SCOTCH-WARE LICENSE (Revision 0): ;// <aaronryool/gmail.com> wrote this file. As long as you retain this notice you ;// can do whatever you want with this stuff. If we meet some day, and you think ;// this stuff is worth it, you can buy me a shot of scotch in return ;//////////////////////////////////////////////////////////////////////////////// ; George Marsaglia's xorshift PRNG ; t = r10 prng_state: w: dq 99999 x: dq 100 y: dq 35 z: dq 27365 prng_seed: rdtsc mov qword [x], rax ret prng: ; t = x mov r10, qword [x] ; t ^= t << 11 mov rax, r10 shl rax, 11 xor r10, rax ; t ^= t >> 8 mov rax, r10 shr rax, 8 xor r10, rax ; x = y, y = z, z = w push qword [w] push qword [y] push qword [z] pop qword [y] pop qword [x] pop qword [z] ; w ^= w >> 19 mov rax, [w] shr rax, 19 xor qword [w], rax ; w ^= t xor qword [w], r10 ; return w mov rax, qword [w] ret

## Sunday, June 26, 2016

### George Marsaglia's xorshift PRNG lightweight implementation

Subscribe to:
Post Comments
(
Atom
)

With a small bit of tweaking, the performance can be more than doubled.

ReplyDeleteFirst, switching to RIP-relative addressing eliminates 6 instruction bytes and shaves off 18% of the running time. You can do this for the whole source with `default rel` or on individual stores/loads, like `[rel w]`.

Second, prng() makes 17 memory accesses when only 8 are necessary (4 loads, 4 stores). Registers are fast and should be exploited as much as possible. Load each of w, x, y, z exactly once, operating on them purely as registers, then store the results back. With careful register selection (allowing the final value of w to naturally arrive in rax), this shaves off 6 more instruction bytes and brings the run time down to just 40% of the original.

There are probably further gains to be made with thoughtful instruction scheduling, but it's probably negligible since, for a function this small, a modern x86 will do that on its own so long as there are no false dependencies.

As a final note, prng_seed() does a *very* poor job at seeding the PRNG. The rdtsc instruction only writes eax, so the upper 32 bits of x are left zeroed. You could fill these from edx, but there's a more fundamental problem: one of the notable weaknesses of xorshift is the lack of avalanche effect. You can see this for yourself. Run prng_seed() and print out the full, 16-nibble hexidecimal outputs of prng(). Notice how much of it is zeroed, and how many thousands of iterations it takes until it finally appears to be random.

Xorshift needs to be seeded thoroughly to be effective. Leaving multiple bytes of the state zeroed/uninitialized will give poor results. The solution, if you want to continue seeding from rdtsc, is to hash the seed with something like splitmix64 and use it seed all of xorshift's state from the output.