Fun with ARM barriers and GHC RTS

While reviewing part of ARM support code in RTS I’ve found out that there are some barriers which are not implemented for ARM yet. This leads me to investigation if they are really needed and I’ve found nice little rts/testwsdeque testcase which fails. The testcase tests WSDeque which is basically lock-less deque implementation for GHC RTS. So as the test fails something is badly wrong with this.
I’ve decided to implement missing barriers and found very useful reference to The JSR-133 Cookbook for Compiler Writers in include/stg/SMP.h header file. The same header file where all the barriers are implemented. The document contains nice table listing various kinds of barriers together with instructions used to implement them on various CPU architectures. ARMv7 was among them. Doug Lea did really nice work in writing it. The isn recommended to use was dmb and I already know this isn from various ARM documentation. ARM in fact provides two isns for implementing barriers: dmb and dsb. I’ve not been 100% sure which to use and so Doug’s document was really useful for me.
Anyway, even after this, rts/testwsdeque still failed. Let’s start searching again. This time I’ve found really nice although quite complex Barrier Litmus Tests and Cookbook which on a few examples recommends some best practice when and how to use barrier instruction in solving common programming problems (spin-locks etc.). I learn that although LDREX/STREXT isns provides kind of synchronization primitives they do not enforce any barrier and so I’ve also added dmb isn into GHC’s xchg and cas functions.
Let’s rerun the test and it still fails sometime. I’ve used simple script to run it in the loop and see if it fails:

while (true); do ./testwsdeque; echo -n .; done

Example of wrong output is:

........internal error: FAIL: 6706788 3 13
    (GHC version 7.1.20110701 for arm_unknown_linux)
    Please report this as a GHC bug:
................internal error: FAIL: 5463172 1 12
    (GHC version 7.1.20110701 for arm_unknown_linux)
    Please report this as a GHC bug:
...................internal error: FAIL: 6496304 1 11
    (GHC version 7.1.20110701 for arm_unknown_linux)
    Please report this as a GHC bug:
.........internal error: FAIL: 6192568 3 13
    (GHC version 7.1.20110701 for arm_unknown_linux)
    Please report this as a GHC bug:

So testcase passes several times for one failure, but still fails.

What now?
I’ve looked into testcase, printed it. I’ve also found appropriate rts/WSDeque.[c|h] sources and printed them too.
Side note: I don’t have several monitors setup here, I’m just using single 23″ LG W2220P in portrait mode but the viewing surface is still small for such manual “debugging”. So I usually print all the relevant code, lay it either on desk or even on floor and then read the code step by step and think about it.
So I ended with printed relevant source files and half of hour later I’ve been more and more convinced that my ARMv7 specific barriers and using of barrier in xchg/cas functions all is right and that the issue really might be in RTS work-stealing deque implementation. I have some feeling leading to it… Well, you know, GHC team is usually working on x86/x64 boxes. Some of the team members are on MacOSX/x64 and some of them are even using Niagara, i.e. former Sun’s UltraSPARC Tx processors and Solaris. Both hardware platforms are quite nice when it comes to load/store reordering. On the other hand I’ve found this nice note on a blog post dealing with barriers in Linux kernel on ARM:

Since the supported architecture with the weakest memory model (effectively the one that permits the most reordering) was the DEC Alpha, this was used as the reference architecture. No other architectures have since surpassed the DEC Alpha in this regard, but ARMv7-A comes pretty close.

And my idea which comes from this was simple, if Alpha was the weakest and if ARMv7 is pretty close, then perhaps ARMv7 is more weak (ie. permit more memory access reordering) than usually tested x86/x64 or UltraSPARC and then some bug might really slipped into RTS’ deque implementation. Deque code itself was written in 2008-2009. I was thinking that it was really a low chance that it was tested on Alpha even if some of GHC still contains Alpha code (which looks quite dead now (both CPU and GHC support for it I mean)). So the idea of a bug in deque implementation looked more and more real and I’ve been quite curious if I find it or not. Well, some time later I got to it! 🙂 pushWSDeque function which pushes specified data to the deque for consumption by stealing threads contained following code:

pushWSDeque (WSDeque* q, void * elem)
    StgWord t;
    StgWord b;
    StgWord sz = q->moduloSize; 
    b = q->bottom;
    q->elements[b & sz] = elem;
    q->bottom = b + 1;

I’ve deleted the code which is not important for the bug explanation. The bug happens on those two lines, or I shall rather tell between them!

    q->elements[b & sz] = elem;
    q->bottom = b + 1;

What you may see is assignment of elem into deque and then incrementing deque’s bottom variable to let stealing threads know, there are some new data in deque. As I learn during the bug hunting, I cannot be sure at all that the sequence will look like this. In fact it might be very well reversed by modern CPUs to:

    q->bottom = b + 1;
    q->elements[b & sz] = elem;

which if this happen would mean that: if (1) there are no data in deque and if (2) we do have some eager stealing thread waiting for new data (or polling for new data) and if (3) the sequence is reordered like above then just between the execution of the two lines stealing thread might got to its run and think hey, there are some data in the deque, let’s consume it and then it’ll got some random data since intended data are not yet assigned to the deque. And that’s all since the second line of the code above has not been executed yet. So stealing thread gets something which it should not.
Solution is quite simple, modify the code sequence to:

    q->elements[b & sz] = elem;
    q->bottom = b + 1;

The write_barrier(); which is effectively translated to dmb isn on ARMv7 enforces actual assignment of elem to really happen (and not only this, but all other pending assignments/writes before the isn execution) before the CPU comes to execute code incrementing deque’s bottom variable.
Does it solved the issue? I hope so, the while loop of testwsdeque testing was running several hundreds times without any failure. I’m running nearly full GHC testsuite now to see the results, but this will take another few hours anyway, so I’ll need to wait and see if I broke something or not. But anyway there is at least some chance that this was really the bugfix. And if so, then I’m going to push the patch upstream of course…

Conclusion: ARM is nicely RISCy and I learn some new stuff about barriers. I’ve known about them, but I’ve not had a chance to touch the stuff till now although I’m already quite some time from the college…


4 thoughts on “Fun with ARM barriers and GHC RTS

  1. Awesome work – thank you!

    It sounds like you have the barriers completely under control but, just in case, here’s an informal sketch of the difference between the DSB and the DMB:
    – DMB forces a memory operation to occur in a particular order with respect to another memory operation
    – DSB forces a memory operation to occur in a particular order with respect to any effects that an instruction might have.
    So you use DMB in situations like the above where one memory location contains control information about another memory location.
    And you use DSB in situations where something other than a memory location is involved: flushing the cache, changing page tables, etc.

    [There is also ISB (Instruction Synchronization Barrier) – used when you’re messing with the instruction cache (e.g., JITing code in the FFI support code).]

    • Hello Alastair,

      thanks a lot for your informal explanation of DMB/DSB/ISB. I also understand them in the same way. I’m just not sure what you mean by JITing in the FFI support code. Is it anything already contained in GHC which we shall also review or fix? On the other hand for example ISB we will need to use in rts/Linker.c (for GHCi) during the relocations mapping… (you set jumps and so you need to flush whole instruction stream…)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s