More on interrupts and atomic operations

In my previous post I wrote about interrupts and atomic operations in embedded systems. After getting bitten myself with just such an issue I thought I would write about the details of what I just experienced on my current project.

First, the background of the issue; the device I am working on connects to a wireless access point to provide an interface to some PC-based software. The issue was that when the PC software started up sometimes the device would appear to drop messages or stop responding to messages. The issue was hard to track down because if the device was run under debugger control the messages appeared to be received just fine. This led me to suspect the code’s logic was being corrupted by interrupts disrupting the management of messages out of the receive buffers.

When a message is received from the network it is placed into a circular buffer, the index of the next free location is updated, and the total number of characters in the buffer is updated. This all happens in the response to an interrupt from the wireless module. Since there was no related interrupt that could be disrupting this I turned my attention to the background task that unpacks the data in the circular buffer into packets for processing. However, what I saw there was my buffer manipulation being protected by the wireless module interrupt being turned off, just as I wrote about previously.

This deepened the mystery since it appeared I was following good practice and keeping the buffer management code atomic, at least to the extent of protecting it to further interrupts from the wireless module, the only other code that manipulated the buffer variables.

The next step was to instrument the two methods that enabled and disable the wireless module interrupt. I did this to make sure that each disable call was matched by an enable call. The results of this test revealed that there were more enable operations than disable operations. After some thinking I realized that the enable and disable methods themselves were not atomic, therefore it was possible for an interrupt to arrive during the execution of the disable call, but before the interrupt had been disabled. What was needed was to track the enable and disable calls and only re-enable them when all callers had requested the re-enable.

To do this, a counter was incremented when interrupt enable was requested and decremented when disable was requested. However, the disable action would only be taken if the counter decremented to zero.

This update appears to have resolved my lost message and non-received message issues.

So, if your system has what appear to be random failures, closely examine the data flow through it and the effects that interrupts may have on that flow. Plus, double check that the code being used to manage the interrupts themselves is clean and does not make assumptions like mine did. Just because a caller requests an interrupt be enabled the code needs to be ensure the action is only taken when each caller that requested a disable has also requested an enable.