This week I've tackled one of the most annoying bugs I've encountered. It was a linker bug, which seems to be the most annoying bugs there are.
We have the ability to run PC simulations for our embedded system on recorded input. Given that, we can ensure that if we see a bug on the input, we can almost certainly reproduce that bug consistently again and again, by simply running the simulation again on the same recorded data.
QA has reported that for a certain recording the simulation never ended. A quick GDB examination using showed that a sort helper function got into an infinite loop.
Looking at the code everything seemed legit, a vector with 1 item required sorting. begin() and end() iterators were passed to a sorting function.
Up until here everything looked fine, the problem was that the sorting function required the distance between the two iterators, and the distance arrived negative. Now this is template code so it was a little complex to debug, but looking at the C++ lines via GDB seemed like everything alright. Except that the distance always arrived negative.
Next step I took was looking at the assembly layout. Looking in at assembly I could see that the calculations seemed to be ok. The address of the end iterator was retrieved using base() function and subtracted accordingly. The result was a value that seemed logical (24), it was than divided by 4 (size of ptr) to receive 6. Which was then multiplied (imul) by 0xccccccd, what should have given us 1. Which is the number of elements between begin and end.
However the result was some negative value other than 1.
Now this was some STL code, so it didn't seem to make sense that the issue would be there. However everything up to that point seemed to be ok. My friend mentioned that "When you are looking at assembly code, you've probably gone too far".
We were at it for a few hours but had no idea what is wrong. Finally my friend noticed that even though the name of the struct which was used for the vector is correct, it is not located in the correct path. This led us quickly find that the same struct was copied to another file in the same namespace, but on that file it contained an extra field. So actually the correct size of the struct was 20. And the linker decided to link to that struct instead of the correct one.
Changing the name of one of the structs solved the issue and put it us rest.
This issue reminded me on the many many possible issues you can have with linkers, and of the interesting posts I've written about here.