Saturday, September 17, 2022

The problem of repeated crashes

Postmortem app crash reporting and analysis is a bit of a hobbyhorse for me. I see it as an extension of the bug reporting facility - a crash in production (typically, but not always) indicates a bug. But does every crash indicate a separate, distinct bug?

When I say "crash" I mean "a condition that would normally cause a program to terminate, unless special care has been taken to prevent immediate termination". So a crash condition could be:

  • A *nix signal (modulo SIGUSRx)
  • An uncaught Windows structured exception (SEH)
  • An uncaught Java/.NET/Python/JavaScript exception
  • A run-time error (e. g. an array bounds check fail)
So imagine a crash has occurred, your faithful crash reporting subsystem has reported it, you have a report. At the very minimum, the report contains a location indicator (crash address, or a source file/line if you are lucky), and the nature of the crash - if signal, the numeric signal code, if exception, the exception class. Ideally, a stack trace would also be there, and some local state.

On the receiving end, there is your bug tracker. And the bug tracker has to answer a question - is it a new bug, or another instance of an existing crash causing bug? It's a surprisingly involved question.

For one thing, matching file/line is not a guarantee of anything. To use a trivial example, an

x = MyArray[MyIndex];

line may case a "null reference" exception if MyArray is not initialized, or an "array out of bounds" if the value of MyIndex is off; and those two conditions could easily have different root causes.

On the other hand, once we start dealing with several versions of the app coexisting on different production devices, the same bug could cause crashes on different locations in different versions, while sharing the same root cause. Trivially, if lines were added to the source where the crashing function is between version 2 and version 3, the line number in the crash would vary accordingly. A heuristic to catch this would be - matching line numbers not relative to the top of the file, but to the top of the crashing function, but only as long as the function source is otherwise the same between versions.

Still, even matching location/condition sometimes is not sufficient. What if a function crashes on an unexpectedly null argument, but said argument can come from multiple sources and therefore different root causes? You don't want to match the whole call stack, especially considering that the root cause error could be halfway through the stack and the far end of it might legitimately differ. In other words, the following is a possible situation:

foo() - CRASH!foo() - CRASH!
happy() (error A)lucky() (error B)

But so is this:

foo() - CRASH!foo() - CRASH!
bar() (error A)bar() (error A)
happy()lucky()

In the former case, it's two different errors with the same crash location/type, in the latter, it's the same error manifesting downstack. I don't have a good answer how can a crash analysis system tell between those two cases, short of human triage.

Ideally, inconsistent state should not be able to cross function call boundaries. The mandatory null-marking of modern C#, for example, is a step in that direction, so are type hints. But consistency is such a squishy concept. Null/not null is the simplest invariant of them all, but there are so many subtle ways to introduce state that turns out to be invalid later. On top of everything else, consistency checks are not free of runtime cost. In desktop/server computing it might be negligible, but in mobile and, especially, embedded coding they still retain the habit of counting bytes and cycles.

With all that in mind, the algorithm of a crash report analyzer would need to strike a balance between a positive bias and a negative bias. It's the well know lumper-splitter problem, all over again.

No comments:

Post a Comment