Production-Run Software Failure Diagnosis via Hardware Performance Counters @ ASPLOS'13

On in Bookmark by Mingxing Zhang
Tags: ,

URL: http://dl.acm.org/citation.cfm?id=2451128

This paper presents PBI, a system that uses existing hardware performance counters to diagnose production-run failures caused by sequential and concurrency bugs with low overhead (< 10%).

Firstly, we must notice that this tool is used to diagnose bugs, not detect or prevent bugs during production runs. This purpose enables PBI to leverage some kinds of statistical methods. As a consequence, you must collect enough failure runs before diagnosing it, which usually requests that you should know how to trigger the bug.

Then, personally, I think the most important observation in this paper is: a wide variety of common software bugs can be reflected by a small portion of hardware events supported by hardware performance counters.

  • For concurrency bugs, those events are cache-coherence events (state change in MESI protocol). For example, the I-predicate and S-predicate can differentiate failure runs from success runs for all 4 types of atomicity violations. More detailed discussions can be found in Sec. 3.1.2.
  • For sequential bugs, PBI use branch-related events, because many semantic bugs are related to wrong control flows.

This paper also proposes a statistical method to identify which events are highly correlated with failure runs.