Monday October 23rd, 12-1PM @ BA5205
Speaker: Yongle Zhang
Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach
Complex and unforeseen failures in distributed systems must be diagnosed and replicated in a development environment so that developers can understand the underlying problem and verify the resolution. System logs often form the only source of diagnostic information, and developers reconstruct a failure using manual guesswork. This is an unpredictable and time-consuming process which can lead to costly service outages while a failure is repaired.
This paper describes Pensieve, a tool capable of reconstructing near-minimal failure reproduction steps from log les and system bytecode, without human involvement. Unlike existing solutions that use symbolic execution to search for the entire path leading to the failure, Pensieve is based on the Partial Trace Observation, which states that programmers do not simulate the entire execution to understand the failure, but follow a combination of control and data dependencies to reconstruct a simplified trace that only contains events that are likely to be relevant to the failure. Pensieve follows a set of carefully designed rules to infer a chain of causally dependent events leading to the failure symptom while aggressively skipping unrelated code paths to avoid the path-explosion overheads of symbolic execution models.
Yongle Zhang is a PhD student at University of Toronto, working on computer systems and reliability with Prof. Ding Yuan. His current research focuses on system profiling and failure diagnosis.