CloudRaid: Hunting Concurrency Bugs in the Cloud via Log-Mining
Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a new effective tool to battle distributed concurrency bugs. CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors. We observe that large-scale online cloud applications process millions of user requests per second, exercising many permutations of message orderings extensively. Those already sufficiently-tested message orderings are unlikely to expose errors. Hence, CloudRaid mines logs from previous executions to uncover those message orderings which are feasible, but not sufficiently tested. Specifically, CloudRaid tries to flip the order of a pair of messages $<S,P>$ if they may happen in parallel, but $S$ always arrives before $P$ from existing logs, i.e., excercising the order $P \rightarrowtail S$. The log-based approach makes it suitable to live systems.
We have applied CloudRaid to automatically test four representative distributed systems: Apache Hadoop2/Yarn, HBase, HDFS and Cassandra. CloudRaid can automatically test 40 different versions of the 4 systems (10 versions per system) in 35 hours, and can successfully trigger 28 concurrency bugs, including 8 new bugs that have never been found before. The 8 new bugs have all been confirmed by their original developers, and 3 of them are considered as critical bugs that have already been fixed.
Tue 6 Nov
|10:30 - 10:52|
|10:52 - 11:15|
|11:15 - 11:37|
|11:37 - 12:00|