CloudRaid: Hunting Concurrency Bugs in the Cloud via Log-Mining (ESEC/FSE 2018 - Research Papers)

Sun 4 - Fri 9 November 2018 Lake Buena Vista, Florida, United States

Who

Jie Lu, Feng Li, Lian Li, Xiaobing Feng

Track

ESEC/FSE 2018 Research Papers

Time Zone

The program is currently displayed in (GMT-05:00) Guadalajara, Mexico City, Monterrey.

Use conference time zone: (GMT-05:00) Guadalajara, Mexico City, MonterreySelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 6 Nov 2018 10:30 - 10:52 at Horizons 10-11 - Concurrency and Races Chair(s): Willem Visser

Abstract

Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a new effective tool to battle distributed concurrency bugs. CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors. We observe that large-scale online cloud applications process millions of user requests per second, exercising many permutations of message orderings extensively. Those already sufficiently-tested message orderings are unlikely to expose errors. Hence, CloudRaid mines logs from previous executions to uncover those message orderings which are feasible, but not sufficiently tested. Specifically, CloudRaid tries to flip the order of a pair of messages $<S,P>$ if they may happen in parallel, but $S$ always arrives before $P$ from existing logs, i.e., excercising the order $P \rightarrowtail S$. The log-based approach makes it suitable to live systems.

We have applied CloudRaid to automatically test four representative distributed systems: Apache Hadoop2/Yarn, HBase, HDFS and Cassandra. CloudRaid can automatically test 40 different versions of the 4 systems (10 versions per system) in 35 hours, and can successfully trigger 28 concurrency bugs, including 8 new bugs that have never been found before. The 8 new bugs have all been confirmed by their original developers, and 3 of them are considered as critical bugs that have already been fixed.

Jie Lu

Feng Li

Institute of Computing Technology at Chinese Academy of Sciences, China

Lian Li

Institute of Computing Technology at Chinese Academy of Sciences, China

Xiaobing Feng

ICT CAS