I recently graduated from the UC Berkeley Ph.D. program, where I was a member of the NetSys Lab and advised by Sylvia Ratnasamy. My PhD thesis work focused on understanding performance in data analytics frameworks.
I am also a committer and PMC member for Apache Spark. My work on Spark has focused on improving scheduler performance, and I currently help maintain and review pull requests for the scheduler code. I have worked on two high-throughput schedulers for Spark, Sparrow and Drizzle, and I have also written and talked about how users can better understand the performance of their Spark workloads.
The first component of my thesis work focused on characterizing the performance of large-scale data analytics frameworks like Spark. As part of that project, I added instrumentation to Spark to measure how much time is spent doing network and disk I/O. Most of that instrumentation is now part of Spark, and can be visualized in the Spark UI by clicking the "Event Timeline" link on the stage detail page. More information about that project is available here; that page includes links to some detailed traces we collected.
One takeaway from my work measuring performance in current systems is that today's systems make it difficult to reason about performance. In Spark, for example, pervasive pipelining and parallelism make it difficult (even with extensive instrumentation and metrics) for users to model performance and understand how changing the software or hardware configuration would impact performance. Today's users have many choices in how to configure their workloads (e.g., what type of EC2 instance should they use to run their job?); without the ability to reason about performance, they cannot configure for the best performance. The second part of my Ph.D. research focuses on a new system, Monotasks, that we designed with the singular goal of making it easy for users to reason about performance. Monotasks is a replacement for the execution layer of Apache Spark, and is fully API-compatible with Spark. For more information about monotasks, refer to our SOSP paper (linked below).
Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker SOSP 2017
Drizzle: Fast and Adaptable Stream Processing at Scale
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, Ion Stoica
Performance clarity as a first-class design principle Kay Ousterhout, Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker HotOS 2017
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun NSDI 2015
Sparrow: Distributed, Low Latency Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica SOSP 2013
The Case for Tiny Tasks in Compute Clusters Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, Ion Stoica HotOS 2013