OSDI Most Cited Papers from 2008 - 2017

Paper Name Author Year Citation
TensorFlow: A System For Large-Scale Machine Learning

Sanjay Ghemawat, Jianmin Chen, Michael Isard, Paul Barham, Matthieu Devin, Manjunath Kudlur, Rajat Monga, Zhifeng Chen, Geoffrey Irving, Paul A Tucker, Jeffrey Dean, Sherry Moore, Derek Gordon Murray, Martin Abadi, Vijay Vasudevan, Andy Davis, Martin Wicke, Pete Warden, Josh Levenberg, Benoit Steiner 2016 6268
TaintDroid: An Information-Flow Tracking System For Realtime Privacy Monitoring On Smartphones

Today’s smartphone operating systems frequently fail to provide users with adequate control over and visibility into how third-party applications use their private data. the authors address these shortcomings with TaintDroid, an efficient, system-wide dynamic taint tracking and analysis system capable of simultaneously tracking multiple sources of sensitive data. Using TaintDroid to monitor the behavior of 30 popular third-party Android applications, we found 68 instances of potential misuse of users’ private information across 20 applications.

William Enck, Patrick Mcdaniel, Jaeyeon Jung, Peter Gilbert, Anmol Sheth, Landon P Cox, Byunggon Chun 2010 2983
KLEE: Unassisted And Automatic Generation Of High-Coverage Tests For Complex Systems Programs

Cristian Cadar, Daniel Dunbar, Dawson R Engler 2008 2163
Improving MapReduce Performance In Heterogeneous Environments

Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, Ion Stoica 2008 1751
Onix: A Distributed Control Platform For Large-scale Production Networks

Computer networks lack a general control paradigm, as traditional networks do not provide any networkwide management abstractions.To address this, the authors present Onix, a platform on top of which a network control plane can be implemented as a distributed system. Control planes written within Onix operate on a global view of the network, and use basic state distribution primitives provided by the platform.

Scott Shenker, Leon Poutievski, Natasha Gude, Teemu Koponen, Jeremy Stribling, Min Zhu, Martin Casado, Takayuki Hama, Hiroaki Inoue, Rajiv Ramanathan, Yuichiro Iwata 2010 1315
Spanner: Google’s Globally-Distributed Database

This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. Spanner is Google’s scalable, multi-version, globallydistributed, and synchronously-replicated database.

Sanjay Ghemawat, Lindsay Rolig, Sebastian Kanthak, David F Nagle, Sergey Melnik, David Mwaura, Alexander Lloyd, Christopher Heiser, Andrew Fikes, Peter Hochschild, Eugene Kogan, Sean Quinlan, Michal Szymaniak, Andrey Gubarev, Chris Jorgen Taylor, Wilson Hsieh, Dale Woodford, Christopher Frost, James C Corbett, Jeffrey Dean, Yasushi Saito, Michael Epstein, Rajesh Rao, J J Furman, Hongyi Li, Ruth Wang 2012 1188
PowerGraph: Distributed Graph-Parallel Computation On Natural Graphs

Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions. in this paper, the authors characterize the challenges of computation on natural graphs in the context of existing graphparallel abstractions. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs.Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.

Danny Bickson, Carlos Guestrin, Joseph Gonzalez, Haijie Gu, Yucheng Low 2012 992
GraphChi: Large-Scale Graph Computation On Just A PC

Current systems for graph computation require a distributed computing cluster to handle very large real-world problems. in this work, the authors present GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. We further extend GraphChi to support graphs that evolve over time, and demonstrate that, on a single computer, GraphChi can process over one hundred thousand graph updates per second, while simultaneously performing computation.By repeating experiments reported for existing distributed systems, we show that, with only fraction of the resources, GraphChi can solve the same problems in very reasonable time.

Guy E Blelloch, Aapo Kyrola, Carlos Guestrin 2012 755
Reining In The Outliers In Map-Reduce Clusters Using Mantri

the authors present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques.Deployment in Bing’s production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.

Srikanth Kandula, Ganesh Ananthanarayanan, Edward Harris, Yi Lu, Bikas Saha, Ion Stoica, Albert Greenberg 2010 635
GraphX: Graph Processing In A Distributed Dataflow Framework

the authors introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system.

Reynold S Xin, Daniel Crankshaw, Michael J Franklin, Ion Stoica, Ankur Dave, Joseph E Gonzalez 2014 588
Finding And Reproducing Heisenbugs In Concurrent Programs

Madanlal Musuvathi, Shaz Qadeer, Iulian Neamtiu, Thomas Ball, Gerard Basler, Piramanayagam Arumuga Nainar 2008 557
Can The Production Network Be The Testbed?

A persistent problem in computer network research is validation. In this paper, the authors describe a way to build a testbed that is embedded in—and thus grows with—the network. When deciding how to evaluate a new feature or bug fix, a researcher or operator must trade-off realism (in terms of scale, actual user traffic, real equipment) and cost (larger scale costs more money, real user traffic likely requires downtime, and real equipment requires vendor adoption which can take years).The technique—embodied in our first prototype, FlowVisor—slices the network hardware by placing a layer between the control plane and the data plane.

Nick Mckeown, Guru Parulkar, Rob Sherwood, Guido Appenzeller, Glen Gibb, Kokkiong Yap, Martin Casado 2010 535
Large-scale Incremental Processing Using Distributed Transactions And Notifications

the authors have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index.By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

Frank Dabek, Daniel Peng 2010 489
Availability In Globally Distributed Storage Systems

the authors characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

Francois Labelle, Vananh Truong, Murray Stokely, Florentina I Popovici, Luiz Andre Barroso, Sean Quinlan, Daniel Ford, Carrie Grimes 2010 480
Finding A Needle In Haystack: Facebook's Photo Storage

This paper describes Haystack, an object storage system optimized for Facebook’s Photos application.

Peter Vajgel, Jason Sobel, Doug Beaver, Sanjeev Kumar, Harry C Li 2010 465
Difference Engine: Harnessing Memory Redundancy In Virtual Machines

Amin Vahdat, Michael Vrable, George Varghese, Alex C Snoeren, Diwaker Gupta, Stefan Savage, Geoffrey M Voelker, Sangmin Lee 2008 448
Corey: An Operating System For Many Cores

Silas Boydwickizer, Frans Kaashoek, Aleksey Pesterev, Robert Morris, Yuehua Dai, Haibo Chen, Lex Stein, Rong Chen, Yandong Mao, Ming Wu, Zheng Zhang, Yang Zhang 2008 427
Scaling Distributed Machine Learning With The Parameter Server

the authors propose a parameter server framework for distributed machine learning problems. We propose a parameter server framework for distributed machine learning problems.Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices.To demonstrate the scalability of the proposed framework, we show experimental results on petabytes of real data with billions of examples and parameters on problems ranging from Sparse Logistic Regression to Latent Dirichlet Allocation and Distributed Sketching.

Eugene J Shekita, Vanja Josifovski, Alexander J Smola, Mu Li, David G Andersen, Amr Ahmed, James Long, Jun Woo Park, Boryiing Su 2014 425
An Analysis Of Linux Scalability To Many Cores

This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48- core computer.

Silas Boydwickizer, M Frans Kaashoek, Nickolai Zeldovich, Aleksey Pesterev, Robert Morris, Yandong Mao, Austin T Clements 2010 388
Shielding Applications From An Untrusted Cloud With Haven

Marcus Peinado, Galen C Hunt, Andrew Baumann 2014 376