Areas of Focus

Software systems research, architecture, design, and development; large-scale software systems; programming languages; compilers; runtime systems; data structures; algorithms; concurrency and parallelism; multi/many core architectures; and parallel and distributed systems

Dissertation Research

PhD Committee: Calvin Ribbens, Srinidhi Varadarajan, Eli Tilevich, Naren Ramakrishnan, Doug Lea, and Vikram Adve.
Dissertation: Hari K. Pyla, Safe Concurrent Programming and Execution, Doctoral Dissertation, Department of Computer Science, Virginia Tech, 2013.

Multi core processors are the mainstay of computing and concurrent programming techniques leverage their power. However, in practice, it is hard to get concurrent programming right – concurrency bugs are particularly hard to find, debug, and fix. Furthermore, applications are inherently difficult to parallelize due to execution order constraints, resulting in under utilization of processor cores.

My doctoral dissertation tackled these fundamental challenges with concurrency, parallelism, and effective utilization of the processing cores in multi-core architectures. Specifically, some of the contributions are:

Designed and developed runtime techniques to make applications self-resilient to, and self-aware of, software failures enabling applications to automatically detect and recover from concurrency bugs without requiring any modifications to the compiler, operating system, or the applications themselves.
Designed and developed a practical and an extremely scalable (billions of locks at rates of well over hundreds of millions of locks per second) system capable of eliminating concurrency bugs in applications.
Designed and developed a programming model framework, compiler, and language-independent runtime system to improve performance of applications and algorithms that are inherently hard-to-parallelize and are fundamentally non-deterministic. The resulting framework significantly simplified parallel programming, reduced code complexity, and improved overall performance of applications by over 60x.

Research Problem-I

The growing ecosystem of multi-threaded applications has brought to the forefront of everyday programming the issues related to lock based synchronization. In practice, it is hard to write correct lock-based code due to the potential of deadlocks. Even well-written codes that spend an inordinate amount of effort in deadlock avoidance cannot always avoid deadlocks, particularly when the order of lock acquisitions is not known a priori. Furthermore, due to the potential for deadlocks, programmers cannot arbitrarily compose locks without knowing their internal locking structure, thus limiting the composition of lock-based code.

Approach

We address this problem using two distinct approaches.

1. A pure runtime approach: We present Sammati (means agreement in Sanskrit), a software system that transparently and deterministically detects and eliminates deadlocks in threaded codes, without requiring any modifications to application source code or recompiling/relinking phases. Sammati supports applications written in type-unsafe languages such as C and C++. Sammati is implemented as a pre-loadable library that overloads the standard POSIX threads (Pthreads) interface and makes the acquisition of mutual exclusion locks a deadlock-free operation. The conference version of paper on Sammati is available at [PACT_2010]. The draft of the journal version of this paper is available at [TOPLAS Submission].

2. A compiler-directed and runtime approach: We present Serenity, an efficient and practical system that transparently detects and eliminates deadlocks in real-world POSIX multi-threaded applications written in C and C++. Additionally, Serenity detects common programming errors that typically occur in lock-based code including asymmetric data races, live locks, priority inversion, and bugs that stem from the presence of asynchronous signaling.

Serenity comfortably supports real applications that acquire over hundreds of millions of locks per second and scales well with the number of threads. Our results using SPLASH, Phoenix, and PARSEC benchmark suites, several desktop and server applications, and synthetic programs show that for a majority of applications Serenity has marginal performance overhead with scalability almost identical to native thread execution.

Sammati and Serenity provide a critical tool to programmers designing, implementing and debugging complex applications for emerging many-core platforms.

Download

You can download Sammati [HERE]. Currently the research paper on Serenity is under review at a major systems conference.

Research Problem-II

The ongoing multi/many-core processor revolution requires that programmers leverage explicit concurrency to improve performance. Unfortunately, a large body of applications and algorithms are inherently hard to parallelize due to execution order constraints imposed by data and control dependencies or being sensitive to their input data and not scaling perfectly, leaving several cores idle. The key challenge here is to enable such applications to leverage multi/many-cores efficiently for improved performance.

Approach

We present a simple speculative programming framework, Anumita (means guess in Sanskrit), in which coarse-grain speculative code blocks execute concurrently, but the results from only a single speculation modify the program state. Anumita is implemented as a shared library that exposes APIs for common type-unsafe languages including C, C++ and Fortran. Its runtime system transparently (a) creates, instantiates, and destroys speculative control flows, (b) performs name-space isolation, (c) tracks data accesses for each speculation, (d) commits the memory updates of successful speculations, and (e) recovers from memory side-effects of any mis-predictions.

Anumita supports a wide range of applications by providing expressive evaluation criteria for speculative execution that go beyond time to solution to include arbitrary quality of solution criteria. Anumita simplifies speculative parallelism and relieves the programmer from the subtleties of concurrent programming.

Anumita’s use requires minimal modifications (8-10 lines on average) to application source code. In the context of high-performance computing, with the prevalent OpenMP threading model, Anumita naturally extends speculation to an OpenMP context through pragmas in the OpenMP programming model. The full paper on Anumita is available at [OOPSLA_2011].

Download

You can download Anumita [HERE].

Related Publications

Hari K. Pyla, and Srinidhi Varadarajan, ACM Student Research Competition: Transparent Runtime Deadlock Elimination, in ACM SIGARCH Proceedings of the 21st international conference on Parallel Architectures and Compilation Techniques (PACT), 2012 [PDF]
Hari K. Pyla, Calvin Ribbens, and Srinidhi Varadarajan, Exploiting Coarse-Grain Speculative Parallelism, in ACM SIGPLAN Proceedings of Object-Oriented Programming, Systems, Languages & Applications (OOPSLA/SPLASH), 2011 [PDF]
Hari K. Pyla, Composing Locks by Decomposing Deadlocks, in ACM SIGPLAN Proceedings of Object-Oriented Programming, Systems, Languages & Applications (OOPSLA/SPLASH) Companion, 2011 [PDF]
Hari K. Pyla, Coarse-grain Speculation for Emerging Processors, in ACM SIGPLAN Proceedings of Object-Oriented Programming, Systems, Languages & Applications (OOPSLA/SPLASH) Companion, 2011 [PDF]
Hari K. Pyla, Efficient Parallel Execution for Un-parallelizable Codes via Coarse-Grain Speculation, in ACM Student Research Competition Grand Finals, 2011. [PDF]
Hari K. Pyla, and Srinidhi Varadarajan, Avoiding Deadlock Avoidance, in ACM SIGARCH Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2010 [PDF]

Related Awards

ACM SIGPLAN John Vlissides Award, 2011
ACM Student Research Competition, First Place, SPLASH/OOPSLA 2011
Graduate Research Symposium, Second Place, Virginia Tech 2012
ACM Student Research Competition, Bronze Medal, PACT 2012

Master’s Thesis

Research Problem

Reducing energy and power consumption in the area of high performance computing is becoming a major challenge. Compute clusters are consuming more power at higher densities than ever before. This results in increased thermal dissipation, the need for powerful cooling systems, and ultimately a reduction in system reliability as temperatures increase. Further, the unprecedented increase in temperature reduces the life time of processors and other on board system components. However, from a system’s research perspective, current technologies and approaches do not provide the user a fine-grained and adaptive control to monitor and regulate the power and thermal attributes of an application

Approach

We propose a fast, easy to use, accurate, portable, software framework called Tempest (for temperature estimator) that leverages emergent thermal sensors to enable user profiling, evaluating, and reducing the thermal characteristics of systems and applications. Tempest provides a key insights into the thermal and power characteristics of an application. In particular, it addresses common questions such as:

Which parts of an application code are benefited from what thermal management techniques?
Where in the code-base should one start optimizing an application in order to reduce temperature and energy?
Are thermal properties of applications same across different machines and clusters?
How do thermal optimizations of an application affect the performance?
How can one modify code adaptively at runtime in order to reduce temperature and energy consumption?

In this thesis, we designed, developed and illustrated the use of Tempest to analyze the thermal effects of various parallel benchmarks in clusters. We also showed how users can analyze the effects of thermal optimizations on cluster applications. Dynamic Voltage and Frequency Scaling (DVFS) reduces the power consumption of high-performance clusters by reducing processor voltage during periods of low utilization. We designed Tempest to measure the runtime effects of processor frequency on thermals. Our findings indicate HPC workload characteristics greatly impact the effects of DVFS on temperature and energy. We propose a thermal-aware DVFS scheduling approach that pro-actively controls processor voltage across a cluster by evaluating and predicting trends in processor temperature. We identify approaches that can maintain temperature thresholds and reduce temperature with minimal impact on performance. Our results indicate that proactive, temperature-aware scheduling of DVFS can reduce cluster-wide processor thermals by more than 10 degrees Celsius, the threshold for improving electronic reliability by 50%.

Related Publications

Hari K. Pyla, Tempest: A Framework for High Performance Thermal-Aware Distributed Computing, Masters Thesis, Department of Computer Science, Virginia Tech, ETD-05242007-220451, 2007
Hari K. Pyla, High Performance Thermal-aware Distributed Computing, VDM Verlag Dr. Mueller e.K., ISBN-13: 978-3836435598, 2008
Hari K. Pyla, Dong Li, and Kirk W. Cameron, Thermal-aware High Performance Computing using Tempest, in Proceedings of 19th IEEE/ACM International Conference on High Performance Computing and Communications (SC’07), Reno, NV, 2007
Kirk W. Cameron, Hari K. Pyla, and Srinidhi Varadarajan, Tempest: A portable tool to identify hot spots in parallel code, in IEEE Proceedings of the International Conference on Parallel Processing, ICPP, 2007 [PDF]
Dong Li, H.-C. Chang, Hari K. Pyla, and Kirk W. Cameron, System-level, thermal-aware, fully-loaded process scheduling, in IEEE Proceedings of the 22nd International Symposium on Parallel and Distributed Processing, IPDPS, 2008 [PDF]

Intellectual Properties

Hari K. Pyla, Kirk W. Cameron, and X. Feng, Method and System for Correlation and Analysis of Performance and Sensor Data in Computer Systems, Virginia Tech, Ed. US, VTIP:07-019, 2007