Heterogeneous Parallel Computing with GPUs
GPUs present unique opportunities as highly parallel energy-efficient processors. However, GPU architectures are not yet sufficiently well understood to be leveraged effectively in programming languages and their compilers. The goal of my research is to explore ways to make it possible to leverage GPUs effectively and efficiently within high-level programming environments. This effort has followed three main threads:
-
Studying specific applications to understand GPUs
We have studied several applications that are apparently difficult to parallelize on GPUs, with the goal of optimizing them, in order to better understand the various trade-offs of GPU architectures. The applications have included k-means clustering, page ranking, multi-dimensional scaling, lossless LZSS compression, longest substring matching, longest common subsequence matching, and several MATLAB benchmarks. This has led to two important new insights:- The best possible performance is invariably achieved not by making an either/or choice between CPUs and GPUs, but by leveraging both; and
- By the time the programmer encodes algorithms using a specific GPU programming model, it is already too late to optimize the code fully! A more effective strategy would be to provide programmers appropriate abstractions of primitives that are most suited to various run-time environments (CPUs and GPUs) while making them aware of the trade-offs. This would encourage programmers to creatively device more effective algorithms for heterogeneous environments—something no automatic tool can do (yet!).
-
Developing high-level GPU programming interfaces
Motivated by various application studies, together with the need to simplify heterogeneous programming, we have been designing GPU-computing embedded DSLs for various languages, including C++, Mozilla Rust, and JavaScript. Different language environments pose different challenges, for example, aliasing can be tricky to handle in C++, while security concerns are important for JavaScript. In all cases, our design attempts to make it possible to specify GPU computations using the host-language idioms as much as possible, so that the computations are expressed naturally. Rust and JavaScript implementations use LLVM and its PTX backed for code generation. -
Exploring compiler techniques for optimizing performance on GPUs
The high-level GPU programming interfaces need to be supported by sophisticated compilers to make them compelling alternatives to lower-level interfaces for high-performance applications. Through this research, one important problem that we are trying to solve is automatic allocation of memory spaces, which involves addressing the challenges of modeling the memory behavior of GPUs. A second problem is automatic scheduling of computations across CPU and GPU, to maximize performance. Our initial success with MATLAB has shown that such automatic scheduling is feasible. A direction we are exploring is automatic conversion to task-parallel style computations to leverage various scheduling strategies made available by task libraries.
Related publications:
-
Adnan Ozsoy, Arun Chauhan and Martin Swany. Fast Longest Common Subsequence with General Integer Scoring Support on GPUs. In Proceedings of the 2014 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), 2014. Held in conjunction with the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).
[Article DOI] - Adnan Ozsoy, Arun Chauhan and Martin Swany. Achieving TeraCUPS on Longest Common Subsequence Problem using GPGPUs. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS), 2013.
-
Adnan Ozsoy and Martin Swany and Arun Chauhan.. Optimizing LZSS Compression on GPGPUs. Future Generation Computer Systems (FGCS), 2013.
[Article DOI] -
Eric Holk, Milinda Pathirage, Arun Chauhan, Andrew Lumsdaine and Nicholas D. Matsakis. GPU Programming in Rust: Implementing High-level Abstractions in a Systems-level Language. In 18th International Workshop on High-level Parallel Programming Models and Supporting Environments (HIPS), 2013. Held in conjunction with the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[Full text] -
Thilina Gunarathne and Bimalee Salpitikorala and Arun Chauhan and Geoffrey Fox. Iterative Statistical Kernels on Contemporary GPUs. International Journal of Computational Science and Engineering, 8(1), pages 58–77, 2013.
[Article DOI] -
Adnan Ozsoy, Martin Swany and Arun Chauhan. Pipelined Parallel LZSS for Streaming Data Compression on GPGPUs. In Proceedings of the 18h IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2012.
[Aricle DOI]
Paper invited for submission as a journal article. -
Eric Holk, William Byrd, Nilesh Mahajan, Jeremiah Willcock, Arun Chauhan and Andrew Lumsdaine. Declarative Parallel Programming for GPUs. In Koen De Bosschere, Erik H. D'hollander, Gerhard R. Joubert, David Padua, Frans Peters and Mark Sawyer, editors, Applications, Tools and Techniques on the Road to Exascale Computing, volume 22 in Advances in Parallel Computing, pages 297–304. IOS Press, Amsterdam, Netherlands, 2012. Proceedings of the 14th biennial ParCo Conference, 2011.
[Full text] -
Thilina Gunarathne, Bimalee Salpitikorala and Arun Chauhan. Optimizing OpenCL Kernels for Iterative Statistical Algorithms on GPUs. In Proceedings of the 2nd International Workshop on GPUs and Scientific Applications (GPUScA), 2011.
[Proceedings URL]
Paper invited for submission as a journal article. -
Eric Holk, William Byrd, Nilesh Mahajan, Jeremiah Willcock, Arun Chauhan and Andrew Lumsdaine. Declarative Parallel Programming for GPUs. In Proceedings of the International Conference on Parallel Computing (ParCo), 2011.
[Full text] -
Chun-Yu Shei, Pushkar Ratnalikar and Arun Chauhan. Automating GPU Computing in MATLAB. In Proceedings of the International Conference on Supercomputing (ICS), pages 245–254, 2011.
[Article DOI] -
Chun-Yu Shei, Adarsh Yoga, Madhav Ramesh and Arun Chauhan. MATLAB Parallelization through Scalarization. In Proceedings of the 15th Workshop on the Interaction between Compilers and Computer Architectures (INTERACT), pages 44–53, 2011. Held in conjunction with the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA).
[Article DOI]
Journal | Book chapter |
Related open-source software releases:
- CUDA-based GPU compression for lossless LZSS compression.
- GPU extension for Mozilla Rust language, using the LLVM PTX backend.
- A JIT compiler for supporting GPU computations in JavaScript for Firefox using LLVM PTX backend; this works with content scripts, but requires installing a Firefox extension.