John Kloosterman

John Kloosterman is a Research Fellow at the University of Michigan, conducting research in energy-efficient computer architectures and systems. He will be starting as a Lecturer at the University of Michigan in the fall.


EECS 183 (Elementary Programming Concepts), Fall 2018
EECS 280 (Programming and Data Structures), Winter 2017


Register file design: GPUs need to have hundreds of kilobytes of register file, because so many threads are executing simultaneously. However, not many of these registers are accessed in any given period of time. RegLess (published MICRO 2017) is a technique to save energy using a much smaller register structure that stores only active registers.

Memory coalescing: Nearby threads on a GPU tend to access nearby locations in memory, allowing requests to the same cache lines to be merged to increase memory throughput. WarpPool (published MICRO 2015) used a new type of memory locality between loads made by different thread groups to merge more requests.

Multi-kernel execution: Multiple kernels running on the same GPU can have complementary resource requirements, meaning that in the best case, running two kernels doubles the throughput. This work investigates how to get the similar benefits even when the resource demands are not as perfectly matched.

Work Experience

Google, Software Engineering Intern, 2015
Designed and implemented a high-performance parallel C++ memory profiling tool used across many Google projects.

Logos Research Systems, Software Engineering Intern, 2013
Created an OpenCV-based system to automatically place and format text on PowerPoint slide backgrounds.

Calvin College, Student Web Developer, 2010-2013
Designed the student newspaper website, major parts of the Hymnary digital humanities resource, and a scholarship application system.


RegLess: Just-in-Time Operand Staging for GPUs
John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, Scott Mahlke
MICRO 2017

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors
John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, and Scott Mahlke
MICRO 2015

local_malloc: malloc() for OpenCL local memory (poster)
John Kloosterman, Joel Adams
ACM SRC Poster, SC13