Understanding Task Parallelism: Providing insight into scheduling, memory, and performance for CPUs and Graphics
- Plats: 2446, ITC, Lägerhyddsvägen 2, Uppsala
- Doktorand: Ceballos, Germán
- Om avhandlingen
- Arrangör: Avdelningen för datorteknik
- Kontaktperson: Ceballos, Germán
Maximizing the performance of computer systems while making them more energy efficient is vital for future developments in engineering, medicine, entertainment, etc. However, the increasing complexity of software, hardware, and their interactions makes this task difficult. Software developers have to deal with complex memory architectures such as multilevel caches on modern CPUs and keeping thousands of cores busy in GPUs, which makes the programming process harder.
Task-based programming provides high-level abstractions to simplify the development process. In this model, independent tasks (functions) are submitted to a runtime system, which orchestrates their execution across hardware resources. This approach has become popular and successful because the runtime can distribute the workload across hardware resources automatically, and has the potential to optimize the execution to minimize data movement (e.g., being aware of the cache hierarchy).
However, to build better runtime systems, we now need to understand bottlenecks in the performance of current and future multicore architectures. Unfortunately, since most current work was designed for sequential or thread-based workloads, there is an overall lack of tools and methods to gain insight about the execution of these applications, allowing both the runtime and the programmers to detect potential optimizations.
In this thesis, we address this lack of tools by providing fast, accurate and mathematically-sound models to understand the execution of task-based applications. In particular, we center these models around three key aspects of the execution: memory behavior (data locality), scheduling, and performance. Our contributions provide insight into the interplay between the schedule's behavior, data reuse through the cache hierarchy, and the resulting performance. These contributions lay the groundwork for improving runtime systems. We first apply these methods to analyze a diverse set of CPU applications, and then leverage them to one of the most common workloads in current systems: graphics rendering on GPUs.