University of Utah Special Topics

CS6969: Fast and Correct GPU Code

A project-centered course on building, understanding, testing, and improving GPU primitives for modern ML and HPC systems.

Course Overview

Modern ML and HPC systems depend on carefully engineered computational primitives, including GPU kernels and numerical library functions, to achieve both high performance and trustworthy behavior. Even heavily tested kernels can hide subtle functional bugs, unstable numeric behavior, and performance defects that quietly leave large amounts of hardware capability unused.

This course studies the numeric, semantic, and scheduling abstractions needed to build GPU code that is both correct and fast. We focus on how data representations, execution order, memory behavior, parallel synchronization, and performance models interact in real kernels. We also examine tools and methods such as data-flow languages, MLIR-style compiler infrastructures, verification techniques, and measurement-driven performance analysis that can make future primitives more systematic to design.

Course highlights:

  • AWS Neuron hands-on.
  • MLIR-AIR deep dive with respect to MLIR transformations.
  • Detailed look at modern tile-based languages.

The class is co-taught with Professor Sreepathi Pai of the University of Rochester. It is explicitly project-centered: student-designed primitives will be tested in realistic ML and HPC settings, and the course is intended to support paper writing and public artifact release when the work matures enough to justify it.

Illustration of parallel GPU arithmetic
GPU correctness and performance are treated together, not as separate concerns.

Timetable

This table is a structured version of the timetable embedded in the shared syllabus document. It preserves the semester flow while keeping the public website readable.

Date Lead Topics Readings / Slides Assignments / Notes
Mon 1/5 Both
  • Course organization
  • Semester goals and project framing
Semester launch
Wed 1/7 Both
  • Number systems and tools
  • Intro to performance fundamentals
Asg1 released, due 1/14. See the assignment Overleaf.
Mon 1/12 Sree Intro to GPU performance GPU performance lecture material Continue Asg1
Wed 1/14 Both + student presenters
  • Formal model of GPU execution
  • Throughput models
  • Race effects and GKLEE demo
Asg2 assigned, due 1/21. Detect races using Faial and optionally GKLEE.
Mon 1/19 Holiday MLK Day No class University holiday
Tue 1/20 Guest talk Interactive computing in nature recreation and youth sports Prof. Michael Jones, BYU Special lecture
Wed 1/21 Guest talk Modular static cost analysis and related verification ideas Tiago Cogumbreiro / Faial material Asg3 assigned, due 1/28. See the Asg-3 writeup workspace.
Thu 1/22 Guest talk 50 years of parallel programming Prof. Keshav Pingali Kahlert Distinguished Lecture
Mon 1/26 Ganesh + David AWS training; Tilus; modular scheduling AWS and low-precision kernel focus
Wed 1/28 Ganesh AWS training; Neuron architecture; Mojo Asg4 due 2/6; Asg5 due 2/13.
Mon 2/2 Ganesh + Sree + students AWS tensor-addition walkthrough; profiling; student talks Interactive experimental session
Wed 2/4 Both + student speakers Follow-on AWS material; student presentations Project-selection writeup due 2/13
Mon 2/9 Both + student speaker Discussion of Asg1-Asg4; Hoare logic for GPU programs Homework review
Wed 2/11 Both + student speaker Memory hierarchy paper discussion Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking Read before class
Mon 2/16 Holiday Presidents Day No class University holiday
Wed 2/18 Both + student speakers ThunderKittens, HipKittens, TVM-FFI discussion Paper-discussion format
Mon 2/23 Ganesh + student speakers MLIR-AIR paper and software tryout Unit-test and software exploration
Wed 2/25 Guest lecture Visit by Dr. Sangeeta Chowdhary on MLIR-AIR AMD / MLIR-AIR effort Asg6 due 3/6; final project proposal expected
Mon 3/2 Ganesh + students Faial race-checking and GKLEE Project-idea discussion
Wed 3/4 Students Brief project idea presentations Project pitching
Mon 3/9 Holiday Spring break No class University holiday
Wed 3/11 Holiday Spring break No class University holiday
Mon 3/16 Project mode Project discussions Project-specific slides and brainstorming Loose post-break schedule begins
Wed 3/18 Project mode Project discussions Team meetings and review Project focus
Mon 3/23 Guest / project mode SLEEK paper and code discussion Andrew Rodriguez presentation
Wed 3/25 Project mode Project meetings; brief look at Hexagon MLIR Qualcomm slides and related arXiv material Project focus
Mon 3/30 Project mode Project meetings Team progress and debugging Project focus
Wed 4/1 Discussion cuFuzz discussion NVIDIA cuFuzz research page Tooling / bug-finding focus
Mon 4/6 Discussion MLIR transform dialect and xDSL Course-generated transformations, scripts, and slides Compiler transformation focus
Mon 4/20 All groups Last lecture; short project presentations In-class final project updates All groups present briefly
Mon 4/27 Due date Project reports due Final report submission Written deliverable deadline
Mon 5/4 Administrative Grades due End of semester Course closeout

Syllabus Snapshot

Table of Contents

  1. Course organization and project expectations
  2. Number systems, floating-point, and tool foundations
  3. GPU performance fundamentals and throughput modeling
  4. Formal GPU execution models, races, and schedule-sensitive bugs
  5. AWS Trainium and Neuron/NKI experimentation
  6. Compiler and language systems: Tilus, Mojo, MLIR, MLIR-AIR
  7. Profiling, tracing, and performance-measurement workflows
  8. Verification, race-checking, and floating-point analysis
  9. Student paper presentations and visiting research talks
  10. Semester-long project development and artifact release

Semester Shape

  • Early weeks emphasize abstractions: numeric representation, correctness reasoning, and cost models.
  • Middle weeks shift into concrete GPU and Trainium experimentation, with profiling and tool use.
  • Later weeks increasingly revolve around paper discussion, project reviews, and system-building.
  • Short student presentations are threaded throughout to connect reading with active experimentation.

The detailed course document lays out a semester that moves from basic numerical and performance foundations toward concrete GPU experiments on current systems. The course opens with number systems, correctness tools, performance fundamentals, and GPU throughput modeling, then develops a formal view of GPU execution, race behavior, and schedule-sensitive bugs.

The middle of the semester shifts into hands-on systems work. The shared syllabus emphasizes AWS Trainium access, profiling exercises, compiler and language ecosystems such as Mojo and Tilus, and performance-model readings. Students are expected to run kernels, measure them, explain observed bottlenecks, and connect those measurements back to formal and algorithmic reasoning about the code.

The document also makes clear that the class is discussion-heavy and research-oriented. Short student presentations are integrated throughout the semester so that teams regularly read papers, explain ideas to the class, and use those readings to sharpen their own project direction. The overall pattern is deliberate: learn the abstractions, inspect real artifacts, experiment on modern accelerators, and then design or repair primitives with publishable discipline.

Projects and Outcomes

The course document frames the project as the center of the class. Teams are encouraged to choose ambitious GPU themes, including new kernels, correctness and performance diagnostics, compiler-assisted kernel design, and experiments involving realistic ML or HPC frameworks. Assignments along the way are structured to feed into that larger project rather than stand alone.

Expected outcomes include a well-documented artifact, a clear correctness and performance story, and potentially a paper-ready result. Students are pushed not only to make something run, but to justify why it is correct, explain why it performs the way it does, and show how the underlying abstractions support those claims.

A hammer looking for a nail A nail looking for the right hammer
The course treats tools as methods to be matched to the right correctness and performance question.

Resources and Logistics

The detailed syllabus highlights several practical resources: University computing access, CHPC systems, AWS Trainium resources, and guest or partner lectures that connect course material to active systems research. It also uses shared communication channels and frequent instructor contact to keep project work moving.

If you want the original detailed course planning material, see the archived course materials and the semester documents in the repository. This homepage is meant to provide the compact public-facing summary.

A consolidated catalog of the public URLs embedded throughout the main syllabus document and its side tabs is available here: Full Link Catalog.

Software Tools

The shared course document points students toward a hands-on stack of systems for writing, checking, and profiling GPU primitives.

  • AWS Trainium + Neuron/NKI: the main accelerator experimentation path in the syllabus, including NKI kernels, Neuron Explorer, profiling traces, and attention and matrix-multiplication tutorials.
  • CHPC GPU workflow: CUDA-capable campus systems, `nvcc`, `nvidia-smi`, `nsys`, and batch allocation workflows for NVIDIA profiling.
  • Faial: a race and cost-analysis direction used in the course to reason about warp-level behavior and correctness/performance interactions.
  • GKLEE: symbolic and concolic GPU bug-finding, used as a reference point for race exposure and schedule-sensitive failures.
  • Tilus: a tile-level GPGPU language for low-precision computation, treated as a language-design case study for structured primitive construction.
  • Mojo: discussed as an emerging systems language for high-performance kernel and HPC-oriented experimentation.
  • MLIR and MLIR-AIR: compiler infrastructure and accelerator-lowering frameworks used to connect loop nests, transformations, and hardware realization.
  • AIR2CUDA and related tooling: software artifacts used to inspect lowering pathways from MLIR-AIR-style flows toward GPU code generation.
  • NVBit and custom instrumentation: dynamic GPU instrumentation ideas, including barrier-focused tooling and low-level runtime inspection.
  • Vercors, CIVL, and FP analysis tools: formal and numeric-analysis tools for proving race freedom, checking semantics, and studying floating-point error.

Papers by Topic

The readings in the shared syllabus cluster naturally into a few recurring themes.

  • Performance and throughput modeling: papers such as uiCA, Facile, the shared-memory atomic bottleneck work, and modular static cost analysis build the vocabulary for predicting and explaining kernel throughput.
  • GPU execution cost and productivity: works such as NPBench, data-centric Python, and CUDA cost-model papers connect user productivity, performance portability, and evaluation-cost reasoning.
  • Race detection and GPU verification: the syllabus groups FastTrack, FSE 2010 SMT-based GPU verification, GKLEE, GPUVerify, HiRace, Memory Access Protocols, and Vercors as complementary approaches to proving or detecting correctness properties.
  • Formal semantics and Hoare-style reasoning: materials such as Hoare logic for GPU programs, memory-model readings, and CIVL point students toward specification-first reasoning instead of purely empirical debugging.
  • Floating-point rigor: the background includes Goldberg’s classic essay, floating-point error-analysis work, Herbie-style rewriting, and scalable rigorous FP analysis, tying numerical semantics directly to kernel trustworthiness.
  • Scheduling, mapping, and specialization: software pipelining, warp specialization, distributed tensor mapping, and distributed Fourier mapping papers capture the scheduling side of making kernels and tensor systems fast.
  • Compiler and accelerator design: MLIR, MLIR-AIR, Tilus, and recent accelerator-lowering work show how modern compiler structures can encode performance intent and hardware structure more systematically.
  • Project-facing frontier systems: RenderMan XPU, tritonBLAS, ParallelKittens, ProofWright, GEAK, TileGym, and Tensor Core survey material serve as examples of current systems that students can study, reimplement, or benchmark against.