CS6969: Fast and Correct GPU Code

Course Overview

Modern ML and HPC systems depend on carefully engineered computational primitives, including GPU kernels and numerical library functions, to achieve both high performance and trustworthy behavior. Even heavily tested kernels can hide subtle functional bugs, unstable numeric behavior, and performance defects that quietly leave large amounts of hardware capability unused.

This course studies the numeric, semantic, and scheduling abstractions needed to build GPU code that is both correct and fast. We focus on how data representations, execution order, memory behavior, parallel synchronization, and performance models interact in real kernels. We also examine tools and methods such as data-flow languages, MLIR-style compiler infrastructures, verification techniques, and measurement-driven performance analysis that can make future primitives more systematic to design.

Course highlights:

AWS Neuron hands-on.
MLIR-AIR deep dive with respect to MLIR transformations.
Detailed look at modern tile-based languages.

The class is co-taught with Professor Sreepathi Pai of the University of Rochester. It is explicitly project-centered: student-designed primitives will be tested in realistic ML and HPC settings, and the course is intended to support paper writing and public artifact release when the work matures enough to justify it.

Illustration of parallel GPU arithmetic — GPU correctness and performance are treated together, not as separate concerns.

Timetable

This table is a structured version of the timetable embedded in the shared syllabus document. It preserves the semester flow while keeping the public website readable.

Date	Lead	Topics	Readings / Slides	Assignments / Notes
Mon 1/5	Both	Course organization Semester goals and project framing	Ganesh intro slides Sree intro slides	Semester launch
Wed 1/7	Both	Number systems and tools Intro to performance fundamentals	Ganesh slides Sree performance slides	Asg1 released, due 1/14. See the assignment Overleaf.
Mon 1/12	Sree	Intro to GPU performance	GPU performance lecture material	Continue Asg1
Wed 1/14	Both + student presenters	Formal model of GPU execution Throughput models Race effects and GKLEE demo	Ganesh slides Facile uiCA	Asg2 assigned, due 1/21. Detect races using Faial and optionally GKLEE.
Mon 1/19	Holiday	MLK Day	No class	University holiday
Tue 1/20	Guest talk	Interactive computing in nature recreation and youth sports	Prof. Michael Jones, BYU	Special lecture
Wed 1/21	Guest talk	Modular static cost analysis and related verification ideas	Tiago Cogumbreiro / Faial material	Asg3 assigned, due 1/28. See the Asg-3 writeup workspace.
Thu 1/22	Guest talk	50 years of parallel programming	Prof. Keshav Pingali	Kahlert Distinguished Lecture
Mon 1/26	Ganesh + David	AWS training; Tilus; modular scheduling	AWS training slides Tilus paper Tilus repo experiments	AWS and low-precision kernel focus
Wed 1/28	Ganesh	AWS training; Neuron architecture; Mojo	Neuron architecture slides AWS writeup workspace	Asg4 due 2/6; Asg5 due 2/13.
Mon 2/2	Ganesh + Sree + students	AWS tensor-addition walkthrough; profiling; student talks	Mojo in HPC RenderMan XPU	Interactive experimental session
Wed 2/4	Both + student speakers	Follow-on AWS material; student presentations	tritonBLAS ParallelKittens	Project-selection writeup due 2/13
Mon 2/9	Both + student speaker	Discussion of Asg1-Asg4; Hoare logic for GPU programs	Hoare Logic of GPU Programs HiRace	Homework review
Wed 2/11	Both + student speaker	Memory hierarchy paper discussion	Dissecting the NVIDIA Turing T4 GPU via Microbenchmarking	Read before class
Mon 2/16	Holiday	Presidents Day	No class	University holiday
Wed 2/18	Both + student speakers	ThunderKittens, HipKittens, TVM-FFI discussion	ThunderKittens HipKittens TVM-FFI and repo	Paper-discussion format
Mon 2/23	Ganesh + student speakers	MLIR-AIR paper and software tryout	MLIR paper MLIR-AIR paper AIR2CUDA Alex Zinenko talk	Unit-test and software exploration
Wed 2/25	Guest lecture	Visit by Dr. Sangeeta Chowdhary on MLIR-AIR	AMD / MLIR-AIR effort	Asg6 due 3/6; final project proposal expected
Mon 3/2	Ganesh + students	Faial race-checking and GKLEE	Faial FMSD paper GKLEE paper NVBit paper	Project-idea discussion
Wed 3/4	Students	Brief project idea presentations	Correctness tooling slide deck Goldberg FP reading Additional reading	Project pitching
Mon 3/9	Holiday	Spring break	No class	University holiday
Wed 3/11	Holiday	Spring break	No class	University holiday
Mon 3/16	Project mode	Project discussions	Project-specific slides and brainstorming	Loose post-break schedule begins
Wed 3/18	Project mode	Project discussions	Team meetings and review	Project focus
Mon 3/23	Guest / project mode	SLEEK paper and code discussion	SLEEK paper SLEEK repository	Andrew Rodriguez presentation
Wed 3/25	Project mode	Project meetings; brief look at Hexagon MLIR	Qualcomm slides and related arXiv material	Project focus
Mon 3/30	Project mode	Project meetings	Team progress and debugging	Project focus
Wed 4/1	Discussion	cuFuzz discussion	NVIDIA cuFuzz research page	Tooling / bug-finding focus
Mon 4/6	Discussion	MLIR transform dialect and xDSL	Course-generated transformations, scripts, and slides	Compiler transformation focus
Mon 4/20	All groups	Last lecture; short project presentations	In-class final project updates	All groups present briefly
Mon 4/27	Due date	Project reports due	Final report submission	Written deliverable deadline
Mon 5/4	Administrative	Grades due	End of semester	Course closeout

Syllabus Snapshot

Course organization and project expectations
Number systems, floating-point, and tool foundations
GPU performance fundamentals and throughput modeling
Formal GPU execution models, races, and schedule-sensitive bugs
AWS Trainium and Neuron/NKI experimentation
Compiler and language systems: Tilus, Mojo, MLIR, MLIR-AIR
Profiling, tracing, and performance-measurement workflows
Verification, race-checking, and floating-point analysis
Student paper presentations and visiting research talks
Semester-long project development and artifact release

Semester Shape

Early weeks emphasize abstractions: numeric representation, correctness reasoning, and cost models.
Middle weeks shift into concrete GPU and Trainium experimentation, with profiling and tool use.
Later weeks increasingly revolve around paper discussion, project reviews, and system-building.
Short student presentations are threaded throughout to connect reading with active experimentation.

The detailed course document lays out a semester that moves from basic numerical and performance foundations toward concrete GPU experiments on current systems. The course opens with number systems, correctness tools, performance fundamentals, and GPU throughput modeling, then develops a formal view of GPU execution, race behavior, and schedule-sensitive bugs.

The middle of the semester shifts into hands-on systems work. The shared syllabus emphasizes AWS Trainium access, profiling exercises, compiler and language ecosystems such as Mojo and Tilus, and performance-model readings. Students are expected to run kernels, measure them, explain observed bottlenecks, and connect those measurements back to formal and algorithmic reasoning about the code.

The document also makes clear that the class is discussion-heavy and research-oriented. Short student presentations are integrated throughout the semester so that teams regularly read papers, explain ideas to the class, and use those readings to sharpen their own project direction. The overall pattern is deliberate: learn the abstractions, inspect real artifacts, experiment on modern accelerators, and then design or repair primitives with publishable discipline.

Projects and Outcomes

The course document frames the project as the center of the class. Teams are encouraged to choose ambitious GPU themes, including new kernels, correctness and performance diagnostics, compiler-assisted kernel design, and experiments involving realistic ML or HPC frameworks. Assignments along the way are structured to feed into that larger project rather than stand alone.

Expected outcomes include a well-documented artifact, a clear correctness and performance story, and potentially a paper-ready result. Students are pushed not only to make something run, but to justify why it is correct, explain why it performs the way it does, and show how the underlying abstractions support those claims.

A hammer looking for a nail — The course treats tools as methods to be matched to the right correctness and performance question.

A nail looking for the right hammer — The course treats tools as methods to be matched to the right correctness and performance question.

Resources and Logistics

The detailed syllabus highlights several practical resources: University computing access, CHPC systems, AWS Trainium resources, and guest or partner lectures that connect course material to active systems research. It also uses shared communication channels and frequent instructor contact to keep project work moving.

If you want the original detailed course planning material, see the archived course materials and the semester documents in the repository. This homepage is meant to provide the compact public-facing summary.

A consolidated catalog of the public URLs embedded throughout the main syllabus document and its side tabs is available here: Full Link Catalog.

Software Tools

The shared course document points students toward a hands-on stack of systems for writing, checking, and profiling GPU primitives.

AWS Trainium + Neuron/NKI: the main accelerator experimentation path in the syllabus, including NKI kernels, Neuron Explorer, profiling traces, and attention and matrix-multiplication tutorials.
CHPC GPU workflow: CUDA-capable campus systems, `nvcc`, `nvidia-smi`, `nsys`, and batch allocation workflows for NVIDIA profiling.
Faial: a race and cost-analysis direction used in the course to reason about warp-level behavior and correctness/performance interactions.
GKLEE: symbolic and concolic GPU bug-finding, used as a reference point for race exposure and schedule-sensitive failures.
Tilus: a tile-level GPGPU language for low-precision computation, treated as a language-design case study for structured primitive construction.
Mojo: discussed as an emerging systems language for high-performance kernel and HPC-oriented experimentation.
MLIR and MLIR-AIR: compiler infrastructure and accelerator-lowering frameworks used to connect loop nests, transformations, and hardware realization.
AIR2CUDA and related tooling: software artifacts used to inspect lowering pathways from MLIR-AIR-style flows toward GPU code generation.
NVBit and custom instrumentation: dynamic GPU instrumentation ideas, including barrier-focused tooling and low-level runtime inspection.
Vercors, CIVL, and FP analysis tools: formal and numeric-analysis tools for proving race freedom, checking semantics, and studying floating-point error.

Papers by Topic

The readings in the shared syllabus cluster naturally into a few recurring themes.

Performance and throughput modeling: papers such as uiCA, Facile, the shared-memory atomic bottleneck work, and modular static cost analysis build the vocabulary for predicting and explaining kernel throughput.
GPU execution cost and productivity: works such as NPBench, data-centric Python, and CUDA cost-model papers connect user productivity, performance portability, and evaluation-cost reasoning.
Race detection and GPU verification: the syllabus groups FastTrack, FSE 2010 SMT-based GPU verification, GKLEE, GPUVerify, HiRace, Memory Access Protocols, and Vercors as complementary approaches to proving or detecting correctness properties.
Formal semantics and Hoare-style reasoning: materials such as Hoare logic for GPU programs, memory-model readings, and CIVL point students toward specification-first reasoning instead of purely empirical debugging.
Floating-point rigor: the background includes Goldberg’s classic essay, floating-point error-analysis work, Herbie-style rewriting, and scalable rigorous FP analysis, tying numerical semantics directly to kernel trustworthiness.
Scheduling, mapping, and specialization: software pipelining, warp specialization, distributed tensor mapping, and distributed Fourier mapping papers capture the scheduling side of making kernels and tensor systems fast.
Compiler and accelerator design: MLIR, MLIR-AIR, Tilus, and recent accelerator-lowering work show how modern compiler structures can encode performance intent and hardware structure more systematically.
Project-facing frontier systems: RenderMan XPU, tritonBLAS, ParallelKittens, ProofWright, GEAK, TileGym, and Tensor Core survey material serve as examples of current systems that students can study, reimplement, or benchmark against.