Calling it the largest advancement since the NVIDIA CUDA platform was inroduced in 2006, NVIDIA has launched CUDA 13.1 with CUDA Tile, which the company said introduces a virtual instruction set for tile-based parallel programming focusing on writing algorithms at a higher level and abstract away the details of specialized hardware, such as tensor cores.
The company said CUDA exposes a single-instruction, multiple-thread (SIMT) hardware and programming model for developers. This supports fine-grained control over how code is executed. However, it can also require considerable effort to write code that performs well, especially across multiple GPU architectures.
While there are libraries to help developers extract performance, such as NVIDIA CUDA-X and NVIDIA CUTLASS, CUDA Tile offers a new way to program GPUs at a higher level than SIMT NVIDIA said.
According to the company:
With the evolution of computational workloads, especially in AI, tensors have become a fundamental data type. NVIDIA has developed specialized hardware to operate on tensors, such as NVIDIA Tensor Cores (TC) and NVIDIA Tensor Memory Accelerators (TMA), which are now integral to every new GPU architecture.
With more complex hardware, more software is needed to help harness these capabilities. CUDA Tile abstracts away tensor cores and their programming models so that code using CUDA Tile is compatible with current and future tensor core architectures.
Tile-based programming enables programming of algorithms by specifying chunks of data, or tiles, and then defining the computations performed on those tiles. There is no need to set how an algorithm is executed at an element-by-element level: the compiler and runtime will handle that.
This programming paradigm is common in languages such as Python, where libraries like NumPy enables programmers to specify data types like matrices, then specify and execute bulk operations with simple code. Under the covers, the right things happen, and computations continue transparently to the programmer.
The foundation of CUDA Tile is CUDA Tile IR (intermediate representation). CUDA Tile IR introduces a virtual instruction set that enables native programming of the hardware as tile operations. Developers can write higher-level code that is efficiently executed across multiple generations of GPUs with minimal changes.
While NVIDIA Parallel Thread Execution (PTX) ensures portability for SIMT programs, CUDA Tile IR extends the CUDA platform with native support for tile-based programs. Developers focus on partitioning their data-parallel programs into tiles and tile blocks, letting CUDA Tile IR handle the mapping onto hardware resources such as threads, the memory hierarchy, and tensor cores.
By raising the level of abstraction, CUDA Tile IR enables users to build higher-level hardware-specific compilers, frameworks, and domain-specific languages (DSLs) for NVIDIA hardware. CUDA Tile IR for tile programming is analogous to PTX for SIMT programming.
One thing to point out is that it’s not an either/or situation. Tile programming on GPUs is another approach to writing GPU code, but programmers don’t have to choose between SIMT and tile programming; they coexist. When SIMT is needed, programmers write kernels as they normally do. When they want to operate using tensor cores, programmers write tile kernels
Figure 2 shows a high-level diagram of how CUDA Tile fits into a representative software stack, and how the tile path exists as a separate but complementary path to the existing SIMT path.
CUDA Tile IR is one layer beneath where a vast majority of programmers will interface with tile programming. Unless a compiler or library is being written, programmers probably won’t need to concern themselves with the details of the CUDA Tile IR software.
- NVIDIA cuTile Python: Most developers will interface with CUDA tile programming through software like NVIDIA cuTile Python—an NVIDIA Python implementation that uses CUDA Tile IR as the back end. We have a blog post that explains how to use cuTile-python with links to sample code and documentation.
- CUDA Tile IR: For developers looking to build their own DSL compiler or library, CUDA Tile IR is where programmers interface with CUDA Tile. The CUDA Tile IR documentation and specification include information on the CUDA Tile IR programming abstractions, syntax, and semantics. If programmers are writing a tool/compiler/library that currently targets PTX, then they can adapt their software to also target CUDA Tile IR.
Read more here: Click Here


