Dr. Stephen Neuendorffer, Fellow, AMD
Abstract
Given the significant changes in technology scaling, it has become more important to innovate in computer architecture in order to continue to realize performance improvements. A key aspect for these new architectures is how they deal with data movement between physical memories. This talk will present an overview of our work developing programming models and compilers suitable for these new ML processing accelerator architectures at AMD. These tools leverage MLIR concepts to describe data movement explicitly while connecting these low-level concepts to higher level programming models like Pytorch.
Dr. Jan Moritz Joseph, CEO, Roofline.ai
Abstract
Local execution of AI models is critical for many applications that are time-critical or data-sensitive. Beploying these models presents significant challenges. One major issue is the inability to easily switch between target platforms, as specific software features may not be compatible across systems. Certain AI algorithms cannot be fully executed if their layers are not supported by conventional deployment methods. Addressing these challenges is essential for effective AI deployment.
In this talk, we will present Roofline’s flexible SDK, which targets multiple hardware platforms, including CPUs, GPUs, and NPUs by compiling models ahead of time. It is also simple to use from Python, making model deployment accessible to ML developers with limited embedded expertise. Our approach reduces both memory footprint and model latency for optimized execution.
We will explain the the flexibility of our solution by leveraging existing and new compiler abstractions from MLIR. Efficiency will be shown through case studies on platforms like the Raspberry Pi, executing AI models on GPUs and CPUs. We will compare our results against state-of-the-art solutions, such as TorchInductor and TFLite, with 2-5x improved memory footprint or latency. Additionally, we will showcase support for new GPU platforms that have not yet been used for GPGPU/AI tasks. This will be underlined by models coverage of the most popular models in Huggingface and other model zoos.
Sean Silva, Principal Engineer, EnChargeAI
Abstract
While AI accelerator hardware can easily be compared on the basis of datasheet specs like TOPS or TOPS/W, the practical performance of AI accelerators is just as strongly determined by the ability to utilize them effectively. This talk is informed by multiple AI hardware bring-up efforts over the years and provides simple mental models, war stories, and practical advice to ensure your hardware performs well in a production setting.
Prof. Dr. Tobias Grosser, Associate Professor at The University of Cambridge
Abstract
The wide adoption of Deep Neural Networks and the resulting desire for more hardware resources has fueled the rapid development of innovative custom hardware accelerators that are increasingly difficult to program. Many proposed hardware designs are only evaluated with hand-written micro-kernels, and the few evaluated on entire neural networks typically require significant investments in building the necessary software stacks. Highly sophisticated neural network compilers emerged to generate DNNs out of expert-written microkernels, but they were traditionally hand-crafted for each platform, which prevented both scaling and integration with industry-supported compilation flows.
We present Quidditch, a novel neural network compiler and runtime, that provides an end-to-end workflow from a high-level network description to high-performance code running on ETH Occamy, one of the first chiplet-based AI research hardware accelerators. Quidditch builds on IREE, an industry-strength AI compiler and runtime focused on GPUs. Quidditch imports NNs from PyTorch, JAX, and Tensorflow and offers optimisations such as memory and multi-level concurrency-guided tiling and asynchronous memory transfers to scratchpad. We pair this with a high-performance microkernel generator, which enables us to run full DNNs with full FPU occupancy and a more than 10x speed-up over IREE’s generic LLVM backend on our custom hardware accelerator. The micro-kernel compiler is implemented entirely in MLIR dialects and is based on a progressive lowering approach that preserves information from the high-level input down to assembly code generation. By providing key building blocks for scaling AI accelerator compilation to full neural networks, we aim to accelerate the evaluation of custom AI hardware and, as a result, AI hardware development overall.
Ananda Samajdar, Research Staff Member, IBM
Over the last decade AI has been changing the face of computing and our daily lives at an unimaginable pace and magnitude. The walk from the humble AlexNet, to transformers, to the gargantuan large language models of today has been a constant source of challenge and excitement of computer scientists in academia and industry alike. Researchers across the compute stack need to innovate to enable training, and efficient inference. In this talk I will be presenting our perspective and innovations at IBM Research on the algorithm, hardware acceleration and software stack to enable efficient execution of current and upcoming AI workloads.