Distinguished Engineer, NVIDIA
Assistant Professor, Carnegie Mellon University
Tianqi Chen is currently an Assistant Professor at the Machine Learning Department and Computer Science Department of Carnegie Mellon University. He is a Distinguished Engineer at NVDIIA. He received his PhD from the Paul G. Allen School of Computer Science & Engineering at the University of Washington. He has created many major learning systems that are widely adopted: XGBoost, TVM, and MLC-LLM
Building ML Systems Foundations at the age of AI
We are currently living in an exciting era for AI, where machine learning systems and infrastructure are crucial for training and deploying efficient AI models. The modern machine learning systems landscape is rich with diverse components, including compilers, libraries, DSLs, frameworks, and coding agents. In this talk, I will explore topics on how we can build a common foundation that helps interoperability across these components. We will also discuss our experience in bringing foundational models across edge and cloud through machine learning compilation. Finally, we will touch on how to build a virtuous cycle where AI can be used in the ML systems production flow.
Chair for Compiler Construction, TU Dresden
As Chair for Compiler Construction at TU Dresden, Jeronimo Castrillon works at the intersection of programming languages, compilers, and computer architecture. His group develops tools and abstractions that make complex, heterogeneous hardware accessible to developers—bridging the gap between high-level software design and efficient hardware execution.
Compilers for In-Memory Computing Systems
Fueled by exciting advances in materials and devices, in-memory computing architectures now represent a promising avenue to advance computing systems. Plenty of manual designs have already demonstrated orders of magnitude improvement in compute efficiency compared to classical Von Neumann architectures across different application domains. In this talk we discuss automation flows for programming and exploring the parameter space of in-memory architectures. We report on current efforts on building an extensible framework around the MLIR compiler infrastructure to abstract from individual technologies to foster re-use. Concretely, we present optimising flows for in-memory accelerators based on cross-bars, on content addressable memories and bulk-wise logic operations. We believe this kind of automation to be key to more quickly navigate the heterogeneous landscape of in-memory accelerators and to bring the benefits of emerging architectures to a boarder range of applications.
Freelance Software Developer & Mojo Champion
Maxim Zaks is a freelance software developer and Mojo Champion contributing to the Mojo standard library and core ecosystem. He authors language enhancement proposals, explores compiler and performance optimizations, and actively supports the Mojo developer community. Maxim regularly speaks at technical meetups and conferences, sharing insights on language design, systems programming, and high-performance computing with Mojo.
Solving the Multi-Platform Problem with Mojo
AI workloads push programming languages to their limits: developers need low-level control for performance, high-level ergonomics for productivity, and seamless portability across heterogeneous platforms. Mojo is a new systems programming language designed to address these challenges by combining Python interoperability with modern compilation techniques. In this talk, we’ll dive into how Mojo enables multi-platform targeting through conditional compilation, and how MLIR ops can be embedded directly into libraries to unlock performance-critical paths. I’ll illustrate these ideas with concrete examples from the Mojo standard library and the MAX open-source codebase, showing how Mojo helps unify the fragmented AI software stack.
Professor of Computer Architecture, TU Wien
He is a Full Professor of Computer Architecture at the Institute of Computer Engineering, TU Wien Informatics. Before joining TU Wien, he led a research group at the Chair of Electronic Design Automation at TU Munich.
His research focuses on Electronic System Level (ESL) design, RISC-V domain-specific architectures, tinyML and embedded ML compiler toolchains, as well as functional safety and hardware security. He is a Senior Member of IEEE and an active contributor to the RISC-V community.
Graph-Level Tiling, Operator Patching, and Fusing for Distributed, Memory-Optimized, and Fault-Tolerant TinyML Deployment
A new generation of AI-enhanced microcontrollers now delivers performance in the hundreds of GOPS, but their inherently low-cost, low-power design still limits on-chip SRAM and ROM to just a few megabytes. As a result, memory capacity remains a central challenge when deploying TinyML models onto these devices. Several techniques—such as pruning and compression—have been introduced to reduce peak memory consumption, and operator tiling and fusion have proven particularly effective for generating memory-aware buffer layouts and execution schedules.
Beyond single-device optimization, tiling and fusing can also be leveraged to aggregate memory across multiple microcontrollers in distributed inference settings. Furthermore, operator patching using checksum-based methods enables modification of the dataflow graph for fault-tolerant execution.
In this talk, we present an overview of graph-level tiling and fusion techniques and their roles in distributed, memory-optimized, and fault-tolerant TinyML deployment. We also introduce an ONNX-based library that integrates these methods directly at the dataflow-graph level, simplifying their adoption in practical toolchains.
CTO, Roofline.ai
As CTO of Roofline.ai, Maximilian Bartel drives innovation in AI performance engineering. His work brings together deep insights from compilers, hardware, and AI to help developers understand and optimize the efficiency of modern machine learning workloads.
Heterogeneous Execution of Accelerators Using IREE
Edge accelerators are rarely designed to run entire networks end-to-end. They handle the compute-heavy parts: convolutions, matrix multiplies, attention blocks. And even when an accelerator can run all of today's models, it'll be on the market for years. Customers will want to run tomorrow's architectures on yesterday's silicon.
The only realistic path forward is heterogeneous execution: run what you can on the NPU, fall back to the GPU or CPU for the rest. The problem is that most toolchains handle this as an afterthought. LiteRT's delegate mechanism and ONNX Runtime's execution providers use synchronous APIs by default, introducing sync points at every handoff between devices. This overhead can easily eat into the speedup you were hoping to get from the accelerator in the first place.
In this talk, I'll present our approach built on the IREE compiler and runtime. Unlike runtime-driven fallback, IREE's compiler generates execution schedules with explicit async semantics baked in. This lets us overlap compute and data movement across CPU and NPU without either device waiting on the other. I'll walk through how we achieve this, what tradeoffs come up (latency vs. throughput, memory pressure, partitioning heuristics), and some open questions we're still wrestling with.
Senior AI/ML Compiler Engineer, NXP Semiconductors
Moritz is Senior AI/ML Compiler Engineer at NXP Semiconductors, where he drives the development of NXP's next-generation compiler-based AI/ML deployment tools. These will enable customers for deploying their AI models from any framework to the full range of NXP's portfolio spanning from MCUs, over MPUs, to discrete NPUs. This is made possible with the latest open-source technology and ecosystem solutions.
Professor, INSA Hauts-de-France and CNRS
Prof. Smail Niar, INSA Hauts-de-France/Université Polytechnique Hauts-de-France (UPHF) & CNRS, received his PhD in computer Engineering from the University of Lille (France) in 1990. Since then, he has been professor at UPHF and INSA Hauts-de-France. He is member of the computer science department at the “Laboratory of Automation, Mechanical and Computer Engineering”, a joint research unit between CNRS and UPHF/INSA. His research interests are AI/ML-based embedded systems, autonomous transportation systems, HPC, and edge computing.
Hardware-Aware AI: Bridging Model Design, Compilers, and Edge Deployment
Integrating deep learning (DL) on resource-constrained edge devices requires hardware-aware and highly efficient solutions. This is particularly challenging due to the high computational and memory cost of standard convolution layers in modern Convolutional Neural Networks (CNNs). In my talk, I will present state-of-the-art approaches for efficient DL deployment using Hardware-Aware Neural Architecture Search (HW-NAS), extended with compiler-integrated convolution-level co-optimization. The talk will focus on three complementary strategies:
1. Surrogate Models and ML4ML for Fast Exploration
2. Model Compression and Dynamic NAS
3. Compiler-Integrated Convolution Search (CONAS)
Research Engineer, CEA
Iryna de Albuquerque Silva is a research engineer at CEA, in France. Her research interests include embedded AI, computer architecture, and compilation techniques, with a particular focus on how software-hardware co-design can support predictable and resource-efficient deployment of intelligent workloads on constrained platforms. She previously worked on and holds a PhD in the certification of machine-learning-based applications for safety-critical real-time embedded systems.
Aidge is a comprehensive, open-source, and collaborative platform hosted by the Eclipse Foundation, designed to support the entire AI lifecycle from model design to optimized deployment. Specifically tailored for embedded systems and edge devices, Aidge provides a rich ecosystem of tools and methodologies that enable efficient analysis, optimization, validation, and deployment of AI models in resource-constrained environments. The platform's development is supported by the DeepGreen (France 2030) and NEUROKIT2E (Europe ChipsJU) projects, with continuous enrichment and validation through collaborations with both industrial and academic partners.
Member of Technical Staff, Fractile
Perry Gibson is an ML Compiler Engineer at Fractile, an AI inference hardware startup, where he brings a compiler-centric perspective to the hardware–software co-design challenges the company is addressing. They completed their PhD in Across-stack DNN Acceleration at University of Glasgow at gicLAB under Dr José Cano Reyes, where they explored the cross-domain interactions of machine learning, software, and hardware techniques to improve performance and efficiency. At Fractile, he balances his time between contribution to design space exploration activities, and build-out of Fractile’s production software stack.
The Compiler Before The Horse: Design Space Exploration at Fractile
As the demands of large-scale DNN model deployments grow, innovation increasingly depends on tight coupling between hardware and software from the very beginning. Fractile, a UK startup building rack-scale accelerators for high-performance AI inference, adopted a strategy of developing functional compilers before the underlying architecture was fully fixed. This approach has contributed to our rapid design-space exploration, early validation of hardware ideas, and a fast feedback loop across teams.
This talk describes how early functional simulation, Python-based prototype compilers, and cross-disciplinary collaboration aided the team to evaluate architectural concepts long before the arrival of silicon. It also covers how priorities shift as the architecture matures: moving from flexible exploratory tooling toward robust, scalable compiler infrastructure; refining IRs to capture nuances of the memory hierarchy and other concerns; and balancing “don’t be weird” software principles with the need to support novel and product-defining hardware capabilities, such as in-memory compute.
Attendees will hear how Fractile’s hardware-software co-design process reduced iteration time, clarified trade-offs and bottlenecks, and helped drive key architectural decisions - along with a look at some of the compiler challenges and opportunities we have explored and face.
Uladzislau graduated from Belarusian State University of Informatics and Radioelectronics in 2019. As a Senior Software Engineer at Huawei Kopernik RC, he leverages over five years of experience in game development and systems engineering. He now drives low-level optimizations for Ascend NPUs and advances heterogeneous computing, focusing on maximizing the efficiency of AI hardware infrastructure.
Hardware-Affine Compiler Ecosystem Optimization for Ascend
This talk presents SGLang as a high-level inference framework and its systematic optimizations for NPU-based deployment, including key capabilities such as KV cache management, expert parallelism, PD disaggregation, and quantization, as well as cross-platform optimizations like DP-Attention and speculative decoding. As high-performance inference increasingly relies on fused operators, differences in operator implementations and interfaces across hardware architectures have become a major challenge for unified cross-platform optimization. To address this issue, we explore and adopt a torch.compile–based approach, leveraging the inductor backend to achieve efficient NPU-oriented code generation while preserving a unified operator-level model representation. The talk further introduces our concrete work within the Triton/Inductor ecosystem, including fused operator design for CV and matrix computations, the integration of the CATLASS template library with automatic tuning, support for graph mode and sparse DMA, and enabling triton-distributed through aclshmem.