This is the website of Simon Boehm. I work on making code run faster at the Astera Institute. If you like these posts: to the mailing list.
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA. My goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning. This includes coalescing global memory accesses, shared memory caching and occupancy optimizations, among others.You can download the code for all kernels from Github. Also checkout wangzyon’s repo from which I copied the benchmarking setup. This post is less polished than my normal uploads, and includes many more sidenotes. I used it as notepad for ideas and scribbles while writing the kernels. That’s why I called it a worklog :)
Pipeline-Parallelism: Distributed Training via Model Partitioning
Pipeline parallelism makes it possible to train large models that don’t fit into a single GPU’s memory.Example: Huggingface’s BLOOM model is a 175B parameter Transformer model. Storing the weights as bfloat16 requires 350GB, but the GPUs they used to train BLOOM ‘only’ have 80GB of memory, and training requires much more memory than just loading the model weights. So their final training was distributed across 384 GPUs. This is made possible by assigning different layers of the model to different GPUs, a process called model partitioning. Implemented naively, model partitioning results in low GPU utilization. In this post, we’ll first discuss the naive implementation of pipeline parallelism and some of its problems. Then, we’ll talk about GPipe and PipeDream, two more recent algorithms that alleviate some of the issues with naive pipeline parallelism.
Data-Parallel Distributed Training of Deep Learning Models
In this post, I want to have a look at a common technique for distributing model training: data parallelism. It allows you to train your model faster by replicating the model among multiple compute nodes, and dividing the dataset among them. Data parallelism works particularly well for models that are very parameter efficientMeaning a high ratio of
FLOPS per forward pass/
#parameters., like CNNs. At the end of the post, we’ll look at some code for implementing data parallelism efficiently, taken from my tiny Python library ShallowSpeed.
Fast Multidimensional Matrix Multiplication on CPU from Scratch
Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms. This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with a cycle taking a third of a nanosecond. Numpy does this using a highly optimized BLAS implementation.BLAS is short for Basic Linear Algebra Subprograms. These are libraries providing fast implementations of eg Matrix multiplications or dot-products. They are sometimes tailored to one specific (family of) CPUs, like Intel’s MKL or Apple’s accelerate. However, non-Vendor specific implementations like OpenBLAS are also available. How hard is it to recreate performance that’s roughly similar using plain C++?
Becoming a Better Programmer by Tightening Feedback Loops
I’m interested in strategies to improve deliberately and continuously as a programmer.I wrote up this post as a rough working note to get thoughts on it from others. I’ve thought about this on and off for the last two years, and have talked to ~25 experienced programmers about it. Mostly, it feels like “programmer training” is not a topic that is taken very seriously, probably because this skill is hard to quantify. As there are no established strategies, the potential returns to thinking about this topic increase.
lleaves - Compiling Decision Trees for Fast Prediction using LLVM
Gradient-boosted decision trees are a commonly used machine learning algorithm that performs well on real-world tabular datasets. There are many libraries available for training them, most commonly LightGBM and XGBoost. Sadly few of the popular libraries are optimized for fast prediction & deployment. As a remedy, I spent the last few months building lleaves, an open-source decision tree compiler and Python package.
The Normalizing Flow Network
The Normalizing Flow Network (NFN) is a normalizing-flow based regression model, great at modelling complex conditional densities. Look at our recent paper on noise regularization for conditional density estimation for some results of using the NFN on real-world and benchmark regression datasets.
Here I’ll explain the structure of the NFN and go through some of the math. Implementations can be found in our Open-Source Python package as well as in my repo.
Less Readworthy Posts
A List of my Favorite Tools
A list of (mostly software) tools that I use more than once a week, and that help speed up my work. I recommend most things on this list to friends so often that I decided to write them up.
A Local Search Engine
A tool for searching through every document I've ever read, locally and within seconds.
René Girard & Mimetic Theory for Non-Philosophers
Mimetic theory is a simple but immensely powerful concept. It explains how humans learn, why laws exist, and why too many people want to go into Finance. The idea was developed by René Girard, a french philosopher, member of the Académie Française and professor at Stanford. In the last few years, independent of Girard’s research, studies into imitation, formation of desire, and mirror neurons have been published that bring forward empirical justification for the theory. Let’s start by looking at the core concept of mimetic theory: Imitative desire.This is the primer I would have wanted to read before diving into the primary literature, which is eye-opening but can be dense.