This is the website of Simon Boehm. Previously, I was a compiler engineering intern at AMD, where I worked on deep learning accelerators & MLIR compilers. Earlier this year, I graduated from ETH Zurich with a CS MSc.
Data-Parallel Distributed Training of Deep Learning ModelsSeptember 8, 2022
In this post, I want to have a look at a common technique for distributing model training: data parallelism. It allows you to train your model faster by replicating the model among multiple compute nodes, and dividing the dataset among them. Data parallelism works particularly well for models that are very parameter efficientMeaning a high ratio of
FLOPS per forward pass/
#parameters., like CNNs. At the end of the post, we’ll look at some code for implementing data parallelism efficiently, taken from my tiny Python library ShallowSpeed.
Fast Multidimensional Matrix Multiplication on CPU from ScratchAugust 14, 2022
Numpy can multiply two 1024x1024 matrices on a 4-core Intel CPU in ~8ms. This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with a cycle taking a third of a nanosecond. Numpy does this using a highly optimized BLAS implementation.BLAS is short for Basic Linear Algebra Subprograms. These are libraries providing fast implementations of eg Matrix multiplications or dot-products. They are sometimes tailored to one specific (family of) CPUs, like Intel’s MKL or Apple’s accelerate. However, non-Vendor specific implementations like OpenBLAS are also available. How hard is it to recreate performance that’s roughly similar using plain C++?
Becoming a Better Programmer by Tightening Feedback LoopsMay 12, 2022
I’m interested in strategies to improve deliberately and continuously as a programmer.I wrote up this post as a rough working note to get thoughts on it from others. I’ve thought about this on and off for the last two years, and have talked to ~25 experienced programmers about it. Mostly, it feels like “programmer training” is not a topic that is taken very seriously, probably because this skill is hard to quantify. As there are no established strategies, the potential returns to thinking about this topic increase.
lleaves - Compiling Decision Trees for Fast Prediction using LLVMSeptember 20, 2021
Gradient-boosted decision trees are a commonly used machine learning algorithm that performs well on real-world tabular datasets. There are many libraries available for training them, most commonly LightGBM and XGBoost. Sadly few of the popular libraries are optimized for fast prediction & deployment. As a remedy, I spent the last few months building lleaves, an open-source decision tree compiler and Python package.
René Girard & Mimetic Theory for Non-PhilosophersMay 27, 2020
Mimetic theory is a simple but immensely powerful concept. It explains how humans learn, why laws exist, and why too many people want to go into Finance. The idea was developed by René Girard, a french philosopher, member of the Académie Française and professor at Stanford. In the last few years, independent of Girard’s research, studies into imitation, formation of desire, and mirror neurons have been published that bring forward empirical justification for the theory. Let’s start by looking at the core concept of mimetic theory: Imitative desire.This is the primer I would have wanted to read before diving into the primary literature, which is eye-opening but can be dense.
The Normalizing Flow NetworkAugust 8, 2019
The Normalizing Flow Network (NFN) is a normalizing-flow based regression model, great at modelling complex conditional densities. Look at our recent paper on noise regularization for conditional density estimation for some results of using the NFN on real-world and benchmark regression datasets.
Less important posts
A List of my Favorite ToolsMay 29, 2022
A list of (mostly software) tools that I use more than once a week, and that help speed up my work. I recommend most things on this list to friends so often that I decided to write them up.
A Local Search EngineApril 30, 2021
A tool for searching through every document I've ever read, locally and within seconds.