Jekyll2023-10-03T19:19:02+00:00https://lyndonduong.com/feed.xmllyndonduong.comlyndon's homepagelyndon duongI defended my PhD thesis!2023-07-11T00:00:00+00:002023-07-11T00:00:00+00:00https://lyndonduong.com/phd-defense<p>Youtube recording of my PhD thesis defense at NYU Center for Neural Science.
<!--more--></p>
<iframe width="1280" height="720" src="https://www.youtube.com/embed/R50yEhZWR6w" title="Lyndon Duong PhD Defense: Adaptive Coding Efficiency & Stochastic Geometry" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe>lyndon duongYoutube recording of my PhD thesis defense at NYU Center for Neural Science.Statistical whitening with 1D projections2023-03-05T00:00:00+00:002023-03-05T00:00:00+00:00https://lyndonduong.com/1d-proj-gaussian<p>Geometric intuition for <a href="https://doi.org/10.48550/arXiv.2301.11955">our recent paper</a> on statistical whitening using overcomplete bases.
<!--more-->
<a href="https://colab.research.google.com/github/lyndond/lyndond.github.io/blob/master/code/2023-03-05-1d-proj-gaussian.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
<a href="https://github.com/lyndond/lyndond.github.io/blob/master/code/2023-03-05-1d-proj-gaussian.ipynb"><img src="https://img.shields.io/badge/Open on GitHub-success.svg" alt="Open on GitHub" /></a></p>
<!-- describe statistical whitening and history -->
<p>A short Tweet thread about this paper can be found <a href="https://twitter.com/lyndoryndo/status/1632823010649423872">here</a>.</p>
<p>A very old problem in statistics and signal processing is to <strong>statistically whiten</strong> a signal, i.e. to linearly transform a signal with covariance \({\bf C}\) to one with identity covariance. The most common approach to do this to find the principal components of the signal (the eigenvectors of \({\bf C}\)), then scale the signal according to how much the it varies along each principal axis. The downside of this approach is that if the inputs <em>change</em>, then the principal axes need to be recomputed.</p>
<p>In our recent preprint <a href="https://arxiv.org/abs/2301.11955">(https://arxiv.org/abs/2301.11955)</a>, we introduce a <strong>completely different and novel approach</strong> to statistical whitening. We do away with finding principal components altogether, and instead develop a framework for whitening with a <em>fixed</em> <strong>frame</strong> (i.e. an overcomplete basis) using concepts borrowed from <a href="https://en.wikipedia.org/wiki/Frame_(linear_algebra)">frame theory of linear algebra</a>, and <a href="https://en.wikipedia.org/wiki/Tomography">tomography</a>, the science of reconstructing signals from projections.</p>
<p><img src="/assets/posts/1d-proj-gaussian/1d_intuition.png" alt="png" /></p>
<p>The figure above shows the geometric concepts behind our approach. It’s useful to know that we can geometrically represent densities with covariance matrices \({\bf C}\) as ellipsoids in \(N\)-dimensional space (top left panel, shaded black). <a href="https://doi.org/10.1006/cgip.1994.1012">Old work in tomography</a> has shown that ellipsoids can be represented (reconstructed) from a series of 1D projections. The 1D projected densities are plotted as vertical lines in the middle panel, with colours corresponding to the axes along which the original density was projected, and colour saturation denoting probability at a given point. It turns out that if the density is Gaussian, then \(N(N+1)/2\) projections along unique axes are <strong>necessary and sufficient</strong> to represent the original density. This number of required projections is the number of independent parameters in a covariance matrix. Importantly, the set of 1D projection axes <em>can exclude the principal components</em> of the original density, and is <strong>overcomplete</strong>, i.e. linearly dependent, since there are more than \(N\) projections.</p>
<p>Unlike conventional tomographic approaches, the main goal of our study isn’t to reconstruct the ellipsoid, but rather to use the information derived from its projections to <strong>whiten</strong> the original signal. The top right plot shows each 1D marginal density’s variance; notice how the variance of the 1D projections is proportional to the length of the corresponding 1D slice, and that for this non-white joint density, the variances are quite variable. Meanwhile, for a whitened signal (bottom row), <strong>all projected variances equal 1</strong>! This geometric intuition involving 1D projections of Gaussian densities forms the foundation of our framework.</p>
<p>In the <a href="https://arxiv.org/abs/2301.11955">paper</a> we show: 1) how to operationalize these geometric ideas into an optimization objective function to learn a statistical whitening transform; and 2) how to derive a recurrent neural network (RNN) that iteratively optimizes this objective, and converges to a steady-state solution where the outputs of the network are statistically white.</p>
<p>Mechanistically, this RNN adaptively whitens a signal by scaling it according to its marginal variance along a fixed, overcomplete set of projection axes, <em>without ever</em> learning the eigenvectors of the inputs. This is an attractive solution because constantly (re)learning eigenvectors with dynamically changing inputs may pose stability issues in a network. From a theoretical neuroscience perspective, our findings are particularly exciting because they generalize <a href="https://doi.org/10.1038/35090500">well-established ideas of single-neuron adaptive efficient coding <strong>via gain control</strong></a> to the level of a neural population.</p>lyndon duongGeometric intuition for our recent paper on statistical whitening using overcomplete bases.I gave a talk on stochastic shape metrics2023-02-24T00:00:00+00:002023-02-24T00:00:00+00:00https://lyndonduong.com/mila-talk-shapes<p>My (virtual) talk on our ICLR 2023 paper at Mila Quebec Neural AI RG.
<!--more--></p>
<p><a href="https://github.com/ahwillia/netrep"><img src="https://img.shields.io/badge/Open on GitHub-success.svg" alt="Open in GitHub" /></a></p>
<p>Duong, Zhou, Nassar, Berman, Olieslagers, and Williams, “Representational dissimilarity metric spaces for stochastic neural networks”, ICLR 2023 <a href="https://arxiv.org/abs/2211.11665">arXiv:2211.11665</a>.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/qOJT6gGSKzg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe>lyndon duongMy (virtual) talk on our ICLR 2023 paper at Mila Quebec Neural AI RG.Python QR code generator for poster presentations2023-02-13T00:00:00+00:002023-02-13T00:00:00+00:00https://lyndonduong.com/qr-code-python<p>Code snippet for generating QR codes with transparent backgrounds.
<!--more--></p>
<p>Code for this post:
<a href="https://colab.research.google.com/github/lyndond/lyndond.github.io/blob/master/code/2023-02-13-qr-code-python.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
<a href="https://github.com/lyndond/lyndond.github.io/blob/master/code/2023-02-13-qr-code-python.ipynb"><img src="https://img.shields.io/badge/Open on GitHub-success.svg" alt="Open in GitHub" /></a></p>
<p>I wanted to put a QR code linking to an arXiv preprint on my CoSyNe poster but the online solutions all had ugly white backgrounds. So I wrote a Python snippet to generate a QR code with <strong>transparent whitespace</strong>. It creates a QR code, converts it to a numpy RGBA array with the alpha channel set according to whether or not there is a black pixel.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">qrcode</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">qrcode</span>
<span class="k">def</span> <span class="nf">make_qr_png</span><span class="p">(</span>
<span class="n">url</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
<span class="n">filename</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">os</span><span class="p">.</span><span class="n">PathLike</span><span class="p">]</span> <span class="o">=</span> <span class="s">'qr_py.png'</span><span class="p">,</span>
<span class="n">dpi</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="mi">300</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-></span> <span class="bp">None</span><span class="p">:</span>
<span class="s">"""Saves a png QR code to a URL with transparent whites and background.
Parameters
----------
url: url that the QR code should point to.
filename: For saving.
dpi: matplotlib figure dpi.
"""</span>
<span class="n">qr</span> <span class="o">=</span> <span class="n">qrcode</span><span class="p">.</span><span class="n">QRCode</span><span class="p">()</span>
<span class="n">qr</span><span class="p">.</span><span class="n">add_data</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">qr</span><span class="p">.</span><span class="n">make_image</span><span class="p">()</span>
<span class="c1"># cast to numpy
</span> <span class="n">img</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="n">h</span><span class="p">,</span> <span class="n">w</span> <span class="o">=</span> <span class="n">img</span><span class="p">.</span><span class="n">shape</span>
<span class="n">rgb</span> <span class="o">=</span> <span class="p">[</span><span class="n">img</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)]</span>
<span class="n">rgb</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">stack</span><span class="p">(</span><span class="n">rgb</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">rgb</span><span class="p">.</span><span class="n">shape</span><span class="p">[:</span><span class="mi">2</span><span class="p">])</span>
<span class="c1"># set alpha channel
</span> <span class="n">rgba</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">h</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">rgba</span><span class="p">[...,:</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">rgb</span>
<span class="n">rgba</span><span class="p">[...,</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">rgba</span><span class="p">[...,:</span><span class="mi">3</span><span class="p">],</span> <span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">))</span><span class="o">*</span><span class="mi">255</span>
<span class="n">rgba</span> <span class="o">=</span> <span class="n">rgba</span><span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="c1">#
</span> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="n">dpi</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">rgba</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">transparent</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"doi.org/10.48550/arXiv.2301.11955"</span> <span class="c1"># set to whatever; shorter urls make cleaner QR codes
</span><span class="n">make_qr_png</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div></div>lyndon duongCode snippet for generating QR codes with transparent backgrounds.Stochastic shape metrics without the agonizing pain2023-02-10T00:00:00+00:002023-02-10T00:00:00+00:00https://lyndonduong.com/stochastic-shape-metrics<p>A no-math intuitive account of our method <a href="https://arxiv.org/abs/2211.11665">published in ICLR 2023</a>.
<!--more--></p>
<p>Code for this post:
<a href="https://colab.research.google.com/github/lyndond/lyndond.github.io/blob/master/code/2023-02-10-stochastic-shape-metrics.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>
<a href="https://github.com/lyndond/lyndond.github.io/blob/master/code/2023-02-10-stochastic-shape-metrics.ipynb"><img src="https://img.shields.io/badge/Open on GitHub-success.svg" alt="Open in GitHub" /></a></p>
<p>Code for methods in the paper:
<a href="https://github.com/ahwillia/netrep"><img src="https://img.shields.io/badge/Open on GitHub-success.svg" alt="Code for methods in paper:" /></a></p>
<p><a href="/mila-talk-shapes/">Watch my 45min talk on shape metrics.</a></p>
<p>Neuroscience and machine learning experiments now produce datasets of several animals or networks performing the same task. Below are two matrices representing simultaneously recorded activities from neurons in two different networks.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/network_matrices.png" style="width:20em" /></div>
<p>You can think of these as responses from two different animals, or two brain regions, or two deep net layers. Each column is a neuron and each row is the population response to a given stimulus condition. A fundamental question in neural representation learning & neuroscience is: Given multiple networks that were doing the same task, how related are their neural representations of the task? These two network matrices are high-dimensional, with the number of dimensions equal to the number of neurons, but for the sake of visualization, let’s assume responses each lie on some low 3D manifold of responses where each row of the matrix representing the population response to a condition is is plotted as a colored point in this space.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/pringles.png" style="width:20em" /></div>
<p>This manifold traces out a purple Pringles chip for network 1, and a green, slightly warped and rotated Pringles chip for network 2. How can we compare these representational geometric objects to each other, and how does that relationship correlate with the task? There have been many proposed methods to answer this question of network similarity, but I’m going to talk about a method and extensions to one that was proposed recently that draws from ideas from the field of <strong>statistical shape analysis</strong> (<a href="https://arxiv.org/abs/2110.14739">Williams et al. 2022</a>).</p>
<h2 id="shape-metrics-on-neural-representations">Shape metrics on neural representations</h2>
<p>We treat each joint response as a high-dimensional “shape” and wish to rigorously quantify notions of <strong>distance</strong> between these two shapes. The upshot is that the distance between these two geometric shapes should be independent given some <strong>group of nuisance transformations</strong>. For instance, the two manifolds above both look like Pringles chips, but are slightly warped and rotated from each other. So, even though the two manifolds are not <strong>exactly</strong> the same, in most cases, being a simple linear transform away from the other is not interesting. We are instead more interested in first aligning the representations, then analyzing the remaining differences. These ideas are used heavily in imaging registration, where we want to align two images of the same object (e.g. medical CT scans), but they are slightly rotated or translated.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/pringles_aligned.png" style="width:30em" /></div>
<p>Given an allowable set of transformations (like a permutation, rotation/reflection, shifts, scaling), we should be able to first align the two networks, and whatever remaining difference is what we define as the distance between them. Indeed, how flexible or restrictive the chosen group of transformations is in itself an interesting hypothesis that the experimenter must decide on. Mathematically, this nuisance-transform invariant distance is a <strong>metric</strong>, which must obey these four properties:</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/metric_properties.png" style="width:15em" /></div>
<p>From top to bottom, these are: identity of indiscernibles, non-negativity, symmetry, and triangle inequality. These are the same properties that make Euclidean distance a metric. Because this is a bona fide metric, we can analyze pairwise distances between <strong>multiple networks simultaneously</strong>. This enables downstream clustering analysis and visualization with theoretical guarantees on correctness.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/abo_mouse.png" /></div>
<p>The plot above plot shows ~50 mouse brain areas from the Allen Brain Observatory. Analyzing their representational similarity with shape metrics allows us to plot all the networks in what we call “shape space”. The colors of each dot correspond to which brain area the network recording was from, and was only added after the fact. This shows that shape metrics provides an unsupervised way to discover functional similarity between different networks</p>
<h2 id="neural-responses-are-noisy">Neural responses are noisy</h2>
<p>The above-described shape metrics were derived for scenarios in which responses are deterministic; i.e. the same stimulus always elicits the same response. They can be applied to neural data where we have taken the conditional mean response across trials. <strong>But neural responses are stochastic!</strong> Imagine now that we show networks a stimulus (dark blue or cyan), producing a different response each time for each network.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/noise_correlations.png" style="width:30em" /></div>
<p>Note that the conditional means (white stars) are same in both networks, but the shape of the conditional distributions (ellipses) are very different. In neuroscience, these are referred to as “noise correlations”, and exist in all brain areas across all species. It’s therefore important for us to develop a way to compare how stimuli are encoded in different animals/networks not just by comparing the conditional means, but also the conditional noise. We need a notion of distance and an alignment procedure that takes into account noise in each response.</p>
<h2 id="dissimilarity-between-gaussian-representations">Dissimilarity between Gaussian representations</h2>
<p>In the paper we describe ways of addressing stochasticity in comparing representational geometry, borrowing two ideas from the theory of optimal transport: Wasserstein distance and Energy distance. Here I’m only going to talk about Wasserstein distance.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/stochastic_responses.png" style="width:20em" /></div>
<p>Consider two networks (purple and orange) responding to some condition with response mean and now with a covariance about that mean. If we model responses as Gaussian, then the 2-Wasserstein distance between them has the nice property of having an analytic solution which nicely decouples into two quantities familiar to experimentalists. The first term is simply difference between the network conditional means.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/means.png" style="width:10em" /></div>
<p>The 2nd is what’s known as the Bures metric, which quantifies the difference in orientation and scale of the covariance noise clouds.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/bures.gif" style="width:15em" /></div>
<p>Geometrically, interpolating two covariances using the Bures metric traces out a path (a geodesic) between the two covariance clouds which linearly interpolates the sum of the principal axis standard deviations.
The below plot shows the geodesic between covariance clouds \({\bf C}_a \rightarrow {\bf C}_b\) (90 degree rotation) and \({\bf C}_a \rightarrow {\bf C}_c\) (simple isotropic scaling).
Note that the sum of principal axis standard deviations (i.e. sum of the square rooted eigenvalues) are linearly interpolated along the geodesic.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/geodesic.png" /></div>
<p>Intuitively, you can think of Wasserstein distance as the amount of work it takes to move and transform a pile of dirt that is shaped like the purple density to the orange one, or vice versa. Most importantly for us, Wasserstein distance is a metric, and therefore provides a natural stochastic extension to, and all the benefits of the deterministic shape metrics described above.</p>
<h2 id="stochastic-shape-metrics-on-neural-representations">Stochastic shape metrics on neural representations</h2>
<p>Equipped with the Wasserstein distance, we developed a procedure that can take two networks with stochastic responses and globally align them using a single transform, which now takes into account both their conditional mean & covariance structure. The remaining difference between them is quantified as Wasserstein distance distance between the two joint distributions.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/stochastic_aligned.png" style="width:30em" /></div>
<h2 id="application-variational-autoencoders">Application: Variational Autoencoders</h2>
<p>To demonstrate the utility and scalability of our stochastic shape metrics, we turn to variational autoencoders (VAEs). A VAE takes as input an image; after which, a neural network serves as a bottleneck to encode each input as a gaussian distribution. Another neural network then lifts out samples from these conditional densities to decode and reconstruct the input. These models are a perfect test-bed for our stochastic metric because the conditional responses are Gaussian by design, and so the Wasserstein distance is <strong>exact</strong>.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/vae_diagram.png" style="width:30em" /></div>
<p>We used our method to compare 1800 different networks trained with different objectives and hyperparameters (<a href="https://arxiv.org/abs/1811.12359">Locatello et al. 2018</a>). This distance matrix has about 1.6 million unique elements and would have taken ~10y of computation time if each pairwise distance was computed serially, but with some algorithm parallelization tricks, we were able to drop the compute time down from 10 years to a couple hours.</p>
<div style="text-align:center"><img src="/assets/posts/stochastic_shapes/vae_analyses.png" /></div>
<p>With this distance matrix, because our method defines a metric on stochastic representations, we can visualize and embed each network into shape space, and colour each network by its objective. We see that each colour clusters together as we might hope, and that our method reveals representational differences between networks trained with different objectives. Alternatively, we can use non-parametric K-nearest-neighbours analyses directly on the distance matrix. Doing this, we found that it’s possible to quantify and decode various parameters of interest such as the objective, how well each VAE reconstructed its inputs, how well the latents were “disentangled”, and even their random initialization seed!</p>
<h2 id="summary">Summary</h2>
<p>Quantifying representational similarity is fundamental to neuroscience & ML. Shape metrics offer a principled way to compare responses of many networks. We develop shape metrics for stochastic neural responses. This method is highly scalable and can be used to compare thousands of networks simultaneously. These analyses can provide direct insight into how representational geometry relates experimental variables of interest (e.g. behaviour, task performance, etc.).</p>lyndon duongA no-math intuitive account of our method published in ICLR 2023.My Google PhD internship experience2022-11-01T00:00:00+00:002022-11-01T00:00:00+00:00https://lyndonduong.com/google-internship<p>My internship research project at Google on machine learning for video compression.
<!--more--></p>
<p>I wrote about how I prepared for and passed the Google coding interview <a href="/coding-interview/">here</a>.</p>
<h2 id="internship-summary">Internship summary</h2>
<p>For my summer 2022 internship project at Google, I worked on the Open Codecs team under the Google Chrome umbrella, researching nonlinear transform methods using machine learning (e.g. deep nets) for video compression.
The model I developed works on intra-frame prediction residuals, and largely draws inspiration from my PhD research on adaptive gain control, and <a href="https://arxiv.org/abs/2007.03034">Ballé et al. (2020)</a>.</p>
<p>Some general takeaways:</p>
<ul>
<li>Most PhD projects were research oriented compared to e.g. undergrad intern projects, and I had tons of freedom to explore my own ideas.</li>
<li>Everyone on my team was very nice and helpful, and seemed to have a healthy work-life balance.</li>
<li>Bay Area summer weather is pretty much unbeatable, but day-to-day life was sleepy compared to NYC.</li>
<li>The internship (12-14 weeks) goes by very fast, especially since onboarding takes at least a couple weeks.</li>
<li>TensorFlow 2.0 is a pain to use and debug compared to PyTorch, but <code class="language-plaintext highlighter-rouge">tensorflow.data</code> is very nice.</li>
<li>Modern video codecs are built upon decades of heuristics and incremental engineering improvements.</li>
<li>Because of hardware limitations, we are <em>far</em> from end-to-end machine learning-based video codecs. The current state-of-the-art is a mostly traditional signal processing with maybe a few ML modules sprinkled in. The discrete cosine transform (DCT) is simply too cheap to replace with a neural network for now.</li>
<li>Google has one giant monorepo, so you can see everybody’s code. This was very useful when I was stuck on some infrastructural issue (e.g. distributed model training) and needed examples to copy and modify.</li>
<li>All internal Google tools felt like they were either in Beta, or in some stage of deprecation. It was really frustrating to follow some approved tutorial only to find that it was out-of-date and there was some new way of doing things.</li>
<li>Google’s code review tools (Critique, Gerrit) are way nicer than GitHub PR reviews.</li>
<li>Fig, Google’s distributed version control system based on Mercurial, is so much better than Git.</li>
</ul>
<h2 id="my-project">My project</h2>
<p>I wrote up the results for a conference paper, which was accepted to IEEE Int’l Conference on Acoustics, Speech and Signal Processing 2023 (ICASSP) taking place in Rhodes Greece <a href="https://doi.org/10.48550/arXiv.2301.11955">(arXiv:2210.14308)</a>!
Figure 1 of the paper (below) shows the architecture of the model.
I developed a method that serves as a (non)linear drop-in replacement for or <em>augmentations</em> to existing transform coding modules in the AV1 codec.
The TL;DR is that a base autoencoder and hyperprior are trained on a large dataset of video frames prediction residuals.
To allow the model to operate at different bit rates, we can train auxiliary parameters (gain modulations; pink) to control the rate-dependent output scale at each layer.
In the paper, we show the model can be trained end-to-end, and that we can augment the DCT with learned gain modulations (quantization matrices), and hyperpriors to significantly improvement performance at a fraction of the cost of a full-blown nonlinear transform (e.g. a deep net).</p>
<p><img src="/assets/posts/multirate_compression/fig_arch.png" alt="architecture" /></p>lyndon duongMy internship research project at Google on machine learning for video compression.Reproducible latents2022-07-10T00:00:00+00:002022-07-10T00:00:00+00:00https://lyndonduong.com/reproducible-latents<p>Can we rely on VAEs to generate reproducible latents?</p>
<!--more-->
<p><a href="https://arxiv.org/abs/1312.6114">Variational autoencoders</a> (VAEs) originated as a method for learning probabilistic generative models of data.
In recent years, there have been countless studies using VAEs as tools to infer low-dimensional latent structure from high-dimensional data.
The nonlinearity/flexibility of these models begs the question: how reliable are VAEs in uncovering these latents?
This is especially important if we want to use them to draw hard scientific conclusions from our data.</p>
<p>Below are different latents from the exact same network architecture (MLP with ReLU non-linearities), but with 5 different random initialization seeds to auto-encode MNIST digits using the standard VAE objective.
Each dot is a different example and different colors are different digits.</p>
<p><img src="/assets/posts/reproducible_latents/mnist_2d.png" alt="latent" /></p>
<p>All of the converged reconstruction errors were effectively the same, but the latents are wildly different (even after rotating/scaling to minimize squared error).</p>
<p>There are ways to remedy this, related to ideas of <a href="https://en.wikipedia.org/wiki/Identifiability">identifiability</a>, which has recently gained popularity in latent variable and generative modelling.
<a href="/pivae/">(I even implemented an identifiable model for spiking neural data in a previous note.)</a>
For this note, I just wanted to show a simple, compelling example of how different the solutions can be.
As researchers and practicioners we should take care to re-run the same experiments with a bunch of different seeds to ensure we get the same qualitative trends in the latents.</p>lyndon duongCan we rely on VAEs to generate reproducible latents?My first C++ open source contribution – Stan Library2022-01-13T00:00:00+00:002022-01-13T00:00:00+00:00https://lyndonduong.com/first-cpp-pr<p>Writing L1 and L2 vector norms with reverse- and forward-mode autodiff.
<!--more--></p>
<p>I successfully got my first <a href="https://github.com/stan-dev/math/pull/2636">C++ Pull Request</a> with ~430 lines of code merged into the <a href="https://github.com/stan-dev">Stan Library</a>, a popular statistical library used for Bayesian modeling and inference.</p>
<p>I spent a good chunk of Xmas break reading textbooks and watching tutorials on C++.
But, you can only learn so much by doing textbook exercises and watching YouTube videos, so I wanted to build something <em>real</em> to solidify my understanding.
C++ seems to be the lingua franca of video game programming, and there are plenty of tutorials online about how to build simple games, but I’m not super into that.
It made more sense to me to work on a project where I could leverage my existing skills and domain expertise (numerical linear algebra, machine learning, probabilistic models).</p>
<p>This is what motivated me to write a tiny linear algebra matrix class to <a href="/mlp-train-cpp/">build and train a neural network from scratch</a>.
While this was fun, the project felt more like a one-off rather than a “real” software project.
I watched an old <a href="https://youtu.be/NOCElcMcFik">CppCon talk by Titus Winters</a>, the C++ tech lead at Google, who described how writing software with long-term stability and maintainability requires a completely different mindset than what most start-ups are junior/student devs (i.e. me) are used to.
This inspired me to contribute to a longer-term, large open-source collaboration in order to learn more about things like complex C++ library builds and unit testing.</p>
<h2 id="choosing-and-open-source-project">Choosing and open source project</h2>
<p>I was torn between trying to contribute to either the PyTorch, or Stan libraries.
There were pros and cons to each that I had to weigh.
I use PyTorch extensively in my day-to-day, so I’m quite familiar with how it works; however, its repo is a behemoth with thousands of contributors and outstanding pull requests, so it looks very easy to get lost in all the noise.
With Stan, on the other hand, I have far less experience using the library; but, I am reasonably familiar with the problem domain (probabilistic models and probabilistic programming), and the smaller community seemed less daunting to join.
I think what ultimately pushed me over the fence was Bob Carpenter, one of the main Stan developers who happens to work down the hall from me; he is very friendly + knowledgeable, and encouraged me to contribute to the library when I mentioned I wanted to hone my C++ skills.</p>
<h2 id="my-contributions-to-the-stan-library">My contributions to the Stan Library</h2>
<p>While most people interface with the Stan library using a higher-level scripting language (e.g. RStan, PyStan, etc.), the Stan framework itself is largely built on a C++ foundation.
At the heart of Stan is its Math library, which basically wraps the existing <code class="language-plaintext highlighter-rouge">Eigen</code> C++ linear algebra library and extends it with automatic differentiation capabilities.</p>
<p>There was an <a href="https://github.com/stan-dev/math/issues/2562">outstanding <code class="language-plaintext highlighter-rouge">stan-dev/math</code>issue</a> from this past summer describing the need for L1 and L2 norms in the library.
It was tagged as as a <code class="language-plaintext highlighter-rouge">good first issue</code>, meaning that it was an ideal problem for newcomers to work on.</p>
<h3 id="the-code-i-wrote">The code I wrote</h3>
<p>My contributions mainly revolved around different function templates for these L1 and L2 norms.
These had to be exhaustive for all possible use-cases in the library.</p>
<ol>
<li>Templated L1 and L2 norm functions operating on Containers with underlying <code class="language-plaintext highlighter-rouge">std::is_arithmetic</code> types</li>
<li>L1 and L2 norm functions with reverse-mode autodiff capabilities</li>
<li>L2 and L2 norm functions with forward-mode autodiff capabilities</li>
<li>Extensive Google Test unit testing for all these new functions with standard-use and edge cases</li>
</ol>
<h2 id="things-i-learned">Things I learned</h2>
<ul>
<li>The Stan Library team is super friendly and welcoming
<ul>
<li>The back-and-forth process from initial pull request to getting reviewed and merged into the main <code class="language-plaintext highlighter-rouge">develop</code> branch only took 4 days, Dec 30 - Jan 2, (i.e. they were kind enough to review over the New Year weekend).</li>
</ul>
</li>
<li>Google Test C++ unit testing framework
<ul>
<li>This is a very popular tool in C++ projects so I’m glad I’m getting more familiar with it.</li>
<li>I’m very familiar with <code class="language-plaintext highlighter-rouge">PyTest</code> for Python, and Google Test feels pretty similar so far.</li>
</ul>
</li>
<li>C++ Template metaprogramming tricks
<ul>
<li><code class="language-plaintext highlighter-rouge">stan-dev/math</code> is a (mostly) header-only library so at times it felt like template metaprogramming olympics to get stuff to run.</li>
<li>The Stan devs were very helpful during the review process and guided me through the confusing template expressions and type-trait landscape.</li>
</ul>
</li>
<li>How to implement reverse and forward-mode autodifferentiation
<ul>
<li>It’s funny, I feel like I’ve used and read about autodiff for years now but never actually had to sit down and implement it myself.</li>
<li>The derivatives of L1 and L2 norms that I implemented were pretty straightforward, but <a href="http://www.matrixcalculus.org/">matrixcalculus.org</a> was a good resource to double-check my math.</li>
<li>Most contemporary machine learning uses reverse-mode autodiff (for backpropagation), so <a href="https://kenndanielso.github.io/mlrefined/blog_posts/3_Automatic_differentiation/3_4_AD_forward_mode.html">this blog post</a> was a good resource to familiarize myself with implementing forward-mode.</li>
</ul>
</li>
</ul>
<p>My experience as a first-time contributor to Stan (and a first-time C++ contributor to anything at all) was great, and I’m looking forward to making more contributions in the future.</p>lyndon duongWriting L1 and L2 vector norms with reverse- and forward-mode autodiff.C++ neural networks from scratch – Pt 3. model training2022-01-06T00:00:00+00:002022-01-06T00:00:00+00:00https://lyndonduong.com/mlp-train-cpp<p>Training a multilayer perceptron built in pure C++.
<!--more--></p>
<p><a href="https://github.com/lyndond/lyndond.github.io/blob/master/code/2021-12-22_neural_net_cpp/"><img src="https://img.shields.io/badge/Open on GitHub-success.svg" alt="Open on GitHub" /></a></p>
<ul>
<li><a href="/linalg-cpp/">Part 1 – building a matrix library</a></li>
<li><a href="/mlp-build-cpp/">Part 2 – building an MLP</a></li>
<li><a href="/mlp-train-cpp/">Part 3 – model training</a></li>
</ul>
<h2 id="fitting-a-neural-network-to-data">Fitting a neural network to data</h2>
<p>We’ve built a tiny matrix library, and a flexible multilayer perceptron (MLP) with forward and backward methods.
Now, it’s time to test if it can learn and fit data!</p>
<p><img src="/assets/posts/nn_cpp/nn_architecture.png" alt="latent" /></p>
<p>Using the <code class="language-plaintext highlighter-rouge">make_model()</code> function (defined in Part 2) to create an MLP with 3 hidden layers with 8 hidden units each, we just need to write code for the data generation and the model training loop.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// main.cpp</span>
<span class="cp">#include "matrix.h" // contains matrix library
#include "nn.h" // contains our MLP implementation
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="c1">// init model</span>
<span class="kt">int</span> <span class="n">in_channels</span><span class="p">{</span><span class="mi">1</span><span class="p">},</span> <span class="n">out_channels</span><span class="p">{</span><span class="mi">1</span><span class="p">};</span>
<span class="kt">int</span> <span class="n">hidden_units_per_layer</span><span class="p">{</span><span class="mi">8</span><span class="p">},</span> <span class="n">hidden_layers</span><span class="p">{</span><span class="mi">3</span><span class="p">};</span>
<span class="kt">float</span> <span class="n">lr</span><span class="p">{</span><span class="mf">.5</span><span class="n">f</span><span class="p">};</span>
<span class="k">auto</span> <span class="n">model</span> <span class="o">=</span> <span class="n">make_model</span><span class="p">(</span>
<span class="n">in_channels</span><span class="p">,</span>
<span class="n">out_channels</span><span class="p">,</span>
<span class="n">hidden_units_per_layer</span><span class="p">,</span>
<span class="n">hidden_layers</span><span class="p">,</span>
<span class="n">lr</span><span class="p">);</span>
<span class="c1">// open file to save loss, x, y, and model(x)</span>
<span class="n">std</span><span class="o">::</span><span class="n">ofstream</span> <span class="n">my_file</span><span class="p">;</span>
<span class="n">my_file</span><span class="p">.</span><span class="n">open</span> <span class="p">(</span><span class="s">"data.txt"</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">max_iter</span><span class="p">{</span><span class="mi">10000</span><span class="p">};</span>
<span class="kt">float</span> <span class="n">mse</span><span class="p">;</span>
<span class="c1">//////////////////////////////////</span>
<span class="c1">////* training loop goes here*////</span>
<span class="c1">//////////////////////////////////</span>
<span class="n">my_file</span><span class="p">.</span><span class="n">close</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="writing-the-training-loop">Writing the training loop</h2>
<p>Let’s fit our model to a nonlinear function: \(y = \sin^2(x)\) where \(x\in[0, \pi)\).
On each iteration, we’ll generate an <code class="language-plaintext highlighter-rouge">(x, y)</code> pair using this function, then pass <code class="language-plaintext highlighter-rouge">x</code> through our model,
\(\hat{y} \leftarrow \texttt{model}(x)\),
and use our <code class="language-plaintext highlighter-rouge">model.backprop()</code> method to compute the gradient and backpropagate with respect to \(\texttt{loss}\leftarrow (y-\hat{y})^2\).</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* training loop */</span>
<span class="k">const</span> <span class="kt">float</span> <span class="n">PI</span> <span class="p">{</span><span class="mf">3.14159</span><span class="p">};</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span><span class="o"><=</span><span class="n">max_iter</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// generate (x, y) training data: y = sin^2(x)</span>
<span class="k">auto</span> <span class="n">x</span> <span class="o">=</span> <span class="n">mtx</span><span class="o"><</span><span class="kt">float</span><span class="o">>::</span><span class="n">rand</span><span class="p">(</span><span class="n">in_channels</span><span class="p">,</span> <span class="mi">1</span><span class="p">).</span><span class="n">multiply_scalar</span><span class="p">(</span><span class="n">PI</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">apply_function</span><span class="p">([](</span><span class="kt">float</span> <span class="n">v</span><span class="p">)</span> <span class="o">-></span> <span class="kt">float</span> <span class="p">{</span> <span class="k">return</span> <span class="n">sin</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="o">*</span> <span class="n">sin</span><span class="p">(</span><span class="n">v</span><span class="p">);</span> <span class="p">});</span>
<span class="c1">// forward and backward</span>
<span class="k">auto</span> <span class="n">y_hat</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="n">model</span><span class="p">.</span><span class="n">backprop</span><span class="p">(</span><span class="n">y</span><span class="p">);</span> <span class="c1">// loss and grads computed in here</span>
<span class="c1">// function that logs (loss, x, y, y_hat)</span>
<span class="n">log</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">y_hat</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="trained-model">Trained model</h2>
<p>I logged the (<code class="language-plaintext highlighter-rouge">loss</code>, <code class="language-plaintext highlighter-rouge">x</code>, <code class="language-plaintext highlighter-rouge">y</code>, <code class="language-plaintext highlighter-rouge">y_hat</code>) values to a <code class="language-plaintext highlighter-rouge">.txt</code> file, then parsed & plotted them in Python (plotting code is also included in the repo).
The model clearly learns the function and reduces error over time (left panel).
This is also qualitatively evident comparing the model outputs toward the beginning of training (middle panel – looks like trash) to those from the late phase of training (right panel).
Looks pretty great :).</p>
<p><img src="/assets/posts/nn_cpp/mlp_cpp.png" alt="latent" /></p>
<h2 id="recap">Recap</h2>
<p>We’ve come a long way: from zero to a fully trained model in just a couple hundred lines of C++.
We built our own <code class="language-plaintext highlighter-rouge">Matrix</code> class with linear algebra capabilities, and a flexible implementation of a backprop-able MLP.
Despite this being a relatively simple task in, say, Python, it’s fun to do away with all the machine learning library abstractions and just write things yourself – back to basics.
I find that in higher-level languages you’re always worrying about whether or not your implementation is as efficient as it could be.
E.g. when writing loops in some analysis code you always wonder if there’s a better vectorized approach.
In C++. since you’re just so close to the bare metal, it seems to me like you just… write the loops.
<code class="language-plaintext highlighter-rouge">Julia</code> kind of has the same vibes but C++ feels a bit more satisfyingly raw.</p>
<p>This was my very first C++ project, and it covered a lot of useful topics to get familiarized with the language: strict typing & type inference, OOP, the standard library, and templated programming are a few things that immediately come to mind.
I’m a long way to writing fully idiomatic C++ (the code written in this project has a Pythonic flavour to it) but it’ll be fun to look back at this down the road and see how I’ve improved.</p>lyndon duongTraining a multilayer perceptron built in pure C++.C++ neural networks from scratch – Pt 2. building an MLP2021-12-29T00:00:00+00:002021-12-29T00:00:00+00:00https://lyndonduong.com/mlp-build-cpp<p>Building a trainable multilayer perceptron in pure C++.
<!--more--></p>
<p><a href="https://github.com/lyndond/lyndond.github.io/blob/master/code/2021-12-22_neural_net_cpp/"><img src="https://img.shields.io/badge/Open on GitHub-success.svg" alt="Open on GitHub" /></a></p>
<ul>
<li><a href="/linalg-cpp/">Part 1 – building a matrix library</a></li>
<li><a href="/mlp-build-cpp/">Part 2 – building an MLP</a></li>
<li><a href="/mlp-train-cpp/">Part 3 – model training</a></li>
</ul>
<h2 id="mlp-class">MLP class</h2>
<p>Now that we’ve built our matrix class and basic linear algebra functionality, let’s use it to build a multilayer perceptron (MLP).
We’ll first go through the class constructor, and then implement the <code class="language-plaintext highlighter-rouge">forward()</code> and <code class="language-plaintext highlighter-rouge">backward()</code> methods.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma once
#include "matrix.h"
#include <random>
#include <utility>
#include <cassert>
</span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">lynalg</span><span class="p">;</span> <span class="c1">// matrix linalg lib from matrix.h</span>
<span class="k">namespace</span> <span class="n">nn</span> <span class="p">{</span>
<span class="k">template</span><span class="o"><</span><span class="k">typename</span> <span class="nc">T</span><span class="p">></span>
<span class="k">class</span> <span class="nc">MLP</span> <span class="p">{</span>
<span class="nl">public:</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="kt">size_t</span><span class="o">></span> <span class="n">units_per_layer</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Matrix</span><span class="o"><</span><span class="n">T</span><span class="o">>></span> <span class="n">bias_vectors</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Matrix</span><span class="o"><</span><span class="n">T</span><span class="o">>></span> <span class="n">weight_matrices</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">Matrix</span><span class="o"><</span><span class="n">T</span><span class="o">>></span> <span class="n">activations</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">lr</span><span class="p">;</span>
<span class="c1">////////////////////////////////</span>
<span class="c1">///* Constructor goes here *////</span>
<span class="c1">////////////////////////////////</span>
<span class="c1">////////////////////////////////</span>
<span class="c1">//* Forward method goes here *//</span>
<span class="c1">////////////////////////////////</span>
<span class="c1">//////////////////////////////////</span>
<span class="c1">//* Backward method goes here *//</span>
<span class="c1">/////////////////////////////////</span>
<span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="constructor">Constructor</h3>
<p>Time to make use of our matrix/linear algebra library from Part 1!
The <code class="language-plaintext highlighter-rouge">MLP</code> constructor will initialize a set of weights and biases for each layer, initialized to random Gaussian noise.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">explicit</span> <span class="nf">MLP</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="kt">size_t</span><span class="o">></span> <span class="n">units_per_layer</span><span class="p">,</span> <span class="kt">float</span> <span class="n">lr</span> <span class="o">=</span> <span class="mf">.001</span><span class="n">f</span><span class="p">)</span><span class="o">:</span>
<span class="n">units_per_layer</span><span class="p">(</span><span class="n">units_per_layer</span><span class="p">),</span>
<span class="n">weight_matrices</span><span class="p">(),</span>
<span class="n">bias_vectors</span><span class="p">(),</span>
<span class="n">activations</span><span class="p">(),</span>
<span class="n">lr</span><span class="p">(</span><span class="n">lr</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">units_per_layer</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">size_t</span> <span class="n">in_channels</span><span class="p">{</span><span class="n">units_per_layer</span><span class="p">[</span><span class="n">i</span><span class="p">]};</span>
<span class="kt">size_t</span> <span class="n">out_channels</span><span class="p">{</span><span class="n">units_per_layer</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]};</span>
<span class="c1">// initialize to random Gaussian</span>
<span class="k">auto</span> <span class="n">W</span> <span class="o">=</span> <span class="n">lynalg</span><span class="o">::</span><span class="n">mtx</span><span class="o"><</span><span class="n">T</span><span class="o">>::</span><span class="n">randn</span><span class="p">(</span><span class="n">out_channels</span><span class="p">,</span> <span class="n">in_channels</span><span class="p">);</span>
<span class="n">weight_matrices</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">W</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">b</span> <span class="o">=</span> <span class="n">lynalg</span><span class="o">::</span><span class="n">mtx</span><span class="o"><</span><span class="n">T</span><span class="o">>::</span><span class="n">randn</span><span class="p">(</span><span class="n">out_channels</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">bias_vectors</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
<span class="n">activations</span><span class="p">.</span><span class="n">resize</span><span class="p">(</span><span class="n">units_per_layer</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="forward-pass">Forward pass</h3>
<p>Each layer of the neural network will be of the form <code class="language-plaintext highlighter-rouge">output <- sigmoid( Weight.matmul( input ) + bias )</code>.
First, we can implement the sigmoid nonlinearity, which is pretty straightforward.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">inline</span> <span class="k">auto</span> <span class="nf">sigmoid</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="mf">1.0</span><span class="n">f</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The forward pass consists of computing the activations at a given layer, saving it to <code class="language-plaintext highlighter-rouge">activations</code>, then pass it forward and use it as the input to the next layer.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">auto</span> <span class="nf">forward</span><span class="p">(</span><span class="n">Matrix</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">get</span><span class="o"><</span><span class="mi">0</span><span class="o">></span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">==</span> <span class="n">units_per_layer</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&&</span> <span class="n">get</span><span class="o"><</span><span class="mi">1</span><span class="o">></span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span><span class="p">));</span>
<span class="n">activations</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="n">Matrix</span> <span class="n">prev</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">units_per_layer</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">Matrix</span> <span class="n">y</span> <span class="o">=</span> <span class="n">weight_matrices</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">matmul</span><span class="p">(</span><span class="n">prev</span><span class="p">);</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">y</span> <span class="o">+</span> <span class="n">bias_vectors</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">.</span><span class="n">apply_function</span><span class="p">(</span><span class="n">sigmoid</span><span class="p">);</span>
<span class="n">activations</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
<span class="n">prev</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">prev</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="backward-pass">Backward pass</h2>
<p>We’re going to need the derivative of a sigmoid wrt its inputs:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">inline</span> <span class="k">auto</span> <span class="nf">d_sigmoid</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">){</span>
<span class="k">return</span> <span class="p">(</span><span class="n">x</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">backprop()</code> class method we’ll make will take as input the target output.
We can apply the strategy described directly from Bishop Pattern Recognition and Machine Learning chapter 5.
It’s been covered in tons of other articles so I’m going to omit the details and only focus on the C++ implementation here.
<a href="https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/">This article has a good walk through for details</a>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">backprop</span><span class="p">(</span><span class="n">Matrix</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">target</span><span class="p">)</span> <span class="p">{</span>
<span class="n">assert</span><span class="p">(</span><span class="n">get</span><span class="o"><</span><span class="mi">0</span><span class="o">></span><span class="p">(</span><span class="n">target</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">==</span> <span class="n">units_per_layer</span><span class="p">.</span><span class="n">back</span><span class="p">());</span>
<span class="c1">// determine the simple error</span>
<span class="c1">// error = target - output</span>
<span class="k">auto</span> <span class="n">y</span> <span class="o">=</span> <span class="n">target</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">y_hat</span> <span class="o">=</span> <span class="n">activations</span><span class="p">.</span><span class="n">back</span><span class="p">();</span>
<span class="k">auto</span> <span class="n">error</span> <span class="o">=</span> <span class="p">(</span><span class="n">target</span> <span class="o">-</span> <span class="n">y_hat</span><span class="p">);</span>
<span class="c1">// backprop the error from output to input and step the weights</span>
<span class="k">for</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">weight_matrices</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span> <span class="p">;</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//calculating errors for previous layer</span>
<span class="k">auto</span> <span class="n">Wt</span> <span class="o">=</span> <span class="n">weight_matrices</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">T</span><span class="p">();</span>
<span class="k">auto</span> <span class="n">prev_errors</span> <span class="o">=</span> <span class="n">Wt</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">delta</span><span class="p">);</span>
<span class="c1">// apply derivative of function evaluated at activations</span>
<span class="c1">//backprop for biases</span>
<span class="k">auto</span> <span class="n">d_outputs</span> <span class="o">=</span> <span class="n">activations</span><span class="p">[</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">].</span><span class="n">apply_function</span><span class="p">(</span><span class="n">d_sigmoid</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">gradients</span> <span class="o">=</span> <span class="n">error</span><span class="p">.</span><span class="n">multiply_elementwise</span><span class="p">(</span><span class="n">d_outputs</span><span class="p">);</span>
<span class="n">gradients</span> <span class="o">=</span> <span class="n">gradients</span><span class="p">.</span><span class="n">multiply_scalar</span><span class="p">(</span><span class="n">lr</span><span class="p">);</span>
<span class="c1">// backprop for weights</span>
<span class="k">auto</span> <span class="n">a_trans</span> <span class="o">=</span> <span class="n">activations</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">T</span><span class="p">();</span>
<span class="k">auto</span> <span class="n">weight_gradients</span> <span class="o">=</span> <span class="n">gradients</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">a_trans</span><span class="p">);</span>
<span class="c1">//adjust weights</span>
<span class="n">bias_vectors</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">bias_vectors</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">add</span><span class="p">(</span><span class="n">gradients</span><span class="p">);</span>
<span class="n">weight_matrices</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">weight_matrices</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">add</span><span class="p">(</span><span class="n">weight_gradients</span><span class="p">);</span>
<span class="n">error</span> <span class="o">=</span> <span class="n">prev_errors</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="creating-the-neural-net">Creating the neural net</h2>
<p>Let’s write a helper function that will take as input the input and output dimensionality, and hidden layer specifications, and return an initialized model.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">auto</span> <span class="nf">make_model</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">in_channels</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">out_channels</span><span class="p">,</span>
<span class="kt">size_t</span> <span class="n">hidden_units_per_layer</span><span class="p">,</span>
<span class="kt">int</span> <span class="n">hidden_layers</span><span class="p">,</span>
<span class="kt">float</span> <span class="n">lr</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="kt">size_t</span><span class="o">></span> <span class="n">units_per_layer</span><span class="p">;</span>
<span class="n">units_per_layer</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">in_channels</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">hidden_layers</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
<span class="n">units_per_layer</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">hidden_units_per_layer</span><span class="p">);</span>
<span class="n">units_per_layer</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">out_channels</span><span class="p">);</span>
<span class="n">nn</span><span class="o">::</span><span class="n">MLP</span><span class="o"><</span><span class="kt">float</span><span class="o">></span> <span class="n">model</span><span class="p">(</span><span class="n">units_per_layer</span><span class="p">,</span> <span class="mf">0.01</span><span class="n">f</span><span class="p">);</span>
<span class="k">return</span> <span class="n">model</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p><img src="/assets/posts/nn_cpp/nn_architecture.png" alt="mlp" /></p>
<p>So if we want to initialize model with 1D input and output, and 3 hidden layers with 8 hidden units each, then we can just call our function like:</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">auto</span> <span class="n">model</span> <span class="o">=</span> <span class="n">make_model</span><span class="p">(</span>
<span class="n">in_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">out_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">hidden_units_per_layer</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
<span class="n">hidden_layers</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
<span class="n">lr</span><span class="o">=</span><span class="mf">.5</span><span class="n">f</span><span class="p">);</span>
</code></pre></div></div>
<p>which would then yield a network architecture that looks like the above figure, with Gaussian random weights and biases.</p>
<p>Great – now that we’ve built a trainable multilayer perceptron on top of our bare-bones linear algebra library, it’s time to train it.</p>lyndon duongBuilding a trainable multilayer perceptron in pure C++.