by Johannes Lechner - July 16, 2021July 16, 2021

ML Sample Generator Project | Phase 2 pt3

Convolutional Networks

Convolutional networks include one or more convolutional layers. These layers are typically used for feature extraction. Stacking multiple on top of each other often can extract very detailed features. Depending on the input shape of the data, convolutional layers can be one- or multidimensional, but are usually 2D as they are mainly used for working with images. The feature extraction can be achieved by applying filters to the input data. The image below shows a very simple black and white (or pink & white) image with a size 3 filter that can detect vertical left-sided edges. The resulting image can then be shrinked down without losing as much data as reducing the original’s dimensions would.

2D convolution with filter size 3 detecting vertical left edges

In this project, all models containing convolutional layers are based off of WavGAN. For this cutting the samples down to a length of 16384 was necessary, as WavGAN only works with windows of this size. In detail, the two models consist of five convolutional layers, each followed by a leaky rectified linear unit activation function and one final dense layer afterwards. Both models were again trained for 700 epochs.

Convolutional Autoencoder

The convolutional autoencoder produces samples only in the general shape of a snare drum. There is an impact and a tail but like the small autoencoders, it is clicky. In contrast to the normal autoencoders, the whole sound is not noisy though but rather a ringing sound. The latent vector does change the sound but playing the sound to a third party would not result in them guessing that this should be a snare drum.

Ringy conv ae sample

GAN

The generative adversarial network worked much better than the autoencoder. While still being far from a snare drum sound, it produced a continuous latent space with samples resembling the shape of a snare drum. The sound itself however very closely resembles a bitcrushed version of the original samples. It would be interesting to develop this further as the current results suggest that there is just something wrong with the layers, but the network takes very long to train which might be due to the need of a custom implementation of the train function.

Bitcrushed sounding GAN sample

Variational Autoencoder

Variational autoencoders are a sub-type of autoencoders. Their big difference to a vanilla autoencoder is the encoder’s last layer, the sampling layer. With this, variational autoencoders always provide a continuous latent space, which is much better for generative models than just to sample from what has been provided. This is achieved by having the encoder output two different vectors instead of one: one for standard deviation and one for the mean. This provides a distribution rather than a single point, leading to the decoder learning that an area is responsible for a feature and not a single sample.

Training the variational autoencoder was especially troublesome as it required a custom class with it’s own train step function. The difficulty with this type of model is that the right mix between reconstruction loss and kl loss has to be found, otherwise the model produces unhelpful results. The currently trained models all have a ramp up time of 30,000 batches until full effect of the kl loss. This value gets multiplied by a different actor depending on the model. The trained versions are with a factor of 0.01 (A), 0.001(B), as well as 0.0001(C). Model A produces a snare drum like sound, but is very metallic. Additionally instead of having a continuous latent space, the sample does not change at all. Model B produces a much better sample but still does not include much changes. The main changes are the volume of the sample as well as it getting a little bit more clicky towards the edges of the y axis. Model C has much more different sounds, but the continuity is more or less not present. In some areas the sample seems to get slightly filtered over one third of the vector’s axis but then rapidly changes the sound multiple times over the next 10%. But still, out of the three variational autoencoders model C produced the best results.

VAE with 0.01 contribution (A) sample

VAE with 0.001 contribution (B) sample

VAE with 0.0001 contribution (C) sample

Next Steps

As I briefly mentioned before, this project will ultimately run on a web server which means the next steps will be deciding how to run this app. Since all of the project has been written in python so far Django would be a good solution. But since TensorFlow offers a JavaScript Library as well this is not the only possible way to go. You will find out more about this in the next semester.

by Johannes Lechner - July 16, 2021July 16, 2021

ML Sample Generator Project | Phase 2 pt2

Autoencoder Results

As mentioned in the post before I have trained nine autoencoders to (re)produce snare drum samples. For easier comparison I have visualized the results below. Each image shows the location of all ~7500 input samples.

Rectified Linear Unit

All three graphics portray how the samples are mostly close together but some are very far out. A continuous representation is with all three models not possible. Reducing the latent vector’s maximum on both axes definitely helps, but even then the resulting samples are not too pleasing to hear. The small network has clicks in the beginning and generates very silent but noisy tails after the initial impact. The medium network includes some quite okay samples but moving around in the latent space often produces similar but less pronounced issues as the small network. And the big network produces the best sounding samples but has no continuous changes.

Clicky small relu sample

Noisy medium relu sample

Quite good big relu sample

Hyperbolic Tangent

These three networks each produce different patterns with a cluster at (0|0). The similarities between the medium and the big network lead me to believe that there is a smooth transition between random noise, to forming small clusters, to turning 45° clockwise and refining the clusters when increasing the number of trainable parameters. Just like the relu version, the reproduced audio samples of the small network contain clicks. The samples are however much better. The medium sized network is the best one out of all the trained models. It produces mostly good samples and has a continuous latent space. One issue is however that there are still some clicky areas in the latent space. The big network is the second best overall as it mostly lacks a continuous latent space as well. The produced audio samples are however very pleasing to hear and resemble the originals quite well.

Clicky small tanh sample

Close-to-original medium tanh sample

Close-to-original big tanh sample

Sigmoid

This group shows a clear tendency to cluster up the more trainable parameters exist. While in the above two groups the medium and the big network produced better results, in this case the small network is by far the best. The big network delivers primarily noisy audio samples and the medium network very noisy ones as well but they are better identifiable as snare drum sounds. The small network has by far the closest sounds to the originals but produces clicks at the beginning as well.

Clicky small sigmoid sample

Noisy medium sigmoid sample

Super noisy big sigmoid sample

In the third part of this series we will take a closer look at the other models.

by Johannes Lechner - July 16, 2021July 16, 2021

ML Sample Generator Project | Phase 2 pt1

A few months ago I already explained a little bit about machine learning. This was because I started working on a project involving machine learning. Here’s a quick refresh on what I want to do and why:

Electronic music production often requires gathering audio samples from different libraries, which, depending on the library and on the platform, can be quite costly as well as time consuming. The core idea of this project was to create a simple application with as few as possible parameters, that will generate a drum sample for the end user via unsupervised machine learning. The interface’s editable parameters enable the user to control the sound of the generated sample and a drag-and-drop space could map a dragged sample’s properties to the parameters. To simplify interaction with the program as much as possible, the dataset should only be learned once and not by the end user. Thus, the application would work with the models rather than the whole algorithm. This would be a benefit as the end result should be a web application where this project is run. Taking a closer look at the machine learning process, the idea was to train the network in the experimentation phase with snare drum samples from the library noiiz. With as many different networks as possible, this would then create a decently sized batch of models from which the best one could be selected for phase 3.

So far I have worked with four different models in different variations to gather some knowledge on what works and what does not. To evaluate them I created a custom GUI.

The GUI

Producing a GUI for testing purposes was pretty simple and straight-forward. Implementing a Loop Play option required the use of threads, which was a little bit of a challenge but working on the Interface was possible without any major problems thanks to the library PySimpleGUI. The application worked mostly bug free and enabled extensive testing of models and also already saving some great samples. However, as it can be seen below, this GUI is only usable for testing purposes and does not meet the specifications developed in the first phase of this project. For the final product a much simpler app should exist and instead of being standalone it should run on a web server.

Autoencoders

An autoencoder is an unsupervised learning method where input data is encoded into a latent vector (therefore the name autoencoder). To get from the input to the latent vector multiple dense layers reduce the dimensionality of the data, creating a bottleneck layer and forcing the encoder to get rid of less important information. This results in data loss but also in a much smaller representation of input data. The latent vector can then be decoded back to produce a similar data sample to the original. While training an autoencoder, the weights and biases of individual neurons are modified to reduce data loss as much as possible.

In this project autoencoders seemed to be a valuable tool as audio samples, even though as short as only 2 seconds, can add up to a huge size. Training with an autoencoder would reduce this information down to only a latent vector with a few dimensions and the trained model itself, which seems perfect for a web application. The past semester resulted in nine different autoencoders, each containing dense layers only. All autoencoders differ from each other by either the amounts of trainable parameters, or the activation functions, or both. The chosen activation functions are rectified linear unit, hyperbolic tangent and sigmoid. These are used in all of the layers of the encoder as well as all layers of the decoder except for the last one to get back to an audio sample (where individual data points are positive and negative).

Additionally, the autoencoders’ size (as in the amount of trainable parameters) is one of the following three:

Two dense layers with units 9 and 2 (encoder) or 9 and sample length (decoder) with trainable parameters
Three dense layers with units 96, 24 and 2 (encoder) or 24, 96 and sample length (decoder) with trainable parameters
Four dense layers with units 384, 96, 24 and 2 (encoder) or 24, 96, 384 and sample length (decoder) with trainable parameters

Combining these two attributes results in nine unique models, better understandable as a 3×3 matrix as follows:

	Small (2 layers)	Medium (3 layers)	Big (4 layers)
Rectified linear unit	Ae small relu	Ae med relu	Ae big relu
Hyperbolic tangent	Ae small tanh	Ae med tanh	Ae big tanh
Sigmoid	Ae small sig	Ae med sig	Ae big sig

All nine of the autoencoders above have been trained on the same dataset for 700 epochs. We will take a closer look on the results in the next post.

by Johannes Lechner - January 10, 2021January 19, 2021

Audio & Machine Learning (pt 3)

Part 3: Audio Generation using Machine Learning

Image processing and generating using machine learning has been significantly enhanced by using deep neural networks. And even pictures of human faces can now be artificially created as shown on thispersondoesnotexist.com. Images however are not that difficult to analyse. A 1024px-by-1024px image, as shown on thispersondoesnotexist, has “only” 1,048,576 pixels; split into three channels that is 3,145,728 pixels. Now, comparing this to a two-second-long audio file. Keep in mind that two seconds really can not contain much audio – certainly not a whole song but even drum samples can be cut down with only two seconds of playtime. An audio file has usually a sample rate of 44.1 kHz. This means that one second audio contains 44,100 slices, two seconds therefor 88,200. CD quality audio wav files have a bit depth of 16bit (which today is the bare minimum in digital audio workstations). So, a two second audio file has 2¹⁶ * 88,200 samples which results in 22,579,200 samples. That is a lot. But even though music or in general audio generation is a very human process and audio data can get very big very fast, machine learning can already provide convincing results.

Midi

Before talking about analysing audio files, we have to talk about the number one workaround: midi. Midi files only store note data such as pitch, velocity, and duration, but not actual audio. The difference in file size is not even comparable which makes midi a very useful file type to be used in machine learning.

FlowMachines is one of the more popular projects that work with midi. It is a plugin for DAWs that can help musicians generate scores. Users can choose from different styles to sound like for example the Beatles. These styles correspond to different trained models. FlowMachine works so well that there is already commercial music produced by it. Here is an example of what it can do:

Audio

Midi generation is a very useful helper, but it will not replace musicians. Generating audio on the other hand could potentially do that. Right now, generating short samples is the only viable way to go and it is just in its early stages but still, that could replace sample subscription services one day. One very recently developed architecture that seems to deliver very promising results is the GAN.

Generative Adversarial Networks

A generative adversarial network (GAN) simultaneously trains two models rather than one: A generator which trains with random values and captures the data distribution, and a discriminator which estimates the probability that a sample came from the training data rather than the generator. Through backpropagation both networks continuously enhance each other which leads to the generator getting better at generating fake data and the discriminator getting better at finding out whether the data came from the training data or the generator.

An already very sophisticated generative adversarial network for audio generation is WaveGAN. It can train on audio examples with up to 4 seconds in length at 16kHz. The demo includes a simple drum machine with very clearly synthesized sounds but shows how GANs might be the right direction to go. But what GANs really have to offer is the parallel processing shown in GANSynth. Instead of predicting a single sample at a time which autoregressive models are pretty good at, GANSynth can process multiple sequences in parallel making it about 50,000 times faster than WaveNet.

https://magenta.tensorflow.org/gansynth
https://github.com/chrisdonahue/wavegan
https://www.musictech.net/news/sony-flowmachines-plug-in-uses-ai/

by Johannes Lechner - December 13, 2020January 19, 2021

Audio & Machine Learning (pt 1)

Part 1: What is Machine Learning?

Machine Learning is essentially just a type of algorithm that improves over time. But instead of humans adjusting the algorithm the computer does it itself. In this process, computers discover how to do something without being programmed to do so. The benefits of such an approach to problem solving is that algorithms too complex for humans to develop can be learned by the machine. This leads to programmers being able to focus on what goes in to and what out of the algorithm rather than the algorithm itself.

Approaches

There are three broad categories of machine learning approaches:

Supervised Learning
Unsupervised Learning
Reinforcement Learning

Supervised learning is used for figuring out how to get from an input to an output. Inputs are classified meaning the dataset (or rather trainset, the part of the dataset used for training) is already split up into categories. The goal of using supervised learning is to generate a model that can map inputs to outputs. An example would be automatic audio file tagging – like either drum or guitar.

Unsupervised learning is used when the input data has not been labelled. The algorithm has to find out on its own how to describe a dataset. Common use cases are feature learning and discovering patterns in data (which might not have been visible without machine learning).

Reinforcement learning is probably what you have seen on YouTube. These are the algorithms that interact with something (like a human would do with a controller for example) and is either punished or rewarded for its behavior. Algorithms learning to play Super Mario World or Tesla’s Autopilot are trained with reinforcement learning.

Of course, there are other approaches as well, but these are a minority, and it is easier to just stick with the three categories above.

Models

The process of machine learning is to create an algorithm which can describe a set of data. This algorithm is called a model. A model exists from the beginning on and is trained. Trained models can then be used for example to categorize files. There are various approaches to machine learning:

Classifying
Regression
Clustering
Dimensionality reduction
Neural networks / deep learning

Classifying models are used to (you guessed it) classify data. They predict the type of data which can be several options (for example colors). One of the simplest classifying models is a decision tree which follows a flowchart-like concept of asking a question and getting either yes or no as an answer (or in more of a programmer’s terms: if and else statements). If you think of it as a tree (the way it is meant to be understood) you start at the root with one question, then get on to a branch where the next question is until you reach a leaf, which represents the class or tag you want to assign.

Regression models come from statistical analysis. There are a multitude of regression models, the easiest of which is the linear regression. Linear regression tires to describe a dataset with just one liner function. The data is mapped on to a 2-dimensional space and then a linear function which “kind of” fits all the data is drawn. An example for regression analysis would be Microsoft Excel’s trendline tool.

non-linear regression | from not enough learning (left) to overfitting (right)

Clustering is used to group similar objects together. If you have unclassified data and want to make use of supervised learning, regression models can automatically classify the objects for you.

Dimensionality reduction models (again an aptronym) reduce dimensionality of the dataset. The dimensionality is the number of variables used to describe a dataset. As usually different variables do not contribute equally to the dataset, the dataset can still be reliably described by less variables. One example for dimensionality reduction is the principal component analysis. In 2D space the PCA generates a best fitting line, which is usually where the least squared distance from the points to the line is.

2D principal component analysis | the ideal state would be when the red lines are the smallest

Deep Learning will be covered in part 2 of this series as this is the main focus of this series.

https://en.wikipedia.org/wiki/Glossary_of_artificial_intelligence
https://www.educba.com/machine-learning-models/
https://www.educba.com/machine-learning-algorithms/