ML Sample Generator Project | Phase 2 pt2

Autoencoder Results

As mentioned in the post before I have trained nine autoencoders to (re)produce snare drum samples. For easier comparison I have visualized the results below. Each image shows the location of all ~7500 input samples.

Rectified Linear Unit
Small relu ae
Medium relu ae
Big relu ae

All three graphics portray how the samples are mostly close together but some are very far out. A continuous representation is with all three models not possible. Reducing the latent vector’s maximum on both axes definitely helps, but even then the resulting samples are not too pleasing to hear. The small network has clicks in the beginning and generates very silent but noisy tails after the initial impact. The medium network includes some quite okay samples but moving around in the latent space often   produces   similar  but  less   pronounced issues as the small network. And the big network produces the best sounding samples but has no continuous changes.

Clicky small relu sample
Noisy medium relu sample
Quite good big relu sample
Hyperbolic Tangent
Small tanh ae
Medium tanh ae
Big tanh ae

These three networks each produce different patterns with a cluster at (0|0). The similarities between the medium and the big network lead me to believe that there is a smooth transition between random noise, to forming small clusters, to turning 45° clockwise and refining the clusters when increasing the number of trainable parameters. Just like the relu version, the reproduced audio samples of the small network contain clicks. The samples are however much better. The medium sized network is the best one out of all the trained models. It produces  mostly  good  samples  and has a continuous latent space. One issue is however that there are still some clicky areas in the latent space. The big network is the second best overall as it mostly lacks a continuous latent space as well. The produced audio samples are however very pleasing to hear and resemble the originals quite well.

Clicky small tanh sample
Close-to-original medium tanh sample
Close-to-original big tanh sample
Sigmoid
Small sig ae
Medium sig ae
Big sig ae

This group shows a clear tendency to cluster up the more trainable parameters exist. While in the above two groups the medium and the big network produced better results, in this case the small network is by far the best. The big network delivers primarily noisy audio samples and the medium network very noisy ones as well but they are better identifiable as snare drum sounds. The small network has by far the closest sounds to the originals but produces clicks at the beginning as well.

Clicky small sigmoid sample
Noisy medium sigmoid sample
Super noisy big sigmoid sample

In the third part of this series we will take a closer look at the other models.

Self Publishing Musician & Label Owner. I do everything myself including video editing, motion graphics, full stack web dev, etc. And I'm a Sound Design student as well.