ML Sample Generator Project | Phase 2 pt3

Convolutional Networks

Convolutional networks include one or more convolutional layers. These layers are typically used for feature extraction. Stacking multiple on top of each other often can extract very detailed features. Depending on the input shape of the data, convolutional layers can be one- or multidimensional, but are usually 2D as they are mainly used for working with  images.  The  feature extraction can be achieved by applying filters to the input data. The image below shows a very simple black and white (or pink & white) image with a size 3 filter that can detect vertical left-sided edges. The resulting image can then be shrinked down without losing as much data as reducing the original’s dimensions would.

2D convolution with filter size 3 detecting vertical left edges

In this project, all models containing convolutional layers are based off of WavGAN. For this cutting the samples down to a length of 16384 was necessary, as WavGAN only works with windows of this size. In detail, the two models consist of five convolutional layers, each followed by a leaky rectified linear unit activation function and one final dense layer afterwards. Both models were again trained for 700 epochs.

Convolutional Autoencoder

The convolutional autoencoder produces samples only in the general shape of a snare drum. There is an impact and a tail but like the small autoencoders, it is clicky. In contrast to the normal autoencoders, the whole sound is not noisy though but rather a ringing sound. The latent vector does change the sound but playing the sound to a third party would not result in them guessing that this should be a snare drum.

Ringy conv ae sample

The generative adversarial network worked much better than the autoencoder. While still being far from a snare drum sound, it produced a continuous latent space with samples resembling the shape of a snare drum. The sound itself however very closely resembles a bitcrushed version of the original samples. It would be interesting to develop this further as the current results suggest that there is just something wrong with the layers, but the network takes very long to train which might be due to the need of a custom implementation of the train function.

Bitcrushed sounding GAN sample

Variational Autoencoder

Variational autoencoders are a sub-type of autoencoders. Their big difference to a vanilla autoencoder is the encoder’s last layer, the sampling layer. With this, variational autoencoders always provide a continuous latent space, which is much better for generative models than just to sample from what has been provided. This is achieved by having the encoder output two different vectors instead of one: one for standard deviation and one for the mean. This provides a distribution rather than a single point, leading to the decoder learning that an area is responsible for a feature and not a single sample.

Training the variational autoencoder was especially troublesome as it required a custom class with it’s own train step function. The difficulty with this type of model is that the right mix between reconstruction loss and kl loss has to be found, otherwise the model produces unhelpful results. The currently trained models all have a ramp up time of 30,000 batches until full effect of the kl loss. This value gets multiplied by a different actor depending on the model. The trained versions are with a factor of 0.01 (A), 0.001(B), as well as 0.0001(C). Model A produces a snare drum like sound, but is very metallic. Additionally instead of having a continuous latent space, the sample does not change at all. Model B produces a much better sample but still does not include much changes. The main changes are the volume of the sample as well as it getting a little bit more clicky towards the edges of the y axis. Model C has much more different sounds, but the continuity is more or less not present. In some areas the sample seems to get slightly filtered over one third of the vector’s axis but then rapidly changes the sound multiple times over the next 10%. But still, out of the three variational autoencoders model C produced the best results.

VAE with 0.01 contribution (A) sample
VAE with 0.001 contribution (B) sample
VAE with 0.0001 contribution (C) sample

Next Steps

As I briefly mentioned before, this project will ultimately run on a web server which means the next steps will be deciding how to run this app. Since all of the project has been written in python so far Django would be a good solution. But since TensorFlow offers a JavaScript Library as well this is not the only possible way to go. You will find out more about this in the next semester.

ML Sample Generator Project | Phase 2 pt2

Autoencoder Results

As mentioned in the post before I have trained nine autoencoders to (re)produce snare drum samples. For easier comparison I have visualized the results below. Each image shows the location of all ~7500 input samples.

Rectified Linear Unit
Small relu ae
Medium relu ae
Big relu ae

All three graphics portray how the samples are mostly close together but some are very far out. A continuous representation is with all three models not possible. Reducing the latent vector’s maximum on both axes definitely helps, but even then the resulting samples are not too pleasing to hear. The small network has clicks in the beginning and generates very silent but noisy tails after the initial impact. The medium network includes some quite okay samples but moving around in the latent space often   produces   similar  but  less   pronounced issues as the small network. And the big network produces the best sounding samples but has no continuous changes.

Clicky small relu sample
Noisy medium relu sample
Quite good big relu sample
Hyperbolic Tangent
Small tanh ae
Medium tanh ae
Big tanh ae

These three networks each produce different patterns with a cluster at (0|0). The similarities between the medium and the big network lead me to believe that there is a smooth transition between random noise, to forming small clusters, to turning 45° clockwise and refining the clusters when increasing the number of trainable parameters. Just like the relu version, the reproduced audio samples of the small network contain clicks. The samples are however much better. The medium sized network is the best one out of all the trained models. It produces  mostly  good  samples  and has a continuous latent space. One issue is however that there are still some clicky areas in the latent space. The big network is the second best overall as it mostly lacks a continuous latent space as well. The produced audio samples are however very pleasing to hear and resemble the originals quite well.

Clicky small tanh sample
Close-to-original medium tanh sample
Close-to-original big tanh sample
Small sig ae
Medium sig ae
Big sig ae

This group shows a clear tendency to cluster up the more trainable parameters exist. While in the above two groups the medium and the big network produced better results, in this case the small network is by far the best. The big network delivers primarily noisy audio samples and the medium network very noisy ones as well but they are better identifiable as snare drum sounds. The small network has by far the closest sounds to the originals but produces clicks at the beginning as well.

Clicky small sigmoid sample
Noisy medium sigmoid sample
Super noisy big sigmoid sample

In the third part of this series we will take a closer look at the other models.

ML Sample Generator Project | Phase 2 pt1

A few months ago I already explained a little bit about machine learning. This was because I started working on a project involving machine learning. Here’s a quick refresh on what I want to do and why:

Electronic music production often requires gathering audio samples from different libraries, which, depending on the library and on the platform, can be quite costly as well as time consuming. The core idea of this project was to create a simple application with as few as possible parameters, that will generate a drum sample for the end user via unsupervised machine learning. The interface’s editable parameters enable the user to control the sound of the generated sample and a drag-and-drop space could map a dragged sample’s properties to the parameters. To simplify interaction with the program as much as possible, the dataset should only be learned once and not by the end user. Thus, the application would work with the models rather than the whole algorithm. This would be a benefit as the end result should be a web application where this project is run. Taking a closer look at the machine learning process, the idea was to train the network in the experimentation phase with snare drum samples from the library noiiz. With as many different networks as possible, this would then create a decently sized batch of models from which the best one could be selected for phase 3.

So far I have worked with four different models in different variations to gather some knowledge on what works and what does not. To evaluate them I created a custom GUI.


Producing a GUI for testing purposes was pretty simple and straight-forward. Implementing a Loop Play option required the use of threads, which was a little bit of a challenge but working on the Interface was possible without any major problems thanks to the library PySimpleGUI. The application worked mostly bug free and enabled extensive testing of models and also already saving some great samples. However, as it can be seen below, this GUI is only usable for testing purposes and does not meet the specifications developed in the first phase of this project. For the final product a much simpler app should exist and instead of being standalone it should run on a web server.


An autoencoder is an unsupervised learning method where input data is encoded into a latent vector (therefore the name autoencoder). To get from the input to the latent vector multiple dense layers reduce the dimensionality of the data, creating a bottleneck layer and forcing the encoder to get rid of less important information. This results in data loss but also in a much smaller representation of input data. The latent vector can then be decoded back to produce a similar data sample to the original. While training an autoencoder, the weights and biases of individual neurons are modified to reduce data loss as much as possible.

In this project autoencoders seemed to be a valuable tool as audio samples, even though as short as only 2 seconds, can add up to a huge size. Training with an autoencoder would reduce this information down to only a latent vector with a few dimensions and the trained model itself, which seems perfect for a web application. The past semester resulted in nine different autoencoders, each containing dense layers only. All autoencoders differ from each other by either the amounts of trainable parameters, or the activation functions, or both. The chosen activation functions are rectified linear unit, hyperbolic tangent and sigmoid. These are used in all of the layers of the encoder as well as all layers of the decoder except for the last one to get back to an audio sample (where individual data points are positive and negative). 

Additionally, the autoencoders’ size (as in the amount of trainable parameters) is one of the following three: 

  • Two dense layers with units 9 and 2 (encoder) or 9 and sample length (decoder) with trainable parameters
  • Three dense layers with units 96, 24 and 2 (encoder) or 24, 96 and sample length (decoder) with trainable parameters
  • Four dense layers with units 384, 96, 24 and 2 (encoder) or 24, 96, 384 and sample length (decoder) with trainable parameters

Combining these two attributes results in nine unique models, better understandable as a 3×3 matrix as follows:

Small (2 layers)Medium (3 layers)Big (4 layers)
Rectified linear unitAe small reluAe med reluAe big relu
Hyperbolic tangentAe small tanhAe med tanhAe big tanh
SigmoidAe small sigAe med sigAe big sig

All nine of the autoencoders above have been trained on the same dataset for 700 epochs. We will take a closer look on the results in the next post.

INDEPTH Sound Design

Indepth Sound Design ist ein Sound Design Channel auf Youtube, der sich mit der Philosophie und den techniken des Sound Designs beschäftigt. Dafür werden Beispiele aus echten Filmen gezeigt und erklärt. Indepth Sound Design beschreibt sich selbst als eine Fundgrube für lehrreiche Sound-Dekonstruktionen, Audio-Stem-Breakdowns und andere klangliche Inspirationen. Die Seite wurde von Mike James Gallagher ins Leben gerufen.

Beispiele für Sound Design Dekonstruktion:

In diesem Beispiel wird auf den Film Independence Day eingegangen, wobei der Sound in die verschiedenen Layers aufgebrochen wird. Die Szene ist 3:45 lang und wird 4-mal gespielt.

Zuerst mit nur den Sound Effects, dann nur Dialog und Foley, anschließend nur die Music und zuletzt alles zusammen im Final Mix.

Im zweiten Beispiel geht es um eine 1:09 min lange Szene aus Terminator 2. Auch diese Szene wir mit den Layers Sound FX, Ambience, Foley, Music und Final Mix separat gezeigt.

Außerdem spricht im Anschluss der Sound Designer des Films Gary Rydstrom über den Entstehungsprozess des Sound Designs bei dieser Szene.


Indepth Sound Design

How Music Producers Find Their “Sound”

Do you catch yourself recognising whose track/song you are listening to when you’re just shuffling randomly through Spotify, even before you look at the artist name? This is because successful music producers have a way to make sure you can instantly recognise them. This is quite beneficial, because it imprints into the listener’s mind and makes them more likely to recognise and share the artist’s future releases with their network.

So how do musicians/music producers do this? There are some key points that can easily help you understand this occurence better.

1) There’s no shortcut! 

You know the 10.000 hour rule? Or as some have put it in the musical context- 1,000 songs? There’s really no way around it! This aplies to any skill in life, not just music. However, usually the end consumer never really knows how many songs an artist actually never releases. Those are all practice songs. For every release that you see out there there might be 100s of other unreleased songs done prior to it. if the musician just keeps creating instead of getting hung up on one song, they will eventually grow into their own unique way of structuring, as well as editing songs.

2) They use unique elements 

So many producers/musicians use samples from Splice, which leads to the listener feeling like they’ve already heard a song even if they haven’t. Songs get lost in the sea of similar musical works, but every now and then, something with a unique flavour pops up and it’s hard to forget. Musicians who make their own synth sounds, play exotic instruments or even their own dit instruments are the ones that stick around in our minds.

3) Using the same sound in multiple songs

This is the easiest and most obvious way in which musicians/producers show their own style. You might hear a similar bass, or drum pattern in mutiple songs/tracks from the same musician. In rap/hiphop, you will also hear producer tags (e.g. “Dj Khaled” being said in the beginning of each track).

4) Great Musicians/Producers don’t stick to one style/trend

Music has existed for so long and progressed so fast lately, that it is hard to stand out, especially if you stick to genres strictly. Nowadays, great musicians will come up with their own subgenres or mix in few different ones into a musical piece. You won’t ever really remember the musicians or the producers who are just following in the footsteps of the greats who have already established a certain genre. If you can’t quite put your finger on why you like someone’s music so much and why they sound “different”, they are probably experimenting with a combination of different genres.


The Basics

The picture above shows Harmor’s interface. We can group the Interface into three sections: The red part, the gray part and the window to the right. Firstly, the easiest section to understand is the window to the right. Harmor is an additive synthesizer, which means the sounds it generates are made up of sine waves added on top of each other. The window on the right displays the frequencies of the individual sine waves, played over the last few seconds. Secondly, the red window is where most of the sound is generated. There are different sections and color-coded knobs to be able to identify what works together. Left of the center you can see an A/B switch. The red section exists twice: once for state A and once for state B. These states can be mixed together via the fader below. Lastly the gray area is for global controls. The only exception is the IMG tab, which we will cover a little later. As you can see there are many knob, tabs and dropdowns. But in addition to that most most of the processing can be altered with envelopes. These allow the user to draw a graph with infinitely many points to either use it as an ADSR curve, an LFO, or map it to keyboard, velocity, X, Y & Z quick modulation and more. At this point it already might become clear that Harmor is a hugely versatile synth. It’s marketed as an additive / subtractive synthesizer and features an immense amount of features which we will take a closer look at now.

Additive or Subtractive?

As mentioned above Harmor is marketed as an additive / subtractive synthesizer. But what does that mean? While Harmor is built using additive synthesis as its foundation, the available features closely resemble a typical subtractive synth. But because Harmor is additive, there are no audio streams being processed. Instead a table of frequency and amplitude data is manipulated resulting in an efficient, accurate and partly very unfamiliar and creative way to generate audio streams. Harmor features four of these additive / subtractive oscillators. Two can be seen on the image above in the top left corner. These can be mixed in different modes and then again mixed with the other two via the A/B switch. In addition to the four oscillators, Harmor is also able to synthesize sound from the IMG section. The user can drag-and-drop audio or image files in and Harmor can act like a sampler, re-synthesizing audio or even generating audio from images drawn in Photoshop.

The Generator Section

As you can see in addition to the different subsections being walled in by dotted lines, this section is color coded as well. The Timbre section allows you to select any waveform by again drawing and then morphing between two of them with different mixing modes. Harmor allows you to import a single cycle waveform to generate the envelope. But you can import any sample and generate a waveform from it. Here is an example where I dragged a full song into it and processed it with the internal compressor module only:

The blur module allows you to generate reverb-like effects and also preverb. Tremolo generates the effect of a stereo vibrato, think about jazz organs. Harmonizer clones existing harmonics by the offset/octaves defined. And prism shifts partials away from their original relationship with the fundamental frequency. A little prism usually generates a detune-like effect, more usually metallic sounds. And here is the interesting part: As with many other parameters as well, you can edit the harmonic prism mapping via the envelopes section. This allows you to create an offset to the amount knob on a per frequency basis. Here is an example of a usage of prism:

As you can see in the analyzer on the right: There is movement over time. In the Harmonic prism envelope I painted a graph so that the knob does not modify lower frequencies but only starts at +3 octaves.
The other options from this section, unison, pitch, vibrato and legato should be clear from other synthesizers.

The Filter Section

As seen above, Harmor features two filters per state. Each filter can have a curve selected from the presets menu. The presets include low pass, band pass, high pass and comb filtering. Additionally you can draw your own curve as explained in the Basics section above. The filters can additionally be control the mix for the envelope, keyboard tracking, width, actual frequency and resonance. But the cool thing is how these filters are combined: The knob in the middle lets you fade between only filter 2, parallel processing, only filter 1, filter 1 + serial processing and serial processing only. In the bottom half there is a one-knob pluck knob as well as a phaser module with, again, custom shaped filters.

The Bottom Section

As you can see above the bottom section features some general global functions. On the left side most should be clear. The XYZ coordinate grid offers a fast way to automate many parameters by mapping them to either X Y or Z and then just editing events in the DAW. On the top right however there are four tabs that open new views. Above we have seen the ENV section where you can modulate about anything. The green tab is the image tab. We already know that Harmor can generate sound from images and sound (not that this is a different way of using existing sound, before I loaded it into an oscillator, now we are talking about the IMG tab). On the right you can see a whole lot of knobs, some of them can be modified by clicking in the image. C and F are course and fine playback speed adjustments, time is the time offset. The other controls are used to change how the image is interpreted and partially could be outsourced to image editors. I’m going to skip this part, as this post would get a whole lot more complicated if not. It would probably be best to just try it out yourself.

The third tab contains some standard effects. These are quite good but especially the compressor stands out as it rivals the easy-but-usefullness of OTT.

And finally, the last section: Advanced (did you really think this was advanced until now? :P) Literally the whole plugin can be restructured here. I usually only go in here to enable perfect precision mode, threaded mode (enables multi core processing) and high precision image resynthesis. Most of these features are usually not needed and seem more like debugging features so I will not go into detail about them, but like before I encourage you to try it out. Harmor can be very overwhelming and as many people mention in reviews: “Harmor’s biggest strength is also it’s greatest weakness, and probably why there are so few reviews for such an amazing synth. You can use Harmor for years, and still feel like a noob only scratching the surface. That makes writing a review difficult. How can you give an in-depth review, when you feel so green behind the ears? You only need to watch a few YT videos (e.g. Seamless) or chat with another user to discover yet another side to this truly versatile beast.”

Harmor on KVR ⬈
Harmor on Image-Line ⬈
Harmor Documentation ⬈ (a whole lot more details and a clickable image if you have more detailed questions)

Sound Design – What Does Magic Sound Like? A Look At How The Harry Potter Films Redefined The Sound Of Magic

Here is an interesting video on the sound design of magic in the Harry Potter series of films.

Before Harry Potter this video suggests that although there were some indications in literature as to what magic might sound like, that until the Harry Potter films came along the medium of film never seen such formalisation of the sound of magic, such a variety of spells cast with specific gestures and feelings. If the film makers didn’t quite know what that should all sound like they definitely knew that they didn’t want it to sound like shooting scenes from science fiction films.

In preparation fro the first Harry Potter film director Chris Columbus told supervising sound editor Eddy Joseph that he didn’t want anything modern, futuristic or electronic. Although the sound of magic did change and develop throughout the series of films, it is said that this was a mantra that the film makers and sound designers continued to hold to.

Instead if the spell that was being cast had a specific sound related to it then they would use that, like water, fire, freezing, etc. Sometimes the sound of what the spell is impacting is all that is needed. When it comes to levitation then silence works just fine. But there are plenty of examples where the magic doesn’t have a specific sound attached and this is where the sound designers get the chance to be creative.

There is no doubt that the sound of magic developed though the Harry Potter films but there was a major change when it came to to the 3rd film, The Prisoner of Azkaban, when out went the explosions and whooshes and in came much softer sounds for most of the spells, which has the effect of making the magic less aggressive and more mysterious and that is the style the team build on for the key patronus spell that is built out a chorus of voices.

Another development we see through the Harry Potter films is that see the spell sounds become more personal and appropriate for each character to give the impression that the spell comes out of the magician just like their breath.

Watch the video and hear and see the examples played out.

The genius of Trent Reznor

One of the most influential bands of our time are certainly the Americans Nine Inch Nails (NIN), founded by singer / composer / programmer / multi-instrumentalist / visionary / genius Trent Reznor in 1988.

Nine Inch Nails have sold over 20 million records and were nominated for 13 Grammys and won 2. Time magazine named Reznor one of its most influential people in 1997, while Spin magazine once described him as “The most vital artist in music”.

Their concerts are characterized by their extensive use of thematic visuals, complex special effects and elaborate lighting; songs are often rearranged to fit a given performance, and melodies or lyrics of songs that are not scheduled to play are sometimes assimilated into other songs.

Trent is also famous for soundtracks of him along with his bandmate Atticus Ross.

They (he) deconstructed the traditional rock song, a bit like Sonic Youth did, but they went in a more electronic and aggressive direction. Their music is characterized by their massive use of Industrial sounds (although, not as massive as for the berliners Einstürzende Neubaten) in early works and lately is focused on analog and modular synths.

The sound design work is a really important part in their composition, as important as the harmony and the melody. They probably used every electronic instruments (and software) they could find, turning them all into their signature, creating that industrial NIN sound. Reznor’s sound is always clearly identifiable. While some of that is due to his sound design, which includes digital distortion and processing noise from a wide variety of sound sources,

What I find really impressive, besides the sound design and beautiful dark lyrics, is the unique choice of harmony and melody progression.

Nothing is predictable and even in the simplest progression there is that note that takes you deep into Reznor’s mind, away from any other musical word.

Reznor’s music has a decidedly shy tone that sets the stage for his often obscure lyrics.

His use of harmony, chords and melody also has a huge impact on his sound. In the movie Sound City, Reznor explains that he has a foundation in music theory, especially in regard to the keyboard, and this subconsciously influences his writing of him:

“My grandma pushed me into piano.  I remember when I was 5, I started taking classical lessons.  I liked it, and I felt like I was good at it, and I knew in life that I was supposed to make music. I practiced long and hard and studied and learned how to play an instrument that provided me a foundation where I can base everything I think of in terms of where it sits on the piano… I like having that foundation in there.  That’s a very un-punk rock thing to say. Understanding an instrument, and thinking about it, and learning that skill has been invaluable to me.”

Here are some example of his writing process:

  • Right where it belongs

Here’s is a continuous shifting between D major e D minor, that marks also an emotional shift of feeling, going constantly from sad to happy and viceversa. This helps give the song its emotional vibe.

  • Closer

Here the melodic line ascends the notes E, F,  G, and Ab.  The last note is intentionally ‘out of key’ to give an unique sound sound.

  • March of the Pigs

The harmonic and melodic choices of this song are simply impressive. They are exactly what an experienced musician would NEVER do, yet they work great.

The progression is unusual because the second chord is a Triton away from the first chord (this means, something really dissonant, that sound you would always try to avoid). The melody is brilliant. The song is (mostly) in the key of D minor (these are the notes of the D minor chord, D – F – A), but in the vocal line it sings an F #. Also, sing the major in a minor key, the worst thing to do, and yet it sounds great.

I must say that falling in love with their music helped to “color outside the borders”. It is a wonderful feeling to know how things should be and to consciously destroy those rules to follow the pure essence of your art.

For anyone interested in learning more about chord theory, here is the full article I was inspired by:

Wwise Implementierung in Unity (Teil 5: AkListener, AkGameObj & AkEnvironment)

In meinem letzten Blogeintrag habe ich das grundlegende Prinzip der Soundengine von Unity, die auf der Verwendung von Audio Listenern und Audio Sources basiert, vorgestellt. Die Wwise-Soundengine arbeitet auf dem selben Prinzip und hat standardmäßig bereits vorhandene Skripte mit welchen die selben System realsiert werden können. Diese Skripte heißen AkAudioListener und AkGameObj. AkAudioListener ist der Wwise-Ersatz für den Audio Listener von Unity.

Ist die Position und Blickrichtung eines GameObjects notwendig, was bei den meisten diegetischen Klänge einer dreidimensionalen Spielwelt der Fall ist, muss dieses GameObject auch als ein solches in Wwise registriert werden. Dafür gibt es das AkGameObj-Skript, das auf alle GameObjects gelegt wird.

Das GameObject, das als Main Camera dient und auf welchem das AkAudioListener-Skript liegt, muss ebenfalls mit einem AkGameObj-Skript versehen werden, weil auch dessen Position und Blickrichtung notwendig ist, um den Abstand und die Richtung zu anderen GameObjects berechnen zu können. In Wwise kann das Verhalten des Richtungshörens und auch des Pegelabfalls bei steigernder Distanz von Audio Listener zu klangerzeugenden GameObject (wie der Rolloff in Unity) konfiguriert werden.

Konfiguration des Klangverhaltens von Objekten in einer dreidimensionalen Spielwelt
Pegelabfallverhalten abhängig von der Distanz des Audio Listeners zum GameObject

Das AkGameObj-Skript kann zusätzlich auch noch andere Parameter aus dem Spiel übermitteln, wie Switches, RTPCs oder Umgebungsinformationen.

Reverb Zones werden Wwise mithilfe von Aux-Bussen realisiert, mit welchen innerhalb von Wwise ein Hall-Effekt zum Direktsignal hinzugemischt wird. Tatsächlich können auch andere Effekte statt Hall-Effekte verwendet werden. Das hierfür benötigte Skript ist das AkEnvironment-Skript, welches auf einen Collider gelegt wird. Im Inspector des Collider-GameObjects kann ausgewählt werden, welcher Aux-Bus aktiviert werden soll, sobald der Collider betreten wird.

AkEnvironment Skript, welches den “Testreverb” aktiviert

Dabei ist wichtig, dass das GameObject, wie z.B. der Audio Listener (also die Main Camera) im AkGameObj so konfiguriert wird, dass diese auf AkEnvironment-Skripte reagieren und damit ein Signal an den jeweiligen Aux-Bus senden.

Dieses GameObj ist Hall-fähig konfiguriert







Audio Listener und Audio Sources in Unity

Eines der grundlegenden Prinzipien der Soundwiedergabe in der Gameengine Unity ist das Zusammenspiel aus sogenannten Audio Listenern und Audio Sources. Auf diesem Prinzip basieren alle mit Unity entwickelten Spiele und Klangrealisierungen. Man muss sich ein Spiel ein wenig, wie einen Film vorstellen, bei welchem der Regisseur (oder aber der Spieleentwickler) genau festlegt, was der Spieler sehen kann und was nicht. Zwar hat der Spieler viel mehr Freiheiten als im Film, aber im Grunde ist es ein ähnliches Prinzip. Auf visueller Ebene wird eine Kamera genutzt, die der Spieler lenken kann. Diese wird oft Main Camera genannt und ist oft an die eigene Spielfigur in der ersten oder dritten Person gebunden. Gesehen werden können alle GameObjects die im Blickwinkel der Kamera sind und mit sichtbaren Texturen belegt sind.

Ähnlich verhält es sich auch auf klanglicher Ebene. Analog zum Visuellen kann man Audio Listener als die Main Camera sehen und Audio Sources wie visuelle GameObjects. Tatsächlich ist der Audio Listener in den meisten Fällen mit der Main Camera verknüpft und Audio Sources mit den GameObjects, die Sound erzeugen. Der Audio Listener ist quasi ein Mikrofon, der Signale von Audio Sources erhält und weiter an die Audioausgänge schickt. In Unity kann es standardmäßig nur einen Audio Listener geben.

Audio Listener empfangen also Signale von Audio Sources, welche aber noch durch Effekte, wie z.B. einen Hall, geschickt werden können, wenn der Audio Listener sich in einer sogenannten Reverb Zone befindet. Reverb Zones sind kugelförmige GameObjects, die ein bestimmtes Gebiet umfassen und nach Eintreten des Audio Listeners den Anteil des Halls hinzumischen, je nachdem, wie nah sich der Audio Listener am Zentrum der Reverb Zone befindet.

Audio Reverb Zone im Unity Inspector mit verschiedenen Konfigurationsmöglichkeiten
Visualisierung des Hall-Verhaltens. Innerhalb des “Full Reverb”-Bereichs ist der Anteil des Diffussignals bei 100%. Im “Gradient Reverb”-Bereich sinkt der Anteil des Diffussignals zugunsten des Direktsignals, je weiter außen der Audio Listener sich befindet

Audio Sources können auf zwei Arten implementiert sein: 2D und 3D, diese können mit dem Spatial Blend-Regler reguliert werden.

2D bedeutet, dass sowohl Abstand als auch Richtung zum Audio Listener nicht relevant sind und der Sound unabhängig davon abgespielt wird. Der Sound, außer er wird im Unity-Mixer oder im Inspector des selben GameObjects noch weiter bearbeitet, wird so wie er ist abgespielt. Das wird vor allem für Soundtracks oder klangliche Hintergrundkulissen, wie Atmo-Aufnahmen, verwendet.

Audio Source mit verschiedenen Konfigurationsmöglichkeiten

3D hingegen setzt einen Sound Source in die dreidimensionale Spielwelt und erlaubt Richtungshören und zugleich variable Pegel, je nach Abstand des Audio Sources zum Audio Listener. Das Richtungshören wird ermöglicht, indem das ausgehende Audiosignal in ein Monosignal gewandelt wird und durch Pegelunterschiede ein Panning erhält. Es gibt zwar auch Lösungen für komplexere Surround-Sound-Verfahren, aber standardmäßig arbeitet Unity nur mit dieser vereinfachten Form, welches das Richtungshören ermöglicht. Von den verschiedenen 3D-Konfigurationsmöglichkeiten ist vor allem der Rolloff von großer Bedeutung. Damit lässt sich bestimmen, wie der Pegel, je nach Abstand zum Audio Listener, abfällt. Standardmäßig verwendet man entweder einen logarithmischen oder einen linearen Rolloff. Alternativ kann auch ein eigener, händisch eingezeichneter, Rolloff erstellt werden. Zusätzlich bestimmt man noch die Parameter Min Distance und Max Distance. Diese markieren einen Bereichen, in dem der Audio Listener sich befinden muss, um den Audio Source hören zu können und in welchem der Rolloff sich abspielt. Ist der Audio Listener außerhalb dieses Bereichs, wird das Signal nicht an diesen weitergegeben und folglich wird das Signal auch nicht abgespielt.

Die unterschiedlichen Rolloff-Verhalten