Processing Audio Data for Use in Machine Learning with Python

I am currently working on a project where I am using machine learning to generate audio samples. One of the steps involved is pre-processing.

What is pre-processing?

Pre-processing is a process where input data somehow gets modified to be more handleable. An easy everyday life example would be packing items in boxes to allow for easier storing. In my case, I use pre-processing to make sure all audio samples are equal before further working with them. By equal, in this case I mean same sample rate, same file type, same length and same time of peak. This is important because having a huge mess of samples makes it much harder for the algorithm to learn the dataset and not just return random noise but actually similar samples.

The Code: Step by step

First, we need some samples to work with. Once downloaded and stored somewhere we need to specify a path. I import os to store the path like so:


import os


PATH = r"C:/Samples"

DIR = os.listdir( PATH )


 Since we are already declaring constants, we can add the following:




 These are the “settings” for our pre-processing script. The values depend strongly on our data so when programming this on your own, try to figure out yourself what makes sense and what does not.

Instead of ATTACK_SEC we could use ATTACK_SAMPLES as well, but I prefer to calculate the length in samples from the data above:

import numpy as np


attack_samples = int(np.round(ATTACK_SEC * SAMPLERATE, 0))

length_samples = int(np.round(LENGTH_SEC * SAMPLERATE, 0))

 One last thing: Since we usually do not want to do the pre-processing only once, form now on everything will run in a for-loop:

for file in DIR:

 Because we used the os import to store the path, every file in the directory can now simply accessed by the file variable.

Now the actual pre-processing begins. First, we make sure that we get a 2D array whether it is a stereo file or a mono file. Then we can resample the audio file with librosa.

import librosa

import soundfile as sf




data, samplerate = + "/" + file, always_2d = True)




data = data[:, 0]

sample = librosa.resample(data, orig_sr=samplerate, target_sr=SAMPLERATE)

 The next step is to detect a peak and to align it to a fixed time. The time-to-peak is set by our constant ATTACK_SEC and the actual peak time can be found with numpy’s argmax. Now we only need to compare the two values and do different things depending on which is bigger:

peak_timestamp = np.argmax(np.abs(sample))


if (peak_timestamp > attack_samples):

new_start = peak_timestamp  attack_samples

processed_sample = sample[new_start:]


elif (peak_timestamp < attack_samples):

gap_to_start = attack_samples  peak_timestamp

processed_sample = np.pad(sample, pad_width=[gap_to_start, 0])



processed_sample = sample

 And now we do something very similar but this time with the LENGTH_SEC constant:

if (processed_sample.shape[0] > length_samples):

processed_sample = processed_sample[:length_samples]


elif (processed_sample.shape[0] < length_samples):

cut_length = length_samples  processed_sample.shape[0]

processed_sample = np.pad(processed_sample, pad_width=[0, cut_length])



processed_sample = processed_sample

 Note that we use the : operator to cut away parts of the samples and np.pad() to add silence to either the beginning or the end (which is defined by the location of the 0 in pad_width=[]).

With this the pre-processing is done. This script can be hooked into another program right away, which means you are done. But there is something more we can do. The following addition lets us preview the samples and the processed samples both via a plot and by just playing them:

import sounddevice as sd

import time

import matplotlib.pyplot as plt











 Alternatively, we can also just save the files somewhere using soundfile:

sf.write(os.path.join(os.path.join(PATH, "preprocessed"), file), processed_sample, SAMPLERATE, subtype='FLOAT')

 And now we are really done. If you have any comments or suggestions leave them down below!

Self Publishing Musician & Label Owner. I do everything myself including video editing, motion graphics, full stack web dev, etc. And I'm a Sound Design student as well.