Skip to content

digitally process audio data with AI/ML & Magenta´s DDSP

Notifications You must be signed in to change notification settings

digwit678/DIGCREAT_AUDIO_PROCESSION

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Discovering Google Magenta´s DDSP library

Digitally process audio data with AI & ML

Introduction

For the course Digital Creativity we explored the open source library Google Magenta DDSP.
We decided to work mostly on Google Colab because it´s much more convenient for us regarding installations, dependencies and training on GPU. The only exception to this is working with the dataset: It was all downloaded from Google Clouds to a local disk and and sorted there by using this notebook.
There are arlready notebooks on converting your own wave data to the needed format (TFR) when working with DDSP. Since we did not have enough of the right wave data we used a TFR dataset with prepared MIDI samples.

We accomodated ourselves to DDSP by going through a lot of the tutorials (DDSP TUTORIALS ).
Afterwards we used our gathered and sorted TFR data for small training on a single instrument type and then predict a sample of another instrument with the help of an adjusted DDSP NOTEBOOK (we recommend working with this version for reproduction, continuation etc.) . Prediction in this sense means, you predict how that sample (e.g. a keyboard tone) would sound with the sound characteristics (timbre) of a different instrument (e.g. string) or simpler: How would a keyboard tone sound if it played with a string sound/timbre ?

Theory

Challenge: Representation of Audio

One song of 3 minutes : 1 Million time steps BUT relevant information is much less! The art is to extract those featuers and find a meaningful representation for music. If music is only structured as a bit stream consisting of 1´s and 0´s it is very difficult to know what´s going on.

Bias In Conventional Representations

ddsp_challenges_waveforms

Phase Alignment

For strided convolution waves are represented as overlapping frames, whereas in reality sound moves in different phases and would have to be aligned precisely between two fixed frames or else it would lead to bias.

Fourier based Models

Another widely used method was to just learn all the waveform packages, decompose them into sine and cosine waves and finally recreate the soundwave out of the Fourier waves. However, the waveforms overlap and therefore this procedure leads to bias again.

Autogenerative Models

Autogenerative models try to mitigate these problems by constructing the waveform sample by sample so they do not suffer from the same bias the others do.
However, the waveform shapes still do not perfectly correlate with human perception and get incoherently corrected during model training:
For example the waveforms on the right sound the same for humans but cause different perceptual losses for the model. Moreover they need alot of data to work.

Back to the Roots: Oscillation based Models

oscillations

Oscillation is defined as  the process of repeating variations of any quantity or measure about its equilibrium value in time .
Most of the things in nature oscillate (vibrate) at a characteristic (natural) frequency or frequencies.
Some familiar examples are the motions of the pendulum of a clock and playground swing, up and down motion of small boats, ocean waves, and motion of the string or reeds on musical instruments.


annotated_synthesis_features

Rather than predicting the waveforms or Fourier coefficients those models directly generates the oscillations.
These analysis/synthesis models use expert knowledge and hand-tuned heuristics to xtract synthesis parameters (analysis) that are interpretable ( loudness in dB and frequencies in Hz) and can be used by the generative algorithm (synthesis).

ddsp_harmonic_transformation

With this features you can represent a harmonic oscillation precisely solely by using



    Fundamental Frequency F0 (Hz)
    Harmonics (F0 multiplications: odd, even, ...)
    Amplitude (dB)


This representation does not imply the model is completely free from bias but it seems to approach the nature and complexity of sound the best yet.

Dataset

For our first trial we used the nsynth/full dataset but then realized the features weren´t optimally suited for working with DDSP so we changed to the nsynth/gansynth_subset.f0_and_loudness/2.3.3. with f0 and loudness features which were missing.

Download

If you would like to try out a training on a single instrument without downloading the whole dataset we uploaded two TFR files (containing a lot of samples!) for string and keyboard (data folder) and used up our whole LF/Large File Storage Github resources. The data there should be enough to train on strings samples (xor keyboard) and predict with one keyboard sample (xor string)

for more efficient training we downloaded the whole GANSYNTH 2.3.3. (sub) dataset from Google Clouds with this link

to download multiple items at once you need to use gsutil. This command requires to have parts of Google CLI installed on your computer

1.) install Google CLI
2.) make sure gsutil is installed on Google CLI (e.g. try gsutil ls in command prompt: is the command recognized?)
3.) download files with gsutil command from the cloud to a (local) storage location (external drive, e.g. "E:\gansynth", recommended for big data amounts:

 gsutil -m cp -r "gs://tfds-data/PATH" "STORAGE_PATH" 

download_nsynth

Sorting

For our project we used the TensorfFlow GAN subset of the NSYNTH dataset. It offers preprocessed samples which contain the most relevant features (amplitude and frequency) ready to use with the DDSP library.
For efficient training we downloaded the 11 instrument samples instead of streaming them. Since the data wasn´t storted by instrument type we had to do this step additionally to observe the effects of training on a single instrument type. We read the TFRecord files into Python, parsed them to JSON to identify the instrument label and then wrote them back to TFRecord files with the help of this notebook. For this to work properly, we had to continuously remove the written objects from the memory such that it did not overflow. All in all this procedure took around 10 hours to sort the samples for the first dataset and then significantly less time for the second (samller) set (30-60 minutes).

Raw TFRecord String Representation

tfrecord_raw_string

TFRecord JSON Representation

tfrecord_json_representation

Adjusting Features names

To get our TFR data working with the DDSP (e.g. notebook training ) we had to adjust the classes slightly do accept the feature names with slashes instead of dashes (f0_hz = f0/hz) else we had to do the whole sorting process again to change feature names

Feature Representation

The features are presented as floatList tensors which contain the values over very small timesteps (e.g. length of 64000).
For efficient processing, (the features of) the input data has to be aligned with the architecture of a neural network.

feature_structure_gan

Training

DDSP achitecture is based on a transformer network.
That´s where the DDSP library comes in: it offers sound modules (synthesizers) which are differentiable and therefore can use back propagation to tune their synthesizer parameters (analog to recreating a sound on a synthesizer) and do not learn as much bias as the other models by the help of deep specialized and structured layers.
Thanks to these layer types we have faster training of autoencoders and therefore quick feedback, which offers a more instrument like workflow than iterating for 16 hours of training until you can implement further changes.

Training of Autoencoders

ddsp_autoencoder

For training on a TFR dataset we recommend using this notebook as a starting point

Python Code for Layers/Synths

colab_tut_training_basic_code_python_soundmodules

Results

We received the following outputs when training with 3 different synthesizers (= neural layers) trained on the same string data (until learning curve flattening, usually around 4.5-5) and predicted on the same keyboard sample

harmonic_training_string

Harmonic Synthesizer

sinusoid_training_string

Sinusoid Synthesizer

wavetable_training_string

Wavetable Synthesizer





We can observe from the spectograms that the harmonic synthesizer - as you´d probably expected - has the richest harmonic distribution



Possible Next Steps

Since the time for this project was scarce and the complexity relatively high we did not yet complete a full big training. To continue with the gathered data and lessons learned from a small training on a singular instrument, options for long training would be:

  • try bigger training on the timbre transfer notebook
  • train a VST on the VST notebook
  • ...

we also prepared the timbre transfer notebook since the original version with the updated dependencies did not work.

for more content, just have a look at ddsp demos: there are a lot of (new) ideas once your familiar with the library and the data!

Further Links

Youtube: Google staff research scientist Jesse Engel explaining DDSP

Citation

All notebook sources in the folder ddsp_notebooks_adjusted belong to Google Magenta´s DDSP research team.



# Copyright 2021 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
@inproceedings{  
  engel2020ddsp,  
  title={DDSP: Differentiable Digital Signal Processing},  
  author={Jesse Engel and Lamtharn (Hanoi) Hantrakul and Chenjie Gu and Adam Roberts},  
  booktitle={International Conference on Learning Representations},  
  year={2020},  
  url={https://openreview.net/forum?id=B1x1ma4tDr}  
}  



Notebook Sources

training on single instrument notebook
timbre transfer notebook

Picture Sources (README & presentation)

DDSP paper

playground oscillation

waves oscillation

About

digitally process audio data with AI/ML & Magenta´s DDSP

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published