Subjective evaluation of sound quality and control of drum synthesis with StyleWaveGAN

In this paper we are presenting a study on the subjective evaluation of the sound quality of the proposed StyleWaveGAN as well as a subjective evaluation of the precision of the control using timbre descriptors form the Audio Commons toolbox. In the context of professional audio production, StyleWaveGAN is our contribution for fast and simple yet extensive drum generation: it synthesizes waveforms faster than real-time on a GPU directly in CD quality up to a duration of 1.5s while retaining a considerable amount of control over the generation. The simplicity of the control method comes from our differentiable implementation of high-level descriptors based on the AudioCommons models, allowing us for to control the synthesis with ease in terms of interpolation and latent separation when used in conjunction with StyleWaveGAN. We evaluate our control method with statistical metrics as well as measurement as well as measurement of psychophysical response to the variations of the control. We also perform perceptual tests to evaluate the sound quality of the generation against DrumGAN.

Data and augmented samples

Drum Type	Original	Lowest augmentation parameters	Highest augmentation parameters
Kick
Snare
Tom
Closed Hi-Hat
Open Hi-Hat

Synthesized samples

Drum Type	Samples
Kick
Snare
Tom
Closed Hi-Hat
Open Hi-Hat

Perceptive control samples

Descriptors	Samples
Brightness	Delta = 0.9 ; Delta = 6.7
Depth	Delta = 1.8; Delta = 8.9
Warmth	Delta = 7.5; Delta = 4.2

Psychophysical evaluation results

In this section, we are showing all the graphs of CDF obtained through our perceptive evaluation of control quality.

Descriptors	Q20	Q50	Q80
Brightness (Combined)
Brightness (Positive)
Brightness (Negative)
Brightness (Whole range)
Depth (Combined)
Depth (Positive)
Depth (Negative)
Depth (Whole range)
Warmth (Combined)
Warmth (Positive)
Warmth (Negative)
Warmth (Whole range)