GitHub link: Singing-Voice-Separation.
A minimal project for singing voice separation. Every component is as simple as possible:
The network is a traditional 3-layer-RNN with GRU cell. Each layer has 256 neurons only. The model file is only 12.5MB.
It uses L2 loss and momentum optimizer, taking around 1600 iters to converge, which needs less than 4 hours on a 1080Ti.
The dataset is MIR-1k dataset, which is only around 650MB containing 1000 music clips.
Yet, it’s quite powerful for its task (well, only on this dataset). Just check the demo.
Download the pre-trained model from here. Extract it at the root directory of the project - make sure you have some files under
About how to use this tool, check
eval.process_single_example. NOTE: the input file must be in
.wav format with 16000 sample rate. You can easily convert a file into this using ffmpeg.
If you insist to train by yourself, you’ll need to:
Download the dataset to
data.train_test_splitto split train and test set.