Singing Voice Separation

GitHub link: Singing-Voice-Separation.

A minimal project for singing voice separation. Every component is as simple as possible:

  • The network is a traditional 3-layer-RNN with GRU cell. Each layer has 256 neurons only. The model file is only 12.5MB.

  • It uses L2 loss and momentum optimizer, taking around 1600 iters to converge, which needs less than 4 hours on a 1080Ti.

  • The dataset is MIR-1k dataset, which is only around 650MB containing 1000 music clips.

Yet, it’s quite powerful for its task (well, only on this dataset). Just check the demo.



  1. Download the pre-trained model from here. Extract it at the root directory of the project - make sure you have some files under ./log/baseline.

  2. Run python

About how to use this tool, check eval.process_single_example. NOTE: the input file must be in .wav format with 16000 sample rate. You can easily convert a file into this using ffmpeg.


Why bother?

If you insist to train by yourself, you’ll need to:

  1. Download the dataset to ./data.

  2. Use data.train_test_split to split train and test set.

  3. Use


  • librosa
  • numpy
  • tensorflow
  • mir_eval