A tensorflow implementation of GAN-TTS paper.
- Text embeddings are generated by a tensorflow pre-trained BERT model.
- Linguistic features are not predicted by external models, but they are predicted by a feature net that works together with the generator and the discriminator. The feature net is a simple CBHG module, which takes a text embedding in input and outputs a tensor of linguistic features.
- You can explore the data flow and data dimensionality using the notebook . The discriminator used in the notebook is different because colab GPU couldn't handle the original discriminator
- I trained the model on a really small dataset, 17 audio-texts from LJSpeech, because i didn't have a proper machine to use.
- To evaluate this GAN i used the Frechét Distance, where all embeddings were calcuated with VGGish TensorFlow pre-trained model.