Text-To-Speech

2 Stage Pipeline

  1. Audio보다 low resolution인 intermediate representation을 생성
    1. e.g. Mel-Spectrograms, Linguistic Features, STFT
  2. intermediate representation에서 raw waveform audio 합성
    1. cf. Vocoder

Reference