Text-To-Speech
2 Stage Pipeline
Audio보다 low resolution인 intermediate representation을 생성
e.g. Mel-Spectrograms, Linguistic Features, STFT
intermediate representation에서 raw waveform audio 합성
cf. Vocoder
Reference