🐟 Fish Audio S2 Pro

State-of-the-Art Dual-Autoregressive Text-to-Speech · Model Page ↗ · GitHub ↗

80+ languages supported · Zero-shot voice cloning · 15,000+ inline emotion tags

✍️ Input Text

🎧 Result

Generated Audio

🏷️ Supported Emotion Tags

15,000+ unique tags supported. Use free-form descriptions like [whisper in small voice] or [professional broadcast tone]. Common tags:

[pause] [emphasis] [laughing] [inhale] [chuckle] [tsk] [singing] [excited] [laughing tone] [interrupting] [chuckling] [excited tone] [volume up] [echo] [angry] [low volume] [sigh] [low voice] [whisper] [screaming] [shouting] [loud] [surprised] [short pause] [exhale] [delight] [panting] [audience laughter] [with strong accent] [volume down] [clearing throat] [sad] [moaning] [shocked]

🌍 Supported Languages

Tier 1: Japanese · English · Chinese | Tier 2: Korean · Spanish · Portuguese · Arabic · Russian · French · German
Also supported: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more. Language is auto-detected from the input text — no configuration needed.

🌟 Examples

Examples