This is the next level of this project.
There are a lot of cool TTS systems in this space
I could keep using Piper since it's built on VITS and looks like it's now embedded in Piper Python https://github.com/rhasspy/piper-phonemize
When I was looking into trying to implement beeping it looks like the difficult part is trying to get a phoneme mapping of the letters so like:
"F" (50ms), "U" (100ms), "CK" (80ms)
I keep the first phoneme, mute the middle, and unmute for CK.
Luckily I'm only working with English right now because it looks like for Arabic https://www.reddit.com/r/TextToSpeech/comments/1ooiabf/how_can_i_extract_phoneme_timings_for_lipsync/.
I could use ElevenLabs or Azure but we ball! Also, they provide Visemes to get exact millisecond start/end? Kinda crazy.

This is the next level of this project.
There are a lot of cool TTS systems in this space
I could keep using Piper since it's built on VITS and looks like it's now embedded in Piper Python https://github.com/rhasspy/piper-phonemize
When I was looking into trying to implement beeping it looks like the difficult part is trying to get a phoneme mapping of the letters so like:
"F" (50ms), "U" (100ms), "CK" (80ms)
I keep the first phoneme, mute the middle, and unmute for CK.
Luckily I'm only working with English right now because it looks like for Arabic https://www.reddit.com/r/TextToSpeech/comments/1ooiabf/how_can_i_extract_phoneme_timings_for_lipsync/.
I could use ElevenLabs or Azure but we ball! Also, they provide Visemes to get exact millisecond start/end? Kinda crazy.