
时间:2022-09-06 17:06:39

I want to write a program in which plays an audio file that reads a text. I want to highlite the current syllable that the audiofile plays in green and the rest of the current word in red. What kind of datastructure should I use to store the audio file and the information that tells the program when to switch to the next word/syllable?


4 个解决方案



This is a slightly left-field suggestion, but have you looked at Karaoke software? It may not be seen as "serious" enough, but it sounds very similar to what you're doing. For example, Aegisub is a subtitling program that lets you create subtitles in the SSA/ASS format. It has karaoke tools for hilighting the chosen word or part.

这是一个略微左侧的建议,但你看过卡拉OK软件吗?它可能不会被视为“严重”,但听起来与你正在做的非常相似。例如,Aegisub是一个字幕程序,可让您以SSA / ASS格式创建字幕。它有卡拉OK工具,用于高亮显示所选单词或部分。

It's most commonly used for subtitling anime, but it also works for audio provided you have a suitable player. These are sadly quite rare on the Mac.


The format looks similar to the one proposed by Yuval A:

格式类似于Yuval A提出的格式:

{\K132}Unmei {\K34}no {\K54}tobira
{\K60}{\K132}yukkuri {\K36}to {\K142}hirakareta

The lengths are durations rather than absolute offsets. This makes it easier to shift the start of the line without recalculating all the offsets. The double entry indicates a pause.


Is there a good reason this needs to be part of your Java program, or is an off the shelf solution possible?




How about a simple data structure that describes what next batch of letters consists of the next syllable and the time stamp for switching to that syllable?


Just a quick example:


[0:00] This [0:02] is [0:05] an [0:07] ex- [0:08] am- [0:10] ple

[0:00]这[0:02]是[0:05] [0:07] ex- [0:08] am- [0:10]



To highlight part of word sounds like you're getting into phonetics which are sounds that make up words. It's going to be really difficult to turn a sound file into something that will "read" a text. Your best bet is to use the text itself to drive a phonetics based engine, like FreeTTS which is based off of the Java Speech API.

突出显示单词的部分声音,就像你正在进入语音,这是构成单词的声音。将声音文件转换为“读取”文本的内容真的很难。您最好的选择是使用文本本身来驱动基于语音的引擎,例如基于Java Speech API的FreeTTS。

To do this you're going to have to take the text to be read, split it into each phonetic syllable and play it. so "syllable" is "syl" "la" "ble". Playing would be; highlight syl, say it and move to next one.


This is really "old-skool" its been done on the original Apple II the same way.

这真的是“老式的”,它在原来的Apple II上以同样的方式完成。



you might want to get familiar with FreeTTS -- this open source tool : http://freetts.sourceforge.net/docs/index.php -

你可能想熟悉FreeTTS - 这个开源工具:http://freetts.sourceforge.net/docs/index.php -

You might want to feed only a few words to the TTS engine at a given point of time -- highlight them and once those are SPOKEN out, de-highlight them and move to the next batch of words.

您可能希望在给定的时间点仅向TTS引擎提供几个单词 - 突出显示它们,一旦这些单词被SPOKEN输出,请将其高亮显示并移至下一批单词。




This is a slightly left-field suggestion, but have you looked at Karaoke software? It may not be seen as "serious" enough, but it sounds very similar to what you're doing. For example, Aegisub is a subtitling program that lets you create subtitles in the SSA/ASS format. It has karaoke tools for hilighting the chosen word or part.

这是一个略微左侧的建议,但你看过卡拉OK软件吗?它可能不会被视为“严重”,但听起来与你正在做的非常相似。例如,Aegisub是一个字幕程序,可让您以SSA / ASS格式创建字幕。它有卡拉OK工具,用于高亮显示所选单词或部分。

It's most commonly used for subtitling anime, but it also works for audio provided you have a suitable player. These are sadly quite rare on the Mac.


The format looks similar to the one proposed by Yuval A:

格式类似于Yuval A提出的格式:

{\K132}Unmei {\K34}no {\K54}tobira
{\K60}{\K132}yukkuri {\K36}to {\K142}hirakareta

The lengths are durations rather than absolute offsets. This makes it easier to shift the start of the line without recalculating all the offsets. The double entry indicates a pause.


Is there a good reason this needs to be part of your Java program, or is an off the shelf solution possible?




How about a simple data structure that describes what next batch of letters consists of the next syllable and the time stamp for switching to that syllable?


Just a quick example:


[0:00] This [0:02] is [0:05] an [0:07] ex- [0:08] am- [0:10] ple

[0:00]这[0:02]是[0:05] [0:07] ex- [0:08] am- [0:10]



To highlight part of word sounds like you're getting into phonetics which are sounds that make up words. It's going to be really difficult to turn a sound file into something that will "read" a text. Your best bet is to use the text itself to drive a phonetics based engine, like FreeTTS which is based off of the Java Speech API.

突出显示单词的部分声音,就像你正在进入语音,这是构成单词的声音。将声音文件转换为“读取”文本的内容真的很难。您最好的选择是使用文本本身来驱动基于语音的引擎,例如基于Java Speech API的FreeTTS。

To do this you're going to have to take the text to be read, split it into each phonetic syllable and play it. so "syllable" is "syl" "la" "ble". Playing would be; highlight syl, say it and move to next one.


This is really "old-skool" its been done on the original Apple II the same way.

这真的是“老式的”,它在原来的Apple II上以同样的方式完成。



you might want to get familiar with FreeTTS -- this open source tool : http://freetts.sourceforge.net/docs/index.php -

你可能想熟悉FreeTTS - 这个开源工具:http://freetts.sourceforge.net/docs/index.php -

You might want to feed only a few words to the TTS engine at a given point of time -- highlight them and once those are SPOKEN out, de-highlight them and move to the next batch of words.

您可能希望在给定的时间点仅向TTS引擎提供几个单词 - 突出显示它们,一旦这些单词被SPOKEN输出,请将其高亮显示并移至下一批单词。
