用于在Java中将文本与音频链接的数据结构

I want to write a program in which plays an audio file that reads a text. I want to highlite the current syllable that the audiofile plays in green and the rest of the current word in red. What kind of datastructure should I use to store the audio file and the information that tells the program when to switch to the next word/syllable?

我想编写一个程序,其中播放一个读取文本的音频文件。我想高亮显示音频文件以绿色播放的当前音节以及当前单词的其余部分为红色。我应该使用什么样的数据结构来存储音频文件以及告诉程序何时切换到下一个单词/音节的信息?

4 个解决方案

#1

This is a slightly left-field suggestion, but have you looked at Karaoke software? It may not be seen as "serious" enough, but it sounds very similar to what you're doing. For example, Aegisub is a subtitling program that lets you create subtitles in the SSA/ASS format. It has karaoke tools for hilighting the chosen word or part.

这是一个略微左侧的建议,但你看过卡拉OK软件吗?它可能不会被视为“严重”,但听起来与你正在做的非常相似。例如,Aegisub是一个字幕程序,可让您以SSA / ASS格式创建字幕。它有卡拉OK工具,用于高亮显示所选单词或部分。

It's most commonly used for subtitling anime, but it also works for audio provided you have a suitable player. These are sadly quite rare on the Mac.

它最常用于字幕动画,但它也适用于音频,前提是你有合适的播放器。这些在Mac上非常罕见。

The format looks similar to the one proposed by Yuval A:

格式类似于Yuval A提出的格式:

{\K132}Unmei {\K34}no {\K54}tobira
{\K60}{\K132}yukkuri {\K36}to {\K142}hirakareta

The lengths are durations rather than absolute offsets. This makes it easier to shift the start of the line without recalculating all the offsets. The double entry indicates a pause.

长度是持续时间而不是绝对偏移。这样可以更轻松地移动线的起点而无需重新计算所有偏移。双重表示暂停。

Is there a good reason this needs to be part of your Java program, or is an off the shelf solution possible?

是否有充分的理由需要成为Java程序的一部分,或者是现成的解决方案?

#2

How about a simple data structure that describes what next batch of letters consists of the next syllable and the time stamp for switching to that syllable?

一个简单的数据结构如何描述下一批字母包含下一个音节和切换到该音节的时间戳?

Just a quick example:

只是一个简单的例子:

[0:00] This [0:02] is [0:05] an [0:07] ex- [0:08] am- [0:10] ple

[0:00]这[0:02]是[0:05] [0:07] ex- [0:08] am- [0:10]

#3

To highlight part of word sounds like you're getting into phonetics which are sounds that make up words. It's going to be really difficult to turn a sound file into something that will "read" a text. Your best bet is to use the text itself to drive a phonetics based engine, like FreeTTS which is based off of the Java Speech API.

突出显示单词的部分声音,就像你正在进入语音,这是构成单词的声音。将声音文件转换为“读取”文本的内容真的很难。您最好的选择是使用文本本身来驱动基于语音的引擎,例如基于Java Speech API的FreeTTS。

To do this you're going to have to take the text to be read, split it into each phonetic syllable and play it. so "syllable" is "syl" "la" "ble". Playing would be; highlight syl, say it and move to next one.

为此,您将不得不将文本读取,将其拆分为每个拼音音节并进行播放。所以“音节”是“syl”“la”“ble”。玩会是;突出syl,说出来并转移到下一个。

This is really "old-skool" its been done on the original Apple II the same way.

这真的是“老式的”,它在原来的Apple II上以同样的方式完成。

#4

you might want to get familiar with FreeTTS -- this open source tool : http://freetts.sourceforge.net/docs/index.php -

你可能想熟悉FreeTTS - 这个开源工具:http://freetts.sourceforge.net/docs/index.php -

You might want to feed only a few words to the TTS engine at a given point of time -- highlight them and once those are SPOKEN out, de-highlight them and move to the next batch of words.

您可能希望在给定的时间点仅向TTS引擎提供几个单词 - 突出显示它们,一旦这些单词被SPOKEN输出,请将其高亮显示并移至下一批单词。

BR,
~A

#1