Skip to main content

Data Sharing Requirements for Speech-to-Text Model Finetuning

Discover the requirements and format for sending audio data to help finetune Corti's ASR models.

Updated over a week ago

Data definition

Data for automatic speech recognition (ASR) training or finetuning takes the form of pairs of audio recordings and target transcripts of that audio.

Audio

The audio recordings should reflect the domain to which the ASR is expected to function including background noise and type of recording equipment, but also speaker dialect, accent, gender, race, age and other characteristics that might affect the speech.

In cases where recordings do not reflect the target domain, this should be communicated and will require extra efforts from the ML team in model training and evaluation to ensure performance on the target domain.

Transcript

The transcripts must be verbatim (or near-verbatim) to be useful for training, fine-tuning and/or evaluating an ASR model. Any syntactic property of the transcripts expected from the delivered ASR model must also be reflected in the target transcripts (e.g. capitalization and punctuation).

Transcripts generated by other highly accurate dictation software vendors can be excellent sources of data, provided they meet the criteria for verbatim or near-verbatim accuracy. This ensures alignment with the expected outputs of the ASR model. Being machine-generated, such transcripts often contain errors making them ill suited for direct use in model training. They remain useful for a number of other use-cases, and can also be corrected in an active learning setup. In short, leveraging such transcripts can often be an effective way to produce high-quality datasets for training or evaluation.

In cases where verbatim transcripts cannot be obtained it might be possible for Corti’s ML team to extract approximately verbatim transcripts from near-verbatim texts such as summaries written from a recorded dictation.

Example

A data sample (excluding the audio file itself) might contain the following parts:

{"audio_file": "file_name.wav", 	
"start": "00:00:00.4",
"stop": "00:00:09.2",
"transcript": "An example of training and evaluation data for an ASR! Wow!"}

While this data sample is in the JSON format, other text-formats such as CSV are also supported, as long as they comply with the guidelines for the transcript itself (which usually include commas!).


Metadata

Per-sample metadata

A data sample can also include metadata such as speaker_id, gender, race, age, language, dialect, location, date and generally should include as much metadata as possible. This might be used for training but is always valuable to estimate model performance on subpopulations.

Domain-related metadata

For specialized domains, metadata can also include lists of specialized words. These can help familiarize the model with specialized words that are rare in the data but important for correct understanding and downstream uses of the transcript. A list of words does not need associated audio recordings. If recordings are available they can be included.


Amount of data

For a language where dialects are not a big concern, 50 hours of audio recordings and corresponding verbatim, or near-verbatim, transcripts for 20+ speakers are needed to see useful improvements in transcript quality.

Specialized domains

The more specialized the vocabulary, the more audio-transcript pairs are needed to be able to provide equivalent transcription quality.

Examples of specialized domains ranked by specialization (highest to lowest):

  1. Internal communication at hospital

  2. Consultations at specialist doctor practices

  3. Consultations at general practitioners

  4. Calls to maternity ward

  5. Calls to emergency services

Dialects

For languages with a significant number of strong dialects, diverse speaker representation for each dialect is required.

If the dialects are “fairly similar”, a good starting point would be 5-10 hours per dialect and 5+ speakers per dialect. In the extreme that dialects are very different (almost separate languages), upwards 50 hours, 20 speakers per dialect would be required.

One reason for the data requirements is the disconnect between the written and spoken languages for strong dialects. As an example, Swiss-German has no official written language but is similar in many ways to High-German. In such cases, target transcripts might be in High-German while the speech is Swiss-German which complicates the task imposed on the model.

Lack of written form

Some languages lack a written form. For a transcription task to make sense in these cases, a substitute written form must be agreed upon. In some cases, cultural norms might impose a common written language to which transcription could translate, effectively extending the speech recognition task to include translation.

This disconnect between spoken and written forms further increases the required amount of data. An estimate is 100 hours of audio recordings and corresponding verbatim, or near-verbatim, transcripts for 20+ speakers.


Data format

Audio files should be provided as individual files. The audio should not be resampled and that it is delivered in the format and encoding in which it was originally recorded. This is true even if audio files come in several different formats. A single file per recording is preferred, i.e. there is also no need to divide any recording into multiple audio files.

Transcripts should be segmented i.e. each audio file should be transcribed into several separate phrases labelled with their start and stop times in the full audio file. Timestamps should be provided at least at the hundred millisecond level in the format H:MM:SS.m where H are hours, MM are minutes (up to 60), SS are seconds (up to 60), and m is hundreds of milliseconds. Transcripts can be provided in different formats, as long as their link to their source audio file remains clear.

Example format 1

Audio files along with a individual metadata files:

delivered_data/ 
├── file_name_1.wav
├── file_name_1.json
├── file_name_2.mp3
├── file_name_2.json
├── file_name_3.ogg
├── file_name_3.json
├── ...
└── file_name_N.json

The individual file_name_X.json files would then look like:

[ 
{
"file": "file_name_1.wav",
"start": "0:00:00.0",
"stop": "0:00:09.4",
"speaker_id": 0,
"transcript": "Hello, this is the captain speaking."
},
{
"file": "file_name_1.wav",
"start": "0:00:11.3",
"stop": "0:00:15.9",
"speaker_id": 1,
"transcript": "I bet this one drew their license in a machine."
},
...
]

Example format 2

Audio files along with a single metadata file:

delivered_data/ 
├── file_name_1.wav
├── file_name_2.mp3
├── file_name_3.ogg
├── ...
└── metadata.json

The metadata.json file might then look like

[ 	
{
"file": "file_name_1.wav",
"start": "0:00:00.1",
"stop": "0:00:09.4",
"speaker_id": 0,
"transcript": "Hello, this is the captain speaking."
},
{
"file": "file_name_1.wav",
"start": "0:00:11.6",
"stop": "0:00:15.2",
"speaker_id": 1,
"transcript": "I bet this one drew their license in a machine."
},
...
{
"file": "file_name_2.wav",
"start": "0:00:04.4",
"stop": "0:00:13.6",
"speaker_id": 2,
"transcript": "Nonetheless, the plane landed safely."
},
]

To continuously improve the ASR models ability to deal with special dialect words and pronunciations as well as medical terms, additional data may be required. The above amounts should be seen as a starting point.

If you would like more accurate results and would like to help train our models, please contact [email protected] to learn more.

Did this answer your question?