How to connect custom speech model with captioning with speech to text?

Quynh Huynh (NON EA SC ALT) 40 Reputation points Microsoft Employee
2025-12-01T22:55:33.45+00:00

From following steps regarding training a custom speech model, most documentations are regarding file format following WAV files. For our particular use cases, we'd like to leverage the custom speech model to generate caption file for mp4 videos. Are there any suggestions on how this may work as converting mp4 to WAV seems redundant if Speech Studio also supports captioning solution.

Thank you!

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
{count} votes

1 answer

Sort by: Most helpful
  1. Aryan Parashar 3,685 Reputation points Microsoft External Staff Moderator
    2025-12-02T10:29:05.7466667+00:00

    Hi Quynh Huynh,

    You are correct; training a custom model requires WAV files. Here is the documentation:
    https://v4.hkg1.meaqua.org/en-us/azure/ai-services/speech-service/how-to-custom-speech-test-and-train#audio-data-for-training-or-testing

    However, for the captioning workflow you mentioned, you can input MP4 files directly without converting them, which avoids unnecessary steps in daily use.

    You might also consider using the Whisper model available in Azure AI Foundry as an alternative.

    Let me know if you have any further questions.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.