Audio

Add background music, voiceovers, and captions to create immersive, accessible Arcades.

Audio Recording

How do I record my own voice directly into Arcade?

To record your own voice:

Open the Record panel, by selecting the Camera icon in the top editing bar.
Select your microphone under Devices.
Choose whether you want to record:
- Audio only
- Audio and video (camera recording)
Click the Record button.

If you select None (Audio Only) under camera devices, it will record just your voice without video.

After recording, Arcade will automatically generate captions for your recording.

Can I record video and audio together?

Yes. If you want to record yourself speaking on camera, select a video device (such as your MacBook Pro camera) and your microphone. This will create a video step with both your image and your voice.

If you want an audio-only recording, select None (Audio Only) for the video device before recording.

Audio Uploading

Can I upload my own audio file?

Yes, you can upload your own MP3 audio file to an Arcade. You can either:

Upload background music through the Design panel
Upload an audio file directly to a specific step (chapter, image, or video step) using the Upload option in the Recording menu

When you upload audio to a step, Arcade will automatically generate captions for it.

How do I upload pre-recorded audio or video?

Instead of recording live, you can upload a file by:

Clicking the Upload button in the Recording panel.
Selecting an MP3 (audio) or MP4 (video) file from your computer.

Arcade will automatically generate captions for any uploaded files.

What happens to the audio in uploaded videos?

If you upload an MP4 file that contains audio, the audio will play by default in your Arcade. You can control the behavior of the audio for each video step. When you select a video step, you will see an option to mute or unmute the video's audio.

This gives you flexibility to use the video’s original sound or rely on background music or voiceover instead.

Background Music

See Background Music in the Design section for more.

How can I add background music to my Arcade?

You can add background music to your Arcade by navigating to the Design panel. In the background music section, you can select from Arcade's available music tracks or upload your own MP3 file. Background music will play continuously as viewers move through your Arcade.

If you upload an MP4 file that includes audio, that audio will automatically play when viewers reach the relevant step.

Can I upload my own background music?

Yes, you can upload your own MP3 file to use as background music. Navigate to the Design panel, choose the background music section, and upload your file.

What happens if I have background music and step audio?

If you have both background music and an individual step with audio (either synthetic voice, uploaded audio, or video audio), the step's audio will play on top of the background music.

Depending on your design goals, you may want to mute the background music during steps with voiceovers or important video content.

Captions

Does Arcade automatically generate captions?

Yes. Any time you:

Record your own audio
Upload an MP3 or MP4 file
Add synthetic voice

Arcade automatically generates captions for that audio content. Viewers can opt to toggle off the closed captions in the Arcade viewer.

This ensures your Arcade is accessible and easy to follow, even without sound.

Can I customize or edit captions?

It depends on how the audio was added:

Audio recorded inside Arcade: You can edit the captions manually. After recording, you'll see an editable captions panel where you can click and change the text.
Uploaded MP3 or MP4 files: Captions cannot currently be edited.

If you need to change captions for uploaded audio, you would need to either re-upload a corrected audio file or re-record using Arcade’s recording tool.

Common Questions

Can I add different audio to different steps?

Yes. You can add unique audio clips to individual steps. Whether it’s an image step, chapter step, or video step, you can:

Record new audio
Upload a different MP3 or MP4
Generate a synthetic voiceover

This allows you to tailor the sound experience step-by-step across your Arcade.

What happens if I have background music, video audio, and synthetic voice all at once?

If a step has multiple audio sources, Arcade will layer them:

Background music will continue unless manually paused.
Step-specific audio (video or synthetic voice) will play on top.
In most cases, viewers will hear both unless you mute either the video or background music on that step.

We recommend muting background music on steps with detailed voiceovers to ensure clarity.

Why isn't my audio playing immediately (on my first slide)?

Most modern browsers block audio from autoplaying unless the user has interacted with the page (like clicking or pressing a key).

Encourage a user interaction with any hotspot or button, before the audio plays. Try adding a generic "Get Started" chapter page to ensure viewers interact and trigger audio before any information/the demo has begun.

Synthetic Audio

Our AI synthetic voices, provided by Eleven Labs, have a high degree of customizability. Read below to understand more about how to change the pausing, pacing, emotion, pronunciation, and more.

What is synthetic voice in Arcade?

Synthetic voice (SV) is a feature that uses AI-generated voices to narrate your Arcade. You can use this to automatically create professional-sounding voiceovers without needing to record your own audio.

Arcade partners with Eleven Labs to provide a wide variety of voices and languages for synthetic voice generation.

How do I add synthetic voice to my Arcade?

There are two main ways to add synthetic voice:

Generate synthetic voice for a hotspot: You can check a box next to any hotspot to automatically generate matching synthetic voice for that hotspot. Once complete, a toast notification will confirm that SV was created for the step.
Manually add synthetic voice to a step: In the Voiceover AI panel, you can type in a script and choose a voice to generate an audio file for a specific step.

What voices are available for synthetic voice?

Arcade offers a variety of English voices, each with a unique tone and accent. Some examples include. You can preview each voice before selecting it.

Can I use my own voice as a synthetic voice?

Yes. If you have your own Eleven Labs API key, you can connect it to Arcade. This allows you to:

Use your custom voices
Clone your own voice and use it inside your Arcades

We provide step-by-step instructions on how to connect your Eleven Labs account inside the Arcade settings.

Why is my AI voiceover reading the text fast?

Sometimes what you hear when adding the synthetic voiceover in the Edit view isn't quite identical to what your hear in Preview.

We're actively working on making the two as close to identical as we can, but in the meantime, the best way to test the end viewer's version of the Arcade is to paste the share link in an incognito/private browser — this will have a cleared cache and should present your Arcade as expected.

If the voiceover is still not sounding as expected (or anything else is incorrect), please contact our support team.

What languages are supported for synthetic voice?

Arcade’s synthetic voice feature supports many languages beyond English. If you type your script in another language, Arcade will automatically use a matching voice for that language, when available.

A full list of supported languages is available inside the Voiceover AI panel.

We currently support the following languages

Arabic
Bulgarian
Chinese
Croatian
Czech
Danish
Dutch
English
Filipino
Finnish
French
German
Greek
Hindi
Indonesian
Italian
Japanese
Korean
Malay
Polish
Portuguese
Romanian
Russian
Slovak
Spanish
Swedish
Tamil
Turkish
Ukrainian

We also provide close captioning (subtitles) for Arcades with voiceovers.

If you need a language or voice that is not included, write to our support.

Basics

Generating high-quality voiceovers can be a difficult and time-consuming task. However, with Arcade, you can effortlessly produce synthetic voiceovers that sound both professional and natural simply by inputting your script.

You can add voiceovers to chapters and image steps to provide clear and concise explanations, improving the user experience.

Arcade's synthetic voiceovers support 29 languages with several accents for languages like Portugues, Spanish, and others.

What if my language or specific accent isn't support?

Please reach out to us! We can add more languages and accents as requested.

Prompting Synthetic Voiceovers

Effective techniques to guide ElevenLabs AI in adding pauses, conveying emotions, and pacing the speech.

Pause

There are a few ways to introduce a pause or break and influence the rhythm and cadence of the speaker. The most consistent way is programmatically using the syntax <break time="1.5s" />. This will create an exact and natural pause in the speech. It is not just added silence between words, but the AI has an actual understanding of this syntax and will add a natural pause.

However, since this is more than just inserted silence, how the AI handles these pauses can vary. As usual, the voice used plays a pivotal role in the output. Some voices, those trained with a few “uh”s and “ah”s in them, have shown to sometimes insert those vocal mannerisms during the pauses, like a real speaker might.

An example could look like this:

"Give me one second to think about it." <break time="1.0s" /> "Yes, that would work."

Break time should be described in seconds, and the AI can handle pauses of up to 3 seconds in length and can be used in Speech Synthesis and via the API. It is not yet available for Projects.

Please avoid using an excessive number of break tags as that has shown to potentially cause some instability in the AI. The speech of the AI might start speeding up and become very fast, or it might introduce more noise in the audio and a few other strange artifacts. We are working on resolving this.

Alternatives

These options are inconsistent and might not always work. We recommend using the syntax above for consistency.

One trick that seems to provide the most consistence output - sans the above option - is a simple dash - or the em-dash —. You can even add multiple dashes such as -- -- for a longer pause.

"It - is - getting late."

Ellipsis ... can sometimes also work to add a pause between words but usually also adds some “hesitation” or “nervousness” to the voice that might not always fit.

I... yeah, I guess so..."

Pronunciation

This feature is currently only supported in English.

In certain instances, you may want the model to pronounce a word, name, or phrase in a specific way. Pronunciation can be specified using standardised pronunciation alphabets. Currently we support the International Phonetic Alphabet (IPA) and the CMU Arpabet. Pronunciations are specified by wrapping words using the Speech Synthesis Markup Language (SSML) phoneme tag.

To use this feature, you need to wrap the desired word or phrase in the <phoneme alphabet="ipa" ph="your-IPA-Pronunciation-here">word</phoneme> tag for IPA, or <phoneme alphabet="cmu-arpabet" ph="your-CMU-pronunciation-here">word</phoneme> tag for CMU Arpabet. Replace "your-IPA-Pronunciation-here" or "your-CMU-pronunciation-here" with the desired IPA or CMU Arpabet pronunciation.

An example for IPA:

<phoneme alphabet="ipa" ph="ˈæktʃuəli">actually</phoneme>

An example for CMU Arpabet:

<phoneme alphabet="cmu-arpabet" ph="AE K CH UW AH L IY">actually</phoneme>

It is important to note that this only works per word. Meaning that if you, for example, have a name with a first and last name that you want to be pronounced a certain way, you will have to create the pronunciation for each word individually.

English is a lexical stress language, which means that within multi-syllable words, some syllables are emphasized more than others. The relative salience of each syllable is crucial for proper pronunciation and meaning distinctions. So, it is very important to remember to include the lexical stress when writing both IPA and ARPAbet as otherwise, the outcome might not be optimal.

Take the word “talon”, for example.

Incorrect:

<phoneme alphabet="cmu-arpabet" ph="T AE L AH N">talon</phoneme>

Correct:

<phoneme alphabet="cmu-arpabet" ph="T AE1 L AH0 N">talon</phoneme>

The first example might switch between putting the primary emphasis on AE and AH, while the second example will always be pronounced reliably with the emphasis on AE and no stress on AH.

If you write it as:

<phoneme alphabet="cmu-arpabet" ph="T AE0 L AH1 N">talon</phoneme>

It will always put emphasis on AH instead of AE.

Emotion

If you want the AI to express a specific emotion, the best approach is to write in a style similar to that of a book. To find good prompts to use, you can flip through some books and identify words and phrases that convey the desired emotion.

For instance, you can use dialogue tags to express emotions, such as he said, confused, or he shouted angrily. These types of prompts will help the AI understand the desired emotional tone and try to generate a voiceover that accurately reflects it. With this approach, you can create highly customized voiceovers that are perfect for a variety of applications.

"Are you sure about that?" he said, confused.
"Don’t test me!" he shouted angrily.

You will also have to somehow remove the prompt as the AI will read exactly what you give it. The AI can also sometimes infer the intended emotion from the text’s context, even without the use of tags.

"That is funny!"
"You think so?"

This is not always perfect since you are relying on the AI discretion to understand if something is sarcastic, funny, or just plain from the context of the text.

Pacing

Based on varying user feedback and test results, it’s been theorized that using a singular long sample for voice cloning has brought more success for some, compared to using multiple smaller samples. The current theory is that the AI stitches these samples together without any separation, causing pacing issues and faster speech. This is likely why some people have reported fast-talking clones.

To control the pacing of the speaker, you can use the same approach as in emotion, where you write in a style similar to that of a book. While it’s not a perfect solution, it can help improve the pacing and ensure that the AI generates a voiceover at the right speed. With this technique, you can create high-quality voiceovers that are both customized and easy to listen to.

"I wish you were right, I truly do, but you're not," he said slowly.

PreviousChapter, Form, & Embed NextVideo

Last updated 2 days ago

Was this helpful?