Audio
Add background music, voiceovers, and captions to create immersive, accessible Arcades.
Audio Recording
How do I record my own voice directly into Arcade?
To record your own voice:
Click the
Record
button in the Recording panel.Select your microphone under
Devices
.Choose whether you want to record:
Audio only
Audio and video (camera recording)
If you select None (Audio Only)
under camera devices, it will record just your voice without video.
After recording, Arcade will automatically generate captions for your recording.
Can I record video and audio together?
Yes. If you want to record yourself speaking on camera, select a video device (such as your MacBook Pro camera) and your microphone. This will create a video step with both your image and your voice.
If you want an audio-only recording, select None (Audio Only)
for the video device before recording.
Audio Uploading
Can I upload my own audio file?
Yes, you can upload your own MP3 audio file to an Arcade. You can either:
Upload background music through the
Design
panelUpload an audio file directly to a specific step (chapter, image, or video step) using the
Upload
option in theRecording
menu
When you upload audio to a step, Arcade will automatically generate captions for it.
How do I upload pre-recorded audio or video?
Instead of recording live, you can upload a file by:
Clicking the
Upload
button in the Recording panel.Selecting an MP3 (audio) or MP4 (video) file from your computer.
Arcade will automatically generate captions for any uploaded files.
What happens to the audio in uploaded videos?
If you upload an MP4 file that contains audio, the audio will play by default in your Arcade. You can control the behavior of the audio for each video step. When you select a video step, you will see an option to mute or unmute the video's audio.
This gives you flexibility to use the video’s original sound or rely on background music or voiceover instead.
Background Music
How can I add background music to my Arcade?
You can add background music to your Arcade by navigating to the Design
panel. In the background music section, you can select from Arcade's available music tracks or upload your own MP3 file. Background music will play continuously as viewers move through your Arcade.
If you upload an MP4 file that includes audio, that audio will automatically play when viewers reach the relevant step.
Can I upload my own background music?
Yes, you can upload your own MP3 file to use as background music. Navigate to the Design
panel, choose the background music section, and upload your file.
What happens if I have background music and step audio?
If you have both background music and an individual step with audio (either synthetic voice, uploaded audio, or video audio), the step's audio will play on top of the background music.
Depending on your design goals, you may want to mute the background music during steps with voiceovers or important video content.
Captions
Does Arcade automatically generate captions?
Yes. Any time you:
Record your own audio
Upload an MP3 or MP4 file
Add synthetic voice
Arcade automatically generates captions for that audio content. Viewers can opt to toggle off the closed captions in the Arcade viewer.
This ensures your Arcade is accessible and easy to follow, even without sound.
Can I customize or edit captions?
It depends on how the audio was added:
Audio recorded inside Arcade: You can edit the captions manually. After recording, you'll see an editable captions panel where you can click and change the text.
Uploaded MP3 or MP4 files: Captions cannot currently be edited.
If you need to change captions for uploaded audio, you would need to either re-upload a corrected audio file or re-record using Arcade’s recording tool.
Common Questions
Can I add different audio to different steps?
Yes. You can add unique audio clips to individual steps. Whether it’s an image step, chapter step, or video step, you can:
Record new audio
Upload a different MP3 or MP4
Generate a synthetic voiceover
This allows you to tailor the sound experience step-by-step across your Arcade.
What happens if I have background music, video audio, and synthetic voice all at once?
If a step has multiple audio sources, Arcade will layer them:
Background music will continue unless manually paused.
Step-specific audio (video or synthetic voice) will play on top.
In most cases, viewers will hear both unless you mute either the video or background music on that step.
We recommend muting background music on steps with detailed voiceovers to ensure clarity.
Synthetic Audio
Our AI synthetic voices, provided by Eleven Labs, have a high degree of customizability. Read below to understand more about how to change the pausing, pacing, emotion, pronunciation, and more.
What is synthetic voice in Arcade?
Synthetic voice (SV) is a feature that uses AI-generated voices to narrate your Arcade. You can use this to automatically create professional-sounding voiceovers without needing to record your own audio.
Arcade partners with Eleven Labs to provide a wide variety of voices and languages for synthetic voice generation.
How do I add synthetic voice to my Arcade?
There are two main ways to add synthetic voice:
Generate synthetic voice for a hotspot: You can check a box next to any hotspot to automatically generate matching synthetic voice for that hotspot. Once complete, a toast notification will confirm that SV was created for the step.
Manually add synthetic voice to a step: In the
Voiceover AI
panel, you can type in a script and choose a voice to generate an audio file for a specific step.
What voices are available for synthetic voice?
Arcade offers a variety of English voices, each with a unique tone and accent. Some examples include. You can preview each voice before selecting it.
Can I use my own Eleven Labs API key?
Yes. If you have your own Eleven Labs API key, you can connect it to Arcade. This allows you to:
Use your custom voices
Clone your own voice and use it inside your Arcades
We provide step-by-step instructions on how to connect your Eleven Labs account inside the Arcade settings.
What languages are supported for synthetic voice?
Arcade’s synthetic voice feature supports many languages beyond English. If you type your script in another language, Arcade will automatically use a matching voice for that language, when available.
A full list of supported languages is available inside the Voiceover AI
panel.
If you need a language or voice that is not included, write to our support.
Basics
Generating high-quality voiceovers can be a difficult and time-consuming task. However, with Arcade, you can effortlessly produce synthetic voiceovers that sound both professional and natural simply by inputting your script.
You can add voiceovers to chapters and image steps to provide clear and concise explanations, improving the user experience.
Arcade's synthetic voiceovers support 29 languages with several accents for languages like Portugues, Spanish, and others.
What if my language or specific accent isn't support?
Please reach out to us! We can add more languages and accents as requested.
Can I use my own voice as a synthetic voice?
First, you can use our camera and microphone recording settings. If you're really interested in cloning your own speaking voice, reach out to us on our support channels.
Prompting Synthetic Voiceovers
Effective techniques to guide ElevenLabs AI in adding pauses, conveying emotions, and pacing the speech.
There are a few ways to introduce a pause or break and influence the rhythm and cadence of the speaker. The most consistent way is programmatically using the syntax <break time="1.5s" />
. This will create an exact and natural pause in the speech. It is not just added silence between words, but the AI has an actual understanding of this syntax and will add a natural pause.
However, since this is more than just inserted silence, how the AI handles these pauses can vary. As usual, the voice used plays a pivotal role in the output. Some voices, those trained with a few “uh”s and “ah”s in them, have shown to sometimes insert those vocal mannerisms during the pauses, like a real speaker might.
An example could look like this:
Break time should be described in seconds, and the AI can handle pauses of up to 3 seconds in length and can be used in Speech Synthesis and via the API. It is not yet available for Projects.
Please avoid using an excessive number of break tags as that has shown to potentially cause some instability in the AI. The speech of the AI might start speeding up and become very fast, or it might introduce more noise in the audio and a few other strange artifacts. We are working on resolving this.
These options are inconsistent and might not always work. We recommend using the syntax above for consistency.
One trick that seems to provide the most consistence output - sans the above option - is a simple dash -
or the em-dash —
. You can even add multiple dashes such as -- --
for a longer pause.
Ellipsis ...
can sometimes also work to add a pause between words but usually also adds some “hesitation” or “nervousness” to the voice that might not always fit.
This feature is currently only supported in English.
In certain instances, you may want the model to pronounce a word, name, or phrase in a specific way. Pronunciation can be specified using standardised pronunciation alphabets. Currently we support the International Phonetic Alphabet (IPA) and the CMU Arpabet. Pronunciations are specified by wrapping words using the Speech Synthesis Markup Language (SSML) phoneme tag.
To use this feature, you need to wrap the desired word or phrase in the <phoneme alphabet="ipa" ph="your-IPA-Pronunciation-here">word</phoneme>
tag for IPA, or <phoneme alphabet="cmu-arpabet" ph="your-CMU-pronunciation-here">word</phoneme>
tag for CMU Arpabet. Replace "your-IPA-Pronunciation-here"
or "your-CMU-pronunciation-here"
with the desired IPA or CMU Arpabet pronunciation.
An example for IPA:
An example for CMU Arpabet:
It is important to note that this only works per word. Meaning that if you, for example, have a name with a first and last name that you want to be pronounced a certain way, you will have to create the pronunciation for each word individually.
English is a lexical stress language, which means that within multi-syllable words, some syllables are emphasized more than others. The relative salience of each syllable is crucial for proper pronunciation and meaning distinctions. So, it is very important to remember to include the lexical stress when writing both IPA and ARPAbet as otherwise, the outcome might not be optimal.
Take the word “talon”, for example.
Incorrect:
Correct:
The first example might switch between putting the primary emphasis on AE and AH, while the second example will always be pronounced reliably with the emphasis on AE and no stress on AH.
If you write it as:
It will always put emphasis on AH instead of AE.
If you want the AI to express a specific emotion, the best approach is to write in a style similar to that of a book. To find good prompts to use, you can flip through some books and identify words and phrases that convey the desired emotion.
For instance, you can use dialogue tags to express emotions, such as he said, confused
, or he shouted angrily
. These types of prompts will help the AI understand the desired emotional tone and try to generate a voiceover that accurately reflects it. With this approach, you can create highly customized voiceovers that are perfect for a variety of applications.
You will also have to somehow remove the prompt as the AI will read exactly what you give it. The AI can also sometimes infer the intended emotion from the text’s context, even without the use of tags.
This is not always perfect since you are relying on the AI discretion to understand if something is sarcastic, funny, or just plain from the context of the text.
Based on varying user feedback and test results, it’s been theorized that using a singular long sample for voice cloning has brought more success for some, compared to using multiple smaller samples. The current theory is that the AI stitches these samples together without any separation, causing pacing issues and faster speech. This is likely why some people have reported fast-talking clones.
To control the pacing of the speaker, you can use the same approach as in emotion, where you write in a style similar to that of a book. While it’s not a perfect solution, it can help improve the pacing and ensure that the AI generates a voiceover at the right speed. With this technique, you can create high-quality voiceovers that are both customized and easy to listen to.
Last updated
Was this helpful?