Voice Generation Skill

Generate realistic speech using AI (Google Gemini TTS, ElevenLabs, OpenAI TTS).

Prerequisites

At least one API key is required:

GOOGLE_API_KEY - For Google Gemini TTS (same key as video/image/music) ✅
ELEVENLABS_API_KEY - For ElevenLabs high-quality voice synthesis
OPENAI_API_KEY - For OpenAI TTS voices

Available APIs

Google Gemini TTS (Recommended - Same API Key)

Best for: Podcasts, dialogues, audiobooks with style control
Voices: 30 voices with natural language style control
Multi-speaker: Up to 2 speakers for dialogues ✅
Languages: 24 languages (auto-detected)
Features: Control style, accent, pace via prompts
Output: 24kHz WAV
API Key: Same GOOGLE_API_KEY as video/image/music ✅

ElevenLabs (Best Quality)

Best for: Natural-sounding voices, voice cloning, long-form content
Voices: 100+ pre-made voices + custom voice cloning
Languages: 29+ languages
Models: Eleven Multilingual v2, Eleven Turbo v2

OpenAI TTS (Simplest)

Best for: Quick, reliable text-to-speech with consistent quality
Voices: alloy, echo, fable, onyx, nova, shimmer
Models: tts-1 (fast), tts-1-hd (high quality)
Output: MP3, Opus, AAC, FLAC

Workflow

Step 1: Understand the Request

Parse the user's voice request for:

Text content: What should be spoken?
Voice type: Male, female, specific character?
Tone: Professional, casual, dramatic, cheerful?
Use case: Narration, voiceover, audiobook, notification?
Language: English, Spanish, other?
Speed: Normal, slow, fast?

Step 2: Select Voice and API

Choose based on requirements:

| Use Case | Recommended API | Reason | |----------|----------------|--------| | Default / Same key as video | Gemini TTS | Same GOOGLE_API_KEY ✅ | | Multi-speaker dialogue | Gemini TTS | Up to 2 speakers built-in | | Style/accent control | Gemini TTS | Natural language prompts | | Voice cloning | ElevenLabs | Only API with cloning | | 100+ voice options | ElevenLabs | Widest selection | | Audiobook/podcast | ElevenLabs or Gemini | Both excellent for long content | | Quick narration | OpenAI TTS | Fast, reliable | | Budget-conscious | OpenAI TTS | Lower cost |

Step 3: Prepare the Text

Optimize text for speech:

Add pauses: Use commas, periods for natural rhythm
Spell out numbers: "1,234" → "one thousand two hundred thirty-four" (if needed)
Handle acronyms: "NASA" vs "N.A.S.A." depending on pronunciation
Mark emphasis: Some APIs support emphasis markers

Example transformation:

Original: "The Q4 2024 results show a 15% YoY increase."
Optimized: "The Q4 2024 results show a fifteen percent year-over-year increase."

Step 4: Generate the Audio

Execute the appropriate script from ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/:

For Google Gemini TTS (single speaker):

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Welcome to our podcast!" \
  --voice "Charon"

Gemini TTS with style direction:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --text "Have a wonderful day!" \
  --voice "Puck" \
  --style "Say cheerfully with a British accent:"

Gemini TTS multi-speaker (dialogue):

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
  --multi \
  --speaker "Host:Charon" \
  --speaker "Guest:Aoede" \
  --text "Host: Welcome to the show!
Guest: Thanks for having me!"

For ElevenLabs:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
  --text "Your text here" \
  --voice "Rachel" \
  --model "eleven_multilingual_v2"

For OpenAI TTS:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
  --text "Your text here" \
  --voice "nova" \
  --model "tts-1-hd"

List Gemini voices:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py --list-voices

Step 5: Deliver the Result

Provide the generated audio file path
Mention the voice and settings used
Offer to:
- Try a different voice
- Adjust speed or tone
- Use a different API
- Generate in a different format

Error Handling

Missing API key: Inform the user which key is needed:

Gemini TTS: Same GOOGLE_API_KEY as video/image - https://aistudio.google.com/apikey
ElevenLabs: https://elevenlabs.io
OpenAI: https://platform.openai.com/api-keys

Gemini TTS requires google-genai package: pip install google-genai

Text too long: Split into chunks and concatenate, or suggest shorter text.

Rate limit: Suggest waiting or trying a different API.

Unsupported language: Suggest an alternative API that supports the language.

Multi-speaker limit: Gemini TTS supports max 2 speakers. For more, use ElevenLabs with multiple calls.

Voice Selection Guide

Google Gemini TTS Voices (30 voices)

| Style | Voices | Best For | |-------|--------|----------| | Bright/Upbeat | Zephyr, Puck, Aoede, Laomedeia | Marketing, cheerful content | | Firm/Informative | Charon, Kore, Orus, Rasalgethi | News, tutorials, professional | | Soft/Warm | Achernar, Sulafat, Vindemiatrix | Meditation, gentle narration | | Smooth | Algieba, Despina, Callirrhoe | Audiobooks, storytelling | | Clear | Erinome, Iapetus, Pulcherrima | Instructions, clarity | | Character | Fenrir (excitable), Enceladus (breathy), Algenib (gravelly), Gacrux (mature) | Character voices, drama | | Friendly | Achird, Zubenelgenubi (casual) | Casual, conversational |

Gemini TTS Style Tips:

Use natural language: --style "Say angrily:" or --style "Whisper mysteriously:"
Specify accents: --style "Speak with a British accent from London:"
Control pace: --style "Speak slowly and deliberately:"
Combine: --style "Say excitedly with a Southern US accent:"

OpenAI TTS Voices

| Voice | Description | Best For | |-------|-------------|----------| | alloy | Neutral, balanced | General purpose | | echo | Warm, conversational | Podcasts, casual | | fable | Expressive, British | Storytelling | | onyx | Deep, authoritative | Narration, professional | | nova | Friendly, upbeat | Marketing, tutorials | | shimmer | Soft, gentle | Meditation, ASMR |

ElevenLabs Popular Voices

| Voice | Description | Best For | |-------|-------------|----------| | Rachel | Young female, American | Narration, audiobooks | | Domi | Young female, energetic | Marketing, ads | | Bella | Young female, soft | Storytelling | | Antoni | Young male, well-rounded | Narration | | Josh | Young male, deep | Audiobooks | | Arnold | Mature male, authoritative | Documentary | | Adam | Middle-aged male, deep | Narration | | Sam | Young male, raspy | Character voices |

Best Practices

For Narration

Use a consistent voice throughout
Add natural pauses between paragraphs
Consider pacing for the content type

For Dialogue

Use different voices for different characters
Match voice characteristics to character descriptions
Adjust speed for emotional scenes

For Accessibility

Use clear, well-paced speech
Avoid overly stylized voices
Test with screen readers if applicable

API Comparison

| Feature | Gemini TTS | ElevenLabs | OpenAI TTS | |---------|------------|------------|------------| | API Key | GOOGLE_API_KEY ✅ | ELEVENLABS_API_KEY | OPENAI_API_KEY | | Voice quality | Excellent | Excellent | Very good | | Voice variety | 30 voices | 100+ voices | 6 voices | | Multi-speaker | ✅ Up to 2 | ❌ No | ❌ No | | Style control | ✅ Natural language | Limited | ❌ No | | Voice cloning | ❌ No | ✅ Yes | ❌ No | | Languages | 24 | 29+ | 50+ | | Speed control | Via prompts | Yes | Yes (0.25-4x) | | Max length | 32k tokens | 5,000 chars | 4,096 chars | | Output format | WAV (24kHz) | MP3, WAV | MP3, Opus, AAC, FLAC | | Same key as video/image | ✅ Yes | ❌ No | ❌ No |