IA e agentes
Áudio e mídia para composições HyperFrames
Produz voz, música, efeitos, transcrições, legendas e remoção de fundo para projetos HyperFrames.
Ver código no GitHub Instala diretamente do repositório-fonte.
O que esta skill faz
Esta skill centraliza recursos de áudio e mídia do HyperFrames no engine scripts/audio.mjs. Ela atende TTS, BGM, SFX, Whisper, legendas, letras, karaoke, estilos por palavra e remoção de fundo.
Quando usar
- Gerar voiceover com HeyGen, ElevenLabs ou Kokoro
- Adicionar música de fundo e efeitos sonoros
- Transcrever áudio com Whisper
- Criar legendas ou karaoke com estilo por palavra
- Remover o fundo de um recurso visual
Como usar
- Revise o repositório e identifique os recursos exigidos pela composição
- Crie um audio_request.json com a solicitação de mídia
- Execute scripts/audio.mjs com os diretórios corretos
- Revise audio_meta.json e os arquivos em assets
- Consuma e anime os dados gerados no HTML
O que revisar antes de instalar
- A disponibilidade de provedores depende das credenciais presentes
- O acesso HeyGen é resolvido por variáveis ou arquivo próprio, não pela CLI
- A colocação dos recursos na composição pertence ao fluxo hyperframes-core
- Fallbacks locais podem produzir resultados diferentes dos serviços externos
SKILL.md
---
name: hyperframes-media
description: Audio and media assets for HyperFrames compositions, produced by one shared audio engine (`scripts/audio.mjs`) — multi-provider TTS (HeyGen / ElevenLabs / Kokoro local), background music + sound effects (HeyGen audio-library retrieval by default, with local Lyria / MusicGen BGM generation and a bundled SFX library as the no-credential fallback), Whisper transcription, background removal, and caption authoring. Use for voiceover / TTS, BGM, SFX / sound effects, transcription, captions / subtitles / lyrics / karaoke / per-word styling, voice + provider selection, and music-mood prompting.
---
# HyperFrames Media
Create the audio and media assets a composition needs — voiceover (TTS), background music + sound effects, transcription, captions, background removal — then consume and animate that data in HTML. For placing assets into compositions, see `hyperframes-core`.
## The audio engine — one source for TTS · BGM · SFX
Workflows do NOT hand-roll audio or vendor a copy. There is one engine — **`scripts/audio.mjs`** — that takes a neutral `audio_request.json` and writes `audio_meta.json` (plus assets under `assets/voice|bgm|sfx`):
```bash
# <MEDIA_DIR> = this skill's directory
node <MEDIA_DIR>/scripts/audio.mjs --request ./audio_request.json --hyperframes . --out ./audio_meta.json
```
All three capabilities degrade on **ONE switch** — whether a HeyGen credential is present (resolved from `$HEYGEN_API_KEY` / `$HYPERFRAMES_API_KEY` / `~/.heygen`, **not** the CLI):
| Capability | HeyGen credential present | absent |
| ---------- | -------------------------------------------------- | ---------------------------------------------------- |
| TTS | HeyGen Starfish REST (native word timestamps) | → ElevenLabs → Kokoro (chain `transcribe` for words) |
| BGM | HeyGen music **retrieval** | Lyria → MusicGen local **generation** (detached) |
| SFX | HeyGen sound-effects **retrieval** (min_score 0.4) | bundled 21-file library (`assets/sfx/`) |
- **Request** (`audio_request.json`): `{ provider?, lang?, speed?, lines: [{ id, text, sfx?: [names] }], bgm: { mode?, query?, prompt? } }`. `id` joins each line back to the caller's model (a frame number, a scene id, …). `bgm.mode` = `retrieve | generate | none`; omit for auto (retrieve when credentialed, else generate). An **explicit** `retrieve` is strict — it skips rather than starting a detached generate (for callers with no `wait-bgm` step).
- **Output** (`audio_meta.json`, id-keyed): `{ tts_provider, voice_id, bgm, bgm_pending, …, voices: [{ id, path, duration_s, words }], sfx: [{ id, name, file, source, offset_s, duration_s, volume }], total_duration_s }`.
- `--only tts,bgm,sfx` runs a subset and **merges** into an existing `--out` (e.g. TTS+BGM early, SFX once cues exist).
- BGM generate is spawned **detached** (`bgm_pending: true`) — run `scripts/wait-bgm.mjs` before assembling.
- `scripts/heygen-tts.mjs` is a single-shot CLI over the same code (one text → wav + words) for when you just need HeyGen TTS without a request file.
Full flag list + the `audio_meta.json` schema live in the header of `scripts/audio.mjs`. The references below cover the provider details and edge cases behind each capability.
## Provider chains (the detail behind the engine)
**TTS** — first available provider wins (the engine, or `npx hyperframes tts "..."`):
| Order | Provider | Detected when | Word timestamps |
| ----- | ----------------------------- | -------------------------------------------- | ---------------------------------------------------------------- |
| 1 | HeyGen (Starfish) | `$HEYGEN_API_KEY` / `hyperframes auth login` | **Yes, native** — pass `--words narration.words.json` to capture |
| 2 | ElevenLabs | `$ELEVENLABS_API_KEY` set | No — chain `transcribe` after |
| 3 | Kokoro-82M (local, 54 voices) | always (no key required) | No — chain `transcribe` after |
> The published `hyperframes tts` CLI is often the local-only build (its `--help` says "Kokoro-82M", no `--provider`/`--words`) and silently falls back to Kokoro even with `$HEYGEN_API_KEY` set. That is why the engine's HeyGen path is the self-contained `scripts/heygen-tts.mjs` (REST), NOT the CLI; the CLI is used only for the Kokoro path. See `references/tts.md`.
**BGM & SFX** — by default **retrieved** from the HeyGen audio library (`/v3/audio/sounds`), same credential as HeyGen TTS, with the no-credential fallback from the switch above:
| Asset | HeyGen `type` | Lands in | Fallback (no credential) |
| ----- | ------------------------------- | ---------------------------------------------------------- | ---------------------------------------------------------- |
| BGM | `music` | `assets/bgm/track.mp3` (retrieve) · `track.wav` (generate) | Lyria / MusicGen generation |
| SFX | `sound_effects` (min_score 0.4) | `assets/sfx/<slug>.mp3` | bundled 21-file library (`assets/sfx/*` + `manifest.json`) |
See `references/bgm.md` and `references/sfx.md`.
## Routing
| Task | Read |
| ------------------------------------------------------------------- | -------------------------------------------- |
| The audio engine — request/meta schema, `--only`, the switch | `scripts/audio.mjs` (header comment) |
| `npx hyperframes tts` / `heygen-tts.mjs` — providers, voices, words | `references/tts.md` |
| BGM — HeyGen retrieval + local Lyria / MusicGen generation | `references/bgm.md` |
| SFX — HeyGen retrieval (min_score 0.4) + bundled local library | `references/sfx.md` |
| `npx hyperframes transcribe` — Whisper, model rules, output shape | `references/transcribe.md` |
| `npx hyperframes remove-background` — transparent cutouts | `references/remove-background.md` |
| TTS → transcription → captions (no recorded voiceover) | `references/tts-to-captions.md` |
| Caption authoring — style detection, layout, word grouping, exit | `references/captions/authoring.md` |
| Transcript handling — input formats, quality gates, cleanup, APIs | `references/captions/transcript-handling.md` |
| Caption motion — karaoke, marker effects, audio-reactive | `references/captions/motion.md` |
| Model caches, system dependencies, troubleshooting | `references/requirements.md` |
## Non-negotiable rules
- **One engine, no vendored copies.** Produce audio via `scripts/audio.mjs` (or `heygen-tts.mjs` for one-shot HeyGen TTS). Don't re-implement TTS/BGM/SFX inside a workflow — write an `audio_request.json` adapter and call the engine.
- **"HeyGen available" = a resolvable credential, not the CLI.** The whole switch keys off `heygenCredential()`; the published `hyperframes tts` may be Kokoro-only, and there is no `hyperframes bgm` / `hyperframes sfx` command at all.
- **Voice IDs are provider-specific.** `am_michael` is Kokoro-only; HeyGen UUIDs don't work on Kokoro. If you pass `--voice`, also pin `--provider` to avoid silent provider drift when the user's env changes.
- **Always pass `--model` to `transcribe`.** The CLI default `small.en` silently translates non-English audio. See `references/transcribe.md` → "Language Rule".
- **HeyGen returns word timestamps; ElevenLabs / Kokoro do not.** The engine chains `transcribe` automatically for the latter two; standalone, pass `--words` to HeyGen or run `transcribe` against the audio file.
- **Captions consume the flat word-array format** with `{ id, text, start, end }`. See `references/transcribe.md` → "Output Shape".
- **`remove-background --background-output` is hole-cut, not inpainted.** For "scene without the person", a different tool is needed. See `references/remove-background.md` → "When NOT the right tool".
- **BGM/SFX default to HeyGen retrieval; the no-credential fallback is generation (BGM) or the bundled library (SFX).** `/audio/sounds` ranks by a text query — name effects concretely (`glass shatter`, not `dramatic sound`); a no-match **skips**, never blocks the render. SFX sit at volume ~0.35 under voice + BGM. See `references/sfx.md` / `references/bgm.md`.