โ† All skills
Tencent SkillHub ยท Developer Tools

Google Gemini Media

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

โฌ‡ 0 downloads โ˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.1

Documentation

ClawHub primary doc Primary doc: SKILL.md 40 sections Open source page

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates: Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration) Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API) Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio) Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence) Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone) Audio understanding (upload/inline; description, transcription, time-range transcription, token counting) Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

2. Quick routing (decide which capability to use)

Do you need to produce images? Need to generate images from scratch or edit based on an image -> use Nano Banana image generation (see Section 5) Do you need to understand images? Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6) Do you need to produce video? Need to generate an 8-second video (optionally with native audio) -> use Veo 3.1 video generation (see Section 7) Do you need to understand video? Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8) Do you need to read text aloud? Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9) Do you need to understand audio? Need audio descriptions, transcription, time-range transcription, token counting -> use Audio understanding (see Section 10)

3.0 Prerequisites (dependencies and tools)

Node.js 18+ (match your project version) Install SDK (example): npm install @google/genai REST examples only need curl; if you need to parse image Base64, install jq (optional).

3.1 Authentication and environment variables

Put your API key in GEMINI_API_KEY REST requests use x-goog-api-key: $GEMINI_API_KEY

3.2 Two file input modes: Inline vs Files API

Inline (embedded bytes/Base64) Pros: shorter call chain, good for small files. Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling. Files API (upload then reference) Pros: good for large files, reusing the same file, or multi-turn conversations. Typical flow: files.upload(...) (SDK) or POST /upload/v1beta/files (REST resumable) Use file_data / file_uri in generateContent Engineering suggestion: implement ensure_file_uri() so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.

3.3 Unified handling of binary media outputs

Images: usually returned as inline_data (Base64) in response parts; in the SDK use part.as_image() or decode Base64 and save as PNG/JPG. Speech (TTS): usually returns PCM bytes (Base64); save as .pcm or wrap into .wav (commonly 24kHz, 16-bit, mono). Video (Veo): long-running async task; poll the operation; download the file (or use the returned URI).

4. Model selection matrix (choose by scenario)

Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.

4.1 Image generation (Nano Banana)

gemini-2.5-flash-image: optimized for speed/throughput; good for frequent, low-latency generation/editing. gemini-3-pro-image-preview: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.

4.2 General image/video/audio understanding

Docs use gemini-3-flash-preview for image, video, and audio understanding (choose stronger models as needed for quality/cost).

4.3 Video generation (Veo)

Example model: veo-3.1-generate-preview (generates 8-second video and can natively generate audio).

4.4 Speech generation (TTS)

Example model: gemini-2.5-flash-preview-tts (native TTS, currently in preview).

5.1 Text-to-Image

SDK (Node.js) minimal template import { GoogleGenAI } from "@google/genai"; import * as fs from "node:fs"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const response = await ai.models.generateContent({ model: "gemini-2.5-flash-image", contents: "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme", }); const parts = response.candidates?.[0]?.content?.parts ?? []; for (const part of parts) { if (part.text) console.log(part.text); if (part.inlineData?.data) { fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64")); } } REST (with imageConfig) minimal template curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" -H "x-goog-api-key: $GEMINI_API_KEY" -H "Content-Type: application/json" -d '{ "contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}], "generationConfig": {"imageConfig": {"aspectRatio":"16:9"}} }' REST image parsing (Base64 decode) curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \ -H "x-goog-api-key: $GEMINI_API_KEY" \ -H "Content-Type: application/json" \ -d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \ | jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \ | base64 --decode > out.png # macOS can use: base64 -D > out.png

5.2 Text-and-Image-to-Image

Use case: given an image, add/remove/modify elements, change style, color grading, etc. SDK (Node.js) minimal template import { GoogleGenAI } from "@google/genai"; import * as fs from "node:fs"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const prompt = "Add a nano banana on the table, keep lighting consistent, cinematic tone."; const imageBase64 = fs.readFileSync("input.png").toString("base64"); const response = await ai.models.generateContent({ model: "gemini-2.5-flash-image", contents: [ { text: prompt }, { inlineData: { mimeType: "image/png", data: imageBase64 } }, ], }); const parts = response.candidates?.[0]?.content?.parts ?? []; for (const part of parts) { if (part.inlineData?.data) { fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64")); } }

5.3 Multi-turn image iteration (Multi-turn editing)

Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style"). To output mixed "text + image" results, set response_modalities to ["TEXT", "IMAGE"].

5.4 ImageConfig

You can set in generationConfig.imageConfig or the SDK config: aspectRatio: e.g. 16:9, 1:1. imageSize: e.g. 2K, 4K (higher resolution is usually slower/more expensive and model support can vary).

6.1 Two ways to provide input images

Inline image data: suitable for small files (total request size < 20MB). Files API upload: better for large files or reuse across multiple requests.

6.2 Inline images (Node.js) minimal template

import { GoogleGenAI } from "@google/genai"; import * as fs from "node:fs"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const imageBase64 = fs.readFileSync("image.jpg").toString("base64"); const response = await ai.models.generateContent({ model: "gemini-3-flash-preview", contents: [ { inlineData: { mimeType: "image/jpeg", data: imageBase64 } }, { text: "Caption this image, and list any visible brands." }, ], }); console.log(response.text);

6.3 Upload and reference with Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const uploaded = await ai.files.upload({ file: "image.jpg" }); const response = await ai.models.generateContent({ model: "gemini-3-flash-preview", contents: createUserContent([ createPartFromUri(uploaded.uri, uploaded.mimeType), "Caption this image.", ]), }); console.log(response.text);

6.4 Multi-image prompts

Append multiple images as multiple Part entries in the same contents; you can mix uploaded references and inline bytes.

7.1 Core features (must know)

Generates 8-second high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX). Supports: Aspect ratio (16:9 / 9:16) Video extension (extend a generated video; typically limited to 720p) First/last frame control (frame-specific) Up to 3 reference images (image-based direction)

7.2 SDK (Node.js) minimal template: async polling + download

import { GoogleGenAI } from "@google/genai"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const prompt = "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience."; let operation = await ai.models.generateVideos({ model: "veo-3.1-generate-preview", prompt, config: { resolution: "1080p" }, }); while (!operation.done) { await new Promise((resolve) => setTimeout(resolve, 10_000)); operation = await ai.operations.getVideosOperation({ operation }); } const video = operation.response?.generatedVideos?.[0]?.video; if (!video) throw new Error("No video returned"); await ai.files.download({ file: video, downloadPath: "out.mp4" });

7.3 REST minimal template: predictLongRunning + poll + download

Key point: Veo REST uses :predictLongRunning to return an operation name, then poll GET /v1beta/{operation_name}; once done, download from the video URI in the response.

7.4 Common controls (recommend a unified wrapper)

aspectRatio: "16:9" or "9:16" resolution: "720p" | "1080p" | "4k" (higher resolutions are usually slower/more expensive) When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood). Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.

7.5 Important limits (engineering fallback needed)

Latency can vary from seconds to minutes; implement timeouts and retries. Generated videos are only retained on the server for a limited time (download promptly). Outputs include a SynthID watermark. Polling fallback (with timeout/backoff) pseudocode const deadline = Date.now() + 300_000; // 5 min let sleepMs = 2000; while (!operation.done && Date.now() < deadline) { await new Promise((resolve) => setTimeout(resolve, sleepMs)); sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000); operation = await ai.operations.getVideosOperation({ operation }); } if (!operation.done) throw new Error("video generation timed out");

8.1 Video input options

Files API upload: recommended when file > 100MB, video length > ~1 minute, or you need reuse. Inline video data: for smaller files. Direct YouTube URL: can analyze public videos.

8.2 Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const uploaded = await ai.files.upload({ file: "sample.mp4" }); const response = await ai.models.generateContent({ model: "gemini-3-flash-preview", contents: createUserContent([ createPartFromUri(uploaded.uri, uploaded.mimeType), "Summarize this video. Provide timestamps for key events.", ]), }); console.log(response.text);

8.3 Timestamp prompting strategy

Ask for segmented bullets with "(mm:ss)" timestamps. Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.

9.1 Positioning

Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.). Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.

9.2 Single-speaker TTS (Node.js) minimal template

import { GoogleGenAI } from "@google/genai"; import * as fs from "node:fs"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const response = await ai.models.generateContent({ model: "gemini-2.5-flash-preview-tts", contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }], config: { responseModalities: ["AUDIO"], speechConfig: { voiceConfig: { prebuiltVoiceConfig: { voiceName: "Kore" }, }, }, }, }); const data = response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? ""; if (!data) throw new Error("No audio returned"); fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));

9.3 Multi-speaker TTS (max 2 speakers)

Requirements: Use multiSpeakerVoiceConfig Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).

9.4 Voice options and language

voice_name supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.). The model can auto-detect input language and supports 24 languages (see docs for the list).

9.5 "Director notes" (strongly recommended for high-quality voice)

Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.

10.1 Typical tasks

Describe audio content (including non-speech like birds, alarms, etc.) Generate transcripts Transcribe specific time ranges Count tokens (for cost estimates/segmentation)

10.2 Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai"; const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY }); const uploaded = await ai.files.upload({ file: "sample.mp3" }); const response = await ai.models.generateContent({ model: "gemini-3-flash-preview", contents: createUserContent([ "Describe this audio clip.", createPartFromUri(uploaded.uri, uploaded.mimeType), ]), }); console.log(response.text);

10.3 Key limits and engineering tips

Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC. Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change). Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters). If total request size exceeds 20MB, you must use the Files API.

Example A: Image generation -> validation via understanding

Generate product images with Nano Banana (require negative space, consistent lighting). Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements. If not satisfied, feed the generated image into text+image editing and iterate.

Example B: Video generation -> video understanding -> narration script

Generate an 8-second shot with Veo (include dialogue or SFX). Download and save (respect retention window). Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).

Example C: Audio understanding -> time-range transcription -> TTS redub

Upload meeting audio and transcribe full content. Transcribe or summarize specific time ranges. Use TTS to generate a "broadcast" version of the summary.

12. Compliance and risk (must follow)

Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content. Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints. Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.

13. Quick reference (Checklist)

Pick the right model: image generation (Flash Image / Pro Image Preview), video generation (Veo 3.1), TTS (Gemini 2.5 TTS), understanding (Gemini Flash/Pro). Pick the right input mode: inline for small files; Files API for large/reuse. Parse binary outputs correctly: image/audio via inline_data decode; video via operation polling + download. For video generation: set aspectRatio / resolution, and download promptly (avoid expiration). For TTS: set response_modalities=["AUDIO"]; max 2 speakers; speaker names must match prompt. For audio understanding: countTokens when needed; segment long audio or use Files API.

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
1 Docs
  • SKILL.md Primary doc