Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Extract and analyze content from video ads using Gemini Vision AI. Supports frame extraction, OCR text detection, audio transcription, and AI-powered scene analysis. Use when analyzing video creative content, extracting text overlays, or generating scene-by-scene descriptions.
Extract and analyze content from video ads using Gemini Vision AI. Supports frame extraction, OCR text detection, audio transcription, and AI-powered scene analysis. Use when analyzing video creative content, extracting text overlays, or generating scene-by-scene descriptions.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
AI-powered video content extraction using Google Gemini Vision.
Frame Extraction: Smart sampling with scene change detection OCR Text Detection: Extract text overlays using EasyOCR Audio Transcription: Convert speech to text with Google Cloud Speech AI Scene Analysis: Describe each scene using Gemini Vision Native Video Analysis: Direct video understanding for longer content Thumbnail Generation: Auto-generate thumbnails from first frame
# Required for Gemini Vision GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json # Required for audio transcription # (same service account needs Speech-to-Text API enabled)
pip install opencv-python pillow easyocr ffmpeg-python google-cloud-speech vertexai google-api-python-client Also requires ffmpeg and ffprobe installed on system.
from scripts.video_extractor import VideoExtractor from scripts.models import ExtractedVideoContent import vertexai from vertexai.generative_models import GenerativeModel # Initialize Vertex AI vertexai.init(project="your-project-id", location="us-central1") gemini_model = GenerativeModel("gemini-1.5-flash") # Create extractor extractor = VideoExtractor(gemini_model=gemini_model) # Analyze video result = extractor.extract_content("/path/to/video.mp4") print(f"Duration: {result.duration}s") print(f"Scenes: {len(result.scene_timeline)}") print(f"Text overlays: {len(result.text_timeline)}") print(f"Transcript: {result.transcript[:200]}...")
frames, timestamps, text_timeline, scene_timeline, thumbnail = extractor.extract_smart_frames( "/path/to/video.mp4", scene_interval=2, # Check for scene changes every 2s text_interval=0.5 # Check for text every 0.5s )
# Works with images too result = extractor.extract_content("/path/to/image.jpg") print(result.scene_timeline[0]['description'])
ExtractedVideoContent( video_path="/path/to/video.mp4", duration=30.5, transcript="Here's what we found...", text_timeline=[ {"at": 0.0, "text": ["Download Now"]}, {"at": 5.5, "text": ["50% Off Today"]} ], scene_timeline=[ {"timestamp": 0.0, "description": "Woman using phone app..."}, {"timestamp": 2.0, "description": "Product showcase with features..."} ], thumbnail_url="/static/thumbnails/video_thumb.jpg", extraction_complete=True )
FeatureDescriptionScene DetectionHistogram-based change detection (threshold=65)OCR ConfidenceTiered thresholds (0.5 high, 0.3 low)AI ProofreadingGemini cleans up OCR errorsSource ReconciliationMerges OCR + Vision text intelligentlyNative VideoDirect Gemini analysis for <20MB files
Customize AI behavior by editing prompts in the prompts/ folder: scene_analysis.md - Frame analysis prompts scene_reconciliation.md - Scene enrichment prompts
"What text appears in this video ad?" "Describe each scene in this creative" "What does the narrator say?" "Extract the call-to-action from this ad"
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.