Tencent SkillHub · AI

MLX Swift LM Expert

MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, tool calling, LoRA fine-tuning, and embeddings.

skill openclawclawhub Free

0 Downloads

0 Stars

0 Installs

0 Score

High Signal

MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, tool calling, LoRA fine-tuning, and embeddings.

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup

Download the package from Yavira.
Extract the archive and review SKILL.md first.
Import or place the package into your OpenClaw setup.

Requirements

Target platform: OpenClaw
Install method: Manual import
Extraction: Extract archive
Prerequisites: OpenClaw
Primary doc: SKILL.md

Package facts

Download mode: Yavira redirect
Package format: ZIP package
Source platform: Tencent SkillHub
What's included: references/embeddings.md, references/training.md, references/tool-calling.md, references/lora-adapters.md, references/model-container.md, references/kv-cache.md

Validation

Use the Yavira download entry.
Review SKILL.md after the package is downloaded.
Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

Download the package from Yavira.
Extract it into a folder your agent can access.
Paste one of the prompts below and point your agent at the extracted folder.

New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Open Send to Agent page Open JSON manifest Open Markdown brief

Trust & source

Release facts

Source: Tencent SkillHub
Verification: Indexed source record
Version: 1.0.0

Provenance

Publisher: ronaldmannak
Source page: View original listing
Canonical URL: Open canonical page

Documentation

ClawHub primary doc Primary doc: SKILL.md 24 sections Open source page

1. Overview & Triggers

mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, fine-tuning via LoRA/DoRA, and embeddings generation.

When to Use This Skill

Running LLM/VLM inference on macOS/iOS with Apple Silicon Streaming text generation from local models Vision tasks with images/video (VLMs) Tool calling / function calling with models LoRA adapter training and fine-tuning Text embeddings for RAG/semantic search

Architecture Overview

MLXLMCommon - Core infrastructure (ModelContainer, ChatSession, KVCache, etc.) MLXLLM - Text-only LLM support (Llama, Qwen, Gemma, Phi, DeepSeek, etc. - examples, not exhaustive) MLXVLM - Vision-Language Models (Qwen2-VL, PaliGemma, Gemma3, etc. - examples, not exhaustive) Embedders - Embedding models (BGE, Nomic, MiniLM)

2. Key File Reference

PurposeFile PathThread-safe model wrapperLibraries/MLXLMCommon/ModelContainer.swiftSimplified chat APILibraries/MLXLMCommon/ChatSession.swiftGeneration & streamingLibraries/MLXLMCommon/Evaluate.swiftKV cache typesLibraries/MLXLMCommon/KVCache.swiftModel configurationLibraries/MLXLMCommon/ModelConfiguration.swiftChat message typesLibraries/MLXLMCommon/Chat.swiftTool call processingLibraries/MLXLMCommon/Tool/ToolCallFormat.swiftConcurrency utilitiesLibraries/MLXLMCommon/Utilities/SerialAccessContainer.swiftLLM factory & registryLibraries/MLXLLM/LLMModelFactory.swiftVLM factory & registryLibraries/MLXVLM/VLMModelFactory.swiftLoRA configurationLibraries/MLXLMCommon/Adapters/LoRA/LoRAContainer.swiftLoRA trainingLibraries/MLXLLM/LoraTrain.swift

LLM Chat (Simplest API)

import MLXLLM import MLXLMCommon // Load model (downloads from HuggingFace automatically) let modelContainer = try await LLMModelFactory.shared.loadContainer( configuration: .init(id: "mlx-community/Qwen3-4B-4bit") ) // Create chat session let session = ChatSession(modelContainer) // Single response let response = try await session.respond(to: "What is Swift?") print(response) // Streaming response for try await chunk in session.streamResponse(to: "Explain concurrency") { print(chunk, terminator: "") }

VLM with Image

import MLXVLM import MLXLMCommon let modelContainer = try await VLMModelFactory.shared.loadContainer( configuration: .init(id: "mlx-community/Qwen2-VL-2B-Instruct-4bit") ) let session = ChatSession(modelContainer) // With image (video is also an optional parameter) let image = UserInput.Image.url(imageURL) let response = try await session.respond( to: "Describe this image", image: image, video: nil // Optional video parameter )

Embeddings

import Embedders // Note: Embedders uses loadModelContainer() helper (not a factory pattern) let container = try await loadModelContainer( configuration: ModelConfiguration(id: "mlx-community/bge-small-en-v1.5-mlx") ) let embeddings = await container.perform { model, tokenizer, pooler in let tokens = tokenizer.encode(text: "Hello world") let input = MLXArray(tokens).expandedDimensions(axis: 0) let output = model(input) let pooled = pooler(output, normalize: true) eval(pooled) return pooled }

ChatSession API (Recommended)

ChatSession manages conversation history and KV cache automatically: let session = ChatSession( modelContainer, instructions: "You are a helpful assistant", // System prompt generateParameters: GenerateParameters( maxTokens: 500, temperature: 0.7 ) ) // Multi-turn conversation (history preserved automatically) let r1 = try await session.respond(to: "What is 2+2?") let r2 = try await session.respond(to: "And if you multiply that by 3?") // Clear session to start fresh await session.clear()

Streaming with generate()

For lower-level control, use generate() directly: let input = try await modelContainer.prepare(input: UserInput(prompt: .text("Hello"))) let stream = try await modelContainer.generate(input: input, parameters: GenerateParameters()) for await generation in stream { switch generation { case .chunk(let text): print(text, terminator: "") case .info(let info): print("\n\(info.tokensPerSecond) tok/s") case .toolCall(let call): // Handle tool call break } }

Tool Calling

// 1. Define tool struct WeatherInput: Codable { let location: String } struct WeatherOutput: Codable { let temperature: Double; let conditions: String } let weatherTool = Tool<WeatherInput, WeatherOutput>( name: "get_weather", description: "Get current weather", parameters: [.required("location", type: .string, description: "City name")] ) { input in WeatherOutput(temperature: 22.0, conditions: "Sunny") } // 2. Include tool schema in request let input = UserInput( prompt: .text("What's the weather in Tokyo?"), tools: [weatherTool.schema] ) // 3. Handle tool calls in generation stream for await generation in try await modelContainer.generate(input: input, parameters: params) { switch generation { case .chunk(let text): print(text) case .toolCall(let call): let result = try await call.execute(with: weatherTool) print("Weather: \(result.conditions)") case .info: break } } See references/tool-calling.md for multi-turn and feeding results back.

GenerateParameters

let params = GenerateParameters( maxTokens: 1000, // nil = unlimited maxKVSize: 4096, // Sliding window (uses RotatingKVCache) kvBits: 4, // Quantized cache (4 or 8 bit) temperature: 0.7, // 0 = greedy/argmax topP: 0.9, // Nucleus sampling repetitionPenalty: 1.1, // Penalize repeats repetitionContextSize: 20 // Window for penalty )

Prompt Caching / History Re-hydration

Restore chat from persisted history: let history: [Chat.Message] = [ .system("You are helpful"), .user("Hello"), .assistant("Hi there!") ] let session = ChatSession( modelContainer, history: history ) // Continues from this point

Image Input Types

// From URL (file or remote) let image = UserInput.Image.url(fileURL) // From CIImage let image = UserInput.Image.ciImage(ciImage) // From MLXArray directly let image = UserInput.Image.array(mlxArray)

Video Input

// From URL (file or remote) let video = UserInput.Video.url(videoURL) // From AVFoundation asset let video = UserInput.Video.avAsset(avAsset) // From pre-extracted frames let video = UserInput.Video.frames(videoFrames) let response = try await session.respond( to: "What happens in this video?", video: video )

Multiple Images

let images: [UserInput.Image] = [ .url(url1), .url(url2) ] let response = try await session.respond( to: "Compare these two images", images: images, videos: [] )

VLM-Specific Processing

let session = ChatSession( modelContainer, processing: UserInput.Processing( resize: CGSize(width: 512, height: 512) // Resize images ) )

DO

// DO: Use ChatSession for multi-turn conversations let session = ChatSession(modelContainer) // DO: Use AsyncStream APIs (modern, Swift concurrency) for try await chunk in session.streamResponse(to: prompt) { ... } // DO: Check Task.isCancelled in long-running loops for try await generation in stream { if Task.isCancelled { break } // process generation } // DO: Use ModelContainer.perform() for thread-safe access await modelContainer.perform { context in // Access model, tokenizer safely let tokens = try context.tokenizer.applyChatTemplate(messages: messages) return tokens } // DO: When breaking early from generation, use generateTask() to get a task handle // This is the lower-level API used internally by ChatSession let (stream, task) = generateTask(...) // Returns (AsyncStream, Task) for await item in stream { if shouldStop { break } } await task.value // Ensures KV cache cleanup before next generation generateTask() is defined in Evaluate.swift. Most users should use ChatSession which handles this internally.

DON'T

// DON'T: Share MLXArray across tasks (not Sendable) let array = MLXArray(...) Task { array.sum() } // Wrong! // DON'T: Use deprecated callback-based generation // Old: generate(input: input, parameters: params) { tokens in ... } // Deprecated // New: for await generation in try generate(input: input, parameters: params, context: context) { ... } // DON'T: Use old perform(model, tokenizer) signature // Old: modelContainer.perform { model, tokenizer in ... } // Deprecated // New: modelContainer.perform { context in ... } // DON'T: Forget to eval() MLXArrays before returning from perform() await modelContainer.perform { context in let result = context.model(input) eval(result) // Required before returning return result.item(Float.self) }

Thread Safety

ModelContainer is Sendable and thread-safe ChatSession is NOT thread-safe (use from single task) MLXArray is NOT Sendable - don't pass across isolation boundaries Use SendableBox for transferring non-Sendable data in consuming contexts

Memory Management

// For long contexts, use sliding window cache let params = GenerateParameters(maxKVSize: 4096) // For memory efficiency, use quantized cache let params = GenerateParameters(kvBits: 4) // or 8 // Clear session cache when done await session.clear()

7. Reference Links

For detailed documentation on specific topics, see: ReferenceWhen to Usereferences/model-container.mdLoading models, ModelContainer API, ModelConfigurationreferences/kv-cache.mdCache types, memory optimization, cache serializationreferences/concurrency.mdThread safety, SerialAccessContainer, async patternsreferences/tool-calling.mdFunction calling, tool formats, ToolCallProcessorreferences/tokenizer-chat.mdTokenizer, Chat.Message, EOS tokensreferences/supported-models.mdModel families, registries, model-specific configreferences/lora-adapters.mdLoRA/DoRA/QLoRA, loading adaptersreferences/training.mdLoRATrain API, fine-tuningreferences/embeddings.mdEmbeddingModel, pooling, use cases

8. Deprecated Patterns Summary

Most common migrations (see individual reference files for topic-specific deprecations): If you see...Use instead...generate(... didGenerate:) callbackgenerate(...) -> AsyncStreamperform { model, tokenizer in }perform { context in }TokenIterator(prompt: MLXArray)TokenIterator(input: LMInput)ModelRegistry typealiasLLMRegistry or VLMRegistrycreateAttentionMask(h:cache:[KVCache]?)createAttentionMask(h:cache:KVCache?) Each reference file contains a "Deprecated Patterns" section with topic-specific migrations.

Automatic Behaviors (NO developer action needed)

The framework handles these automatically: FeatureDetailsEOS token loadingLoaded from config.jsonEOS token overridePriority: generation_config.json > config.json > defaultsEOS token mergingAll sources merged at generation timeEOS token detectionStops generation automatically when EOS encounteredChat template applicationApplied automatically via applyChatTemplate()Tool call format detectionInferred from model_type in config.jsonCache type selectionBased on GenerateParameters (maxKVSize, kvBits)Tokenizer loadingLoaded from tokenizer.json automaticallyModel weights loadingDownloaded and loaded from HuggingFace

Optional Configuration (Developer MAY configure)

FeatureWhen to ConfigureextraEOSTokensOnly if model has unlisted stop tokenstoolCallFormatOnly to override auto-detectionmaxKVSizeTo enable sliding window cachekvBitsTo enable quantized cache (4 or 8 bit)maxTokensTo limit output length

Category context

Agent frameworks, memory systems, reasoning layers, and model-native orchestration.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package

6 Docs

references/embeddings.md Docs
references/kv-cache.md Docs
references/lora-adapters.md Docs
references/model-container.md Docs
references/tool-calling.md Docs
references/training.md Docs