Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, tool calling, LoRA fine-tuning, and embeddings.
MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, tool calling, LoRA fine-tuning, and embeddings.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, fine-tuning via LoRA/DoRA, and embeddings generation.
Running LLM/VLM inference on macOS/iOS with Apple Silicon Streaming text generation from local models Vision tasks with images/video (VLMs) Tool calling / function calling with models LoRA adapter training and fine-tuning Text embeddings for RAG/semantic search
MLXLMCommon - Core infrastructure (ModelContainer, ChatSession, KVCache, etc.) MLXLLM - Text-only LLM support (Llama, Qwen, Gemma, Phi, DeepSeek, etc. - examples, not exhaustive) MLXVLM - Vision-Language Models (Qwen2-VL, PaliGemma, Gemma3, etc. - examples, not exhaustive) Embedders - Embedding models (BGE, Nomic, MiniLM)
PurposeFile PathThread-safe model wrapperLibraries/MLXLMCommon/ModelContainer.swiftSimplified chat APILibraries/MLXLMCommon/ChatSession.swiftGeneration & streamingLibraries/MLXLMCommon/Evaluate.swiftKV cache typesLibraries/MLXLMCommon/KVCache.swiftModel configurationLibraries/MLXLMCommon/ModelConfiguration.swiftChat message typesLibraries/MLXLMCommon/Chat.swiftTool call processingLibraries/MLXLMCommon/Tool/ToolCallFormat.swiftConcurrency utilitiesLibraries/MLXLMCommon/Utilities/SerialAccessContainer.swiftLLM factory & registryLibraries/MLXLLM/LLMModelFactory.swiftVLM factory & registryLibraries/MLXVLM/VLMModelFactory.swiftLoRA configurationLibraries/MLXLMCommon/Adapters/LoRA/LoRAContainer.swiftLoRA trainingLibraries/MLXLLM/LoraTrain.swift
import MLXLLM import MLXLMCommon // Load model (downloads from HuggingFace automatically) let modelContainer = try await LLMModelFactory.shared.loadContainer( configuration: .init(id: "mlx-community/Qwen3-4B-4bit") ) // Create chat session let session = ChatSession(modelContainer) // Single response let response = try await session.respond(to: "What is Swift?") print(response) // Streaming response for try await chunk in session.streamResponse(to: "Explain concurrency") { print(chunk, terminator: "") }
import MLXVLM import MLXLMCommon let modelContainer = try await VLMModelFactory.shared.loadContainer( configuration: .init(id: "mlx-community/Qwen2-VL-2B-Instruct-4bit") ) let session = ChatSession(modelContainer) // With image (video is also an optional parameter) let image = UserInput.Image.url(imageURL) let response = try await session.respond( to: "Describe this image", image: image, video: nil // Optional video parameter )
import Embedders // Note: Embedders uses loadModelContainer() helper (not a factory pattern) let container = try await loadModelContainer( configuration: ModelConfiguration(id: "mlx-community/bge-small-en-v1.5-mlx") ) let embeddings = await container.perform { model, tokenizer, pooler in let tokens = tokenizer.encode(text: "Hello world") let input = MLXArray(tokens).expandedDimensions(axis: 0) let output = model(input) let pooled = pooler(output, normalize: true) eval(pooled) return pooled }
ChatSession manages conversation history and KV cache automatically: let session = ChatSession( modelContainer, instructions: "You are a helpful assistant", // System prompt generateParameters: GenerateParameters( maxTokens: 500, temperature: 0.7 ) ) // Multi-turn conversation (history preserved automatically) let r1 = try await session.respond(to: "What is 2+2?") let r2 = try await session.respond(to: "And if you multiply that by 3?") // Clear session to start fresh await session.clear()
For lower-level control, use generate() directly: let input = try await modelContainer.prepare(input: UserInput(prompt: .text("Hello"))) let stream = try await modelContainer.generate(input: input, parameters: GenerateParameters()) for await generation in stream { switch generation { case .chunk(let text): print(text, terminator: "") case .info(let info): print("\n\(info.tokensPerSecond) tok/s") case .toolCall(let call): // Handle tool call break } }
// 1. Define tool struct WeatherInput: Codable { let location: String } struct WeatherOutput: Codable { let temperature: Double; let conditions: String } let weatherTool = Tool<WeatherInput, WeatherOutput>( name: "get_weather", description: "Get current weather", parameters: [.required("location", type: .string, description: "City name")] ) { input in WeatherOutput(temperature: 22.0, conditions: "Sunny") } // 2. Include tool schema in request let input = UserInput( prompt: .text("What's the weather in Tokyo?"), tools: [weatherTool.schema] ) // 3. Handle tool calls in generation stream for await generation in try await modelContainer.generate(input: input, parameters: params) { switch generation { case .chunk(let text): print(text) case .toolCall(let call): let result = try await call.execute(with: weatherTool) print("Weather: \(result.conditions)") case .info: break } } See references/tool-calling.md for multi-turn and feeding results back.
let params = GenerateParameters( maxTokens: 1000, // nil = unlimited maxKVSize: 4096, // Sliding window (uses RotatingKVCache) kvBits: 4, // Quantized cache (4 or 8 bit) temperature: 0.7, // 0 = greedy/argmax topP: 0.9, // Nucleus sampling repetitionPenalty: 1.1, // Penalize repeats repetitionContextSize: 20 // Window for penalty )
Restore chat from persisted history: let history: [Chat.Message] = [ .system("You are helpful"), .user("Hello"), .assistant("Hi there!") ] let session = ChatSession( modelContainer, history: history ) // Continues from this point
// From URL (file or remote) let image = UserInput.Image.url(fileURL) // From CIImage let image = UserInput.Image.ciImage(ciImage) // From MLXArray directly let image = UserInput.Image.array(mlxArray)
// From URL (file or remote) let video = UserInput.Video.url(videoURL) // From AVFoundation asset let video = UserInput.Video.avAsset(avAsset) // From pre-extracted frames let video = UserInput.Video.frames(videoFrames) let response = try await session.respond( to: "What happens in this video?", video: video )
let images: [UserInput.Image] = [ .url(url1), .url(url2) ] let response = try await session.respond( to: "Compare these two images", images: images, videos: [] )
let session = ChatSession( modelContainer, processing: UserInput.Processing( resize: CGSize(width: 512, height: 512) // Resize images ) )
// DO: Use ChatSession for multi-turn conversations let session = ChatSession(modelContainer) // DO: Use AsyncStream APIs (modern, Swift concurrency) for try await chunk in session.streamResponse(to: prompt) { ... } // DO: Check Task.isCancelled in long-running loops for try await generation in stream { if Task.isCancelled { break } // process generation } // DO: Use ModelContainer.perform() for thread-safe access await modelContainer.perform { context in // Access model, tokenizer safely let tokens = try context.tokenizer.applyChatTemplate(messages: messages) return tokens } // DO: When breaking early from generation, use generateTask() to get a task handle // This is the lower-level API used internally by ChatSession let (stream, task) = generateTask(...) // Returns (AsyncStream, Task) for await item in stream { if shouldStop { break } } await task.value // Ensures KV cache cleanup before next generation generateTask() is defined in Evaluate.swift. Most users should use ChatSession which handles this internally.
// DON'T: Share MLXArray across tasks (not Sendable) let array = MLXArray(...) Task { array.sum() } // Wrong! // DON'T: Use deprecated callback-based generation // Old: generate(input: input, parameters: params) { tokens in ... } // Deprecated // New: for await generation in try generate(input: input, parameters: params, context: context) { ... } // DON'T: Use old perform(model, tokenizer) signature // Old: modelContainer.perform { model, tokenizer in ... } // Deprecated // New: modelContainer.perform { context in ... } // DON'T: Forget to eval() MLXArrays before returning from perform() await modelContainer.perform { context in let result = context.model(input) eval(result) // Required before returning return result.item(Float.self) }
ModelContainer is Sendable and thread-safe ChatSession is NOT thread-safe (use from single task) MLXArray is NOT Sendable - don't pass across isolation boundaries Use SendableBox for transferring non-Sendable data in consuming contexts
// For long contexts, use sliding window cache let params = GenerateParameters(maxKVSize: 4096) // For memory efficiency, use quantized cache let params = GenerateParameters(kvBits: 4) // or 8 // Clear session cache when done await session.clear()
For detailed documentation on specific topics, see: ReferenceWhen to Usereferences/model-container.mdLoading models, ModelContainer API, ModelConfigurationreferences/kv-cache.mdCache types, memory optimization, cache serializationreferences/concurrency.mdThread safety, SerialAccessContainer, async patternsreferences/tool-calling.mdFunction calling, tool formats, ToolCallProcessorreferences/tokenizer-chat.mdTokenizer, Chat.Message, EOS tokensreferences/supported-models.mdModel families, registries, model-specific configreferences/lora-adapters.mdLoRA/DoRA/QLoRA, loading adaptersreferences/training.mdLoRATrain API, fine-tuningreferences/embeddings.mdEmbeddingModel, pooling, use cases
Most common migrations (see individual reference files for topic-specific deprecations): If you see...Use instead...generate(... didGenerate:) callbackgenerate(...) -> AsyncStreamperform { model, tokenizer in }perform { context in }TokenIterator(prompt: MLXArray)TokenIterator(input: LMInput)ModelRegistry typealiasLLMRegistry or VLMRegistrycreateAttentionMask(h:cache:[KVCache]?)createAttentionMask(h:cache:KVCache?) Each reference file contains a "Deprecated Patterns" section with topic-specific migrations.
The framework handles these automatically: FeatureDetailsEOS token loadingLoaded from config.jsonEOS token overridePriority: generation_config.json > config.json > defaultsEOS token mergingAll sources merged at generation timeEOS token detectionStops generation automatically when EOS encounteredChat template applicationApplied automatically via applyChatTemplate()Tool call format detectionInferred from model_type in config.jsonCache type selectionBased on GenerateParameters (maxKVSize, kvBits)Tokenizer loadingLoaded from tokenizer.json automaticallyModel weights loadingDownloaded and loaded from HuggingFace
FeatureWhen to ConfigureextraEOSTokensOnly if model has unlisted stop tokenstoolCallFormatOnly to override auto-detectionmaxKVSizeTo enable sliding window cachekvBitsTo enable quantized cache (4 or 8 bit)maxTokensTo limit output length
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.