4.4. Apple MLX Acceleration¶

Under Development

This feature is currently under active development. It may not be included in the stable release version or may have limited functionality.

Lablup Experimental Acceleration

Lablup is preparing an experimental acceleration solution for MLX that aims to further enhance performance on Apple Silicon. Stay tuned for updates in future releases.

MLX is an array framework for machine learning on Apple Silicon, brought to you by the Apple machine learning research team. For macOS users with M1 through M5 chips, MLX provides the most optimized, native experience for running LLMs.

What is MLX?¶

Repository: github.com/ml-explore/mlx
Key Figure: Awni Hannun (@awni), a lead researcher at Apple who has been instrumental in optimizing Transformer models for Apple Silicon.

Released quietly by Apple in late 2023, MLX is designed to be familiar to users of NumPy and PyTorch but built from the ground up for the Unified Memory Architecture of Apple Silicon.

Key Advantages¶

Unified Memory: MLX allows the CPU and GPU to share the same memory pool. This means zero-copy data transfer—the model weights stay in one place, and both processors can access them instantly.
Dynamic Graph Construction: Like PyTorch, MLX builds computation graphs on the fly, making it flexible and easy to debug.
Lazy Computation: Operations are only computed when the results are actually needed, saving power and improving efficiency.

Role in Backend.AI GO¶

While llama.cpp is a fantastic cross-platform solution, MLX is often the superior choice specifically for macOS users.

The MLX Server¶

Backend.AI GO integrates an inference server based on mlx-lm. When you choose an MLX model:

The app launches a dedicated Python-based sidecar process optimized for Core ML and Metal.
It loads models directly into the Unified Memory.
It provides extremely high token generation speeds (Tokens Per Second) compared to other runtimes on the same hardware.

Why Choose MLX?¶

If you are on a Mac, you might wonder whether to use GGUF (llama.cpp) or MLX.

Choose MLX if:¶

Maximum Speed: You want the absolute highest inference speed your Mac can provide.
Battery Efficiency: MLX is often more power-efficient for long running tasks.
Latest Models: The MLX community (led by Awni and others) is very quick to port new architectures like Llama 3, Mistral, and Qwen.

Choose GGUF (`llama.cpp`) if:¶

Compatibility: You are using an Intel Mac or need a specific quantization format not yet available in MLX.
Extreme Quantization: You need to run a very large model on limited RAM using aggressive quantization (e.g., Q2_K).

Supported Models¶

Backend.AI GO supports loading MLX-converted models directly from Hugging Face. Look for models with the mlx tag or those inside the mlx-community organization.

Common Format: MLX models usually come as a folder containing config.json, tokenizer.json, and *.safetensors weight files.

Backend.AI GO leverages the cutting-edge work of the Apple MLX team and Awni Hannun to deliver a best-in-class AI experience on macOS.