Microsoft Foundry Local is a free tool that lets developers run generative‑AI models entirely on‑device—no Azure subscription, no internet connection, and no data leaving your laptop or desktop. It ships an OpenAI‑compatible REST endpoint backed by ONNX Runtime accelerators (CUDA, DirectML, NPU, or pure CPU) so you can reuse existing SDK code with almost zero changes. The project debuted at Build 2025 as part of Microsoft’s push toward edge‑ready AI and privacy‑first development models.
foundry‑local-sdk
mirrors Azure AI Foundry SDK, letting you scale to the cloud with a config swap.
# 1. Install
winget install Microsoft.FoundryLocal # Windows
# or
brew tap microsoft/foundrylocal && \
brew install foundrylocal # macOS :contentReference[oaicite:5]{index=5}
# 2. Chat with an SLM
foundry model run phi-3.5-mini # auto‑downloads 3.5 variant
The CLI caches models in ~/.foundry
and exposes http://localhost:11434/v1/
as an OpenAI endpoint for SDKs, LangChain, or Semantic Kernel.
Foundry ships pre‑quantised INT4/INT8 checkpoints that leverage ONNX Runtime’s weight‑only compression to slash VRAM while keeping accuracy within 1 pp of FP16.
Because the local gateway implements the same /v1/chat/completions
and /v1/embeddings
routes as OpenAI, you can point existing Python or JS clients to base_url=manager.endpoint
and they just work.
Microsoft performs vulnerability scans and red‑team evaluations on every model before adding it to the Azure AI Foundry catalogue that Foundry Local mirrors. (microsoft.com)
Model: phi-4 (Phi-4-cuda-gpu)
Model: phi-4 (Phi-4-generic-gpu)
Model: phi-4 (Phi-4-generic-cpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-cuda-gpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-gpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-cpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-cuda-gpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-gpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-cpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-gpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-cpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-cuda-gpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-gpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-cpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-cuda-gpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-gpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-cpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-cuda-gpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-gpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-cpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-cuda-gpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-gpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-cpu)
Model: phi-4-mini (Phi-4-mini-instruct-cuda-gpu)
Model: phi-4-mini (Phi-4-mini-instruct-generic-gpu)
Model: phi-4-mini (Phi-4-mini-instruct-generic-cpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-cuda-gpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-gpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-cpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-cuda-gpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-gpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-cpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-cuda-gpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-gpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-cpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-cuda-gpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-gpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-cpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-cuda-gpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-gpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-cpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-cuda-gpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-gpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-cpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-cuda-gpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-gpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-cpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-cuda-gpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-gpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-cpu)
import openai
from foundry_local import FoundryLocalManager
mgr = FoundryLocalManager()
openai.api_base = mgr.openai_endpoint # Looks just like Azure or OpenAI
resp = openai.ChatCompletion.create(
model="phi-3.5-mini",
messages=[{"role": "user", "content": "Explain transformers in 3 bullets"}]
)
print(resp.choices[0].message.content)
import { FoundryLocalManager } from "foundry-local-sdk";
import OpenAI from "openai";
const mgr = new FoundryLocalManager();
const openai = new OpenAI({ baseURL: mgr.openaiEndpoint });
const chat = await openai.chat.completions.create({
model: "phi-3.5-mini",
messages: [{ role: "user", content: "Explain quantization in 2 lines." }]
});
console.log(chat.choices[0].message.content);
Yes - install via winget
or Homebrew with no license cost or Azure account.
Windows 11, Windows Server 2025, and macOS Sonoma (Intel & Apple Silicon).
No. It auto‑selects CPU builds but accelerates on CUDA GPUs, DirectML GPUs, or NPUs when available for 5‑10× speed‑ups. (devblogs.microsoft.com, blogs.windows.com)
foundry model pull <name>
fetches the latest quantised weights from the Azure AI Foundry catalogue.