What Is Microsoft Foundry Local?

Microsoft Foundry Local is a free tool that lets developers run generative‑AI models entirely on‑device—no Azure subscription, no internet connection, and no data leaving your laptop or desktop. It ships an OpenAI‑compatible REST endpoint backed by ONNX Runtime accelerators (CUDA, DirectML, NPU, or pure CPU) so you can reuse existing SDK code with almost zero changes. The project debuted at Build 2025 as part of Microsoft’s push toward edge‑ready AI and privacy‑first development models.
Why Run AI Models Locally?
- Full data privacy – prompts and outputs never leave the device.
- Latency under 50 ms on recent RTX‑40 GPUs and Intel® NPUs. (devblogs.microsoft.com, blogs.windows.com)
- Zero cloud cost – ideal for classrooms, prototypes, and air‑gapped teams. (windowsforum.com)
- Same code everywhere – the
foundry‑local-sdk
mirrors Azure AI Foundry SDK, letting you scale to the cloud with a config swap. - Hardware‑aware builds – Foundry auto‑downloads the CUDA, DirectML, NPU or CPU variant that best matches your machine. (github.com)
Two‑Step Quick‑Start
The CLI caches models in ~/.foundry
and exposes http://localhost:11434/v1/
as an OpenAI endpoint for SDKs, LangChain, or Semantic Kernel.
Foundry Local Feature Deep‑Dive
ONNX Runtime Acceleration
Foundry ships pre‑quantised INT4/INT8 checkpoints that leverage ONNX Runtime’s weight‑only compression to slash VRAM while keeping accuracy within 1 pp of FP16.
OpenAI‑Compatible Gateway
Because the local gateway implements the same /v1/chat/completions
and /v1/embeddings
routes as OpenAI, you can point existing Python or JS clients to base_url=manager.endpoint
and they just work.
Secure by Default
Microsoft performs vulnerability scans and red‑team evaluations on every model before adding it to the Azure AI Foundry catalogue that Foundry Local mirrors. (microsoft.com)
Local vs. Cloud: When to Choose Each
Model Catalogue
Model: phi-4 (Phi-4-cuda-gpu)
Model: phi-4 (Phi-4-generic-gpu)
Model: phi-4 (Phi-4-generic-cpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-cuda-gpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-gpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-cpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-cuda-gpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-gpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-cpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-gpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-cpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-cuda-gpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-gpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-cpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-cuda-gpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-gpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-cpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-cuda-gpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-gpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-cpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-cuda-gpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-gpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-cpu)
Model: phi-4-mini (Phi-4-mini-instruct-cuda-gpu)
Model: phi-4-mini (Phi-4-mini-instruct-generic-gpu)
Model: phi-4-mini (Phi-4-mini-instruct-generic-cpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-cuda-gpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-gpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-cpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-cuda-gpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-gpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-cpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-cuda-gpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-gpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-cpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-cuda-gpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-gpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-cpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-cuda-gpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-gpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-cpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-cuda-gpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-gpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-cpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-cuda-gpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-gpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-cpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-cuda-gpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-gpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-cpu)
Integration Patterns
Python Example
JavaScript / TypeScript
Real‑World Use Cases
- Edge RAG on Secure PCs - Run a Phi‑3‑mini summarizer locally; fall back to cloud GPT‑4o for lengthy docs only when a network exists. (learn.microsoft.com)
- Workshops & Hackathons - Students clone the repo and build chatbots without burning Azure credit.
- Healthcare Pilots - PII stripped locally before optional cloud escalation. (microsoft.com)
Friendly FAQ
Q1. Is Microsoft Foundry Local free?
Yes - install via winget
or Homebrew with no license cost or Azure account.
Q2. Which operating systems are supported?
Windows 11, Windows Server 2025, and macOS Sonoma (Intel & Apple Silicon).
Q3. Does Foundry Local need a GPU?
No. It auto‑selects CPU builds but accelerates on CUDA GPUs, DirectML GPUs, or NPUs when available for 5‑10× speed‑ups. (devblogs.microsoft.com, blogs.windows.com)
Q4. How do I update models?
foundry model pull <name>
fetches the latest quantised weights from the Azure AI Foundry catalogue.
Next Steps & Resources
- Official GitHub Repo - star and follow releases.
- Build 2025 Session C17E57EB - “Bring AI Foundry to Local” demo.
- ONNX Quantization Guide - cut VRAM by 75 %.
- DirectML + NPU Preview - harness Intel® AI Boost on Copilot+ PCs. (devblogs.microsoft.com)
- Azure AI Foundry Docs - full SDK & model registry tutorial.
Microsoft Foundry Local is a free tool that lets developers run generative‑AI models entirely on‑device—no Azure subscription, no internet connection, and no data leaving your laptop or desktop. It ships an OpenAI‑compatible REST endpoint backed by ONNX Runtime accelerators (CUDA, DirectML, NPU, or pure CPU) so you can reuse existing SDK code with almost zero changes. The project debuted at Build 2025 as part of Microsoft’s push toward edge‑ready AI and privacy‑first development models.
Why Run AI Models Locally?
- Full data privacy – prompts and outputs never leave the device.
- Latency under 50 ms on recent RTX‑40 GPUs and Intel® NPUs. (devblogs.microsoft.com, blogs.windows.com)
- Zero cloud cost – ideal for classrooms, prototypes, and air‑gapped teams. (windowsforum.com)
- Same code everywhere – the
foundry‑local-sdk
mirrors Azure AI Foundry SDK, letting you scale to the cloud with a config swap. - Hardware‑aware builds – Foundry auto‑downloads the CUDA, DirectML, NPU or CPU variant that best matches your machine. (github.com)
Two‑Step Quick‑Start
The CLI caches models in ~/.foundry
and exposes http://localhost:11434/v1/
as an OpenAI endpoint for SDKs, LangChain, or Semantic Kernel.
Foundry Local Feature Deep‑Dive
ONNX Runtime Acceleration
Foundry ships pre‑quantised INT4/INT8 checkpoints that leverage ONNX Runtime’s weight‑only compression to slash VRAM while keeping accuracy within 1 pp of FP16.
OpenAI‑Compatible Gateway
Because the local gateway implements the same /v1/chat/completions
and /v1/embeddings
routes as OpenAI, you can point existing Python or JS clients to base_url=manager.endpoint
and they just work.
Secure by Default
Microsoft performs vulnerability scans and red‑team evaluations on every model before adding it to the Azure AI Foundry catalogue that Foundry Local mirrors. (microsoft.com)
Local vs. Cloud: When to Choose Each
Model Catalogue
Model: phi-4 (Phi-4-cuda-gpu)
Model: phi-4 (Phi-4-generic-gpu)
Model: phi-4 (Phi-4-generic-cpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-cuda-gpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-gpu)
Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-cpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-cuda-gpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-gpu)
Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-cpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-gpu)
Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-cpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-cuda-gpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-gpu)
Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-cpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-cuda-gpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-gpu)
Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-cpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-cuda-gpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-gpu)
Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-cpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-cuda-gpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-gpu)
Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-cpu)
Model: phi-4-mini (Phi-4-mini-instruct-cuda-gpu)
Model: phi-4-mini (Phi-4-mini-instruct-generic-gpu)
Model: phi-4-mini (Phi-4-mini-instruct-generic-cpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-cuda-gpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-gpu)
Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-cpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-cuda-gpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-gpu)
Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-cpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-cuda-gpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-gpu)
Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-cpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-cuda-gpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-gpu)
Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-cpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-cuda-gpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-gpu)
Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-cpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-cuda-gpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-gpu)
Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-cpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-cuda-gpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-gpu)
Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-cpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-cuda-gpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-gpu)
Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-cpu)
Integration Patterns
Python Example
JavaScript / TypeScript
Real‑World Use Cases
- Edge RAG on Secure PCs - Run a Phi‑3‑mini summarizer locally; fall back to cloud GPT‑4o for lengthy docs only when a network exists. (learn.microsoft.com)
- Workshops & Hackathons - Students clone the repo and build chatbots without burning Azure credit.
- Healthcare Pilots - PII stripped locally before optional cloud escalation. (microsoft.com)
Friendly FAQ
Q1. Is Microsoft Foundry Local free?
Yes - install via winget
or Homebrew with no license cost or Azure account.
Q2. Which operating systems are supported?
Windows 11, Windows Server 2025, and macOS Sonoma (Intel & Apple Silicon).
Q3. Does Foundry Local need a GPU?
No. It auto‑selects CPU builds but accelerates on CUDA GPUs, DirectML GPUs, or NPUs when available for 5‑10× speed‑ups. (devblogs.microsoft.com, blogs.windows.com)
Q4. How do I update models?
foundry model pull <name>
fetches the latest quantised weights from the Azure AI Foundry catalogue.
Next Steps & Resources
- Official GitHub Repo - star and follow releases.
- Build 2025 Session C17E57EB - “Bring AI Foundry to Local” demo.
- ONNX Quantization Guide - cut VRAM by 75 %.
- DirectML + NPU Preview - harness Intel® AI Boost on Copilot+ PCs. (devblogs.microsoft.com)
- Azure AI Foundry Docs - full SDK & model registry tutorial.
Related Blog Posts

Brainydaps & Takescake agents

Highlights from Global AI Minnesota: FlowDevs at King Coil Spirits (June 12 2025)
