What Is Microsoft Foundry Local?

Subscribe to newsletter
By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Microsoft Foundry Local is a free tool that lets developers run generative‑AI models entirely on‑device—no Azure subscription, no internet connection, and no data leaving your laptop or desktop. It ships an OpenAI‑compatible REST endpoint backed by ONNX Runtime accelerators (CUDA, DirectML, NPU, or pure CPU) so you can reuse existing SDK code with almost zero changes. The project debuted at Build 2025 as part of Microsoft’s push toward edge‑ready AI and privacy‑first development models.

Why Run AI Models Locally?

  • Full data privacy – prompts and outputs never leave the device.
  • Latency under 50 ms on recent RTX‑40 GPUs and Intel® NPUs. (devblogs.microsoft.com, blogs.windows.com)
  • Zero cloud cost – ideal for classrooms, prototypes, and air‑gapped teams. (windowsforum.com)
  • Same code everywhere – the foundry‑local-sdk mirrors Azure AI Foundry SDK, letting you scale to the cloud with a config swap.
  • Hardware‑aware builds – Foundry auto‑downloads the CUDA, DirectML, NPU or CPU variant that best matches your machine. (github.com)

Two‑Step Quick‑Start

The CLI caches models in ~/.foundry and exposes http://localhost:11434/v1/ as an OpenAI endpoint for SDKs, LangChain, or Semantic Kernel.

Foundry Local Feature Deep‑Dive

ONNX Runtime Acceleration

Foundry ships pre‑quantised INT4/INT8 checkpoints that leverage ONNX Runtime’s weight‑only compression to slash VRAM while keeping accuracy within 1 pp of FP16.

OpenAI‑Compatible Gateway

Because the local gateway implements the same /v1/chat/completions and /v1/embeddings routes as OpenAI, you can point existing Python or JS clients to base_url=manager.endpoint and they just work.

Secure by Default

Microsoft performs vulnerability scans and red‑team evaluations on every model before adding it to the Azure AI Foundry catalogue that Foundry Local mirrors. (microsoft.com)

Local vs. Cloud: When to Choose Each

Scenario Foundry Local Azure AI Foundry
Offline field app ✅ zero connectivity required
HIPAA/PII workloads ✅ data stays on device ✔️ but needs extra compliance steps
Burst traffic / heavy RAG 🚫 limited by local hardware ✅ autoscaling GPU clusters

Model Catalogue

Model: phi-4 (Phi-4-cuda-gpu)

Model: phi-4 (Phi-4-generic-gpu)

Model: phi-4 (Phi-4-generic-cpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-cuda-gpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-gpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-cpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-cuda-gpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-gpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-cpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-gpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-cpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-cuda-gpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-gpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-cpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-cuda-gpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-gpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-cpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-cuda-gpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-gpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-cpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-cuda-gpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-gpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-cpu)

Model: phi-4-mini (Phi-4-mini-instruct-cuda-gpu)

Model: phi-4-mini (Phi-4-mini-instruct-generic-gpu)

Model: phi-4-mini (Phi-4-mini-instruct-generic-cpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-cuda-gpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-gpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-cpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-cuda-gpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-gpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-cpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-cuda-gpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-gpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-cpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-cuda-gpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-gpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-cpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-cuda-gpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-gpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-cpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-cuda-gpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-gpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-cpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-cuda-gpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-gpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-cpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-cuda-gpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-gpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-cpu)

Integration Patterns

Python Example

JavaScript / TypeScript

Real‑World Use Cases

  1. Edge RAG on Secure PCs - Run a Phi‑3‑mini summarizer locally; fall back to cloud GPT‑4o for lengthy docs only when a network exists. (learn.microsoft.com)
  2. Workshops & Hackathons - Students clone the repo and build chatbots without burning Azure credit.
  3. Healthcare Pilots - PII stripped locally before optional cloud escalation. (microsoft.com)

Friendly FAQ

Q1. Is Microsoft Foundry Local free?

Yes - install via winget or Homebrew with no license cost or Azure account.

Q2. Which operating systems are supported?

Windows 11, Windows Server 2025, and macOS Sonoma (Intel & Apple Silicon).

Q3. Does Foundry Local need a GPU?

No. It auto‑selects CPU builds but accelerates on CUDA GPUs, DirectML GPUs, or NPUs when available for 5‑10× speed‑ups. (devblogs.microsoft.com, blogs.windows.com)

Q4. How do I update models?

foundry model pull <name> fetches the latest quantised weights from the Azure AI Foundry catalogue.

Next Steps & Resources

  • Official GitHub Repo - star and follow releases.
  • Build 2025 Session C17E57EB - “Bring AI Foundry to Local” demo.
  • ONNX Quantization Guide - cut VRAM by 75 %.
  • DirectML + NPU Preview - harness Intel® AI Boost on Copilot+ PCs. (devblogs.microsoft.com)
  • Azure AI Foundry Docs - full SDK & model registry tutorial.

Microsoft Foundry Local is a free tool that lets developers run generative‑AI models entirely on‑device—no Azure subscription, no internet connection, and no data leaving your laptop or desktop. It ships an OpenAI‑compatible REST endpoint backed by ONNX Runtime accelerators (CUDA, DirectML, NPU, or pure CPU) so you can reuse existing SDK code with almost zero changes. The project debuted at Build 2025 as part of Microsoft’s push toward edge‑ready AI and privacy‑first development models.

Why Run AI Models Locally?

  • Full data privacy – prompts and outputs never leave the device.
  • Latency under 50 ms on recent RTX‑40 GPUs and Intel® NPUs. (devblogs.microsoft.com, blogs.windows.com)
  • Zero cloud cost – ideal for classrooms, prototypes, and air‑gapped teams. (windowsforum.com)
  • Same code everywhere – the foundry‑local-sdk mirrors Azure AI Foundry SDK, letting you scale to the cloud with a config swap.
  • Hardware‑aware builds – Foundry auto‑downloads the CUDA, DirectML, NPU or CPU variant that best matches your machine. (github.com)

Two‑Step Quick‑Start

The CLI caches models in ~/.foundry and exposes http://localhost:11434/v1/ as an OpenAI endpoint for SDKs, LangChain, or Semantic Kernel.

Foundry Local Feature Deep‑Dive

ONNX Runtime Acceleration

Foundry ships pre‑quantised INT4/INT8 checkpoints that leverage ONNX Runtime’s weight‑only compression to slash VRAM while keeping accuracy within 1 pp of FP16.

OpenAI‑Compatible Gateway

Because the local gateway implements the same /v1/chat/completions and /v1/embeddings routes as OpenAI, you can point existing Python or JS clients to base_url=manager.endpoint and they just work.

Secure by Default

Microsoft performs vulnerability scans and red‑team evaluations on every model before adding it to the Azure AI Foundry catalogue that Foundry Local mirrors. (microsoft.com)

Local vs. Cloud: When to Choose Each

Scenario Foundry Local Azure AI Foundry
Offline field app ✅ zero connectivity required
HIPAA/PII workloads ✅ data stays on device ✔️ but needs extra compliance steps
Burst traffic / heavy RAG 🚫 limited by local hardware ✅ autoscaling GPU clusters

Model Catalogue

Model: phi-4 (Phi-4-cuda-gpu)

Model: phi-4 (Phi-4-generic-gpu)

Model: phi-4 (Phi-4-generic-cpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-cuda-gpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-gpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-cpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-cuda-gpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-gpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-cpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-gpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-cpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-cuda-gpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-gpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-cpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-cuda-gpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-gpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-cpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-cuda-gpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-gpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-cpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-cuda-gpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-gpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-cpu)

Model: phi-4-mini (Phi-4-mini-instruct-cuda-gpu)

Model: phi-4-mini (Phi-4-mini-instruct-generic-gpu)

Model: phi-4-mini (Phi-4-mini-instruct-generic-cpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-cuda-gpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-gpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-cpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-cuda-gpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-gpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-cpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-cuda-gpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-gpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-cpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-cuda-gpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-gpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-cpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-cuda-gpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-gpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-cpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-cuda-gpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-gpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-cpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-cuda-gpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-gpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-cpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-cuda-gpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-gpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-cpu)

Integration Patterns

Python Example

JavaScript / TypeScript

Real‑World Use Cases

  1. Edge RAG on Secure PCs - Run a Phi‑3‑mini summarizer locally; fall back to cloud GPT‑4o for lengthy docs only when a network exists. (learn.microsoft.com)
  2. Workshops & Hackathons - Students clone the repo and build chatbots without burning Azure credit.
  3. Healthcare Pilots - PII stripped locally before optional cloud escalation. (microsoft.com)

Friendly FAQ

Q1. Is Microsoft Foundry Local free?

Yes - install via winget or Homebrew with no license cost or Azure account.

Q2. Which operating systems are supported?

Windows 11, Windows Server 2025, and macOS Sonoma (Intel & Apple Silicon).

Q3. Does Foundry Local need a GPU?

No. It auto‑selects CPU builds but accelerates on CUDA GPUs, DirectML GPUs, or NPUs when available for 5‑10× speed‑ups. (devblogs.microsoft.com, blogs.windows.com)

Q4. How do I update models?

foundry model pull <name> fetches the latest quantised weights from the Azure AI Foundry catalogue.

Next Steps & Resources

  • Official GitHub Repo - star and follow releases.
  • Build 2025 Session C17E57EB - “Bring AI Foundry to Local” demo.
  • ONNX Quantization Guide - cut VRAM by 75 %.
  • DirectML + NPU Preview - harness Intel® AI Boost on Copilot+ PCs. (devblogs.microsoft.com)
  • Azure AI Foundry Docs - full SDK & model registry tutorial.

Subscribe to newsletter
By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share
More

Related Blog Posts