May 20, 2025

What Is Microsoft Foundry Local?

Run GPT‑quality models offline with Microsoft Foundry Local. Learn installation, SDK tips, and privacy‑first edge‑AI workflows for Windows 11 & macOS.

Microsoft Foundry Local is a free tool that lets developers run generative‑AI models entirely on‑device—no Azure subscription, no internet connection, and no data leaving your laptop or desktop. It ships an OpenAI‑compatible REST endpoint backed by ONNX Runtime accelerators (CUDA, DirectML, NPU, or pure CPU) so you can reuse existing SDK code with almost zero changes. The project debuted at Build 2025 as part of Microsoft’s push toward edge‑ready AI and privacy‑first development models.

Why Run AI Models Locally?

  • Full data privacy – prompts and outputs never leave the device.
  • Latency under 50 ms on recent RTX‑40 GPUs and Intel® NPUs. (devblogs.microsoft.com, blogs.windows.com)
  • Zero cloud cost – ideal for classrooms, prototypes, and air‑gapped teams. (windowsforum.com)
  • Same code everywhere – the foundry‑local-sdk mirrors Azure AI Foundry SDK, letting you scale to the cloud with a config swap.
  • Hardware‑aware builds – Foundry auto‑downloads the CUDA, DirectML, NPU or CPU variant that best matches your machine. (github.com)

Two‑Step Quick‑Start

# 1. Install
winget install Microsoft.FoundryLocal        # Windows
#   or
brew tap microsoft/foundrylocal && \
brew install foundrylocal                    # macOS :contentReference[oaicite:5]{index=5}

# 2. Chat with an SLM
foundry model run phi-3.5-mini               # auto‑downloads 3.5 variant

The CLI caches models in ~/.foundry and exposes http://localhost:11434/v1/ as an OpenAI endpoint for SDKs, LangChain, or Semantic Kernel.

Foundry Local Feature Deep‑Dive

ONNX Runtime Acceleration

Foundry ships pre‑quantised INT4/INT8 checkpoints that leverage ONNX Runtime’s weight‑only compression to slash VRAM while keeping accuracy within 1 pp of FP16.

OpenAI‑Compatible Gateway

Because the local gateway implements the same /v1/chat/completions and /v1/embeddings routes as OpenAI, you can point existing Python or JS clients to base_url=manager.endpoint and they just work.

Secure by Default

Microsoft performs vulnerability scans and red‑team evaluations on every model before adding it to the Azure AI Foundry catalogue that Foundry Local mirrors. (microsoft.com)

Local vs. Cloud: When to Choose Each

Scenario Foundry Local Azure AI Foundry
Offline field app ✅ zero connectivity required
HIPAA/PII workloads ✅ data stays on device ✔️ but needs extra compliance steps
Burst traffic / heavy RAG 🚫 limited by local hardware ✅ autoscaling GPU clusters

Model Catalogue

Model: phi-4 (Phi-4-cuda-gpu)

Model: phi-4 (Phi-4-generic-gpu)

Model: phi-4 (Phi-4-generic-cpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-cuda-gpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-gpu)

Model: phi-3-mini-128k (Phi-3-mini-128k-instruct-generic-cpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-cuda-gpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-gpu)

Model: phi-3-mini-4k (Phi-3-mini-4k-instruct-generic-cpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-gpu)

Model: mistral-7b-v0.2 (mistralai-Mistral-7B-Instruct-v0-2-generic-cpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-cuda-gpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-gpu)

Model: phi-3.5-mini (Phi-3.5-mini-instruct-generic-cpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-cuda-gpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-gpu)

Model: phi-4-mini-reasoning (Phi-4-mini-reasoning-generic-cpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-cuda-gpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-gpu)

Model: deepseek-r1-14b (deepseek-r1-distill-qwen-14b-generic-cpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-cuda-gpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-gpu)

Model: deepseek-r1-7b (deepseek-r1-distill-qwen-7b-generic-cpu)

Model: phi-4-mini (Phi-4-mini-instruct-cuda-gpu)

Model: phi-4-mini (Phi-4-mini-instruct-generic-gpu)

Model: phi-4-mini (Phi-4-mini-instruct-generic-cpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-cuda-gpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-gpu)

Model: qwen2.5-0.5b (qwen2.5-0.5b-instruct-generic-cpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-cuda-gpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-gpu)

Model: qwen2.5-coder-0.5b (qwen2.5-coder-0.5b-instruct-generic-cpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-cuda-gpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-gpu)

Model: qwen2.5-1.5b (qwen2.5-1.5b-instruct-generic-cpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-cuda-gpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-gpu)

Model: qwen2.5-7b (qwen2.5-7b-instruct-generic-cpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-cuda-gpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-gpu)

Model: qwen2.5-coder-1.5b (qwen2.5-coder-1.5b-instruct-generic-cpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-cuda-gpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-gpu)

Model: qwen2.5-coder-7b (qwen2.5-coder-7b-instruct-generic-cpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-cuda-gpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-gpu)

Model: qwen2.5-14b (qwen2.5-14b-instruct-generic-cpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-cuda-gpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-gpu)

Model: qwen2.5-coder-14b (qwen2.5-coder-14b-instruct-generic-cpu)

Integration Patterns

Python Example

import openai
from foundry_local import FoundryLocalManager

mgr = FoundryLocalManager()
openai.api_base = mgr.openai_endpoint  # Looks just like Azure or OpenAI

resp = openai.ChatCompletion.create(
    model="phi-3.5-mini",
    messages=[{"role": "user", "content": "Explain transformers in 3 bullets"}]
)
print(resp.choices[0].message.content)

JavaScript / TypeScript

import { FoundryLocalManager } from "foundry-local-sdk";
import OpenAI from "openai";

const mgr = new FoundryLocalManager();
const openai = new OpenAI({ baseURL: mgr.openaiEndpoint });

const chat = await openai.chat.completions.create({
  model: "phi-3.5-mini",
  messages: [{ role: "user", content: "Explain quantization in 2 lines." }]
});
console.log(chat.choices[0].message.content);

Real‑World Use Cases

  1. Edge RAG on Secure PCs - Run a Phi‑3‑mini summarizer locally; fall back to cloud GPT‑4o for lengthy docs only when a network exists. (learn.microsoft.com)
  2. Workshops & Hackathons - Students clone the repo and build chatbots without burning Azure credit.
  3. Healthcare Pilots - PII stripped locally before optional cloud escalation. (microsoft.com)

Friendly FAQ

Q1. Is Microsoft Foundry Local free?

Yes - install via winget or Homebrew with no license cost or Azure account.

Q2. Which operating systems are supported?

Windows 11, Windows Server 2025, and macOS Sonoma (Intel & Apple Silicon).

Q3. Does Foundry Local need a GPU?

No. It auto‑selects CPU builds but accelerates on CUDA GPUs, DirectML GPUs, or NPUs when available for 5‑10× speed‑ups. (devblogs.microsoft.com, blogs.windows.com)

Q4. How do I update models?

foundry model pull <name> fetches the latest quantised weights from the Azure AI Foundry catalogue.

Next Steps & Resources

  • Official GitHub Repo - star and follow releases.
  • Build 2025 Session C17E57EB - “Bring AI Foundry to Local” demo.
  • ONNX Quantization Guide - cut VRAM by 75 %.
  • DirectML + NPU Preview - harness Intel® AI Boost on Copilot+ PCs. (devblogs.microsoft.com)
  • Azure AI Foundry Docs - full SDK & model registry tutorial.

Author:
Discover how we can help you >

Related Posts

FlowDevs AI Assistant
×
Powered by FlowDevs AI Visit FlowDevs.io