← Back to GuidesGUIDEArtificial Intelligence

VS Code's GitHub Copilot Chat + LM Studio Local API for Offline Coding

Keep Copilot Chat prompts private with an LM Studio local OpenAI-compatible backend—fully offline, no token costs, and easy setup steps.

8 days ago
❤️ 0 likes
💬 0 comments
lm-studiolocal-llmai-toolsVS CodeCopilotAI coding assistantdeveloper tools
VS Code's GitHub Copilot Chat + LM Studio Local API for Offline Coding

Pair GitHub Copilot with a Local LM Studio Backend for Private, Offline Coding

If you like the GitHub Copilot Chat experience but want your prompts and code to stay on your own machine, there is a straightforward way to wire Copilot's polished chat UI to a local LLM running through LM Studio. The result is a coding assistant that respects your privacy, works without an internet connection, and costs nothing beyond the electricity to run your hardware. In this guide I'll walk you through the integration itself, the model choices, the trade-offs, and the gotchas, while keeping this post focused on the Copilot wiring rather than LM Studio installation. For the install, model download, and first chat walkthrough, our platform-specific guides have you covered: LM Studio Guide: Run Local LLMs on Your Mac Fast and LM Studio Windows Guide: Run Local LLMs Effortlessly. I won't rehash those steps here, so if LM Studio isn't running yet, start with the guide for your OS, then come back.

The Pitch: AI Coding That Stays on Your Laptop

The interesting thing about LM Studio's local server is that it speaks the OpenAI API. That single fact is what makes the Copilot integration possible. GitHub Copilot Chat, in its newer versions, can talk to any OpenAI-compatible endpoint, including the one LM Studio exposes at http://localhost:1234/v1. Once you point Copilot at that address, every chat message you send gets processed by a model running in your laptop's memory. Nothing leaves the device, there's no per-token billing, and there's no rate limit beyond what your hardware can sustain.

For developers working on proprietary codebases, regulated industries, or just personal projects they don't want logged by a third party, this is meaningful. The same convenience of an inline chat panel, but with the data sovereignty of running the model yourself. It also unlocks a fully offline workflow. Plane wifi, restricted environments, air-gapped networks. Your coding assistant is right there.

Info

This guide assumes LM Studio is already installed and you've downloaded at least one coding-capable model. If you haven't reached that point, follow the Mac or Windows setup guide linked above before continuing.

A quick honest disclaimer before we go further: this isn't a magic "free GPT-4" setup. Local models are smaller, slower, and less capable than the frontier cloud models. What you get back is control and privacy, not raw benchmark wins. We'll cover how to pick a model that holds up well for code, and where this approach shines versus where you'll still want to reach for the cloud.

What You'll Need (Without Repeating Setup Guides)

Here's the short list. I've kept it tight because the heavy lifting of getting LM Studio running is already documented in the reference posts above.

  • VS Code (1.85 or newer) with the GitHub Copilot Chat extension installed and signed in
  • LM Studio running locally, with a coding-capable model downloaded (DeepSeek Coder V2 Lite, Qwen2.5 Coder, Codestral, or similar)
  • A working local API server inside LM Studio (we'll enable this in Step 1)
  • Roughly 8 GB of free VRAM or RAM for a 7B-class model, 16 GB+ if you want comfortable headroom for larger context windows
  • About 10 GB of free disk for the model files plus the LM Studio cache

Tip

If you're on a Mac with Apple Silicon, you can offload most of the inference to the GPU. On Windows, an NVIDIA card with CUDA support will do the same. CPU-only inference works too, but expect noticeably slower token generation.

One thing to double-check before you start: Copilot Chat's "Bring Your Own Model" support is gated behind a recent VS Code version. The flow below assumes you have at least version 0.14 of the Copilot Chat extension.

Step 1: Fire Up LM Studio's Local API

With LM Studio installed and a model already downloaded, you need to do two things: load the model into memory, and turn on the local API server. The model browser and download walkthrough is covered in the setup guides; the focus here is the server toggle.

1
Open the Developer Tab
Open the Developer tab in LM Studio (the icon that looks like a terminal prompt, on the left sidebar).
2
Pick Your Coding Model
Pick your coding model from the dropdown at the top. If you've already downloaded DeepSeek Coder V2 Lite, Qwen2.5 Coder 7B Instruct, or Codestral, select it now.
3
Start the Server
Start the server by clicking the green "Start Server" toggle. The default port is 1234, and the OpenAI-compatible base URL will be http://localhost:1234/v1.
4
Copy the Model Identifier
Copy the model identifier shown in the server panel. You'll need this exact string for the Copilot config in Step 2.
5
Verify It's Live
Verify it's live by running a quick request from a terminal using the curl command provided.

Once the server is up, test it from your terminal. This is the fastest way to confirm the API is responding before you touch VS Code:

Bash
1curl http://localhost:1234/v1/models

You should get a JSON response listing the model you loaded. If you get connection refused, LM Studio isn't running, or the server wasn't started. If you get a model list back, you're good to proceed.

Bash
1curl http://localhost:1234/v1/chat/completions \2  -H "Content-Type: application/json" \3  -d '{4    "model": "qwen2.5-coder-7b-instruct",5    "messages": [{"role": "user", "content": "Write a Python function that flattens a nested list."}],6    "temperature": 0.27  }'

If that returns a sensible Python snippet, your local backend is fully operational. Keep LM Studio running in the background while you work in VS Code; closing it drops the API endpoint.

Picking a Model That Understands Code

Not every model in LM Studio's catalog is great for code. The ones worth your time for this integration are models specifically post-trained on code corpora. The two camps to know are dense decoder-only models (Qwen2.5 Coder, DeepSeek Coder, Codestral) and code-specialized variants of general models. For most laptops, 7B-class models in Q4_K_M quantisation hit the sweet spot of "fast enough" and "smart enough." Larger models give you better reasoning but eat RAM and slow token generation. We'll compare the popular picks in a dedicated section later.

Step 2: Point GitHub Copilot Chat to Your Local Server

This is the heart of the integration. Open your VS Code chatLanguageModels.json (Ctrl+Shift+P → "Chat: Open Language Models (JSON)") and add the following configuration. Replace the model name with whatever LM Studio showed in Step 1:

JSON
1[2    {3		"name": "LM Studio",4		"vendor": "customendpoint",5		"apiType": "chat-completions",6        "apiKey": "your-api-key",7		"models": [8			{9				"id": "google/gemma-4-e4b",10				"name": "google/gemma-4-e4b",11				"url": "http://localhost:1234/v1",12				"toolCalling": true,13				"vision": true,14				"maxInputTokens": 131072,15				"maxOutputTokens": 5120016			}17		]18	}19]

A few details worth explaining. The vendor field is set to "customendpoint" and apiType to "chat-completions", which tell Copilot Chat that it is talking to a generic OpenAI-compatible chat endpoint rather than a first-party provider. The apiKey field is required by the schema, but by default LM Studio's local server doesn't actually validate it, so any placeholder string works. The models array is where you define the model Copilot will use, and the id and name fields should match the model identifier shown in LM Studio's server panel. Replace "google/gemma-4-e4b" with whichever model you have loaded. The url points to LM Studio's local server at http://localhost:1234/v1. toolCalling and vision are both set to true here, meaning Copilot will attempt to use those capabilities if the loaded model supports them. If your model doesn't support function calling or vision, set these to false to avoid unexpected behavior. Finally, maxInputTokens and maxOutputTokens define the context window limits, and the values shown (131072 and 51200) should be adjusted to match the actual context length your chosen model was loaded with in LM Studio.

After saving the file, restart VS Code. This isn't strictly required, but it forces Copilot Chat to re-read the configuration cleanly. Once it reloads, open the Copilot Chat panel and click the model selector at the top. You should see "LM Studio Local" (or whatever name you used) in the list.

To verify the wiring is correct, send a simple prompt in the chat:

"In TypeScript, write a debounce function with proper typing."

Warning

If the model picker doesn't show your custom entry, the most common cause is a malformed JSON snippet (a trailing comma is the usual suspect). Open the settings file and validate it with your editor's JSON checker before reloading VS Code.

If the response comes back quickly and reads like actual code, you're talking to your local model through Copilot's UI. Notice the experience: it's the same chat box, slash commands, and conversation history you get with the hosted models, but the inference is happening in another window on your machine. The status indicator at the bottom of the chat panel won't show the usual "Copilot" branding anymore when you're on the local backend, which is a quick visual confirmation the swap worked.

Step 3: Put It to Work

Once the model is wired up, you can use Copilot Chat exactly the way you normally would. Ask it to refactor a function, explain a regex, write tests, or convert a snippet from one language to another. From your perspective inside VS Code, nothing has changed about the workflow. From a data-flow perspective, everything has changed: the request never leaves your laptop.

Here's a representative interaction that shows the kind of help this setup delivers well:

You: "I'm building a FastAPI endpoint that accepts a list of orders and returns the total revenue. The orders have status fields that can be 'pending', 'paid', or 'refunded'. Only count paid orders. Add input validation and unit tests."

The local model will give you a structured response with the route, the Pydantic models, and a pytest file. Quality varies by model (more on that next), but the Qwen2.5 Coder and Codestral families handle this kind of request well, including the validation logic and reasonable test coverage.

One important scope clarification: this integration covers Copilot Chat, not inline ghost-text completions. The text suggestions that appear as you type still come from GitHub's cloud model unless you separately disable them. If you want a fully local inline-completion experience, you can disable Copilot's cloud completions and use a dedicated extension like Continue, which is covered in the next section.

Info

If you want to switch back to the hosted Copilot model mid-session, just open the model picker in the chat panel and choose the default Copilot option. Your custom local entry stays in the list, so toggling is a one-click operation.

Model Picks for Local Coding

The model you choose has the biggest impact on the quality of this setup. Below is a side-by-side of the most useful coding models you can load in LM Studio right now, tuned for laptops with 16 GB of RAM or more.

FeatureSize (Q4_K_M)Best ForNotes
Qwen2.5 Coder 7B Instruct~4.7 GBGeneral code generation, multi-languageStrong on Python, TS, Rust. Fast on most laptops.
Qwen2.5 Coder 14B Instruct~9 GBMore complex reasoning, larger refactorsNeeds 16 GB+ RAM. Noticeably smarter at architecture.
DeepSeek Coder V2 Lite~9 GBCode completion and chat16B MoE that activates ~2.4B per token. Good middle ground.
Codestral 22B (Q4)~13 GBMulti-language, instruction followingMistral's code model. Hungry for RAM but very capable.
StarCoder2 7B~4.5 GBFill-in-the-middle, completionsExcellent at code completion, less so at chat-style tasks.
DeepSeek Coder 6.7B~4 GBOlder but reliable, low VRAMStill solid for simpler tasks on modest hardware.

If you want a quick shortlist without reading the table in detail:

  • Got a 16 GB Mac or Windows machine? Start with Qwen2.5 Coder 14B. It's the strongest balance of capability and speed at that RAM tier.
  • Running on 8 GB or want faster responses? Use Qwen2.5 Coder 7B. Quality is only marginally below the 14B for everyday tasks.
  • Want maximum coding chops and have the headroom? Codestral 22B in Q4 quantization is hard to beat, but you'll need to close Chrome to free up RAM.
  • Mostly doing small completions and one-liners? StarCoder2 7B is fast and competent, with a strong fill-in-the-middle mode.

A practical workflow tip: download two models, a fast 7B for quick chat questions and a slower 14B for architecture-level prompts. Swap between them in LM Studio's developer tab whenever you need a different trade-off. The model switch in LM Studio takes about 5 to 10 seconds.

Weighing the Trade-offs: Local vs. Cloud Copilot

This setup is not a one-to-one replacement for the hosted Copilot. It has real benefits and real costs, and the right choice depends on what you're optimizing for.

Pros

  • Code and prompts never leave your machine, which matters for proprietary work, regulated industries, and personal projects
  • Fully offline: works on planes, in secure facilities, or anywhere with restricted network access
  • Zero per-token cost after the initial model download, no monthly usage caps
  • No rate limits beyond what your hardware can sustain
  • Easy to experiment with different open-source models without juggling API keys

Cons

  • Smaller models (7B to 22B) are noticeably less capable than frontier cloud models on complex reasoning
  • Slower token generation, especially on CPU-only or low-VRAM machines
  • Higher RAM and disk usage; running other heavy apps at the same time can cause swapping
  • No vision input, no native tool calling, and reduced effectiveness on very large codebases
  • Inline ghost-text completions still use the cloud unless you switch extensions

Tip

A practical hybrid pattern: use the local backend for routine chat tasks (explaining code, writing tests, drafting snippets) and keep a hosted Copilot subscription for the gnarly architectural questions where the larger model genuinely helps. You're not paying for tokens you don't need, and you're not forcing a 7B model to design a distributed system.

For a lot of day-to-day coding, the local setup is more than capable. The "less smart" gap shrinks dramatically if you stay within the model's training distribution, meaning mainstream languages and common patterns. Where you'll feel the gap most is on novel problems, long-horizon reasoning, and tasks that benefit from broad world knowledge.

Common Hiccups (And How to Fix Them)

A few questions come up reliably when developers first try this integration. Here are the most frequent ones, plus a deeper dive into the trickier ones.

Beyond Copilot Chat: More Local AI in Your Editor

If you like this setup and want to push further, the same LM Studio endpoint powers a few other VS Code extensions that bring even more local AI features into the editor.

Success

Both Continue and Cody are excellent next steps once you have a working LM Studio backend. They unlock inline completions, multi-file context, and slash commands, all running against the same local model.

  • Continue.dev: Open-source extension with inline completions, a chat panel, and the ability to select multiple files as context. Configure it with the same http://localhost:1234/v1 endpoint and you're fully local.
  • Cody (Sourcegraph): Strong on multi-repo context, code attribution, and slash commands like /explain and /test. Has a free tier that works with local models.
  • Cline (formerly Claude Dev): An agent-style extension that can run multi-step coding tasks against local models. Less polished chat UX, but powerful for automation-heavy workflows.
  • Roo Code / CodeGPT: Additional extensions that consume OpenAI-compatible APIs. Worth a look if you want a different UI surface for the same backend.

The bigger point: once you have LM Studio running, you're not locked into Copilot. You can run multiple extensions against the same local server, swap models freely, and design a setup that matches how you actually work.

Your First Move After Reading

You have everything you need. The Copilot integration takes about ten minutes once LM Studio is running, and the only irreversible step is downloading the model file. Start with Qwen2.5 Coder 7B if you want a quick win, then graduate to a larger model when you know what your hardware can handle.

Ready to wire it up?
Load a coding model in LM Studio, paste the settings.json snippet, restart VS Code, and you'll be chatting with a local model in Copilot's UI within minutes. Share what you find in the community.

Once you've got it working, the real value comes from experimenting. Try the same prompt on a 7B and a 14B model and compare. Try switching to a different quantisation level. Try loading the same model with a longer context window and see if it actually helps. The setup is small, but the surface area for tuning is large, and every developer ends up with a slightly different preferred configuration. Have fun with it, and if you find a model-and-settings combo that works particularly well for a specific stack (Elixir, Rust, Elisp, whatever), share it. The community is small but generous, and there's still a lot to learn about what local models can do for everyday coding.

Join the discussion on VS Code's GitHub Copilot Chat + LM Studio Local API for Offline Coding

Likes, comments, and replies are available for authenticated readers with verified email addresses.

Comments (0)

Loading discussion...