LM Studio

June 24, 2026

🚀 Optimization Guide: LM Studio + pi Agent

Target Hardware: Intel Ultra 9 285K | NVIDIA RTX 5060 Ti (16GB) | 64GB RAM

This guide defines the ideal settings to maximize the performance of the pi agent, balancing the intelligence of large models with the speed of smaller ones.

🧠 The Fundamental Concept: VRAM vs. RAM

Your GPU has 16GB of VRAM.

  • If the model + context fit within 16GB → Maximum Speed (GPU).
  • If the model + context exceed 16GB → Slowdown (spillover to RAM/CPU).

🛠️ Configuration Profiles

1. Profile: Deep Reasoning (Large Models: 26B+)

Use this profile for complex architecture and difficult logic tasks, where accuracy matters more than speed. Example: gemma-4-2quanto-26b

ParameterRecommended ValueReason
GPU OffloadPartial (Manual)Adjust so total GPU usage stays around 14GB.
Context Length16384 (16k)Prevents the context window from pushing too many layers to the CPU.
CPU Thread Pool18Leaves cores available for the OS and the pi agent.
Eval. Batch Size512Balances prompt processing and VRAM usage.
Physical Batch Size512Keeps the inference pipeline stable.
Max Concur. Pred.1The pi agent works sequentially; there is no need for more.

2. Profile: Instant Coding (Light Models: 7B - 9B)

Use this profile for rapid refactoring, unit test creation, and file reading. Focus on ultra-low latency. Example: Llama-3-8B, Mistral-7B, Phi-3

ParameterRecommended ValueReason
GPU OffloadMax (All Layers)Ensures the entire model resides in VRAM.
Context Length32768 or 65536Uses the extra VRAM to give the agent longer-term memory.
CPU Thread Pool18Keeps the system responsive.
Eval. Batch Size1024 or 2048Drastically speeds up long prompt reading.
Physical Batch Size1024Aligns with evaluation for high throughput.
Max Concur. Pred.1Keeps focus on the single task of the agent.

📖 Glossary of Advanced Parameters

  • GPU Offload: Defines how many model layers are processed by the GPU. The goal is always to use as much as possible without exceeding the 16GB total.
  • Context Length: The “short-term memory.” A larger value lets the pi agent read more code at once, but increases memory consumption.
  • CPU Thread Pool Size: How many cores of your Intel Ultra 9 CPU are dedicated to the model’s math computation when it is running via RAM.
  • Evaluation/Physical Batch Size: Defines how many tokens are processed in a single step during prefill and decoding. Larger values increase throughput (tokens per second), but require more VRAM.

⚠️ Critical Warnings

[!CAUTION] VRAM MONITORING: Before loading a model, check the nvidia-smi command. If your GPU is already under heavy load (for example, more than 5GB occupied by other apps), you will not be able to run large models in full-GPU mode. Close browsers and other heavy applications before starting LM Studio.

[!TIP] GOLDEN TIP: If you notice that the pi agent is responding very slowly (low tokens/s), reduce the Context Length or lower the number of layers in GPU Offload to ensure the model fits entirely in your RTX 5060 Ti.