🚀 Optimization Guide: LM Studio + pi Agent
Target Hardware: Intel Ultra 9 285K | NVIDIA RTX 5060 Ti (16GB) | 64GB RAM
This guide defines the ideal settings to maximize the performance of the pi agent, balancing the intelligence of large models with the speed of smaller ones.
🧠 The Fundamental Concept: VRAM vs. RAM
Your GPU has 16GB of VRAM.
- If the model + context fit within 16GB → Maximum Speed (GPU).
- If the model + context exceed 16GB → Slowdown (spillover to RAM/CPU).
🛠️ Configuration Profiles
1. Profile: Deep Reasoning (Large Models: 26B+)
Use this profile for complex architecture and difficult logic tasks, where accuracy matters more than speed.
Example: gemma-4-2quanto-26b
| Parameter | Recommended Value | Reason |
|---|---|---|
| GPU Offload | Partial (Manual) | Adjust so total GPU usage stays around 14GB. |
| Context Length | 16384 (16k) | Prevents the context window from pushing too many layers to the CPU. |
| CPU Thread Pool | 18 | Leaves cores available for the OS and the pi agent. |
| Eval. Batch Size | 512 | Balances prompt processing and VRAM usage. |
| Physical Batch Size | 512 | Keeps the inference pipeline stable. |
| Max Concur. Pred. | 1 | The pi agent works sequentially; there is no need for more. |
2. Profile: Instant Coding (Light Models: 7B - 9B)
Use this profile for rapid refactoring, unit test creation, and file reading. Focus on ultra-low latency.
Example: Llama-3-8B, Mistral-7B, Phi-3
| Parameter | Recommended Value | Reason |
|---|---|---|
| GPU Offload | Max (All Layers) | Ensures the entire model resides in VRAM. |
| Context Length | 32768 or 65536 | Uses the extra VRAM to give the agent longer-term memory. |
| CPU Thread Pool | 18 | Keeps the system responsive. |
| Eval. Batch Size | 1024 or 2048 | Drastically speeds up long prompt reading. |
| Physical Batch Size | 1024 | Aligns with evaluation for high throughput. |
| Max Concur. Pred. | 1 | Keeps focus on the single task of the agent. |
📖 Glossary of Advanced Parameters
- GPU Offload: Defines how many model layers are processed by the GPU. The goal is always to use as much as possible without exceeding the 16GB total.
- Context Length: The “short-term memory.” A larger value lets the
piagent read more code at once, but increases memory consumption. - CPU Thread Pool Size: How many cores of your Intel Ultra 9 CPU are dedicated to the model’s math computation when it is running via RAM.
- Evaluation/Physical Batch Size: Defines how many tokens are processed in a single step during prefill and decoding. Larger values increase throughput (tokens per second), but require more VRAM.
⚠️ Critical Warnings
[!CAUTION] VRAM MONITORING: Before loading a model, check the
nvidia-smicommand. If your GPU is already under heavy load (for example, more than 5GB occupied by other apps), you will not be able to run large models in full-GPU mode. Close browsers and other heavy applications before starting LM Studio.
[!TIP] GOLDEN TIP: If you notice that the
piagent is responding very slowly (low tokens/s), reduce the Context Length or lower the number of layers in GPU Offload to ensure the model fits entirely in your RTX 5060 Ti.