Dust off that gaming PC you haven't had time to play (so many model releases 😵💫), We're going to use it to build an async Agent.
*Tested on Ubuntu Box with Intel Core i5, RTX 3060 (12GB VRAM), and 32GB of RAM.
To run Qwen3-4B-Instruct locally with good performance (roughly 50 tokens per second), you'll need 8 GB GPU VRAM (FP8 quantized) or 16 GB+ (bf16) and 16 GB RAM to meet unified memory thresholds.
Dramatis Personae (Glossary)
- vLLM: High-performance inference engine for serving LLMs locally.
- Qwen/Qwen3-4B-Instruct-2507: An efficient open-source instruction-tuned model from Alibaba’s Qwen team.
- OpenAI API compatibility: vLLM mimics the OpenAI API, so any OpenAI-style client can talk to it.
- smol-agents: Lightweight agent framework for building tool-using LLMs.
- venv: Python's built-in virtual environment manager.
- R.A.G. (Retrieval Augmented Generation): A technique that enhances LLM responses by providing relevant context from external data sources.
1. Environment Setup
Assuming you already have Python 3.10+, pip, and GPU drivers (NVIDIA or other supported) installed.
# Create and activate venv
python3 -m venv .venv
source .venv/bin/activate
# Upgrade pip
pip install -U pip
# Install runtime deps
pip install smolagents openai requests vllm
2. Start vLLM with Qwen
We’ll run Qwen locally via vLLM, exposing it on port 8000 with OpenAI-compatible endpoints.
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-4B-Instruct-2507 `# Model name/path to load (4B parameter LLM)` \
--host 127.0.0.1 `# Local bind address for the server` \
--port 8000 `# Port for the server to listen on` \
--dtype float16 `# Data type for model weights (float16 for reduced VRAM usage)` \
--max-model-len 8192 `# Max context length in tokens for input/output` \
--gpu-memory-utilization 0.95 `# Fraction of GPU VRAM to use (95% of 12GB)` \
--swap-space 20 `# GB of system RAM for offloading when VRAM is full` \
--max-num-seqs 1 `# Max parallel sequences to process at once` \
--max-num-batched-tokens 8192 `# Max tokens processed in a single batch`
That keeps Qwen under control on consumer GPUs, but tweak for your card.
3. The Simple Hello Agent
With the vLLM server running, open a new terminal and create a simple agent that can both chat naturally and use tools. It will respond to Hello
and also demonstrate the ping
tool. Save it as
agent_hello.py
:
from smolagents import CodeAgent, tool, OpenAIServerModel
from openai import OpenAI
# Direct OpenAI client for simple chat
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="EMPTY",
)
# smolagents setup for tool usage
model = OpenAIServerModel(
model_id="Qwen/Qwen3-4B-Instruct-2507",
api_base="http://127.0.0.1:8000/v1",
api_key="EMPTY",
timeout=60,
)
@tool
def ping() -> str:
"""Return a simple string to prove tool-calling works."""
return "pong"
# CodeAgent for tool usage
agent = CodeAgent(
tools=[ping],
model=model,
verbosity_level=2,
)
def simple_chat(message: str) -> str:
"""Simple chat without any agent overhead."""
response = client.chat.completions.create(
model="Qwen/Qwen3-4B-Instruct-2507",
messages=[{"role": "user", "content": message}],
max_tokens=100
)
return response.choices[0].message.content
if __name__ == "__main__":
print("=== Hello Test ===")
print(simple_chat("Hello"))
print("\n=== Ping Tool Test ===")
print(agent.run("Use the ping tool"))
Run it:
python agent_hello.py
You should see Qwen's response to Hello
printed to the console.
4. The Async Agent (with Tools)
Now let’s expand: an async REPL where you can type commands. It has:
ping
→ returns "pong".-
fetch users [N]
→ downloads dummy JSON users fromhttps://dummyjson.com
, saves todata/users.json
. This data becomes available for chat context. - Any other input → chats with the model, automatically including the fetched JSON as context for informed responses.
Save as agent_async_fn.py
:
import os
import json
import asyncio
import requests
from openai import OpenAI
# --- Config ---
BASE_URL = "http://127.0.0.1:8000/v1"
MODEL = "Qwen/Qwen3-4B-Instruct-2507"
DATA_DIR = "data"
USERS_FILE = os.path.join(DATA_DIR, "users.json")
os.makedirs(DATA_DIR, exist_ok=True)
client = OpenAI(base_url=BASE_URL, api_key="not-needed")
# --- Tools ---
def ping() -> str:
"""Return a simple pong string."""
return "pong"
def fetch_users(limit: int = 5) -> str:
"""
Fetch dummy users from https://dummyjson.com.
Save results to data/users.json and return a summary.
"""
url = f"https://dummyjson.com/users?limit={limit}"
resp = requests.get(url)
resp.raise_for_status()
payload = resp.json()
with open(USERS_FILE, "w") as f:
json.dump(payload, f, indent=2)
users = payload.get("users", [])
first_names = [u["firstName"] + " " + u["lastName"] for u in users[:3]]
more = max(0, len(users) - 3)
summary = f"Users: {len(users)} total; first: {', '.join(first_names)}"
if more:
summary += f" … +{more} more"
return f"Saved to: {USERS_FILE}\n{summary}"
def load_user_context() -> str:
"""Load saved users.json and return as string for RAG context."""
if not os.path.exists(USERS_FILE):
return ""
try:
with open(USERS_FILE) as f:
data = json.load(f)
return json.dumps(data, indent=2)
except Exception as e:
return f"(Could not load JSON from {USERS_FILE}: {e})"
# --- Chat wrapper ---
def chat_with_model(prompt: str, context: str = "") -> str:
"""Send a message to the model, optionally with RAG context."""
messages = []
if context:
messages.append({"role": "system", "content": f"Context:\n{context}"})
messages.append({"role": "user", "content": prompt})
resp = client.chat.completions.create(model=MODEL, messages=messages)
return resp.choices[0].message.content
# --- REPL loop ---
async def repl():
print("⚡ Async Agent (users only). Type 'exit' to quit.")
print("Examples:")
print(" ping")
print(" fetch users")
print(" fetch users 3\n")
while True:
text = input("you> ").strip()
if text.lower() in {"exit", "quit"}:
break
if text == "ping":
print("[ping…]")
print(ping())
elif text.startswith("fetch users"):
parts = text.split()
limit = int(parts[2]) if len(parts) > 2 else 5
print(f"[fetch users {limit}…]")
print(fetch_users(limit))
else:
print("[chat…]")
context = load_user_context()
reply = chat_with_model(text, context=context)
print(reply)
if __name__ == "__main__":
asyncio.run(repl())
Run it:
python agent_async_fn.py
Guess what? You've just implemented Retrieval Augmented Generation (R.A.G.) By fetching external data and injecting it into the chat context. This gave your model access to information beyond its training data. From buzzword to hands on experience.
5. Try It Out
you> ping
[ping…]
pong
you> fetch users 3
[fetch users 3…]
Saved to: data/users.json
Users: 3 total; first: John Doe, Jane Smith, Alice Johnson
Fetched JSON:
{
"users": [
{ "firstName": "John", "lastName": "Doe", "email": "john@example.com" },
{ "firstName": "Jane", "lastName": "Smith", "email": "jane@example.com" },
{ "firstName": "Alice", "lastName": "Johnson", "email": "alice@example.com" }
]
}
you> how old is John Doe?
[chat…]
He is 25 years old in the dummy dataset.
Wrap-Up
- Hello Agent: minimal one-shot demo.
- Async Agent: adds simple tools + local JSON context (RAG).
- vLLM + Qwen: efficient, open, and runs on consumer GPUs.
From here you can:
- Add more tools (file I/O, DB queries, shell commands).
- Swap Qwen for a different model.
- Build a proper front-end over the REPL.
You now have a local Qwen agent setup — both the simple and async versions — running fully offline on your GPU.
MIT License
Copyright (c) 2025 Tasetic Wave LLC
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.