Building AI Agents with smol-agents, QWEN, and vLLM

Dust off that gaming PC you haven't had time to play (so many model releases 😵‍💫), We're going to use it to build an async Agent.

Source Code: View complete source code on GitHub →

*Tested on Ubuntu Box with Intel Core i5, RTX 3060 (12GB VRAM), and 32GB of RAM.

To run Qwen3-4B-Instruct locally with good performance (roughly 50 tokens per second), you'll need 8 GB GPU VRAM (FP8 quantized) or 16 GB+ (bf16) and 16 GB RAM to meet unified memory thresholds.

Dramatis Personae (Glossary)

vLLM: High-performance inference engine for serving LLMs locally.
Qwen/Qwen3-4B-Instruct-2507: An efficient open-source instruction-tuned model from Alibaba’s Qwen team.
OpenAI API compatibility: vLLM mimics the OpenAI API, so any OpenAI-style client can talk to it.
smol-agents: Lightweight agent framework for building tool-using LLMs.
venv: Python's built-in virtual environment manager.
R.A.G. (Retrieval Augmented Generation): A technique that enhances LLM responses by providing relevant context from external data sources.

1. Environment Setup

Assuming you already have Python 3.10+, pip, and GPU drivers (NVIDIA or other supported) installed.

Bash

# Create and activate venv
python3 -m venv .venv
source .venv/bin/activate

# Upgrade pip
pip install -U pip

# Install runtime deps
pip install smolagents openai requests vllm

2. Start vLLM with Qwen

We’ll run Qwen locally via vLLM, exposing it on port 8000 with OpenAI-compatible endpoints.

Bash

            
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-4B-Instruct-2507 `# Model name/path to load (4B parameter LLM)` \
--host 127.0.0.1 `# Local bind address for the server` \
--port 8000 `# Port for the server to listen on` \
--dtype float16 `# Data type for model weights (float16 for reduced VRAM usage)` \
--max-model-len 8192 `# Max context length in tokens for input/output` \
--gpu-memory-utilization 0.95 `# Fraction of GPU VRAM to use (95% of 12GB)` \
--swap-space 20 `# GB of system RAM for offloading when VRAM is full` \
--max-num-seqs 1 `# Max parallel sequences to process at once` \
--max-num-batched-tokens 8192 `# Max tokens processed in a single batch`

That keeps Qwen under control on consumer GPUs, but tweak for your card.

3. The Simple Hello Agent

With the vLLM server running, open a new terminal and create a simple agent that can both chat naturally and use tools. It will respond to Hello and also demonstrate the ping tool. Save it as agent_hello.py:

Python

from smolagents import CodeAgent, tool, OpenAIServerModel
from openai import OpenAI

# Direct OpenAI client for simple chat
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="EMPTY",
)

# smolagents setup for tool usage
model = OpenAIServerModel(
    model_id="Qwen/Qwen3-4B-Instruct-2507",
    api_base="http://127.0.0.1:8000/v1",
    api_key="EMPTY",
    timeout=60,
)

@tool
def ping() -> str:
    """Return a simple string to prove tool-calling works."""
    return "pong"

# CodeAgent for tool usage
agent = CodeAgent(
    tools=[ping],
    model=model,
    verbosity_level=2,
)

def simple_chat(message: str) -> str:
    """Simple chat without any agent overhead."""
    response = client.chat.completions.create(
        model="Qwen/Qwen3-4B-Instruct-2507",
        messages=[{"role": "user", "content": message}],
        max_tokens=100
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    print("=== Hello Test ===")
    print(simple_chat("Hello"))
    print("\n=== Ping Tool Test ===")
    print(agent.run("Use the ping tool"))

Run it:

Bash

python agent_hello.py

You should see Qwen's response to Hello printed to the console.

4. The Async Agent (with Tools)

Now let’s expand: an async REPL where you can type commands. It has:

ping → returns "pong".
fetch users [N] → downloads dummy JSON users from https://dummyjson.com, saves to data/users.json. This data becomes available for chat context.
Any other input → chats with the model, automatically including the fetched JSON as context for informed responses.

Save as agent_async_fn.py:

Python

import os
import json
import asyncio
import requests
from openai import OpenAI

# --- Config ---
BASE_URL = "http://127.0.0.1:8000/v1"
MODEL = "Qwen/Qwen3-4B-Instruct-2507"
DATA_DIR = "data"
USERS_FILE = os.path.join(DATA_DIR, "users.json")

os.makedirs(DATA_DIR, exist_ok=True)

client = OpenAI(base_url=BASE_URL, api_key="not-needed")


# --- Tools ---
def ping() -> str:
    """Return a simple pong string."""
    return "pong"


def fetch_users(limit: int = 5) -> str:
    """
    Fetch dummy users from https://dummyjson.com.
    Save results to data/users.json and return a summary.
    """
    url = f"https://dummyjson.com/users?limit={limit}"
    resp = requests.get(url)
    resp.raise_for_status()
    payload = resp.json()

    with open(USERS_FILE, "w") as f:
        json.dump(payload, f, indent=2)

    users = payload.get("users", [])
    first_names = [u["firstName"] + " " + u["lastName"] for u in users[:3]]
    more = max(0, len(users) - 3)
    summary = f"Users: {len(users)} total; first: {', '.join(first_names)}"
    if more:
        summary += f" … +{more} more"

    return f"Saved to: {USERS_FILE}\n{summary}"


def load_user_context() -> str:
    """Load saved users.json and return as string for RAG context."""
    if not os.path.exists(USERS_FILE):
        return ""
    try:
        with open(USERS_FILE) as f:
            data = json.load(f)
        return json.dumps(data, indent=2)
    except Exception as e:
        return f"(Could not load JSON from {USERS_FILE}: {e})"


# --- Chat wrapper ---
def chat_with_model(prompt: str, context: str = "") -> str:
    """Send a message to the model, optionally with RAG context."""
    messages = []
    if context:
        messages.append({"role": "system", "content": f"Context:\n{context}"})
    messages.append({"role": "user", "content": prompt})

    resp = client.chat.completions.create(model=MODEL, messages=messages)
    return resp.choices[0].message.content


# --- REPL loop ---
async def repl():
    print("⚡ Async Agent (users only). Type 'exit' to quit.")
    print("Examples:")
    print("  ping")
    print("  fetch users")
    print("  fetch users 3\n")

    while True:
        text = input("you> ").strip()
        if text.lower() in {"exit", "quit"}:
            break

        if text == "ping":
            print("[ping…]")
            print(ping())

        elif text.startswith("fetch users"):
            parts = text.split()
            limit = int(parts[2]) if len(parts) > 2 else 5
            print(f"[fetch users {limit}…]")
            print(fetch_users(limit))

        else:
            print("[chat…]")
            context = load_user_context()
            reply = chat_with_model(text, context=context)
            print(reply)


if __name__ == "__main__":
    asyncio.run(repl())

Run it:

Bash

python agent_async_fn.py

Guess what? You've just implemented Retrieval Augmented Generation (R.A.G.) By fetching external data and injecting it into the chat context. This gave your model access to information beyond its training data. From buzzword to hands on experience.

5. Try It Out

Output

you> ping
[ping…]
pong

you> fetch users 3
[fetch users 3…]
Saved to: data/users.json
Users: 3 total; first: John Doe, Jane Smith, Alice Johnson

Fetched JSON:
{
  "users": [
    { "firstName": "John", "lastName": "Doe", "email": "john@example.com" },
    { "firstName": "Jane", "lastName": "Smith", "email": "jane@example.com" },
    { "firstName": "Alice", "lastName": "Johnson", "email": "alice@example.com" }
  ]
}

you> how old is John Doe?
[chat…]
He is 25 years old in the dummy dataset.

Wrap-Up

Hello Agent: minimal one-shot demo.
Async Agent: adds simple tools + local JSON context (RAG).
vLLM + Qwen: efficient, open, and runs on consumer GPUs.

From here you can:

Add more tools (file I/O, DB queries, shell commands).
Swap Qwen for a different model.
Build a proper front-end over the REPL.

You now have a local Qwen agent setup — both the simple and async versions — running fully offline on your GPU.

← Back to Blog Need Something Built?

MIT License

Copyright (c) 2025 Tasetic Wave LLC

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.