I Ran a 1B AI Agent on a $0 Budget — 100+ tok/s on 8GB GPU

Sursa: https://youtu.be/i-Oq_CcFsT4?si=lClTPzCk3kMvEaAG Data: 2026-05-31 Creator: Prompt Engineer Format: Video (~24:14 min) Tags: @work @growth

TL;DR

MiniCPM 5 1B (2.17 GB, necesita 7-8 GB VRAM) rulează la 100+ tok/s pe un GPU de 8 GB. Videoul demonstrează 3 metode: Ollama (simplu, rapid), vLLM (throughput mai mare, necesar pentru apps publice, necesita WSL pe Windows), și OpenCode (alternativa open-source la Claude Code). Concluzie: modelul e excelent pentru chat/rezumare/reasoning, slab la tool-calling și modificare fișiere. Cheia pentru frameworks agentice (OpenCode, Hermes): --max-model-len 64000.

Puncte cheie

MiniCPM 5 1B de la OpenBMB — 2.17 GB, GGUF disponibil pe HuggingFace, performant pe GPU de 8 GB
100 tok/s pe 8 GB VRAM — viteză impresionantă pentru un model local, gratuit
3 metode de rulare: Ollama (cel mai simplu), vLLM (throughput mai mare, multi-user), OpenCode CLI
vLLM pe Windows: necesita WSL + Ubuntu + Miniconda. Comanda: vllm serve OpenBMB/MiniCPM-S-1B-sft --port 1234 --gpu-memory-utilization 0.65 --max-model-len 64000 --enable-auto-tool-choice --tool-call-parser hermes
64K context obligatoriu pentru frameworks agentice (OpenCode, Hermes Agent) — fără el primești eroare de context prea mic
Limitări: tool-calling slab, nu poate modifica fișiere în agentic mode; bun pentru text/reasoning/chat
OpenCode = open-source Claude Code, configurabil cu orice model vLLM prin custom provider

Idei acționabile

Test MiniCPM 5 1B pe Ollama (LXC 104) — are deja Ollama, modelul e 2.17 GB, VRAM disponibil? Bun pentru taskuri de rezumare/reasoning locale @work
vLLM ca alternativă la API Anthropic pentru taskuri repetitive simple (rezumare bonuri, extragere date) — cost $0 @work

Transcrierea

Hey guys, I'm gonna take you through a journey. So we have this new model, Mini CPM 1 billion. And I'm gonna show you different ways in which you can use this beautiful small model. The file size of this model is just 2.17 GB. But if you are trying to make it run, we do need about 7 to 8 gigs of VRAM. And this fits perfectly on my PC because I have 8 GB of VRAM and 16 GB of RAM. So let's go ahead and try to use this. Now I will show you how to use this on Olama. You can do LM Studio as well. I'm gonna show you how I do the UI part on Streamlit. I'm gonna show you VLM inferencing as well. So that we can ultimately patch everything up and try to run it with the open code. Which is an open source alternative of Cloud code. Now this is really one of the toughest jobs for any LLM. Now 1 billion model LLM, we really doubt if it's possible or not. Now if you want to see if it was possible to do it or not. Please stay till the later part of the video when I will show you if it was a success or not. But rest of the things like Olama and VLM was a success. LM Studio was a success as well. There are so many options to run this and I'm gonna touch upon a few and do this quickly. Let's go ahead. So we have this from OpenBMB. We have MiniCPM 5 1 billion model. And this is a very small model. You can see you can compare it with other Qen 3 0.6 billion model Qen 3.5 0.8 billion model. Or LLM 2.5 1.2 billion model. And you can see our new model which is MiniCPM 5 1 billion. It's actually performing pretty good in agendic works in logical reasoning and math reasoning. Again in coding and in general knowledge. So let's go ahead and let's see this. We have different options that you can choose. We have this base version of 1 billion. And then we have the SFT version. We have the base version here. And then we have this GGUF version which is for LLAMA CPP, Olama and LM Studio. Now I'm gonna use this. So what I'm gonna do is I'm gonna click here and go to Hugging Face and then download this. So I'm just gonna copy or click here and then download it here. So this is downloaded. This is the model. And for running it on Olama, we first need to install Olama. The installation process of Olama is pretty easy. You can go to PowerShell and run this command. Or you can go to Downloads here. And again have options for all these. You can go ahead and download the desktop app as well for Windows. Okay, so I already have installed Olama. So because of which I can do this. I can go to our command prompt here. And then I can do Olama list. And you can see the list of models that I have. But the important thing is, I don't want this. But the important thing is that I want to use this GGUF model. Now for using this model, we need to create a file called model file. Okay, just a model file. So what you need to do is you can go ahead and start up a text document, remove the extension and just name it model file. Okay, so once you do that. Now let's open up this in a VS code so that I can show you properly. So here we have this model file. So inside this model file, we need to do this. So we are taking the path where the GGUF file is. So this is my GGUF file. So I'm saying from the GGUF file and the template. I'm just leaving it like this. And I'm putting the parameter of temperature of 0.7. So this is the model file. And now using the model file, what you can do is you can go ahead and do cmd now to this particular location. And now you can do olama create the name of the model. So let's say I'm going to call this mini CPM 5 1b model. Okay. And then I'm going to say dash F and the model file. So I'm making a new model known as mini CPM 5 1b and we're using the model file. So inside the model file, as you can see, we have the things written here. So this model file. So inside the model file, we have the location of the GGUF because this is the actual model. We have the template and then we have the parameter of temperature. You can put other things as well. If you go to olama model file, you will see all the list of things that you can do. So I can go to model file reference and go to some examples here. So you can see from the model, set the temperature, set the number of context here, and then the system prompt. So even I can set this number of context. Let's set this context as well. So what I can do is I can put in like this, a context length could be, okay, could be anything between like 2000. It's okay. So now what we are going to do is I'm going to cmd here. And then now we are going to create this. I think I've already have written this. So this is the one olama create mini CPM 51B dash F and then the model file. That's going to press enter. And this is going to use the GGUF file, which you have downloaded. And it's going to make or save it as a new model. Now, if you say olama list, then you'll see that we have this new model mini CPM 51B latest. This is the new model that we have. Now what you can do is we can say olama run the name of the model. So mini CPM then 51B latest. So you can run this model now. So it's loading the two GB model. And if you want to see the memory utilization and everything, you can go ahead to WSL. So we'll see how to install WSL as well. But once you go to WSL, which is Windows subsystem for Linux, there you can run this command just a minute. There you can run this command. So I can say watch dash N one second. And I can say NVIDIA dash SMI. So this will show me, let me minimize this. And you can see we have the 8188 MB 8188 GB GPU. And out of which 1.50 has been used as of now. Okay, let's go ahead and ask some questions. What is the capital of India? You can see the speed. It's really amazing. The capital of India is New Delhi. Cool. Then if you want to see the number of tokens, I can just top this control D and then say clear. But again, if you run this with a verbose tag here and then say what is the capital of India? Then you can see the number of tokens here. You can see it's 100 tokens per second. And it's really amazing. So you can go ahead and use this on Olama. But the outputs are not that great when you use an Olama. But if you use in VLLM, I'm going to show you a better control on the outputs. Okay, so this is first step. You can go ahead and use this on Olama. Works completely fine. You can see you get 100 tokens per second. I have cap cut now running. If it doesn't run, then it goes even faster than this. But a local model, a powerful model giving you 100 tokens per second. It's really amazing. All right. The next part of the video would be to run it on VLLM using VLLM because it gives you higher throughput and it's better control. And again, multiple users requesting info and it's going to inference it and serve each one of them. And it's not going to confuse and send the outputs of one to another. So it's a better serving engine. VLLM is always best when you have public facing applications. So let's go ahead and use VLLM to host to run this one billion model here. So for using VLLM, what I'm going to do is, okay, so for using with VLLM, what we need to do is first go ahead and if you're on a Windows system, you need WSL because on Windows, it doesn't work much and it's really frustrating to get VLLM to work. So what are you going to do is we are going to use WSL, which is Windows subsystem for Linux. Basically a window, a Linux environment on a Windows machine. Okay. So we're going to run this on PowerShell and this is going to just get you started. So once you have that, you'll be able to open something called WSL here need to install Ubuntu as well. So you have Ubuntu and WSL here. And then what I'm going to do is I'm going to make one environment here using mini conda or conda. You need to install mini conda here as well. So when you first start this, you will have something like this. So you will have, sorry, you will have something like this. And then you need to install mini conda. So you go to mini conda, install here, to install mini conda here. And then you can go to Linux installer here, copy this, go back and paste it here. It'll download the files. And then you go down and run this, say enter, do the necessary requirements here. So enter, say yes, enter, and you are done. Okay. So it's not installed in my case because I already have a mini conda environment here. Okay. So once you install mini conda, then again, make a new environment which will be like conda, create dash and let's call it VLLM ENV and then say Python 3.12-Y. Okay. So this will create a VLLM environment here, VLLM underscore ENV environment on your WSL, on your Linux system. So this is going to install everything here. Okay. Now we need to do conda activate, I guess a conda activate VLLM underscore ENV. So this will activate the environment. Cool. Next, what you need to do is go ahead and install VLLM. So say pip install VLLM. And this is going to download the VLLM. This will take some time, but it will be worth it. And again, just keep it in this fashion and use it later as well. You don't need to download it every day and every time you use an LLM. This is a one-time setup that you can do on your system and you can keep it forever. Of course, you need to update the things when it is required. Okay. I'm not going to bore you with the installation here. So what I've done is I've already got this installed. So I can see the list of environments conda info dash dash ENVS. This will give me the list of environments. So the environment is this one open BNB. I can say conda activate open BNB. So this is my environment in which the VLLM is already installed. So now what I can do is I already have this command. I will share this with you, but this is the command that we need to use. So VLLM serve, we are serving this model open BNB. I can say mini CPM 1B. And then we will serve it in the port of 1, 2, 3, 4 GPU memory utilization is 0.65. Because if I keep it one, then it the 8GVGV that I have falls short. Therefore, I had to reduce this. Now the most important thing and the only important thing if you want to take from this video is this 64,000 value. We need a maximum model length of 64,000 value. If you want to work with open code or if you want to work with Hermos agent and cloud code. So if you want to use this model 1 billion model with even cloud code, you can do that. I will bring it a different video. You can obviously integrate with Hermos agent if it's possible with cloud code. And in this video, I'm just going to show you the open code way, but it's absolutely possible for Hermos agent and cloud code as well. I'll bring in a separate video for those. But today let's focus on open code. We need this 64,000 and it's essential. And I was trying this with Hermos agent. I will bring a separate video on this, but you can see the error that we have here. This is an internal error model open BNB mini CPM 1B has a context 20 of 16,000, but which is below the minimum 64,000 required by Hermos agent. And therefore I suggest to keep this value whenever you're working with agentic frameworks. Okay. So 64,000 now in order to have that 64,000, you need because it takes some GPUs. So we need to make some space in the GPU. So for which we need to reduce this value. Okay. And then we are putting enable auto tool choice. This will help us choose the tools. And then we're using the tool parser of Hermos and this will help you output tool calls. So let's run this VLM serve open VMB mini CPM 5, 1 billion. We got the same to open code. So this is running. You can see that it's pretty good. You can see the memory requirements. And that will be interesting. And it's not yet increased, but it will be in just a moment. So I have a GB GPU and it's going to increase. So I can see that it's increasing 32, 3.2 GB is used. It will go even more. Okay. It is downloading and making everything ready. So finally, everything has started up. It's VLM. Your 1 billion model is running here on the port 1, 2, 3, 4 on the local host. And I can see the usage here with a GB of GPU. I've used 6.4 GB is being used. And that includes my recording cap cut as well. So now let's go ahead and use this. How to use this now. What do you say we have this model? So if I go to a CMD and if I do a CURL and I can say HTTP and say local host, 1, 2, 3, 4, V1 and say models, then you'll get the list of models. So you can see that we have this open BMV mini CPM 1 billion. So this is the actual model. Okay. So if I use this, you can use this on the front end that I've shown you at the start. We're going to use trimlet. Now I have already made a code. I will share that with you as well. But this is a code app.py. You're using this endpoint of 1, 2, 3, 4 V1 chat completions. This is the model name. By the way, you can go ahead and try out the other models. I think the SFT would perform better as well. This is post train that we're using right now. But this is before the reinforcement learning and OPD. So you can go ahead and use this SFT model as well. But ultimately, the starting model was this pre training model. And then we have the SFT and then we have the RL plus OPD. So this is the final model that you can use. If you are on Apple, you can go ahead and use this MLX. We have seen how to use this on Olamma. But you can check out my other videos where I've shown you how to use this on Lama CPP and LM Studio as well. The Olamma and LM Studio are just have been made on the Lama CPP here. So let's go back enough of blabbering. This is the model that we're using setting the title of me CPM. And then we are putting this streamlit code here. I'll share the code. It's it's a normal code, nothing much. And there I need to install a streamlit streamlit and say requests. Sorry, it should be requests. Okay, seems like this is installed. Let's go ahead and run this now. Streamlit run app.py. So this will start up my app.py here. And you can see that we have this beautiful interface. Mini CPM 1 billion chat here. So let's go ahead and chat this. And I have deliberately not kept any memory as of now. So what is the capital of India? And you can see it's pretty fast. This is the thinking mode here till here and the output here. How cool is that? And I can say which is larger 9.11 or 9.9. 9.9 is larger because it's the first decimal digit larger than 9.11. And it's really great. Which is larger 2 to the power 3 or 3 to the power 2. And as a matter of fact, even chat GPT did this wrong. You can see I was asking some trick questions if I show you the exact conversation that we're having. So you can see that this wears some of the conversations that I was having with chat GPT so that you know I can ask some trick questions here. And you can see this one. You can see question number 10 which is larger 2 to the power 3 or 3 to the power 2. And the answer that chat GPT gave was equal. I mean come on. 3 to the power 2 is 9 and 2 to the power 3 is 8. And therefore 3 to the power 2 is larger. And you can see if we go to our favorite 1 billion model, let's see the output here. So you can see 3 to the power 2 is larger and it was able to solve this. It's really amazing how fast is this. I can go ahead and look at some other questions here. For example, write this number in words. Let's go ahead and see here. So you can see this one. Output. So all these are the thinking process and the output is 1024. Okay. And then other questions. For example, what comes next 248 1632. It's so fast. You know, like 100 tokens. It's really it's amazingly fast. You can see it's 32 here and harder reasoning. Let's say a doctor gives you three pills and say take one every 30 minutes. How long until all pills are taken? Okay. This is the question you got the question. I think a doctor gives you three pills and say take one pill every 30 minutes. So right now is let's say 12 o'clock and 12 30 you take one and one PM you take one and then one 30 you take one. So how long it would be like, you know, 90 minutes. So let's see the output. These are all the thinking process. I cannot complain, but I just want the final output. Okay. This is another thing which I even did wrong and says take one every 30 minutes. Okay. It considered that it took the first pill at the start the second 30 minutes later and the third 60 minutes later. What is the output here? Okay. First pill now second pill 30 minutes and the third pill 60 minutes. I was mistaken. Sorry. But what I want is that it's not, you know, it's not cutting the outputs and it's able to give the entire output here. And that is really amazing for me. And now thanks to the 64,000 tokens that we have put and then this code as well. And the, in the output tokens and the maximum allowable tokens also I have given a very good number, which is about 2048. And therefore it's not, it will not exclude now given this number considering that we will have the history as well. So if these questions would have been history and it remembered everything from the start, then it would easily cross, you know, 20k or 30k tokens. And then it will be useful when we have a 64k tokens here. But for this example, you can see it works really good. It's fun. It's fast and it's local. Now taking one step further, if you have been following and watching this video till now, let's go to the, the third criteria in which we are going to use this with open code. Now this is revolutionary. A one billion model trying to run on open code. If it was success, I don't think so. I don't want to say now, but let's go ahead and see for ourselves if it's a success or not. So for this, I can go and install open code using npm. So I can say npm, copy this, go to any cmd on your system. And then I can say npm i dash g, which means dash g is for global and this is done. Okay. So once we have this, what I'm going to do is I'm going to go to this directory that I was playing with. I'm going to go to cmd here and then I'm going to go and say open code. So open code and this will enter into your interface here. Pretty cool. Next, we want to have the models. So I can say models here, switch models and you can pick up the model from here. But let's go ahead and first use the desktop app in which I can show you more clearly how to set up a model. I'm going to close this. I'm going to go here and download the desktop app again. So this is the desktop app. I'm going to download four windows and once you download and install this, you'll have something like open code. So open code is opening. I'm going to go ahead and click here on the open projects. Go to that folder that we're working on. So select that folder and then say new session. And now we need to set our models here. Okay. So I click on the choose model here, click on the manage models here and then click on connect provider and then go to custom provider. So here what I'm going to say is YouTube. Let's say one billion. This is the provider ID. The display name could be the same thing like one billion. The base URL, this is important, HTTP. And then we have localhost or I can say 127. 001. And then I can say v1 and the port as well. Sorry. One to three, four. And then I'm going to say v1. Okay. And then there is no requirement of any API key, but the model name is really, really important here. And the model that we have seen is this one open BNB copied. No, open BNB, any CPM, be mini CPM five. So this will be like model ID is like open BNB. Then we have mini CPM five and one billion. The display name could be anything. I'm going to put like YouTube one billion. And let's go ahead and test this out. If it works or not, otherwise we need to change some things. So this is done. I'm going to close this. Let's open up again, open code. Okay. So we selected this model that we've used, which is this YouTube one B. And now we can go ahead and ask, hi, how are you cool? I can say what is the capital of India? Cool. And then if you remember, I had some questions that I asked. And if you remember, I have just one. And if you remember, I just asked some chat GPT for some trick questions here. So let's see, copy this and let's go ahead and let's go ahead and ask this question here. So divide 30 by half and then add 10. It doesn't have the mathematical abilities. But if you go to open code here and then maybe change the model, let's see. So we pick the model, which is this one, YouTube one B, and then ask the same question here. The same question that you asked just now, divide 30 by half. So you can see this is able to give me the answer because we're using on the code here. Now the difference between the CLI that I've observed and the web app is that in the web app, you're not able to make the files automatically. It's difficult to make the files automatically. But if you're using on the CLI, which is equivalent to cloud code, then you're able to make the files as well. But this one billion model is not it's able to make the entire files for yourself. But it's at least it's trying, but at least it's trying to get the results. And you can see that if I say, read the contents of the folder of the current folder, or you can say read the contents of the files of the current folder, then you see that it was able to get this file, this file app.txt. And we really do have that app.txt file here. And I can say, read the contents of the app.txt file. Go ahead. So you can see it lacks that it lacks that capabilities of using and changing the files in your system. But having said that, this is an amazing model for doing all other sorts of things without any file modification or shell commands running. So those are not available here. But still, you can go ahead and ask these questions as well. Which one is heavier? One kg of iron or one kg of cotton? And you can see that the weights are approximately iron this point 18 kg and cotton is 0.75. I want to expect that it performs well in the tool calling feature as well. So apart from the tool calling, I was able to find this as a very good model for using on LamaCPP, Olama LM Studio. And you can go ahead and use it and do the VLLM here as well. Now for any summarization task or the task which has words without any tool usage, you can go ahead and absolutely use this model. This is amazingly fast. It's like 100 tokens per second on the 8GB GPU that I have. So go ahead and test this out and let me know if you were able to connect this with all these things that I've mentioned here with Olama VLLM and OpenCode. I'm going to show in the next video on Cloud Code as well. And maybe find a way to make that tool call usage. Okay. But in summary, it's a very good model. It works with Olama. It works with VLLM. Even it works with OpenCode. We're able to connect. And there is no issue with the context as you have seen. 64K tokens. If you keep that 64K tokens, you know, if you have lesser GPU, if you have a 4GB GPU, then try to make this GPU utilization lesser. But you have to keep the 64K here as well. Now I've tested out so many things which I cut from the video. I've tested out with SG Lang as well and it failed miserably. VLLM was the one which at least we were able to serve this. And at least we are able to use this with OpenCode. Okay. So in the next video, in the subsequent videos, I will play with this model and do all sorts of automations. Try it with Cloud Code, try it with Hermes agent and I will show you the results. So stay tuned, stay subscribed and I will see you in the next one.

25 KiB Raw Blame History

I Ran a 1B AI Agent on a $0 Budget — 100+ tok/s on 8GB GPU

TL;DR

Puncte cheie

Idei acționabile

Transcrierea

25 KiB

Raw Blame History