Run LLMs on Cloud GPUs in Minutes
45sShows how to quickly set up powerful language models on cloud GPUs, appealing to AI enthusiasts and developers.
▶ Play ClipThis video demonstrates how to run large language models from Hugging Face on powerful GPUs using Vast.ai and the Oobabooga web UI. The presenter walks through selecting a template, allocating sufficient disk space, choosing an appropriate GPU, and downloading models.
The video shows how to run large language models from Hugging Face on powerful GPUs using Oobabooga as the web UI.
Select the recommended template for Oobabooga, which sets up the environment and opens port 7860 for the Gradio web interface.
Allocate at least 80 GB of disk space upfront because many models are 60-70 GB and disk space cannot be added later.
Check the model's GPU RAM requirements (e.g., Falcon 40B needs ~60 GB) and select a GPU with sufficient RAM, such as an A100 with 80 GB.
Choose a GPU with enough RAM, like a 1x A100 (80 GB) or multi-GPU options. Cheaper alternatives include 4x A5000 (96 GB) or A6000 (48 GB).
The instance takes about 3-5 minutes to load. Once ready, click the open button to access the Oobabooga web UI on port 7860.
In the Models tab, paste the Hugging Face username/model name (e.g., from the LLM leaderboard) and click download. After download, load the model into GPU RAM.
Check the billing tab to estimate credits needed for long runs and set auto-billing threshold to avoid instance stoppage.
By following these steps, you can easily run large language models on cloud GPUs via Vast.ai using Oobabooga, ensuring proper disk space and GPU RAM allocation.
"Title accurately describes the content: launching Oobabooga on Vast.ai cloud GPUs."
What is the minimum disk space recommended for running one large language model on Vast.ai?
At least 80 GB.
2:07
Why can't you add disk space later to a Vast.ai instance?
All disk space must be allocated upfront; it cannot be added later.
1:59
What GPU RAM does a 1x A100 provide?
80 gigabytes.
3:58
How do you download a model from Hugging Face in Oobabooga?
Paste the Hugging Face username and model name into the Models tab and click download.
7:38
What port does the Oobabooga template open for the web interface?
Port 7860.
6:00
Disk Space Allocation
Emphasizes the critical need to allocate sufficient disk space upfront, as it cannot be added later.
1:34GPU RAM Matching
Explains the importance of matching GPU RAM to model requirements to avoid failures.
2:22GPU Selection Options
Lists various GPU options with different RAM sizes and costs, helping users choose appropriately.
3:42Model Download Process
Shows the simple process of downloading models from Hugging Face directly within Oobabooga.
7:22[00:00] hello uh welcome to vast in this video I
[00:03] want to show you how you can run some of
[00:07] the best large language models that
[00:10] exist from hugging face or other places
[00:12] on a very powerful gpus and so
[00:18] let me get started today we'll be using
[00:21] uba Booga to as the web UI which is a
[00:25] great interface for prompting the models
[00:28] and also kind of loading and managing
[00:30] them that's some open source software
[00:32] that will will load up in an instance on
[00:34] vast
[00:36] so I'll first kind of Click into the
[00:38] console and make sure that you are
[00:41] logged into your account and that you
[00:43] have credits if you've never done this
[00:46] before with fast we have a different
[00:47] video that can go over a lot of some of
[00:49] the basics
[00:50] but for uba booga
[00:54] you're going to come in here and select
[00:57] our recommended template for that
[01:00] and uh that's gonna have the description
[01:03] here it kind of shows you which language
[01:06] models that you can run with it and uh
[01:09] it's going to have some specific options
[01:11] and an on-site script that you don't
[01:13] want to mess with that's going to set up
[01:15] your environment correctly so this will
[01:18] work it's also going to open a port for
[01:22] the open button and that will launch the
[01:25] gradio web interface so that it all just
[01:28] works so really all you need to do is
[01:30] Select that uh template
[01:34] now one of the most important things
[01:38] um
[01:39] is to make sure you allocate enough
[01:41] Discord storage I just reset my filters
[01:44] because the default is only 16 gigabytes
[01:46] which is not going to be enough
[01:49] a lot of these large language models are
[01:51] 60 70 gigabytes to download and your
[01:54] instance will start to throw errors if
[01:57] it runs out of disk space you also need
[01:59] to allocate all the disk space that you
[02:01] want to use up front for this instance
[02:03] you cannot add it later
[02:05] so with that in mind you're probably
[02:07] going to want if you're just going to
[02:08] try one language model at least about 80
[02:13] gigabytes so I'm just going to move the
[02:15] slider to 81 and get that all set up the
[02:20] other important thing to understand when
[02:22] you're running these large language
[02:23] models is to match the GPU with the
[02:27] model that you want to run for example
[02:30] if you're looking at hugging face
[02:34] hugging faces has a actual llm
[02:37] leaderboard and so you can see some of
[02:40] the most popular models here and
[02:45] um
[02:46] how you can run them
[02:48] and basically what you will need to do
[02:51] is to load these into uba Booga once we
[02:54] have that running so we'll come back to
[02:56] this
[02:57] but know that the model that you each
[03:00] one of these models that you're trying
[03:02] to run for example if you want to run
[03:05] Falcon 40 billion you need to read
[03:09] through and understand how much GPU Ram
[03:12] this is going to require because if this
[03:15] requires say
[03:17] 60 gigabytes of GPU RAM and you select a
[03:20] GPU that only has 10 gigabytes of GPU
[03:24] Ram it is not going to work so you need
[03:26] to make sure that the the large language
[03:28] model that you want to run it's going to
[03:31] have
[03:32] um uh you need to figure out what exact
[03:35] specifications it needs
[03:37] and then select an appropriate GPU
[03:42] what I like to do is to just actually
[03:44] select the GPU that has the most GPU Ram
[03:47] which is one of our a100 uh
[03:51] smx4s or pcies so I'm going to go ahead
[03:54] and select a 1x smx4 these have 80
[03:58] gigabytes of GPU Ram so they have uh one
[04:03] of the more powerful cards that are out
[04:05] right now from Nvidia and 80 gigabytes
[04:09] is is enough for most large language
[04:11] models you can also select a multi-gpu
[04:14] instance so if I needed even more space
[04:17] I could have a 2X a100 that would be I
[04:20] have 160 gigabytes of GPU RAM available
[04:23] for the large language models or I could
[04:26] select sort of a cheaper option like an
[04:29] a5000 and this 4X a 5
[04:34] 000.
[04:36] has actually 96 gigabytes of GPU RAM and
[04:41] it is a little bit cheaper than a single
[04:44] a100 you can also look at an a6000 they
[04:47] have 48 gigabytes of GPU Ram
[04:51] and an A40
[04:53] has 45 gigabytes of GPU Ram
[04:57] the consumer graphics cards like the
[04:59] 4090 and 3090 are only going to have 24
[05:02] gigabytes of GPU Ram each so again this
[05:06] is just something that you want to be
[05:08] really aware of and make sure that
[05:10] you're selecting a GPU that's going to
[05:11] have enough space so I'm going to go
[05:13] ahead and select a 1X a100 and now this
[05:18] is going to load I have 80 gigabytes
[05:21] allocated on this instance and I have
[05:25] selected a Ooba Booga web UI which is
[05:31] our recommended template and so if I
[05:34] jump into my instances here I can see
[05:36] that this is being created and set up
[05:37] for me
[05:40] it's going to take a three four or five
[05:43] minutes to load maybe a little bit
[05:45] longer it's really going to depend on
[05:46] the internet connection speed of the
[05:48] machine and
[05:50] um
[05:51] uh the size of the image this one loaded
[05:54] about three and a half minutes for me
[05:56] and now the open button is going to open
[06:00] port 7860 which was put in the
[06:04] environment variables when we set this
[06:06] up
[06:07] um and uh
[06:09] there's a few things that were installed
[06:11] and set up on the onstart script but
[06:13] anyways this is all just stuff that's in
[06:16] the template that we have set up for
[06:18] Ooba booga
[06:20] and I'm going to go ahead and open that
[06:22] interface up and here it is
[06:26] so uh here's where you can actually
[06:27] query the model that you set up the most
[06:30] important thing is that going to be
[06:32] downloading and setting up the model so
[06:35] um this software is not
[06:38] developed or maintained by vast this is
[06:43] open source software so to understand
[06:45] how to use this software you're going to
[06:48] want to find the open source project for
[06:53] this
[06:54] and and load that
[07:02] so here's the
[07:05] GitHub that's going to have a readme
[07:08] um
[07:09] the of course the installation steps you
[07:12] don't have to worry about because
[07:13] um you're using a Docker image and
[07:15] everything is is
[07:18] pre-loaded
[07:21] um
[07:22] so you can place the models into the
[07:24] model folder or when you're using the
[07:27] web UI you can just simply go to the
[07:29] models tab where it was before and
[07:32] here's where you can download the custom
[07:34] model so for hugging face you just use
[07:38] the username and model so for example if
[07:43] I wanted to try to to use just looking
[07:45] at the leaderboard if I wanted to use
[07:47] this model I would simply select the
[07:51] username and the name of the model like
[07:53] that and copy and paste it into the web
[07:55] UI
[07:57] and hit download and now it is going to
[08:00] start downloading
[08:02] this model once that model is downloaded
[08:04] I will be able to load the model in here
[08:06] into the GPU
[08:09] Ram in the instance
[08:11] sort of memory so then I can query it I
[08:14] can go back to text generation and
[08:16] actually start using it there's probably
[08:18] some other things that you can do and
[08:22] become familiar with with this interface
[08:23] this is a very nice way to run llms so
[08:27] that you don't have to use a command
[08:28] line
[08:32] um so there's quite a bit here and but
[08:34] again you're going to want to read about
[08:36] this and the ooga booga
[08:40] GitHub
[08:44] so if I go back and look at my instance
[08:48] um
[08:49] you can see that it's running
[08:51] you can also click on the billing tab if
[08:54] you just want to see you know if you're
[08:55] going to run something for multiple days
[08:57] you can get an idea of how many credits
[08:59] you're going to need you can set up your
[09:01] auto billing threshold so that your
[09:05] instance is not stopped when your
[09:08] balance gets low
[09:10] and that's the basics of running
[09:13] ubeoka on vast thanks for your time
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.